Monday, 28 September 2009

3.8 Information Retrieval

Surely everyone who has used Google’s search engine has at one point left bewildered with the amount of information even the most obscure query returns. Many times I’ve deliberately searched using made-up words, and got back at least some loosely related results – leaving me with the warm feeling that there are other like-minded people there who wander on the art of Information Retrieval or IR.

IR is the process that allows retrieving information related to a user’s requirements. Differences between querying with a view for IR and querying RDBMS stems from the way the information is organised: in the DB environment, information is structured and related to the underlying business model, plus the process is deterministic - same query by different users will retrieve the same results. In contrast, due to the unstructured nature of the information that exists in many different formats and media and the subjective relevance of a user’s perspective, an IR query could return different results, and is highly probabilistic .
Techniques used in IR include removing stop words, stemming, and identifying synonyms in order to create document indexes. A widely used type of index is an inverted file, which is a list of terms, pointing to a list of relevant documents. Additionally, more complex queries can be constructed using Boolean algebra operators like OR and AND. (Macfarlane, A., Raper , J. & Dykes, J., Lecture 08: Information Retrieval)

The algorithms used by modern search engines in order to efficiently retrieve information are highly kept secrets – like Google’s PageRank. The penetration of the web in modern life, and the need of brand recognition, lead the importance of search engine’s ranking grew stronger and stronger (source: SEMPO Survey). As a consequence a new IT field has arise, namely Search Engine Optimization (SEO), aiming to implement various ranking improvement methodologies with a view to draw more visitors in a client’s site.

1 comment: