LSI and Search Engines
by: Vikas Malhotra
LSI and search engines
Now, what do we gain through the use of this LSI technology?
To answer briefly, we can say that with LSI, search engines took a step forward to give us an ideal search result. Now you would ask - what is an ideal search result? Then, answer this! What do you look for when you type in a keyword or a context in Google’s search text box?
With the number of Web pages increasing voluminously on the Web, we would like to rely doubtlessly on the search engine and want to use them as a librarian with huge capacity to recall, ability to give the most precise and relevant results and that too, with wonderful sense of ordering. More technically, the ideal search engine should be able to cater to this trinity of recall, precision, and order. And this is where LSI fits in wonderfully and enhances the search engine’s power to converge with artificial intelligence.
Let’s have a look at the dumb computer problems that LSI can well take care of.
As we said earlier, a conventional search engine based on keyword searching may not give you the best results. This is simply because the search engine programs cannot differentiate between:
•Similar words with different meanings, ex: Monitor workflow or monitor
•Words that are similar in meaning but spelled differently, ex: disease and maladies
•Singular and plural forms of words, ex: button and buttons
•Words with similar roots, such as differed, differs, and different
•Other grammatically different words, such as roast, roasted
The LSI, because it focuses on a bunch of keywords, so to say, and not a single keyword, and through its studied pattern of the relationship between semantically close and distant words in a collection of documents, it do not get confused between singular and plurals, or synonyms. It simply goes on to find the context developed by a bunch of keywords. So that, when you search for Tiger Woods, it doesn’t go on to look for Web pages that has used the keywords “tiger’ and ‘woods’ but lists a collection of pages that discusses Golf. This is what is called relevance feedback.
Usually, you will find that your search results are reduced with the increase in the number of keywords you search for. This is because a search engines functions better when they study, index, and recall for shorter and a simpler set of keywords. LSI goes the other way round and first focuses on knowing and analyzing a document exhaustively before indexing or categorizing it. Therefore, a latent semantic search engine allows a user to do an iterative search and provides useful feedback to frame a better search, if needed.
LSI is more close to human-generated taxonomies and categorization and takes a long step in structuring unstructured data. Hence, it is more archive-friendly. It allows archivists to efficiently label and index the LSI-generated categories. LSI does half the job and every document need not be indexed from scratch.
LSI helps in pointing out any part of content that is relevant but not covered in a document by comparing the data or content words on a given topic. This can find use in several contexts, one of them being a kind of automated grading system, where an assignment is compared to a sample of given quality.
LSI can investigate the semantic relationships within a text to decide on the relevance and consistency in the component parts. Adopting this into an application would enhance readability and comprehension. Naturally, these properties can be used effectively in instructional design and techniques.
However, the final and more relevant use of LSI is perhaps its power to filter information and prevent spamming or distribution of unsolicited electronic mail. By adapting and adjusting a latent semantic algorithm on your mailbox and feeding the details of known spam messages into it, junk mail can be prevented more effectively than with the current system of keyword based approaches.
LSI is an extremely methodical technique that needs high amount of monotonous precision, one that a computer can do efficiently. As obvious, the technique involves a purely mechanical search based on an extensive evaluation of a set of words and comparing their presence in a much larger set of documents. The process can be automated because the computer does not need to understand either the search query or the meaning of the words.
About The Author
Vikas Malhotra is a successful Internet marketer utilizing both pay-per-click marketing and search engine optimization to increase website traffic. To learn more, visit
http://sem.mosaic-service.com.
[email protected]