Summary of a Search Engine:
Abstract
Today web has become larger with its own prospect.Every Day about 500 hundreds of new website is added to Internet.So it is rapidly increasing.These new web has new information and new element that an indivisual need.But it is not possible to learn or find any particular element from these sites where every sites may have avarage of 100 pages.Where comes the need of search.So from the prospect it is looklike that the search must be efficient and also the memory of storing these info is also need to be compromized.Here comes the choice of good search engine.So how and what makes a search engine different from other we will deal with it.
Key word :
1.Page Ranking
2. Probability
3. Anchor Text
4.Vector Space model.
5.Doc. Indexing.
6.Metadata.
7.Lexicon
8.Hitlist
9.Barrels
10.Parsing
1).Page Ranking:As search engine is developed the information may found in several web pages.So which page will come first,Then which one is next.It is done in a fashioned way that....... If the link goes from a page is (T), And comes to this page is (C).Here we have a multiply factor d(0<d<1).Particularly d = 0.85 .So the page Rank of a page is PR = (1-d)+PR(T1)/C(T1) +...+PR(Tn)/C(Tn).
2.Probability:Here it is somewhat possible to choose random page when a user is get bored.So this can be handle by probability.Suppose there is (n) link in a page for traversing and the user used 10 link serially.So selecting 1 link from these is (1/(n-10)).(Mine)
3.Anchor Text:Text of link is treated in a special way.The link is ascosiate with page the link points to. This has several advantages. Previous Method, anchors often provide more accurate descriptions of web pages than the pages themselves. Second, anchors may exist for documents which cannot be indexed by a text-based search engine, such as images, programs, and databases. This makes it possible to return webpages which have not actually been crawled. Note that pages that have not been crawled can causeproblems, since they are never checked for validity before being returned to the user. In this case, thesearch engine can even return a page that never actually existed, but had hyperlinks pointing to it.
However, it is possible to sort the results, so that this particular problem rarely happens.
4.Vector Space Model:It is 3 stage divided procedure.
Doc.Indexing
|
Weight of Indexed term
|
Rank of doc with query
|
5.Doc.Indexing: Remove (the,an) is non-significant words.
6.Metadata: Data about data.
<META HTTP_EQUIV = string CONTENT = string(image ,media etc.)>
Metadata sometime givesfake information about a site or page.
Content:Has keywords.
7.Lexicon: The lexicon has several different forms. One important change from earlier systems is that the lexicon can fit in memory for a reasonable price. In the current implementation we can keep the lexicon in memory on a machine with 256 MB of main memory. The current lexicon contains 14 million words(though some rare words were not added to the lexicon). It is implemented in two parts -- a list of the words (concatenated together but separated by nulls) and a hash table of pointers. For various functions, the list of words has some auxiliary information which is beyond the scope of this paper to explain fully.
8.Hit list: A hit list corresponds to a list of occurrences of a particular word in a particular document includingposition, font, and capitalization information. Hit lists account for most of the space used in both theforward and the inverted indices. Because of this, it is important to represent them as efficiently aspossible. We considered several alternatives for encoding position, font, and capitalization -- simpleencoding (a triple of integers), a compact encoding (a hand optimized allocation of bits), and Huffmancoding. In the end we chose a hand optimized compact encoding since it required far less space than thesimple encoding and far less bit manipulation than Huffman coding.
9.Barrels : Each document is converted into a set of word occurrences called hits. The hits record the word, position in document, an approximation of font size, and capitalization. The indexer distributes these hits into a set of "barrels", creating a partially sorted forward index.
10.Parsing: Any parser which is designed to run on the entire Web must handle a huge array ofpossible errors. These range from typos in HTML tags to kilobytes of zeros in the middle of a tag,non-ASCII characters, HTML tags nested hundreds deep, and a great variety of other errors thatchallenge anyone’s imagination to come up with equally creative ones. For maximum speed,instead of using YACC to generate a CFG parser, we use flex to generate a lexical analyzer whichwe outfit with its own stack. Developing this parser which runs at a reasonable speed and is veryrobust involved a fair amount of work.
Note : I am more and more interested in lexical analyzer(which I learn in a course named --”Compilers”).I will give a huge description and analysis and everything as far as i know in this prospect in another post.
11. Forward Index: The forward index is actually already partially sorted. It isstored in a number of barrels (we used 64). Each barrel holds a range of wordID’s. If a document contains words that fall into a particular barrel, the docID is recorded into the barrel, followed by a list of wordID’s with hitlists which correspond to those words. This scheme requires slightly more storage because of duplicated docIDs but the difference is very small for a reasonable number of buckets and saves considerable time and coding complexity in the final indexing phase done by the sorter.
12.Backward Index: The inverted index consists of the same barrels as the forward index, except that they have been processed by the sorter. For every valid wordID, the lexicon contains a pointer into the barrel that wordID falls into. It points to a doclist of docID’s together with their corresponding hit lists. This doclist represents all the occurrences of that word in all documents.
13.Crawler: Running a web crawler is a challenging task. There are tricky performance and reliability issues and even more importantly, there are social issues. Crawling is the most fragile application since it involves interacting with hundreds of thousands of web servers and various name servers which are all beyond the control of the system.
Note : I will give huge description later.
Total Step of Running a Search Engine:
1. Parse the query.
2. Convert words into wordIDs.
3. Seek to the start of the doclist in the short barrel for every word.
4. Scan through the doclists until there is a document that matches all the search terms.
5. Compute the rank of that document for the query.
6. If we are in the short barrels and at the end of any doclist, seek to the start of the doclist in the full barrel for every word and go to step 4.
7. If we are not at the end of any doclist go to step 4. Sort the documents that have matched by rank and return the top k.