There are 2 kinds of search engines: crawler-based ones and human-powered directories. Human-powered directories depend on humans for its listings. We will be concentrating on the crawler-based ones.
Crawler-based search engines match queries against an index they create. A query is any term you type into the search box. An index is a document that has been organized to facilitate searching through it for particular information. Basically,the search engine first combs through a document on the internet and processes it into an index or an inverted file.
Inverted files are files that have literally been turned inside out–their contents have been combed through, sorted, washed, and numbered so that when you type in a query, a search engine can quickly scan through the document to see if it should be in the results you receive milliseconds later.
Search engines are also called information retrieval (IR) systems, and the engines we use publicly search the internet, an intranet, or a database like a library catalog. They consist of 4 essential modules:
- A document processor
- A query processor
- A search and matching function
- A ranking capability
Search Engines use spiders to index websites. A ‘crawler’ is an automated program that is run by the search engine system. A crawler visits a web site, reads the content on the actual site, all the websites Meta tags and also follows the web links that the web site connects to. The spider then returns all that information back to a central depository, where the data is indexed by the document processor.
The Document Processor
The document processor prepares, processes, and inputs all the documents that form the database you are searching. It performs all or most of the following steps:
- Normalizes the document content to fit a predefined format
- Breaks the document content into easily retrievable bits
- Isolates and metatags subdocument pieces
- Identifies potential indexable elements in documents
- Deletes stop words
- Stems Terms
- Extracts index entries
- Compares weights
- Creates and updates the main inverted file against which the search engine searches in order to match queries to documents
Steps 1-3: Preprocessing
These first 3 steps help the search engine by standardizing all the different formats that documents come in. They process the content until it is in one consistent data structure for the rest of the steps.
Step 4: Identify elements to index
Depending on what the term “term” means to the system doing the indexing, this step identifies potential indexable elements in the document. Each search engine depends on a set of rules that its document processor must execute to determine what action will be taken by the software used to define terms suitable for indexing, which is called the tokenizer.
Step 5: Deleting stop words
This step pulls out all the words deemed to be “stop” words in order to save system resources. Stop words are those words found to convey little to no meaning, like articles (a, the), conjunctions (and, but), interjections (oh, but), prepositions (in, over), pronouns (he, it), etc. In order to do this, an algorithm compares the content to a list of stop words and culls out anything they find that matches.
Step 6: Term stemming
Stemming removes word suffixes, and in doing so achieves 2 goals: it saves processing power by reducing the number of unique words in the index and thus speeds up the search.
For example, if the user searches for the word analyze, she may also want to see documents like analysis, analyzing, etc. The document processor stems document terms to analy- so that documents which contain forms of analy- are more likely to be retrieved. Stemming may negatively impact precision in that all forms of a stem will match.
Step 7: Extract index entries
Having completed steps 1 through 6, the document processor extracts the remaining entries from the original document. The output of step 7 is then stored in an inverted file that lists index entries and an indication of their position and frequency of occurrence.
Step 8: Term weight assignment
Weights are assigned to terms in the index file. The more sophisticated the search engine, the more complex the weighting system. Measuring the frequency of occurence of a term in the document creates more sophisticated weighting.
Step 9: Create index
The index or inverted file is the internal data structure that stores the information and that will be searched for every query. The more complete the information in the index, the better the search results.
The Query Processor
Query processing has seven steps, though a system can cut these short and proceed to matching the query to the inverted file at any number of points during the processing. Document processing shares a lot of steps with query processing. The longer the wait for results, the higher the quality of results. Usually, search engines that are publicly used choose to sacrifice very high quality results to fast returns because they have so many documents to search.
The steps in query processing:
- Tokenize query terms
- Recognize query terms vs. special operators
- Delete stop words
- Stem words
- Create query representation
- Expand query terms
- Compute weights
Step 1: Tokenizing
As soon as a user inputs a query, the search engine has to break it down into understandable segments, a process called tokenizing the query stream.
Step 2: Parsing
Since users may employ special operators in their query, the system needs to parse the query first into query terms and operators.
Operators are special methods or phrases you can enter into your search query to control the search in some way. WE will talk about them some more later in the class session.
Steps 3 & 4: Stop list and stemming
Some search engines may take these steps to trim down queries, but most search engines try to limit the length of queries and may drop these steps.
Step 5: Creating the query
Depending on how the search engine does its matching, it will formulate the query.
Step 6: Query expansion
Since users of search engines usually include only a single statement in their query, it becomes highly probable that the info they are seeking may be expresses using synonyms. More sophisticated systems may expand the query into all possible synonymous terms.
Step 7: Query term weighting (assuming more than one query term has been entered)
The final step in query processing involves computing weights for the terms in the query. After this final step, the expanded, weighted query is searched against the inverted file of documents
Search and Matching Function
Searching the inverted file for documents meeting the query requirements (matching) is typically a simple binary search (either yes it matches or no it doesn’t). Having determined which subset of documents or pages matches the query, a similarity score is computed between the query and each page/document based on the scoring algorithm used by the system. After computing the similarity of each document in the subset of documents, the system presents an ordered list to the user.
After computing the similarity of each document in the subset of documents, the system presents an ordered list to the user. The sophistication of the ordering of the documents again depends on the model the system uses, as well as the richness of the document and query weighting mechanisms. For example, search engines that only require the presence of any alpha-numeric string from the query occurring anywhere, in any order, in a document would produce a very different ranking than one by a search engine that performed linguistically correct phrasing for both document and query representation and that utilized the proven tf/idf weighting scheme.
However the search engine determines rank, the ranked results list goes to the user, who can then simply click and follow the system’s internal pointers to the selected document/page.