Categories
Computer Science

The Simple Computer Science Behind Search Engines

Searching is an impossibly hard problem, but the foundation is simple.


In computer science, we’re always worrying about the n, the number of operations our code has to do for how much data we have.

One approach to searching is to just go through everything in order: look at every word on a page and see if it matches what you’re looking for. In that case, your runtime would be linear because, for each n or word we have to look at, the program will run one unit slower.

If a program did that for this article and searched for the word ‘fish’ for example, it wouldn’t be a big deal because our n is relatively small.

Now if you have to search every word on the internetit would be a big deal because there are a lot of words on the internet. Google has really fast computers but there’s no way it could do that in 0.7 seconds.


Inverted Index to the Rescue

This inverted index isn’t a way to search every word on the internet but it’s a great way to organize them to make them easier to find. It allows us to only do one operation(well, usually) to perform this search.

So, how does it pull this off?

The magic sauce is hash tables.

Tricks to get to O(1)

Hash tables have two important elements: buckets and a hashing function.

These buckets are generally an array. The hashing function can be something like this issue but, for the purposes of an easy intro, let’s imagine it’s just counting the letters in a word.

A diagram showing six boxes starting at 0 with words corresponding to how many letters they have.

Our hash table organizes words into numbered buckets (an array) based on the number the hashing function gave it. So, if our hashing function looks at the word red, it will say: “Put it in bucket three!”.

That means that, when we want to find it again, we just need to do the same operation on the same word.

This hashing method makes hash table lookups super fast!It doesn’t matter how many items you have. Every time you look something up it will take around the same amount of time.

Compare this to other search methods where you might have to look at each item until you find what you’re looking for (linear search).

The drawback is that it takes more work upfront and takes up more space to store the table.

This illustrates a trade-off we often have to make as programmers: use up more space, or take longer to run. In the case of a search engine, speed is most important.

What’s an inverted index?

An inverted index is a hash table of key-value pairs. That means that the hash table tells you where to find the key and the key knows where to find the value.

In an inverted index, the key is the word we’re searching and the value is a list of indexes where we saw the word before. It lets you search for a word and only has to do one operationto tell you everywhere it found that word on the whole internet.

For example:

One fish two fish red fish blue fish

If we map each word to an index (“One” is at index 0, “fish” at index 1, etc.), an inverted index of the words in this sentence would look like this:

One : [0],
fish : [1, 3, 5, 7],
two: [2],
red: [4],
blue: [6]

If I want to know where to find the word fish, I can simply look it up in the reverse index and see that it shows up at the indexes 1, 3, 5, and 7.

Because this is in a hash table, we can find everywhere the word appears almost instantly.


Let’s Look at Some Code

Python makes this somewhat simple thanks to dictionaries. Dictionaries are essentially hash tables under the hood so we can add methods to make an inverted index.

If you’re storing a lot of data, you would want to do this differently but it works as an example.

Our search engine will only look at single words and tell you where to find them in a collection of files.

Google, for example, might do some more sophisticated operations (PageRank) to figure out how relevant a result is to your search.


Footnote on Web Crawling

To make this information searchable, you do have to go through and index all of it. To do that, search engines use web crawlers (also sometimes called spiders) to click through all the links and index everything in each web page.

Let me know if you’d like to see a part II on spiders!