Building a Search Engine from Scratch

The previous blog post in this series explored our journey so far in building an independent, alternative search engine. If you haven’t read it yet, we would highly recommend checking it out first! It is no secret that Google search is one of the most lucrative businesses on the planet. With quarterly revenues of Alphabet Inc. exceeding $40 Billion[1] and a big portion of that driven by the advertising revenue on Google’s search properties, it might be a little surprising to see the lack of competition to Google in this area[2]. We at Cliqz believe that this is partly due to the web search bootstrapping problem: the entry barriers in this field are so massive that the biggest, most successful companies in the world with the resources to tackle the problem shy away from it. This post attempts to detail the bootstrapping problem and explain the Cliqz approach to overcoming it. But let us first start by defining the search problem. The expectation for a modern web search engine is to be able to answer any user question with the most relevant documents that exist for the topic on the internet. The search engine is also expected to be blazingly fast, but we can ignore that for the time being. At the risk of gross oversimplification, we can define the web search task as computing the content match of each candidate document with respect to the user question (query), computing the current popularity of the document and combining these scores with some heuristic. The content match score measures how well a given document matches a given query. This could be as simple as an exact keyword match, where the score is proportional to the number of query words present in the document: queryavengers endgamedocumentavengers endgame imdb If we could score all our documents this way, filter the ones that contain all the query words and sort them based on some popularity measure, we would have a functioning, albeit toy, search engine. Let us look at the challenges involved in building a system capable of handling just the exact keyword match scenario at a web scale, which is a bare minimum requirement of a modern search engine. According to a study published on worldwidewebsize.com, a conservative estimate of the number of documents indexed by Google is around 60 Billion. 1. The infrastructure costs involved in serving a massive, constantly updating inverted index at scale. Considering just the text content of these documents, this represents at least a petabyte of data. A linear scan through these documents is technically not feasible, so a well understood solution to this problem is to build an inverted index. The big cloud providers like Amazon, Google or Microsoft are able to provide us with the infrastructure needed to serve this system, but it is still going to cost millions of euros each year to operate. And remember, this is just to get started. 2. The engineering costs involved in crawling and sanitizing the web at scale. The crawler[3] infrastructure needed to keep this data up to date while detecting newer documents on the internet is another massive hurdle. The crawler needs to be polite (some form of domain level rate-limiting), be geographically distributed, handle multilingual data and aggressively avoid link farms and spider traps[4]. A huge portion of the crawlable[5] web is spam and duplicate content; sanitizing this data is another big engineering effort. Also, a significant portion of the web is cut off from you if your crawler is not famous. Google has a huge competitive advantage in this regard, a lot of site owners…

Like to keep reading?

This article first appeared on 0x65.dev. If you'd like to keep reading, follow the white rabbit.

View Full Article

Leave a Reply