The Architecture of a Large-Scale Web Search Engine

In previous posts of this advent series, we have described some of the technologies that power our private search products. It is about time that we introduce the systems that bring everything together. It is important to understand that a web scale search engine is highly complex. It is a distributed system with strong constraints on performance and latency. On top of that it can easily become extremely costly to operate; both in human resource and, of course, in money. This article explores the technology stack we employ today and some of our choices and decisions, which have been taken and iterated upon over the years, to cater both external and internal users. The topic at hand is very broad and cannot be covered in a single sitting, but we hope to give you the gist of it. We use a combination of prominent open source and cloud-native technologies wrapped with home grown tooling, which have been battle tested. Places where we haven’t found a solution in the open source world or commercial efforts, we have been prepared to dive deep and write some core systems from scratch, which has worked well for us at our scale. Disclaimer: We describe how our system is, as of today. Of course we did not start like this. We had multiple architectural overhauls throughout the years, always considering constraints like costs, traffic and data size. By no means, we would suggest that this is a recipe to build a search engine; it is what is working today, as wiser people said: “Premature optimization is the root of all evil” ~ Donald Knuth And we agree wholeheartedly. As matter of fact, we really advise anyone, to never try to throw all the ingredients to the pot at once. But instead to add them one by one; slowly and incrementally adding complexity one step at a time. Given the nature of this post, we want to provide an ordered outline of all topics covered: Cliqz search as a product and its system requirements.Web Search Systems: A near real-time and truly automated search system.Data Processing Platform: Facilitating near Real-time and Batch Indexing.How deployments were done in the past? The Pros and Cons of various approaches.Microservices Architecture: Orchestrating services involved to deliver content for a search engine result page.Our need for using containers and a container orchestration system (Kubernetes).Introduce our Kubernetes stack – How we deploy, run and manage Kubernetes and various add-ons and the problems they solve for us.Local Development on Kubernetes – An end to end use case.Optimizing on Costs.Machine Learning Systems.Our Search Experience—Dropdown & SERP The search engine at Cliqz has two consumers with different requirements. Search-as-you-type Figure 1: Cliqz Dropdown in the Browser The search in the browser address bar[1], with results available on the dropdown. This type of search requires fewer results (typically 3) but is extremely latency sensitive (less than 150 ms); otherwise the user experience suffers. Search in SERP Figure 2: Cliqz Search Engine Result Page beta.cliqz.com Search on a web page, the typical search engine results page everybody knows. In here, the depth of the search is unbounded but it is less demanding on latency (less than 1000 ms) as compared to the dropdown version. Fully Automated and Near Real-time Search Consider a query like “bayern munich”. Now, this may seem a very generic query, but when issued, it touches several services within our system. If we try to interpret the intent from the query, we will figure out that the user may be: Researching about the club (in which case a Wikipedia snippet would be relevant)Interested in booking…

Like to keep reading?

This article first appeared on 0x65.dev. If you'd like to keep reading, follow the white rabbit.

View Full Article

Leave a Reply