Please Don’t Use OFFSET and LIMIT For Your Pagination
@ivoecpereiraIvo Pereira Chief Technology Officer @ Ongagement, Web & Mobile Applications Developer and Security Enthusiast Gone are the days when we wouldn’t need to worry about database performance optimization. With the advance of times and every new entrepreneur wanting to build the next Facebook combined with the mindset of collecting every possible data-point to provide better Machine Learning predictions, we, as developers, need to prepare our APIs, better than ever, to provide reliable and efficient endpoints that should be able to navigate through huge amounts of data without a sweat. If you have been doing backend or database architecture for a while you have probably already done paging queries, like this: Right? But if you did build your paginations such as this, I am sorry to say but you have been doing it wrong. You don’t agree with me? You don’t need to. Slack, Shopify and Mixmax are paginating their APIs with this same concept we will be talking about today. I challenge you to name a single backend developer who hasn‘t ever had to deal with OFFSET and LIMIT for pagination purposes. For pagination in MVPs and low-data listings it “just works“. But when you want to build reliable and effective systems from the scratch, you might as well do it right upfront. Today we will be discussing what the problems are with the (wrongly) widely used implementations and how to achieve a performant pagination. What is wrong with OFFSET and LIMIT? As we briefly explored in the past paragraphs, OFFSET and LIMIT work great for projects with low to no data usage. The issue arises when your database starts gathering more data than your server can store in memory and you still need to paginate performantly through them all. For that to happen the database will need to perform an inefficient Full Table Scan everytime you request a pagination (insertions and deletes may happen meanwhile and we don’t want outdated data!). What is a Full Table Scan? A Full Table Scan (aka Sequential Scan) is a scan made in the database where every row in a table is sequentially read and the columns encountered are then checked for the validity of a condition. This type of Scan is known to be the slowest due the heavy amount of I/O reads from the disk consisting on multiple seeks as well as costly disk to memory transfers. That means that if you have 100.000.000 users and you are requesting an OFFSET of 50.000.000, it will need to fetch all those records (that will not even be needed!), put them in memory, and only after, get the 20 results specified in the LIMIT. So, to show a pagination like this in a website: 50.000 to 50.020 of 100.000 It would need to fetch 50.000 rows first. See how inefficient this is? If you don’t believe me, take a look at this fiddle I’ve created. In the left panel you have a base schema that will insert 100.000 rows for our test and in the right there are is problematic query and our solution. Just click Run on the top and compare the execution time of each. » Read More
Like to keep reading?
This article first appeared on hackernoon.com. If you'd like to keep reading, follow the white rabbit.