|<li>||4-6 emails per month.|
You will receive an email with a scenario which needs to be solved. 24 hours later, another email with detailed explanation(s) to the scenario.
Sample (click to expand)
Here is the scenario for today (Level = Easy)
Past: There was a game called lulucity. It was played by 1 million users each day over the last 10 days. Each user (represented by unique username) created about 1 million events per day. Each event is of the form "I got X", where X is a positive or a negative number. The events generated by users were uniformly distributed across the day (that means, there are no uneven spikes in events for any particular time)
Today: You were given a machine (Quad core i7 processor, 64G Ram, 96TB SSD) which has the sql database with all the records. Each user created 1M events per day, and each day 1M users played. So that's 1 Trillion records per day and it was played for 10 days. So what you have is a database table with 10 Trillion records. Each record is of the following format.
Note: Quad core i7 processor can scan and add up to 4 Billion rows in a second. While the exact numbers vary, for this problem we can stick with this. This means, if there was a database with 4 Billion rows. A simple query to sum all the rows and give answer would take a second.
Task: You need to build a leaderboard that displays the top 50 users for a given start and end time. The start and end time will be given by the user in seconds format. For example, if the user said start = yesterday, 12:30 PM, and 23 seconds and end = today, 3:00 PM, and 56 seconds, then you need to consider the scores only between those seconds and return the leaderboard. Do not worry about the tiny details like formats, etc, just focus on the performance issue
Naive solution: We can use this SQL statement. select username, sum(score) as total_score from table_raw where time_second >= start_time and time_second <= end_time group by username order by total_score desc limit 50; It does exactly what we need — Filters the rows which fall in user given time frame, then adds them up and groups them per user, then orders them by high score. But it's too slow. For example, if the start_time and end_time is whole range of 10 days, then it has to loop over 10 Trillion records and add them up. As said above, scanning and adding 4 Billion rows take a second. So for 10 Trillion rows, it's going to take 2500 seconds. Clearly, not good. So how would you solve this?
Restrictions: None. You can use any mechanism, language, creating more tables, using other stores like redis, in-memory cache - your wish really. What we need is it got to be able to answer such queries as quickly as possible.
Solutions will be sent in 24 hours
==== STOP AND THINK ===== The content below will be delivered to your inbox 24 hours later ==== STOP AND THINK ====
Wrong Answer #1: I am surprised how many of you sent emails with this as the answer. It goes like this: "We can add an index to the time_second column, this will speed up things and return the answer under a second"
That's wrong. Adding an index would not speed up the process for the worst case scenario mentioned in yesterday email. It helps when the range is small. Explaining about indexes and how it won't help is beyond this email, so I will leave with short summary and a couple of links.
Indexes give a way to binary search on a column and go to a particular row in O(log N) time instead of O(N) time, where N is the number of rows. In our case, if a large range is given, we will do 2 O(log N) searches to find the "first row that's valid" and "last row that's valid". We still have to loop over all those valid rows and sum up the values. Read more: 1, 2
Wrong Answer #2: This was an interesting answer. So the idea here is to maintain a prefix sum array per each user and then answer total_score per user in O(1). This way, the total performance is O(number_of_users) which is at most 10M.
However, there is one issue. We have 10 days, and 1M users per day. If all of them are unique, we can have upto 10M users. Each array is going to be of the size 86400 (seconds in a day). So the total size is 8 bytes (long int) * 86400 * 10M = 6.9 TB. So it's not possible to store this in memory. If we store this on the disk, it's going to be too slow.
If this answer is not clear for you, that's probably because you do not understand how to use cumulative sum arrays. That's okay.
Correct Answer: (One of many correct ways). Few of you got to this. Congrats!
Let's visualize the problem first.
The issue is that, there are too many events in a given timeframe. So, how about we group them up somehow?
If we can group them up into blocks, then we can add the left-side overflow events, right-side overflow events and the middle blocks (instead of middle events). A very intuitive way to group them up here is by hours of the day. We will create a new table of the following format. Let's call this "table_hours" and the initial table we got "table_raw".
hour in above table represents the 3rd hour from the start of our data, so it means this row is about day 1, 3rd hour, i.e, 2:00 AM to 3:00 AM. Each row represents the total sum in a single hour for a single user. We process the table_raw, accumulate values on an hourly basis and create this new table_hours
As you can see in the image, we are splitting up the problem into 3 parts (Part 3 is not visible, it's similar to part 1, but on right end of the image). We will use table_raw as usual for part 1, 3. And use table_hours for part 2. Let's check the performance.
Part 1: It is at most a hour of data (If it is more than an hour of data, then it will be converted to a hour block right). Since, in each day we get 1 Trillion events uniformly distributed, this is about 1/24th of a Trillion, or about 40 billion events. Part 3 is similar to part 1. Part 2: It's summation of hour blocks over 10 days, for 1 million users per day. Total blocks = 10 * 24 * 1M = 240M
So we have 40B + 240M + 40B events to be summed. As per the spec of 4B sums in a second, this is going to take about 20 seconds. That's quite good compared to the 2500 seconds of naive solution. But still not good enough.
Clearly, the issue is with part 1 & 3.
If you think of it, we have the exact same issue as before. We have too many events cramped into this. Can we again use the same logic to optimize it? You see where I am going? Yup, we can create another table which is similar to table_hours but at minute scale instead of hour scale. Let's call this table_minutes.
I will leave other details for you to work out. Here is the final performance
Part 1 & 5: At most 1 minute of events = 1 Trillion / (24*60) ~= 0.7 Billion Part 2 & 4: At most 60 minutes in an hour = 1M users * 60 minutes = 60 Million Part 3: 24 hours a day, and 10 days = 1M users * 240 hours = 240 Million Total: 0.7B + 60M + 240M + 60M + 0.7B ~= 1.5B
It's going to take less half a second now. woohoo! This idea of dividing into logical blocks is the underlying principle for a popular data structure called Segmentation Trees.
Until next week, Anudeep
|<li>||Scenarios focus on real world cases and are programming language agnostic|
|<li>||Explanation in the email is concise. Pictorial when needed|
|<li>||I try my best to make every edition intellectually satisfying|
|<li>||If you enjoy problem-solving & critical thinking, there is a high probability that you will love this newsletter|