<li> | 4-6 emails per month. | ||||||||||||||||

<li> |
You will receive an email with a scenario which needs to be solved. 24 hours later, another email with detailed explanation(s) to the scenario.Sample (click to expand)
Hey Devs! Here is the scenario for today (Level = Easy)
Solutions will be sent in 24 hours ==== STOP AND THINK =====
That's wrong. Adding an index would not speed up the process for the worst case scenario mentioned in yesterday email. It helps when the range is small. Explaining about indexes and how it won't help is beyond this email, so I will leave with short summary and a couple of links. Indexes give a way to binary search on a column and go to a particular row in O(log N) time instead of O(N) time, where N is the number of rows. In our case, if a large range is given, we will do 2 O(log N) searches to find the "first row that's valid" and "last row that's valid". We still have to loop over all those valid rows and sum up the values. Read more: 1, 2
However, there is one issue. We have 10 days, and 1M users per day. If all of them are unique, we can have upto 10M users. Each array is going to be of the size 86400 (seconds in a day). So the total size is 8 bytes (long int) * 86400 * 10M = 6.9 TB. So it's not possible to store this in memory. If we store this on the disk, it's going to be too slow. If this answer is not clear for you, that's probably because you do not understand how to use cumulative sum arrays. That's okay.
Let's visualize the problem first. The issue is that, there are too many events in a given timeframe. So, how about we group them up somehow? If we can group them up into blocks, then we can add the left-side overflow events, right-side overflow events and the middle blocks (instead of middle events). A very intuitive way to group them up here is by hours of the day. We will create a new table of the following format. Let's call this "table_hours" and the initial table we got "table_raw".
hour in above table represents the 3rd hour from the start of our data, so it means this row is about day 1, 3rd hour, i.e, 2:00 AM to 3:00 AM. Each row represents the total sum in a single hour for a single user. We process the table_raw, accumulate values on an hourly basis and create this new table_hours As you can see in the image, we are splitting up the problem into 3 parts (Part 3 is not visible, it's similar to part 1, but on right end of the image). We will use table_raw as usual for part 1, 3. And use table_hours for part 2. Let's check the performance. Part 1: It is at most a hour of data (If it is more than an hour of data, then it will be converted to a hour block right). Since, in each day we get 1 Trillion events uniformly distributed, this is about 1/24th of a Trillion, or about 40 billion events. So we have 40B + 240M + 40B events to be summed. As per the spec of 4B sums in a second, this is going to take about 20 seconds. That's quite good compared to the 2500 seconds of naive solution. But still not good enough. Clearly, the issue is with part 1 & 3. If you think of it, we have the exact same issue as before. We have too many events cramped into this. Can we again use the same logic to optimize it? You see where I am going? Yup, we can create another table which is similar to table_hours but at minute scale instead of hour scale. Let's call this table_minutes. I will leave other details for you to work out. Here is the final performance Part 1 & 5: At most 1 minute of events = 1 Trillion / (24*60) ~= 0.7 Billion It's going to take less half a second now. woohoo! This idea of dividing into logical blocks is the underlying principle for a popular data structure called Segmentation Trees. Until next week, |
||||||||||||||||

<li> | Scenarios focus on real world cases and are programming language agnostic | ||||||||||||||||

<li> | Explanation in the email is concise. Pictorial when needed | ||||||||||||||||

<li> | I try my best to make every edition intellectually satisfying | ||||||||||||||||

<li> | If you enjoy problem-solving & critical thinking, there is a high probability that you will love this newsletter |

Success! Now check your email to confirm your subscription.