Adventures of a Wannabe Datadude

R.I.P. Timeline on Desktop - FAQ for the Normals

2024-12-20T00:00:00+00:00

I woke up a few months ago to the upsetting news that Google is sunsetting the Timeline on Desktop feature.

I was furious about this change, but when I tried to communicate my outrage to the folks around me, I was met with blank stares and shrugged shoulders. Most people around me didn’t seem to even know that the Timeline feature existed and hence had no idea what the change to this feature implied.

This an evangelical post - Basically my attempt to explain to an intelligent (and skeptical) Normal ¹, what Timeline is and why it matters.

Normal: Lets start from basics - What is timeline?

Datadude: I think you mean Timeline.

N: Sigh… Ok dude, what is Timeline?

D: Every few seconds your phone sends it’s GPS coordinates to one of Google’s servers. This data is recorded and processed by Google. You can view a historic record of that data using the Google Maps Web app and check where you were physically located at any point in time over the past `n' years.

N: Wow I had no idea - That’s kinda cool!

D: Yeah, it’s very cool!

Unfortunately, a surprisingly large number of people are not aware of the existence of Timeline - Many people I know didn’t switch it on and their data was not even recorded.

Of the ones who have it switched on, a large fraction know of it’s existence on an abstract level, but have never actually explored it in a meaningful way.

N: Ok I’ll bite - How do I see my Timeline?

It used to be that you could open up your Timeline on any device you are logged into your Google account on. By using a service called Google Takeout, you could also download a json file containing your entire Timeline history data.

Going forth Google is making a change where the Timeline data will all live on your mobile device and not be maintained on their servers - You will have an option to keep an encrypted backup on their servers, but this is for the purpose of maintaining Timeline continuity when you change your mobile device. There have been complaints from several users who have lost decades of data due to dark patterns and weird bugs in this migration process.

N: Hmm ok, so I spent like 5 mins playing around with this… Now what? I mean you can geek out over this data all you want, but can you bottom line it for me? What’s the point of it?

The way I see it, there are a few different answers to that question.

Sentimental answer:

What is the point of a photo album? Maybe I’m just weird, but retracing my physical journeys is a great way to reminisce about my holiday or even some siginificant date in my past.

In this vein, I came across the following project by Chan Perry, who used her and her boyfriend’s data to figure out the number of times they could’ve potentially met before they actually did.

Practical answer:

For the non-sentimental hardasses among you here are some practical uses:

I wrote a whole blog post about how I leveraged Timeline data to fill out onerous visa application forms.
Mileagewise allows you to use your Timeline data in the US to claim tax rebates!
J.S. Morin writes that he uses Timeline as an “argument killing machine”. Being able to save time and/or money is one thing, but nothing compares to the rush of being able to whip out your phone and conclusively prove your spouse wrong about something.
Figuring out what credit card items mean - Sometimes I will be going through my credit card statement and see a weird entry from a couple of weeks ago that I don’t recognize. Fortunately I can use Timeline to look up where I was on that date and figure out that this is the holding company that owns the restaurant I ate at on that day.
In a similiar vein to 4. I was able to find the name of a quaint little restaurant I dined at in Milan, while I was on a trip there a couple of years ago.

Who knows what the future holds? Having this data might pay off in unexpected ways at some point and even if it doesn’t, it’s better to have it and not need it than otherwise.

Philosophical answer:

I saved the best (And most controversial) argument for the end. There is a certain point of view that the human mind is a computational process - A self-propagating, dynamic collection of data and patterns ². Death is the erasure of those patterns from the universe. For most of human history these computational patterns - The essence of who we are - were confined to our brains. In today’s age however, significant chunks of our memories and identities exist externally in digital exobrains - Our laptops, mobile devies and other digital extensions of our minds.

Taken to it’s logical conclusion, this viewpoint dictates that deleting a significant chunk of your digitally stored patterns is like suffering brain damage, or even a small-scale death! If that seems far-fetched to you, imagine how you’d feel if all existing digital photos of you were irrecoverably deleted?

Your Google Timeline is a high resolution projection of your computational process and it’s incredibly special that you have access to it (Unlike most humans that have ever lived). To lose it would be to let a piece of yourself fade into oblivion. Do not go gentle into that good night!

Understanding this point will also help you understand the outrage around this move in certain internet circles - It’s as if Google is casually committing data genocide, and no one seems to care.

N: Woah, when did this turn into a Black Mirror episode? The philosophical stuff seems a bit far-fetched to me, but if this data is as potentially useful as you say, isn’t it dangerous for me to let Google have a copy? Aren’t I better off not saving it with them?

Look, at this point, we’ve all signed away our lives to our tech overlords. If they want to, they could squash you like a cockroach with or without your Timeline data. Saving it down just means that you have access to the data yourself and can use it for your own ends if you ever want to.

N: Ok I think you’ve successfully Pascal wager-ed me into it - Even if this data is functionally worthless to me at this moment, it costs nothing to save it down and have it in case it becomes useful later on. Is there a detailed guide somewhere on the internet that gives me step by step instructions on how to proceed with the Timeline migration?

D: Fear not my friend, the Datadude has you covered. Now that you’re (somewhat) convinced of the importance of your Timeline data, you can proceed to my FAQ for converts to learn exactly the steps you must take in order to safely migrate to the new version of Timeline without losing your data.

Footnotes

I hope that readers will appreciate my use of the judgement-free “Normal” rather than the pejorative “Normie”. ↩
This perspective is far less controversial today than it was decades ago when pioneers of Artificial Intelligence declared that the brain was “merely a meat machine.” In the age of Artifical Intelligence, this idea has seeped into public consciousness and some would even call it obviously true. Yet, it remains one of the foundational concepts in computer science, intimately tied to the Church-Turing thesis and the simulation hypothesis. By liberating us from the dogma of any single computational substrate, it reminds us that, in the end, “it’s Turing machines all the way down.” For an exploration of the mind-bending possibilities that arise from this idea, I highly recommend Permutation City by Greg Egan. ↩

IKB - Subarray Sum

2024-11-18T00:00:00+00:00

Problem

You are given two integer arrays $A$ and $B$, such that $|A| = n$ and $|B| = m$. Further all elements of $A$ are in the range $[ 1, \cdots, m ]$ and all elements of $B$ are in the range $[ 1, \cdots, n ]$.

Prove that there exist non-empty subarrays of $A$ and $B$ that have the same sum.

IKB - Span of a Tetrahedron

2024-11-11T00:00:00+00:00

Problem

For a set $E$ in $R^3$, let $L(E)$ consist of all points on all lines determined by any two points of $E$.

More formally, define $L(E)$ as:

$ \hspace{1cm} L(E) = \bigcup_{{p, q} \subset E} \ell(p, q), $

where $\ell(p, q) = \{ (1-t)p + tq : t \in \mathbb{R} \}$ is the line passing through the points $p$ and $q$.

Thus if $V$ consists of the four vertices of a regular tetrahedron, then $L(V)$ consists of the six edges of the tetrahedron, extended infinitely in both directions.

Does $L(L(V))$ span all of $R^3$?

Solution

I originally saw this problem in Stan Wagon’s problem collection (That link is worth book-marking for every puzzle enthusiast). He credits Victor Klee with creating this problem. Klee was one of those badass geometers who had a fully functional amusement park inside his brain. Among his many achievements, the ones I understood and was impressed by are proposing the Art Gallery Problem and showing that the worst case runtime of the Simplex method is exponential.

This problem is one of those beauties that befuddles experienced mathematicians, but that can be solved by a layperson with good enough geometric intuition. The solution below relies on this fact, as I don’t have the patience to write the full blown algebraic proof.

Firstly, we note that the set $L(V)$ is just the 6 edges of the tetrahedron, extended to infinity on both sides.

Let’s label these lines $E_{12}, E_{13}, E_{14}, E_{23}, E_{24}, E_{34}$. (A line is named in correspondence with the 2 vertices of the tetrahedron it passes through)

If 2 lines are co-planar, the span of those lines can only include the plane containing those lines. If we want to fill space, we must focus on the skew lines.

In the set above, there are 3 pairs of skew lines - $(E_{12}, E_{34}), (E_{14}, E_{23})$ and $(E_{13}, E_{24})$.

Let’s focus one of these pairs - Convince yourself that the set $L(E_x, E_y$) contains all points in $R^3$, except for two planes - The plane containing $E_y$ and parallel to $E_x$ and the plane containing $E_x$ and parallel to $E_y$.

If you’re having difficulty visualizing this, you can use the fact that such a pair of skew lines can be rotated to the lines $E_{12}: x=0, z=0$ and $E_{34}: x=1, y=0$ and then prove algebraically that you can attain all $(x, y, z)$ as linear combinations of points on these lines EXCEPT for the ones where $x=0, z \neq 0$ and $x=1, y \neq 0$.

For each pair of skew lines, we get a pair of parallel planes that is NOT in the image of of $L(E_x, E_y)$. The intersections of these pairs of planes give us 4 points that are not in the image $L(L(V))$.

Is there a succint way to describe these points? One of my favourite professors used to say that whenever you encounter a regular tetrahedron, one potentially fruitful avenue is to inscribe it in a cube. If you have a cube with vertices $\in \{0, 1\}^3$, then you can inscribe in it the regular tetrahedron with vertices $(0, 0, 0), (0, 1, 1), (1, 0, 1)$ and $(1, 1, 0)$. The points not in $L(L(V))$ are then exactly the 4 remaining vertices of the cube.

When life gives you a tetrahedron, inscribe it in a cube

Irodov Ka Baap - Origins

2024-11-06T00:00:00+00:00

I have been interested in solving, creating and curating math puzzles since high school. This tendency has helped me crack competitive exams, clear job interviews and make friends with loads of interesting people. Starting in high school, I began compiling a collection of my favourite problems and gave it the tongue-in-cheek title “Irodov ka Baap”.

Some background here for readers who don’t know - “Problems in General Physics” by I.E. Irodov was a book that nerdy Indian teenagers venerated, feared and adored. It was one of the classic preparation resources for the IIT JEE (An annual entrance exam which is the gateway to the prestigious Indian Institutes of Technology) and every serious aspirant proudly owned a copy that was dog-eared and worn out from frequent use (I used the past tense because things have probably changed since I gave the exam).

My personal copy looked _much_ more beaten up

In Hindi, the literal meaning of the phrase “X ka baap” is “X’s father”, and colloquially it is used to denote something far superior to X. Hence the title “Irodov Ka Baap” was my way of claiming that my (yet to be written) book was far superior to a book beloved by several generations of my country’s most brilliant students. To answer your question, yes, 18 year old me was an insufferable prick with delusions of grandeur.

I actually took IKB further along than most of my projects - I even managed to sell an early draft for Rs. 50k! Over the years, the contents changed to reflect my changing tastes - The physics puzzles got swapped out with probability and discrete math but like the Ship of Theseus, I retained the name of my beloved collection. Unfortunately, I was never able to publish - The act of writing my thoughts down in a satisfactory way proved to be surprisingly hard. As time went on, my core interest in puzzles started waning. My puzzle friends were re-directing their mental bandwidth towards excelling in their careers, or having babies or going insane. My own mental bandwidth proved insufficient for such lofty goals, and was exhausted in trying to unentangle the tedious, bureaucratic mazes adults are required to run through.

Recently however, life came full circle for me. In the course of helping some high schoolers prepare for the AMC, I revisited the old neural pathways and rediscovered the dormant joy of puzzle solving. I decided to make use of this momentum to write down some of my favourite puzzles for the internet audience. The hope is to post a new puzzle each day for as long as I can. Links to all the problems will be aggregated below.

The solution for the previous days puzzle will be posted alongwith the puzzle for the next day.

P.S. A small fraction of you reading this are probably saying “BTW Irodov ka baap already exists, it’s called Aptitude Test Problems in Physics by SS Krotov”. To you I say, well done my friend! I doubt any one else in your life gave a fuck when you told them you had read and solved Krotov (No one did in mine) but today you got to feel smug for a few seconds and isn’t that what life is all about?

IKB - The Impossible Problem, Version 1729.0

2024-11-06T00:00:00+00:00

Puzzle

Two infinitely intelligent and truthful mathematicians $P$ and $Q$ meet to play a game. Each of them individually decides on a secret real number greater than 1. ($P$’s number is $p$ and $Q$’s number is $q$) They submit their numbers to a moderator M, who reveals to both of them the product pq, along with another number $r$. M doesn’t reveal which number is $pq$ and which number is $r$. This entire setup is common knowledge between $P$ and $Q$, i.e. they both know the setup and know that the other knows the setup etc.

M then turns to $P$ and asks him if he knows q. If $P$ says no, he asks $Q$ if he knows p. If $Q$ says no, he then asks $P$ again, and so on. Prove that this game terminates after a finite number of rounds.

Solution (Updated 11Nov24):

This problem belongs to a category of puzzles that exploit the logical concept of Common Knowledge. These puzzles explore how shared knowledge, and the ability to deduce information based on what others know, can lead to surprising and often counterintuitive conclusions. Famous examples include the Blue-Eyed Islander Puzzle and the many variants of the Impossible Puzzle. If you haven’t encountered these before, they are well worth exploring as quintessential examples of the genre. Also worth reading is Scott Alexander’s DELIGHTFUL short story exploring the puzzle from the point of view of the islanders.

At first glance, such puzzles can feel perplexing or even paradoxical. A common reaction is, “Where is the new information coming from?” After all, for progress to be made, the participants $P$ and $Q$ must gain new insights with each round, even though no additional explicit information is revealed. This leads to the realization that the source of this “new” information is not external but emerges from the participants’ reasoning about each other’s reasoning.

The key to unraveling these puzzles is understanding that:

1) $P$ and $Q$ are infinitely intelligent and that

2) They are both aware of the setup.

This means in any given situation they can make all logically possible deductions and know that other participants can do the same. Therefore, the lack of immediate deductions becomes itself a crucial source of information. For example, if $P$ doesn’t know the answer in the first round, it implies that certain conditions must hold for $Q$, allowing $Q$ to refine their possibilities—and vice versa.

With this in mind, let us systematically track the information that both participants have in each round.

Start of the game:

At the start of the game, both $P$ and $Q$ see 2 numbers $a$ and $b$ (Without loss of generality, $a < b$). Hence we have:

P’s Knowledge	Q’s Knowledge
$q = \frac{a}{b} \text{ OR } q = \frac{b}{a}$	$p = \frac{a}{b} \text{ OR } p = \frac{b}{a}$
$q > 1$	$p > 1$

Round 1, Question 1:

If $\frac{a}{p} < 1$, $P$ can immediately eliminate it and knows that $q = \frac{b}{p}$

Conversely, if $P$ cannot immediately identify $q$, this must imply $\frac{a}{p} > 1 \implies p < a$.

Hence if $P$ answers No in the first round, $Q$’s knowledge gets updated as following:

P’s Knowledge	Q’s Knowledge
$q = \frac{a}{b} \text{ OR } q = \frac{b}{a}$	$p = \frac{a}{b} \text{ OR } p = \frac{b}{a}$
$q > 1$	$1 < p < a$

Round 1, Question 2:

If $\frac{b}{q} > a$, $Q$ can immediately deduce $p$.

Conversely, if $Q$ cannot deduce $p$, this must imply $\frac{b}{q} < a \implies q > \frac{b}{a}$.

Hence if $Q$ answers No in the first round, $P$’s knowledge gets updated as follows

P’s Knowledge	Q’s Knowledge
$q = \frac{a}{b} \text{ OR } q = \frac{b}{a}$	$p = \frac{a}{b} \text{ OR } p = \frac{b}{a}$
$q > \frac{a}{b}$	$1 < p < a$

As we can see, after one round, $P$’s lower bound for $q$ increased from $1$ to $\frac{b}{a}$. Continuing in this manner, we can see that after the $n^{th}$ round $P$’s lower bound for $q$ increases to $\left( \frac{b}{a} \right)^n$.

Similarly, $Q$’s upper bound for $p$ at the end of the $n^{th}$ round decreases to $a \cdot \left( \frac{a}{b} \right)^n$

Hence it’s clear that sooner or later, one of them will deduce the other’s number.

Google Timeline Analysis (Finally Something Useful)

2024-05-26T00:00:00+00:00

Last year, I had to submit a large form with my travel history outside of UK.

The normie approach would be to sit down with your passport and a large cup of coffee, note down each stamp and figure out when the corresponding flight was - Yuck. Further, the UK doesn’t stamp your passport on the way out, so that leaves you with edge cases (If you arrived in Bangkok on 17th May, did you leave the UK on the 16th or the 17th?) - Such edge cases would have to be resolved by tracking down the corresponding flight ticket in your inbox. Double Yuck.

The Type A personality approach would be to maintain an ongoing spreadsheet of your travel history that you update at fixed intervals (Along with the spreadsheets that track your investments, the expiry dates of all your medications and the last known locations of all your sworn enemies). Triple mega ultra YUCK.

This seemed like a prime opportunity for a datadude revival post. Google has been stalking my physical location for years and it was finally time for me to extract some value from that. I downloaded the relevant files from Google Takeout and I was ready to go.

Takeout Choices - Raw or Processed?

Google gives you your history in 2 formats.

Raw Data: A giant list where each entry is (Point in time, Point in space). This is the physicist’s worldline - Your life is just a discontinuous curve moving through 3-dimensional spacetime (The discontinuities are the times your phone wasn’t online and 3-D because Google maps doesn’t store your spatial z co-ordinate).

The great thing about this format is the data schema is as simple as it can get. This has many advantages, including maximal data portability - Integrating your Google worldline and your Apple Maps worldline is a simple matter of list concatenation.

The bad part is you have to do all the data crunching. There are also some weird issues that crop up when using the raw data - For eg. soon after I moved to the UK, there is a patch in my raw data that shows my physical position as alternating between London and Mumbai. I suspsect this is because I was logged in to my Google account on both, my PC in my parent’s home and my cellphone, and my position was being recorded from both devices simultaneously.

Semantic Format: Here Google uses it’s world-class, cutting edge algos and armies of highly paid data scientists to slap some meaningful labels on the raw worldline data. At a high level, your location history is divided into placeVisits (Where you spend some time stationary at location X) and activitySegment (Where you travel between location X and Y).

Each placeVisit contains information like the coordinates of the place visited, the duration spent there, the address as inferred by Google etc. Each activitySegment has information like the start and end coordinates, the path followed and the inferred mode of transport for that edge.

The problem with these is that the datamodel is complicated and varies over time. For example, in my semantic data, each activitySegment contains a field ‘activities’, which is a probability distribution over the possible modes of transport that activity involved. Now most (But not all) activitySegments also contain a field called ‘activityType’, which is assigned value of the most likely activity in the list. Hence any solution that uses the ‘activityType’ field may miss some information. Without an official dictionary from Google clearly fleshing out what these fields mean, we are forced to rely on our common sense to make sense of these fields, and while that is good enough for most cases, it will certainly not cover all cases.

Further, Google will sometimes fuck up the tagging of activitySegments (like the time they thought I cycled from Lucknow to Mumbai in 2 hours). For the most part however, they are probably doing a better job of cleaning up and interpreting this data than I can.

Approach 1 - Rawdogging data, Costly API calls, Simplifying assumptions

Given the above, it seemed that if we want a Google proof solution (In terms of being immune to both, their sloppy data processing and their potential future bankruptcy from class action lawsuits filed by rock-eaters), the most reliable way is to use the raw worldline. Further, for the task I had in mind, the solution seemed super simple - For each point on the worldline, map it to the corresponding country. Find all the points in time when the country changes and Boom - You have your travel history table.

Problem with this approach: Mapping from (Latitude, Longitude) → Country is an expensive function call. The raw data contains a point every 15-20 seconds, so that translates to a lot of points. In 3000 days of history, I had a 15 million raw data points. The really dumb approach of simply mapping each raw datapoint to a country is waaaaay too slow. Maybe we could do something cleverer like form a small number of clusters (1000?), map each cluster to a country and then use that to label raw points? It could work, but it’s not foolproof by any means and relies on a complex algo. I’m a lazy fucker who hates complexity so I decided to think some more before going down this route.

Exploiting Timeline gaps: The approach I took was to use the fact that the data is already naturally clustered - As I stated earlier, the worldline is discontinuous whenever my cellphone is not online and these gaps cluster the raw data in a natural way - The insight is to realise that most country changes occur during such gaps. This is because:

I mostly travel between countries by flight - This is specially true for travel in and out of the UK.
There is a gap in my location history while I’m flying as data connectivity is lost (I never access WiFi on the plane, not sure how this would be affected if I did).

Now there will obviously be some discontinuities which were not travel related - Maybe my phone ran out of battery, or I didn’t have connectivity, but that’s OK - By focussing just on the timeline gaps, we massively reduce the number of points we need to reverse map. We can further cut down the points by prioritizing the biggest gaps first.

Final Approach: So what I did is

Evaluate the physical distance between consecutive points in the worldline (I used the Haversine metric, just to show off more than anything else).
Reverse-sorted and got the consecutive worldline points with the largest distance between them. These are the gaps.
Applied a sensible threshold cutoff - 500 kms seems like a good lower bound for international flights.
Mapped the remaining few hundred points to coun tries
Filtered down to the gaps where start_country != end_country

And voila! I had a table of all my international travels.

Comments: This solution works for me (And one other super-lazy friend who I sent my jupyter notebook to) but will the underlying assumptions hold for everyone? Certainly it won’t capture the country changes for Europeans that routinely drive/take trains across country borders. Will it work for the perenially online fucks who check their emails while on international flights? This solution won’t work for a lot of people and that is unsatisfying. On the plus side, it uses the raw data, and the code and logic is extremely simple - The only complicated bit is the country lookup and we outsource that to an API.

Approach 2 - Processed Data only

Chronologically, this was the first approach that I took to solve this problem - In this case I used the semantic data and went through the following steps:

Isolated the activitySegments tagged as FLIGHTS,
Looked at the placeVisits before and after those FLIGHT Segments,
Extracted the airport name from the address of the placeVisits
Manually compiled a mapping of which airport was in which country.

Comments: This approach was pretty stupid, and I would’ve been better off mapping the start and end locations of the FLIGHT segments using the country lookup. I also spent way too much time at this step figuring out the meaning of random useless fields in the structured data, how they were related to one another and if the data was internally consistent (Was the startTime of each activitySegment AFTER the endTime of the previous activitySegment?)

Other approaches

This idea is obvious enough that a few other people have implemented their own versions of it:

Laurens Geffert: One of the aforementioned highly talented Google data scientists - His solution is kind of a mix of my 2 approaches - He whittled down the space of points required by focussing on the processed data and then mapped it using a country lookup. I suspect the API he used to do the country lookups was also much more performant than the one I used (I just used whatever ChatGPT recommended). I should add here that, like me, he also performed a sanity check of the results and manually removed a few obviously nonsensical datapoints.

If Laurens has already solved the problem, why blog my solution? Well I could give you reasons like I used a different algorithm or that it’s implemented in python instead of R, but mostly it’s because because Peter Campbell is my spirit animal.

Yeah, you tell 'em Pete!

Geoprocessing: This company lets you upload your timeline data and then helps you with various analyses. I didn’t want to upload my data anywhere so haven’t experimented with this.
Mileagewise: They use Google timeline data to create mileage logs (In the US you can claim tax benefits by claiming the mileage incurred while driving around as a business expense) This is a very different usecase from the one considered here, but I threw it in because it’s an ingenious use of timeline data.

Conclusion

The most devastating criticism of the quantified-self subculture is that the projects seldom have tangible value - Pretty graphs that show your travel locations on a map or analytics that count how many cities/countries you visited last month may be interesting, but they are not actionable in any meaningful way. This is why most people don’t bother with the Timeline digest mail that you receive from Google every month - Who gives a fuck?

Looks like Google's data scientists exercised as much creativity in designing this digest as the guys who named their geographic data processing company Geoprocessing

This project shows that you can extract value from your personal data in a simple and direct way. As with any project, a quick and dirty first cut will handle 80% of the cases but a solution which addresses all the edge cases presented by all of humanity is beyond the scope of a blog post project hacked together in a few hours. However, I hope that at least a few readers will be able to save a couple of hours of tedious effort by leveraging this code to fill out their own travel history forms!

The Data Liberation Front

2021-06-21T00:00:00+00:00

The code for this project can be found here.

For my next post, I’d initially planned to analyse some WhatsApp group chats and see what I could deduce about the chat participants and the relationships between them - Ideally something that goes beyond the tools that collect and display statistics like number of messages sent, most active time of day and most used smileys.

Unfortunately, I got side-tracked while trying to collect my raw WhatsApp data, and because I’m way behind on my publishing deadlines, I decided to write about that instead.

Table of Contents:

WhatsApp - The Good, the Very Good, and the Excellent
WhatsApp - The Ugly
The 5 Stages of Grief
The Solution
Why?
Conclusion
Footnotes

WhatsApp - The Good, the Very Good, and the Excellent

For as long as I can remember, WhatsApp let you easily export your chat in a neat .txt file with just a couple of button clicks. I already had great respect for WhatsApp for several reasons:

Design: Beautifully minimal.
Penetration: Whenever I meet a new group of people, WhatsApp groups are the preferred method of communication. Everyone always seems to have and use WhatsApp.
Erlang: The backend of WhatsApp is written in Erlang. I don’t know why exactly that is impressive, but I do know that Erlang is fucking cool ¹. To give you an idea, WhatsApp’s use of Erlang is often touted as the reason they were able to handle over 900M users with only 50 engineers.
Obvious Features that no one else has figured out: When you send chats on WhatsApp and there’s no signal, WhatsApp will automatically send the chat when the signal resumes! I’m still not sure why Messenger can’t accomplish this.

“Good for WhatsApp!” - I thought. Data portability is just one more thing that these guys handle in a clean, simple and efficient manner! Unfortunately, I was in for a rude shock…

WhatsApp - The Ugly

I don’t remember why, but I decided to export and save down my WhatsApp call logs as well - The more data, the better right? However, on clicking the 3 dots, I couldn’t find the “Export Call Logs” option. At first, I thought ah well, guess WhatsApp isn’t perfect after all - They’ve buried the Export Call Logs button somewhere deep inside the settings window. After a few minutes of poring over the settings, it dawned on me that maybe there is no Export Call Logs option in WhatsApp! I fired off some Google searches, frantically trying to refute this possibility but each empty result cemented it further into a horrifying fact.

The 5 Stages of Grief

Denial: I spent an embarassingly long amount of time convinced that exporting call logs was in fact possible, and that I was just a technologically challenged simpleton that couldn’t figure it out. At some point Googling the exact same phrases again and again gave way to…

Anger: Isn’t data portability the law? Aren’t telecoms obliged to provide to us our call logs under GDPR? What makes WhatsApp different? Why wouldn’t they just provide the call logs anyway? This time, the Zuck had gone way too far. For corrupting the soul of my beloved WhatsApp, he would pay the ultimate price…

Bargaining: I came across some apps that claimed to be able to retrieve the WhatsApp call logs, but the only testimonials I could find were from the creators of the apps themselves. I was not quite ready to hook my WhatsApp up to a shady app I downloaded off some dark corner of the internet. I looked for ways to get the WhatsApp db files from my phone, but these files were encrypted, and it was not clear how to decrypt them (There was an app that claimed to let you do this, but I definitely didn’t want to hook my WhatsApp up to a shady third party crypto app). Maybe I could scrape the data from WhatsApp Web? Nice try, but WhatsApp Web doesn’t even have the “Calls” tab (Why would it?). Is there a way to scrape data from apps? Nope, that is a notoriously difficult problem, and apps like WhatsApp have booby traps built in specifically to prevent scraping…

Depression: The most frustrating part was that the data was right there! In my phone! In theory, I could just scroll through the logs one by one and manually enter the data for each call into a table. All I need to execute that plan is a metric ton of Adderall. Maybe I could build a rig with a robotic arm that would do the scrolling and clicking, and a camera that would take pictures of the screen and extract the data? I actually entertained this idea for about 5 seconds, before remembering that every hardware project I’ve attempted has ended in either abortion, miscarriage or attempted suicide…

Joy Lobo also abandoned his hardware project, but he showed impressive follow through on his suicide attempt

~~Acceptance~~ Obsession: Unfortunately, acceptance has never been my strong suit. I knew I had to let go, but just like the time Kiran Kapoor called me fartface in front of the entire 6th grade, my mind kept going back to it. The great thing about engineering problems though (As opposed to bitter middle school memories) is that if you keep thinking about them, it sometimes leads to a resolution.

The Solution

The Eureka moment was realizing that I didn’t need a robotic hand or any hardware - I could just mirror my phone screen on my computer and write some code to simulate the actions of scrolling through the call logs and capturing screenshots. Once I had the screenshots, I could use OCR to extract the text data from them and chuck it into a table. Easy right? Right guys?

Guys???

Step 1 - Mirroring my Phone on the Computer

This was easily achieved with a little scrcpy magic - The hardest part of this step was removing the USB cable from my phone charger and connecting it to my computer.

Step 2 - GUI Automation

The next step was figuring out how to control my mouse and keyboard with code. In principle, this was also fairly easy - pyautogui is a great python module that lets you do exactly that. Some of the challenges I faced:

The biggest challenge with GUI Automation (Or automation of any kind really) is the inability of the machine to adapt to even the most minor surprises. The machine will keep executing the same instructions, and if something unexpected happens in the middle of that, things could go completely wrong. For example, I received a phone call in the middle of one of my runs, and that threw off the carefully balanced workflow of steps so that the code started saving screenshots of my homescreen again and again (It could’ve been worse). If you try this yourself, switch your phone to flight mode, and don’t leave the machine completely unsupervised; Sit nearby with a book so that you can keep an eye on things.
The time interval between pyautogui operations had to be optimised to account for the non-trivial and slightly random latency of GUI functions like switching between windows, or opening up the “Save As” dialog box. I had about 800 call logs to save, and given that I was sitting around watching over the process, I couldn’t afford to spend more than 4 seconds per log.
While the whole point of GUI automation was to bypass manual drudgery, there was still the one-time drudgery of figuring out the pixel locations for the mouse clicks in the workflow. To make this process slightly smoother, I called print(pyautogui.position()) in a for loop with a sleep timer, so that it printed out the mouse position at regular intervals. I then positioned the mouse every where I needed it. I also tried to use keyboard shortcuts as much as possible when executing tasks.
Finally, I found that in some cases pressing the ‘Enter’ key via pyautogui didn’t work. Someone had written a detailed answer on Stack Overflow as to why this happens, but I only read the answer below that one, which told me to use pydirectinput instead. I used it, it worked, I moved on.

You can find the code I used to perform the screen grabbing here. After running this, I had a folder of images like the following, one for each call log.

Step 3 - Optical Character Recognition

Optical Character Recognition or OCR is the use of technology to distinguish and identify characters in images and convert them to text. It is one of the earliest areas of AI research and there are some truly impressive OCR tools out there today - Two examples that blew me away were Google Lens, which can read in foreign language text from your phone camera and translate in real time, and Mathpix Snip which can be used to generate the Latex code for a mathematical formula.

My use case was a particularly simple one. The text was digitally generated and formatted, not handwritten or in some funky font, and the image itself was a screenshot, not a picture of a physical sheet of paper (Actual pictures are harder because the text is distorted by wrinkles in the paper and the perspective of the lens). Despite all of this, I found that the OCR module I used (pytesseract) did not give perfect results. The two issues that came up were:

Correctness: Given the simplicity of my use case, I was expecting perfect accuracy. However, I still got errors in about 2% of my images. There are ways to boost the accuracy by pre-processing the image, but I adopted the lazy method of going into my phone settings and changing the “Display Size” setting to the largest possible, which removed all but a couple of the errors.

Output Consistency: This was a much more serious problem (Mostly because it forced me to write more code than I’d initially expected to). The format of the string that I received as an output was not consistent across all my images. In some cases, the call type and the duration were part of the same line. In others they were in different lines - The figures below show the 2 main classes of text output I received from the OCR module.

Class 1: Each token printed in a separate line

Class 2: Call Type and Duration in the same line, Call Time and Size in the same line

Step 4 - Parsing OCR Text to Structured Data

The eventual goal of the project was to convert the text ouput of the OCR engine for each screenshot into an object of the “Record” class below:

Record:
    Name (string)
    Status (string)
    Date (date)
    Calls (list of calls)

With each call being an object of the following class:

Call:
    cType (Incoming/Outgoing/Missed)
    Time (HH:MM)
    Duration (H:MM:SS/Not answered/Empty(Iff cType == Missed))
    Size ( DD.D MB/ DDD Kb /Empty(Iff cType == Missed OR Duration == Not answered))

As stated in the previous section, the fact that there are 2 different types of text output complicates things. In such cases, the most straightforward solution is what I call the “Waterfall” approach - Parse the text assuming it’s a class 1 output. If that fails, parse it assuming it’s a class 2 output. If that also fails, throw an error. If your code throws a lot of errors, you realise that there are more than 2 classes of outputs, add one more level to your waterfall and repeat the process.

I didn’t quite use the waterfall approach in my code - Never go full waterfall if you can avoid it. However, I did something pretty close to it (You can have a look at my code here). In a nutshell, I checked each line of the input against a list of regular expressions to determine what kind of token it is and then used it’s position relative to the other tokens to insert it into the output structure.

Example: Consider the line L = “19:42”. It matches the regular expression R = [0-9]{2}:[0-9]{2}, that is to say, it is two digits, followed by a colon, followed by two digits. This means that it is either a Call Time or a Call Duration. If the previous line was “333 kB” then we know that this token is preceded by a Call Size, which means it’s a Call Time (Call Durations are never preceded by Call Sizes). Assuming that all tokens are present in the text input in the correct order, we find the first duration-less Call in the output structure, and populate it with 19:42. Once we have processed all elements, we return the resulting output structure.

The most unsatisfying thing about the above approach is that I have made a lot of assumptions about the input, but these assumptions aren’t clearly stated in the code (The only thing worse than people not stating their assumptions is people not knowing what their assumptions are). Further, because the assumptions are baked into the structure of the code, if I want to add/remove/change one of them, I need to restructure the code entirely. Not fun.

Another unsatisfying aspect of this approach is it’s “All or nothing” nature. If the OCR process ommitted the “:” and incorrectly read in the input as “1942”, this code would just throw. It would be nice if we could flag it as corrupted, but still match with the duration token, based on the fact that it’s still a pretty close match to the regexp, and the fact that it comes after a Call Size token.

In this section, I’ve tried to flesh out a probabilistic approach to the data extraction problem, that tries to resolve the issues described above. STATUTORY WARNING: This section is speculative and hand-wavy. Consume at your own risk.

In a broad sense, the idea is to represent the state of knowledge about each line of the input, and what kind of token it represents, using a probability distribution. Our knowledge about the format of the input text can be specified with some conditional probabilities, and then it’s just a question of applying Bayesian inference to get the resulting posterior distribution for each line. Once this is done, the maximum probability token for each line can be selected. It might clarify things to view the example above in the context of this approach.

Example: Again, we have the line L = “19:42” and want to figure out what kind of token it is. For simplicity, let us assume we know it is either a Call Duration, a Call Time, or a Call Size, and the prior probability of each possibility is 0.33. We can repesent this state of knowledge with the vector $[0.33, 0.33, 0.33]$

In order to get the final distribution for each line, we must go through a list of tests. Each test comes with a list of conditional probabilities (The probability of a particular test outcome given that the line is of a particular type) - Given these, the posterior distribution can be inferrred from the test outcome. The tests and the associated conditional probabilities encode our assumptions about how the system behaves.

One such test could be to compare L with the regular expression R = [0-9]{2}:[0-9]{2}. The associated conditional probabilities are:

$P(L$ matches $R$ | $L =$ Duration$)$ = 0.99

$P(L$ matches $R$ | $L =$ Time$)$ = 0.99

$P(L$ matches $R$ | $L =$ Size$)$ = 0.01

We never set any of the probabilities to 1 or 0, becaue there is always a chance that the OCR output was corrupted (Or that we are living in The Matrix).

On applying the test, our probability distribution get’s updated by Baye’s Theorem into $[0.4975, 0.4975, 0.005]$

Another test could be based on the token type of the previous line. The conditional probabilites in this case are:

$P(L_{i-1} =$ Size | $L_i =$ Duration$)$ = 0.5

$P(L_{i-1} =$ Size | $L_i =$ Time$)$ = 0.05

$P(L_{i-1} =$ Size | $L_i =$ Size$)$ = 0.01

$P(L_{i-1} =$ Duration | $L_i =$ Duration$)$ = 0.3

$P(L_{i-1} =$ Duration | $L_i =$ Time$)$ = 0.01

$P(L_{i-1} =$ Duration | $L_i =$ Size$)$ = 0.6

Let’s say that the inferred posterior distribution for the previous line $L_{i-1}$ is $[0.05, 0, 0.95]$.

Then we can further update the probability distribution for $L_i$ as follows:

$P(L_i) = P(L_i$ | $L_{i-1} =$ Size$)$*$P(L_{i-1} =$ Size$)$ + $P(L_i$ | $L_{i-1} =$ Duration$)$*$P(L_{i-1} =$ Duration$)$

where:

$P(L_i$ | $L_{i-1}$ = Size) $\propto P(L_{i-1} =$ Size | $L_i)$*$P(L_i)$

$P(L_i$ | $L_{i-1}$ = Duration) $\propto P(L_{i-1} =$ Duration | $L_i)$*$P(L_i)$

The final posterior probability is $[0.9786, 0.0202, 0.0011]$ or a 97.86% chance that the token is a Call Duration.

Advantages over the waterfall

Basically, the probabilistic approach lets us chuck everything we know about the system and all our observations about the current instance into a big pot, and stir it till inferences appear. It makes explicit our expectations and assumptions about how the system is supposed to behave. We can add or remove assumptions by simply adding or removing the corresponding tests. The strength or weakness of these assumptions can be tweaked by changing the actual values of the conditional probabilities. Finally, it gets around the brittleness of the Boolean logic approach - Even if the regular expression doesn’t match perfectly, we can do some kind of fuzzy matching and come up with a probability of match assuming some corruption. If one of the tests fails, the information provided by the other tests could still lead us to the right answer.

Why?

“But datadude, you infuriatingly indirect imbecile”, I hear you say, “Surely there’s a more direct way to fetch this data than to build this Rube Goldberg contraption involving mirroring, GUI automation and OCR?”

I’m the first to admit that there are many shortcomings with the approach I’ve taken. Probably the biggest one is that it isn’t portable - Most people can’t use this approach to get their WhatsApp call logs. Hell, even I probably won’t go through this process on a weekly basis to keep my call records updated! Secondly, there is a lot of ugly ad hocery involved in converting the OCR text into structured data. WhatsApp has this data in a neat table somewhere inside it, and if you’re looking for something clean and scalable, hunting down that table is the only way.

Why then did I do it this way? It just seemed more fun and horizon-broadening! I’ve thought about both GUI automation and structured data extraction in different contexts before, and this seemed like a good project to play around with these concepts.

Conclusion

I don’t think data portability is anyone’s top concern right now, but I think that will soon change. At the moment, a lot of companies seem to be getting away with shoddy, or non-existent data portability solutions. When consumers start rightfully demanding their data, I don’t think these companies will meekly hand it over - We will probably need to resort to creative tactics to get it. Even if they do hand over the data, it will have to converted from the structured format they use into the structured formats that each of us wants, and that isn’t an easy problem.

The battle to get my call logs out of WhatsApp was a microcosm of the upcoming Portability Wars and to be honest, I had a lot of fun fighting it! Still, I live in the hope that WhatsApp will see the error of their ways and just add the damn “Extract Call Logs” button, so we can all move on to bigger and better things.

Footnotes

Some day, I’ll figure out exactly why Erlang is so cool (Right after I finish reading The Art of Computer Programming, Structure and Interpretation of Computer Programs and the Escher and Bach parts of GEB) ↩

Comments

Ben, August 4th, 2021 20:55

This was both super impressive and super entertaining! Did you do any validation of the records to see how accurate the OCR+parsing was? I’d be curious to know how well it worked :)

↪︎ Reply to Ben

Saatvik Gulati, July 3th, 2021 16:11

Good problem solving skills never thought of mirroring phone screen and integrating it with OCR

↪︎ Reply to Saatvik Gulati

BB, June 29th, 2021 03:02

Wow wow wow, hard to stop reading this one, once you have fallen into start reading trap.

↪︎ Reply to BB

CasualCaveman, June 27th, 2021 12:19

Well written, data dude. Look forward to the gossip in the next post

↪︎ Reply to CasualCaveman

Ankit, June 23th, 2021 17:29

That was entertaining and enlightening - A much more interesting project would be pdf parsing btw.. especially to extract tabular data.

↪︎ Reply to Ankit

BB, June 29th, 2021 02:56

Pdf parsing is much easier and much less interesting

N, June 23th, 2021 16:21

Really fun read, Rube Goldberg contraptions are fun, especially if there are no moving parts to fail. Perhaps a solution to regularly updating the call logs is to have an applet that triggers when you put your phone to charge at night, putting your phone into airplane mode, and then start logging.

↪︎ Reply to N

Mohit, June 23th, 2021 12:47

Enlarging the font size was a good hack to get over the limitations of OCR

↪︎ Reply to Mohit

Arjun, June 23th, 2021 09:44

Woaah…never thought there could be so much to call logs on WhatsApp. Good analysis and witty humour. Kudos to the effort! All in all in a good read

↪︎ Reply to Arjun

Sumedh, June 23th, 2021 04:38

Super fun read 😁 Had to read the sidebar twice 😂

Moar plz

↪︎ Reply to Sumedh

Deliveroo Data Analysis III

2021-05-19T00:00:00+00:00

This is Part III of a 3 part series. Click here for Part I and Part II

In this section, I want to break down my orders from some of my favourite restaurants and see which items I ordered most, and provide some commentary. I wrap up with some directions for future work.

Table of Contents:

Axes of Interest
Sidebar: Clustering
- Item Categorization - K-Means Clustering
- Item Name Changes - Prefix Clustering
Shake Shack
Byron
Ping Pong
Rusty Bike
The Pizza Room
Motu Indian Kitchen
Dishoom
Next Steps
Footnotes

Axes of Interest

What kind of information are we interested in when looking at a particular restaurant?

Summary stats: Total number of orders? Total spend on the restaurant? How expensive is the restaurant? How often do I order from the restaurant?
Dynamic Breakdown: I’d like to see a breakdown of the summary stats over time, since averages don’t always tell the whole story. There are two graphs here:
- Histogram of order values: This gives a fair idea of what ordering behaviour was for a particular restaurant - Did I always have the same standard order? Did I order from this restaurant when entertaining guests?
- Chart of order frequency: Each time I ordered from that restaurant, I plotted it’s order share in the last 10 orders. This was benchmarked against 1/(Number of distinct restaurants in last 10 orders) - The logic being that if all restaurants were equally popular, the value of order frequency would be the benchmark value.
What are the most popular items for this particular restaurant? I’ve also included some personal commentary in this section.

Ever since I decided to try to want to become a datadude, I’ve had an irresistible urge to cluster things together and this seemed like the perfect opportunity. In this sidebar, I describe two different problems that came up in the analysis, and how I approached them with clustering.

Item Categorization - K-Means Clustering

As discussed before, it’s handy to have some way to categorize restaurant items into Mains/Sides/Drinks. It’s a useful axis to view the data along (Most popular Main?), and can also feed into more complicated analyses (How wasteful are you being when you order drinks from the restaurant instead of buying them from the supermarket below your house?)

Probably the most accurate way to do item categorization is to somehow scrape the Deliveroo menus for each restaurant, and look up which category they fall into. I do not have the coding skills required to do this, and even if I did, it probably wouldn’t be as straightforward (Item names change over time, as we will discuss below).

If I can’t get the data from an external source, I have to look within. The only information I have per item is it’s name and it’s price ¹. I don’t know how to extract the category data from the item name, and so I decided to go with a simple binary classification of items per restaurant into Mains or Sides, based on their price. What this boils down to is applying K-Means clustering to a 1-dimensional dataset (The list of prices in this case). I don’t know if K-Means is ideal for such situations, or if there are simpler and more direct methods one can leverage, but it was available in SKlearn library and seemed to give good enough results.

For certain restaurants this kind of binary categorization doesn’t make sense - For eg. Ping Pong has a system of small plates, so all their items are Mains (or Sides). In such a case, forcing a divide into 2 categories creates an artificial distinction. There are ways to determine the correct number of clusters for a dataset, but I don’t have a lot of expertise in these matters and hence decided to just take this as an input from the user.

Item Name Changes - Prefix Clustering

When analysing the distribution of items per restaurant, I found that items names will sometimes change over time. Example: Shake Shack will sometimes refer to it’s “Chipotle Cheddar Chick’n Burger” as just “Chipotle Cheddar Chick’n”, and the name of Ping Pong’s “Potato and Edamame Cake (V) (2pcs)” changed one day to “Potato and Edamame Cake (v) (2pcs)”.

Name changes like these can lead to some nasty surprises; Can you imagine my horror when I saw “Garden Fresh Golden Dumplings” topping the Ping Pong list instead of my beloved cakes of potato and edamame?!

This is similar to the issue I faced earlier in the analysis with restaurant names, but solving this with manual relabelling is far more tedious, since the universe of items is much larger.

My first thought was to define the distance between two items as the edit distance, connect all points (i, j) in the resulting graph with distance(i, j) <= d (For some suitable d) and then each connected component would be defined as a separate cluster. You could then define some appropriate point inside the cluster as the representative (The centroid maybe?) However, it didn’t feel like this problem was worth that much effort.

The next version of this idea was to draw a directed edge from i -> j if lowercase(i) is a prefix of lowercase(j). If lowercase(i) == lowercase(j), we use a lexicographic ordering to break a tie and avoid cycles. This sets up a directed acyclic graph over the items, and we can again define connected components as clusters. The lowest common prefix of all the strings in the cluster pretty much presents itself as the natural candidate for cluster representative.

This kind of clustering seems to work quite well for my dataset and didn’t seem to cluster any distinct items together. However, as a future improvement, it might be a good idea to also consider the average item price as a co-ordinate when evaluating the distance - If the item is the same, the price will be similiar as well.

Shake Shack

Shake Shack is my favourite restaurant, and with 47 orders, the data supports this. I first discovered it in the US and fell in love with their cheesy fries (As we will see, the data supports this as well).

Summary Stats

Consumption:

Total Number of Orders: 47

Total Value of Orders: £707.4

Cost:

Average Order Value: £15.05

Median Order Value: £17.45

Order History:

First Order: 2018-11-22

Last Order: 2020-12-21

Order Frequency: 1.86 per month

Summary Graphs

Item-wise Analysis

Next up I looked at the distribution of the items.

Item	Qty	Value	Avg Price	Item Type
Crinkle Cut Fries with a pot of Cheese Sauce (V)	45	180.00	4.00	Side
Shack Cheese Sauce	21	21.00	1.00	Side
Chocolate	21	124.95	5.95	Main
Chick'n Shack	16	120.00	7.50	Main
Chipotle Cheddar Chick'n	14	119.00	8.50	Main
Black Truffle Chick'n	7	66.50	9.50	Main
Fries (V) (VG)	5	15.00	3.00	Side

As is evident, I am a huge fan of the cheesy fries, and an even bigger fan of the cheese sauce that they serve along with it - Counting the default serving you get with the fries along with the extra servings, that’s 76 pots of cheese sauce! This makes sense, because having to ration cheese sauce (Or worse, share it) is an abominable thought. If there was ever any doubt, I’d order an extra serving. At a price of £1, it was a steal - Though the caloric cost (240 kcal) is much higher.

The K-Means clusterer has categorized the Chocolate Shake as a Main based on the £5.95 price point, but given that this is Shake Shack we’re talking about, it’s probably OK.

My relation with their burgers has evolved over time; At first I could only eat the Shroom burger, which kinda sucked. I was hugely excited about the launch of their fried chicken sandwich, which did not disappoint (At first). Over time though, it started tasting super dry and chewy, and whenever they came out with a special edition chicken burger (Like the Chipotle Cheddar Chick’n, or the Black Truffle Chick’n) I’d immediately switch loyalties. I’m still not sure why the special edition burgers tasted so much better; My theory about the Black Truffle Chick’n is that it was made of thigh meat, which meant a juicier and more tender patty. While the data doesn’t reflect this, recently life came full circle for me when I switched back to the Shroom burger.

Byron

Byron clocks in at number 2 in terms of all-time favourites, though I’m not sure when I’ll eat there next. The pandemic seems to have hit this chain particularly hard. Most outlets haven’t re-opened, and of the ones that have, none deliver to my house. They’ve also booted my favourite items from their new menu, so don’t know if I’ll go back at all.

For old time’s sake then, here are the Byron stats:

Summary Stats

Consumption:

Total Number of Orders: 31

Total Value of Orders: £679.3

Cost:

Average Order Value: £21.91

Median Order Value: £20.65

Order History:

First Order: 2018-05-16

Last Order: 2019-11-27

Order Frequency: 1.66 per month

Summary Graphs

Item-wise Analysis

Item	Qty	Value	Avg Price	Item Type
Blue Cheese Sauce	26	36.9	1.42	Side
Onion rings (V)	19	76.0	4.00	Side
Chocolate	16	80.0	5.00	Side
V-Rex + Fries	14	190.0	13.57	Main
Classic Chicken	12	143.0	11.92	Main
Byron Lager (500ml)	8	47.6	5.95	Side
Sriracha Mayonnaise (V)	4	5.8	1.45	Side

The item distribution is almost exactly the same as Shake Shack (Fries don’t feature on the list, as Byron would provide them with the burger).

Cheese Sauce at the top of the pile again! Dunking their large, messy beer-battered onion rings into the Blue cheese sauce, adding a touch of mustard and then devouring the result was a ritual I’d perform every single time. The Byron Chocolate Shake was also amazing (Much better than Shake Shack).

On the burgers, I actually really like their classic grilled chicken burger, both in terms of the flavour, and the health angle - It was the leanest burger among all the burgers in this analysis. The V-Rex was a special edition vegetarian burger, that is probably the closest I’ve seen UK veg burgers come to the veg burgers we have in India (It had a crunchy deep-fried patty, spicy mayo and and a slice of onion in it).

Speaking of the V-Rex reminded me of one my biggest pet peeves - Why isn’t anyone making veg burgers out of potatos? If there are any English restaurant owners reading this, ditch the halloumi/beans/jackfruit/Beyond Meat patties, and use potatos instead!! I don’t know why no one has come up with this yet, but a smattering of vegetables in a matrix of mashed potatos, breaded and deep-fryed, results in the tastiest vegetarian burgers. Put this on your menu and you will win a lot of business and goodwill from the large (And rapidly growing) Indian immigrant community.

Also, people whose girlfriends are vegetarians, which pretty much makes them vegetarians.

Ping Pong

Ping Pong was the first restaurant I ever ate at in London. I’ve always enjoyed momos, springrolls and assorted dimsums, and Ping Pong elevated that whole experience into something very special. Unfortunately I only lived within the Ping Pong delivery radius for a small window in space-time, otherwise it would be a serious contender for the top spot.

Summary Stats

Consumption:

Total Number of Orders: 12

Total Value of Orders: £238.25

Cost:

Average Order Value: £19.85

Median Order Value: £19.95

Order History:

First Order: 2019-01-26

Last Order: 2019-05-09

Order Frequency: 3.5 per month

Summary Graphs

Item-wise Analysis

Item	Qty	Value	Avg Price
Potato and Edamame Cake (v) (2pcs)	12	41.40	3.45
Mixed Vegetable Spring Roll (v) (3pcs)	8	30.00	3.75
Spicy Vegetable Dumpling (v) (gf) (3pcs)	8	30.80	3.85
Golden Dumpling (v) (gf) (3pcs)	7	26.25	3.75
Vegetable Sticky Rice (v)	6	30.90	5.15
Chinese Vegetable Spring Roll (V)(VG)	4	15.00	3.75
Spinach and Mushroom Dumpling (3pcs) (VG) (GF)	4	15.40	3.85
Spicy Chinese Vegetable Dumpling (V) (VG) (GF) (3pcs)	3	11.55	3.85
Asahi	2	9.30	4.65

The Ping Pong item list shows some of the limitations of the prefix clustering approach used to cluster item names. The “Mixed Vegetable Spring Roll” and “Chinese Vegetable Spring Roll” are the same, as are the “Spicy Vegetable Dumplings” and the “Spicy Chinese Vegetable Dumplings”. On making these corrections, it looks like I ordered the potato cakes and spring rolls 12 times in 12 orders (And spicy veg dumplings 11 times), so pretty much every time I ordered. The uniform order hypothesis is also supported by the order value histogram.

Not much too say about these items; Just extremely tasty, high quality dimsums that always left me happy and satisfied. How I wish they’d open up a few dark kitchens!

The only thing that irritates me about this table is that I paid £9.30 for two bottles of Asahi, when I could’ve bought 4 bottles from Tesco for £5.50. Paying for drink markups is understandable when you’re dining at the restaurant, but it’s foolish when you get food delivered at home!

Rusty Bike

Rusty Bike was my go-to place when I first moved to London. Unlike most delivery options, the food was not extravagant/unhealthy, reasonably priced and quite tasty.

Summary Stats

Consumption:

Total Number of Orders: 21

Total Value of Orders: 346.1

Cost:

Average Order Value: 16.48

Median Order Value: 14.15

Order History:

First Order: 2018-07-27

Last Order: 2020-09-23

Order Frequency: 0.8 per month

Summary Graphs

Item-wise Analysis

Item	Qty	Value	Avg Price	Item Type
Green Curry	21	204.85	9.75	Main
Steamed Rice	15	42.85	2.86	Side
Vegetable Spring Rolls	12	54.20	4.52	Side

As is evident from the price histogram, Rusty Bike orders followed a standard template - Green Curry, Steamed Rice, and on occassion, spring rolls. It is a bit odd that the numbers don’t line up better - I’d expected the Green Curry and Rice orders to be almost equal. On reviewing the raw orders data, I found that initially (For the first six orders), Rusty Bike used to include a default steamed rice with the green curry. They later decoupled them into separate items. One result of this is that the Average Price of the green curry in the table above is inflated - The actual price of the green curry is about £7.5.

The Pizza Room

I have eaten a lot of pizza in my life and The Pizza Room is something special.

Summary Stats

Consumption:

Total Number of Orders: 18

Total Value of Orders: £419.54

Cost:

Average Order Value: £23.31

Median Order Value: £20.94

Order History:

First Order: 2018-08-10

Last Order: 2020-11-25

Order Frequency: 0.64 per month

Summary Graphs

Item-wise Analysis

Item	Qty	Value	Avg Price	Item Type
Quattro Formaggi	20	299.20	14.96	Main
Brownie with Ice Cream	9	56.25	6.25	Side
Coke 330ml	3	7.80	2.60	Side
Margherita	2	24.30	12.15	Main

Wherever I go, the Quattro Formaggi or 4 Cheese pizza has been my go-to pizza order for a while. What makes the Pizza Room Quattro Formaggi special is:

Tomato Sauce: Nowhere else have I seen Pizza Room levels of clarity on this topic. They are upfront about the fact that the default option is no tomato sauce (Traditionally, the QF is a ‘White Pizza’). However, if you’d like tomato sauce, they will add it on for a fee. I really like the fact that they charge me a nominal amount for the sauce and hence eliminate the uncertainty - At other pizzerias, I add an awkward delivery note saying “If you don’t typically add tomato sauce to the Quattro Formaggi pizza, please can you do so in this case?” and then pace nervously till the pizza gets delivered.
Add-Ons: Definitely add green chillies to your QF pizza for a zingy complement to all the cheese.
Generous deposits of Gorgonzola: I’ve no idea why, but a lot of restaurants scrimp on the blue cheese. The Pizza Room isn’t one of them.

Again, we see the idiocy of paying >3x for a can of Coke, that could’ve been fetched from my refrigerator.

Motu Indian Kitchen

Motu translates to “Fatty” or “Fatboy”, and they have shipped me a lot of calories.

Summary Stats

Consumption:

Total Number of Orders: 32

Total Value of Orders: £565.75

Cost:

Average Order Value: £17.68

Median Order Value: £17.5

Order History:

First Order: 2018-12-08

Last Order: 2020-01-16

Order Frequency: 2.38 per month

Summary Graphs

Item-wise Analysis

Item	Qty	Value	Avg Price	Item Type
Box for 1	28	506.0	18.07	Main
Tadka Dal (VG)	9	31.5	3.50	Side
Pilau Rice (V)	7	17.5	2.50	Side

On paper, the Box for 1 had everything I could want from an Indian meal: Paneer, Dal Tadka, garlic naan and papads. The quality was not amazing, but it wasn’t too bad, and beggars can’t be choosers.

Eagle-eyed readers will notice that this table is not consistent with the order value histogram above - The Box for 1 costs £18.07, but there are no orders with that price. Maybe the price increased over time? But in that case you’d probably have a bimodal distribution with two peaks. What gives? In order to crack this one, I had to go back all the way to the email receipts.

Turns out Motu will let you add extra food to your Box for 1 order, and the Deliveroo email receipt lists it all under the Box for 1. I didn’t account for this case while writing my email parser (And I’m not sure how I should, since the unit prices of the individual items aren’t given). Hence the £18.07 is a an average of the vanilla, no-frills-attached Boxes for 1 and the fancier, with-extra-fixin’s Boxes for 1, like the one below.

Dishoom

A lot of people love to shit on Dishoom. These are the fools that equate contrarianism with intelligence. Dishoom is fucking amazing, and I’ll fight anyone who says otherwise.

Summary Stats

Consumption:

Total Number of Orders: 6

Total Value of Orders: £166.2

Cost:

Average Order Value: £27.7

Median Order Value: £21.9

Order History:

First Order: 2020-08-14

Last Order: 2020-12-19

Order Frequency: 1.42 per month

Summary Graphs

Item-wise Analysis

Item	Qty	Value	Avg Price	Item Type
Garlic Naan (V)	10	35.0	3.50	Side
House Black Daal (V)	6	51.9	8.65	Main
Chilli Chicken	4	27.6	6.90	Side
Pau Bhaji (V)	1	5.7	5.70	Side
Chole (Ve)	1	9.5	9.50	Main
Mattar Paneer (V)	1	11.5	11.50	Main
Chicken Berry Britannia	1	12.5	12.50	Main
Chicken Ruby	1	12.5	12.50	Main

Dishoom has the best naan I’ve had in the UK, just by virtue of the fact that it’s actually a naan, and not an insipid cloud of semi-cooked dough. The black daal is among the best I’ve had anywhere, but my favourite item has to be the Chilli Chicken, which is basically popcorn chicken tossed in a sticky, spicy sauce. Dishoom, if you guys are reading this, how about showing vegetarians some love and adding Chilli Paneer to the menu?

Next Steps

The following are some ideas for next steps:

Add support for UberEats: I’m primarily a Deliveroo user myself, but I did go through an UberEats phase. Would be nice to get that data in here as well.
Calorie counts: I would really like to be able to figure out caloric consumption rates, but I’ve no idea how I’d go about it.
Infer regime changes from the data. As seen in the analysis above, the data often changed due to events taking place in my life and the world in general. I would like to be able to automatically infer the different regimes from the data in some way. In a high-level, abstract sense, the way to do this would be to fit some kind of model to the data in the current phase, and calculate the probability of the next phase occurring given that fit. If the probability is low, that means an event has occurred which caused the fit to change. I have no idea what such a model would actually look like though.
Correlate this with other datasets in my life: Was I on a health kick when I didn’t order in those 2 weeks in Jan? Did I order that brownie sundae late on Friday night because I had an extra long day at work? What do I typically watch on TV when eating Shake Shack?
Get everyone else’s data: Wouldn’t it be awesome if everyone in London ran the parser and uploaded their spreadsheets into a central repository? We could mine the resulting dataset for insight into how all of London eats. We could charge restaurant owners and investors to run queries on the dataset, or sell them insights that are mined from it. For eg. “Guys, Chilli Chicken is super popular right now, you gotta add it to your menu!” Or how about “Guys this hole-in-the-wall type place in Brixton is really blowing up, maybe we should invest and hook them up with a location in Central London?”.

If Deliveroo ever opens up a restaurant consultancy, remember, you saw it here first!
Look at the other players in the Deliveroo ecosystem: What about the riders? Would they get some benefit out of analysing their delivery data? Comparing their stats against other riders, or the population average? I don’t know how much data Deliveroo shares with them, and in what format, but employers are typically incentivized to give their employees as little information as possible.

Footnotes

Well, I also know which other items typically accompany it, so one could in theory say that since the item 1, item2 and item3 occur together, it’s a mains+side+drink combo. However, it’s MUCH harder to extract meaningful inferences in this way, and almost impossible when your dataset is so small! ↩

Comments

Rishabh Vaid, May 26th, 2021 10:47

This is a comment

↪︎ Reply to Rishabh Vaid

Deliveroo Data Analysis II

2021-05-18T00:00:00+00:00

This is Part II of a 3 part series. Click here for Part I

In this section, I will do some broad, first-order analysis. No matter how we slice the data (By time-periods, cuisine or some other pattern), the first questions that spring to mind are always:

How many orders follow this pattern?
How much money did I spend on such orders?

Once we answer these in the aggregate, we can do a more in-depth analysis to see how these quantities trend over time, and do a comparative analysis to see how the aggregate values for different slices stack up against each other.

Table of Contents:

Annual Consumption Trends
Sidebar: Why are Graphs so painful?
- Subjective Problems
  - My Ingratitude and Immaturity
  - My Crippling OCD
- Objective Problems
Annual Consumption Trends (Graphs)
- What do the graphs tell us?
Consumption by Cuisine
- Cuisine Distribution over Time
- Restaurants by Cuisine
Consumption by Restaurant
- Restaurant Distribution over time
Look Ma! Bar Charts!
Cost Analysis

Annual Consumption Trends

For starters, I simply looked at the annual data.

Year	No_Orders	No_Items	Tot_Value	Avg Order Value
2018	88.0	182.0	1302.60	14.80
2019	162.0	406.0	2681.01	16.55
2020	66.0	150.0	1228.76	18.62
Total	316.0	738.0	5212.37	16.49

Broadly speaking, not much change in consumption frequency from 2018 to 2019 (Given that 2018 was half a year worth of data). On average, I ordered in 3 times a week.

Consumption dropped in 2020, for 3 main reasons:

Started cooking more at home.
Ate out at restaurants/friend’s houses more than previous years.
Ordered only once during the first 3 months of the Covid19 pandemic.

Next, I decided to plot a graph of my consumption over the course of each year.

Having found the total annual consumption, I wanted to plot out the consumption trends to see how they vary over time. This turned out to be more complicated than one would expect, for a variety of reasons. Since this is supposed to be a report of my journey, I thought I’d spend some time fleshing out these complications instead of jumping straight to the results.

Subjective Problems

Part of the reason I found plotting graphs difficult was because I was plotting them, and I happen to have quite a few mental and emotional hangups. The ones that are relevant here are:

My Ingratitude and Immaturity

I have incredibly powerful magical abilities that I take for granted. To be fair, this is a shortcoming I share with most of humanity - We take our visualization and graphical processing abilities for granted, and hence underestimate how hard it is to convey visual information to non-visual entities.

As an illustrative example look at the scene on your desk. Imagine having to answer a series of simple questions about this scene - Is the lamp to the right of the screen or the left? What is the color of the pen lying closest to the power outlet? Is the stack of papers thicker than the notebook? You’d most likely get a perfect score. Now imagine having to describe the scene to a friend over the phone in enough detail that they can get a perfect score on a similar quiz. Sounds daunting, doesn’t it? (By the way, this is a rigorous proof of the folklore theorem: “A picture is worth a thousand words”).

Something similar happens when we try to plot graphs on a computer. We are communicating visual information over a text channel, and hence we need to specify “obvious” things that our brain takes for granted - A simple example of is this Before picture of fig.tight_layout() - A human knows to position the graphs such that the labels don’t overlap, but matplotlib needs you to say, “Oh and by the way, please can I have a tight_layout for that fig?!”

Being slammed with unexpected bureaucracy in this way felt unfair and frustrating, and I started throwing tantrums - “Stupid matplotlib developers! Why is something as simple as plotting a graph so complicated?!” The answer, of course, is that plotting a graph isn’t that simple - I just felt that way because I have some extremely advanced visual processing machinery sitting between my ears. Eventually I understood that if I wanted graphs, I had to suck it up, be a big boy, and give the machine what it needs.

My Crippling OCD

I have strong aesthetic preferences about certain things, and find deviations physically painful - For eg. I re-wrote this meta-joke 17 different times, trying to get it just right.

While plotting these graphs, I had several tiny requirements which I spent a lot of (too much) time on:

When plotting annual consumption graphs for 3 consecutive years, I wanted them to be stacked on top of each other, with the dates aligned. I had to edit the dataset to make sure the dates for each year went from 1 Jan to 31 Dec.
I couldn’t choose between plotting number of orders and order value, so I decided to plot both on the same graph. However, when generating the legend, I was only able to generate the legends separately, which meant that either they overlapped and were unreadable, or you had two separate legends in two different corners of the graph, which was super ugly. Finally found a way to hack around this (Thank God for Stackoverflow!) but it should really be a standard option when plotting twinx() charts.
The legend was covering up part of the graph. Set the ylim to be 1.2*max in order to make room for it.
The dates on the x-axis were rotated. For some reason, these rotated dates really pissed me off, and I wasn’t getting the display intervals that I wanted. Is it that hard to label the x-axis with months? Eventually I manged to get the format of the x-axis exactly the way I wanted it, but I’m still not sure how, and don’t think I could repeat this feat.

Objective Problems

There are also some purely technical considerations that make plotting (useful) graphs harder than simply calling a Plot function on a time series.

Non-Uniformly Distributed Values (Sampling)

My dataset contains points for each order that I placed, and hence is not uniformly sampled. Plotting a line graph on such a dataset could result in some funky looking graphs. When you use a line chart, it will linearly interpolate missing data points, which gives a weird trend line. For eg. it is NOT the case that order activity linearly increased from March to May in the below graph.

This can be addressed by re-sampling the data into uniformly spaced buckets; For eg. Each day is a bucket and orders for that day feed into that bucket. Days with no orders get assigned zero. This graph correctly shows periods of no activity.

Discontinuous Jumps (Smoothing)

Resampling the data gives a more accurate representation of order activity, but the resulting graph looks a bit like hedgehog roadkill. A smoother graph would give a better indication of how order activity trends over a period of time.

Signal smoothing is done with low pass filters, which is just a fancy way of saying you need to mathematically transform the series in a way that damps down the effect of short term fluctuations and pronounces longer term trends (Woah! Looks like the Control Systems course I took in college wasn’t a complete waste of time!)

One particular way to achieve this is to sum order activity over a lookback window - This also has the advantage of being easy to interpret (Order activity in the past ‘n’ days). How to pick ‘n’ in a general scenario is an important question, and hedge funds like WorldQuant hire legions of smart undergrads to ~~try every possible option~~ employ advanced statistical methods to figure it out. In this case however, I just chose a lookback period of one week, because:

Food delivery behaviour is roughly periodic over this time period (Tend to order more on the weekends etc). Hence variations on top of this baseline predictability will give us maximal informational payload.
Order frequency is typically at least 1 per week; If you pick a look back whose length is less than the average space between orders, there will be no smoothing effect.

Important Note: One last thing to observe from the graph below is that the smoothed graph isn’t strictly better, as it sometimes omits juicy details. For eg. On 13 July 2019, I had 5 orders on the same day!

Sampling + Smoothing?

I’m all about saving effort, and so the natural next thought was: What if, instead of resampling data into daily buckets and then smoothing it out using a weekly window, I just resampled into weekly buckets? As you can see the weekly sampled graph excludes a fair bit of detail, so I decided the short cut wasn’t worth it.

Annual Consumption Trends (Graphs)

Putting together all the knowledge from the previous section, I was in a position to plot the graphs tracking the local consumption trends in each year. I wasn’t sure about whether to use order value or number of orders as my consumption metric, so I went with both (As I explain below, their interplay also allows us to make some interesting inferences).

What do the graphs tell us?

Going by the scales of the y axes, the weekly consumption dropped in 2020 (In terms of both, the average and the peak value).
The valleys with zero orders correspond to the times that I was on vacation, or my parents were visiting. The longest period of time without orders was from mid-March 20 to late April 20, which is when Covid19 first went viral (sorry) in the public imgaination.
The scales of the Value and Orders chart are such that the Orders line (Red) is, in most cases, above the Value line (Green). Hence, the instances where the green line crosses the red line correspond to particularly large orders, when I had guests over (For eg. start Mar 2019, end Apr 2019, mid Nov 2019, start Dec 2020)

Note that the Value chart has roughly the same height in end April 19 and start May 19, but it’s clear that the Value in May came from several orders and was just me pigging out for whatever reason.

Consumption by Cuisine

The next logical prism to split the data is Cuisine. For starters, what is the distribution of cuisine preference?

Cuisine	OrderNo	Value	Value Per Order
Burger	89	1542.18	17.3
Indian	61	1151.30	18.9
Pizza	28	646.34	23.1
Thai	33	586.90	17.8
Dessert	47	367.05	7.8
Chinese	18	307.90	17.1
Italian	13	231.60	17.8
Greek	14	199.40	14.2
Lebanese	13	179.70	13.8

And the same data in pie chart format (As one can see, there is a fair bit of variance in the Value per Order for different cuisines, so Order share seemed like a more democratic metric to compare). Looks like Burgers and Indian food account for about 50% of my consumption!

Cuisine Distribution over Time

I wanted to see how the distribution of various cuisines has varied over time, so plotted the following charts: My cuisine preferences appear to be pretty dynamic! However, this is probably a result of external factors rather than my personality changing from year to year. One such factor is the availability of restaurants on Deliveroo; If Shake Shack delivered to my house in 2018, I’m pretty sure burgers would be the chart-topper that year as well.

Restaurants by Cuisine

One simple question to ask in this regard is, what is the favourite restaurant for each cuisine? As the pie charts above show, ordering behaviour is quite dynamic over time, so it makes sense to look at the favourite restaurant per cuisine per year. (I aggregated over Value in this case, since restaurants with the same cuisine would have prices closer to each other.)

Year	2018	2019	2020
Burger	Byron	Shake Shack	Shake Shack
Chinese	Grilled Fusion	Ping Pong
Dessert	Cookies & Cream	Cookies & Cream	Craving Dessert
Greek	The Athenian	The Athenian	The Athenian
Indian	Namma by Kricket	Motu Indian Kitchen	Dishoom
Italian	La Figa		Scarpetta
Lebanese	The Chickpea	Waleema	Efes
Pizza	PizzaExpress	The Pizza Room	The Pizza Room
Thai	Rusty Bike	Rusty Bike	Busaba

Also included a fancy Tableau graph of this data, since just showing the max value restaurant hides close runners up (As is the case with Byron and Shake Shack in 2019).

Consumption by Restaurant

Next, I broke down the data by restaurant. Here again, I looked at the overall total, and then looked at the distribution on a year by year basis.

rName	OrderNo	Value	Avg Value
Shake Shack	47	707.40	15.1
Byron	31	679.30	21.9
Motu Indian Kitchen	32	565.75	17.7
The Pizza Room	18	419.54	23.3
Rusty Bike	21	346.10	16.5
Busaba	12	240.80	20.1
Ping Pong	12	238.25	19.9
PizzaExpress	9	212.25	23.6
The Athenian	14	199.40	14.2
Cookies & Cream	27	192.15	7.1
Manjal	7	177.95	25.4
Dishoom	6	166.20	27.7

Restaurant Distribution over time

The pie charts below the order share of restaurants per year. I only included restaurants with > 5% order share to keep things readable.

I don’t know if someone who doesn’t know me could figure out much about me from the charts above, but they make a lot of sense to me given what I know about myself.

2018: I first moved to London, and had a hankering for Indian food (Namma by Kricket) and ordered a lot from Pizza Express (The power of brand recognition). Namma by Kricket shut shop within a couple of months (The power of a really bad name) and Rusty Bike became my go-to for simple, not-too-unhealthy food. Once I discovered Pizza Room, I entirely switched over to them for all my pizza needs. Cookies and Cream was the generic cake shop closest to my house.

2019: Early 2019 was a wild time. Two of my favourite restaurants, Shake Shack and Ping Pong started delivery to my house. Unfortunately, I moved houses and Ping Pong no longer delivered to my new house, which explains the 8.9% above. Also, Byron launched a veggie burger (Limited edition though), which meant I ordered from them a lot more. Motu was another one of the new entrants at this time - They had a Box for 1, which was strictly average in quality but sated my desire for Indian food. Dzrt was the generic cake shop closest to my new house.

2020: Two cataclysmic events occurred in late 2019; A Chinese dude ate a bat sandwich and I got a new flatmate. My flatmate didn’t enjoy burgers as much as I did, which meant a drastic reduction in Byron + Shake Shack. Also, the pandemic seemed to affect Byron particularly badly (If Lord Byron is reading this, check out Section 3 for some ideas to increase business). The preponderance of Busaba can be attributed to my flatmate, while Capeesh was all mine. We both got behind the Athenian (Halloumi Souvlaki), Manjal (Uthappa aka Savory South Indian rice pancakes) and Scarpetta (Pasta for her, grilled chicken+veggies for me). Due to the lockdown, Dishoom finally entered the food delivery game in the later part of the year. Another pandemic baby was Chowpatty, an upstart, home-run ‘restaurant’ that delivered Bombay street food (Complete with raw mango garnish).

Look Ma! Bar Charts!

Honestly, this section is just me flexing my newly developed plotting muscles…

Unsurprisingly, most of the food is ordered on the weekends. But why stop here? Let’s take a look at:

Another anti-surprise, most of the food ordered during lunch and dinner, with more dinner orders than lunch orders. And because we live in an age of cheap compute, I decided to also plot…

OHMYGODOHMYGODOHMYGOD!!! No orders during the 23rd minute!!! Clearly I’m living in a movie…

Cost Analysis

As calculated before, my total spend on Deliveroo so far is £5212.37, which works out to £87 per month on average.

Distribution of Order Values

The average price of an order is £16.49 and the median is £15.77. I’ve plotted the distribution below, and it’s pretty obviously multimodal - For eg. The large concentration of orders at about £8 corresponds to the dessert orders. One can also see the larger orders when I ordered for a group of people (Seems like there were at least 7/8 such occasions).

Is my Deliveroo Plus Account worth it?

This calculation is very hard to do super accurately, since the rules of the game keep changing. For eg. in Jan 20, Deliveroo introduced a £10 minimum order value to avail free delivery. The delivery fee structure is also pretty complicated, though Deliveroo states that on average, it is about £2.5. I have no idea how that number has changed over time though - Most of my email receipts before subscribing to Plus state the delivery fee is £0, which makes me wonder why I subscribed in the first place. To cut a long story short I will proceed with the following assumptions - Delivery fee would have been £2.5 without Plus, and the £10 threshold applies to all orders.

I started my Plus subscription in October 2019. At the time, it cost £7.99 per month (Increased to 11.49 per month in December 20). Over 15 months, this is a total spend of £123.35.

Over the same period of time, I had 80 orders that cost more than £10. This translates to savings on delivery fees of £200, and net savings of £66.65 (So close!) or a princely sum of £4.44 per month. At the present cost of Plus, the net savings would be less than £2 per month.

It would be nice if Deliveroo themselves could do this calculation for you. Obviously, they wouldn’t want to show you that info if it turns out your Plus membership is actually a net negative for you, and I don’t think it would be acceptable for them to only show this statistic to people who benefitted from Plus.

Hence, the only way such a feature could work is if they showed people who don’t have Plus how much they would have benefitted with Plus given their order history - “If you had signed up for Plus in Oct 2019, and used the savings to purchase long-dated Gamestop options, today you’d have a 1000 Dogecoins!”. Of course, this is assuming that Deliveroo is actually incentivized to convert such customers to Plus; I’ve no idea how the economics on that works.

Most Expensive Restaurant

This should be an easy one, right? The most expensive restaurant is the one that costs the highest price per unit of food, which seems straightforward enough. Unfortunately, the hard part of using that formula is defining a ‘unit of food’. Do we define an order to constitute 1 unit? This doesn’t seem right, as we will see in the analysis of individual restaurants, order sizes can be quite variable even for orders from the same restaurant (Due to guests for example).

How about defining a unit of food to be one restaurant item? This is even more problematic, because items can fall into various categories - Mains, Sides, Pizzas (Yes, pizza is a separate category!), Drinks, Desserts etc. and are hence even less uniform than orders. There might be a path here if we manage to categorize the items, and take some sorted of weighted sum across categories (1 Side = 0.5 Units, 1 Main = 1 Unit, 1 Pizza = 1.5 Units etc.) but the categorization is a challenge in it’s own right (That I explore in the restaurant level analysis).

Another approach would be to go full Physics and define food units in calories, but caloric information doesn’t exist for most of these items - Hit me up if you can think of a not-too-hard, non-manual way to come up with approximate calorie counts for the items! Also, caloric count isn’t proportional to satiety - The calorie approach would be biased against desserts, but that is just an argument for why desserts should be their own category.

In the absence of clear answers, I decided to just calculate the 2 simplest metrics (Average Order Value and Average Item Value), and see which one made more sense.

rName	Average Item Value
Motu Indian Kitchen	11.55
The Pizza Room	10.76
PizzaExpress	10.61
Busaba	8.30
Cookies & Cream	7.12
Manjal	7.12
The Athenian	6.88
Dishoom	6.65
Rusty Bike	6.41
Byron	6.01

Off the bat, we can see that Item Value as a metric is giving nonsensical results - The top 3 items have Pizza restaurants (As predicted) and Motu Indian Kitchen, which has the massive “Box for 1” as a single item, but is hardly an expensive/high class restaurant. The other entries on the list are equally nonsensical (The Athenian? Rusty Bike?? Cookies & Cream???)

rName	Average Order Value
Dishoom	27.70
Manjal	25.42
PizzaExpress	23.58
The Pizza Room	23.31
Byron	21.91
Busaba	20.07
Ping Pong	19.85
Motu Indian Kitchen	17.68
Chowpatty	17.11
Rusty Bike	16.48

This seems more in line with the truth, but again there is a very evident bias - All of Dishoom, Manjal and PizzaExpress have instances of large (> £60) orders, which skew their average order value upwards.

Another point in favour of the order metric - This list features the fancier places like Ping Pong, Busaba, and Byron higher up than the first list.

In order to see the restaurant-wise analysis analysis, check out Section III!

Deliveroo Data Analysis I

2021-05-17T00:00:00+00:00

In this series of posts, my goal is to analyse my Deliveroo order data starting May 2018 and see what I can learn about myself from it.

The series is divided into 3 sections:

Section I: Fetching the raw data. Provides details on the steps taken to parse email order receipts and obtain the data in a structured format.
Section II: Top level analysis. Simple things like spend per year, consumption patterns over time, distribution by cuisine/restaurant and how it changed over time. Also includes a behind the scenes look at why these patterns changed in the way they did.
Section III: Restaurant-wise analysis: Basic stats and graphs for the most popular restaurants. Also includes a list of the most popular items, reviews and other personal tidbits.

Table of Contents:

Step 0 - Build Mad Skillz
Getting the data
- Parsing Email Receipts to .csv
- Data Formatting + Cleaning

Step 0 - Build Mad Skillz

Despite having an advanced degree in Computer Science and working as a sort-of-software engineer for several years, I knew embarrasingly little about the basic machinery required to work on this project. Hence I spent a fair bit of time just learning some super simple stuff:

Pandas: For those that don’t know, Pandas is a Python library used for dealing with tabular data sets (Dataframes) - Pretty much bread and butter for any datadude. Going through the Kaggle tutorial was sufficient to get started.
Jupyter Notebooks: I guess the only thing I had to learn was the fact that they exist! I really love it when tools are powerful AND easy to use - I felt like I had been looking for Jupyter notebooks my entire life!
Other tools: Played around with a couple of different code editors, picked up the basics of the command line and git and went through some of the lectures here.
Touch Typing: Imagine having to use a pencil taped to a brick every time you want to write something. That is what using a keyboard felt like to me. I decided to make the interface between my brain and the computer as seamless as possible, and learnt to touch type. Best investment of my time ever!

Getting the data

The first step of any data project is to get the data into a usable format (Most often a table), clean it (Get rid off meaningless/null values) and enrich with derived fields that will be utilized in the analysis.

Parsing Email Receipts to .csv

As a first step, I wrote some code to parse the email receipts that I get from Deliveroo each time I place an order and extract the data into a nice .csv file. This turned out to be harder than I originally thought it would be. Some challenges I faced along the way:

Deciding which format to parse: Each email contains a text version and an HTML version. Initially, parsing the HTML seemed like the more correct way; There was even a neat pandas function that converted from HTML to a dataframe!

However the HTML to dataframe function was super brittle, and when it broke, I didn’t exactly know why. Rather than dig through the error messages and try to make the function work, I just wrote a backup text parser for the cases where the HTML parser failed. The text parser turned out to be much simpler and, as I found later, more accurate as well.

There are libraries such as Beautiful soup whose specific purpose is to scrape data from HTML, but I decided to postpone digging into them till I didn’t have a choice.
The format of the email changes over time: I was expecting this to be an issue. As a simple example, the top line of the Deliveroo email reads: “{Restaurant Name} has your order!”. However, before May 2019, it used to read “{Restaurant Name} has accepted your order!”.

The first solution my brain suggested was to put a branch in my code to check whether the order was received before/after 1 May 2019, and parse the restaurant name accordingly. Of course, this solution was terribly unsatisfactory because:
- I didn’t know exactly when the format had changed, I just knew that it had changed at some point between two of my orders. Hence, there was a chance that if someone else ran this code for their orders, it might parse them incorrectly.
- More seriously, the format could change again at some point, and then I’d have to add yet another branch in my code, increasing the length and complexity of my codebase.
I shrugged off these concerns and just coded it up anyway, since I wanted to finish the data scraping ASAP and move on to the ANALYSIS!. However the code didn’t work as expected since before November 2018, the top line used to read “{Restaurant Name} has received your order!”

There was no way I was putting 3 branches in my code, and so I decided to respect the problem and actually think about it for 5 minutes. At 4 minutes and 20 seconds, I realized I could leverage some regular expression magic to vastly simplify the code.

For those that don’t know, regular expressions are a way to recognize if some text data matches a certain pattern, and split it into sub-patterns. I was using a different regexp in each branch to extract the restaurant name:
- {Restaurant Name} has your order!
- {Restaurant Name} has accepted your order!
- {Restaurant Name} has received your order!
However, as I later realised, I could actually push the branching into the regexp itself, and use just one: “{Restaurant Name} has (|accepted|received) your order!” This change avoided blowing up the size of the code and also exorcised the date-based check.
The price convention seemed to change randomly: This was a subtle issue, and I actually noticed it when I was deep in the ANALYSIS! stage. I was plotting the distribution of order sizes, and found that I had spent £150 on a single order at Dishoom, which I did not remember.

Digging further, I found that while most of the receipts contained the unit price of each item, some of them (Notably, the ones for Dishoom) had the total price (Unit price x Item Quantity). I was working under the assumption that they were all unit prices, and hence the totals for Dishoom were being calculated as much higher than they actually were.

In order to figure out which receipt followed which convention, I started parsing the bottomline numbers (Sub-Total, Delivery Fee, Taxes, Total) in the receipt. I then did a basic checksum against the implied total under both price conventions, to see which receipt followed which convention.

Eventually though, this turned out to be wasted effort. The two different price conventions were an artefact introduced by some extra display logic in the HTML code, and the text part of the email had the right values all along. However, it’s still good to have the bottom line numbers in order to analyse things like how delivery fees etc changed over time/How much I pay in extra charges etc. so this was’t a total waste.

I have shared the end result of these efforts in git repo here, in case any of you are interested in fetching your own Deliveroo data.

Eventually, I was able to get my data extraction working and got the raw data corresponding to (almost) all orders in a neat .csv file. Just scrolling through this file, I got a feeling of power. ~~Finally it was ANALYSIS! time!~~ Finally, it was time to clean my data and get it into the right format!

Data Formatting + Cleaning

In order to do useful ANALYSIS! on the data, it first needed some cleaning and formatting:

Datatype Conversions: Need to convert the values of the “Date” column into datetimes.
Add Inferred Columns: Enrich the dataset with some more columns that will be required in the course of the analysis such as Value, and OrderNo - All items ordered at the same time share an order no.
Remove outliers: Remove rows with price = 0. Typically, such rows correspond to freebies such as ketchup/mustard packets. While it may be interesting to analyse such rows separately, I will exclude them for now.
String Cleaning: Some of the restaurant names and item names contained weird characters such as “=E2=80=99” – This is a consequence of UTF-8 characters in the item names. UTF-8 is an encoding scheme by means of which emojis and special symbols are represented using letters and numbers. Examples:
- “=C2=AE” is the UTF-8 hex encoding of the subtly threatening (R) that follows a registered trademark.
- “=F0=9F=8C=B6” represents the cute chili emojis that serve as a spice warning.
- A lot of them are Mandarin characters, representing the original Chinese name of the item.
I don’t care about any of these (Least of all the spice warnings!), and only remapped =E2=80=99 to the single apostrophe to make things a little more readable.
Add External Context: In order to make the analysis more meaningful, I wanted to add in some contextual data that is not present in the raw Deliveroo data:
- Map to Real Name: Sometimes, restaurants pop up with slightly different names, because of branding exercises or different branches - I mapped these to a single name (The rName or real name). This is important when trying to answer questions like which restaurant is the most popular one.
- Add Cuisine: Knowing what kind of cuisine each restaurant serves was important to gain a broad understanding of my tastes and how they evolve over time. It was also important for the purposes of analysis. Example: If you want to calculate the average price of a meal, you may want to exclude dessert orders since these are much smaller on average.
While it may be possible to automate the creation of this contextual dataset, I just did it manually. There are only 49 distinct restaurants that I ordered from and hence labelling them took less than 5 minutes.

Once the data had been extracted, cleaned, formatted and enriched with external context, I spent 5 minutes gazing at it lovingly before diving into the ANALYSIS!

	Date	Restaurant	Item	Qty	Price	rName	Cuisine	Value	OrderNo	Year
0	2018-05-06 20:38:59	Mother Clucker Editions	Halloumi Bun	1	16.5	Mother Clucker Editions	Burger	16.5	0	2018
1	2018-05-07 11:42:05	Namma by Kricket	Aloo Chaat	1	5.5	Namma by Kricket	Indian	5.5	1	2018
2	2018-05-07 11:42:05	Namma by Kricket	Burnt Garlic Tarka Dhal	1	3.5	Namma by Kricket	Indian	3.5	1	2018
3	2018-05-07 11:42:05	Namma by Kricket	Matar Pilau	1	3.2	Namma by Kricket	Indian	3.2	1	2018
4	2018-05-07 11:42:05	Namma by Kricket	Papad	1	1.0	Namma by Kricket	Indian	1.0	1	2018

In order to see the top level analysis, check out Section II!

P’s Knowledge	Q’s Knowledge
\(q = \frac{a}{b} \text{ OR } q = \frac{b}{a}\)	\(p = \frac{a}{b} \text{ OR } p = \frac{b}{a}\)
\(q > 1\)	\(p > 1\)

P’s Knowledge	Q’s Knowledge
\(q = \frac{a}{b} \text{ OR } q = \frac{b}{a}\)	\(p = \frac{a}{b} \text{ OR } p = \frac{b}{a}\)
\(q > 1\)	\(1 < p < a\)

P’s Knowledge	Q’s Knowledge
\(q = \frac{a}{b} \text{ OR } q = \frac{b}{a}\)	\(p = \frac{a}{b} \text{ OR } p = \frac{b}{a}\)
\(q > \frac{a}{b}\)	\(1 < p < a\)

Adventures of a Wannabe Datadude

R.I.P. Timeline on Desktop - FAQ for the Normals

Sentimental answer:

Practical answer:

Philosophical answer:

Footnotes

IKB - Subarray Sum

Problem

IKB - Span of a Tetrahedron

Problem

Solution

Irodov Ka Baap - Origins

IKB - The Impossible Problem, Version 1729.0

Puzzle

Solution (Updated 11Nov24):

Google Timeline Analysis (Finally Something Useful)

Takeout Choices - Raw or Processed?

Approach 1 - Rawdogging data, Costly API calls, Simplifying assumptions

Approach 2 - Processed Data only

Other approaches

Conclusion

The Data Liberation Front

WhatsApp - The Good, the Very Good, and the Excellent

WhatsApp - The Ugly

The 5 Stages of Grief

The Solution

Step 1 - Mirroring my Phone on the Computer

Step 2 - GUI Automation

Step 3 - Optical Character Recognition

Step 4 - Parsing OCR Text to Structured Data

Sidebar: Probabilistic Approach to Data Extraction

Advantages over the waterfall

Why?

Conclusion

Footnotes

Comments

Add Comment(cancel reply)

Deliveroo Data Analysis III

Axes of Interest

Sidebar: Clustering

Item Categorization - K-Means Clustering

Item Name Changes - Prefix Clustering

Shake Shack

Summary Stats

Summary Graphs

Item-wise Analysis

Byron

Summary Stats

Summary Graphs

Item-wise Analysis

Ping Pong

Summary Stats

Summary Graphs

Item-wise Analysis

Rusty Bike

Summary Stats

Summary Graphs

Item-wise Analysis

The Pizza Room

Summary Stats

Summary Graphs

Item-wise Analysis

Motu Indian Kitchen

Summary Stats

Summary Graphs

Item-wise Analysis

Dishoom

Summary Stats

Summary Graphs

Item-wise Analysis

Next Steps

Footnotes

Comments

Add Comment(cancel reply)

Deliveroo Data Analysis II

Annual Consumption Trends

Sidebar: Why are Graphs so painful?

Subjective Problems

My Ingratitude and Immaturity

My Crippling OCD