We’re awful people. Throwing data out there into the void and trusting myriad companies with our personal information, assuming it won’t be used for nefarious purposes. Well. It is. All the time. We’re experimented on and manipulated every second of the day, being bucketed and categorised while hyper-optimised algorithms whirr in the background trying to figure out how we can be gently prodded to click a button and be further monetised, driving up the CTR another few percent. All of this manipulation is driven by data.
Data is fascinating. We produce so much of it without even knowing or thinking about it, whether we’re browsing the internet and lingering a while on a website about cars, or searching for our deepest desires in a search engine. It’s all logged and used against us in order to sell more ads for us to click on. Most of this data we don’t have a good view into, but some of it is available through various APIs. I thought it might be interesting to collate some of my own for some less-than-nefarious purposes.
Anyway, I built this: https://stats.theletterd.co.uk (Code here)
I’ve had this project kicking around my head for a while. There have been numerous posts on reddit asking questions along the lines of “If you had access to a database of your life, what would you want to know?”. I don’t think I’ll be able to get data around how many near-death incidents I’ve had, or how many people have had dreams about me, but there’s some stats out there that are definitely within my grasp.
Google tracks my step-count through my Android phone, Withings’ Health Mate app syncs my weight to Google each time I step on the scale. I’m logging the books I’m reading to Goodreads, and I switched from Runkeeper to Strava for tracking runs because, well, Strava has an API that I can read things from and Runkeeper does not. I don’t know if anyone’s ever switched apps because of the lack of an API, but I suspect those instances are few and far between.
The first stat I wanted to track was my step count, because I know Google’s got the goods. However, that data is behind an API that requires you to do OAuth2 fanciness which (I thought) I would have to dedicate a large amount of time to to figure it out. Time and energy is at a premium right now, and my meagre attempts to read OAuth documentation at night were largely fruitless.
Until I stumbled across an article that showed me how to sync step data from Google Fit into a Google spreadsheet. Ok cool. Now all I needed to do is modify that to fit my means, and then figure out how to access a spreadsheet with a Flask app. That got off the ground when I discovered gspread. After some time trying to remember how to build a very basic React app, we were off to the races.
The nice thing about reading stats from a Google Sheet was that once I’d built a sheet that summarised my steps/weight statistics, I could add a bunch of my own custom stats. Birkenstock count, number of wives, etc. A nice way to extend what I’ve already got!
With Google Fit supplying step count and weight, it was time to see what other data I could pull. Goodreads has an API that is vaguely maintained, which allowed me to pull data about what I’ve read/am reading relatively easily, so that was simply integrated. Their API returns XML, which is a little annoying, so a bit of a refresher was in order to figure out how to extract what I wanted from the response.
I tidied up the UI by using Flexbox, which is an incredible way to do flexible layouts. This was deeply satisfying.
After that, I decided to explore Strava. I’d never actually used Strava for run tracking, but everyone else does. I gave it go, and it seemed fine, so I set about transferring my historical runs from Runkeeper, and then poking around at Strava’s API.
This is what poked me into actually learning OAuth, which, it turned out, wasn’t a huge deal. At first I was just using the requests library to do everything manually: posting the client id and client secret to get an access token, and then using the access token to actually get the data I wanted. But then I started to worry about performance.
Page load time was suffering. For the stats I was getting (reading from Google Sheets, Reading from Strava, reading from Goodreads), one page load required seven API calls (two each for Google Sheets and Strava, and one each of my Goodreads stats, of which there were three.
I cached stats for 15 mins in memcached which helped a lot, but first load was still expensive. For a little while, I looked into python’s asyncio functionality in an attempt to do as many of these API calls as possible in parallel. I did end up getting something working, but it was an ugly mess. It made sense to just cut down the number of requests I was doing by storing the OAuth access tokens, instead of requesting them each time.
SQLite is a super easy database to get up and running, and pairing that with SQLAlchemy, things like table creation, and queries were made much easier (particularly if it meant not hardcoding SQL statements…). And since I now had a database, I could properly store OAuth tokens, and add the ability for users (me) to login and do the OAuth2 Authorization Code flows properly, after adding some endpoints in my app. Plus, now that I had a database, I could write a job to poll Google for my step/weight data periodically instead of reading it from a spreadsheet. This meant I could simplify the system by moving logic out of a spreadsheet. Bit of a yikes there, thinking about it now, but hey, it got things off the ground!
I spent a lot of time after that getting rid of my home-spun OAuth flows and replacing them with Authlib which made things a lot cleaner and simpler, although I did end up getting a bit stuck due to the non-standard way that Strava does OAuth (they ask for the client_id and client_secret when requesting an access token).
Google also was occasionally not giving me a refresh token, which turned out to be a known issue but was easily fixed.
Once I had Authlib fully integrated, I added an internal page with authorisation links to the external services I was using, just in case I needed to re-authenticate for whatever reason. This came in very handy.
All throughout this, I spent a lot of time thinking about the right way to structure the app. I refactored, moved things around, pulled APIs and interfaces into something that made a bit more sense, and ended up with a structure that felt a lot cleaner and sensible.
All through this, a lot of things broke, underlining the importance of tests. Particularly when I broke the job that pulled Google Fit data into my database. That was broken for a day, for very obvious reasons, before I noticed. I added some tests. I think I have four right now. I’ll probably add more as time goes on, when I continue to break things 😀 For the testing, I ended up using pytest. It took some time to configure things the way I wanted. I ended up using this blog as a guide, which I also used for advice on structuring the app.
duncan@gillian:~/programming/stats $ tree -C -I 'env|__pyc' -P '.py|.html|.js|*.css'
.
├── instance
│  └── config.py
├── statsapp
│  ├── __init__.py
│  ├── blueprints
│  │  ├── __init__.py
│  │  ├── home
│  │  │  ├── stat_collector.py
│  │  │  └── views.py
│  │  ├── oauth
│  │  │  └── views.py
│  │  └── user
│  │  └── views.py
│  ├── collectors
│  │  ├── goodreads.py
│  │  ├── googlefit.py
│  │  ├── gsheet.py
│  │  └── strava.py
│  ├── config.py
│  ├── models
│  │  ├── __init__.py
│  │  ├── googlefit.py
│  │  ├── oauth.py
│  │  ├── stat.py
│  │  └── user.py
│  ├── oauth_apis
│  │  ├── __init__.py
│  │  ├── goodreads.py
│  │  ├── googlefit.py
│  │  ├── gsheet.py
│  │  └── strava.py
│  ├── static
│  │  └── style.css
│  ├── tasks
│  │  └── pull_recent_googlefit_data.py
│  ├── templates
│  │  ├── authorised_apps.html
│  │  ├── base.html
│  │  ├── index.html
│  │  └── login.html
│  └── tools
│  ├── interactive.py
│  └── util.py
└── tests
├── conftest.py
├── models
│  └── test_googlefit.py
├── tasks
│  └── test_pull_recent_googlefit_data.py
└── tools
└── test_util.py
17 directories, 34 files
At least for the moment, I’m done. I might think about some other APIs I can pull data from. I have thoughts around some simple visualisations for the weight/step data I’m collecting. And I’ve got some ideas around how to not rely on Google Sheets at all, but instead keep all configuration in the database – though it’ll take some fancy UI stuff to pull off what I’m envisioning.
And tests. I’ll probably write some more tests.
Probably.