I combine two very different approaches to time series forecasting, applied to a dataset of air pollution in Beijing. I use Prophet to make an univariate additive regression model, then show that it performs similarly to a shallow neural network made with fast.ai. I devise a plan to give the...
[Read More]
Reproducible data science with Docker and Luigi
The case of arsenic and fluoride in Mexican drinking water
I describe a workflow that uses Docker and Luigi to create fully transparent and reproducible data analyses. End users can repeat the original calculations to produce all the final tables and figures starting from the original raw data. End users (and the author, at a later date) can easily make...
[Read More]
fast.ai Deep Learning vs XGBoost on tabular data
The case of broken water pumps in Tanzania
I use the fast.ai deep learning library for one of its newest applications: predictive modeling on tabular data. I compare its performance against the incumbent best tool in the field, gradient boosting with XGBoost, as well as against various scikit-learn classifiers. Despite recent prominence on other tabular datasets, in this...
[Read More]
RNA secondary structure in the flu virus
Using entropy and mutual information to find structure in the genome of Influenza A
The influenza virus, in its many strains, is responsible for everything from seasonal flu to the occasional pandemic. It is an RNA virus whose genome is split into 8 strands, each neatly wrapped around packaging proteins in order to fit in the virus’s envelope. While most of the RNA genome...
[Read More]
The GeoPandas Cookbook
Simple recipes for beautiful data maps
For anyone used to data science with pandas, GeoPandas is the simplest way to perform geospatial operations and (most importantly) visualize your geographic data. GeoPandas saves you from needing to use specialized spatial databases such as PostGIS. This cookbook contains the recipes that I’ve found myself using over and over...
[Read More]