Building a Reproducible Data Science Project Checklist
The Background
As a social scientist turned data scientist I am acutely aware of the importance of research documentation, readable and reproducible code, and following consistent research pathways that allow for consistent, and efficient, research outcomes and innovation. However, as a social scientist who never took a single computer science class until getting a PhD, I became a self-taught programmer who learned to code in a very linear fashion (hello SAS) without thinking much about code reproducibility and modularization much outside of having some stellar SAS macros skills.
I worked for over 10 years for a government agency (CDC) and then academia. In terms of coding efforts, it was a lot of bashing things with a stick and finding modifiable code snippets from Google (which usually led to Stack Overflow and Medium posts of course). After years of seeing much research based data science code fail to make it into production I committed myself to taking a more “full-stack” approach to building, testing, and (hopefully) deploying data science solutions. I started to approach every code-based project, whether for a client or “for fun”, as a way to re-build my data science process around the end-to-end data science pipeline experience. Of course this has led down infinite numbers of rabbit holes with seemingly endless choices of pre-built software solutions, and blog posts…
“how to structure a data science project”
“building a repeatable data analysis process”
“developing end-to-end ML”
…and the list goes on.
Many software solutions promise the end to your “data science solutions die on PowerPoints” woes. These “ready-made” solutions have not quite worked for me for two main reasons:
- The idea that ANYONE can just spin up an automated machine learning model and push into a production environment or application is slightly unnerving. It seems to allow folks to skip right over the data munging, exploratory data analysis, model building considerations, etc. that could help prevent pushing bad, incomplete, biased (all of the above?) data and models into the world. It takes the responsibility off of the data science practitioner. This is really an entire area of discussion in itself with new books, academic research, opinion pieces, etc. being produced every day. I pretty much started my dissertation from this topic, so we’ll just leave it at that for now.
- Much like no specific model is best at all problems, no software solution really covers the data science solution to production application problem adequately. Boxed up solutions also tend to be just that; boxes. It’s difficult to make customized solutions within a box.
With the increasing number of boxed solutions, blog posts, code packages, GitHub repositories, etc. available each day, it can be very difficult and overwhelming to wade through it all. Over time I have selected bits and pieces from various sources that have helped me in several ways by:
- improving my coding skills
- increasing my efficiency
- improving my understanding of end-to-end data science pipelines and data engineering
- allowed me to start building experimental solutions and code with potential end use products in mind
With all this in mind, here is my current data science project checklist (i.e. data science for dummies). As are most things centered around learning, career skills, and technology in general, this is a living work in progress and represents the first phase of bringing experimental code to modular code to reproducible and sharable code.
Data Science Project Checklist
Even though this checklist is just the setup of a data science project, it has already saved me a lot of time. Taking just 5 minutes to follow these steps for each new project allows for consistent work, easily reproduced code and environments, and quick turnaround for minimally viable products (MVPs). This has proved advantageous for internal, client-based, and personal projects alike.
As I said, this represents a work-in-progress checklist for a simple project setup. I’ve been taking this setup and working on expanding it to include: making packaged code, creating APIs, packaging code into containers, running the containerized code in the cloud, and application orchestration (i.e. Kubernetes). As always, stay tuned…
I hope this checklist is helpful for others who have felt overwhelmed jumping from research based to application based coding practices. Please chime in with any comments or your own suggestions!