My Journey to Data-Centric AI

Dr. Jessica Rudd
5 min readJun 27, 2022

--

I worked as an epidemiologist at the CDC for 10 years. During that time, I was constantly reminded that there are “people in p-values.” There are arguments against the misuse of the p-value in research. Still, the sentiment remains that the data we use in models do not exist in a vacuum. Much data across domains is derived from individuals and communities.

Similarly, individuals and communities also collect and curate data with their own interests. When I was a Ph.D. student in Analytics and Data Science, I often argued with other students and faculty to consider “garbage in — garbage out” to academic research and applied business analytics. Many of my computer science-focused colleagues had what is now seen as a “model-centric” approach to AI. State-of-the-art research focused on selecting model architectures and training parameters for improved model performance on benchmark datasets. It was frustrating to see studies almost wholly focused on code and computational resources while ignoring the value of high-quality data. This frustration led to my dissertation research focused on a data-centric approach to data science.

Knowledge acquisition in Data Science

The No Free Lunch (NFL) Theorem states that no single machine learning algorithm is better than all others on all problems (Wolpert, Macready, et al., 1997). It is common to try multiple models and find one that works best for a data set. The goal of data analytics and modeling is not simply to predict the future with the best model. Instead, the goal is to use available information to inform decisions about possible futures and outcomes. The interdisciplinary nature of data science has led to disparate opinions on the nature of inquiry within the field and how best to approach model development. For example, pioneering Microsoft computer scientist Jim Gray proposes data science as the 4th paradigm of science inquiry: that growing Big Data availability, new analytical methods, and the computing resources available to marry the two suggest we can analyze data without hypotheses and let algorithms find patterns in data where science cannot (Kitchin, 2014).

When the objective in data science becomes to find every possible association and let the data inform a narrative without oversight, we risk finding patterns that are not always meaningful and correlations that are random and have no actual causal association at best. Worse still, we risk perpetuating harmful patterns embedded in the data. If you look at the clouds long enough, you will undoubtedly see some mythical shapes.

In contrast, in Floridi’s assessment of the epistemology of big data, he points out that too much data presents an epistemological problem in terms of what data to throw away, what is valuable, and patterns that may or may not be valuable (Floridi, 2012). He suggests the technological solution to this epistemological problem includes more and better techniques and technologies, which effectively shrink Big Data back to a manageable size. Taking the “big” problem away from Big Data brings us back to a world where statistical theory and reasoning are the keys to effective and valid pattern detection. Part of a researcher’s potential success in data science is their ability to deduce which data is valuable, which can be dropped, and which missing data matters. This focus on data quality makes the exercise very similar to traditional data mining methods on a larger scale. As technology keeps pace with more significant amounts of data, “big data” becomes just “data.” Leaning into the statistical reasoning and mathematical assumptions required to make valid deductions from this data remains unchanged, regardless of the relative size of the data. What is “big” today will not be “big” tomorrow, but “bad” data will always present challenges regardless of size. The definition of data science from a data-centric AI perspective is to make data of any size or structure manageable. At the same time, we must understand the underlying statistical requirements to translate that data into meaningful information.

The current interdisciplinary state of data science suffers from an inconsistent approach to modeling strategies, with the complexities of big data and potentially “bad” data inadequately addressed. The literature review in my dissertation focused on the epistemology of data science and the current state of AI research. I found that data science practitioners encounter several significant challenges:

  • Even big data does not exist in a vacuum.
  • What constitutes “big data” is not consistent across domains and is relative to the availability of computational resources.
  • No Free Lunch (NFL) Theorem: no single algorithm is better than all others on all problems.
  • Whether statistically valid or not, the choice of data and analytic strategy affects the results.

Data-centric AI and Mirry

I am passionate about data democratization and improving access to high-quality, representative data. I believe in the power of data science for good so that technology products can be built to serve the needs and interests of people rather than people serving technology. The initial source code for Mirry was born out of this passion and curiosity for improving business analytics while protecting personal data. As a social scientist who became a data scientist, my social science sensibilities are an essential part of developing products and leading teams; that ethos is vital for building Mirry as a product and an organization.

Mirry is designed to expose enterprise customers to the returns of Next-Gen data for AI. Mirry enables data-centric AI by providing data access, labeling, and augmentation in a single platform. As businesses increasingly turn to artificial intelligence (AI) systems to power their operations, ensuring accuracy and reliability is essential. In business, data is often viewed as a one-time event. The focus is on the code, and data collection is overlooked. This can lead to hundreds of hours wasted fine-tuning a model based on faulty data. By taking a more data-centric approach, you can avoid these pitfalls and improve your results.

The data-centric approach focuses on developing systematic engineering practices for improving data in reliable, efficient, and organized ways. This includes collecting high-quality training data; preprocessing or cleaning up noisy or incomplete data sets; augmenting datasets with synthetic examples generated by machines; selecting appropriate feature engineering techniques; using automated machine learning (AutoML) tools to build models automatically from scratch, and deploying deep learning models in production environments.

The Mirry approach to data-centric AI is focused on a suite of feature modules, including

  • Data Access — Generate synthetic data, improve data security & privacy, and generate test datasets.
  • Data Labeling — Reduce the complexity of high cardinality data & enhance the quality of data labels.
  • Data Augmentation — Balance data, enumerate features, and add simulated events.

One of Mirry’s first product offerings is focused on solving the problem of data access via synthetic data and we’ve just opened it up to general availability.

Interested in testing Mirry’s next-gen data solution? Sign up for our FREE Community Edition and join our Slack Workspace.

You can read more of my blog posts on the Mirry.AI resources page.

--

--

Dr. Jessica Rudd
Dr. Jessica Rudd

Written by Dr. Jessica Rudd

Senior Data Engineer | MLOps | Endurance Athlete | Using computation to improve analytics pipelines. #ChangeTheRatio

No responses yet