The Label Blog

Data Entrepreneurship – Why I Started The Training Data Project

Data Entrepreneur [definition]: Someone who helps others uncover truth, meaning, and opportunity in data.


Have you ever been in a situation where you thought, “If only we only had this, there is so much more we could do”? Working both in government and now supporting the Defense and Intelligence Community, I’ve been there many times in my career. About 25 years ago, I got the chance to explore one of those “if only” scenarios with some early advanced analytics while in graduate school at Carnegie Mellon in Pittsburgh.

I can still recall that cold October day, sitting in a stuffy computer lab, when my professor asked if anyone was interested in a project working with the Pittsburgh Police. We aimed to explore new and innovative ways to fight crime using GIS (Geographic Information Systems) and other spatial analytical models and tools, including machine learning. It was supposed to be just a semester assignment, but trust me, it was much more than that.

That semester, I launched a career and a personal mission. Every morning, I couldn’t wait to get to work. I read everything, experimented with as many different analytical methods as possible, and had the good fortune to work with some pioneers in spatial and statistical modeling. After I graduated, I went into law enforcement and public safety just so I could make a direct, positive impact in my community. At the heart of that impact was data.

Good data to power our models was crucial but also scarce. I knew that to make a difference, I had to get as close to the data as possible. Like many of my colleagues, I always thought- “…if only I had access to more and better data.” Data was fuel, and behind each piece of data was a person.


Fast forward to today. Technology has come a long way. We’re now living in the Age of AI, and along with open-source tools and methods, the possibilities are endless. It’s an exciting time, and I can’t wait to see what the future holds.

In the age of Human-Machine Teaming (HMT), the importance of data has only grown. Years ago, we needed much more storage, available compute, and easy deployment to the edge. Nowadays, we have all of that. However, there’s still a constant challenge with data.

High-quality training data that can fuel our HMT work is forever hard to find. It’s a serious roadblock for many government and commercial entities. Training data can drive innovation. However, there is never enough of it. Current data labeling initiatives often suffer from limited budgets, proprietary platforms, and a lack of standards, strategy, and interoperability. Finding and nurturing a diverse workforce that can perform labeling tasks fairly and equitably is also a challenge. This creates a lack of trust in the entire data labeling process.

The rise of Generative AI has highlighted the massive data challenges in feeding Large Language Models (LLMs). These technologies present so much opportunity but require creative human input to stretch and reach their full potential.


That’s where the Training Data Project comes in. Our mission is to help overcome training data challenges. We see training data as critical infrastructure for AI. The data used to train AI must have integrity to ensure that people- the Humans in Human-machine-teaming- benefit from this technology as much as possible. To do that, high-quality training data is not just a goal but a necessity. Through open-source tools, industry standards, accessibility, and transparency, the TDP aims to enhance training data quality across the public and private sectors. We strive to make high-quality data labels a public purpose and a public good for all.

Mark Twain said it best- “the two most important days in your life are the day you are born, and the day you find out why.” I’ve been very fortunate to find out that, at my core, I’m a data entrepreneur. I started the Training Data Project to help others become one as well- to uncover truth, meaning, opportunity, and good- through data.

We need more data entrepreneurs!

So, what do you think? Are you ready to embrace this era of innovation in training data and see where it takes us? Join us!