In mid-April, I attended the Data Universe Conference in New York City. Popping out of Penn Station, fresh bagel in hand, I walked into a battle of ideas (at least if you are a Data Nerd like me). Data Universe brought together a host of speakers covering everything from “Why Data is Not The New Oil, and Is Far More Valuable” (Doug Laney) to “The Business Blueprint for AI-Enabled Analytics (Shingai Manjengwa). One of my favorites came from a colleague. Her nonprofit’s mission is to incorporate public input into AI development through open-source projects, policy support, educational initiatives, and the promotion of AI safety and accessibility.
She spoke about the need for more openness and more voices in our AI future. She had a great quote- “we worry about a future where there are just bookstores and no libraries,”—which hit home when it comes to our work at The Training Data Project. AI needs its public libraries. Moreover, it needs the dynamics of libraries- the standards, the methods, the support and the good thinking of many great “librarians”- policymakers, developers, engineers, and stakeholders. I left thinking about what libraries can teach us about AI and that many a good librarian would be a great AI whisperer.
I grew up in the library. I love them enough that my wife and I were married on the MLK Library rooftop terrace in downtown DC two years ago. I remember my first library card like it was a driver’s license. There was an open road of adventure navigating the stacks and shelves. The library was more than just a building; it was an on-ramp and a method to discovery. The smell of old books, the organized chaos of the Dewey Decimal System, and the whir of microfilm (I’m dating myself) brought an air of the ability to solve problems. I learned about marrying content with context among the shelves when the search for one book led me to ten others. That early experience probably shaped my love of GEOINT- of location as an organizing principle for data.
Democratizing Access
One goal of The Training Data Project is democratizing access to trustworthy AI training data. Just as libraries do not charge for borrowing books, we believe that high-quality training data should be openly available wherever possible.
Libraries were essential for me growing up because buying books was often beyond my family’s means. The library was where I expanded my world for the cost of a couple of late charges. In the same way, many bright minds and companies today might be restricted by the high costs of accessing or curating high-quality AI training data. We aim to change that by advocating for ‘public libraries’ of AI data—places where anyone can access training data without financial barriers. The government can lead the way here as they did with Open Data (e.g., one of my favorite sites is with the DC Government), and private sector should lean into this as well.
Open Source Tools
Libraries are a great place to make informed choices. Before you check out a book, you have plenty of opportunities to determine whether it has what you’re looking for. You can go nearly risk-free to a shelf and see if there’s something better before you check it out. With something so transformative as AI, and something so important as the fuel that powers AI, why would we want anything less?
Open access reinforces values like transparency and responsibility. Open-source tools can help translate those values into reality and make those AI training data libraries more useful. At The Training Data Project, one of our goals is to bring more open-source tools to data labeling and provide a better method of measuring label quality. A card catalog or a search at a public library has never been proprietary. Tools that allow for independent test and evaluation of labels—that measure trust— should not be either. Whether you search for and “skim” a training dataset or dive into several training datasets, we believe you should easily know the value, relevance and risks of any training dataset given what you are trying to accomplish.
Like a library card, Open Source Tools would allow you to “check out” a training dataset in multiple ways. As a result, trust in AI could begin at the foundations through fully understanding and evaluating the quality of your labels and where they fit in the risk spectrum. These tools can also help you identify if the data was sourced ethically and responsibly.
Guidance and Best Practices
Librarians are the unsung heroes of the data and information world. Search engines have revolutionized our ability to have vast amounts of information at our fingertips. They are amazing, but every AI/ML developer knows that models crave novelty. Humans-in-the-loop are what often provide context and help models stretch. There’s a tension and a balance that has to be struck between human computation and what can be automated.
Great librarians played a huge role in helping me define and chart a course through a vast world of knowledge. Their expertise in managing and understanding complex information was invaluable. They are the Humans-in-the-Loop on a daily basis. At the Training Data Project, we strive to emulate librarians by providing guidance and tools to help everyone navigate the complexities of AI data, ensuring it is sourced and used responsibly and ethically.
An article I was reading about information science recently reminded why libraries and librarians are so important- “I love being a librarian because my services as a librarian are a way to serve my nation as strong libraries build strong researchers and informed researchers build strong nations.” This reflects our mission at the Training Data Project—planting seeds for well-informed AI practitioners who can contribute to a strong, ethical future for AI technology.
Not Forgetting the Great Bookstores
Admittedly, I’m a card-carrying rewards member at Barnes and Noble and Politics and Prose (a favorite DC bookstore). I could probably buy a house with what I have spent on books on Amazon. Knowledge and access to knowledge is a public good. Many a bookstore employee has guided me to a new novel that I never would have picked on my own- and I trust them now to help me find new books each time I walk in.
That’s the key- trust. How do we grow it around AI? How do we plant those seeds? Between 1886 and 1919, Andrew Carnegie’s donations of more than $40 million paid for 1,679 new library buildings in communities large and small across America. Think of the impact that likely made, and the lives influenced.
Our work at The Training Data Project is not just about access; it’s about building transparency and fostering trust in AI. Just as public libraries have empowered generations by making knowledge accessible, we aim to empower today’s and tomorrow’s innovators by making AI training data available to all. By creating a more equitable AI landscape, we aim to pave the way for a burst of creativity and innovation, much like the one that transformed societies when libraries first opened their doors to the public.