AI Chatbots Are Hungry for More Knowledge — And Libraries Are Answering the Call
CAMBRIDGE, Mass. — The internet may have been AI’s playground until now, but the next frontier for training large language models is much older and dustier: the library.
In a groundbreaking move, Harvard University is releasing a massive dataset of nearly one million books, some dating back to the 15th century, to support the development of artificial intelligence. These works — scanned from Harvard’s library stacks and covering 254 languages — could dramatically expand the depth and cultural range of AI training data.
But Harvard isn’t alone. The Boston Public Library and other historic institutions are also opening up their archives, offering everything from 19th-century newspapers to government records, all in hopes of shaping a more responsible and representative generation of AI tools.
Why Libraries?
With AI chatbots like ChatGPT and Meta’s LLaMA dominating headlines — and lawsuits — over their use of copyrighted content, tech firms are now looking to public domain material to avoid legal landmines. Libraries, long-time stewards of human knowledge, are stepping in to help.
“It is a prudent decision to start with public domain data because that’s less controversial,” said Burton Davis, deputy general counsel at Microsoft.
This new approach is also a way for libraries to regain some agency in the AI era. As Aristana Scourtas from Harvard’s Library Innovation Lab put it:
“We’re trying to move some of the power from this current AI moment back to these institutions. Librarians have always been the stewards of data.”
What’s in the Stack?
The new dataset, called Institutional Books 1.0, contains over 394 million scanned pages — including handwritten manuscripts, literary works, legal texts, agricultural manuals, and scientific treatises.
One of the oldest pieces? A 1400s Korean manuscript on growing flowers and trees.
This treasure trove amounts to roughly 242 billion data tokens, making it a significant — though still modest — addition to the multi-trillion-token scale of current AI models. For context, Meta says its latest AI model was trained on over 30 trillion tokens spanning text, images, and video.
Legal Trouble Meets Open Knowledge
Tech companies like Meta and OpenAI are already embroiled in lawsuits for using copyrighted works — some allegedly scraped from “shadow libraries” of pirated books. Authors like Sarah Silverman and other creators argue their intellectual property was used without consent.
Now, companies are pivoting. OpenAI, for instance, recently gave $50 million to academic institutions including Oxford’s Bodleian Library, funding digitization efforts that align with public interest.
Jessica Chapel of Boston Public Library emphasized transparency:
“We’ve been very clear that, ‘Hey, we’re a public library.’ Our collections are held for public use, and anything we digitized as part of this project will be made public.”
Digitization Meets AI Innovation
Turning ancient texts into machine-readable data is no small feat. Libraries have spent years scanning and organizing archives — like French-language newspapers from New England — once vital to immigrant communities and now valuable to AI researchers.
The Harvard books were initially digitized in 2006 through Google’s controversial book scanning project, which faced years of copyright litigation before the U.S. Supreme Court let the project stand in 2016.
Now, with copyright-expired texts retrieved from Google Books, the newly cleaned and structured dataset is being released to the public through the open-source Hugging Face platform.
Why This Matters
The Harvard corpus is linguistically rich — less than half is in English — with significant representation in French, German, Latin, Spanish, and Italian. That diversity helps address long-standing criticisms about the narrow cultural lens of many AI models trained mostly on English-language web content.
More importantly, older texts provide deep context on how humans reason, argue, and explain — foundational skills that AI still struggles to master.
“You have a lot of pedagogy around what it means to reason,” said Greg Leppert, director of the Institutional Data Initiative. “You have a lot of scientific information about how to run processes and how to run analyses.”
The Fine Print: Old Data, Old Problems
Not everything in these books is golden. There’s also harmful and outdated content — from debunked medical theories to explicitly racist texts.
That’s why Harvard’s team is including ethical guidance for AI developers, helping them identify and mitigate risks.
“When you’re dealing with such a large data set, there are tricky issues around harmful content and language,” said Kristi Mukk from Harvard’s Library Innovation Lab.
The Takeaway
Libraries are becoming unlikely power players in the AI revolution — not by creating new data, but by preserving and sharing the knowledge we’ve already gathered over centuries.
As tech giants scramble to build the next generation of intelligent machines, it turns out the wisdom they need may already be resting on a library shelf.
Source: AP News – AI chatbots need more books to learn from. These libraries are opening their stacks