AI Chatbots Are Hungry for More Knowledge — And Libraries Are Answering the Call

CAMBRIDGE, Mass. — The internet may have been AI’s playground until now, but the next frontier for training large language models is much older and dustier: the library.

In a groundbreaking move, Harvard University is releasing a massive dataset of nearly one million books, some dating back to the 15th century, to support the development of artificial intelligence. These works — scanned from Harvard’s library stacks and covering 254 languages — could dramatically expand the depth and cultural range of AI training data.

But Harvard isn’t alone. The Boston Public Library and other historic institutions are also opening up their archives, offering everything from 19th-century newspapers to government records, all in hopes of shaping a more responsible and representative generation of AI tools.

Why Libraries?

With AI chatbots like ChatGPT and Meta’s LLaMA dominating headlines — and lawsuits — over their use of copyrighted content, tech firms are now looking to public domain material to avoid legal landmines. Libraries, long-time stewards of human knowledge, are stepping in to help.

“It is a prudent decision to start with public domain data because that’s less controversial,” said Burton Davis, deputy general counsel at Microsoft.

This new approach is also a way for libraries to regain some agency in the AI era. As Aristana Scourtas from Harvard’s Library Innovation Lab put it:

Indiana University Student Newspaper Adviser Fired Amid Free Press Dispute

Big Tech Invests Millions to Train U.S. Teachers on Artificial Intelligence

Judge Orders Trump Administration to Restore $500 Million in UCLA Federal Grants

“We’re trying to move some of the power from this current AI moment back to these institutions. Librarians have always been the stewards of data.”

What’s in the Stack?

The new dataset, called Institutional Books 1.0, contains over 394 million scanned pages — including handwritten manuscripts, literary works, legal texts, agricultural manuals, and scientific treatises.

One of the oldest pieces? A 1400s Korean manuscript on growing flowers and trees.

This treasure trove amounts to roughly 242 billion data tokens, making it a significant — though still modest — addition to the multi-trillion-token scale of current AI models. For context, Meta says its latest AI model was trained on over 30 trillion tokens spanning text, images, and video.

Legal Trouble Meets Open Knowledge

Tech companies like Meta and OpenAI are already embroiled in lawsuits for using copyrighted works — some allegedly scraped from “shadow libraries” of pirated books. Authors like Sarah Silverman and other creators argue their intellectual property was used without consent.

Now, companies are pivoting. OpenAI, for instance, recently gave $50 million to academic institutions including Oxford’s Bodleian Library, funding digitization efforts that align with public interest.

Jessica Chapel of Boston Public Library emphasized transparency:

“We’ve been very clear that, ‘Hey, we’re a public library.’ Our collections are held for public use, and anything we digitized as part of this project will be made public.”

Digitization Meets AI Innovation

Turning ancient texts into machine-readable data is no small feat. Libraries have spent years scanning and organizing archives — like French-language newspapers from New England — once vital to immigrant communities and now valuable to AI researchers.

The Harvard books were initially digitized in 2006 through Google’s controversial book scanning project, which faced years of copyright litigation before the U.S. Supreme Court let the project stand in 2016.

Now, with copyright-expired texts retrieved from Google Books, the newly cleaned and structured dataset is being released to the public through the open-source Hugging Face platform.

Why This Matters

The Harvard corpus is linguistically rich — less than half is in English — with significant representation in French, German, Latin, Spanish, and Italian. That diversity helps address long-standing criticisms about the narrow cultural lens of many AI models trained mostly on English-language web content.

More importantly, older texts provide deep context on how humans reason, argue, and explain — foundational skills that AI still struggles to master.

“You have a lot of pedagogy around what it means to reason,” said Greg Leppert, director of the Institutional Data Initiative. “You have a lot of scientific information about how to run processes and how to run analyses.”

The Fine Print: Old Data, Old Problems

Not everything in these books is golden. There’s also harmful and outdated content — from debunked medical theories to explicitly racist texts.

That’s why Harvard’s team is including ethical guidance for AI developers, helping them identify and mitigate risks.

“When you’re dealing with such a large data set, there are tricky issues around harmful content and language,” said Kristi Mukk from Harvard’s Library Innovation Lab.

The Takeaway

Libraries are becoming unlikely power players in the AI revolution — not by creating new data, but by preserving and sharing the knowledge we’ve already gathered over centuries.

As tech giants scramble to build the next generation of intelligent machines, it turns out the wisdom they need may already be resting on a library shelf.

Source: AP News – AI chatbots need more books to learn from. These libraries are opening their stacks

This article was rewritten by JournosNews.com based on verified reporting from trusted sources. The content has been independently reviewed, fact-checked, and edited for accuracy, tone, and global readability in accordance with Google News standards.

Stay informed with JournosNews.com — your trusted source for verified global reporting and in-depth analysis. Follow us on Google News and BlueSky for real-time updates.

JournosNews.com follows Google News content standards with original reporting, verified sources, and global accessibility. Articles are fact-checked and edited for accuracy and neutrality.

Tags: #AIandEducation #AIandHumanity #AIandLibraries #AIContentTraining #AIDataSources #AIethics #AIhistory #AIinSociety #AIResearch #AIresponsibility #AITrainingData #ArtificialIntelligence #BookDigitization #BPLibrary #CulturalPreservation #DigitalArchives #DigitalLibraries #FutureOfAI #HarvardLibrary #HistoricalBooks #HuggingFaceAI #KnowledgePreservation #LibraryInnovation #LibraryTechnology #MachineLearning #MicrosoftAI #OldBooksNewTech #OpenAI #PublicDomainData #TechandCulture

AI Turns to Libraries: Harvard and BPL Open Their Vaults for Training Data

From Books to Bots: How Historic Libraries Are Powering the Next AI Revolution

Indiana University Student Newspaper Adviser Fired Amid Free Press Dispute

Big Tech Invests Millions to Train U.S. Teachers on Artificial Intelligence

Judge Orders Trump Administration to Restore $500 Million in UCLA Federal Grants

The Daily Desk

Related Posts

Indiana University Student Newspaper Adviser Fired Amid Free Press Dispute

Big Tech Invests Millions to Train U.S. Teachers on Artificial Intelligence

Judge Orders Trump Administration to Restore $500 Million in UCLA Federal Grants

Trump Administration Intensifies Oversight of Harvard, Threatens Sanctions Over Admissions Data

Proposed Changes Could Exclude Workers from Student Loan Forgiveness Over Alleged ‘Illegal’ Activities

Wall Street Slips as Boeing Stumbles and Oracle Soars

UN to Vote on Gaza Ceasefire and Humanitarian Aid Access

RECOMMENDED

OpenAI Completes For-Profit Conversion, Reshaping Partnership With Microsoft

Brigitte Macron’s Daughter Says Cyberbullying Damaged First Lady’s Health

MOST VIEWED

CDs Are Back: Why Audiophiles Are Ditching Streaming

16 Billion Passwords Leaked: What You Must Do Now to Stay Safe

EU Says Meta and TikTok Breached Transparency Rules Under Digital Services Act

South Korean President Apologizes After Martial Law Controversy

2025 American Music Awards: Full Winners List and Highlights

CATEGORY

SITE LINKS

NEWSLETTER

Welcome Back!

Retrieve your password

AI Turns to Libraries: Harvard and BPL Open Their Vaults for Training Data

From Books to Bots: How Historic Libraries Are Powering the Next AI Revolution

AI Chatbots Are Hungry for More Knowledge — And Libraries Are Answering the Call

Why Libraries?

“It is a prudent decision to start with public domain data because that’s less controversial,” said Burton Davis, deputy general counsel at Microsoft.

RELATED POSTS

“We’re trying to move some of the power from this current AI moment back to these institutions. Librarians have always been the stewards of data.”

What’s in the Stack?

Legal Trouble Meets Open Knowledge

“We’ve been very clear that, ‘Hey, we’re a public library.’ Our collections are held for public use, and anything we digitized as part of this project will be made public.”

Digitization Meets AI Innovation

Why This Matters

“You have a lot of pedagogy around what it means to reason,” said Greg Leppert, director of the Institutional Data Initiative. “You have a lot of scientific information about how to run processes and how to run analyses.”

The Fine Print: Old Data, Old Problems

“When you’re dealing with such a large data set, there are tricky issues around harmful content and language,” said Kristi Mukk from Harvard’s Library Innovation Lab.

The Takeaway

Related Posts

RECOMMENDED

MOST VIEWED

CATEGORY

SITE LINKS

NEWSLETTER

Welcome Back!

Retrieve your password