Journos News
Thursday, October 30, 2025
  • Login
  • Home
  • Breaking News
  • World News
  • Politics
  • Business
  • Conflict and Crisis
  • Sports
  • Technology
  • Entertainment
  • Health
No Result
View All Result
  • Home
  • Breaking News
  • World News
  • Politics
  • Business
  • Conflict and Crisis
  • Sports
  • Technology
  • Entertainment
  • Health
No Result
View All Result
Journos News
No Result
View All Result
Home Educations AI & Machine Learning

AI Turns to Libraries: Harvard and BPL Open Their Vaults for Training Data

From Books to Bots: How Historic Libraries Are Powering the Next AI Revolution

The Daily Desk by The Daily Desk
June 12, 2025
in AI & Machine Learning, Digital Archives, Educations, Libraries & Archives, Tech and Education
0
AI’s New Brain Food? Centuries-Old Books From Harvard and Boston Libraries - AP Photo/Charles Krupa

Why Harvard’s Ancient Book Collection Could Shape the Future of AI - AP Photo/Charles Krupa

0
SHARES
2
VIEWS

AI Chatbots Are Hungry for More Knowledge — And Libraries Are Answering the Call

CAMBRIDGE, Mass. — The internet may have been AI’s playground until now, but the next frontier for training large language models is much older and dustier: the library.

In a groundbreaking move, Harvard University is releasing a massive dataset of nearly one million books, some dating back to the 15th century, to support the development of artificial intelligence. These works — scanned from Harvard’s library stacks and covering 254 languages — could dramatically expand the depth and cultural range of AI training data.

But Harvard isn’t alone. The Boston Public Library and other historic institutions are also opening up their archives, offering everything from 19th-century newspapers to government records, all in hopes of shaping a more responsible and representative generation of AI tools.

Why Libraries?

With AI chatbots like ChatGPT and Meta’s LLaMA dominating headlines — and lawsuits — over their use of copyrighted content, tech firms are now looking to public domain material to avoid legal landmines. Libraries, long-time stewards of human knowledge, are stepping in to help.

“It is a prudent decision to start with public domain data because that’s less controversial,” said Burton Davis, deputy general counsel at Microsoft.

This new approach is also a way for libraries to regain some agency in the AI era. As Aristana Scourtas from Harvard’s Library Innovation Lab put it:

RELATED POSTS

Indiana University Student Newspaper Adviser Fired Amid Free Press Dispute

Big Tech Invests Millions to Train U.S. Teachers on Artificial Intelligence

Judge Orders Trump Administration to Restore $500 Million in UCLA Federal Grants

“We’re trying to move some of the power from this current AI moment back to these institutions. Librarians have always been the stewards of data.”

What’s in the Stack?

The new dataset, called Institutional Books 1.0, contains over 394 million scanned pages — including handwritten manuscripts, literary works, legal texts, agricultural manuals, and scientific treatises.

One of the oldest pieces? A 1400s Korean manuscript on growing flowers and trees.

This treasure trove amounts to roughly 242 billion data tokens, making it a significant — though still modest — addition to the multi-trillion-token scale of current AI models. For context, Meta says its latest AI model was trained on over 30 trillion tokens spanning text, images, and video.

Legal Trouble Meets Open Knowledge

Tech companies like Meta and OpenAI are already embroiled in lawsuits for using copyrighted works — some allegedly scraped from “shadow libraries” of pirated books. Authors like Sarah Silverman and other creators argue their intellectual property was used without consent.

Now, companies are pivoting. OpenAI, for instance, recently gave $50 million to academic institutions including Oxford’s Bodleian Library, funding digitization efforts that align with public interest.

Jessica Chapel of Boston Public Library emphasized transparency:

“We’ve been very clear that, ‘Hey, we’re a public library.’ Our collections are held for public use, and anything we digitized as part of this project will be made public.”

Digitization Meets AI Innovation

Turning ancient texts into machine-readable data is no small feat. Libraries have spent years scanning and organizing archives — like French-language newspapers from New England — once vital to immigrant communities and now valuable to AI researchers.

The Harvard books were initially digitized in 2006 through Google’s controversial book scanning project, which faced years of copyright litigation before the U.S. Supreme Court let the project stand in 2016.

Now, with copyright-expired texts retrieved from Google Books, the newly cleaned and structured dataset is being released to the public through the open-source Hugging Face platform.

Why This Matters

The Harvard corpus is linguistically rich — less than half is in English — with significant representation in French, German, Latin, Spanish, and Italian. That diversity helps address long-standing criticisms about the narrow cultural lens of many AI models trained mostly on English-language web content.

More importantly, older texts provide deep context on how humans reason, argue, and explain — foundational skills that AI still struggles to master.

“You have a lot of pedagogy around what it means to reason,” said Greg Leppert, director of the Institutional Data Initiative. “You have a lot of scientific information about how to run processes and how to run analyses.”

The Fine Print: Old Data, Old Problems

Not everything in these books is golden. There’s also harmful and outdated content — from debunked medical theories to explicitly racist texts.

That’s why Harvard’s team is including ethical guidance for AI developers, helping them identify and mitigate risks.

“When you’re dealing with such a large data set, there are tricky issues around harmful content and language,” said Kristi Mukk from Harvard’s Library Innovation Lab.

The Takeaway

Libraries are becoming unlikely power players in the AI revolution — not by creating new data, but by preserving and sharing the knowledge we’ve already gathered over centuries.

As tech giants scramble to build the next generation of intelligent machines, it turns out the wisdom they need may already be resting on a library shelf.

Source: AP News – AI chatbots need more books to learn from. These libraries are opening their stacks

This article was rewritten by JournosNews.com based on verified reporting from trusted sources. The content has been independently reviewed, fact-checked, and edited for accuracy, tone, and global readability in accordance with Google News standards.

Stay informed with JournosNews.com — your trusted source for verified global reporting and in-depth analysis. Follow us on Google News and BlueSky for real-time updates.

JournosNews.com follows Google News content standards with original reporting, verified sources, and global accessibility. Articles are fact-checked and edited for accuracy and neutrality.

Tags: #AIandEducation#AIandHumanity#AIandLibraries#AIContentTraining#AIDataSources#AIethics#AIhistory#AIinSociety#AIResearch#AIresponsibility#AITrainingData#ArtificialIntelligence#BookDigitization#BPLibrary#CulturalPreservation#DigitalArchives#DigitalLibraries#FutureOfAI#HarvardLibrary#HistoricalBooks#HuggingFaceAI#KnowledgePreservation#LibraryInnovation#LibraryTechnology#MachineLearning#MicrosoftAI#OldBooksNewTech#OpenAI#PublicDomainData#TechandCulture
ShareSend
The Daily Desk

The Daily Desk

Journos News is a freelance editor and contributor at The Daily Desk, focusing on politics, media, and the shifting dynamics of public discourse. With a decade of experience in digital journalism, Jordan brings clarity and precision to every story.

Related Posts

Indiana University Student Newspaper Adviser Fired Over Press Dispute - AP Photo/Darron Cummings, File

Indiana University Student Newspaper Adviser Fired Amid Free Press Dispute

by The Daily Desk
October 18, 2025
0

Published: October 18, 2025, 21:45 EDT The firing of a faculty adviser to Indiana University’s student newspaper and the abrupt...

Big Tech Funds AI Education for Teachers in Nationwide Push - AP Photo/Darren Abate

Big Tech Invests Millions to Train U.S. Teachers on Artificial Intelligence

by The Daily Desk
October 17, 2025
0

Published: October 17, 2025, 21:45 EDT Dozens of teachers across the United States are trading weekends for workshops as technology...

Federal judge orders Trump administration to restore $500M in UCLA grants, citing violations of administrative law - AP Photo/Damian Dovarganes, File

Judge Orders Trump Administration to Restore $500 Million in UCLA Federal Grants

by The Daily Desk
September 23, 2025
0

Federal Judge Orders Restoration of $500 Million in UCLA Research Grants Frozen by Trump Administration Published: September-23-2025, 17:30 EDT A...

Trump Administration Puts Harvard Under Heightened Cash Monitoring, Threatens Sanctions Over Admissions Data Compliance - AP Photo/Steven Senne, File

Trump Administration Intensifies Oversight of Harvard, Threatens Sanctions Over Admissions Data

by The Daily Desk
September 20, 2025
0

Trump Administration Escalates Oversight Fight With Harvard University Published Time: 09-20-2025, 14:30 EDT The Trump administration has increased pressure on...

Teachers, Nurses, and Public Workers Could Lose Loan Forgiveness if Employer Engages in Activities Deemed Illegal Under New Federal Proposal - AP Photo/David Zalubowski

Proposed Changes Could Exclude Workers from Student Loan Forgiveness Over Alleged ‘Illegal’ Activities

by The Daily Desk
August 16, 2025
0

Education Department Proposes Limits on Student Loan Forgiveness for Employees of Organizations With Alleged Illegal Activities Published Time: 08-16-2025, 20:00...

Next Post
Dow Drops 246 Points as Boeing Drags, Oracle Shines - AP Photo/Richard Drew

Wall Street Slips as Boeing Stumbles and Oracle Soars

UN Aims to Break Gaza Stalemate with Ceasefire Vote - AP video shot by Mariam Dagga. Production by Wafaa Shurafa

UN to Vote on Gaza Ceasefire and Humanitarian Aid Access

RECOMMENDED

OpenAI Becomes For-Profit, Reshapes Microsoft Partnership and AI Future - Reuters via BBC

OpenAI Completes For-Profit Conversion, Reshaping Partnership With Microsoft

October 29, 2025
Brigitte Macron’s Daughter Says Cyberbullying Harmed French First Lady’s Health - Getty Images via BBC

Brigitte Macron’s Daughter Says Cyberbullying Damaged First Lady’s Health

October 29, 2025

MOST VIEWED

  • CDs vs. Streaming: Why More Music Lovers Are Switching Back - image credit Headphonesty

    CDs Are Back: Why Audiophiles Are Ditching Streaming

    0 shares
    Share 0 Tweet 0
  • 16 Billion Passwords Leaked: What You Must Do Now to Stay Safe

    0 shares
    Share 0 Tweet 0
  • EU Says Meta and TikTok Breached Transparency Rules Under Digital Services Act

    0 shares
    Share 0 Tweet 0
  • South Korean President Apologizes After Martial Law Controversy

    0 shares
    Share 0 Tweet 0
  • 2025 American Music Awards: Full Winners List and Highlights

    0 shares
    Share 0 Tweet 0

Journos News delivers globally neutral, fact-based journalism that meets international media standards — clear, credible, and made for a connected world.

CATEGORY

SITE LINKS

  • Log in
  • Entries feed
  • Comments feed
  • WordPress.org

NEWSLETTER

  • About Us
  • Accessibility Statement
  • Contact Us
  • Privacy Policy
  • Terms and Conditions

© JournosNews.com – Trusted source for breaking news, trending stories, and in-depth reports.
All rights reserved.

Welcome Back!

Login to your account below

Forgotten Password?

Retrieve your password

Please enter your username or email address to reset your password.

Log In
No Result
View All Result
  • Home
  • Breaking News
  • World News
  • Politics
  • Business
  • Conflict and Crisis
  • Sports
  • Technology
  • Entertainment
  • Health

© JournosNews.com – Trusted source for breaking news, trending stories, and in-depth reports.
All rights reserved.

This website uses cookies. By continuing to use this website you are giving consent to cookies being used. Visit our Privacy and Cookie Policy.