NEWSLETTER
Sunday, June 15, 2025
  • Login
No Result
View All Result
JOURNOS NEWS
  • Home
  • World News
  • Government and Politics
  • Business
  • Technology
  • Entertainment
  • Lifestyle
  • Health
  • Sports
  • Science
  • Environment
  • Home
  • World News
  • Government and Politics
  • Business
  • Technology
  • Entertainment
  • Lifestyle
  • Health
  • Sports
  • Science
  • Environment
No Result
View All Result
JOURNOS NEWS
No Result
View All Result
Home Educations AI & Machine Learning

AI Turns to Libraries: Harvard and BPL Open Their Vaults for Training Data

From Books to Bots: How Historic Libraries Are Powering the Next AI Revolution

by The Daily Desk
June 12, 2025
in AI & Machine Learning, Digital Archives, Educations, Libraries & Archives, Tech and Education
0
AI’s New Brain Food? Centuries-Old Books From Harvard and Boston Libraries - AP Photo/Charles Krupa

Why Harvard’s Ancient Book Collection Could Shape the Future of AI - AP Photo/Charles Krupa

0
SHARES
2
VIEWS
Share on FacebookShare on Twitter

AI Chatbots Are Hungry for More Knowledge — And Libraries Are Answering the Call

CAMBRIDGE, Mass. — The internet may have been AI’s playground until now, but the next frontier for training large language models is much older and dustier: the library.

In a groundbreaking move, Harvard University is releasing a massive dataset of nearly one million books, some dating back to the 15th century, to support the development of artificial intelligence. These works — scanned from Harvard’s library stacks and covering 254 languages — could dramatically expand the depth and cultural range of AI training data.

But Harvard isn’t alone. The Boston Public Library and other historic institutions are also opening up their archives, offering everything from 19th-century newspapers to government records, all in hopes of shaping a more responsible and representative generation of AI tools.

Why Libraries?

With AI chatbots like ChatGPT and Meta’s LLaMA dominating headlines — and lawsuits — over their use of copyrighted content, tech firms are now looking to public domain material to avoid legal landmines. Libraries, long-time stewards of human knowledge, are stepping in to help.

“It is a prudent decision to start with public domain data because that’s less controversial,” said Burton Davis, deputy general counsel at Microsoft.

This new approach is also a way for libraries to regain some agency in the AI era. As Aristana Scourtas from Harvard’s Library Innovation Lab put it:

“We’re trying to move some of the power from this current AI moment back to these institutions. Librarians have always been the stewards of data.”

What’s in the Stack?

The new dataset, called Institutional Books 1.0, contains over 394 million scanned pages — including handwritten manuscripts, literary works, legal texts, agricultural manuals, and scientific treatises.

One of the oldest pieces? A 1400s Korean manuscript on growing flowers and trees.

This treasure trove amounts to roughly 242 billion data tokens, making it a significant — though still modest — addition to the multi-trillion-token scale of current AI models. For context, Meta says its latest AI model was trained on over 30 trillion tokens spanning text, images, and video.

Legal Trouble Meets Open Knowledge

Tech companies like Meta and OpenAI are already embroiled in lawsuits for using copyrighted works — some allegedly scraped from “shadow libraries” of pirated books. Authors like Sarah Silverman and other creators argue their intellectual property was used without consent.

Now, companies are pivoting. OpenAI, for instance, recently gave $50 million to academic institutions including Oxford’s Bodleian Library, funding digitization efforts that align with public interest.

Jessica Chapel of Boston Public Library emphasized transparency:

“We’ve been very clear that, ‘Hey, we’re a public library.’ Our collections are held for public use, and anything we digitized as part of this project will be made public.”

Digitization Meets AI Innovation

Turning ancient texts into machine-readable data is no small feat. Libraries have spent years scanning and organizing archives — like French-language newspapers from New England — once vital to immigrant communities and now valuable to AI researchers.

The Harvard books were initially digitized in 2006 through Google’s controversial book scanning project, which faced years of copyright litigation before the U.S. Supreme Court let the project stand in 2016.

Now, with copyright-expired texts retrieved from Google Books, the newly cleaned and structured dataset is being released to the public through the open-source Hugging Face platform.

Why This Matters

The Harvard corpus is linguistically rich — less than half is in English — with significant representation in French, German, Latin, Spanish, and Italian. That diversity helps address long-standing criticisms about the narrow cultural lens of many AI models trained mostly on English-language web content.

More importantly, older texts provide deep context on how humans reason, argue, and explain — foundational skills that AI still struggles to master.

“You have a lot of pedagogy around what it means to reason,” said Greg Leppert, director of the Institutional Data Initiative. “You have a lot of scientific information about how to run processes and how to run analyses.”

The Fine Print: Old Data, Old Problems

Not everything in these books is golden. There’s also harmful and outdated content — from debunked medical theories to explicitly racist texts.

That’s why Harvard’s team is including ethical guidance for AI developers, helping them identify and mitigate risks.

“When you’re dealing with such a large data set, there are tricky issues around harmful content and language,” said Kristi Mukk from Harvard’s Library Innovation Lab.

The Takeaway

Libraries are becoming unlikely power players in the AI revolution — not by creating new data, but by preserving and sharing the knowledge we’ve already gathered over centuries.

As tech giants scramble to build the next generation of intelligent machines, it turns out the wisdom they need may already be resting on a library shelf.

Source: AP News – AI chatbots need more books to learn from. These libraries are opening their stacks

The Daily Desk

The Daily Desk

J News is a freelance editor and contributor at The Daily Desk, focusing on politics, media, and the shifting dynamics of public discourse. With a decade of experience in digital journalism, Jordan brings clarity and precision to every story.

Related Posts

No Vote, No Ban: Texas Lawmakers Miss Deadline on Teen Social Media Bill - AP Photo/Kiichiro Sato, File

Push to Keep Kids Off Social Media Fades in Texas Legislature

May 29, 2025
Visa Revocations Leave Chinese Students Reeling Across U.S. Campuses - AP Photo/Andy Wong, File

Chinese Students in U.S. Face Uncertainty After New Visa Crackdown

May 29, 2025
Foreign Students Face Delays as U.S. Ramps Up Visa Screening Measures - AP Photo/Mark Schiefelbein

Visa Interview Pause Disrupts Plans for Foreign Students Heading to U.S.

May 28, 2025
Harvard’s $100 Million in Federal Contracts at Risk Under Trump Administration - AP Photo/Steven Senne, File

Trump Administration Moves to Cut $100 Million in Harvard Federal Contracts

May 27, 2025
Black and LGBTQ+ Students Fight to Save Graduation Traditions - Keith Bedford/The Boston Globe/Getty Images

Students Keep Cultural Graduation Traditions Alive After Campus Bans

May 25, 2025
University Showdown: Harvard Challenges Trump’s Ban on International Students - Charles Krupa/AP via CNN Newsource

Harvard Fights Back: Sues Trump Administration Over Ban on International Students

May 23, 2025
Next Post
Dow Drops 246 Points as Boeing Drags, Oracle Shines - AP Photo/Richard Drew

Wall Street Slips as Boeing Stumbles and Oracle Soars

Popular News

  • The Mind Behind the Music: How a Psychiatrist Explains Audiophile Obsession - Headphonesty

    Why Your Mood Might Matter More Than Your Amp, According to a $20K Audiophile Shrink

    0 shares
    Share 0 Tweet 0
  • Why Steely Dan’s Albums Still Define Audiophile Perfection

    0 shares
    Share 0 Tweet 0
  • America’s First Pope Is Earning Praise — But Many Are Still Watching and Waiting

    0 shares
    Share 0 Tweet 0
  • How Prime Video’s Burn Bar Is Revolutionizing NASCAR Broadcasts

    0 shares
    Share 0 Tweet 0
  • Texas and New Mexico Report Measles Deaths as National Case Count Rises

    0 shares
    Share 0 Tweet 0

Recommended

Sex Scandals and NDAs: Federal Charges Rock Former Abercrombie CEO Mike Jeffries

Former Abercrombie & Fitch CEO Mike Jeffries Taken Into Custody on Sex Trafficking Allegations.

8 months ago
How AI Could Change the Future of Work Forever - Getty Images

Leading AI Expert Predicts 20% Unemployment Due to Automation

2 weeks ago

Connect with us

  • About
  • Advertise
  • Contact Us
  • Terms and Conditions
  • Privacy Policy
  • Cookie Settings
  • Accessibility Statement
  • Support Press Freedom
  • Online Shopping
Breaking News That Keeps You Ahead.

Copyright © 2024 JournosNews.com All rights reserved.

Welcome Back!

Login to your account below

Forgotten Password?

Retrieve your password

Please enter your username or email address to reset your password.

Log In
No Result
View All Result
  • Home

Copyright © 2024 JournosNews.com All rights reserved.

This website uses cookies. By continuing to use this website you are giving consent to cookies being used. Visit our Privacy and Cookie Policy.