NEWSLETTER
Thursday, July 31, 2025
  • Login
No Result
View All Result
JOURNOS NEWS
33 °c
Manila
28 ° Fri
29 ° Sat
  • Home
  • World News
  • Business
  • Politics
  • Sports
  • Entertainment
  • Technology
  • Lifestyle
  • Science
  • Health
  • Home
  • World News
  • Business
  • Politics
  • Sports
  • Entertainment
  • Technology
  • Lifestyle
  • Science
  • Health
33 °c
Manila
28 ° Fri
29 ° Sat
No Result
View All Result
JOURNOS NEWS
No Result
View All Result
Home Educations AI & Machine Learning

AI Turns to Libraries: Harvard and BPL Open Their Vaults for Training Data

From Books to Bots: How Historic Libraries Are Powering the Next AI Revolution

by The Daily Desk
June 12, 2025
in AI & Machine Learning, Digital Archives, Educations, Libraries & Archives, Tech and Education
0
AI’s New Brain Food? Centuries-Old Books From Harvard and Boston Libraries - AP Photo/Charles Krupa

Why Harvard’s Ancient Book Collection Could Shape the Future of AI - AP Photo/Charles Krupa

AI Chatbots Are Hungry for More Knowledge — And Libraries Are Answering the Call

CAMBRIDGE, Mass. — The internet may have been AI’s playground until now, but the next frontier for training large language models is much older and dustier: the library.

In a groundbreaking move, Harvard University is releasing a massive dataset of nearly one million books, some dating back to the 15th century, to support the development of artificial intelligence. These works — scanned from Harvard’s library stacks and covering 254 languages — could dramatically expand the depth and cultural range of AI training data.

More RelatedPosts

UCLA reaches $6M civil rights settlement with Jewish students and professor

Columbia settles $221M with Trump administration to regain federal funding

Teens Are Talking to AI Companions—Here’s What Parents Should Know Now

Harvard Sues Trump Administration Over $2.6 Billion in Federal Funding Cuts

Load More

But Harvard isn’t alone. The Boston Public Library and other historic institutions are also opening up their archives, offering everything from 19th-century newspapers to government records, all in hopes of shaping a more responsible and representative generation of AI tools.

Why Libraries?

With AI chatbots like ChatGPT and Meta’s LLaMA dominating headlines — and lawsuits — over their use of copyrighted content, tech firms are now looking to public domain material to avoid legal landmines. Libraries, long-time stewards of human knowledge, are stepping in to help.

“It is a prudent decision to start with public domain data because that’s less controversial,” said Burton Davis, deputy general counsel at Microsoft.

This new approach is also a way for libraries to regain some agency in the AI era. As Aristana Scourtas from Harvard’s Library Innovation Lab put it:

“We’re trying to move some of the power from this current AI moment back to these institutions. Librarians have always been the stewards of data.”

What’s in the Stack?

The new dataset, called Institutional Books 1.0, contains over 394 million scanned pages — including handwritten manuscripts, literary works, legal texts, agricultural manuals, and scientific treatises.

One of the oldest pieces? A 1400s Korean manuscript on growing flowers and trees.

This treasure trove amounts to roughly 242 billion data tokens, making it a significant — though still modest — addition to the multi-trillion-token scale of current AI models. For context, Meta says its latest AI model was trained on over 30 trillion tokens spanning text, images, and video.

Legal Trouble Meets Open Knowledge

Tech companies like Meta and OpenAI are already embroiled in lawsuits for using copyrighted works — some allegedly scraped from “shadow libraries” of pirated books. Authors like Sarah Silverman and other creators argue their intellectual property was used without consent.

Now, companies are pivoting. OpenAI, for instance, recently gave $50 million to academic institutions including Oxford’s Bodleian Library, funding digitization efforts that align with public interest.

Jessica Chapel of Boston Public Library emphasized transparency:

“We’ve been very clear that, ‘Hey, we’re a public library.’ Our collections are held for public use, and anything we digitized as part of this project will be made public.”

Digitization Meets AI Innovation

Turning ancient texts into machine-readable data is no small feat. Libraries have spent years scanning and organizing archives — like French-language newspapers from New England — once vital to immigrant communities and now valuable to AI researchers.

The Harvard books were initially digitized in 2006 through Google’s controversial book scanning project, which faced years of copyright litigation before the U.S. Supreme Court let the project stand in 2016.

Now, with copyright-expired texts retrieved from Google Books, the newly cleaned and structured dataset is being released to the public through the open-source Hugging Face platform.

Why This Matters

The Harvard corpus is linguistically rich — less than half is in English — with significant representation in French, German, Latin, Spanish, and Italian. That diversity helps address long-standing criticisms about the narrow cultural lens of many AI models trained mostly on English-language web content.

More importantly, older texts provide deep context on how humans reason, argue, and explain — foundational skills that AI still struggles to master.

“You have a lot of pedagogy around what it means to reason,” said Greg Leppert, director of the Institutional Data Initiative. “You have a lot of scientific information about how to run processes and how to run analyses.”

The Fine Print: Old Data, Old Problems

Not everything in these books is golden. There’s also harmful and outdated content — from debunked medical theories to explicitly racist texts.

That’s why Harvard’s team is including ethical guidance for AI developers, helping them identify and mitigate risks.

“When you’re dealing with such a large data set, there are tricky issues around harmful content and language,” said Kristi Mukk from Harvard’s Library Innovation Lab.

The Takeaway

Libraries are becoming unlikely power players in the AI revolution — not by creating new data, but by preserving and sharing the knowledge we’ve already gathered over centuries.

As tech giants scramble to build the next generation of intelligent machines, it turns out the wisdom they need may already be resting on a library shelf.

Source: AP News – AI chatbots need more books to learn from. These libraries are opening their stacks

The Daily Desk

The Daily Desk

J News is a freelance editor and contributor at The Daily Desk, focusing on politics, media, and the shifting dynamics of public discourse. With a decade of experience in digital journalism, Jordan brings clarity and precision to every story.

Related Posts

UCLA Agrees to $6.13 Million Settlement Over Campus Protest Civil Rights Case - AP Photo/Jae C. Hong, File
Education Policy

UCLA reaches $6M civil rights settlement with Jewish students and professor

July 29, 2025
Columbia Reaches $221M Settlement Over Antisemitism Probes, Restores Federal Funds - Charly Triballeau/AFP/Getty Images
Education Policy

Columbia settles $221M with Trump administration to regain federal funding

July 24, 2025
AI Companions Are Shaping Teen Friendships: A Guide for Parents - AP Photo/Katie Adkins
AI & Machine Learning

Teens Are Talking to AI Companions—Here’s What Parents Should Know Now

July 23, 2025
Federal Court Hears Harvard’s Lawsuit Against Trump Over Research Funding Freeze - AP Photo/Lisa Poole, File
Education Policy

Harvard Sues Trump Administration Over $2.6 Billion in Federal Funding Cuts

July 21, 2025
No Vote, No Ban: Texas Lawmakers Miss Deadline on Teen Social Media Bill - AP Photo/Kiichiro Sato, File
Education Policy

Push to Keep Kids Off Social Media Fades in Texas Legislature

May 29, 2025
Visa Revocations Leave Chinese Students Reeling Across U.S. Campuses - AP Photo/Andy Wong, File
Education Policy

Chinese Students in U.S. Face Uncertainty After New Visa Crackdown

May 29, 2025
Foreign Students Face Delays as U.S. Ramps Up Visa Screening Measures - AP Photo/Mark Schiefelbein
Education Policy

Visa Interview Pause Disrupts Plans for Foreign Students Heading to U.S.

May 28, 2025
Harvard’s $100 Million in Federal Contracts at Risk Under Trump Administration - AP Photo/Steven Senne, File
Educations

Trump Administration Moves to Cut $100 Million in Harvard Federal Contracts

May 27, 2025
Black and LGBTQ+ Students Fight to Save Graduation Traditions - Keith Bedford/The Boston Globe/Getty Images
Educations

Students Keep Cultural Graduation Traditions Alive After Campus Bans

May 25, 2025
Load More
Next Post
Dow Drops 246 Points as Boeing Drags, Oracle Shines - AP Photo/Richard Drew

Wall Street Slips as Boeing Stumbles and Oracle Soars

UN Aims to Break Gaza Stalemate with Ceasefire Vote - AP video shot by Mariam Dagga. Production by Wafaa Shurafa

UN to Vote on Gaza Ceasefire and Humanitarian Aid Access

CDC Vaccine Panel Gets a Radical Makeover Under RFK Jr. - AP Photo/Steve Helber, File

New CDC Vaccine Advisers Include Vaccine Skeptics and Misinformation Spreaders

Nature’s Early Warning System: What Birds Learn from Prairie Dogs - Roshan Patel/Smithsonian National Zoo and Concervation Biology Institute via AP

How Prairie Dogs Help Birds Stay One Step Ahead of Predators

Pacers Take Control of NBA Finals with Game 3 Win Over OKC - AP Photo/Michael Conroy

Pacers Rally Late to Take 2–1 Lead Over Thunder in NBA Finals

Popular News

  • Cincinnati assault update: Three now charged in viral downtown street beating - Cincinnati Police Department

    Third suspect arrested in viral Cincinnati street assault incident

    0 shares
    Share 0 Tweet 0
  • Secret FBI room held documents on Trump–Russia probe, now under Senate review

    0 shares
    Share 0 Tweet 0
  • Inside Lithuania’s Secret Cold War Missile Base, Now a Public Museum

    0 shares
    Share 0 Tweet 0
  • Léon Marchand breaks 200m medley world record at World Aquatics Championships

    0 shares
    Share 0 Tweet 0
  • The rise and fall of the Comet: How the world’s first passenger jet shaped modern aviation

    0 shares
    Share 0 Tweet 0

Recommended

Tracing Our Roots: The Oldest Human DNA in Europe

Oldest Human DNA Reveals Hidden Ties to Neanderthals

8 months ago
From War to Unity? Syria’s New Government Faces Tough Challenges - Omar Sanadiki/AP Photo

Syria’s Uncertain Future: A Week of Violence and Diplomacy

5 months ago

Connect with us

  • About Us
  • Contact Us
  • Cookie Settings
  • Privacy Policy
  • Terms and Conditions
  • Support Press Freedom
  • Accessibility Statement
  • Advertising
  • Online Shopping
Breaking News That Keeps You Ahead.

Copyright © 2024 JournosNews.com All rights reserved.

Welcome Back!

Login to your account below

Forgotten Password?

Retrieve your password

Please enter your username or email address to reset your password.

Log In
No Result
View All Result
  • Home
  • World News
  • Business
  • Politics
  • Sports
  • Entertainment
  • Technology
  • Lifestyle
  • Science
  • Health

Copyright © 2024 JournosNews.com All rights reserved.

This website uses cookies. By continuing to use this website you are giving consent to cookies being used. Visit our Privacy and Cookie Policy.