Machine Learning & the Library of Congress Digital Collections

The Library of Congress captures the essence of big data. Each collection within the world’s largest library — which contains some 170 million items and growing — is a data set waiting for analysis. But just like with other big data undertakings, siloing gets in the way. Thankfully, though, this work is being tackled by machine learning.

Consider the Library of Congress’ collection of newspapers. One research fellow hoped to use the Chronicling America collection — millions of digitized newspaper pages, dating from 1789 to 1963 — to compare historical classified ads against contemporary Craigslist posts, to see how patterns of fake ads have taken shape over time. She even signed up for a class to learn Python to pull it off. But it turned out there was no effective way to drill down by category and subset the ads from the bulk of the newspaper text.

A similar problem played out for a graduate student who hoped to use the library’s Wayback Machine-esque web archive of election campaign sites, in order to see whether candidates changed positions after an event. But since crawlers build Web ARChive files by time of site capture, without thematic consistency, queries returned prohibitively large amounts of unrelated data. Plus, files just get massive overall.

“Before a researcher works with web archives, they need to do some significant data reduction,” said Kate Zwaard, former director of digital strategy at the Library of Congress. “And it’s not easy for us to transfer these many-terabytes-large collections to researchers. It costs money.”

Collection interfaces are often built item-level, without a window on the whole.

“But if we’re thinking about presenting collections and data at scale, we need to be completely reconsidering a different model,” said Meghan Ferriter, senior innovation specialist at the Library of Congress.

For better scaling of the mass amount of the library’s information, this is where the flexibility of databases and the cloud have come in handy.

More on Machine LearningBuilding a Machine Learning Pipeline? Here’s What You Need to Know.

How the Library of Congress Converts Collections to Data

Digitizing library collections is nothing new. The Library of Congress has been doing so since 1993. It also has a few APIs that allow for subsetting of some smaller data sets. The concept of collections as data, however, is more recent.

To a growing extent, institutions like the Library of Congress have made digitized or natively digital collections, and their attendant metadata, available explicitly for computational analysis. But many collections weren’t digitized in a way that centered a big data analytics lens. So museum and library archivists have had to come together in recent years in order to hash out the best framework.

The 2019 “Always Already Computational: Collections as Data” report, co-authored by Library of Congress’ Office of Digital Strategy program analyst Laurie Allen, covers a range of collections topics. Content includes advice on full-text collection inventories — “document rights status, license status, discoverability and downloadability” — to noting characteristics that make a digital collection a good collections-as-data candidate — thorough metadata, strong optical character recognition results, popularity and relevance to other projects.

At the same time, the report also underscored that the concept remains a work in progress. “There’s no clean, shared state of play across collections,” Ferriter said. “Has that changed? I think the answer is no. But our awareness about it has changed. And people are thinking and speaking through the lens of collections as data a lot more.”

That “foundational work” nonetheless helped guide the Library of Congress’ organizational projects, most notably with collection cover sheets, Ferriter said. It’s basically the prep before the prep — documenting what collections contain as extensively as possible to make future data transformation easier

Find out who's hiring.

See all Data + Analytics jobs at top tech companies & startups

View Jobs

The “Computing Cultural Heritage in the Cloud” Initiative

While taking stock of such access obstacles, the library’s digital strategy wing is also asking, what exactly are the kinds of research focus areas and technical approaches that can only be done by analyzing large-scale data sets?

To that end, the Library of Congress put out a call for proposal in the autumn of 2020 — an agency announcement looking for researchers “to experiment with solutions to problems that can only be explored at scale” via machine learning.

The project, called Computing Cultural Heritage in the Cloud (CHCC), aims to create a virtuous cycle of modeling approaches and complementary data wrangling to advance machine learning research techniques within the library, and for the so-called GLAM (galleries, libraries, archives, museums) sector more broadly.

The three present researchers, who began work in May of 2021 and will continue through December of 2022, currently document their individual findings and share regular updates.

“We’re trying to think about, ‘How do we create a comparable model for those who want to research with digital collections at a broader scale or look across our research collections?’” Ferriter said.

Another target of the project? Reusability. Even when researchers do the hard-slog work of cleaning data for a specific use case, the library currently doesn’t have a great way to reinstitute that cleaned data set back into its collection for future use. That will hopefully change.

“If we have a preliminary set of data transformation that’s a stepping off point for more specific questions, we could then try to host that, and we plan to do that as part of this project,” Fettiter said.

Staff has identified some collections that have strong data-analysis potential and begun detailing cover sheets for each, outlining information about format, size, rights issues — anything that might aid future research on that collection.

“For computational uses especially, researchers want to understand the structure of the collection at multiple levels, as well as its provenance — where it came from and how it has been transformed, as well as what is missing from the data, and what kinds of uses are allowed with it,” Allen wrote in a 2020 blog post that included an early stage sheet template.

Collections with prioritization potential include those that aren’t rights-restricted and those that have historically proven fruitful for researchers. Examples include digitized versions of stereograph cards and the Sanborn maps, which draw a picture of 19th and 20th century architecture and city planning and have been used to help understand changes within given settings. The full range of application will be influenced by the concluded CHCC research.

More on Data ScienceImproving Racial Equity in Data Integration

How the Library of Congress Uses Machine Learning Ethically

The Library of Congress is far removed from Silicon Valley geographically, and also temperamentally. A report on the state of machine learning in libraries, commissioned by the Library of Congress and published in 2020, draws a contrast between the “move fast, break things” ethos that defined Big Tech and the methodical approach favored by librarians. The contrast has positive implications for things like setting standards before adopting new technologies, but it’s valuable for larger artificial intelligence ethical issues too.

That paper, and other research conducted or supported by the Library of Congress, explains how algorithmic bias is far more likely to occur in historical collection analysis in comparison to other uses in the commercial sector. As the report notes, even though the capacity for harm within research institutions is far lower than in use cases like credit lending and recidivism likelihood, narrow or biased data sets could produce flawed research or “inadvertently echo and amplify outdated or even offensive subject terms.”

Clear value statements and attention to organizational transparency and staff data fluency are key, but so are more granular details, like establishing — and making readily available — ground truth data sets and benchmarking data, researchers noted. That means, for image data, having metadata that doesn’t just explain what is pictured, but considers a host of possible features, like “digitization source, contrast, skew, noise, range effect, complexity (or a difficulty measure of some sort).” As the report underscores, the technical concerns are inextricably linked with social ones.

Such level of concern tracks with libraries’ larger track record of prudence, Zwaard said.

“There’s a long, well-developed history of professional ethics,” she said. “Libraries have been serving sensitive materials for centuries and have a really good understanding of what’s appropriate.”

The project, whose advisory board includes equity advocate and Algorithms of Oppression author and Safiya Noble, advises applicants to propose projects “that are diverse in terms of their topics, their approaches, and the required collections.”

Find out who's hiring.

See all Data + Analytics jobs at top tech companies & startups

View Jobs

Of course, machine learning’s greatest promise in cultural heritage research — both broadly and within the Library of Congress projects — is that it might reveal connections and insights that are inadvertently hidden within the stacks.

“One of the most exciting things about looking across collections at scale,” Ferriter said, “is that we can see stories and surface patterns that were not as available, or not as prominently presented to the public.”

How Machine Learning Organizes the Library of Congress Digital Collections

How the Library of Congress Converts Collections to Data

The “Computing Cultural Heritage in the Cloud” Initiative

How the Library of Congress Uses Machine Learning Ethically

Recent Artificial Intelligence Articles