8. Possible Approaches to Federating Archival Description from Multiple Repositories

Researchers face many challenges in identifying and gaining access to archival holdings distributed at archives and special collections across the United States. Many archives have not described all of their collections or made that information available online. Even if archival description is online, researchers have to look in several places to find relevant resources, searching MARC records in WorldCat, MARC and EAD records in ArchiveGrid, National Union Catalog of Manuscript Collections (NUCMC) records in Archives USA, EAD finding aids aggregated in regional repositories such as Online Archive of California and TARO, and/or finding aids provided through the Web sites of particular archives. In order to facilitate discovery of archival resources, the CLIR Hidden Collections Program aims to provide a federated catalog drawing from multiple repositories. As the 2008 program description states, "The records and descriptions obtained through this effort will be accessible through the Internet and the Web, enabling the federation of disparate, local cataloging entries with tools to aggregate this information by topic and theme." Archivists whom I interviewed recognize the value of aggregating information from multiple repositories. As one interviewee noted, "We just have to federate—there really isn't a reason to stop at the stage of putting things on the Web. The point of EAD was not to put finding aids online, but to share, to get everyone together, to do things across a collection. If we don't make the step forward to sharing, we might as well be using HTML."

However, federating archival descriptions poses some significant challenges. For one thing, an appropriate technical infrastructure needs to be developed, perhaps leveraging OAI-PMH or RDF (Resource Description Framework). A federated catalog needs to be flexible enough to accommodate the diverse data generated by archives, yet rigorous enough to present data in a standard format. Options for federating archival data include:

Make MARC and EAD available through a national/international service such as ArchiveGrid, Archives USA, or Archives Hub.
OCLC's ArchiveGrid³⁵ includes archival information from thousands of archives in the United States, the United Kingdom, Germany, Australia, and other countries. Archive Grid draws from two main data streams: archival records in WorldCat (about 90 percent of the total records) and finding aids harvested from contributing institutions.³⁶ These finding aids can be written in EAD, HTML, or plain text. To set up the harvesting, OCLC asks contributors to point to a Web site of finding aids that can be crawled. The crawler brings over the text of the finding aid, parses it so that it maps to the ArchiveGrid's record structure, and adds it to the index. For harvested finding aids, ArchiveGrid links from its search results to the full finding aid on the contributor's Web site, similar to a Google result. Thematic collections are not currently represented; ArchiveGrid does not yet have consistent topical categories to apply across its varied contributions, but that could change. Archives pay nothing to contribute records to ArchiveGrid, but access to the full records in Archive Grid is available only through a subscription. However, through OpenWorldCat, researchers can access a large subset of archives' MARC records that are also available through ArchiveGrid. It is possible that an archival version of the freely available OpenWorldCat—Open ArchiveGrid?—could be developed so that a subscription would not be required. One archivist reported satisfaction with Archive Grid: "Archive Grid is harvesting our EAD files. ... It seems to be gathering those OK."

Another aggregation model is provided by Archives Hub, the United Kingdom's "national gateway to descriptions of archives in UK universities and colleges."³⁷ Supported by Mimas, "a JISC and ESRC [Economic and Social Research Council]-supported national data centre" for higher education,³⁸ Archives Hub offers a distributed model for aggregating content from individual archives. Archives can become "spokes," enabling them to retain control over their data and provide a custom search interface to their collections while also making their content available through a common interface (Archives Hub 2008). Archives Hub is built on the Cheshire full-text information retrieval system, which includes a Z39.50 server. Archives Hub focuses on higher education institutions in the United Kingdom, but will accept contributions from other relevant repositories. (Nevertheless, it is probably more appropriate as a model than as a repository for U.S. finding aids.)

ProQuest's Archives USA "is a current directory of over 5,500 repositories and more than 161,000 collections of primary source material across the United States."³⁹ It provides online access to the NUCMC from 1959 to the present, names and subject indexes from the National Inventory of Documentary Sources (NIDS) in the United States, and collection descriptions contributed by archives. Like ArchiveGrid, Archives USA allows repositories to contribute finding aids at no cost, but requires a subscription to access.
Harvest EAD from distributed repositories through OAI-PMH, Atom, or another technology
Existing technologies such as OAI-PMH⁴⁰ and Atom⁴¹ support harvesting and aggregating content from distributed repositories. The University of Illinois-Urbana Champaign (UIUC) has already developed preliminary OAI services and tools to harvest information from EAD and other sources.⁴² As UIUC found, converting EAD to OAI-PMH poses several challenges: mapping a single EAD file to multiple OAI records; the variability of EAD-encoding practices; the complex hierarchical structure of EAD finding aids; and contextualizing individual results within the overall hierarchy (Prom and Habing 2002). Illinois experimented with "a schema that produces many DC [Dublin Core] metadata records from a single EAD file," producing a collection-level record that linked to the EAD finding aid as well as providing links to related parts of the collection (Cole et al. 2002). Archon is now experimenting with harvesting finding aids from a static directory via OAI-PMH, but nothing has been released yet. Other archival management systems, including CALM for Archives, Archeevo, MINISIS M2A, and Adlib Archive, already provide support for OAI. The FCLA is also exploring using the OAI-PMH protocol to harvest EAD from registered provider sites (Florida Center for Library Automation 2008). While Kathy Wisser was at the North Carolina Echo Project, she developed a proof-of-concept distributed repository using the Internet Archive's Heretrix Web crawler and XTF as the indexer.
Adopt an archival management system that supports federation.
ICA-AToM is being designed to support harvesting and syndication via OAI and IETF Atom Publishing Protocol. According to its Web site, "it can be set up as a multi-repository 'union list' accepting descriptions from any number of contributing institutions." Perhaps software such as ICA-AToM could be adopted to provide a union list, although such a solution may not be flexible enough to accommodate the varied methods archives use to deliver archival information.

FOOTNOTES FOR SECTION 8

³⁵http://archivegrid.org/

³⁶ Author's interview with Bruce Washburn, consulting software engineer for RLG Programs, July 1, 2008.

³⁷http://www.archiveshub.ac.uk/index.html

³⁸http://www.mimas.ac.uk/

³⁹http://archives.chadwyck.com/marketing/about.jsp

⁴⁰http://www.openarchives.org/

⁴¹http://www.atomenabled.org/

⁴²http://oai.grainger.uiuc.edu/

Federation Models

8. Possible Approaches to Federating Archival Description from Multiple Repositories