Library Journal "Digital Libraries" Columns 1997-2007, Roy Tennant
Please note: these columns are an archive of my Library Journal column from 1997-2007. They have not been altered in content, so please keep in mind some of this content will be out of date.
Metadata's Bitter Harvest
07/15/2004
I recently conducted my first harvest. Not pulling in corn or wheat but bibliographic records. Before long I had nearly 100,000 of them on my laptop, all describing free online resources held by five different libraries. Using the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) it was a breeze--anyone could do it with the right software, of which there is much to choose from. But I could hardly believe the results. What I had was a pile of metadata problems that in hindsight I should have expected. Certainly those who have created union catalogs could predict some of the issues. Even so, union catalog efforts typically deal with the same type of records (MARC) using the same set of rules (AACR2). What I saw occurs when there is only a very simple format (Dublin Core) and no application rules to speak of. It was a complete mess. This mess is neither caused nor prevented by the harvesting protocol (OAI-PMH) and the guidelines for its use. The OAI developers specifically created an infrastructure with both a low threshold (a low barrier to implementation and use) as well as a high ceiling (the opportunity to create much richer interactions among collaborating institutions). It was brilliant, but it also sets up problems if the collaborative community of users doesn't apply a set of common guidelines and practices. OAI developers clearly expected communities to agree on how to use the harvesting protocol to best effect--by concurring on a richer metadata format for sharing, for example. The problem is that libraries have yet to decide these issues, although movement is beginning. But first it might be helpful to review some of the current metadata problems. Metadata woes Data providers (libraries with records to share) make their metadata available for harvesting in segments called "sets." This allows service providers (those who aggregate records from a variety of repositories for searching) to take smaller chunks rather than the entire pile of records. The problem is that libraries have yet to decide how to create logical and useful sets, and therefore data providers create whatever sets they wish. Some create sets based on item format (text, image, etc.). Others' sets are based on administrative units (e.g., university departments). Still others devise sets based on particular collections (which are not always logical subject groupings). A further complication arises when some institutions include page images of text documents in an image set. As for the metadata records, the only required format is simple (unqualified) Dublin Core. Yet simple Dublin Core is, for many purposes, too simple. For example, at least one institution offers three fields with the same label, but only one is the actionable URL with which to retrieve the object online. Without a qualifier to identify which of the three fields is the appropriate URL, the service provider must guess. Even the data within the elements can be problematic. Nearly all institutions are dumbing-down from a richer internal metadata scheme to simple Dublin Core. This process means some elements can be incorrectly mapped into the wrong Dublin Core element. Even when the correct data is in the correct element, there can be encoding issues. In the records I retrieved, there were dozens of different methods for encoding dates. One institution might use 1991-10-01, while another one uses October 1, 1991. For more information on metadata problems, see the paper by Naomi Dushay and Diane Hillmann and "Bitter Harvest: Problems & Suggested Solutions for OAI-PMH Data and Service Providers" (available on the California Digital Library [CDL] Harvesting Project web site). Hopeful signs Not long after my harvesting epiphany, my colleagues at the CDL and I talked with others experienced with harvesting. The University of Illinois at Urbana-Champaign, University of Michigan, and Cornell are all experienced at OAI harvesting and their perceptions largely predated and paralleled ours. Many institutions are now working, through the sponsorship of the Digital Library Federation (DLF), to develop best practices and guidelines for both data providers and service providers. A DLF working group will share metadata normalization and transformation tools and techniques and encourage data providers to expose richer metadata formats (e.g., MODS) than simple Dublin Core--always an option with OAI-PMH. Despite some problems with effective usage of OAI-PMH today, we are still in the early days of understanding how best to implement the protocol. As communities come together to specify how to use the protocol, our users will be much better served. __________________________________________________________________ LINK LIST California Digital Library Harvesting Project [123]www.cdlib.org/inside/projects/harvesting Dublin Core [124]dublincore.org Open Archives Initiative [125]www.openarchives.org Dushay and Hillmann Paper [126]www.siderean.com/dc2003/501_Paper24.pdf