Library Journal "Digital Libraries" Columns 1997-2007, Roy Tennant
Please note: these columns are an archive of my Library Journal column from 1997-2007. They have not been altered in content, so please keep in mind some of this content will be out of date.
Doing Data Differently
06/15/2005
Metadata is often created in a time-consuming process by catalogers, digital library technicians, and others. It is then underused in our systems and the systems of those with whom we share the information. But recent experiments by some library organizations indicate that we are only scratching the surface of what our data can do for us. The American West Project at the California Digital Library (CDL) is aggregating metadata that describes digital objects related to the American West from a couple of dozen libraries, museums, and archives. When we used the Open Archives Initiative Protocol for Metadata Harvesting to gather these records, we discovered such significant metadata problems I was prompted to write a paper ("[145]Bitter Harvest") as well as a similarly titled column (LJ 7/04, p. 32). We've normalized Now, a year later, we have made progress toward developing a set of procedures and software routines to normalize, transform, and enrich these records in new and interesting ways. For example, we discovered that the institutions from which we were obtaining records encoded dates using a wide variety of methods (for examples, see "[146]CDL's OAI Harvesting Infrastructure"). We also discovered that useful dates were not always encoded in date fields within the record. Sometimes dates were embedded within a title field. Dates also appeared in descriptions, subjects, and other fields. We are writing software routines to normalize the dates in these records and match the normalized dates against a set of time periods that relate to the history of the American West. Because of all this processing, the users should have a much richer interaction with the site. Enriched data Normalizing is rather straightforward compared with trying to assign broad topic headings to a mass of heterogeneous records with subject headings from different vocabularies or with no headings at all. For this we turned to clustering software. Clustering software attempts to find associations among records that can then be exploited to assign one or more subject headings to appropriate clusters. We do not yet have a production capability for doing this, but early experiments have been encouraging. Meanwhile, the institutions from which we obtained these records are interested in the possibility of getting the enriched records back. This desire of institutions to receive their enhanced records back is a familiar one to Diane Hillmann and her team at the National Science Digital Library (NSDL). They have long worked to improve the metadata they received from institutions participating in the NSDL and have recently taken that experience further by suggesting that an appropriate and useful role for metadata aggregators is to provide enhanced metadata for others to use (see "[147]Improving Metadata Quality"). A basic finding of both the NSDL and CDL is that metadata enrichment should not be based solely on human-only nor software-only procedures but rather on a mix that uses the strengths of each to its full potential. Work that metadata OCLC provides yet another example of how our metadata can do more for us. Lorcan Dempsey, OCLC's VP of research (see "[148]Making Data Work Harder"), pointed out how a significant characteristic of both Google and Amazon is that they squeeze as much work out of their data as they can--all to create more useful and compelling services. He suggests we could learn a lot from them, as well as others that are also making data work harder. OCLC is itself "mining" the rich metadata store of WorldCat in interesting ways that may soon show up as useful new services (see "[149]Works 4 You?"). Another example used frequently in this column is RedLightGreen.com, the bibliographic database tailored for the needs of undergraduate students by the Research Libraries Group. In RedLightGreen, subject headings from the records retrieved by a search are pulled out and put in a prominent location for users to discover easily. Why should this very useful information lie hidden in the metadata and only surface when a user requests to see the full record? Our rich collections of metadata are underused. Meanwhile, we can employ automated techniques to make our metadata more uniform when it needs to be uniform and richer when it needs to be richer. We must get smarter about metadata in so many ways. Never have the skills of software-savvy catalogers and metadata-savvy software engineers been so essential to the future of our libraries and the users we serve. __________________________________________________________________ Link List Bitter Harvest [150]www.cdlib.org/inside/ projects/harvesting/ bitter_harvest.html CDL'S OAI Harvesting Infrastructure [151]www.cdlib.org/inside/ projects/harvesting Improving Metadata Quality [152]metadata-wg.mannlib.cornell.edu /forum/?date=2005-04-29 Making Data Work Harder [153]orweblog.oclc.org/ archives/000535.html RedLightGreen [154]Redlightgreen.com Works 4 You? [155]orweblog.oclc.org/archives/ 000579.html