Library Journal "Digital Libraries" Columns 1997-2007, Roy Tennant

Please note: these columns are an archive of my Library Journal column from 1997-2007. They have not been altered in content, so please keep in mind some of this content will be out of date.
roytennant.com :: Digital Libraries Columns
Doing Data Differently

06/15/2005
   Metadata is often created in a time-consuming process by catalogers,
   digital library technicians, and others. It is then underused in our
   systems and the systems of those with whom we share the information.
   But recent experiments by some library organizations indicate that we
   are only scratching the surface of what our data can do for us.

   The American West Project at the California Digital Library (CDL) is
   aggregating metadata that describes digital objects related to the
   American West from a couple of dozen libraries, museums, and archives.
   When we used the Open Archives Initiative Protocol for Metadata
   Harvesting to gather these records, we discovered such significant
   metadata problems I was prompted to write a paper ("[145]Bitter
   Harvest") as well as a similarly titled column (LJ 7/04, p. 32).
   We've normalized

   Now, a year later, we have made progress toward developing a set of
   procedures and software routines to normalize, transform, and enrich
   these records in new and interesting ways. For example, we discovered
   that the institutions from which we were obtaining records encoded
   dates using a wide variety of methods (for examples, see "[146]CDL's
   OAI Harvesting Infrastructure").

   We also discovered that useful dates were not always encoded in date
   fields within the record. Sometimes dates were embedded within a title
   field. Dates also appeared in descriptions, subjects, and other fields.
   We are writing software routines to normalize the dates in these
   records and match the normalized dates against a set of time periods
   that relate to the history of the American West. Because of all this
   processing, the users should have a much richer interaction with the
   site.
   Enriched data

   Normalizing is rather straightforward compared with trying to assign
   broad topic headings to a mass of heterogeneous records with subject
   headings from different vocabularies or with no headings at all. For
   this we turned to clustering software. Clustering software attempts to
   find associations among records that can then be exploited to assign
   one or more subject headings to appropriate clusters. We do not yet
   have a production capability for doing this, but early experiments have
   been encouraging. Meanwhile, the institutions from which we obtained
   these records are interested in the possibility of getting the enriched
   records back.

   This desire of institutions to receive their enhanced records back is a
   familiar one to Diane Hillmann and her team at the National Science
   Digital Library (NSDL). They have long worked to improve the metadata
   they received from institutions participating in the NSDL and have
   recently taken that experience further by suggesting that an
   appropriate and useful role for metadata aggregators is to provide
   enhanced metadata for others to use (see "[147]Improving Metadata
   Quality").

   A basic finding of both the NSDL and CDL is that metadata enrichment
   should not be based solely on human-only nor software-only procedures
   but rather on a mix that uses the strengths of each to its full
   potential.
   Work that metadata

   OCLC provides yet another example of how our metadata can do more for
   us. Lorcan Dempsey, OCLC's VP of research (see "[148]Making Data Work
   Harder"), pointed out how a significant characteristic of both Google
   and Amazon is that they squeeze as much work out of their data as they
   can--all to create more useful and compelling services. He suggests we
   could learn a lot from them, as well as others that are also making
   data work harder. OCLC is itself "mining" the rich metadata store of
   WorldCat in interesting ways that may soon show up as useful new
   services (see "[149]Works 4 You?").

   Another example used frequently in this column is RedLightGreen.com,
   the bibliographic database tailored for the needs of undergraduate
   students by the Research Libraries Group. In RedLightGreen, subject
   headings from the records retrieved by a search are pulled out and put
   in a prominent location for users to discover easily. Why should this
   very useful information lie hidden in the metadata and only surface
   when a user requests to see the full record?

   Our rich collections of metadata are underused. Meanwhile, we can
   employ automated techniques to make our metadata more uniform when it
   needs to be uniform and richer when it needs to be richer. We must get
   smarter about metadata in so many ways. Never have the skills of
   software-savvy catalogers and metadata-savvy software engineers been so
   essential to the future of our libraries and the users we serve.
     __________________________________________________________________

                          Link List
   Bitter Harvest
   [150]www.cdlib.org/inside/
   projects/harvesting/
   bitter_harvest.html CDL'S OAI Harvesting Infrastructure
   [151]www.cdlib.org/inside/
   projects/harvesting Improving Metadata Quality
   [152]metadata-wg.mannlib.cornell.edu
   /forum/?date=2005-04-29
   Making Data Work Harder
   [153]orweblog.oclc.org/
   archives/000535.html RedLightGreen
   [154]Redlightgreen.com Works 4 You?
   [155]orweblog.oclc.org/archives/
   000579.html