:: Digital Libraries Columns


Library Journal "Digital Libraries" Columns 1997-2007, Roy Tennant

Please note: these columns are an archive of my Library Journal column from 1997-2007. They have not been altered in content, so please keep in mind some of this content will be out of date. :: Digital Libraries Columns

The Murky Bucket Syndrome


   Two recent, unrelated events put into stark focus the major challenges
   we have ahead of us if we want to serve our users as they expect and

   The first event arose when I innocently reported that two records for
   the same journal in our union catalog failed to merge. These were not
   old, battle-scarred records that had survived numerous title changes or
   the other disasters that plague serial records. These were records
   created for Respiratory Research, founded in 2000 by BioMed Central. As
   it turned out, these records represent an issue that is going to hound
   One per format

   The problem is this: the journal has both a print and an online
   version. One of our libraries cataloged it as an electronic journal
   (creating separate records for different versions), and the others
   cataloged it with one joint print and online record. Our merging
   algorithm failed since the print and e-forms have different ISSNs.

   Both methods of cataloging have proponents and opponents, benefits and
   drawbacks. In any case, as I was informed, it is a problem seemingly
   without solution. Even within one university we cannot agree about how
   to handle issues like this consistently.

   But the main points are: 1) our cataloging infrastructure is rife with
   these problems; 2) such problems have unfortunate consequences for
   users; and 3) our existing infrastructure appears to be inadequate to
   solve this problem and others like it. Certainly this is not unique to
   my institution. Unfortunately, it is systemic.
   New needs for old data

   The other incident that helped frame this problem was an email exchange
   with Lorcan Dempsey, vice president for research, OCLC, Inc. He
   responded to my column "[123]The Trouble with Online" (LJ 9/15/04, p.
   26), which chronicled the inability of library users to limit their
   catalog search to only online materials (another indication that our
   systems are hopelessly inadequate).

   Referring to this difficulty, Dempsey described the " 'murky bucket
   syndrome' that affects any large bibliographic database--we cannot
   entirely, unambiguously slice and dice the database because of historic
   data entry and cataloging practices that...were not oriented toward our
   new needs."

   In going forward, Dempsey believes we "need to think about not just
   sharing data but extracting as much value as we can from it through
   processing." A prime example of extracting value from existing data is
   OCLC's FictionFinder service, which mines the MARC record for fiction
   genre information and uses it to provide a readers' advisory service.

   But as I was soon to discover with the Respiratory Researchexample, the
   effects of the murky bucket syndrome are not limited to the "slicing"
   problem. As Dempsey put it, "this issue is now cropping up all over the
   place. As we try to do things programmatically, the structure and
   content practices really matter in ways they might not have before
   (FRBRization, data mining, etc.).... Increasingly, I think we need to
   look at cataloging practices in light of the new world of programmatic
   uses." For more from Dempsey, visit his web log, available through the
   online version of this article.
   Some signs of hope

   Recent research points out that the situation may not be quite as
   daunting as we think. In a paper at a recent Dublin Core conference
   ("Assessing Metadata Utilization") Bill Moen at the University of North
   Texas School of Library and Information Sciences revealed that only a
   small number of elements accounts for the vast majority of occurrences
   in a test dataset of 400,000 WorldCat MARC records, and fewer than half
   of the nearly 2000 fields/ subfields currently defined in MARC 21
   occurred in even one of the records in the test set. In other words,
   despite the size and complexity of the MARC record, much of it is
   little used.

   Also, recent experiments with merging record displays based on
   principles outlined in the Functional Requirements for Bibliographic
   Records (FRBR, which leads to the verb "FRBRization") may point a way
   out of at least some of the mess. To see a system that has applied the
   FRBR principles, check out the Research Libraries Group's project.
   Metadata for tomorrow

   We need more large-scale experiments with existing catalog records to
   see what can be done with legacy data. But we must also think about how
   to reengineer our infrastructure to enable robust machine processing,
   support for multiple record formats, and flexibility in user interfaces
   and screen display. For more on where we need to go, see my article "A
   Bibliographic Metadata Infrastructure for the Twenty-First Century."

   I've been hitting on metadata issues hard in this column, especially in
   recent months. I am increasingly disturbed by our inability to get this
   right, at least given today's needs. The library profession seems fond
   of assuming that its bibliographic infrastructure is the best ever
   devised, worthy of respect and admiration. There is some truth to that
   but also some self-delusion. If this is the best bibliographic
   infrastructure ever devised, then we (and, more importantly, our users)
   are in trouble. We must fix it, and soon.

   Links List
   Assessing Metadata Utilization: An Analysis of MARC Content Designation
   [124] A
   Bibliographic Metadata Infrastructure for the Twenty-First Century
   Examining Present Practices to Inform Future Metadata Use: An Empirical
   Analysis of MARC Content Designation Utilization
   [126] FictionFinder
   Functional Requirements of Bibliographic Records
   [128] Lorcan Dempsey's Web Log