Library Journal "Digital Libraries" Columns 1997-2007, Roy Tennant
Please note: these columns are an archive of my Library Journal column from 1997-2007. They have not been altered in content, so please keep in mind some of this content will be out of date.
The Murky Bucket Syndrome
12/15/2004
Two recent, unrelated events put into stark focus the major challenges we have ahead of us if we want to serve our users as they expect and deserve. The first event arose when I innocently reported that two records for the same journal in our union catalog failed to merge. These were not old, battle-scarred records that had survived numerous title changes or the other disasters that plague serial records. These were records created for Respiratory Research, founded in 2000 by BioMed Central. As it turned out, these records represent an issue that is going to hound us. One per format The problem is this: the journal has both a print and an online version. One of our libraries cataloged it as an electronic journal (creating separate records for different versions), and the others cataloged it with one joint print and online record. Our merging algorithm failed since the print and e-forms have different ISSNs. Both methods of cataloging have proponents and opponents, benefits and drawbacks. In any case, as I was informed, it is a problem seemingly without solution. Even within one university we cannot agree about how to handle issues like this consistently. But the main points are: 1) our cataloging infrastructure is rife with these problems; 2) such problems have unfortunate consequences for users; and 3) our existing infrastructure appears to be inadequate to solve this problem and others like it. Certainly this is not unique to my institution. Unfortunately, it is systemic. New needs for old data The other incident that helped frame this problem was an email exchange with Lorcan Dempsey, vice president for research, OCLC, Inc. He responded to my column "[123]The Trouble with Online" (LJ 9/15/04, p. 26), which chronicled the inability of library users to limit their catalog search to only online materials (another indication that our systems are hopelessly inadequate). Referring to this difficulty, Dempsey described the " 'murky bucket syndrome' that affects any large bibliographic database--we cannot entirely, unambiguously slice and dice the database because of historic data entry and cataloging practices that...were not oriented toward our new needs." In going forward, Dempsey believes we "need to think about not just sharing data but extracting as much value as we can from it through processing." A prime example of extracting value from existing data is OCLC's FictionFinder service, which mines the MARC record for fiction genre information and uses it to provide a readers' advisory service. But as I was soon to discover with the Respiratory Researchexample, the effects of the murky bucket syndrome are not limited to the "slicing" problem. As Dempsey put it, "this issue is now cropping up all over the place. As we try to do things programmatically, the structure and content practices really matter in ways they might not have before (FRBRization, data mining, etc.).... Increasingly, I think we need to look at cataloging practices in light of the new world of programmatic uses." For more from Dempsey, visit his web log, available through the online version of this article. Some signs of hope Recent research points out that the situation may not be quite as daunting as we think. In a paper at a recent Dublin Core conference ("Assessing Metadata Utilization") Bill Moen at the University of North Texas School of Library and Information Sciences revealed that only a small number of elements accounts for the vast majority of occurrences in a test dataset of 400,000 WorldCat MARC records, and fewer than half of the nearly 2000 fields/ subfields currently defined in MARC 21 occurred in even one of the records in the test set. In other words, despite the size and complexity of the MARC record, much of it is little used. Also, recent experiments with merging record displays based on principles outlined in the Functional Requirements for Bibliographic Records (FRBR, which leads to the verb "FRBRization") may point a way out of at least some of the mess. To see a system that has applied the FRBR principles, check out the Research Libraries Group's redlightgreen.com project. Metadata for tomorrow We need more large-scale experiments with existing catalog records to see what can be done with legacy data. But we must also think about how to reengineer our infrastructure to enable robust machine processing, support for multiple record formats, and flexibility in user interfaces and screen display. For more on where we need to go, see my article "A Bibliographic Metadata Infrastructure for the Twenty-First Century." I've been hitting on metadata issues hard in this column, especially in recent months. I am increasingly disturbed by our inability to get this right, at least given today's needs. The library profession seems fond of assuming that its bibliographic infrastructure is the best ever devised, worthy of respect and admiration. There is some truth to that but also some self-delusion. If this is the best bibliographic infrastructure ever devised, then we (and, more importantly, our users) are in trouble. We must fix it, and soon. Links List Assessing Metadata Utilization: An Analysis of MARC Content Designation Use [124]www.unt.edu/wmoen/publications/MARCPaper_Final2003.pdf A Bibliographic Metadata Infrastructure for the Twenty-First Century [125]roytennant.com/metadata.pdf Examining Present Practices to Inform Future Metadata Use: An Empirical Analysis of MARC Content Designation Utilization [126]www.unt.edu/mcdu FictionFinder [127]fictionfinder.oclc.org Functional Requirements of Bibliographic Records [128]www.ifla.org/VII/s13/frbr/frbr.pdf Lorcan Dempsey's Web Log [129]orweblog.oclc.org RedLightGreen [130]redlightgreen.com