Library Journal "Digital Libraries" Columns 1997-2007, Roy Tennant
Please note: these columns are an archive of my Library Journal column from 1997-2007. They have not been altered in content, so please keep in mind some of this content will be out of date.
MARC Exit Strategies
11/15/2002
In [123]last month's column (LJ 10/15/02, p. 26ff), I outlined why it is time for us to rethink our most basic bibliographic standards: MARC elements, MARC syntax, and AACR2 (the rules for populating them). This month I'll be looking at various ways in which we can build on our strengths without allowing the past to limit our future. Saving what's good Librarians have more than 30 years of experience encoding bibliographic information in computer-readable form. Along the way we've learned a few things. As physicians are instructed to "first, do no harm," so we should make sure that a new bibliographic standard does not harm what we value. We've learned that granularity is good. [124]Granularity (see LJ 5/15/02, p. 32ff.) is the degree to which metadata elements are chopped up into constituent parts. For example, unambiguously recording the various parts of a personal name is good because sometimes you want to display a name as someone would refer to a person (e.g., Mark Twain) and sometimes you want to display a name in reverse order (e.g., Twain, Mark). If a name is not recorded as individual parts, computers are unable to manipulate the name accurately. Language and hierarchy We know that not all words are created equal. Recording the title of a book (e.g., The Adventures of Tom Sawyer) is fine for most purposes. But when you need to create an alphabetical list of book titles, the word The becomes an impediment. In MARC we have a method of identifying how many "nonfiling" characters are at the beginning of a title string. Names we understand to be tricky things. When users search for "Mark Twain," do they want to find all of the works by Mark Twain, or do they want to find all the works by the person Samuel Clemens, whether he used his pseudonym or not? Authority control provides a way to identify uniquely particular individuals and all the different names or name forms they may have used in their lifetime. Some information is hierarchical in nature, while some is not. Most bibliographic information has no hierarchical relationship. That is, a book may have one or more authors, a title, some subject headings, etc., yet none of that information could logically be nested within any other information. The table of contents for that same book, on the other hand, may be quite hierarchical in nature--a section may have one or more chapters nested within it. Dumping what's bad Our current standards are focused on the book as a physical object. Indeed, the physical object is paramount. Second editions of a book require a completely separate record. Therefore, when users search for a book, they are often presented with several separate records for different editions, or sometimes even for different printings. Sorting out the one that is wanted is not always easy when much of the same information is duplicated for each record. This problem is significantly exacerbated with union catalogs--particularly virtual union catalogs that attempt to merge duplicate records at the moment a user performs a search (called "on-the-fly"). We need to separate the base bibliographic description (author, title, subjects, etc.) from the manifestations of that item. There should be one bibliographic record and one or more holdings records, with the details of a given holding recorded with that holding. A print copy may have a physical description (e.g., the number of pages, etc.) while an electronic copy will need to record other details (e.g., the URL and format). Adding new features We need easy extensibility. As any cataloger knows, changing MARC is a painstaking and lengthy process. But the XML community has shown us how to specify a well-codified basic set of tags while allowing for further enhancement with any arbitrary set of tags. Being hierarchical in nature, XML would allow a bibliographic record to have a MARC-like container for basic bibliographic information, as well as related but separate containers, or other sets, of metadata elements. One such enhancement might be an accurate encoding of the table of contents, which would significantly enhance our bibliographic records. Smashing a table of contents into MARC is like forcing a square peg into a round hole--it's difficult, and the peg usually gets damaged in the process. Being a flat structure, MARC doesn't easily accommodate hierarchy. Improve merging, authority control There should be hooks for record merging. As libraries share their bibliographic information with other entities--such as in building a union catalog--merging records for the same intellectual item is presenting significant difficulties. This problem is particularly acute for systems that attempt to merge records on-the-fly. Why not create a field specifically for the purposes of record merging? I'm not yet sure of the best method to create a unique ID for each bibliographically distinct item, but I'm certain it can be done, whether by software or catalogers. An ISBN is not sufficient, since a unique ISBN is assigned to the cloth and paper copies of the same bibliographic entity. If such a code could be created or assigned, merging records for the same book would be a piece of cake. Information that is specific to a particular physical item would be carried along. Authority control should be robust. Our current catalogs and the standards upon which they are based have never provided us with truly effective authority control. With a new standard, we have the opportunity to treat names in an entirely different way. For example, we could record the author's name as it appears on a book but also record the unique ID of that person's record in a central name authority database. At the time of indexing, such a database could be consulted to pull in the name variants for that individual. Start from scratch I'm not certain that I did an adequate job last month of drawing distinctions among the problems with MARC element sets (e.g., title statement, subject heading, etc.), the MARC syntax (e.g., the numeric fields, subfield indicators, etc.), and AACR2. Unfortunately, the length of this column does not always permit such nuances. But I'm also not sure if nuance is needed to make my case. You can move the deck chairs around the Titanic all you want, but the ship is still going down. While we are not (yet) in such a dramatic situation, the analogy has a kernel of truth. If we do not consider what we can achieve by starting anew, we will never know what possible solutions there are beyond our present set of standards. So for the sake of argument, let's assume that in the fullness of time we have indeed created a new set of standards for bibliographic description. What do we do with our millions of existing MARC records? I can suggest a few possibilities. Some exit strategies One method is entombment. My least favorite option is to stop creating new MARC records and begin building a parallel system based on the new standard. With cross-database search tools becoming a reality (see [128]LJ 10/15/01 , p. 29ff.), it may be possible to create a portal that stitches the two systems together for the user while you work over time to migrate the old records (see below). Another option is encapsulation. The Library of Congress is already hard at work defining an XML standard for bibliographic information that breaks some of the bonds of MARC. This developing standard, the Metadata Object Description Schema (MODS), is sufficient for capturing most of what is presently in our MARC records, although it is not possible to "round trip" (go from MARC to MODS and back to MARC again) with MARC records using MODS. For a project at the California Digital Library, we extracted MARC records from our catalog, translated them into MODS, and embedded the records in an XML wrapper via the Metadata Encoding and Transfer Syntax (METS). We also demonstrated how we could take metadata from the publisher and embed that as well in its own section of the METS record. We were then able to index and display fields selectively from either metadata source for the same item. Or maybe migration Then there is migration. A complete transition to a new set of bibliographic standards would require migrating our existing record base into whatever we devise. Before you object to the scale of such an undertaking, consider how only a few years ago there were no records at all in digital form. And migrating digital information from one format to another will be easier, since some parts (but likely not all) can be automated. The point here is to do this migration once and to a standard that is flexible enough to allow future extensions and changes without requiring another migration. So am I out of my mind? Possibly. Am I stirring up trouble? Likely. Have I stated facts and outlined possibilities that are worth considering? Certainly. We have nothing to lose by considering what our future could be, even if it means questioning our most basic professional assumptions. __________________________________________________________________ Link List MARC [125]www.loc.gov/marc METS [126]www.loc.gov/standards/mets MODS [127]www.loc.gov/standards/mods