Library Journal "Digital Libraries" Columns 1997-2007, Roy Tennant

Please note: these columns are an archive of my Library Journal column from 1997-2007. They have not been altered in content, so please keep in mind some of this content will be out of date.
roytennant.com :: Digital Libraries Columns
MARC Exit Strategies

11/15/2002
   In [123]last month's column (LJ 10/15/02, p. 26ff), I outlined why it
   is time for us to rethink our most basic bibliographic standards: MARC
   elements, MARC syntax, and AACR2 (the rules for populating them). This
   month I'll be looking at various ways in which we can build on our
   strengths without allowing the past to limit our future.
   Saving what's good

   Librarians have more than 30 years of experience encoding bibliographic
   information in computer-readable form. Along the way we've learned a
   few things. As physicians are instructed to "first, do no harm," so we
   should make sure that a new bibliographic standard does not harm what
   we value.

   We've learned that granularity is good. [124]Granularity (see LJ
   5/15/02, p. 32ff.) is the degree to which metadata elements are chopped
   up into constituent parts. For example, unambiguously recording the
   various parts of a personal name is good because sometimes you want to
   display a name as someone would refer to a person (e.g., Mark Twain)
   and sometimes you want to display a name in reverse order (e.g., Twain,
   Mark). If a name is not recorded as individual parts, computers are
   unable to manipulate the name accurately.

   Language and hierarchy
   We know that not all words are created equal. Recording the title of a
   book (e.g., The Adventures of Tom Sawyer) is fine for most purposes.
   But when you need to create an alphabetical list of book titles, the
   word The becomes an impediment. In MARC we have a method of identifying
   how many "nonfiling" characters are at the beginning of a title string.

   Names we understand to be tricky things. When users search for "Mark
   Twain," do they want to find all of the works by Mark Twain, or do they
   want to find all the works by the person Samuel Clemens, whether he
   used his pseudonym or not? Authority control provides a way to identify
   uniquely particular individuals and all the different names or name
   forms they may have used in their lifetime.

   Some information is hierarchical in nature, while some is not. Most
   bibliographic information has no hierarchical relationship. That is, a
   book may have one or more authors, a title, some subject headings,
   etc., yet none of that information could logically be nested within any
   other information. The table of contents for that same book, on the
   other hand, may be quite hierarchical in nature--a section may have one
   or more chapters nested within it.
   Dumping what's bad

   Our current standards are focused on the book as a physical object.
   Indeed, the physical object is paramount. Second editions of a book
   require a completely separate record. Therefore, when users search for
   a book, they are often presented with several separate records for
   different editions, or sometimes even for different printings. Sorting
   out the one that is wanted is not always easy when much of the same
   information is duplicated for each record.

   This problem is significantly exacerbated with union
   catalogs--particularly virtual union catalogs that attempt to merge
   duplicate records at the moment a user performs a search (called
   "on-the-fly"). We need to separate the base bibliographic description
   (author, title, subjects, etc.) from the manifestations of that item.
   There should be one bibliographic record and one or more holdings
   records, with the details of a given holding recorded with that
   holding. A print copy may have a physical description (e.g., the number
   of pages, etc.) while an electronic copy will need to record other
   details (e.g., the URL and format).
   Adding new features

   We need easy extensibility. As any cataloger knows, changing MARC is a
   painstaking and lengthy process. But the XML community has shown us how
   to specify a well-codified basic set of tags while allowing for further
   enhancement with any arbitrary set of tags.

   Being hierarchical in nature, XML would allow a bibliographic record to
   have a MARC-like container for basic bibliographic information, as well
   as related but separate containers, or other sets, of metadata
   elements.

   One such enhancement might be an accurate encoding of the table of
   contents, which would significantly enhance our bibliographic records.
   Smashing a table of contents into MARC is like forcing a square peg
   into a round hole--it's difficult, and the peg usually gets damaged in
   the process. Being a flat structure, MARC doesn't easily accommodate
   hierarchy.

   Improve merging, authority control
   There should be hooks for record merging. As libraries share their
   bibliographic information with other entities--such as in building a
   union catalog--merging records for the same intellectual item is
   presenting significant difficulties. This problem is particularly acute
   for systems that attempt to merge records on-the-fly. Why not create a
   field specifically for the purposes of record merging?

   I'm not yet sure of the best method to create a unique ID for each
   bibliographically distinct item, but I'm certain it can be done,
   whether by software or catalogers. An ISBN is not sufficient, since a
   unique ISBN is assigned to the cloth and paper copies of the same
   bibliographic entity. If such a code could be created or assigned,
   merging records for the same book would be a piece of cake. Information
   that is specific to a particular physical item would be carried along.

   Authority control should be robust. Our current catalogs and the
   standards upon which they are based have never provided us with truly
   effective authority control. With a new standard, we have the
   opportunity to treat names in an entirely different way. For example,
   we could record the author's name as it appears on a book but also
   record the unique ID of that person's record in a central name
   authority database. At the time of indexing, such a database could be
   consulted to pull in the name variants for that individual.
   Start from scratch

   I'm not certain that I did an adequate job last month of drawing
   distinctions among the problems with MARC element sets (e.g., title
   statement, subject heading, etc.), the MARC syntax (e.g., the numeric
   fields, subfield indicators, etc.), and AACR2. Unfortunately, the
   length of this column does not always permit such nuances.

   But I'm also not sure if nuance is needed to make my case. You can move
   the deck chairs around the Titanic all you want, but the ship is still
   going down. While we are not (yet) in such a dramatic situation, the
   analogy has a kernel of truth. If we do not consider what we can
   achieve by starting anew, we will never know what possible solutions
   there are beyond our present set of standards.

   So for the sake of argument, let's assume that in the fullness of time
   we have indeed created a new set of standards for bibliographic
   description. What do we do with our millions of existing MARC records?
   I can suggest a few possibilities.
   Some exit strategies

   One method is entombment. My least favorite option is to stop creating
   new MARC records and begin building a parallel system based on the new
   standard. With cross-database search tools becoming a reality (see
   [128]LJ 10/15/01 , p. 29ff.), it may be possible to create a portal
   that stitches the two systems together for the user while you work over
   time to migrate the old records (see below).

   Another option is encapsulation. The Library of Congress is already
   hard at work defining an XML standard for bibliographic information
   that breaks some of the bonds of MARC. This developing standard, the
   Metadata Object Description Schema (MODS), is sufficient for capturing
   most of what is presently in our MARC records, although it is not
   possible to "round trip" (go from MARC to MODS and back to MARC again)
   with MARC records using MODS.

   For a project at the California Digital Library, we extracted MARC
   records from our catalog, translated them into MODS, and embedded the
   records in an XML wrapper via the Metadata Encoding and Transfer Syntax
   (METS). We also demonstrated how we could take metadata from the
   publisher and embed that as well in its own section of the METS record.
   We were then able to index and display fields selectively from either
   metadata source for the same item.
   Or maybe migration

   Then there is migration. A complete transition to a new set of
   bibliographic standards would require migrating our existing record
   base into whatever we devise.

   Before you object to the scale of such an undertaking, consider how
   only a few years ago there were no records at all in digital form. And
   migrating digital information from one format to another will be
   easier, since some parts (but likely not all) can be automated. The
   point here is to do this migration once and to a standard that is
   flexible enough to allow future extensions and changes without
   requiring another migration.

   So am I out of my mind? Possibly. Am I stirring up trouble? Likely.
   Have I stated facts and outlined possibilities that are worth
   considering? Certainly. We have nothing to lose by considering what our
   future could be, even if it means questioning our most basic
   professional assumptions.
     __________________________________________________________________

Link List

   MARC
   [125]www.loc.gov/marc

   METS
   [126]www.loc.gov/standards/mets

   MODS
   [127]www.loc.gov/standards/mods