Library Journal "Digital Libraries" Columns 1997-2007, Roy Tennant
Please note: these columns are an archive of my Library Journal column from 1997-2007. They have not been altered in content, so please keep in mind some of this content will be out of date.
The Importance of Being Granular
05/15/2002
Our libraries are increasingly dependent on metadata. Besides the obvious (our catalogs), other uses are becoming more commonplace. Virtually any content we digitize and make available to our clientele requires metadata for discovery and access. Every interlibrary loan transaction is a slug of metadata that helps libraries get a book or journal article to a user. Libraries now license so many databases and collections of online content that they increasingly offer a way for users to search for a resource based on their topic. Such a service requires metadata. Last month, in "[123]Metadata as if Libraries Depended on It" (LJ 4/15/02, p. 32ff.), I discussed metadata and its various components: a standard container, qualification, usage guidelines, and the information being captured. In that overview I set aside one topic as being worthy of its own column: metadata granularity. How you chop it Granularity refers to how finely you chop your metadata. For example, in the standard for encoding the full text of books using the Text Encoding Initiative (TEI) schema, a book author may be recorded as: <docAuthor>William Shakespeare</docAuthor>. That's all well and good, if you never need to know which string of text comprises the author's last name and which the first. If you do and most library catalogs should have this capability, you're not going to get very far with information extracted from a book encoded using the TEI tag set. Although TEI has been around in one form or another for 15 years, its focus is mainly on the recording of aspects of a work for humanities scholars. As such, it is not particularly well suited for library-style bibliographic description. Nonetheless, as more texts are digitized in their entirety, libraries will increasingly be using either some form of TEI or another similar schema (e.g., ISO 12083). Therefore, it behooves us to know how well or how poorly standards such as TEI and MARC can interoperate. How granularity helps Granularity is good. It makes it possible to distinguish one bit of metadata from another and can lead to all kinds of additional user services. For example, you can't sort records on author names if you can't tell the last name. Wait a minute, you're thinking, we do it all the time since the MARC record is sufficiently granular. And you would be correct. Generally speaking, most of the information in a MARC record is sufficiently granular for the purposes for which it was designed. But it becomes less than adequately granular should you wish to start loading up the MARC record with such things as book reviews. Then you are reduced to such questionable tactics as smashing it into a note field. As time goes on, in other words, we may begin to find that MARC isn't quite as extensible or granular as it will need to be. External compliance The issue of granularity becomes critical in the apparent slavish devotion we tend to have toward standards. Don't get me wrong. Standards are vital to sharing data with others. They are important to any situation in which you must interoperate with other systems. They are important to providing a method to layer services easily on top of a collection of metadata. But we sometimes confuse internal compliance with external compliance. External compliance with standards means that you can export your data into whatever metadata standard applies to a given situation. For example, some libraries are involved with the Open Archives Initiative (OAI), which aims to share metadata among working paper archives. Although internally a given archive may have a richer and more granular collection of metadata, OAI specifies that at minimum the archive should be able to make its metadata available for "harvesting" (collecting via software) using the Dublin Core metadata specification. Therefore, an OAI-compliant archive will likely "dumb down" its metadata (in some cases making granular metadata more homogenous) to meet this minimum specification. Not all metadata are equal Internal compliance means storing the metadata in a particular standard even when it makes little sense to do so. Some standards are meant to provide interoperability among systems (such as the Dublin Core), while others are designed to provide a base level of standardization upon which software systems can be built (such as MARC). Therefore, not all metadata standards are created equal. They are sometimes inadequate for your internal needs or would prevent you from complying with a different standard. In the case of the OAI-compliant archive above, for example, being internally compliant with the Dublin Core would make no sense in and of itself. So long as it could "speak" Dublin Core when required, a richer set of internal metadata may allow many other additional uses of the same information (such as MARC records for a library catalog). Granularity questions Nearly all metadata standards raise granularity issues. In the TEI example, for greater flexibility the author's name should be chopped up at least into the part of the name upon which sorting can take place (usually the last name). Therefore, should I decide to encode a digitized book using the TEI set of XML tags, I will have metadata that is only adequate for TEI compliance. On the other hand, should I create my own set of tags--perhaps the TEI tag set plus additional tags, for example, to identify the author's first and last name--to provide more granularity, then a standard such as TEI can be covered like a blanket. And a number of other metadata standards that may be important (such as MARC) can be supported as well. Once you have your metadata stored in a standard, machine-parsable container, whether in a database or an XML data stream, it's easy to spit out the information in various configurations and formats. Remember: select (or create) and use metadata containers that are granular enough for any purpose to which you can imagine putting them. If you do this, not only can you serve your own purposes, but you can also share your metadata with anyone you wish. If this is not practical, then you must decide which needs will remain unfulfilled. Highly granular metadata doesn't come cheap. There is a trade-off between all possible uses that you may wish to support and the staff time required to capture the metadata required to do so. In some cases, the benefit will not warrant the cost; in others, it will be worth it. Another path to granularity Good granularity doesn't necessarily mean that any single metadata standard or container must chop up every field into the smallest reduceable part. For example, the emerging standard for digital object description, METS, is designed to take advantage of other, more granular metadata containers. As a wrapper, it is meant to enclose some things and link to others. It can refer to a metadata record for the item being described. Therefore, a digital object described using the METS schema may, in fact, refer to a MARC record for descriptive metadata. Granularity of metadata is hard-won and easily lost. Identifying and appropriately encoding metadata elements usually requires a person--and one with training. Once granularity has been achieved, it should not be permanently surrendered through internal compliance with an external standard, unless the benefits clearly outweigh the drawbacks and no alternatives are possible. The time of cataloging staff is valuable, and once granularity is lost it may not be practical to recover it. Our libraries depend on metadata. They are becoming even more dependent as we move into the realm of creating, managing, and preserving collections in digital form. Doing so well requires us to understand thoroughly what is at stake and the consequences of our actions. __________________________________________________________________ Link List Dublin Core [125]dublincore.org ISO 12083 [126]www.xmlxperts.com/12083.htm MARC [127]www.loc.gov/marc METS [128]www.loc.gov/standards/mets Open Archives Initiative [129]www.openarchives.org Text Encoding Initiative [130]www.tei-c.org