Library Journal "Digital Libraries" Columns 1997-2007, Roy Tennant
Please note: these columns are an archive of my Library Journal column from 1997-2007. They have not been altered in content, so please keep in mind some of this content will be out of date.
XML: The Digital Library Hammer
03/15/2001
Abraham Maslow once said, 'When the only tool you own is a hammer, every problem begins to resemble a nail.' Once you understand XML and the opportunities it offers for creating and managing digital library services and collections, you will begin seeing nails everywhere. It is not the only tool you have, but it is by far the most useful. XML (Extensible Markup Language) is born of a marriage of SGML (Standard Generalized Markup Language) and the web. HTML can't do much more than describe the look of a web page, whereas SGML is too complicated and unwieldy for most applications. XML achieves much of the power of SGML without the complexity and adds web capabilities beyond HTML. XML provides a method by which you can mark (or 'tag') the structure of an object -- a document, a database entry, or just about anything made up of definable components. XML tags define the beginning of a structure, such as a section title, and the end. Whether you know it or not, you use a similar kind of technology frequently. Whatever word processing software you use tags the text you write with style information such as the font used. How it works XML differs from word processing software in several essential ways. First, it is open and transparent -- the specifications can be freely read and adapted by anyone, and the markup (the tags themselves) can be seen as well as the text. Also, it is capable of describing the structure of a virtually infinite variety of objects, not just a text document. Finally, it is being built for the web -- with a robust set of linking, transformation, and rendering capabilities. To understand the XML applications described below, it will help to know the various required pieces and how they interact. First, there is the information that has been tagged either by hand or by using an editor similar to an HTML editor. Unlike an HTML document, an XML document must be tagged according to three basic rules: 1) all tags must be in lowercase, 2) all beginning tags must have an ending tag, and 3) no tag can span another tag (that is, all tags must properly nest). Second, you must have an XML style sheet called Extensible Stylesheet Language Transformations, or XSLT. While a style sheet usually specifies how the information that references it (typically an HTML file) should be displayed within a browser, XSLT offers more powerful, transformative features. For example, you can choose to display text or not. In some ways, XSLT resembles a simple programming language that has been optimized for transforming XML files into other forms. Third, there should be an HTML style sheet (Cascading Style Sheet or CSS) that tells the web browser how to display the page. All three elements reside on the server, usually (but not necessarily) close to each another. XML and software If you use software such as the Cocoon publishing framework, when a user requests an XML document from your web server, the request is passed to special software. The software then applies the XML style sheet transformations to produce the HTML version that is sent to the client along with the HTML style sheet. If you don't use special software on the server for these operations, the client software (typically a web browser) must attempt to process the XML file. The latest versions of Microsoft Internet Explorer will attempt to process the file, but you're unlikely to be pleased with the result. Don't even try with Netscape. Few people know this, but any library with an integrated library system from Innovative Interfaces (with Update D) can view XML versions of catalog records. Kyle Bannerjee of Oregon State University has used this capability to provide information essential to relocating 50,000 items to a storage facility. Bannerjee also uses it to solve problems that many other libraries face, as with his program ILL ASAP (Interlibrary Loan Automatic Search and Print). Bannerjee says that 'XML and XSLT are the most significant developments in information management since relational databases and SQL.' XML and structured information Bibliographies are commonplace in libraries, whether as lists of books by a particular author or pathfinders by subject. What are bibliographic citations but a structured set of textual elements? XML is made for this. Imagine a list of historical novels. Marked up in XML (which is almost as easy as an HTML markup), this same list could be used to create several different web pages: one that lists items by author, another by title, and another by time period. All you'd need is an XSLT style sheet that produces a different view of the document, depending on the user request. You need only maintain one XML document. This is not yet a trivial task, but employing XML gets steadily easier, as tools improve. XML and digital publishing XML seems made to order for publishing. A book, for instance, is a highly structured object, with basic bibliographic information, front matter, chapters, headings, paragraphs, and back matter. A book marked up in XML can then be displayed in various ways -- chapter by chapter, as a table of contents (by extracting section headings from the file), and more. XML-encoded books will soon be viewed both on the web and personal devices such as e-book readers and personal digital assistants. The Open eBook Forum has promulgated a standard method of encoding e-books in XML specifically to provide an easy method for interchanging books across reading devices. To see this in action on the web, see Tobacco War: Inside the California Battles, which only exists in a single XML file but is delivered in chunks of HTML to the user upon request. This one example will soon be joined by several dozen books on topics in international and area studies, all provided to the client in HTML from XML source files. An essential precondition for interoperability is the capacity to share information effectively with other systems. XML supports that capability by providing information in a structured way. For example, the Open Archives initiative is using XML as the carrier syntax for bibliographic information about e-prints ('[135]Open Archives: A Key Convergence,' LJ 2/15/00, p. 122ff.). XML toolbox To learn more, the XML.com site offers a good start, while Robin Cover's XML Cover Pages is nearly exhaustive. Also see Norman Desmarais's The ABCs of XML: The Librarian's Guide to the eXtensible Markup Language. To discuss XML and its use in libraries, join the newly created XML4Lib discussion. If you are running an Apache web server and want to jump in with both feet, download and install Cocoon. This will provide a platform to transform your XML documents to HTML on the fly. For information on using XSLT, an essential part of the process, see Chapter 14 of the XML Bible as well as Michael Kay's excellent book XSLT: Programmer's Reference. MARC then, XML now MARC was the data encoding standard upon which librarians built modern librarianship. It enabled us to create automated library systems, shared cataloging, resource networks, and many things that we today take for granted. Now libraries are becoming publishers. We must provide access to a wider array of information resources, and resource sharing is more essential than before. XML likely will provide the carrier syntax for all of this, thereby becoming the MARC-equivalent of the 21st century. __________________________________________________________________ LINK LIST The ABCs of XML [123]http://www.newtechnologypress.com/ ntp/inprint.html Apache XML Project [124]http://xml.apache.org Cocoon Publishing Framework [125]http://xml.apache.org/cocoon Open Archives Initiative [126]http://www.libraryjournal.com/articles/ infotech/digitallibraries/ 20000215_13666.asp ILL Automatic Search and Print [127]http://ucs.orst.edu/~banerjek/ illasap Open eBook Forum [128]http://www.openebook.org Tobacco War [129]http://escholarship.cdlib.org/ ucpress/tobacco-war.xml XML Bible, Chapter 14 [130]http://www.ibiblio.org/xml/books/ bible/updates/14.html XML.com [131]http://www.xml.com XML Cover Pages [132]http://xml.coverpages.org XML4Lib [133]http://sunsite.berkeley.edu/ XML4Lib XSLT: Programmer's Reference [134]http://www.wrox.com/Books/ Book_Details.asp?ISBN=1861005067