Library Journal "Digital Libraries" Columns 1997-2007, Roy Tennant
Please note: these columns are an archive of my Library Journal column from 1997-2007. They have not been altered in content, so please keep in mind some of this content will be out of date.
Collection Development Today
05/15/2005
If you have something to say, you've never had it so good. You can get your word out by posting to a blog, by editing a wiki or a web site, or even by using cost-effective print-on-demand book production services. In other words, collection development librarians now live in a world of hurt. All these methods, as well as a number of others, fall outside the conventional ones librarians use to select and aggregate content. If a book is not in a publisher's catalog or an approval plan, collection development librarians are likely to know nothing of it. But in a world where important content is often not part of the traditional publishing infrastructure, libraries risk becoming increasingly irrelevant unless they thoroughly reengineer their collection-building strategies. Do these new methods of creating content mean that old ones such as print publishing are going away? Not on your life. It means that collection development librarians not only can't eliminate anything they already do, but they must also add an entirely new set of activities. Metadata harvesting Content on the web typically falls into one of two categories: content that can be crawled by software and content in the "deep web," which can't be crawled because it is embedded in a database. Deep web content isn't lost forever to automated methods of aggregation, however, thanks to the development of a protocol for "harvesting" metadata. The Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) offers a way for repositories to let the metadata that describes their holdings be harvested either wholesale or in preconfigured clumps called "sets." The harvesting process is performed by any number of applications (see the Open Archives web site), many of them free. One application in particular makes it easy for a librarian to create a single search interface to metadata downloaded from any number of OAI-compliant repositories. The Public Knowledge Project Harvester, once set up and configured, can be managed via a simple web interface. Any librarian with a tiny bit of knowledge (the base URL of the repository to harvest, etc.) can create a searchable collection. As [123]OAIster.org has proven, there is a great deal of useful content available to those with the capability to work with the OAI-PMH protocol. As of this writing, OAIster.org has harvested information about more than 5.25 million items from over 450 separate repositories. This is a substantial collection development opportunity. Targeted web crawling Since web sites are increasingly the publication medium of choice for everything from government documents to personal blogs, web crawling is potentially a powerful collection development tool. I'm not suggesting that libraries attempt to replicate Google. But selecting specific web sites to crawl can bolster print collections in useful areas. This kind of targeted web crawl relies on librarian knowledge of the applicability of specific web sites to a particular clientele. The California Digital Library (CDL) is attempting to create tools that a 21st-century collection development librarian could use to aggregate content, or information about content, on behalf of a local clientele. The purpose of these tools is to provide librarians with methods to select repositories and web sites to be harvested or crawled, with the intent that the resulting collections would be made available directly to end users or in association with other resources via metasearch software. Software development is in an early stage, but we have already gained experience with both metadata harvesting and web crawling upon which we can draw. Much of this work is supported by grants from the Hewlett Foundation, Mellon Foundation, and National Science Digital Library. Developments can be followed via the Inside CDL web site. Underlying this work is the firm belief in the continued need for curation of content--selecting, organizing, and providing access to a world of resources that are appropriate to particular audiences and/or purposes. Sure, Google crawls (or attempts to crawl) everything, but sometimes what isn't searched is as important as what is. As long as users have a need to narrow in on useful and appropriate information resources, there will be work for collection development librarians. It just won't be the work that most librarians do now. __________________________________________________________________ Link List Inside CDL [124]www.cdlib.org/inside Open Archives Initiative [125]www.openarchives.org Public Knowledge Project Harvester [126]www.pkp.ubc.ca/pkp-harvester