Library Journal "Digital Libraries" Columns 1997-2007, Roy Tennant

Please note: these columns are an archive of my Library Journal column from 1997-2007. They have not been altered in content, so please keep in mind some of this content will be out of date.
roytennant.com :: Digital Libraries Columns
Collection Development Today

05/15/2005
   If you have something to say, you've never had it so good. You can get
   your word out by posting to a blog, by editing a wiki or a web site, or
   even by using cost-effective print-on-demand book production services.
   In other words, collection development librarians now live in a world
   of hurt.

   All these methods, as well as a number of others, fall outside the
   conventional ones librarians use to select and aggregate content. If a
   book is not in a publisher's catalog or an approval plan, collection
   development librarians are likely to know nothing of it. But in a world
   where important content is often not part of the traditional publishing
   infrastructure, libraries risk becoming increasingly irrelevant unless
   they thoroughly reengineer their collection-building strategies.

   Do these new methods of creating content mean that old ones such as
   print publishing are going away? Not on your life. It means that
   collection development librarians not only can't eliminate anything
   they already do, but they must also add an entirely new set of
   activities.
   Metadata harvesting

   Content on the web typically falls into one of two categories: content
   that can be crawled by software and content in the "deep web," which
   can't be crawled because it is embedded in a database. Deep web content
   isn't lost forever to automated methods of aggregation, however, thanks
   to the development of a protocol for "harvesting" metadata.

   The Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH)
   offers a way for repositories to let the metadata that describes their
   holdings be harvested either wholesale or in preconfigured clumps
   called "sets." The harvesting process is performed by any number of
   applications (see the Open Archives web site), many of them free.

   One application in particular makes it easy for a librarian to create a
   single search interface to metadata downloaded from any number of
   OAI-compliant repositories. The Public Knowledge Project Harvester,
   once set up and configured, can be managed via a simple web interface.
   Any librarian with a tiny bit of knowledge (the base URL of the
   repository to harvest, etc.) can create a searchable collection.

   As [123]OAIster.org has proven, there is a great deal of useful content
   available to those with the capability to work with the OAI-PMH
   protocol. As of this writing, OAIster.org has harvested information
   about more than 5.25 million items from over 450 separate repositories.
   This is a substantial collection development opportunity.
   Targeted web crawling

   Since web sites are increasingly the publication medium of choice for
   everything from government documents to personal blogs, web crawling is
   potentially a powerful collection development tool. I'm not suggesting
   that libraries attempt to replicate Google. But selecting specific web
   sites to crawl can bolster print collections in useful areas. This kind
   of targeted web crawl relies on librarian knowledge of the
   applicability of specific web sites to a particular clientele.

   The California Digital Library (CDL) is attempting to create tools that
   a 21st-century collection development librarian could use to aggregate
   content, or information about content, on behalf of a local clientele.
   The purpose of these tools is to provide librarians with methods to
   select repositories and web sites to be harvested or crawled, with the
   intent that the resulting collections would be made available directly
   to end users or in association with other resources via metasearch
   software.

   Software development is in an early stage, but we have already gained
   experience with both metadata harvesting and web crawling upon which we
   can draw. Much of this work is supported by grants from the Hewlett
   Foundation, Mellon Foundation, and National Science Digital Library.
   Developments can be followed via the Inside CDL web site.

   Underlying this work is the firm belief in the continued need for
   curation of content--selecting, organizing, and providing access to a
   world of resources that are appropriate to particular audiences and/or
   purposes. Sure, Google crawls (or attempts to crawl) everything, but
   sometimes what isn't searched is as important as what is. As long as
   users have a need to narrow in on useful and appropriate information
   resources, there will be work for collection development librarians. It
   just won't be the work that most librarians do now.
     __________________________________________________________________

                             Link List
   Inside CDL
   [124]www.cdlib.org/inside Open Archives Initiative
   [125]www.openarchives.org Public Knowledge Project Harvester
   [126]www.pkp.ubc.ca/pkp-harvester