Library Journal "Digital Libraries" Columns 1997-2007, Roy Tennant

Please note: these columns are an archive of my Library Journal column from 1997-2007. They have not been altered in content, so please keep in mind some of this content will be out of date.
roytennant.com :: Digital Libraries Columns
The Right Solution: Federated Search Tools

06/15/2003
   I recently took library catalogs to task for being the wrong solution
   for the right problem--general information discovery ([123]Digital
   Libraries, LJ 2/15/03, p. 28). But I believe that federated or
   cross-database search tools now available on the market are the correct
   solution for unifying access to a variety of information resources (for
   an earlier take on these tools, [124]see LJ 10/15/01, p. 29ff.).

   These tools can search not only library catalogs but also commercial
   abstracting and indexing databases, web search engines, and a variety
   of other databases, while often merging and de-duplicating (a.k.a.
   de-duping) results.

   Although these software solutions are still in an early stage of
   development, they already offer key functionality, both proving the
   benefits of federated search tools and pointing to their potential. A
   number of libraries are already using such tools to serve their
   clientele better.
   The players

   Representative players in this market space are MuseGlobal, Endeavor,
   Ex Libris, WebFeat, and Fretwell-Downing. All have products that unify
   searching of a variety of databases (known as "targets"). They also
   provide additional services such as authentication, merging, and
   de-duping.

   The market is still in an early stage of development, with products
   varying quite a bit in features, stability, and ease of implementation.
   This means if you're purchasing a system, you should use due diligence
   in comparing features, talking with current customers, and making sure
   that the product will deliver on its promises. It also means that
   implementation may not be a breeze.
   Not just point and shoot

   While some libraries may be perfectly happy doing a minimum amount of
   configuration to the "out of the box" interfaces of these products, a
   significant number of libraries will not. They may also need to add
   extra targets not in the repertoire of the given application or
   reconfigure the user interface. Or they may need to go deeper by
   getting involved with target creation or enhancement.

   For example, some targets may not provide exactly the searching and
   results options (e.g., an option to output results in XML) that you
   would like. In cases where you exert some control over the target
   (e.g., your library catalog), you may be able to make changes to the
   system. This kind of work can be considered a type of target
   enhancement.
   Hitting the right targets

   In other cases, you may find that you must get involved in creating
   targets to be searched. In the case of a large academic library, an
   effective discovery service may require searching a number of
   institutional repositories (see [125]Digital Libraries , LJ 9/15/02, p.
   28ff.).

   While most such repositories support a standard that allows metadata to
   be harvested, it is not a searching protocol like Z39.50. In order to
   point a federated search tool at repositories such as these, a library
   would first need to harvest the metadata from the repositories of
   interest and make that metadata available to the federated search tool
   as a searchable target. In such a case, the library would be engaged in
   target creation.

   Another example of target creation would be focused web crawling and
   indexing of web sites related to a particular topic area. For example,
   rather than relying on Google or other general purpose web search
   tools, it might make more sense to do a focused crawl of web sites
   important to a particular discipline.
   The de-duping challenge

   In a market still in its infancy, there are a number of unmet
   challenges. Some are being addressed in varying degrees by the vendors
   (e.g., de-duplication). Others are likely to remain fairly intractable
   for some time (e.g., identifying what's online).

   A key challenge is the ability to de-duplicate search results reliably.
   Multiple records for the same item are inevitable with cross-database
   searching, something users don't want.

   If the de-duping algorithms are too strict (i.e., records must match on
   multiple criteria to be merged), duplicates may get through; if they
   are too loose, items that are not really duplicates may be merged. Most
   products allow the library to set the de-duplication protocol or
   protocols that should be applied, but some limit the total number of
   records that can be de-duped per search.
   Making sense of results

   A tougher problem is relevance ranking. It's one thing to return search
   results from a variety of sources, but it's another thing entirely to
   list them in order of relevance.

   To compute relevance, a system must assume some things related to user
   needs (e.g., Would an overview be more appropriate than an in-depth
   piece?). It must also try to determine which items will best match the
   user's need--often with very little information to go on.

   Take library catalog records, for example. How would you go about
   ranking MARC records? If the records come from a union database, you
   may be able to use the number of holding libraries as an important
   ranking variable. But what do you use if that is not available?

   A possibly intractable problem is reliably identifying what is online
   in full text. If you know that all of the records in a particular
   database have full text available, then the problem is solved for that
   particular target.

   But with libraries increasingly using OpenURL-based systems (such as Ex
   Libris's SFX product), it is difficult or impossible to know up front
   if a given search result can be accessed online. Unfortunately, this
   problem is likely to get worse before it gets better. So what is a
   panicked undergraduate to do?
   California's searchlight

   The California Digital Library launched a federated search service in
   2000, called Searchlight. It prompted users to pick between sciences
   and engineering, or social sciences and humanities. From there, it took
   a simple user query and searched dozens of resources simultaneously.

   Since it was an early prototype, it did not attempt to merge or de-dupe
   results but rather depicted the number of hits in each database, with
   the databases organized in categories by type of resource.

   We learned that, in many cases, thousands of hits across a hundred or
   so resources is not always helpful, except to guide users to a
   particular database where they could then reenter their search using
   the unique options of that resource. We realized that an
   all-encompassing solution does not necessarily serve key user needs
   well.

   Through the Searchlight project, we came to realize that on one end of
   the scale is the undergraduate, who typically needs only "a few good
   things" to cite in a paper. But on the other end of the scale is the
   graduate student or faculty member who wants much more thorough
   coverage of a subject discipline. One-size-fits-all serves neither
   well.
   Tailored portals

   For a large university, tailored portals may be a better solution,
   where librarians can pull together resources specific to a topic area
   or purpose (such as a "core collection" for undergraduates).

   Meanwhile, smaller libraries or public libraries may wish to use
   federated search tools to provide one-stop shopping to a select group
   of databases. Luckily, it appears that these needs can be met by the
   federated search tools now on the market.
     __________________________________________________________________

Link List

   Endeavor ENCompass
   [127]encompass.endinfosys.com

   ExLibris MetaLib
   [128]www.aleph.co.il/metalib

   ExLibris SFX
   [129]www.sfxit.com

   Fretwell-Downing Zportal
   [130]www.fdusa.com/products/zportal.html

   MuseGlobal MuseSearch
   [131]www.museglobal.com/Products/MuseSearch

   OpenURL
   [132]www.niso.org/committees/committee_ax.html

   Searchlight
   [133]Searchlight.cdlib.org/cgi-bin/searchlight

   WebFeat
   [134]www.webfeat.org