Library Journal "Digital Libraries" Columns 1997-2007, Roy Tennant
Please note: these columns are an archive of my Library Journal column from 1997-2007. They have not been altered in content, so please keep in mind some of this content will be out of date.
The Right Solution: Federated Search Tools
06/15/2003
I recently took library catalogs to task for being the wrong solution for the right problem--general information discovery ([123]Digital Libraries, LJ 2/15/03, p. 28). But I believe that federated or cross-database search tools now available on the market are the correct solution for unifying access to a variety of information resources (for an earlier take on these tools, [124]see LJ 10/15/01, p. 29ff.). These tools can search not only library catalogs but also commercial abstracting and indexing databases, web search engines, and a variety of other databases, while often merging and de-duplicating (a.k.a. de-duping) results. Although these software solutions are still in an early stage of development, they already offer key functionality, both proving the benefits of federated search tools and pointing to their potential. A number of libraries are already using such tools to serve their clientele better. The players Representative players in this market space are MuseGlobal, Endeavor, Ex Libris, WebFeat, and Fretwell-Downing. All have products that unify searching of a variety of databases (known as "targets"). They also provide additional services such as authentication, merging, and de-duping. The market is still in an early stage of development, with products varying quite a bit in features, stability, and ease of implementation. This means if you're purchasing a system, you should use due diligence in comparing features, talking with current customers, and making sure that the product will deliver on its promises. It also means that implementation may not be a breeze. Not just point and shoot While some libraries may be perfectly happy doing a minimum amount of configuration to the "out of the box" interfaces of these products, a significant number of libraries will not. They may also need to add extra targets not in the repertoire of the given application or reconfigure the user interface. Or they may need to go deeper by getting involved with target creation or enhancement. For example, some targets may not provide exactly the searching and results options (e.g., an option to output results in XML) that you would like. In cases where you exert some control over the target (e.g., your library catalog), you may be able to make changes to the system. This kind of work can be considered a type of target enhancement. Hitting the right targets In other cases, you may find that you must get involved in creating targets to be searched. In the case of a large academic library, an effective discovery service may require searching a number of institutional repositories (see [125]Digital Libraries , LJ 9/15/02, p. 28ff.). While most such repositories support a standard that allows metadata to be harvested, it is not a searching protocol like Z39.50. In order to point a federated search tool at repositories such as these, a library would first need to harvest the metadata from the repositories of interest and make that metadata available to the federated search tool as a searchable target. In such a case, the library would be engaged in target creation. Another example of target creation would be focused web crawling and indexing of web sites related to a particular topic area. For example, rather than relying on Google or other general purpose web search tools, it might make more sense to do a focused crawl of web sites important to a particular discipline. The de-duping challenge In a market still in its infancy, there are a number of unmet challenges. Some are being addressed in varying degrees by the vendors (e.g., de-duplication). Others are likely to remain fairly intractable for some time (e.g., identifying what's online). A key challenge is the ability to de-duplicate search results reliably. Multiple records for the same item are inevitable with cross-database searching, something users don't want. If the de-duping algorithms are too strict (i.e., records must match on multiple criteria to be merged), duplicates may get through; if they are too loose, items that are not really duplicates may be merged. Most products allow the library to set the de-duplication protocol or protocols that should be applied, but some limit the total number of records that can be de-duped per search. Making sense of results A tougher problem is relevance ranking. It's one thing to return search results from a variety of sources, but it's another thing entirely to list them in order of relevance. To compute relevance, a system must assume some things related to user needs (e.g., Would an overview be more appropriate than an in-depth piece?). It must also try to determine which items will best match the user's need--often with very little information to go on. Take library catalog records, for example. How would you go about ranking MARC records? If the records come from a union database, you may be able to use the number of holding libraries as an important ranking variable. But what do you use if that is not available? A possibly intractable problem is reliably identifying what is online in full text. If you know that all of the records in a particular database have full text available, then the problem is solved for that particular target. But with libraries increasingly using OpenURL-based systems (such as Ex Libris's SFX product), it is difficult or impossible to know up front if a given search result can be accessed online. Unfortunately, this problem is likely to get worse before it gets better. So what is a panicked undergraduate to do? California's searchlight The California Digital Library launched a federated search service in 2000, called Searchlight. It prompted users to pick between sciences and engineering, or social sciences and humanities. From there, it took a simple user query and searched dozens of resources simultaneously. Since it was an early prototype, it did not attempt to merge or de-dupe results but rather depicted the number of hits in each database, with the databases organized in categories by type of resource. We learned that, in many cases, thousands of hits across a hundred or so resources is not always helpful, except to guide users to a particular database where they could then reenter their search using the unique options of that resource. We realized that an all-encompassing solution does not necessarily serve key user needs well. Through the Searchlight project, we came to realize that on one end of the scale is the undergraduate, who typically needs only "a few good things" to cite in a paper. But on the other end of the scale is the graduate student or faculty member who wants much more thorough coverage of a subject discipline. One-size-fits-all serves neither well. Tailored portals For a large university, tailored portals may be a better solution, where librarians can pull together resources specific to a topic area or purpose (such as a "core collection" for undergraduates). Meanwhile, smaller libraries or public libraries may wish to use federated search tools to provide one-stop shopping to a select group of databases. Luckily, it appears that these needs can be met by the federated search tools now on the market. __________________________________________________________________ Link List Endeavor ENCompass [127]encompass.endinfosys.com ExLibris MetaLib [128]www.aleph.co.il/metalib ExLibris SFX [129]www.sfxit.com Fretwell-Downing Zportal [130]www.fdusa.com/products/zportal.html MuseGlobal MuseSearch [131]www.museglobal.com/Products/MuseSearch OpenURL [132]www.niso.org/committees/committee_ax.html Searchlight [133]Searchlight.cdlib.org/cgi-bin/searchlight WebFeat [134]www.webfeat.org