roytennant.com

[ Prototype Server Home | 856 Usage Home ]

Trouble in Online Paradise:
An Analysis of MARC 856 Usage at One Institution

Roy Tennant — March 23, 2007

A Note on Technique

In this unscientific analysis of the usage of 856 MARC fields at one institution, 1,000,000 MARC records were obtained without using scientific random sampling methods. Nonetheless, the records were not selected by any particular criteria and the sample size represents a significant percentage of the whole (the UC Berkeley catalog). This analysis, however, should be considered anecdotal in nature.

User Needs

Numerous user studies1 by the California Digital Library strongly indicate that academic faculty and students (and students in particular) enjoy knowing then the full content of an item they seek is available on the Internet (for example, see "Earth Sciences Metasearch Portal Usability Testing," 2006, p. 8). This desire can manifest itself in at least a couple ways: 1) the desire to immediately see on a screen of search results which items are completely available online, and 2) the desire to filter searches based on the availability of full content (that is, "show me only items I can get to on the Internet"). Meanwhile, most of our library catalogs do a very poor job of serving either of these user desires.2

To serve these needs well we need the ability to know, via software algorithms, which items in our library catalogs are fully available on the Internet. In library catalogs, this information is recorded in the MARC 856 field3. As more content becomes available in full-text via mass digitization projects such as those by Google and the Open Content Alliance, the ability of libraries to specify when full-text is available only becomes more important. Therefore, this analysis seeks to discover how difficult it is for libraries to ascertain the availability of content fully available online and make recommendations on improvements that could be made to make such determinations easier and more accurate.

The 856 Field

The 856 field of the MARC record is where the information needed to locate an access an electronic resource is recorded. There are a number of subfields and options (for example, indicators) that do not apply to this analysis, so the documents cited should be consulted for a full explication of the field and subfields. This analysis will focus on only the subfields used within the sample data.

Analysis

1st Indicator
Explanation: The 1st indicator is the access method, which for our purposes is nearly always 4, "HTTP (Hypertext Transfer Protocol)", although potentially could also be 1, FTP, or 7, which refers to the method specified in subfield $2.
Discussion: Only ten items (.05%) in the sample were discovered to have a first indicator of 1, and some of them appear to have been miscoded. Another nearly 300 items (about 1.5% of the total) are coded with 7, 156 of which have an additional subfield of $2 that specifies "http" as the method. The vast majority were coded with 4 (98.42%). Unfortunately, this indicator only indicates the access method, not whether the item being accessed represents the complete item.
Conclusion: This indicator cannot be used to determine whether an item is fully available.

2nd Indicator
Explanation: The second indicator is for specifying the relationship of the item referenced by the URL in the 856 field to the item described in the MARC record. There are almost exclusively three entries that appear in the sample. Indicator 0 indicates it is the item that is described by the record. Indicator 1 specifies a version of the resource. Indicator 2 specifies a related resource.
Discussion: The majority (64.86%) of the 856s in the sample have an indicator of 1, followed by those with an indicator of 0 (20.08%), and trailed by those with an indicator of 2 (10.62%). In this sample, records with an indicator of 2 seemed the least likely to be helpful in finding items fully available online, since most entries were for credits from the Internet Movie Database, a publisher description, or an archival finding aid. Items with an indicator of 0, which specifies that the URL points to the item described by the MARC record, is generally a good indicator that the URL in the 856 points to the complete item. The presence of indicator 1 does not by itself indicate the presence of the complete item. Rather, it is necessary to use additional criteria to determine complete availability.
Conclusion: When a second indicator of 0 exists, it can be considered an indicator of full content availability. When a second indicator of 1 exists, it cannot be assumed that the complete item exists online, but it may be possible to use it in conjunction with other indicators to so determine, at least in some cases. A second indicator of 2 can be considered to indicate that the 856 does not point to the complete item.

Subfield $3
Explanation: Subfield 3 defines the part of the described materials to which the field applies.
Discussion: Over 50% of the 856 fields in the sample have a $3 subfield. At least 42% of the $3 subfields specify the part to be "Table of contents". At least another 7% specify a publisher description. Over 7% identify a PDF version, with a Text version being identified in and additional 7+% of the cases. Since this is a free-text field, however, descriptions can vary. Examples of entries in this field include "Summary (HTML) and complete full text (PDF)", "Finding aid :", "Current issue:", "Sample text", "Abstract and full text", "maps", "v.1(2001)-", etc.
Conclusion: When a clearly negative (e.g., "Table of contents") or positive (e.g., "PDF version") declaration exists, this subfield could be helpful in determining full content (or not) availability. But there are an unknown percentage of edge cases that will require sophisticated machine processing to properly determine.

Subfield $z
Explanation: Subfield $z is an optional note designed to explain to end-users anything that relates to accessing the item referenced in the 856.
Discussion: Over 37% of the sampled 856s have a $z subfield. Of this total, almost half of the $z fields contain a message related to restricted access to purchased or licensed content. Over 15% of the $z subfields (5.7% of the entire sample) specified "Adobe Acrobat Reader required" which seems to indicate the availability of full-text. Variations on that theme such as "Click here to download PDF file" indicate that the availability of full-text is likely higher once all the permutations of this are taken into account. Other entries in this subfield include "full text available in pdf format at same site;", "Credits from Internet Movie Database", "Freely available.", "Click here to download PDF file"
Conclusion: When a clearly negative (e.g., "Credits from Internet Movie Database") or positive (e.g., "Adobe Acrobat Reader required") declaration exists, this subfield could be helpful in determining full content (or not) availability. But there are an unknown percentage of edge cases that will require sophisticated machine processing to properly determine.

Other Techniques
By analyzing the 856 fields within a given database it is possible to determine patterns that reliably indicate full content availability. For example, within the UC Berkeley dataset analyzed here there are numerous 856 fields that look like this: 856 41$uhttp://www.mip.berkeley.edu/cgi-bin/csmp?000338 and 856 41$uhttp://sunsite.berkeley.edu/TechRepPages/CSD-91-633. Once patterns like this are identified, it would be possible to programmatically assign them to one category or another based on URL string matching (in these cases full content availability).

In another set of cases 856 fields such as 856 41$uhttp://www.srs.fs.fed.us/pubs/gtr/gtr%5Fsrs010.pdf can reasonably be assumed to lead the user to the full-text due to the filename extension (.pdf).

There may be exceptions to some of these assumptions, but I would guess that they would be very few in number if the decisions were carefully made, and the benefits of including such items in the pool of "fully available" are potentially substantial.

Recommendations

Short-Term Strategies
It is possible with enough review of the existing data, similar to what I have done here, to specify an algorithm to check multiple conditions to make a determination of full online availability. Some checks will result in a clear negative determination (e.g., "Table of contents"), while others will have a clear positive (e.g., the existence of second indicator of 0. A study of options and trade-offs will reveal whether it is better to do this in a batch process and add an unambiguous indicator to the record, or do these checks at the point of display. In the lack of any guidance from a centralized cataloging authority, the latter may be the safer route if the processing overhead is not too onerous.

Long-Term Strategies
Within the existing work to refactor our bibliographic infrastructure (for example, the present work on Resource Description and Access, create a clear and unambiguous method to specify when a URL points to the complete item being described.

Conclusions

The students and faculty at the University of California have made it clear that they highly value full online access to the items they locate, and I doubt there are many users who would disagree with this opinion. Therefore it is important to solve the current problems we have with making it easy for our users to both discover this content and easily retrieve it.

As I have outlined here in anecdotal form, our MARC/AACR2 infrastructure and how it has been put into practice within at least one library is preventing us from effectively fulfilling these valid user needs. For the short-term I believe there are some strategies that can provide a decent solution to these issues for the bulk of our material. But going forward, we need to find and implement an unambiguous, machine-processable method to specify when a URL will fetch the complete item. We cannot allow optional local practices4 to continue to be the basis upon which we rest our solutions.


1 California Digital Library. Evaluation and Assessment Reports, <http://www.cdlib.org/inside/assess/evaluation_activities/>. Return
2 Tennant, Roy. "The Trouble with Online" Library Journal, (September 15, 2004), <http://libraryjournal.com/article/CA452319.html> . Return
3 OCLC Online Computer Library Center. 856 Electronic Location and Access, <http://www.oclc.org/bibformats/en/8xx/856.shtm> and Library of Congress. MARC 21 Concise Bibliographic: Holdings, Location, Alternate Graphics, etc. Fields (841-88X), <http://www.loc.gov/marc/bibliographic/ecbdhold.html#mrcb856>. Return
4 More information on UC local practice can be found at CDL Catalog Guidelines: CDL Conventions for Cataloging Electronic Resources, <http://libraries.universityofcalifornia.edu/hots/tfer/tferguidcon.html>, 2003.