Cataloging – Competency G

Demonstrate understanding of basic principles and standards involved in organizing information such as classification and controlled vocabulary systems, cataloging systems, metadata schemas, or other systems for making information accessible to a particular clientele.

Introduction

The frenetic nature of the COVID-19 pandemic illustrated the need for better management of information. When the world first went into lockdown, educated guesses and unfounded speculation sowed confusion amongst the general public. Time has passed since those first tumultuous weeks; and with vaccines now being distributed, an end to the crisis is in sight, but it is disheartening that misinformation continues to stymy the recovery. Much work remains to be completed on expanding access to and the acceptance of credible, scientifically sound information. Towards that end, information science professionals continue to catalog the credible data as it is being published to help identify the material as credible and to provide subject access to said content. Some of this classification is done through controlled vocabulary and cataloging systems. As catalogers process new material, they create surrogate records following the structure of a metadata schema such as Machine-Readable Cataloging (MARC) and the guidelines of a content standard such as Resource Description and Access (RDA). As the field of information science grapples with the spread of misinformation in the digital age, these surrogate records can serve as a foundation to expand the reach of credible information.

Regardless, it is indisputable that there is a vast amount of knowledge in the world, and that the world requires knowledge organization systems to manage and ready this knowledge so that it is available upon demand. Cataloging systems do this primarily by referencing surrogate bibliographic records, which are created (or copied) by catalogers upon acquisition by their respective institutions. It should be noted that the field of information science has coalesced around a replacement for MARC abet a replacement still in development. Its current iteration is called BIBFRAME 2.0. It is a basic criterion for information science professionals to be knowledgeable on information organization systems, how controlled vocabulary enhances their performance through improved precision and greater recall and understand the optimal applications for each metadata schema.

Explication

As an information science professional, my progression through the courses of INFO 248 Beginning Cataloging, INFO 220 Medical Librarianship, LIBR 247 Vocabulary Design, and INFO 256 Archives and Manuscripts has imparted to me a solid foundation in the principles of cataloging, classification, controlled vocabulary systems, and metadata schemas. INFO 220 focused on Medical Subject Headings (MeSH). INFO 256 broadened my knowledge of how the field of information science and specifically archivists catalog materials that are not necessarily literary works through adherence to Describing Archives: A Content Standard: (DACS).

Controlled Vocabulary

Controlled vocabularies are at the center of effective cataloging. A controlled vocabulary designates a single authorized term for each concept that may have otherwise been expressed by any number of synonyms (Bolin, 2016, p. 75). A controlled vocabulary system also takes great pains to distinguish between homographs such as Mice (Animals) versus Mice (Computers). A controlled vocabulary must be maintained with institutional support. This is critical for when developments within the vocabulary’s domain provide the literary warrant to add a new term to the controlled vocabulary so that a single authoritative term or authorized term is selected instead of having multiple stakeholders haphazardly adding their own descriptors to the thesaurus of the controlled vocabulary. By using consistent terms, subjects may be aggregated (Bolin, p. 16). Controlled vocabularies are syndetic. They outline the broader, narrower, and associative relationships for their authorized terms. Controlled vocabularies can be referred to as syndetic systems (Khosrowpour, personal communication, 2020).

Controlled vocabularies stand in contrast to uncontrolled vocabularies such as in the case when a discovery tool indexes words from the actual document or user-generated terms or tags such as in folksonomies. The complexities of a natural language undermine the effectiveness of search queries. How a natural language vocabulary diminishes the efficacy of resultant retrievals can be broadly divided into two categories. Firstly, discovery tools typically only retrieve exact matches meaning that documents that only use alternative forms such as singular and plural forms of the same word or synonyms (such as juveniles versus children) risk not being identified as relevant, lowering the recall of the discovery tool in question. Secondly, the presence of homographs on a natural language search query could result in retrieval of irrelevant resources, thereby lowering the search tool’s precision.

It should be noted that there are benefits to using a natural language vocabulary for searching. Many times, users do not need expansive searches to meet their information needs. In such cases, tagging can quickly identify a relevant resource to satisfy their requirements.

Researchers and academics are not the only people who use information retrieval systems in their everyday life. Lay persons often lack familiarity with a controlled vocabulary and have not received any specialized training to leverage the benefits that a controlled vocabulary may offer. For such persons, it is entirely possible that searching the user-generated tags of a folksonomy may provide superior results.

Other than controlled vocabulary systems, classification systems are the other primary way to express the aboutness of an object. In the United States, the two most widespread classification systems in libraries are the Dewey Decimal Classification (DDC) and the Library of Congress Classification (LCC).

“I am familiar with the Library of Congress Subject Headings (LCSH)hierarchical implies and Medical Subject Headings (MeSH).”

Classification

Classification systems sum up aboutness in a single, alphanumeric string (Khosrowpour, personal communication, 2020). Traditionally, classification systems also dictate the physical arrangement, but this association is less important for information stored digitally.

The more prominent classification systems in librarianship and information science are hierarchical. In a hierarchical classification system, resources and artifacts within a classifications system’s domain are grouped into classes by a single trait such as subject. These classes should be collectively comprehensive for every item in the domain but mutually exclusive so that there is no overlap between the classes. As the term hierarchical implies, these classes may also be divided into subclasses, which should be comprehensive for the domain of the parent class, but mutually exclusive amongst each other.

The Dewey Decimal Classification (DDC) system provides a well-known example of this principle. The DDC has a domain divided into ten main classes, which is further subdivided into The Hundred Divisions, which is even further subdivided into The Thousand Sections (Online Computer Library Center [OCLC], 2003). Each Dewey Decimal Classification number (DDC number) begins with a base number of three digits followed by a decimal point. The digit in the “hundred’s place” corresponds to the main class. The digit in the “ten’s place” corresponds to the “division” within the designated class, and the digit in the “one’s place” corresponds to the “section” within the designated “division.”

DDC has flaws. One of its principal flaws is how it has incorporated facets into its structure. Facets can be likened to traits or attributes such as language, geographical location, date of creation. As a faceted classification system, DDC notation expresses the facet of biographical work of a single numeric structure “92” (Davis, 2021). The construction can be found within the DDC division 78X (music) where biographies of classical musicians are assigned the DDC number 780.92 and that 784.092 corresponds to the biographies of popular musicians (UCF Libraries, 2021).

Theoretically, when a composite subject is divided into multiple facets such as time and location, a faceted index may provide for subject access for each facet (Zhang, personal communication, 2015). Recent updates to the DDC have attempted to overhaul how facets are expressed in DDC numbers. In fact, DDC 23 was created with the specific intention of allowing computers easier access to facets, but The limited number of characters makes the avoidance of notational re-use a mathematical improbability (Tinker, 2005, p. 96). Many structural problems have yet to be resolved.

Primarily, DDC 23 lacks an overarching cohesive structure for facets. Because of this, discovery tools are not able to is leverage DDC facets to provide users with more precise results. The problem with the DDC’s facets as currently employed by DDC 23 is that DDC 23 is not accessible to the public, and historical structural differences keep online public access catalogs (OPACs) from effectively leveraging the subject access that facets may otherwise provide (Online Computer Library Center [OCLC], 2019).

Some may take it as criticism that facets were not added to the DDC in a manner to allow for computers to easily decipher them, but one should not forget that the Dewey Decimal Classification system was a leap in classification when it was first introduced in 1876. At the time, computers and their ability to process information and capacity to store information had yet to be conceptualized.

DDC 23 suffers from a lack of “facet indicators.” Universal Decimal Classification (UDC) uses symbols such as + : = ( ) to denote facet boundaries and the nature of the facet such as language, form, or time (Tinker, 2005, p. 95). That is not to say that the DDC totally lacks indicators of facets, but that they are not used consistently and when they are employed it is difficult for computers to parse the character as a facet indicator than as a part of the Dewey Decimal Number. For example, the use of the number zero as a facet indicator of standard subdivisions in DDC system is problematic; because unfortunately, zero is also used in the DDC for purposes other than as a facet indicator, making it extremely difficult to have computers recognize the border of facets. Of course, not all facets are standard subdivisions,

While DDC has evolved to make its classification of subjects through the use of facets more assessable, the current state of DDC leaves it unable for lay users, researchers, and information science professionals to leverage the potential of facets.

A criticism of the Dewey Decimal System is its separation of 400 Language and 800 Literature and rhetoric. Classification systems should be evaluated in accordance with the logical order of their main classes—where related classes should be grouped together. The assignment of 400 to Language and 800 to Literature and rhetoric raises a few eyebrows (Tinker, 2005, p. 17). Another weakness of the DDC is Class 200 Religion. Its subdivisions are strongly reflect a Western Christian bias with seven out of ten of its divisions are dedicated too Christianity and only one of its divisions 290 is set aside for “Other religions.” Similarly, the Class Language 2XX, and Class Literature 8XX reflect an Anglo-European bias.

Metadata Schema

In the digital age, a metadata schema is an encoding standard—sometimes referred to as an encoding schema. Metadata schemas dictate the framework or structure that is used to encode bibliographic information and authority data. Metadata schemas dictate what data elements may be used in a descriptive record and what tags or computer code is used to identify types of data elements, but they do not dictate what a particular data element should be composed of. The composition of the data elements is articulated by cataloging codes otherwise known as content standards such as Resource Description and Access (RDA) and its predecessor Anglo-American Cataloguing Rules 2nd edition (AACR2) (Zhang, Gourley, 2009).

Machine-Readable Cataloging

Machine-Readable Cataloging (MARC) is the most common encoding schema currently used today, but efforts in the information science field are actively working to adopt a more modern standard. MARC was created by the Library of Congress in late 1960s (Bolin, 2016, p. 5). MARC assigns numbered tags to fields or data elements that are used to create bibliographic records and authority records (Bolin, 2016, p. 68–69). Each MARC field has two indicators that contain numerical characters or may be left blank to describe the contents of a specific MARC field, although what an indicator describes varies from MARC field to MARC field.

One of the greatest strengths of MARC is its widespread use. This benefit can most directly be exemplified through the WorldCat union catalog operated by the Online Computer Library Center (OCLC). By using the MARC metadata schema, the collections and holdings of libraries, archival repositories, and museums can be collectively examined on a global scale with a single query. Institutions may export MARC records of acquisitions to the WorldCat catalog and download MARC records from WorldCat into their own catalogs.

Unfortunately, MARC has been an obstacle for facilitating more precise searching than what online public access catalogs have traditionally offered. Take for instance searching for the content in the field of music. Should a collection contain representations of musical works these works may have many contributors. There can be a composers, lyricists, performers, conductors, and editors. Should an information retrieval system even allow users to distinguish between these distinctive roles, the complexity and nonuniformity of how metadata is stored in a MARC environment reduce the recall of relevant bibliographic records. This hampers the ability of end users to procure the data that they require. When a single query should retrieve all relevant records within the query’s specified domain, end users—in practice—have to conduct multiple searches to replicate such results when the bibliographic information is stored in a MARC environment. With advances in computational power and other technologies, the process of acquiring information could be and should be more efficient. Given that other facets such as instrumentation, editions, and formats can also cause difficulty, it is not surprising that the field of librarianship and the information science community has elected to replace the MARC metadata schema. MARC was created by the Library of Congress in the 1960s.

As alluded to earlier, the processing power and data storage were expensive. This made it prudent to encode concepts by a three-digit numeric tag such as “651” for “Geographical name” and a facet such as form to be encoded under the subfield “$v” (Library of Congress [LC], 2017). Only dedicated catalogers have the ability to expend the time necessary to memorize the nomenclature of the MARC 21 and decipher a raw MARC bibliographic record. For all others concerned, encoding information in MARC, erects an unnecessarily high barrier to access of bibliographic information. This is what has made Extensible Markup Language (XML) and its alphanumeric tags so appealing as a replacement for MARC. The alphanumeric nature of these tags allows words (i.e., metadata) to describe the nature of the information that they enclose thereby entirely overcoming this shortcoming of the DDC.

While much has been done to modify MARC to prolong its useful lifespan, its difficulty of imposing the principles of linked data and its difficulty in cataloging relationships between entities such as part-whole and series relationships have motivated the development of a replacement. This is proving to be a difficult task with Xu, Hess, and Akerman (2018) evaluating that the designated replacement BIBFRAME 2.0. is not quite ready to be adopted by the wider profession. In the meantime, segments of information science and related fields make use of other metadata schema.

Extensible Markup Language

These three schemas are encoded in Extensible Markup Language (XML) for digital transmission. XML was designed to transport data and store information (W3Schools, 2021). It does this by storing information in undefined tags where the software of the recipient or storing entity determines the function of a tag as opposed to HTML tags, which have predefined functions. This grants XML the trait of software independence. Software independence makes XML an ideal coding language for several metadata schemas.

Dublin Core

Dublin Core was originally developed to describe web resources but now describes a variety of physical and digital resources. What sets Dublin Core apart from other widely recognized metadata schema is Dublin Core’s simplicity. The core principles or the Dublin Core Metadata Element Set outline 15 metadata elements to describe physical and digital resources. These 15 metadata elements are title, subject, description, creator, publisher, contributor, date, type, format, identifier, source, language, relation, coverage, and rights (University of California, Santa Cruz [UCSC], University Library, 2017).

Metadata Encoding and Transmission Standard

The METS schema is a standard for encoding descriptive, administrative, and structural metadata regarding objects within a digital library, expressed using the XML schema language” (Library of Congress [LC], 2021) The METS Board works in collaboration with the Network Development and MARC Standards Office of the Library of Congress to maintain METS.

Encoded Archival Description

Encoded Archival Description (EAD) is an international standard maintained by the Library of Congress in partnership with the Society of American Archivists (SAA) (Library of Congress [LC], n.d.-b). The metadata schema EAD standardizes the encoding structure for finding aids so that researchers may search through the contents of archives and manuscript repositories and locate relevant primary sources (de Lorenzo, personal communication, 2020). A finding aid describes a collection’s arrangement. Depending on the processing level, a finding aid may be detailed and expansive or as succinct as a simple summary (de Lorenzo, personal communication, 2020).

By utilizing a standardized XML structure, EAD allows researchers a high degree of precision in searching for primary sources pertinent to their current line of research. Whether an archival collection contains records, photographs, multimedia objects, or other artifacts, a finding aid should help a researcher determine if the contents of the collection will further his or her own research. It should be noted that bibliographic records that are structured to follow EAD are encoded in accordance with the Describing Archives: A Content Standard (DACS). Content standards will be discussed more thoroughly in a later section of this essay. One drawback of EAD is that it cannot house audio or moving images and cannot browse digital objects (de Lorenzo, personal communication, 2020).

BIBFRAME 2.0

The field of information science has consciously decided to replace the MARC metadata schema in favor of a metadata structure that focuses on linked data as the new global standard to transmit and store bibliographic information.

 “BIBFRAME initiative…is designed to integrate with and engage in the wider information community and still serve the very specific needs of libraries.” (Library of Congress [LC], n.d.-a) A MARC environment is focused on records that are independently understandable. While such a result can be beneficial, in practice; this has resulted in significant duplication of information. The BIBFRAME Model seeks to rely on relationships between resources such as Work-to-Agent relationships to describe works, instances, and items. BIBFRAME accomplishes this through the use of controlled identifiers. While MARC uses controlled identifiers in some instances such as geographic codes and language codes, Controlled identifiers are the rule and not the exception in BIBFRAME (Library of Congress [LC], n.d.-a).

Of some note is that the materials and components of BIBFRAME are currently in the public domain (Library of Congress [LC], n.d.-a). This should help foster the adoption of BIBFRAME, but it remains to be seen if BIBFRAME will be put behind a paywall to cover the costs of upkeep of the standard by the designated authoritative institution. It should be noted that BIBFRAME is designed to be flexible and is neither a cataloging code, subject heading list, or classification scheme and may require only minimal updates to meet future needs. BIBFRAME holds significant promise to the field of information science to overcome the shortcomings of MARC.

Content Standards

As discussed in the previous section Metadata Schema dictate the structure of bibliographic information. Specifically, they outline how data elements are housed in relation to one another. The format of the data elements—how bibliographic information is articulated—is set by content standards (or cataloging codes). The following section will discuss Resource Description and Access (RDA) and its predecessor Anglo-American Cataloguing Rules 2nd edition (AACR2) as well as Describing Archives: A Content Standard (DACS).

AACR2 was implemented in 1981. While it did provide rules for the creation of metadata—bibliographic information), problems arose when it was applied to resources of different formats particularly those stored digitally. RDA was implemented in 2010 as a replacement to AACR2. DACS directs archivists in creating descriptive records by extending the skeletal rules for archival materials AACR2’s chapter 4. It also illustrates how these rules might be implemented in either the MARC or EAD metadata schema (Zhang, Gourley, 2009).

Evidence

Evidence 1: LCSH MARC Record Set Submission

As evidence of my knowledge and comprehension bibliographic records and copying those records for my own institution’s use, I proffer “LCSH MARC Record Set Submission” from the course INFO 248 Beginning Cataloging and Classification. 

Beyond the duplication of the MARC record, the professor requested that students list the ARNs of authorized subject headings, and provide the OCLC #, the title of book, the authorized subject heading that best describes the book, and said authorized subject heading’s ARN underneath it to test our understanding of the process.

This document demonstrates proficiency in copy cataloging and familiarity with OCLC Connexion and ClassWeb or more formerly known as the Library of Congress Web Classification (tool) and addresses MARC metadata schema, DDC, LCC, LCSH, bibliographic records, and authority records. Should I find myself in a position with copy cataloging as part of my duties, I should be able to perform those duties with no trouble.

Evidence 2: Collection Finding Aids of the University Archives

As evidence of my knowledge and comprehension for competency G, I submit the finding aid for a donation of Papers from the Swett-Tracy Family to The Bancroft Library’s University Archives (from INFO 256 Archive and Manuscripts) as or original cataloging and the creation of bibliographic records outside of literary works. Archival repositories create finding aids to assist interested parties with the content of archival collections. The processing of newly accessioned materials is one of the core duties of archivists. Part of this includes arranging donations into series while also creating a finding aid. In this exercise, I was asked to process a donation of papers from the Swett-Tracy Family as if I was working for The Bancroft Library (UC Berkley). This finding aid that I created covers includes the cover and title pages, the biographical sketch, the scope and content note, and series description (in accordance with original order). Series arrangement and finding aid creation is a crucial step in allowing institutions to safeguard the provenance of archival materials under their care. The completion of this finding aid attests to my ability to conduct original cataloging. Not all information comes in the form of literary works. When cataloging physical items, it is important to follow the standards of the archival profession so that artifacts of uncommon value are preserved for future expansion. It is particularly important for archival institutions to take care when arranging their acquisitions into series, as it is the standard for archivists to point researchers in the direction of a specific series where they may find artifacts of relevance to their current line of inquiry.

Evidence 3: Vinyl Record Thesaurus: Thesaurus Construction

To attest to my understanding of thesauri and controlled vocabularies, I offer “Vinyl Record Thesaurus: Final Project—Thesaurus Construction” (from INFO 247 Vocabulary Design.).

My team and I limited the domain of this project to vinyl records to bring it within the scope of a student project. The intent of this project is to provide a controlled vocabulary from which all vinyl records may be adequately cataloged. In this project, my time underwent a process of domain analysis, term extraction, and facet analysis. This culminated in the final term selection to create the Vinyl Record Thesaurus. Thesauri are important because their controlled vocabularies provide a systematic way to catalog the subject of materials; and therefore, their retrieval. When implemented, an information retrieval system, search queries that use a controlled vocabulary will recall a high proportion of relevant surrogate records with a high degree of precision.

Conclusion

Librarianship has come a long way from the card catalog. In the digital age, it is theoretically possible to search and return results with perfect recall and perfect precision, but the field has yet to achieve this theoretical ideal. The cornerstone of this retrieval is cataloging. Standards are developed and managed by a recognized authoritative body result in interoperability. This interoperability is key; because with the proliferation of information, institutions must work with many partners when it comes to cataloging their physical and digital holdings. Currently, the MARC metadata scheme and the RDA content standards have the most prevalence, but many catalogers are looking forward to the day when they can replace MARC with BIBFRAME 2.0 (or its descendent). There is all for naught, if the human component—namely librarians—disregard such standards, which is why it is critical for librarians and information science professionals to understand the principles of cataloging and follow the standards accepted by the information science field. In addition, librarians should assist patrons to use controlled vocabularies to increase the precision and recall of their search queries and the overall effectiveness of their resource retrieval. It is important to not forget that cataloging is meant to facilitate access to an institution’s holdings through name, title, and subject.

References

Aitchison, J., Gilchrist, A. & Bawden, D. (2000). Thesaurus construction and use: A practical manual (4th ed.). London: Routledge.

Bolin, M. K. (2016). Beginning cataloging and classification. San Jose, CA: San Jose State University.

Davis, B. (2021, March 26). What is the difference between 92 and 920 in Dewey Decimal System? Mvorganizing.org. Retrieved from https://www.mvorganizing.org/what-is-the-difference-between-92-and-920-in-dewey-decimal-system/

Hemmasi, H. (2002). Why Not MARC? Variations [University of Indiana] http://variations.indiana.edu/pdf/hemmasi-ismir2002.pdf

Library of Congress. (n.d.-a). BIBFRAME frequently asked questions. https://www.loc.gov/bibframe/faqs/#q03

Library of Congress. (n.d.-b.) Finding aids: Encoded Archival Description (EAD) at the Library of Congress. https://www.loc.gov/rr/ead/

Library of Congress. (2017, December 18). 651 – subject added entry-geographic name (R). MARC 21 Format for Bibliographic Data. https://www.loc.gov/marc/bibliographic/bd651.html

Library of Congress. (2021, October 1). Metadata Encoding & Transmission Standard. https://www.loc.gov/standards/mets/

Online Computer Library Center. (2003). Summaries DDC Dewey Decimal Classification https://www.oclc.org/content/dam/oclc/dewey/resources/summaries/deweysummaries.pdf

Online Computer Library Center. (2019, May 17). Introduction to the Dewey Decimal Classification. https://www.oclc.org/content/dam/oclc/dewey/versions/print/intro.pdf

Online Computer Library Center. (2021). Updates to DDC 23. Dewey Services®. https://www.oclc.org/en/dewey/updates.html

Tennant, R. (2002) Digital Libraries: Marc must die. Library Journal, 127(17) 26, 28. http://soiscompsfall2007.pbworks.com/f/marc%20must%20die.pdf

Tinker, A. (2005). Deriving and applying facet views of the Dewey Decimal Classification scheme to enhance subject searching in library OPACs. (7482) [Doctoral dissertation, University of Huddersfield.] University of Haddersfield Repository. http://eprints.hud.ac.uk/id/eprint/7482/

UCF Libraries. (2021, June 7). Dewey Decimal Classification System: Commonly used numbers. https://guides.ucf.edu/dewey/common

University of California, Santa Cruz, University Library. (2017, January 27). Metadata creation. https://guides.library.ucsc.edu/c.php?g=618773&p=4306386

W3Schools. (2021, September 2). XML tutorial. https://www.w3schools.com/xml/

Xu, A., Hess, K., & Akerman, L. (2018). From MARC to BIBFRAME 2.0: Crosswalks. Cataloging & Classification Quarterly, 56(2-3), 224–250. https://doi.org/10.1080/01639374.2017.1388326

Zhang, A. B., & Gourley, D. (2009). Creating digital collections. Chandos Publishing.