Analyzing the Lifecycle in Practical Terms: Part I: Definitions

Continuing our research in thinking about all collections objects as sets of data, we are applying some theoretical constructs to the real world, both to understand the nature and needs of data objects, and the capabilities of management, presentation and discovery systems.

Today we start by looking at a set of characteristics of data that will eventually become criteria for determining how and where to manage and deliver our data collections. These characteristics are sometimes inherent in the objects themselves, applied by the holding institution to the objects, or created when the objects are ingested into a repository or other management or presentation system.

Characteristics of Integrity

These characteristics are inherent in the data no matter how the institution is seeking to use or manage them.  They are core to the definition of a preservable digital object, and were defined at the very beginning of the digital library age. See: “Preserving Digital Information” (1996)

  • Content: Stuctured bits
  • Fixity: frozen as discrete objects
  • Reference: having a predictable location
  • Provenance: with a documented chain of custody
  • Context: linked to related objects

If a digital object lacks a particular characteristic of integrity, it is not preservable, but that does not mean that we don’t manage it in some system or another.

Characteristics of the Curation Lifecycle

The digital curation lifecycle models how institutions mange their data over time. Rather than being inherent in the data itself, these characteristics are dependent upon the collection development goals of the institution,  and subject to review and alteration. The characteristics below are related to digital preservation activities. This is exhaustively explained in the “Reference Model for and Open Archival Information System”

  • Review
  • Bitstream maintenance
  • Backup/Disaster recovery
  • Format normalization
  • Format migration
  • Redundancy
  • Audit trail
  • Error checking

Characteristics of Usability

Some of the characteristics of usability are effectively inherent, others are definable by the institution. The characteristics of Intellectual Openness, while not inherent in the data itself, are typically externally determined. The institution does not generally have the ability to alter this characteristic unilaterally. The characteristics of Interoperability and Reusability are inherent in the data when it is acquired, but may be changed by creating derivatives or though normalization, based on level of Intellectual Openness. The ideas of Interoperabilty and Reusability in digital libraries come from: A Framework of Guidance for Building Good Digital Collections, 3rd ed.

  • Intellectual Openness
    • Open
    • Restricted-by license or intellectual property
  • Interoperability-the ability of one standards-based object to be used in another standards based system
  • Reusability-The ability to re-use, alter, or modify the object, or any part of that object to create new information or knowledge. Reusability makes scholarship possible.

Next time we will examine how these characteristics relate to digital objects, and then after that, how those characteristics, along with institutional mission,  help determine the systems and platforms that we could use to manage, preserve,  and make available digital content from our repositories.


Records Management Meets Digital Preservation

Library data architecture map

At UConn Library we are involved in a project to develop a systematic data architecture, although we don’t quite use that term, which is more of an IT term.  According to Wikipedia, “In information technology, data architecture is composed of models, policies, rules or standards that govern which data is collected, and how it is stored, arranged, integrated, and put to use in data systems and in organizations.”

This definition does not address the preservation or sustainabilty aspect of data management that is central to the data curation lifecycle, but data architecture is meant to be only one of the aspects of what is called solution architecture.

Like many organizations that made the transformation from the analog to the digital world, Libraries have over the years developed multiple and sometimes conflicting solutions, systems, and policies for managing digital collections and files in their domain. These solutions  were usually implemented to solve particular problems that arose at the time, with less thought of how those decisions would have large-scale impact, often because there was no large scale impact, or there was no way for these decisions to affect other areas of the organization.  And of course external vendors were only too happy to sell libraries “solutions” that were specific to a particular use case.

As digital content became the medium of activity and exchange, systems improved and became more flexible, it is now possible, and in fact necessary, to look at our data management systems more broadly.

If we keep in mind that, at the root, all digital content is “ones and zeros” and that any system that can manage ones and zeros is potentially useful to a library, no matter where it comes from or what it is sold or developed for, then we can build an approach, or data architecture, that will serve us well, efficiently, and effectively.

How we get to that point is easier said than done. In order to get beyond thinking about the system first we need to understand the nature or characteristics of our data. That’s where records management thinking intersects with this. RM thinking assesses the needs and limits of access and persistence (or what RM folks would call retention). Based on those criterial records are held and managed in certain ways and in certain environments to meet the requirements of their characteristics.  For example, sensitive records may be stored in a more secure facility than non-sensitive records.

How does RM thinking apply to digital libraries?  The RM idea is embodied in the DCC’s Lifecycle model, and many digital archivists have internalized this idea already. Many librarians, who work more with current data, have had less of a reason to  internalize the DCC model of data curation into their work, and the model has generally only been applied to content already designated as preservation worthy. What would it mean to apply RM/Lifecycle thinking to all areas of library content?

We have been mapping the relationships among different content types that the library is responsible for in terms of six different characteristics:

  • File format
  • Manager
  • IP rights holder
  • Retention
  • current management platform
  • Current access platforms

Then we are going to look at the characteristics the content types have in common, and develop a set of policies that govern the data that has these characteristics, and only then will we look to use/alter/build/purchase applications and systems to implement these policies.

It is always difficult to separate applications from the content they manipulate, but it is essential to do so in order to create a sustainable data architecture that puts the content first and the applications second.

Our project is in its early phases, and the map linked to above is very much a work in progress. Check back often to see the evolution of our thinking.

Automating Data Entry for 20,000 folders.

Patrick uses the scan pen to do some data entry

So from the macro level to the micro level, you never know what is going to happen.  We have an artificial collection that was created over some 20 years of “alternative” news and information sources relating mostly to late-20th century counter culture groups. The collection fills about a dozen filing cabinets, with folders that may contain two issues of a newsletter, or fifty flyers from a protest group. Each folder has a typewritten title, sometimes referring to the title of the publication, sometimes referring to an idiosyncratic subject term.  It has been a daunting task to think about creating an online index of these resources, the data entry alone would be an enormous task. And once we did that, there would be enormous pressure to provide online access to the contents as well.

Scanning titles

With some seed funding from a private donor, we are beginning to digitize the collection, and create online access to the resources. We made some decisions that are consistent with the idea of “quantum archives,” and applied some technological solutions to a difficult problem.

First, we defined the smallest unit of description to be the folder. No matter if the folder had 20 different documents or a homogenous set, we would manage and describe at the folder level. A user would discover the folder, and then browse through the pages in the folder (or use full text searching) until they find what they want. Folder titles and one or two genre terms would be the initial entry points.

The genre list

In order to automate data entry (remember that the folder titles are typed) we purchased a text scanning stylus. Using a spreadsheet, we attach and scan the barcode of the folder, the title of the folder, and genre terms from a typewritten sheet.  there are no typographical errors, and with the scanning pen, we can enter data at a rate far higher than hand typing.

Once we populate the spreadsheet, we use other processes to convert the spreadsheet into MODS xml descriptive metadata records, pair them with the set of scanned objects from the folder and use a batch process to ingest them into the preservation digital repository.  After a bit of tinkering with settings, workflow, and process, we are far exceeding the throughput of a manual process.


Why We Shouldn’t Try to Save Everything

John Cook is wanted for murder, 1923. Connecticut Historical Society

A recent article in the Washington Post by UConn graduate student Matthew Guariglia talks about the dangers of keeping so much information that the sheer volume makes it impossible to sift through and make sense of, even using the most sophisticated tools available.   His is talking specifically about personal information on individuals that began to be collected in the Victorian Age by police forces attempting to deal with increasing crime in crowded industrial cities, and has escalated into the massive data collection efforts of security organizations of all modern governments.

As the availability of potentially useful data increased, from photographs to body measurements to fingerprints and beyond, management and analysis systems struggled, and ultimately failed, to keep up with this growing torrent of information.

Guariglia’s argument in part is that data analysis systems will never keep up with the ever increasing flood of data, and that massively collecting undifferentiated data actually makes us less safe because you can’t find the significant data among all the noise. What does this mean for the archivist who is charged with collecting and preserving historical documentation? I think this brings into focus even more sharply that archives are not a stream-of-consciousnes recording of “what happened” (as if that were even possible), but carefully selected and curated collections that serve the institutional needs and missions of the organizations of which they are a part. This is something that all archivists know as a matter of course and which informs their appraisal and curatorial decisions.

If only the NSA and the rest of the security apparatus would think like archivists, who knows what good things would happen?

Video is the Snapshot of the 21st Century

The other day my family and I were at the local “Pumpkin Festival” and we ran into another family we knew from my daughter’s pre-school class. As we walked around the festival, I was snapping photos with my “pro-sumer” digital SLR when I noticed that our friend was holding his phone up in front of him most of the time.

The still image, an obsolete documentary form?

“Oh I don’t take pictures anymore” he said when I asked him what he was doing, “I only shoot video. We don’t even use a camera nowadays.” I didn’t tell him that he was, in fact using a camera, but it was instructive to hear that HE didn’t think of recording video as “taking pictures.” Still photography, even digital still photography,  to him was as archaic as wet plates or daguerreotypes.

I thought this was a telling development, and one that archivists will certainly need to navigate sooner rather than later. Digital video is becoming the preferred method of recording the 21st-century family events and history that social scientists and historians will want to study. Are we ready for it? Do we have the means to store, manage, and provide access to this new ubiquitous media? This again points to the necessity for even local historical societies (often the recipients of this type of family material) to have access to digital repository systems. Technology aside, I wondered how we would manage to catalog this in a way that makes it discoverable.

Then I thought that maybe it isn’t the potential problem that it could be. I mean, the digital video comes complete with a time stamp (Month, Day, Year, Hour, Minute, Second), lots of other metadata (duration, camera, settings) and most importantly, geographic location. And, if we have any kind of luck, we will have some sort of donor information that will give us at least some indication of the creator. That’s a darn sight more metadata than the vast majority of analog prints in most archives ever have.

So, is it fair to say that digital video has replaced the snapshot as the family history recording medium of choice? If not now, then soon, I think. I also think it IS fair to say that video will provide historians with a snapshot of life in the first half of the 21st century.

The Quick Start Guide to Becoming a Professional Archivist

When we were first developing a productivity-based  processing workflow system for the Digital Collections and Archives at Tufts University, we had a whiteboard on which we wrote motivational phrases that reminded us of the things that were important for us to remember. These guiding principles were later codified into what we called the “Quickstart Guide to Becoming a Professional Archivist.”   It had two sections, one on archival principles and one on attitudes about processing. We used the Quickstart Guide as a introductory and training tool for new staff members.

The Guide introduced concepts like “lumpers vs. splitters” and “ruthless efficiency and dogged persistence.” as ideas related to archival processing as well as asking more philosophical questions about the role of the archivist in creating knowledge.

Back then the Quickstart Guide was mostly focused on processing paper records. As time went on and I began to use the Quickstart Guide as a teaching tool, I realized that in the born digital age, processing had changed significantly and that the old Guide was a bit out of touch. For example, the original Guide emphasized that good archival description proceeded from the General to the Specific and moved down that continuum as time and resources allowed. Quantum Archival theory turns that idea on its head, and says that good archival description focuses on specifics first and moves to generalities as time allows.

So I went back and revised it for the digital world. The result is the Quick Start Guide 2.1.

The Quick Start Guide, 2.1

The key change was to emphasize that “management is not access.” That is, the way we manage our collections is not necessarily (or even desirably) the way we want users to access our collections. The ability to separate management from access is one of the key values of digitized and born digital archival content.

The Quick Start Guide remains a central statement of what I consider to be “good” archival attitudes. It is the first thing I teach in my classes.