Analyzing the Lifecycle in Practical Terms: Part I: Definitions

Continuing our research in thinking about all collections objects as sets of data, we are applying some theoretical constructs to the real world, both to understand the nature and needs of data objects, and the capabilities of management, presentation and discovery systems.

Today we start by looking at a set of characteristics of data that will eventually become criteria for determining how and where to manage and deliver our data collections. These characteristics are sometimes inherent in the objects themselves, applied by the holding institution to the objects, or created when the objects are ingested into a repository or other management or presentation system.

Characteristics of Integrity

These characteristics are inherent in the data no matter how the institution is seeking to use or manage them.  They are core to the definition of a preservable digital object, and were defined at the very beginning of the digital library age. See: “Preserving Digital Information” (1996) https://www.clir.org/pubs/reports/pub63watersgarrett.pdf

  • Content: Stuctured bits
  • Fixity: frozen as discrete objects
  • Reference: having a predictable location
  • Provenance: with a documented chain of custody
  • Context: linked to related objects

If a digital object lacks a particular characteristic of integrity, it is not preservable, but that does not mean that we don’t manage it in some system or another.

Characteristics of the Curation Lifecycle

The digital curation lifecycle models how institutions mange their data over time. Rather than being inherent in the data itself, these characteristics are dependent upon the collection development goals of the institution,  and subject to review and alteration. The characteristics below are related to digital preservation activities. This is exhaustively explained in the “Reference Model for and Open Archival Information System” https://public.ccsds.org/pubs/650x0m2.pdf.

  • Review
  • Bitstream maintenance
  • Backup/Disaster recovery
  • Format normalization
  • Format migration
  • Redundancy
  • Audit trail
  • Error checking

Characteristics of Usability

Some of the characteristics of usability are effectively inherent, others are definable by the institution. The characteristics of Intellectual Openness, while not inherent in the data itself, are typically externally determined. The institution does not generally have the ability to alter this characteristic unilaterally. The characteristics of Interoperability and Reusability are inherent in the data when it is acquired, but may be changed by creating derivatives or though normalization, based on level of Intellectual Openness. The ideas of Interoperabilty and Reusability in digital libraries come from: A Framework of Guidance for Building Good Digital Collections, 3rd ed. http://www.niso.org/publications/rp/framework3.pdf

  • Intellectual Openness
    • Open
    • Restricted-by license or intellectual property
  • Interoperability-the ability of one standards-based object to be used in another standards based system
  • Reusability-The ability to re-use, alter, or modify the object, or any part of that object to create new information or knowledge. Reusability makes scholarship possible.

Next time we will examine how these characteristics relate to digital objects, and then after that, how those characteristics, along with institutional mission,  help determine the systems and platforms that we could use to manage, preserve,  and make available digital content from our repositories.

 

Visualizing Data Sets

A curious circle of interest around 1943 in a search for 1925.

I’ve been continuing to experiment with the Kumu social networking application, seeing how I can use it to visualize all sorts of data. I’ve gotten better at manipulating the display to make the maps easier to use.

My current experiment is to take a search result set from the Connecticut Digital Archive do some minimal manipulation on it, and put it into a Google sheet that I link to the visualization app. The result is running on a test server, and is quite interesting I think.

For this basic test, I did a simple search in the repository for “1925” not specifying  any metadata field, but just looking for it somewhere in a record, expecting that most results would have 1925 in the date. But, that wasn’t always the case, and the outliers proved to be more interesting that the expected results.

Using the tool, you can arrange content by date, owning institution, or creator. When I arranged by “Date” I got this interesting circle around 1943. Not understanding why that would happen, I took a closer look and discovered that all of the photos were taken in 1943 as worker identification photos for the Post boatyard in Mystic Connectcut. In the description, each worker was identified by his name and birthdate. These 20 or so men (out of more than 200 of these images in the repository) were all born in 1925.  I wonder if they knew that?

I think tools like this can make it interesting and informative to do “sloppy” or simple searches, and find hidden relationships that come out of the data.

From Uniqueness to Ubiquity

A still unique but no longer scarce historical document

Another step along the path from analog to digital thinking in archival access is to stop thinking about our collections as unique, even if they are one of a kind. What does this mean?

When all access to analog content was by way of the reading room, everything existed in an environment of scarcity, since a one-of-a-kind document, like this 1815 membership certificate from the Windham County Agricultural society could only be experienced in one place, and at limited times. This was scarcity of opportunity.   Since most manuscript collections were never published in any form, this scarcity seemed a permanent condition. In fact, some repositories, perversely it seems to us now, prided themselves on the fact that people were forced to come to their reading rooms from all over the world to view their treasures.

A digital object can be in many places at once.

Digitization changed all that. Repositories now pride themselves on how much of their collections are available 24 x 7, and in the number of places they are discoverable.   Ubiquity has replaced scarcity as the coin of the realm so to speak. The original documents remain as unique as before, but their ability to be ubiquitous gives them as much value as their uniqueness.  How does this change the way we think about value in what we do?

Records Management Meets Digital Preservation

Library data architecture map

At UConn Library we are involved in a project to develop a systematic data architecture, although we don’t quite use that term, which is more of an IT term.  According to Wikipedia, “In information technology, data architecture is composed of models, policies, rules or standards that govern which data is collected, and how it is stored, arranged, integrated, and put to use in data systems and in organizations.”

This definition does not address the preservation or sustainabilty aspect of data management that is central to the data curation lifecycle, but data architecture is meant to be only one of the aspects of what is called solution architecture.

Like many organizations that made the transformation from the analog to the digital world, Libraries have over the years developed multiple and sometimes conflicting solutions, systems, and policies for managing digital collections and files in their domain. These solutions  were usually implemented to solve particular problems that arose at the time, with less thought of how those decisions would have large-scale impact, often because there was no large scale impact, or there was no way for these decisions to affect other areas of the organization.  And of course external vendors were only too happy to sell libraries “solutions” that were specific to a particular use case.

As digital content became the medium of activity and exchange, systems improved and became more flexible, it is now possible, and in fact necessary, to look at our data management systems more broadly.

If we keep in mind that, at the root, all digital content is “ones and zeros” and that any system that can manage ones and zeros is potentially useful to a library, no matter where it comes from or what it is sold or developed for, then we can build an approach, or data architecture, that will serve us well, efficiently, and effectively.

How we get to that point is easier said than done. In order to get beyond thinking about the system first we need to understand the nature or characteristics of our data. That’s where records management thinking intersects with this. RM thinking assesses the needs and limits of access and persistence (or what RM folks would call retention). Based on those criterial records are held and managed in certain ways and in certain environments to meet the requirements of their characteristics.  For example, sensitive records may be stored in a more secure facility than non-sensitive records.

How does RM thinking apply to digital libraries?  The RM idea is embodied in the DCC’s Lifecycle model, and many digital archivists have internalized this idea already. Many librarians, who work more with current data, have had less of a reason to  internalize the DCC model of data curation into their work, and the model has generally only been applied to content already designated as preservation worthy. What would it mean to apply RM/Lifecycle thinking to all areas of library content?

We have been mapping the relationships among different content types that the library is responsible for in terms of six different characteristics:

  • File format
  • Manager
  • IP rights holder
  • Retention
  • current management platform
  • Current access platforms

Then we are going to look at the characteristics the content types have in common, and develop a set of policies that govern the data that has these characteristics, and only then will we look to use/alter/build/purchase applications and systems to implement these policies.

It is always difficult to separate applications from the content they manipulate, but it is essential to do so in order to create a sustainable data architecture that puts the content first and the applications second.

Our project is in its early phases, and the map linked to above is very much a work in progress. Check back often to see the evolution of our thinking.

Why We Shouldn’t Try to Save Everything

John Cook is wanted for murder, 1923. Connecticut Historical Society

A recent article in the Washington Post by UConn graduate student Matthew Guariglia talks about the dangers of keeping so much information that the sheer volume makes it impossible to sift through and make sense of, even using the most sophisticated tools available.   His is talking specifically about personal information on individuals that began to be collected in the Victorian Age by police forces attempting to deal with increasing crime in crowded industrial cities, and has escalated into the massive data collection efforts of security organizations of all modern governments.

As the availability of potentially useful data increased, from photographs to body measurements to fingerprints and beyond, management and analysis systems struggled, and ultimately failed, to keep up with this growing torrent of information.

Guariglia’s argument in part is that data analysis systems will never keep up with the ever increasing flood of data, and that massively collecting undifferentiated data actually makes us less safe because you can’t find the significant data among all the noise. What does this mean for the archivist who is charged with collecting and preserving historical documentation? I think this brings into focus even more sharply that archives are not a stream-of-consciousnes recording of “what happened” (as if that were even possible), but carefully selected and curated collections that serve the institutional needs and missions of the organizations of which they are a part. This is something that all archivists know as a matter of course and which informs their appraisal and curatorial decisions.

If only the NSA and the rest of the security apparatus would think like archivists, who knows what good things would happen?

Alphabetical Order

Yesterday I wrote a post about some things you could do with a body of digital “data” that was not specifically related to the purpose of the original documents. Later in the day, during our opening demonstration of the web site, I was reminded of the very powerful nature of the printed word in telling the story of history.  A relative of Thomas Dodd sat down and searched for the phrase: alphabetical order .

Surprisingly to me, but not to the person who typed it, the phrase returned three results from a presentation by Dodd to the Tribunal. In showing that the execution of prisoners was a calculated policy, Dodd reviewed death records from one concentration camp:

“These pages cover death entries made for the 19th day of March. 1945 between fifteen minutes past one in the morning until two o’clock in the afternoon. In this space of twelve and three- quarter hours. on these records, 203 persons are reported as having died. They were assigned serial numbers running from 8390 to 8593. The names of the dead are listed. And interestingly enough the victims are all recorded as having died of the same ailment – heart trouble. They died at brief intervals. They died in alphabetical order. The first who died was a man named Ackermann, who died at one fifteen a.m., and the last was a man named Zynger, who died at two o’clock in the afternoon.”

Just thinking a bit about what the description of this activity says about the people and government that calmly and efficiently carried out and very consciously documented the horrors described here is alarming and disturbing. I know that we often say that we live in a “post-literate” society, and that data visualization is the latest and greatest way to create an impact on that highly visual society. I think that these 122 words say more in their own way than any photo or visualization of data could.

What’s for Breakfast?

In about an hour, we will be doing a public demonstration of our new repository infrastructure. Of course most people won’t know that, they will be looking at the Nuremberg Trial papers of Thomas J. Dodd (archives.lib.uconn.edu). What they won’t see is the underlying presentation and management Drupal/Islandora application, the Fedora repository, the storage layer, and a host of decisions about metadata schemas (MODS with uri tags for subject and names), OCR (Uncorrected, machine generated), data and content models (atomistic pages brought together in a “book” using RDF) and Drupal themes (Do you like that button there or here?).

The papers themselves represent about 12,000 pages of material (about 50% of the total–we are continuing to digitize the rest) collected by then Executive Trial Counsel Thomas J. Dodd during the International Military Tribunal in Nuremberg just after WWII. There are trial briefs, depositions, documentation, and administrative memos relating to the construction and execution of the trial strategy of the U.S. prosecutors that has never before been available on line. As one of the most heavily used collections in our repository, we felt that this was an appropriate first collection for our new infrastructure. As with all digital collections, it will now be possible to access this material without having to travel to Connecticut and will open up all sorts of research possibilities for scholars of international law, WWII, the Holocaust, etc.

While all these things are very valuable and were the primary purpose for digitizing the collection, I wanted to focus this post on some unintended consequences (or opportunities) that full-text access to a body of material like this supplies. I’m a big believer in the opportunity of unintended consequences. This has never been more true in the era of digitization where documents become data that can be manipulated by  computers to uncover and connect things that would take years to do by hand, if they could be done at all.

In the course of building their case, the prosecutors collected a massive amount of information about the workings of the Nazi regime. A lot of that information is mundane, relating to supply chains (what we would today call “logistics”) and procurement, or economic output, or the movement of material and resources along transportation routes.  Without expressly meaning to, they created a picture of a wartime society that includes all sorts of information about mid-20th century Europe.

It may seem inappropriate to study the record of a global tragedy to find out what people ate for breakfast or to study the technology infrastructure of  transportation systems, but that is exactly what you can do. Digital resources create opportunities to ask research questions that could never have been asked before, and as we well know, it is not our job as archivists to decide what is an appropriate question to ask about any historical resource.

Copyright and Risk

It is hard to believe that it has been almost a month since Digital Directions. (I guess being involved in two NEH grant applications and some strategic planning activities can just consume time otherwise spent thinking about archives and digital libraries). My two biggest takeaways from DD2012 were about copyright and delivery. I’ve heard Peter Hirtle give his copyright talk a number of times over the years and was struck this time by how the landscape around copyright and digital libraries has shifted over the years, much to the benefit of open access to information.

Peter summed up the shift in a couple of bullet points:

  • Don’t just ask “Is it legal”?
  • Ask “Who is going to be angry if I do this? Who will benefit?”
  • Look for ways to minimize potential harm while maximizing access and use.

Citing  the recently published ARL Code of Best Practices in Fair Use for Academic and Research Libraries, Principle 4 that supports the idea of fair use in archival collections that are comprehensively digitized, Peter emphasized the idea of risk assessment in addition to, and perhaps before, legal precedent in determining whether or not to provide digital access to primary source materials.

And finally, he said that it was important to be honest and open about your own decisions, to inform users about all you know about the rights that relate to your content so that they can be responsible as well.

This seems to me to bring us back to the common sense approach that was prevalent in the pre-digital but post-photocopier age, when we informed people of the potential of copyright issues relating to the material we were making available but trusted the users to be responsible researchers.

After some years of worrying that the specter of copyright would choke off most of the innovation in delivering digitized Special Collections, I see these developments as a positive step forward.

Metcalfe’s Law and the Information Universe, Or: Why We Should be as Connected as Possible

I think it is important to keep in mind that the information universe beyond our repository is the ultimate audience and community for the material we steward. We don’t manage our repositories for their own sake, but because the materials in them have social or cultural value. Our job is to make it possible for people to use these materials that have been entrusted to us. Has this equation changed in the digital era?  Let’s think about it.  If in the paper world, preservation of the physical object had no real value unless the object could be used, can we say that preservation in the digital world has no real value if the digital content is not linked to other content? Is it true that only information that is linked will be discovered and used, and the more links the more use?  I’d like to make that statement and see if it holds up.
Some years ago, before the arrival of social networking,  Paul Conway wrote that “preservation is the creation of digital products worth maintaining over time.” Conway’s measure of worth at the time was the value added by the digitization process that could make the digital product more useful and critical to the collection and the institution that created it. That worth generally was internally contained within the object itself or tied to the application which it lived and was delivered. Today, I think the value proposition has shifted from an internal measure to an external one, and one that demands interoperability.   We can say that digital products worth maintaining over time are those that are the most connected to users and scholarship and have achieved a sort of transcendence over their original use or purpose through their connections with other objects or scholarship.  They have achieved what Bob Metcalfe called the network effect.

The Original Illustration of Metcalfe's Law

Metcalfe’s law (as explained by computer scientist Jim Hendler) was developed in the late 1980s and originally described in part the ” value of a network service to a user that arises from the number of people using the service.” While a network can grow “linearly with the number of connections, the value was proportional to the square of the number of users.”
A corollary to Metcalfe’s law was actually more relevant to the web in particular. While the number of connections to the network was important, it was the linking of content in that network that was the key to the value of a resource on the web. This corollary is most famously demonstrated by Google’s page ranking algorithm.
According to Bob Metcalfe, the originator of Metcalfe’s Law, the value of digital content to a particular community will exceed the cost of maintaining that content if there are enough links and communities built around that content to exceed a “critical mass.”  Since the cost of networks (and network storage), as well as the cost of connectivity is going down, while the potential uses (though linking) of digital content is ever increasing, the critical mass of links necessary to make a digital resource “valuable” is also decreasing.
To re-interpret Paul Conway’s aphorism, the worth of digital products is vested in how and how often they are linked to other resources and scholarship on the web. And preservation is not only the “preservation of access,” but what I would call the “preservation of connections” that are the heart of modern scholarship.