Seven Pillars of Digital Curation

This post appeared in slightly different form in the Connecticut Digital Archive blog on February 2, 2013. 

People new to digital archives (and more often funding stakeholders, and certain IT managers) often ask about the difference between preservation and backup. The question goes something like this: “If I have backups of my files, and can restore them if something happens to my computer (or CD, or portable hard drive) then isn’t my data preserved?”

It is a good question that is often answered either too simply: “Backup is NOT preservation” or by an explanation that goes into detail that only an archivist can understand. Here we attempt to explain digital preservation in everyday terms–well as everyday as we can get and still be archivists.

Digital preservation seeks to guarantee the integrity of and long-term access to digital information resources.  Preserving Digital Information, the 1996 report of the Task Force on Archiving Digital Information identified five attributes of what they called digital integrity. Integrity was defined as attributes that give digital resources a distinct identity. These attributes are:

  1. content
  2. fixity
  3. reference
  4. provenance
  5. context

These five attributes became the foundation for what developed into digital preservation. Paul Conway later very succinctly explained these attributes as “formatted and structured bits (content) ‘frozen’ as discrete objects (fixity) in a predictable location (reference) with a documented chain of custody (provenance) and linkages to related objects (context).”

But, while these aspects together may insure a digital resource’s integrity, they do not necessarily insure its preservation.  Digital preservation comes from the addition of time and preservation actions to the five attributes of integrity.

Today the term “Digital Curation” is commonly used to identify the activities surrounding maintaining digital information resources over time. These activities take place within a context of stewardship that makes appraisal decisions based on judgments about the value of information resources over time. Data curators or modern archivists, like their analog predecessors, continually review the collections in their care and make decisions about what to do with them in terms of access, description, reformatting, disposition and the like.

The Digital Curation Centre’s Lifecycle Model illustrates the cyclical concepts and activities involved in digital curation.

According to the DCC, data archiving (or digital curation) both preserves and adds value to data. For example:

  • Selection decisions affect which data are kept in the long term, and therefore which data are accessible to users
  • Ingest and preservation action can lead to the addition of administrative metadata which describes the curation chain
  • Data can be transformed into new formats
  • Data are placed in a wider context in terms of their long-term management through, for example, the addition of annotations or developing relationships with other datasets

(See more at:

While backup strategies are important to insure the preservation of the bitstreams and can insure some or all of the five facets of integrity, digital curation adds value and makes the preserved data useable and useful beyond their original purposes. Data backup can insure recovery of digital information resources in forms and structures consistent with their original creation, digital curation supports preservation and reuse of digital information resources for future uses. Backup, disaster recovery, and digital curation are mutually supporting activities and are essential activities in a well-run digital repository


Posted in Greatest Hits!, Preservation | Tagged , | Leave a comment

Used Posts

With apologies to Tom Waits, I figure if singers can do it why can’t I?Dark-Gold-icon I’ve been neglectful of the quantum archivist as I have been busy writing and working in other areas, mostly in the Connecticut Digital Archive and duplicating posts in multiple places has seemed to not be in the cards. Hmm…now, if that blog post was a digital object in a repository, I could re-purpose it and reuse it in multiple forms and presentations without having to retype or make references… 

But back to reality. I thought it might be a good idea to gather some of the things I’ve been writing elsewhere, update them a little bit, and reprint them here in this venue. This is especially timely since I will be teaching at NEDCC’s Digital Directions in Portland, Oregon next week. Some may even come from early posts on this blog.  It may even be a good opportunity for me to see how the world has changed over the last five years.

Posted in Greatest Hits! | Leave a comment

A Question of Scale

Two days at CNI in Washington, DC. The city is freaking out about the weather–typical– but the meeting is its usual fascinating and inspiring event. Like the Web Wise meetings CNI tends to be a group of people who are doing things talking about the things they do to other people who are doing things or who want to do them.

Clifford Lynch, the archetype of the digital library world gives the keynote summary of what is new, news, and newsworthy, and we all listen intently to see if we are on the right track. It isn’t that he invents new things for us to do, but that he always has his finger on the pulse of the profession, and knows where things are going. So he simultaneously reports what is happening, and by choosing to emphasize one thing over another, gently pushes the profession in one or another direction.

My first takeaway from the keynote was Cliff’s idea that we are moving past the era of building small systems and federating them–a common approach in the earlier days–and that we need to think about how to interconnect systems automatically.

Interchange and re-use continues to come to the fore. Open data, linked data, linked open data–whatever–embedded in metadata, served up by open APIs, available through CC0 licenses, that is the message of CNI so far.

This was echoed again and again in Monday’s sessions, from the people at GWU collecting tweets, to Dan Cohen talking about how the DPLA is both a portal for discovery and a platform for building new services upon.

More to come …

Posted in Conferences | Tagged , , , | Leave a comment

Alphabetical Order

Yesterday I wrote a post about some things you could do with a body of digital “data” that was not specifically related to the purpose of the original documents. Later in the day, during our opening demonstration of the web site, I was reminded of the very powerful nature of the printed word in telling the story of history.  A relative of Thomas Dodd sat down and searched for the phrase: alphabetical order .

Surprisingly to me, but not to the person who typed it, the phrase returned three results from a presentation by Dodd to the Tribunal. In showing that the execution of prisoners was a calculated policy, Dodd reviewed death records from one concentration camp:

“These pages cover death entries made for the 19th day of March. 1945 between fifteen minutes past one in the morning until two o’clock in the afternoon. In this space of twelve and three- quarter hours. on these records, 203 persons are reported as having died. They were assigned serial numbers running from 8390 to 8593. The names of the dead are listed. And interestingly enough the victims are all recorded as having died of the same ailment – heart trouble. They died at brief intervals. They died in alphabetical order. The first who died was a man named Ackermann, who died at one fifteen a.m., and the last was a man named Zynger, who died at two o’clock in the afternoon.”

Just thinking a bit about what the description of this activity says about the people and government that calmly and efficiently carried out and very consciously documented the horrors described here is alarming and disturbing. I know that we often say that we live in a “post-literate” society, and that data visualization is the latest and greatest way to create an impact on that highly visual society. I think that these 122 words say more in their own way than any photo or visualization of data could.

Posted in Discovery, Uncategorized | Tagged , | Leave a comment

What’s for Breakfast?

In about an hour, we will be doing a public demonstration of our new repository infrastructure. Of course most people won’t know that, they will be looking at the Nuremberg Trial papers of Thomas J. Dodd ( What they won’t see is the underlying presentation and management Drupal/Islandora application, the Fedora repository, the storage layer, and a host of decisions about metadata schemas (MODS with uri tags for subject and names), OCR (Uncorrected, machine generated), data and content models (atomistic pages brought together in a “book” using RDF) and Drupal themes (Do you like that button there or here?).

The papers themselves represent about 12,000 pages of material (about 50% of the total–we are continuing to digitize the rest) collected by then Executive Trial Counsel Thomas J. Dodd during the International Military Tribunal in Nuremberg just after WWII. There are trial briefs, depositions, documentation, and administrative memos relating to the construction and execution of the trial strategy of the U.S. prosecutors that has never before been available on line. As one of the most heavily used collections in our repository, we felt that this was an appropriate first collection for our new infrastructure. As with all digital collections, it will now be possible to access this material without having to travel to Connecticut and will open up all sorts of research possibilities for scholars of international law, WWII, the Holocaust, etc.

While all these things are very valuable and were the primary purpose for digitizing the collection, I wanted to focus this post on some unintended consequences (or opportunities) that full-text access to a body of material like this supplies. I’m a big believer in the opportunity of unintended consequences. This has never been more true in the era of digitization where documents become data that can be manipulated by  computers to uncover and connect things that would take years to do by hand, if they could be done at all.

In the course of building their case, the prosecutors collected a massive amount of information about the workings of the Nazi regime. A lot of that information is mundane, relating to supply chains (what we would today call “logistics”) and procurement, or economic output, or the movement of material and resources along transportation routes.  Without expressly meaning to, they created a picture of a wartime society that includes all sorts of information about mid-20th century Europe.

It may seem inappropriate to study the record of a global tragedy to find out what people ate for breakfast or to study the technology infrastructure of  transportation systems, but that is exactly what you can do. Digital resources create opportunities to ask research questions that could never have been asked before, and as we well know, it is not our job as archivists to decide what is an appropriate question to ask about any historical resource.

Posted in Delivery, Discovery, Uncategorized | Tagged , , , , | 1 Comment

Tom Scheinfeldt Made Me Write This Post!

Sort of…I’ve been on an “anti-social” network kick for a while as I have been busy working on the Connecticut Digital Archive project. Lots of tiny details related to infrastructure that I thought would be completely uninteresting to anyone. My mistake. The beauty of blog posts is that they are in the moment and ephemeral, so if it is boring a reader or follower can just skip it. If the next one is interesting you can read it. The point is to toss it out there and add to the conversation, in the long run everything necessary will get said and everything unnecessary will be forgotten.

What does this have to do with Tom Scheinfeldt? Nothing directly and that is the point. Tom is teaching a class here at UConn about Digital Culture–I’d recommend it to anyone at UConn who has an opportunity to take it. His syllabus includes a mention of Andrew Sullivan, a former editor at the Atlantic who is of course a blogger, but who wrote an article way back in 2008 called “Why I Blog.” (Full disclosure here. I didn’t find this out for myself, I was alerted to it by my colleague Jean Nelson–who found it from one of Tom’s tweets–thanks for the tip Jean!)

Sullivan describes the blog as “the spontaneous expression of instant thought … its borders are extremely porous and its truth inherently transitory.” And, unlike print journalism or book or journal authorship “It is accountable in immediate and unavoidable ways to readers and other bloggers, and linked via hypertext to continuously multiplying references and sources.”

It is difficult for those of us who were brought up in research disciplines to “blurt” our thoughts before we have defined, refined, and attributed them to evidence.  What I ultimately understood about blogging from reading this article came from some advice Sullivan attributes to Matt Drudge that “the key to understanding a blog is to realize that it’s a broadcast, not a publication. If it stops moving it dies. If it stops paddling it sinks.”  Brevity and immediacy is the currency of the blogosphere. This doesn’t mean that posts should not be well-considered, just that they can contribute to the world without having been vetted and edited, because its value is in how it makes connections with others thinking the same thing.

The social network relies on immediacy, shout outs, and sharing, something hard for a dinosaur like me to embrace, but I will do my best. When I have something to say, I won’t worry about who wants to hear. In some ways the internet is the ultimate “build it and they will come” environment.

Posted in Uncategorized | Tagged | Leave a comment

Facing Up to ARMA-geddon

Earlier this week, I spent an interesting and enjoyable evening with members of  the Connecticut chapter of ARMA, the records management professional organization. They invited me to be the after-dinner speaker at their monthly chapter meeting. I’d never been an after-dinner speaker before so I didn’t really know what to expect or what was expected.

My topic was to talk about the challenge of documenting culture in the digital age–or at least that’s what I said I would talk about when they asked me to speak. This was for me, and I think for them as well, an opportunity to get out of the bubble of talking to the usual suspects about the usual things.

Rather than follow a more traditional format of slides and linear discourse, I took advantage of the informality of the setting to try to create a discourse between me as an archivist and the records managers. It was informative for all of us.

I think the key idea that I came away with that I had not thought about quite in this way before was the fact that records managers work within organizational systems and archivists work out there in the chaos of human society. Attempting to apply rules-based approaches to content is mostly futile in the archival world. All you can do is collect what you collect and not worry about what you are ignoring.

If you look at the slides below, you  will see that I consider the value of things like home surveillance video, personal digital pedometer data, and lifecasted video channels as historical records.

Here are the slides:

Is all this stuff “records?” And if it is, what do we do with it?  I guess that’s for us to find out.


And the text of what I would have said if I had followed a script:

Posted in Futurology | Tagged , | Leave a comment

Raising the Floor

Yesterday I was again fortunate to participate in an event here at UConn called “Digital Media/Innovative Collaborations” a symposium organized by Tim Hunter of UConn’s Digital Media and Design program. The symposium  brought together folks from across campus who have an interest or experience in working with digital media and was organized according to Tim’s idea of the digital media “table” being supported by four “legs” of Business, Creative Arts, STEM, and Digital Humanities/Social Science.

Two excellent keynotes by Gael McGill of Harvard Medical School, and Tom Scheinfeldt of the CHNM kicked off the day and after a networking lunch, we went to breakout sessions in each topic area with an admonition for people to try to visit an area with which they were not familiar.

I was invited to speak as part of the Digital Humanities breakout session, and I chose to speak broadly about the role of digital repositories in the context of not only the Humanities, but all digital media and design. Taking Tim Hunter’s analogy a step farther, I see digital repositories as the “floor” upon which the legs of the digital media table sits.

It is repositories that supply the digital content for visualizing and are the places for created content to live and be repurposed in the future. And so without repositories the table, while it would still have legs to stand on, would not have a floor for those legs to rest on, and the structure would collapse.

The audience was filled with mostly Digital Humanities practitioners, a core group of potential users and contributors that we wanted to reach. There were some people who were hearing  one of my talks for the first time and who understood my message and a few were interested in pursuing a collaboration of some type or another. So, all in all it was a worthwhile  day and was great exposure for the repository program.


Posted in Uncategorized | Tagged , , , | Leave a comment