What’s for Breakfast?

In about an hour, we will be doing a public demonstration of our new repository infrastructure. Of course most people won’t know that, they will be looking at the Nuremberg Trial papers of Thomas J. Dodd (archives.lib.uconn.edu). What they won’t see is the underlying presentation and management Drupal/Islandora application, the Fedora repository, the storage layer, and a host of decisions about metadata schemas (MODS with uri tags for subject and names), OCR (Uncorrected, machine generated), data and content models (atomistic pages brought together in a “book” using RDF) and Drupal themes (Do you like that button there or here?).

The papers themselves represent about 12,000 pages of material (about 50% of the total–we are continuing to digitize the rest) collected by then Executive Trial Counsel Thomas J. Dodd during the International Military Tribunal in Nuremberg just after WWII. There are trial briefs, depositions, documentation, and administrative memos relating to the construction and execution of the trial strategy of the U.S. prosecutors that has never before been available on line. As one of the most heavily used collections in our repository, we felt that this was an appropriate first collection for our new infrastructure. As with all digital collections, it will now be possible to access this material without having to travel to Connecticut and will open up all sorts of research possibilities for scholars of international law, WWII, the Holocaust, etc.

While all these things are very valuable and were the primary purpose for digitizing the collection, I wanted to focus this post on some unintended consequences (or opportunities) that full-text access to a body of material like this supplies. I’m a big believer in the opportunity of unintended consequences. This has never been more true in the era of digitization where documents become data that can be manipulated by  computers to uncover and connect things that would take years to do by hand, if they could be done at all.

In the course of building their case, the prosecutors collected a massive amount of information about the workings of the Nazi regime. A lot of that information is mundane, relating to supply chains (what we would today call “logistics”) and procurement, or economic output, or the movement of material and resources along transportation routes.  Without expressly meaning to, they created a picture of a wartime society that includes all sorts of information about mid-20th century Europe.

It may seem inappropriate to study the record of a global tragedy to find out what people ate for breakfast or to study the technology infrastructure of  transportation systems, but that is exactly what you can do. Digital resources create opportunities to ask research questions that could never have been asked before, and as we well know, it is not our job as archivists to decide what is an appropriate question to ask about any historical resource.

Facing Up to ARMA-geddon

Earlier this week, I spent an interesting and enjoyable evening with members of  the Connecticut chapter of ARMA, the records management professional organization. They invited me to be the after-dinner speaker at their monthly chapter meeting. I’d never been an after-dinner speaker before so I didn’t really know what to expect or what was expected.

My topic was to talk about the challenge of documenting culture in the digital age–or at least that’s what I said I would talk about when they asked me to speak. This was for me, and I think for them as well, an opportunity to get out of the bubble of talking to the usual suspects about the usual things.

Rather than follow a more traditional format of slides and linear discourse, I took advantage of the informality of the setting to try to create a discourse between me as an archivist and the records managers. It was informative for all of us.

I think the key idea that I came away with that I had not thought about quite in this way before was the fact that records managers work within organizational systems and archivists work out there in the chaos of human society. Attempting to apply rules-based approaches to content is mostly futile in the archival world. All you can do is collect what you collect and not worry about what you are ignoring.

If you look at the slides below, you  will see that I consider the value of things like home surveillance video, personal digital pedometer data, and lifecasted video channels as historical records.

Here are the slides:
 

Is all this stuff “records?” And if it is, what do we do with it?  I guess that’s for us to find out.

 

And the text of what I would have said if I had followed a script:

Seven Pillars of Digital Curation

I’ve been splitting time between blogging for the Connecticut Digital Archive and my own thoughts on the digital record. In this post on the CTDA blog, I attempt to explain digital preservation in 500 words or less. I managed it in 526!  I think it deserves mention in both places.

http://blogs.lib.uconn.edu/ctda/?p=93

CTDA

CTDA-Main-Color-220

For the past several months I’ve been working with some very dedicated people both at UConn and elsewhere in Connecticut on a project that we are calling the Connecticut Digital Archive or CTDA.  The CTDA is an extension of one of the original digital aggregation projects: Connecticut History Online (CHO).

For years UConn has been managing the technical infrastructure of CHO. As UConn began to look at the next logical step in its development of digital content management, it seemed only natural that we would continue to collaborate with others in Connecticut to build, not only a shared aggregator of digital content, but to offer digital preservation services as well to libraries, museums, historical societies in Connecticut.

CHO made it possible for lots of people to make their content available to a larger audience, now the CTDA will make it possible to preserve the digital cultural heritage of Connecticut for future generations.

Follow our progress at:

http://blog.ctdigitalarchive.org

 

Forum Forum

Yesterday, I had the pleasure and privilege to attend the Connecticut Forum on Digital Initiatives. The second installment of what I hope will be an annual event brought together more than 100 people interested in digital preservation and presentation from across the state and even beyond. We were treated to an engaging and challenging opening keynote from Trevor Owens from the Library of Congress. My big takeaway from that talk was the idea that we should not “confuse tools with content.”

In an era where we want to use, reuse, and manipulate our digital content the display or presentation means change quickly. It is the content (or what I might call the “data”) that we want to preserve. We can also preserve the story that is told with the data through the presentation platform, but that is a completely different activity, and separate from the tools.

Trevor was followed by a number of breakout presentations on a host of topics. You can look at the Google doc to see the schedule and links to presentations and examples.

I  had two chances to speak at the Forum. One was to introduce our latest project, the Connecticut Digital Archive (more on that later) a state-wide collaborative preservation repository for cultural heritage organizations based in Connecticut. You can see the slides below:

My second chance to talk came at the end of the day when an interested and perhaps somewhat information overloaded group convened for the closing plenary.

My point in the closing was to encourage people to join the digital archive effort, and to think about the current challenges facing archivists in the digital age.

Anyone who has read this blog in the past will know what comes next. I wanted to convey my idea that the current challenges we face are part of a long evolution of record keeping that goes back as far as clay tablets and will extend far beyond our lifetimes. To meet today’s challenges, I said that we should respect the traditions of our profession and embrace the potential of our technology.

You can read the text of my remarks and see the slides:

Quantum Archivist Manifesto, Part VI: It’s All About the Package

I’m really fond of creating lists of slogans that encapsulate larger ideas about the work archivists do. Lately, I’ve been thinking a lot about information packages and OAIS models of SIPs, AIPs, and DIPs. In discussions with friends and colleagues, I’ve also trotted out a lot of quantum archives theory to measure up to the package approach to archives. It seems to me that digital information packages and quantum archives have a lot in common. Looking back over the blog posts over the last couple of years, and thinking about how all this might fit together, I’ve formulated a new list of slogans for the quantum universe, or what have taken to calling the second generation digital repository. I haven’t attributed the origins of all of these ideas below, but regular readers of the Quantum Archivist should be able to pick out where they come from.

We begin with the list and follow with a bit of exposition and expansion of the items in the list. Right now there are five principles on the list, maybe the list will grow, maybe it will shrink. We’ll see…

Five Principles of the Second Generation

  1. All digital content is data
  2. All data that has value should be managed
  3. The package is the smallest unit of management
  4. All pointers refer to the “original” resource
  5. Digital curation preserves access not objects

What is data?

Data can be defined as any information suitable for manipulation, use, or reuse in an electronic environment.  This includes metadata, which is the “sum total of anything we know about an object” as well as the digital content files (electronic records, primary content objects, etc.) themselves, which, by their binary nature are inherently data.

What is a digital asset?

A digital asset is a set of data elements (metadata of all types, primary content objects, associative information, system information) combined into a “package” that is internally coherent and can be managed in an electronic environment by applications and processes.

We acquire digital assets through analog to digital conversion, born digital acquisition, and metadata creation in associated systems.

Managed data and packages

Managed data is data combined into a digital asset (or digital information) package that exists in an application that allows managers (and end users) to perform operations on the digital asset including CRUD (create, read, update, delete), and preservation activities (from checksums to migration) based on rules and authorization.

While digital asset packages should be able to exist outside of a management application, they would no longer be managed. Only managed data meets the standards of archival quality.

While a digital asset package is the smallest unit of management, the size and nature of a package can vary from package to package. In archival information systems, we make choices about the content and nature of the lowest level of managed data. For example, a set of page images of a book may exist as a single package with structural metadata that arranges them in order, or they could exist as a set of individual packages that have associated metadata that organizes them into their proper order. Either way, the end user sees a book with its pages in order.  This “lumper vs. splitter” decision is a choice we make at the institutional or even collection level.

Originals and derivatives

As mentioned above, a digital asset package may contain any number of content and metadata objects. These objects can be both archival master files (what we tend to think of as the “originals”) as well as derivatives that support access and manipulation.  In every case, derivatives are merely representations or transformations of the original object—whatever that may be—and any reference to the object refers to the original and NOT to the representation. For example, if a user interacts with a jp2000 version of a DNG image file and would like to cite that image in a scholarly work, the citation should refer to a unique identifier of that resource and not the URL of the jp2000 representation. The jp2000 version the user interacts with today is only a convenience of access in a particular time and place, and is likely to be replaced with some other access derivative in the future. This question gets a lot more complicated when analog to digital conversion is involved, and in the case of born digital text documents. Repositories must make policy decisions about what is considered the original.

Preservation, Access, and Curation

As Paul Conway said some years ago, “preservation is the action and access is the thing.” Digital preservation is the activity of insuring access to digital content over time. This opens any number of possibilities and questions, especially in the born digital world. What is an original? What do we preserve—content or format? Or are there different answers at different times and in different situations?

However we answer policy questions like the ones above, the one answer that is always right is that good digital objects are never isolated or alone, but should always exist within the context of a package that is internally coherent, and self describing. As I like to say, if you found this digital object on the floor, you could pick it up and know all you needed to know about it just by looking inside.

 

Wandering Boston’s Streets and Discovering the Future in the Process: or The Quantum Archivist Manifesto, Part IV

The other day, in an extended discussion/meeting, I had an opportunity to talk about my ideas on and experiences related to data management and curation. In preparing for my ten-minute review of the topic I was reminded again of how much of current data curation is rooted in basic archival principles. This took me back to one of my first digital projects, Boston Streets. I’ve written about this project before in another context, and am continually amazed at how much of my current thinking was shaped by the work we did almost 10 years ago; how many things we got “right” within the context of the time and place; and how many things we set the stage for, without knowing it. There were a course a number of mis-steps, things we thought were important that turned out to have been bypassed or superseded by later developments, but all in all we, along with many other people in that same time, were using corpuses of content (what is now generally called “data”) to  understand how the digital world affected our work and profession.

Mapping the lives of clerks in 19th-centruy Boston

Boston Streets was an IMLS-funded project that attempted to make connections between a place and contextual information about that place without significantly expanding the process of data organization or preparation.

The idea was to organize data by spatial location and then link this data to many other pieces of data that shared the same location without having to extensively catalog them by subject, name, or other subjective terminology.

In 2001 there was very little infrastructure to manage display or interrelate digital content and no easily leveraged open-source or vendor options. While some best-practices were being developed there were none for some of the data types we were working with, most notably the city directories that formed the basis of the project.

A major challenge we faced was that the metadata that did exist (mostly for the photographs) was originally created for a specific purpose and audience and not intended to be interoperable or reusable, and we needed to figure our how to normalize, manage, and preserve this data for other uses while maintaining its original integrity.

Although we didn’t use these terms at the time, what we were trying to do was to develop durable data and user-centered tools to manipulate that data that weren’t tied together in a single silo but interoperable across multiple applications. This is standard practice now, but back then, the Framework of Guidance for Building Good Digital Collections was not widely used, nor was there a general understanding of trustworthyness in digital repositories.

We were fortunate to have connections in two areas that were leaders in these fields: Greg Crane and the Perseus Project was literally in the next building, and through our Tufts Academic Technology group we made connections with Thorny Staples, then at UVA who was putting together this thing called FEDORA, based on work done originally at Cornell. Fedora (Flexible Extensible Digital Object Repository Architecture) at the time wasn’t so much a software platform or application as it was a way of thinking about digital objects. Both these influences helped inform  our thinking about how we might leverage and reuse content and metadata in multiple contexts.

Over the two and a half years of the project (including the almost obligatory no-cost extension) we began to understand the process of how to manage, manipulate, and even begin to preserve digital and digitized content.

By the end, we had learned a number of lessons, not just about managing digital projects, but about what really mattered, the content we were trying to create, manage, present, and preserve:

  • Data (in its largest sense of primary content, metadata, etc.) is creator-driven for a particular purpose and will always reflect that purpose first in its organization, content standards and vocabulary
  • Data is seldom reusable in its “original” form no matter how standards-based or consistent it is
  • Reused data must always be traceable back to the original data in its original form-ten years ago that meant manipulating it by hand, now we have powerful computational tools and new ways of thinking about data–like OAI-ORE–that were created for just this purpose.
  • Trying to predict the future uses of data is a fools errand  (I’d like to credit Eliot Wilczek for this phrase)

These lessons are as valid now as they were then, and are part of my approach to archives and digital archives. As we see the movement of scholarly pursuits into a “Web in the Cloud”  these basic ideas can continue to inform and guide us.

WebWise Begins With Preconferences

The 2010 edition of the IMLS WebWise conference kicks off tomorrow in Denver with two pre-conferences. I’ve been the steward for the half-day workshop called “Digital Repositories Uncovered” run by Sarah Shreeves, Coordinator of IDEALS at UIUC, and Jessica Colati, Director of the Alliance Digital Repository here in Colorado (and yes, we are related). As I mentioned in a previous post, Sarah and Jessica have what I think is a difficult job of selling people something they need but don’t think they want. But that is only one part of the story. Managing digital repositories means more than just convincing content owners that they want to deposit. It involves understanding copyright and fair use, intellectual property law, hardware and server specifications, software applications, and how to talk to programmers. If digital repository managers were soccer players, they would be center midfielders, able to direct the flow of the game, and understand and coordinate how all the parts work.

The half-day workshop covers a range of issues repository managers have to face (I’ve seen the previews) but most importantly, I think the workshop helps repository mangers think about who they are, and their central role in the collection, management, preservation and use of digital content.