When is the Right Time to Talk About Preservation?

Greenhouse Studios Design Process

At the Greenhouse Studios, we are working out the process of creating new forms of scholarship. One important aspect of what defines scholarship from projects is sustainability.   As I like to say, there is no scholarship without persistence. The infrastructure of persistence is well understood in traditional academic publishing, and is less understood in the world digital humanities.

The GS model works through five distinct phases, Understand, Identify, Build, Review, Release, and is based on the idea of flattening traditional academic hierarchies: we do not build things for faculty, we gather together a group of people around a common intellectual question, and go from there.

As archivists, we have traditionally said that it improves the preservation  potential of any digital record for the archivist to be a part of the creation of that record from the beginning.  At the GS we are testing what that actually means in terms of new scholarship. What is the beginning? When is it appropriate to consider preservation?

Originally, we had a sense that it was important to consider preservation at the very beginning, but as we move through the process with our initial cohorts, we are finding that thinking about preservation in the initial, Understand, phase, when conversations are more about “what if” than anything else would limit the imagination of the group.  The second phase, Identify, seemed a more logical place to have a preservation discussion, since this is where the project’s core deliverable would be defined.  However, this too was not the time, as this phase served to define more of the intellectual direction of the project rather than the technology, even though the technology is generally defined in this phase. So the current thinking is that the preservation discussion will happen in the Build phase.

Pushing the preservation discussion further downstream has a number of effects. At the moment we don’t know if these are positive or negative effects. It or course gives the project much more flexibility to be creative if there are no limits on what they can do. It also keeps the preservation discussion on a transactional relationship, outside the bounds of the project.

To use Henry Mintzberg’s terminology, GS projects are organized as ad hocracies—where roles are loosely defined and fluid. Although within the GS projects are considered ad hocracies, the GS exists within a professional bureaucracy, where roles and responsibilities are sharply defined, and the external technostructure of payroll, procurement, and Human Resources processes tend to constrict the freedom of the GS participants.

That discussion is for another day. The question today is whether or not preservation is integral to the development of scholarship or if it is part of the technostructure. By pushing the preservation discussion farther downstream we also push it farther into the technostructure, as preservation becomes an external demand that must be satisfied, rather than a integral part of the creative process.

Do preservation considerations belong within the creative process, or is it the job or archivists to figure out how to preserve whatever creative people ultimately create?   It seems obvious that involving archivists in the early stages of more tradtiionally-based scholarship and the creation of  data management plans and such contribute to preserving research data. But they still stand outside the creative process.  We will ultimately figure this out, but for now, we are watching and waiting.


Analyzing the Lifecycle in Practical Terms: Part I: Definitions

Continuing our research in thinking about all collections objects as sets of data, we are applying some theoretical constructs to the real world, both to understand the nature and needs of data objects, and the capabilities of management, presentation and discovery systems.

Today we start by looking at a set of characteristics of data that will eventually become criteria for determining how and where to manage and deliver our data collections. These characteristics are sometimes inherent in the objects themselves, applied by the holding institution to the objects, or created when the objects are ingested into a repository or other management or presentation system.

Characteristics of Integrity

These characteristics are inherent in the data no matter how the institution is seeking to use or manage them.  They are core to the definition of a preservable digital object, and were defined at the very beginning of the digital library age. See: “Preserving Digital Information” (1996) https://www.clir.org/pubs/reports/pub63watersgarrett.pdf

  • Content: Stuctured bits
  • Fixity: frozen as discrete objects
  • Reference: having a predictable location
  • Provenance: with a documented chain of custody
  • Context: linked to related objects

If a digital object lacks a particular characteristic of integrity, it is not preservable, but that does not mean that we don’t manage it in some system or another.

Characteristics of the Curation Lifecycle

The digital curation lifecycle models how institutions mange their data over time. Rather than being inherent in the data itself, these characteristics are dependent upon the collection development goals of the institution,  and subject to review and alteration. The characteristics below are related to digital preservation activities. This is exhaustively explained in the “Reference Model for and Open Archival Information System” https://public.ccsds.org/pubs/650x0m2.pdf.

  • Review
  • Bitstream maintenance
  • Backup/Disaster recovery
  • Format normalization
  • Format migration
  • Redundancy
  • Audit trail
  • Error checking

Characteristics of Usability

Some of the characteristics of usability are effectively inherent, others are definable by the institution. The characteristics of Intellectual Openness, while not inherent in the data itself, are typically externally determined. The institution does not generally have the ability to alter this characteristic unilaterally. The characteristics of Interoperability and Reusability are inherent in the data when it is acquired, but may be changed by creating derivatives or though normalization, based on level of Intellectual Openness. The ideas of Interoperabilty and Reusability in digital libraries come from: A Framework of Guidance for Building Good Digital Collections, 3rd ed. http://www.niso.org/publications/rp/framework3.pdf

  • Intellectual Openness
    • Open
    • Restricted-by license or intellectual property
  • Interoperability-the ability of one standards-based object to be used in another standards based system
  • Reusability-The ability to re-use, alter, or modify the object, or any part of that object to create new information or knowledge. Reusability makes scholarship possible.

Next time we will examine how these characteristics relate to digital objects, and then after that, how those characteristics, along with institutional mission,  help determine the systems and platforms that we could use to manage, preserve,  and make available digital content from our repositories.


Using the Hammer, Having the Nails

Connecticut Historical Society

We all know the old saying that when the only tool you have is a hammer, everything starts to look like a nail. I’ve been “nailing” pretty much everything around here with my social network tool, including in one case even a social network visualization. It is all part of experimenting with different tools that can leverage digital content. I’m sure soon we will find another tool that we can leverage for our content and start hammering everything with that. While it is a lot of fun, and we are going to make some more permanent visualizations with this particular tool, this exposes an important idea behind digital repositories. In order to use, reuse, and otherwise “re-present” content, it has to have certain characteristics which I call:  Reusability, Interoperability, and Openness. These characteristics insure that any new lightweight tool that comes out will be able to leverage content in the repository in more or less automated ways.

Open content and metadata that exists in an environment where it can be manipulated, remade, and shared is important. Equally important for scholarship and the historical record is the persistence of that source data in predictable locations no matter where or how it is ultimately used. This “cite-ability” is a foundational principle of history and scholarship, and is the only way we can determine the validity of the content we see.

All the lightweight visualization, presentation, discovery, etc. tools are less useful if we don’t have reliable source material. Or, if we are to follow the opening metaphor, “A hammer is useless if there are no nails.”

Records Management Meets Digital Preservation

Library data architecture map

At UConn Library we are involved in a project to develop a systematic data architecture, although we don’t quite use that term, which is more of an IT term.  According to Wikipedia, “In information technology, data architecture is composed of models, policies, rules or standards that govern which data is collected, and how it is stored, arranged, integrated, and put to use in data systems and in organizations.”

This definition does not address the preservation or sustainabilty aspect of data management that is central to the data curation lifecycle, but data architecture is meant to be only one of the aspects of what is called solution architecture.

Like many organizations that made the transformation from the analog to the digital world, Libraries have over the years developed multiple and sometimes conflicting solutions, systems, and policies for managing digital collections and files in their domain. These solutions  were usually implemented to solve particular problems that arose at the time, with less thought of how those decisions would have large-scale impact, often because there was no large scale impact, or there was no way for these decisions to affect other areas of the organization.  And of course external vendors were only too happy to sell libraries “solutions” that were specific to a particular use case.

As digital content became the medium of activity and exchange, systems improved and became more flexible, it is now possible, and in fact necessary, to look at our data management systems more broadly.

If we keep in mind that, at the root, all digital content is “ones and zeros” and that any system that can manage ones and zeros is potentially useful to a library, no matter where it comes from or what it is sold or developed for, then we can build an approach, or data architecture, that will serve us well, efficiently, and effectively.

How we get to that point is easier said than done. In order to get beyond thinking about the system first we need to understand the nature or characteristics of our data. That’s where records management thinking intersects with this. RM thinking assesses the needs and limits of access and persistence (or what RM folks would call retention). Based on those criterial records are held and managed in certain ways and in certain environments to meet the requirements of their characteristics.  For example, sensitive records may be stored in a more secure facility than non-sensitive records.

How does RM thinking apply to digital libraries?  The RM idea is embodied in the DCC’s Lifecycle model, and many digital archivists have internalized this idea already. Many librarians, who work more with current data, have had less of a reason to  internalize the DCC model of data curation into their work, and the model has generally only been applied to content already designated as preservation worthy. What would it mean to apply RM/Lifecycle thinking to all areas of library content?

We have been mapping the relationships among different content types that the library is responsible for in terms of six different characteristics:

  • File format
  • Manager
  • IP rights holder
  • Retention
  • current management platform
  • Current access platforms

Then we are going to look at the characteristics the content types have in common, and develop a set of policies that govern the data that has these characteristics, and only then will we look to use/alter/build/purchase applications and systems to implement these policies.

It is always difficult to separate applications from the content they manipulate, but it is essential to do so in order to create a sustainable data architecture that puts the content first and the applications second.

Our project is in its early phases, and the map linked to above is very much a work in progress. Check back often to see the evolution of our thinking.

5 Ways the iPhone Revolutionized Archives

As the 10th anniversary of the iPhone comes and goes, there have been a spate of articles about how the iPhone irreversibly changed this or that. We have the more limited “How the iPhone Revolutionized Photography” and the more expansive:  “5 ways the iPhone changed the world” to the even more expansive “10 ways the iPhone changed the world”.

I’m sure that the iPhone changed the world in as many ways as there are people to write about it. But, I haven’t yet seen anyone write about how iPhone revolutionized archives.  So, here and now, I’m going to take a stab at a short list of suggestions about how the iPhone altered the landscape of archives. Interestingly most of these relate to the iPhone as a camera, rather than a phone, but hey, lots of folks don’t really use it as a phone anyway.

Five Ways the iPhone Revolutionized Archives

  1. The end of the photocopier
  2. Geospatial and time/date precision in resource description
  3. The end of family snapshots on film
  4. Video becomes the snapshot of the current era
  5. The end of the paper scrapbook, the challenge of social media

First some easy ones.

The End of the Photocopier

Smartphones enabled reading room users to make reference copies of documents without subjecting them to the stress of photocopying. As reading rooms embraced the self service aspect of personal reproductions and even required it, the ubiquitous photocopier, with the copyright disclaimer sometimes attached to the copybed, disappeared from reading rooms. The loss of all those $0.05 charges was more than offset by the reduction in work and effort to maintain, run, and manage the photocopier. Although photocopier statistics were used to justify existences, archivists soon found other better things to do than make copies.

Geospatial and time/date precision in resource description

iPhones know where and when they are, and they attach this information to everything the handle.  This makes it possible to get driving directions, and it also makes it possible to know, with very little doubt, exactly where a photo was taken.

Possibly, this search result will be a thing of the past

No longer do we have to confront the words “possibly” or “unknown” in place or time metadata fields, at least in photos taken with smartphones. On the flip side, integrated and cloud photo management tools, simultaneously make it easier for people to manage their photos, and harder for archivists to get their hands on them later.  More on that below. 


The end of family snapshots on film

The family snapshot was being replaced by digital photography before the smartphone, but many cameras, and printers came with a means to directly output digital photo files to print. The iPhone, and the accompanying photo management tools pretty much ended that practice. Slideshow apps on televisions and computer screens replaced the framed photo, and photo sharing apps obviated the need to make prints. Even grandmothers show off photos of the grandkids by pulling out their phones and not their wallets.

Video becomes the snapshot of a new generation

Seven years ago I wrote a post about video being the new snapshot. In the intervening years I have seen that trend accelerate. No only do grandmothers pull out their phone to show off their grandkids, but they will just as likely show you a video of the young tyke as they will a still photo. With social media becoming more video friendly (Facebook especially) the moving image is becoming the recording medium of choice. Why do we care? In some senses we don’t, file size is not the issue it once was, and so many management and presentation systems can deal with moving image files that it really isn’t a big deal in a technical sense. It is harder to describe time-based media than still images, but the challenges of description are not inherent to video.

Now the hard part:

The end of the scrapbook, and the challenge of social media

While “scrapbooking”  is alive and well as a  craft activity, the more mundane practice of saving photos in albums with black pages that you write on with white ink is pretty much over for the general population. The modern form of casual life documentation is, wait for it….Facebook.

Although, according to Facebook “you own all of the content and information you post,” most people would be hard-pressed to figure out how to extract any of it.  And although it can be done relatively easily most people would never think to do it. If you die before you do it, it becomes almost impossible for anyone to gain access to the account or to its contents except through the Facebook interface unless you have designated in advance of your death (or in a will I suppose) someone called a legacy contact. This legacy contact must be a friend of yours on Facebook and then will have permission to download your content. That’s not quite the same as your grandchildren going through the attic and deciding what to do with a bunch of stuff up there, because you have to think of it ahead of time.

All of this is directly related to the way that the smartphone integrates itself into your information world and directs your activities without you even noticing. This is a real and significant result of the iPhone.

These things are not, in and of themselves bad, they just make the archivist’s job harder, and makes us understand even more that while so much of our work in the digital age is just like our work in the analog age, there is so much of our work that is different. The most significant point I’ve been seeing is that we have to make archival decisions at the point of creation, because when the records become inactive, it may be too late.


Seven Pillars of Digital Curation

This post appeared in slightly different form in the Connecticut Digital Archive blog on February 2, 2013. 

People new to digital archives (and more often funding stakeholders, and certain IT managers) often ask about the difference between preservation and backup. The question goes something like this: “If I have backups of my files, and can restore them if something happens to my computer (or CD, or portable hard drive) then isn’t my data preserved?”

It is a good question that is often answered either too simply: “Backup is NOT preservation” or by an explanation that goes into detail that only an archivist can understand. Here we attempt to explain digital preservation in everyday terms–well as everyday as we can get and still be archivists.

Digital preservation seeks to guarantee the integrity of and long-term access to digital information resources.  Preserving Digital Information, the 1996 report of the Task Force on Archiving Digital Information identified five attributes of what they called digital integrity. Integrity was defined as attributes that give digital resources a distinct identity. These attributes are:

  1. content
  2. fixity
  3. reference
  4. provenance
  5. context

These five attributes became the foundation for what developed into digital preservation. Paul Conway later very succinctly explained these attributes as “formatted and structured bits (content) ‘frozen’ as discrete objects (fixity) in a predictable location (reference) with a documented chain of custody (provenance) and linkages to related objects (context).”

But, while these aspects together may insure a digital resource’s integrity, they do not necessarily insure its preservation.  Digital preservation comes from the addition of time and preservation actions to the five attributes of integrity.

Today the term “Digital Curation” is commonly used to identify the activities surrounding maintaining digital information resources over time. These activities take place within a context of stewardship that makes appraisal decisions based on judgments about the value of information resources over time. Data curators or modern archivists, like their analog predecessors, continually review the collections in their care and make decisions about what to do with them in terms of access, description, reformatting, disposition and the like.

The Digital Curation Centre’s Lifecycle Model illustrates the cyclical concepts and activities involved in digital curation.

According to the DCC, data archiving (or digital curation) both preserves and adds value to data. For example:

  • Selection decisions affect which data are kept in the long term, and therefore which data are accessible to users
  • Ingest and preservation action can lead to the addition of administrative metadata which describes the curation chain
  • Data can be transformed into new formats
  • Data are placed in a wider context in terms of their long-term management through, for example, the addition of annotations or developing relationships with other datasets

(See more at: http://www.dcc.ac.uk/resources/curation-lifecycle-model)

While backup strategies are important to insure the preservation of the bitstreams and can insure some or all of the five facets of integrity, digital curation adds value and makes the preserved data useable and useful beyond their original purposes. Data backup can insure recovery of digital information resources in forms and structures consistent with their original creation, digital curation supports preservation and reuse of digital information resources for future uses. Backup, disaster recovery, and digital curation are mutually supporting activities and are essential activities in a well-run digital repository


Seven Pillars of Digital Curation

I’ve been splitting time between blogging for the Connecticut Digital Archive and my own thoughts on the digital record. In this post on the CTDA blog, I attempt to explain digital preservation in 500 words or less. I managed it in 526!  I think it deserves mention in both places.


Forum Forum

Yesterday, I had the pleasure and privilege to attend the Connecticut Forum on Digital Initiatives. The second installment of what I hope will be an annual event brought together more than 100 people interested in digital preservation and presentation from across the state and even beyond. We were treated to an engaging and challenging opening keynote from Trevor Owens from the Library of Congress. My big takeaway from that talk was the idea that we should not “confuse tools with content.”

In an era where we want to use, reuse, and manipulate our digital content the display or presentation means change quickly. It is the content (or what I might call the “data”) that we want to preserve. We can also preserve the story that is told with the data through the presentation platform, but that is a completely different activity, and separate from the tools.

Trevor was followed by a number of breakout presentations on a host of topics. You can look at the Google doc to see the schedule and links to presentations and examples.

I  had two chances to speak at the Forum. One was to introduce our latest project, the Connecticut Digital Archive (more on that later) a state-wide collaborative preservation repository for cultural heritage organizations based in Connecticut. You can see the slides below:

My second chance to talk came at the end of the day when an interested and perhaps somewhat information overloaded group convened for the closing plenary.

My point in the closing was to encourage people to join the digital archive effort, and to think about the current challenges facing archivists in the digital age.

Anyone who has read this blog in the past will know what comes next. I wanted to convey my idea that the current challenges we face are part of a long evolution of record keeping that goes back as far as clay tablets and will extend far beyond our lifetimes. To meet today’s challenges, I said that we should respect the traditions of our profession and embrace the potential of our technology.

You can read the text of my remarks and see the slides:

A Shoutout to Activist Archivists!

The Chronicle of Higher Education’s “Wired Campus” blog ran a story today about Archivists and the Occupy movement. Who was this activist archivist you ask? None other than Howard Besser, who reported on his experiences at the recent CNI membership meeting in Baltimore.

The Wired Campus reporter felt it noteworthy to mention that Howard “spoke at the conference wearing an Occupy Wall Street T-shirt that he had made by hand” which would not have been a surprise to any archivists who have ever seen Howard at any public event.

Nevertheless the questions for archivists presented by the Occupy movement, and other protest movements in the social media age are complex and daunting for archivists. The Chronicle article goes on to explain how Howard and other archivists are attempting to create some systematic means of appraising and collecting the records of social movements. Failing that, they are at least interested in developing some standard approach for archivists to take when attempting to document these movements, because as Howard is quoted in the article as saying, ““The old way of doing things doesn’t scale.  …We have to find new ways of doing the selection and doing the metadata.”

Howard was joined in the panel by David Millman of NYU, and Sharon Leon of the Center for History and New Media.

Thanks to all three for raising the questions and the public awareness of the archival craft!

Metcalfe’s Law and the Information Universe, Or: Why We Should be as Connected as Possible

I think it is important to keep in mind that the information universe beyond our repository is the ultimate audience and community for the material we steward. We don’t manage our repositories for their own sake, but because the materials in them have social or cultural value. Our job is to make it possible for people to use these materials that have been entrusted to us. Has this equation changed in the digital era?  Let’s think about it.  If in the paper world, preservation of the physical object had no real value unless the object could be used, can we say that preservation in the digital world has no real value if the digital content is not linked to other content? Is it true that only information that is linked will be discovered and used, and the more links the more use?  I’d like to make that statement and see if it holds up.
Some years ago, before the arrival of social networking,  Paul Conway wrote that “preservation is the creation of digital products worth maintaining over time.” Conway’s measure of worth at the time was the value added by the digitization process that could make the digital product more useful and critical to the collection and the institution that created it. That worth generally was internally contained within the object itself or tied to the application which it lived and was delivered. Today, I think the value proposition has shifted from an internal measure to an external one, and one that demands interoperability.   We can say that digital products worth maintaining over time are those that are the most connected to users and scholarship and have achieved a sort of transcendence over their original use or purpose through their connections with other objects or scholarship.  They have achieved what Bob Metcalfe called the network effect.

The Original Illustration of Metcalfe's Law

Metcalfe’s law (as explained by computer scientist Jim Hendler) was developed in the late 1980s and originally described in part the ” value of a network service to a user that arises from the number of people using the service.” While a network can grow “linearly with the number of connections, the value was proportional to the square of the number of users.”
A corollary to Metcalfe’s law was actually more relevant to the web in particular. While the number of connections to the network was important, it was the linking of content in that network that was the key to the value of a resource on the web. This corollary is most famously demonstrated by Google’s page ranking algorithm.
According to Bob Metcalfe, the originator of Metcalfe’s Law, the value of digital content to a particular community will exceed the cost of maintaining that content if there are enough links and communities built around that content to exceed a “critical mass.”  Since the cost of networks (and network storage), as well as the cost of connectivity is going down, while the potential uses (though linking) of digital content is ever increasing, the critical mass of links necessary to make a digital resource “valuable” is also decreasing.
To re-interpret Paul Conway’s aphorism, the worth of digital products is vested in how and how often they are linked to other resources and scholarship on the web. And preservation is not only the “preservation of access,” but what I would call the “preservation of connections” that are the heart of modern scholarship.