Using the Hammer, Having the Nails

Connecticut Historical Society

We all know the old saying that when the only tool you have is a hammer, everything starts to look like a nail. I’ve been “nailing” pretty much everything around here with my social network tool, including in one case even a social network visualization. It is all part of experimenting with different tools that can leverage digital content. I’m sure soon we will find another tool that we can leverage for our content and start hammering everything with that. While it is a lot of fun, and we are going to make some more permanent visualizations with this particular tool, this exposes an important idea behind digital repositories. In order to use, reuse, and otherwise “re-present” content, it has to have certain characteristics which I call:  Reusability, Interoperability, and Openness. These characteristics insure that any new lightweight tool that comes out will be able to leverage content in the repository in more or less automated ways.

Open content and metadata that exists in an environment where it can be manipulated, remade, and shared is important. Equally important for scholarship and the historical record is the persistence of that source data in predictable locations no matter where or how it is ultimately used. This “cite-ability” is a foundational principle of history and scholarship, and is the only way we can determine the validity of the content we see.

All the lightweight visualization, presentation, discovery, etc. tools are less useful if we don’t have reliable source material. Or, if we are to follow the opening metaphor, “A hammer is useless if there are no nails.”

Records Management Meets Digital Preservation

Library data architecture map

At UConn Library we are involved in a project to develop a systematic data architecture, although we don’t quite use that term, which is more of an IT term.  According to Wikipedia, “In information technology, data architecture is composed of models, policies, rules or standards that govern which data is collected, and how it is stored, arranged, integrated, and put to use in data systems and in organizations.”

This definition does not address the preservation or sustainabilty aspect of data management that is central to the data curation lifecycle, but data architecture is meant to be only one of the aspects of what is called solution architecture.

Like many organizations that made the transformation from the analog to the digital world, Libraries have over the years developed multiple and sometimes conflicting solutions, systems, and policies for managing digital collections and files in their domain. These solutions  were usually implemented to solve particular problems that arose at the time, with less thought of how those decisions would have large-scale impact, often because there was no large scale impact, or there was no way for these decisions to affect other areas of the organization.  And of course external vendors were only too happy to sell libraries “solutions” that were specific to a particular use case.

As digital content became the medium of activity and exchange, systems improved and became more flexible, it is now possible, and in fact necessary, to look at our data management systems more broadly.

If we keep in mind that, at the root, all digital content is “ones and zeros” and that any system that can manage ones and zeros is potentially useful to a library, no matter where it comes from or what it is sold or developed for, then we can build an approach, or data architecture, that will serve us well, efficiently, and effectively.

How we get to that point is easier said than done. In order to get beyond thinking about the system first we need to understand the nature or characteristics of our data. That’s where records management thinking intersects with this. RM thinking assesses the needs and limits of access and persistence (or what RM folks would call retention). Based on those criterial records are held and managed in certain ways and in certain environments to meet the requirements of their characteristics.  For example, sensitive records may be stored in a more secure facility than non-sensitive records.

How does RM thinking apply to digital libraries?  The RM idea is embodied in the DCC’s Lifecycle model, and many digital archivists have internalized this idea already. Many librarians, who work more with current data, have had less of a reason to  internalize the DCC model of data curation into their work, and the model has generally only been applied to content already designated as preservation worthy. What would it mean to apply RM/Lifecycle thinking to all areas of library content?

We have been mapping the relationships among different content types that the library is responsible for in terms of six different characteristics:

  • File format
  • Manager
  • IP rights holder
  • Retention
  • current management platform
  • Current access platforms

Then we are going to look at the characteristics the content types have in common, and develop a set of policies that govern the data that has these characteristics, and only then will we look to use/alter/build/purchase applications and systems to implement these policies.

It is always difficult to separate applications from the content they manipulate, but it is essential to do so in order to create a sustainable data architecture that puts the content first and the applications second.

Our project is in its early phases, and the map linked to above is very much a work in progress. Check back often to see the evolution of our thinking.

Seven Pillars of Digital Curation

I’ve been splitting time between blogging for the Connecticut Digital Archive and my own thoughts on the digital record. In this post on the CTDA blog, I attempt to explain digital preservation in 500 words or less. I managed it in 526!  I think it deserves mention in both places.

http://blogs.lib.uconn.edu/ctda/?p=93

Metcalfe’s Law and the Information Universe, Or: Why We Should be as Connected as Possible

I think it is important to keep in mind that the information universe beyond our repository is the ultimate audience and community for the material we steward. We don’t manage our repositories for their own sake, but because the materials in them have social or cultural value. Our job is to make it possible for people to use these materials that have been entrusted to us. Has this equation changed in the digital era?  Let’s think about it.  If in the paper world, preservation of the physical object had no real value unless the object could be used, can we say that preservation in the digital world has no real value if the digital content is not linked to other content? Is it true that only information that is linked will be discovered and used, and the more links the more use?  I’d like to make that statement and see if it holds up.
Some years ago, before the arrival of social networking,  Paul Conway wrote that “preservation is the creation of digital products worth maintaining over time.” Conway’s measure of worth at the time was the value added by the digitization process that could make the digital product more useful and critical to the collection and the institution that created it. That worth generally was internally contained within the object itself or tied to the application which it lived and was delivered. Today, I think the value proposition has shifted from an internal measure to an external one, and one that demands interoperability.   We can say that digital products worth maintaining over time are those that are the most connected to users and scholarship and have achieved a sort of transcendence over their original use or purpose through their connections with other objects or scholarship.  They have achieved what Bob Metcalfe called the network effect.

The Original Illustration of Metcalfe's Law

Metcalfe’s law (as explained by computer scientist Jim Hendler) was developed in the late 1980s and originally described in part the ” value of a network service to a user that arises from the number of people using the service.” While a network can grow “linearly with the number of connections, the value was proportional to the square of the number of users.”
A corollary to Metcalfe’s law was actually more relevant to the web in particular. While the number of connections to the network was important, it was the linking of content in that network that was the key to the value of a resource on the web. This corollary is most famously demonstrated by Google’s page ranking algorithm.
According to Bob Metcalfe, the originator of Metcalfe’s Law, the value of digital content to a particular community will exceed the cost of maintaining that content if there are enough links and communities built around that content to exceed a “critical mass.”  Since the cost of networks (and network storage), as well as the cost of connectivity is going down, while the potential uses (though linking) of digital content is ever increasing, the critical mass of links necessary to make a digital resource “valuable” is also decreasing.
To re-interpret Paul Conway’s aphorism, the worth of digital products is vested in how and how often they are linked to other resources and scholarship on the web. And preservation is not only the “preservation of access,” but what I would call the “preservation of connections” that are the heart of modern scholarship.

Wandering Boston’s Streets and Discovering the Future in the Process: or The Quantum Archivist Manifesto, Part IV

The other day, in an extended discussion/meeting, I had an opportunity to talk about my ideas on and experiences related to data management and curation. In preparing for my ten-minute review of the topic I was reminded again of how much of current data curation is rooted in basic archival principles. This took me back to one of my first digital projects, Boston Streets. I’ve written about this project before in another context, and am continually amazed at how much of my current thinking was shaped by the work we did almost 10 years ago; how many things we got “right” within the context of the time and place; and how many things we set the stage for, without knowing it. There were a course a number of mis-steps, things we thought were important that turned out to have been bypassed or superseded by later developments, but all in all we, along with many other people in that same time, were using corpuses of content (what is now generally called “data”) to  understand how the digital world affected our work and profession.

Mapping the lives of clerks in 19th-centruy Boston

Boston Streets was an IMLS-funded project that attempted to make connections between a place and contextual information about that place without significantly expanding the process of data organization or preparation.

The idea was to organize data by spatial location and then link this data to many other pieces of data that shared the same location without having to extensively catalog them by subject, name, or other subjective terminology.

In 2001 there was very little infrastructure to manage display or interrelate digital content and no easily leveraged open-source or vendor options. While some best-practices were being developed there were none for some of the data types we were working with, most notably the city directories that formed the basis of the project.

A major challenge we faced was that the metadata that did exist (mostly for the photographs) was originally created for a specific purpose and audience and not intended to be interoperable or reusable, and we needed to figure our how to normalize, manage, and preserve this data for other uses while maintaining its original integrity.

Although we didn’t use these terms at the time, what we were trying to do was to develop durable data and user-centered tools to manipulate that data that weren’t tied together in a single silo but interoperable across multiple applications. This is standard practice now, but back then, the Framework of Guidance for Building Good Digital Collections was not widely used, nor was there a general understanding of trustworthyness in digital repositories.

We were fortunate to have connections in two areas that were leaders in these fields: Greg Crane and the Perseus Project was literally in the next building, and through our Tufts Academic Technology group we made connections with Thorny Staples, then at UVA who was putting together this thing called FEDORA, based on work done originally at Cornell. Fedora (Flexible Extensible Digital Object Repository Architecture) at the time wasn’t so much a software platform or application as it was a way of thinking about digital objects. Both these influences helped inform  our thinking about how we might leverage and reuse content and metadata in multiple contexts.

Over the two and a half years of the project (including the almost obligatory no-cost extension) we began to understand the process of how to manage, manipulate, and even begin to preserve digital and digitized content.

By the end, we had learned a number of lessons, not just about managing digital projects, but about what really mattered, the content we were trying to create, manage, present, and preserve:

  • Data (in its largest sense of primary content, metadata, etc.) is creator-driven for a particular purpose and will always reflect that purpose first in its organization, content standards and vocabulary
  • Data is seldom reusable in its “original” form no matter how standards-based or consistent it is
  • Reused data must always be traceable back to the original data in its original form-ten years ago that meant manipulating it by hand, now we have powerful computational tools and new ways of thinking about data–like OAI-ORE–that were created for just this purpose.
  • Trying to predict the future uses of data is a fools errand  (I’d like to credit Eliot Wilczek for this phrase)

These lessons are as valid now as they were then, and are part of my approach to archives and digital archives. As we see the movement of scholarly pursuits into a “Web in the Cloud”  these basic ideas can continue to inform and guide us.

Sustainable Development Sustaining Sustainable Development

Or so we hope. Here at DU we are in the very beginnings of what could be a very interesting and useful project, to support a “Global Campus” being developed at the Josef Korbel School of International Studies. This global campus is part of a new Masters in Development Practice program (MDP) which “… addresses just and sustainable development via transformational learning.”

Korbel School of International Studies MDP Program
As I understand it, this global campus is to be more than just a distance learning portal, it’s goal is to be much more dynamic and conversational than the standard distance learning application.

Our piece of this project is to support the information and content management needs of the global campus in a sustainable way. By using our digital library and repository infrastructure we can manage and maintain the resources being created by this program and support their dissemination in current systems as well as systems not yet developed or imagined.  So our sustainable content management approach supports and sustains the sustainable development educational approach, which then supports sustainable development throughout the world.

Just as in the “real” world of sustainable development there are barriers to implementing a sustainable content management approach. It is arguably “easier,” at least in the short run, to build a self-contained data-driven web site that ignores long-term infrastructure, management, and sustainability but this approach seems to me to be the antithesis of the goals of the program. We’ve had discussions about this with the leaders of the program with good result and we hope that economic constraints and time pressures will not force us to step away from sustainable development of their program resources. In a way, they are living the experience of what they are trying to do for others. It will be interesting to see how it all turns out.

Clifford Lynch has said about the preservation of cultural heritage materials that  “Stewardship is easy and inexpensive to claim; it is expensive and difficult to honor, and perhaps it will prove to be all too easy to later abdicate.”  Our goal is to make sure that doesn’t happen with the materials, scholarship, and knowledge that is created in the MDP program by helping to create a sustainable development approach to sustainable development education. I hope you see more posts on this project in the future.

YouTube: the Ephemera of the 21st Century?

In a recent interview for the Digital Pioneers project, Howard Besser called YouTube the “ephemeral material of today” and a “microphone on the water cooler discussions people have at work.”   You can hear these comments for yourself about 4 minutes into the conversation on critical issues facing cultural heritage digitization.

Recent news about the Library of Congress collecting Twitter tweets would seem to confirm that social network material is the new “correspondence” series of personal papers collections if the terms correspondence and personal papers could be said to still have meaning in today’s archival environment. They are becoming the record of personal and casual social and intellectual interaction of the current age.

Is YouTube the kind of “ordinary everyday material produced by ordinary everyday people” that Howard Besser says it is? I guess that depends on your definition of “ordinary.” Certainly YouTube and Twitter are a view into a certain sector of the population, one that is reasonably literate and has a certain level of technological ability. And the technological barrier is certainly a lot lower than it was even a couple of years ago so this form of communication is available to a much larger pool of potential users. By collecting this content centrally, we can have access to a vast amount of material from a huge variety of people, far more than ever would have donated their personal papers to an archive. So in this aspect, I agree with Howard completely.

I’d argue though that documenting the contents of social networking tools only gets us back to where we were in the age of paper, and not much beyond that. Although my evidence is purely anecdotal, I’d bet that the people who create YouTube videos and are on Twitter, are by and large, educated people who are at home with the visual and literary communications methods of today. And although it is now so much easier for anyone from that group to get his or her ideas spread across the globe, I believe that the people who were voiceless in the age of paper have not made similar progress.

It would be interesting to think if Ben Franklin, the Sons of Liberty, and the authors of the Federalist Papers would have been on YouTube and blogs had they existed in those times.  If it had been possible, would John and Abigail Adams have posted messages to each other’s Facebook pages rather than fool with those messy quill pens? And if they had, and we didn’t preserve this highly ephemeral material, what would we know of the early struggles of the American nation? What future counterparts to Abigail and John Adams are posting in blogs, or tweeting, or making YouTube videos of things that inspire or outrage them?

While we HAVE become much better about documenting the formal means of communications of our society in the digital age, I think that we could be at even greater risk of losing not only the ephemera of today’s society, but the personal papers of our entire culture because we blithely rely on organizations beyond our control, who have no interest in our content as historical artifacts, to maintain and preserve our own personal history for us.  (A note of disclosure here: My wife and I run a blog on a hosted web site where we post news and stories about our family for friends and relatives and this blog is hosted by a for-profit service provider.) Yet what choice do we have? As archivists and digital librarians, we have to find ways to solve this dilemma.

Nobody Wants a Digital Repository…Until They Do!

And then they want it YESTERDAY.

There have been numerous studies related to why or why not Institutional Repositories succeed. Many of them have been gathered by Chris Bailey in his Institutional Repository Bibliography.

Basically, IRs fail because no one has any use for them (in the economic sense of “utility”) and because they are often marketed as preservation solutions and not as something that could benefit the actual users. I had a conversation the other day with some folks who want me to market our digital repository to faculty, get significant buy-in, and then explore how we can expand services for them. Sort of the opposite of “if you build it they will come.”  Now that the age of experimentation in digital libraries is over, and has been for about 5 years, the idea of leading from the front has taken a back seat to leading from behind.  The tagline we hear most often is “user-centered design.”  That is, our systems must reflect what users want and not necessarily what we think they need. Presumably, the user knows what he wants, and it is up to us to give it to him.

I think there is a flaw in this approach. Most users can only imagine what they want within the context of what they already know.  This idea is illustrated in numerous folktales. One of my favorites is “Jaimie O’Rourke and the Big Potato” where Jaimie, after being granted a wish by a leprechaun, wishes, not for a release from poverty or anything like that, but for the biggest potato in the world because that’s the best thing he can think of. While Jaimie ends up all right, he would have done much better if he could have broken free from established conventions.

But how could we have expected him to? Can we really expect users to be able to articulate or even imagine paradigm changing scenarios without being led to them in some way by people who think about this all the time?

I think we should go back to leading from the front, by listening to our users and (at the risk of sounding like Mama Odie here), discerning from them what it is they need, rather than what they want.

Most people don’t know that they need a digital repository, or if you ask them, think that they don’t need one at all (I have backups! I have network drives! I have CDs!). But if you give them something they need, like a platform for open access publishing or a means to deliver content that they couldn’t include in their most recent publication, or a place to keep their grant-funded datasets, THEN they see a value in what you are offering not because it offers permanent durability, but because it meets a need or solves a problem, or just makes their life easier.

Despite what we as archivists know to be the value of digital repositories, for the user, the digital repository is a means, and not a destination in itself. For them it is a means of access to a corpus of content that they need to do their scholarly work that will be there when they need it. For us, it is the other side of the same coin.

Preservation vs. Durability

I’m attending and speaking at a small conference for members of the Colorado Alliance of Research Libraries called “Digital Repositories, Data Curation, and the Cloud.” The Keynote speaker in the preconference today was Thorny Staples, the “godfather” of Fedora and currently Director of Community Strategy and Alliances and Fedora Project at DuraSpace. In this morning’s talk, Thorny introduced the idea of “durability” as being different from, and preferable to, the idea of simple preservation. As I understand it, durability differs from preservation in that while preservation seeks to maintain the existence of a digital object in a way that enables it to be accessed, durability preserves not only the existence but the meaning or context of the content in a verifiable way.

This strikes me as being absolutely obvious, now that it has been pointed out. The record of humanity now takes place on the web. How do we maintain the connections that are made between and among objects that are combined and recombined in 2.0 tools  even when those objects do not live in the same place, and the tools that are used to create those connections are themselves ephemeral?

The scholarly record, and by the same token the historical record, relies on citations to stable resources that provide verification for the assumptions or assertions made in an argument. How do we verify and persist, i.e. make “durable,” the context of a digital object in all of its contexts? Once we let the object out of a controlled environment that enforces context, how can context be maintained?

These are the kinds of questions that we might address when thinking about durability rather than just preservation.