Yesterday I wrote a post about some things you could do with a body of digital “data” that was not specifically related to the purpose of the original documents. Later in the day, during our opening demonstration of the web site, I was reminded of the very powerful nature of the printed word in telling the story of history. A relative of Thomas Dodd sat down and searched for the phrase: alphabetical order .
Surprisingly to me, but not to the person who typed it, the phrase returned three results from a presentation by Dodd to the Tribunal. In showing that the execution of prisoners was a calculated policy, Dodd reviewed death records from one concentration camp:
“These pages cover death entries made for the 19th day of March. 1945 between fifteen minutes past one in the morning until two o’clock in the afternoon. In this space of twelve and three- quarter hours. on these records, 203 persons are reported as having died. They were assigned serial numbers running from 8390 to 8593. The names of the dead are listed. And interestingly enough the victims are all recorded as having died of the same ailment – heart trouble. They died at brief intervals. They died in alphabetical order. The first who died was a man named Ackermann, who died at one fifteen a.m., and the last was a man named Zynger, who died at two o’clock in the afternoon.”
Just thinking a bit about what the description of this activity says about the people and government that calmly and efficiently carried out and very consciously documented the horrors described here is alarming and disturbing. I know that we often say that we live in a “post-literate” society, and that data visualization is the latest and greatest way to create an impact on that highly visual society. I think that these 122 words say more in their own way than any photo or visualization of data could.
In about an hour, we will be doing a public demonstration of our new repository infrastructure. Of course most people won’t know that, they will be looking at the Nuremberg Trial papers of Thomas J. Dodd (archives.lib.uconn.edu). What they won’t see is the underlying presentation and management Drupal/Islandora application, the Fedora repository, the storage layer, and a host of decisions about metadata schemas (MODS with uri tags for subject and names), OCR (Uncorrected, machine generated), data and content models (atomistic pages brought together in a “book” using RDF) and Drupal themes (Do you like that button there or here?).
The papers themselves represent about 12,000 pages of material (about 50% of the total–we are continuing to digitize the rest) collected by then Executive Trial Counsel Thomas J. Dodd during the International Military Tribunal in Nuremberg just after WWII. There are trial briefs, depositions, documentation, and administrative memos relating to the construction and execution of the trial strategy of the U.S. prosecutors that has never before been available on line. As one of the most heavily used collections in our repository, we felt that this was an appropriate first collection for our new infrastructure. As with all digital collections, it will now be possible to access this material without having to travel to Connecticut and will open up all sorts of research possibilities for scholars of international law, WWII, the Holocaust, etc.
While all these things are very valuable and were the primary purpose for digitizing the collection, I wanted to focus this post on some unintended consequences (or opportunities) that full-text access to a body of material like this supplies. I’m a big believer in the opportunity of unintended consequences. This has never been more true in the era of digitization where documents become data that can be manipulated by computers to uncover and connect things that would take years to do by hand, if they could be done at all.
In the course of building their case, the prosecutors collected a massive amount of information about the workings of the Nazi regime. A lot of that information is mundane, relating to supply chains (what we would today call “logistics”) and procurement, or economic output, or the movement of material and resources along transportation routes. Without expressly meaning to, they created a picture of a wartime society that includes all sorts of information about mid-20th century Europe.
It may seem inappropriate to study the record of a global tragedy to find out what people ate for breakfast or to study the technology infrastructure of transportation systems, but that is exactly what you can do. Digital resources create opportunities to ask research questions that could never have been asked before, and as we well know, it is not our job as archivists to decide what is an appropriate question to ask about any historical resource.
the second installment of a program sponsored by the Connecticut State Library and begun in 2011. Paraphrasing from their website: The Forum brings together people from libraries, archives, museums, and cultural heritage institutions from around Connecticut and beyond to talk about the digital initiatives and how collaboration can enhance a project and create communities from across the cultural heritage spectrum. The Forum is a chance for the diverse voices within the cultural heritage sector to talk about ideas, projects and tools with which they are engaged.
And it is true. I attended last year, just after I moved to UConn, and was happily surprised by the spirit of collaboration I found at the Forum. Lots of places and groups talk about collaboration, but in Connecticut it appears to be a reality.
This year, I am honored to be on the program, talking about the importance of digital preservation and how a collaborative approach to digital preservation can make it possible to preserve the cultural record of both large and small organizations. I’m sharing the podium with folks from Connecticut (Yale, UConn, CT State Library) and beyond (Library of Congress, George Mason Univ., NYPL Labs) who are sharing their stories, plans and dreams about the digital present and future
If you can make it to Hartford on October 22, it will be worth the trip. Registration is free. So come and join the conversation!
The more things change, the more they remain the same. I am continually reminded of that when I read about each “new” approach to library or archival services that leverages new technology. Most recently, I’ve been involved in a number of discussions about the value and methods of building a community around collections or research activities.
Every collection tends to have its own community, a group of people who are interested in the topic of the collection or the person or organization that created it. This community is made up of scholars, amateur experts, and the casually interested, as well as those archivists who are the stewards of the collection. One goal of outreach and research services for archival repositories is to support communities of users of their collections and make it possible for the members of these communities to have sufficient access to the collections to do with what they wish within the bounds of responsible use.
We archivists have always tried to make our collections available, provide the means for community building, and actively engage in communities that surround our collections. In the pre-Internet days, good reference archivists knew the local, regional, and national users of their collections. With the advent of digital access to collections the link between user and repository was often broken. At first we thought that was a good thing, to get out of the way of researchers and let them decide what is important rather than having the archivist as the gatekeeper. But what we didn’t realize until a bit later, was that the archivist and the collection acting as the connection point among multiple researchers was a function that many people found valuable.
Interposing the archivist back into the relationship between the researcher and the material, not as a “gatekeeper” but more as a “conductor” supports the community of scholars (and as always, I mean scholar in the largest sense) that grows up around any often-used collection. And in a chicken-and-egg situation any collection that has a community of scholars surrounding it is often-used.
So we continue to support community building as we have always done, working in the milieu of the moment, which today is social media. In this two-way interaction between scholar and curator, we will be able to continue the practice of learning about our collections from the people who may know it best, and providing better services in the ways that people want and need. This brings us back to the thought that we are just doing what we have always done, just using different tools to do it.
At the recent WebWise 2011 conference in Baltimore there was a lot of talk about
“macroscopic” analysis of data, beginning with Josh Greenburg’s (of the Sloan Foundation) keynote on Thursday morning who first used the term relating to STEM education. Macroscopic analysis involves aggregating large amounts of data from a single source or multiple sources and analyzing and presenting this data in a way that is comprehensible (more or less) using data visualization tools. But macroscopic data visualization is not limited to the sciences.
One example Josh used was the Google Ngram viewer which lets you graph the frequency of use of particular words in a large set of documents assembled from the corpus of the Google book project.
Macroscopic analysis of a corpus of texts is not exactly new. This is something that the computational linguists at the Perseus Project and elsewhere have been doing for years. Perhaps it is the scale of content and the types of tools that are now becoming available to the general public that makes this the next big thing.
Nevertheless, the idea of data driven research even in the humanities was seen by many at the conference as the future challenge for libraries and archives. Fran Berman now of RPI and recent Co-Chair of the Blue Ribbon Task Force for Sustainable Digital Preservation and Access made the same point in her Friday keynote at WebWise. She also said that we must serve both “the club” (professional researchers) and “the crowd” (non-professional experts or the just plain curious).
Fred Gibbs of the Center for History and New Media at George Mason demonstrated another tool (voyeurtools.org) they were using to test anecdotal history through sheer volume. However at another point in the conference, Sayeed Choudhury from Johns Hopkins noted that there is a critical difference between data and documents. Documents are made to be seen with our eyes and data is born to be processed by machine. In other words, data has no inherent meaning until it is processed, while documents have many inherent meanings, and to turn them into data means de-contextualizing them and looking at the resulting data in a way that was never intended by the creator.
Archivists have long been focused on the documentary use of their collections. I think this will remain our most important function for some time to come, but data driven research is becoming something that is within the realm of all researchers now that the tools to make it possible don’t require extensive technical skills.
In light of this new use of documentary content, a question that we have to answer, and it is one of the type that we have been answering for decades, is this: “How much do we do to prepare our collections for research use, and how much do we leave to the researcher to do for him/her self?” Should we convert our documents into “data” ahead of time, or should we leave that task up to the researcher? How far do we go?
There is a continuum and matrix of service that we need to develop to answer these questions and they are based on traditional principles of the profession: appraisal, assessment of value, available resources, skills, and technology to name a few. I think you can boil it down to two questions:
1. Does the value of the source material warrant the level of access made possible by turning it into data?
2. Will converting this material into data enhance its value?
Since we obviously don’t/won’t have the resources to turn everything into data right now, we will need to make some informed appraisal decisions that are similar to the decisions we make about digitizing analog content. It is important to note here that digitizing analog content is not the same as converting it into data. I can scan a document and turn it into an image, but it is not data until I can extract the words or pictures contained in that document and manipulate them in a way of my choosing. Again, do I offer the images and let an interested researcher convert them to data, or should I convert this content to data and make it available to data aggregators for use by anyone?
If we see all collections as potential sources of “data” then it is even more important for us to have some item-level control of our collections content and in as many relevant forms as possible, so that it will be as useful as possible. The decisions we make about how and when to go about doing this involve not only an understanding of the resources and mission of our institution but on the needs and interests of the users of our collections. (More on that topic in the next post.) Update March 22, 2011
I just ran across this blog post from the National Trust Historic Sites that gives a really great summary of Josh Greenberg’s keynote at WebWise and goes into some depth on the idea of the macroscope http://historicsites.wordpress.com/2011/03/21/webwise-using-data-to-see-forests-not-just-trees/
In the early days of digital scholarship, “real” scholarship and web delivery were seenas incompatible and even the most innovative scholars published in traditional journals. Gradually, with the development of online article archives (originally limited to pre-prints or post prints) and the emergence of the Open Access movement that model began to change. Concurrently, the development of software and systems designed to disseminate digital scholarship (DSpace and the Open Journal System for example), and course material sharing and collaboration sites, such as Rice University’s Connexions, and MIT’s Open Courseware among others, helped to alter the attitudes of scholars and tenure committees toward the idea of web-based publishing and scholarship.
This revolution is by no means complete, but the tide is definitely running in favor of electronic publishing, in either traditional or more importantly in new more open and flexible models. SHERPA/RoMEO has become the de facto aggregator of publishing rights information, the Directory of Open Access Journals (DOAJ) now numbers more than 6,000 titles, and the development of the Creative Commons has permanently changed the landscape for intellectual property management.
This revolution in scholarly publishing has spawned a complimentary revolution in access to research data. In the print environment, citations were the chief means of referencing supporting evidence and data. Libraries housed untold volumes to support citation following. Even so, references to primary sources or unpublished research meant that this material remained almost permanently unavailable to all but a few scholars. Digitization of historical sources and digital repositories, as well as access to digitized printed works are changing this model. This revolution was pioneered in what has become known as the digital humanities. Most notably, and somewhat surprisingly, this occurred first in the classics and archaeology and then spread to the hard sciences. Today, research data, in both the humanities and the hard sciences is being deposited in open-access repositories and being made available to scholars worldwide. Combine that with the permeation of electronic versions of printed works and you have a scholarly experience that mimics the link-following behavior of the web. I can read a scholarly work and click to look at the data that supports a particular point, or I can read an entire letter that is only briefly quoted in an article and tell immediately if the author took the quote out of context.
As they have for centuries, libraries can remain standing at the nexus of scholarly communication if they can pursue traditional services in modern ways. Winston Tabb of Johns Hopkins University recently made the point that “data centers are the new library stacks.” As more published information is available electronically from cloud-based providers, local libraries can become the stewards of unique scholarly data (and by scholarly data we mean all the resources used to create scholarship and new knowledge) created by faculty, and students, that contribute to the growth of knowledge. Libraries have the organizational structure and ability to potentially support long term preservation of not only the digital content, but the permanence of access that is required for scholarship. Additionally, libraries, with their understanding of copyright and ethical values of information exchange, can support Open Access publishing in its own right by leading the movement in both thought and action by becoming not only the stewards of scholarly content, but the distributors of that content as well.
It seems to me that this approach to thinking about the library, and increasing the visibility and prominence of its special and unique collections, will help libraries, especially Special Collections libraries, not only avoid the fate of Blockbuster Video, but remain relevant and important in the world of scholarship.
A recent post in the AOTUS blog by David Ferriero entitled “The Future is in the Palm or Our Hands” called for archivists to think about ways to connect archival collections to potential users through mobile devices. Ferriero was speaking specifically about NARA and its collections, but this idea is of course broadly applicable to all archives and collections.
The great opportunity for archives in connecting to users through mobile devices comes from one special nature of these devices: they can locate themselves in space, that is, they know where they are. And since they know where they are, we can link digital objects in our collections to those locations and have them pop-up on a mobile device and announce their presence, without the user doing practically anything at all except holding up his smartphone.
The idea of geo-coding locations for historical documents (especially photographs) has been around for some time. I was a part of some work in the late 1990s at Tufts University in collaboration with the Perseus Digital Library to overlay historical resources of London and Boston
onto historical maps. These were large-scale, programming intensive projects that used what we would now consider primitive, web-based GIS display tools to visually display and deliver historical information through a web-browser. They certainly were not optimized for mobile devices, because, of course those devices didn’t really exist then. While these tools were good at showing a visual representation of the location of historical information, we didn’t yet have the ability to do what we could imagine, which was to stand in a particular spot on the earth and connect with the historical record of that particular place.
The advent and general adoption of the Google maps API made it possible to more easily connect content to maps, and the development of smart phones and web-enabled mobile devices makes it possible to deliver historical documentation to people right where the history happened even though the resources that document that history are stored in our repositories.
How great would it be to stand on the steps of the Lincoln Memorial and hear Martin Luther King Jr.’s “I Have a Dream” speech? Or stand on a street in San Francisco and see photos of that street after the 1906 earthquake? Actually, I don’t know that you CAN’T do this right now. The technology exists, I don’t know if anyone has done it yet.
Of course, there are people already doing this sort of thing. For example, if you are in Philadelphia, you can point your iPhone to http://phillyhistory.org/i/ and be shown historic photos of Philadelphia based on your location. North Carolina State has produced WolfWalk (http://www.lib.ncsu.edu/wolfwalk/) which provides information on the history of approximately 60 major sites on the NCSU campus drawn from resources at the University’s Special Collections. In both cases I need to know that Phillyhistory or WolfWalk exists and what the url is.
What would it take for my Google maps app to list, not only restaurants or barber shops, but historical documents, images, and media related to nearby places? Well, maybe that’s getting a bit too optimistic, but we can still dream can’t we?
… and your enemies closer. Whether this comes from the Godfather, or Napoleon, or an Ancient Chinese philosopher, it may explain what a fervent believer in open source like me is doing giving a presentation at an ILS vendor’s user group meeting here Chicago.
Like most academic libraries, we use a combination of tools, applications, and resources to collect and deliver our content. In the past few years, we have made an explicit choice to move toward open source software solutions, at least for our presentation layer.
Why did we do this? There are a number of reasons most of them philosophical and operational rather than economical. Although open source is free (like a puppy), there are many costs associated with development and maintenance. I don’t think the economic argument has a lot of value in terms of decision making, since anything big costs a lot of money. Big products from vendors and big software development projects seem to me to be in the same ballpark cost-wise.
I’m not going to go deeply into the whole argument here, and it is possible to argue any of these points. But my opinion is that given a certain level of technical expertise (that not everyone has or can get) I think the advantage of open source is the ability to be nimble in the face of new demands and serve your user base in much more focused way than vendor solutions can offer. The downside of course is that you have to maintain it all yourself and there is no easy phone call to customer support that you can make to say “just fix it please!”
Which brings me back to Chicago, physically and intellectually. I am part of a panel with two colleagues from our library to talk about harvesting and aggregating metadata–including primary source metadata–into a presentation layer that is usable and useful for researchers.
We will of course talk about the vendor-supplied option that we currently use to harvest and aggregate book and primary source metadata, but I’m going to go another step beyond that to talk about the value of standards-based data exchange and demonstrate not only the vendor-based model, but a few open source based applications that we have developed here at the library because my point is that data aggregation is a matter of policy and practice, not applications.
What I am saying is that aggregated metadata can be used in a variety of ways to support discovery, and that open source applications based on standards that can be re-used and re-purposed for different audiences can go a long way toward serving the needs of our local audiences in ways that “one-size-fits-all” vendor products don’t seem to be doing.
We’ll see what sort of reception this gets in a room full of people who presumably (at least in my mind) are here to hear about the latest product from their vendor and why they should buy it.
Recently, the Penrose Library launched a brand new “user-centered” web site. I’m not a big fan of the term “user centered” since I think it is often used as an excuse not to be creative. But what we are trying to do is make available to each group of users the things that they are most interested in right up front. Rather than forcing them to learn how the library is organized administratively, we wanted the site to answer the question: “What do I want to do?” based on a second question: “Who wants to know?”
Some of this approach was informed by a workshop given by Nancy Fried Foster, library anthropologist at the University of Rochester, that some of us attended a year or so ago. She had recently completed an ethnographic study of undergraduate research behavior at the University of Rochester. Her findings were published in 2007 in a book called “Studying Students: The Undergraduate Research Project at the University of Rochester.”
Other parts of the design were informed by our own observations of user behaviors from Faculty, Students (both graduate and undergraduate) and University staff. If you are interested, there is a short “tour” of the new site, narrated by our Instruction Librarian, Carrie Forbes.
My point is really that, in order to be successful, especially in a library that hopes to teach research and scholarship skills as well as provide information, one size does not fit all, and there should be as many different library experiences as there are groups we wish to serve. Our next step is to extend the granularity of experience down to the individual, and provide each person (or at least each person who is affiliated with DU) with a experience that is tailored to his or her own interests and experience. I mean, if Amazon and L.L. Bean can do it, why can’t a library?
As archivists, we are always trying to find the best way to connect to our user community to give them what they want in the best way possible. The idea of quantum archives is to connect people to the content in as granular a way as possible while preserving the opportunity for them to contextualize the content in ways that they want. I was recently involved in a conversation where someone who wouldn’t ever consider himself an Archivist put this idea in the most succinct way.
Thought Equity Motion is a for profit stock footage fulfillment and video asset management service that manages the video libraries of some of the biggest media organizations in the world. They happen to be based in Denver and I’ve had a couple of opportunities over the past few months to talk with Frank Cardello, the EVP for Corporate Development at TEM. TEM has just launched a joint venture with the NCAA called the “NCAA Vault.” Timed to coincide with the beginning of the annual Men’s basketball tournament, the Vault features “ten years of full games and highlights” of the Sweet 16. As a basketball fan I appreciate this opportunity, as an archivist I am even more impressed with how TEM and the NCAA thought about presenting historical information.
While I can watch an entire game, I can also use search terms to limit to particular teams, years, and players. There are also some pre-defined categories like “great shots” or “great finishes.” Next, but not finally, you have the opportunity to search (using a text-based search box) through the play-by play track of the video footage for a particular moment or play within a game. You can select this clip and share it in other applications.
Frank said that the idea behind this approach was that people initially don’t want to watch the entire game, they want to “experience the moment” and share that moment with others. It was the purpose of the Vault to allow people to experience the moment.
Although he was talking about entertainment consumers, I thought that this was an interesting way to view all types of historical research. Researchers seldom want everything in a collection or a book, but those “moments” that help them prove their points, support their thesis or just inform themselves. This seems to me to be the essence of quantum archives, to reduce archival material to a collection of “moments” that can be used, shared, and re-used both in ways that we define–the pre-defined “great shots”–and the unexpected ways that result from users making their own moment out of a larger whole.
I’d like to coin a new phrase that I think I’ll add to the next version of the Quickstart Guide. It is “Deliver the Moment.” It simply means that we can manage our content according to traditional principles, but always seek to deliver that content in ways that resonate with our users.
I don’t know how scalable this idea is in terms of delivering real-life archival access. The NCAA Vault, for now, focuses on just one sport (Men’s basketball), in a very short time frame (10 years), and over a very limited scope (the last three rounds of the annual tournament). Given these limited parameters it is relatively easy to craft a satisfying user experience based on the principle of delivering the moment. There are plans to add more sports and a greater time span. I’m rooting for them.