Archive

Posts Tagged ‘Crowdsourcing’

Digitize First, Catalog Later?

April 1st, 2010 Quantum Archivist No comments

In the digital collection building workshops we do for SAA, we always have emphasized the idea that you should never digitize a collection that isn’t already processed. We generally leave the definition of “processed” a bit vague. At the most basic level, we mean that until you have some organized list of the items that you want to digitize, you shouldn’t start slapping random content on the scanner bed.  In practice this meant that you didn’t digitize until you had item-level control of the collection, even if there was only a title without any other descriptive information. The value added descriptive information is something we would advocate adding as part of the digitizing workflow process.

Now I am beginning to wonder if that idea is not quite as valid for born digital content. Perhaps if we just put the stuff out there with the absolute minimum of control, and let the crowd of interested amateur experts fill in the details beyond what we can derive automatically we might be better off, or at least farther ahead.

For most born digital content I can know a few basic things mostly automatically: where it came from, who created it (sometimes), and what it is (document, photograph, moving image, etc) and its file format (jpg, pdf, mp4, mp3, etc.). I can assign it the few required fields in a management system automatically, with something as basic as the title being simply the file name. Could I then  just toss it out there and allow the crowd to fill in the other details?

Even if I assume that there are equal parts “Wisdom of the Masses” and “Madness of the Mob” out there, would I get enough good information to make it worth the work of separating the wheat from the chaff?

One argument on the positive side is that, unless you have a very highly focused collection with a very small temporal span, no one organization or institution can possibly have all the expertise to create high quality, in-depth information about all of its collections. And there are a lot of people out there who may know more about the Ukraine, or about DU in the 1940s than the folks here in Denver in the early part of the 21st century.

Could our role as archivists and repository managers be to view and review, rather than to create and catalog?

I don’t think this really can work, or can it?

Distributed Cataloging and the Semantic Web

March 9th, 2010 Quantum Archivist 2 comments

In the first couple of Harry Potter books, the editions that were offered for sale in the United States were Americanized versions of the original works. What was a “jumper” in the original became a “sweater” in the US version. Lorries became trucks, boots became trunks, etc. Even the title of the first book was changed to suit the American audience. Once the books became a world-wide phenomenon, everyone was presumably familiar with Britishisms and the practice stopped I believe.

This is an interesting and possibly significant issue as we begin to develop our distributed cataloging project for the work of Semyon Fridlyand. Will we need to develop a semantic thesaurus of some kind that will help us bridge the gap between how we think about and name things and how others do? Adding to the dilemma is the fact that we will also be dealing with multiple languages and even multiple alphabets.

At the Web Wise conference last week, I heard Monika Hagendorn-Saupe of Europeana the EU’s aggregator of digital libraries. They are dealing with a huge alphabetic, semantic, and language issue and are developing a semantic search engine that you can test. I think it has promise and I’m hoping to find out more about the project and will report it here.

The concept of the semantic web has been around for a number of years, and for at least 10 years we’ve been hearing how the semantic web would change the way we use the web. The automatic linking of similar ideas, even if those ideas are not specifically indicated in the resource has been something of a holy grail for information professionals since the digital age began and we realized that it would be impossible to maintain metadata about digital content in the way that we did for analog content.

Finding a way out of our semantic/language/alphabet dilemma is going to be a bigger deal than we had originally thought when we come up with this idea.