Nov 15 2009

From data to metadata

Henk Ellermann

Providing a document with a proper context of other documents, annotations, citations, data sets, previous versions of documents, guidelines and the like, is the proper subject of whole spectrum of recent developments. The semantic web provides the theory and an ever growing set of tools to describe relations to documents in way that machines can process. OAI-ORE is one of the additions to the semantic web repertoire that allows one to describe a bundle of such documents as one whole. Relations between documents and groupings of documents are two important structures that can provide a context to a document. A context that may be really useful at reading time.

But, given a document, how can we find the metadata? The problem is that documents and metadata are often separate resources which are not necessarily bi-directionally linked. In practice, the link from a document to resources containing metadata about that document is often missing.

Let’s first change our terminology before discussing this in more detail. Following the principles and accepted terminology of the web architecture we will not use the term document, but replace it with the more general term resource. A resource refers to a variety of objects, including textual documents, audio-visual material, data sets, running code, etcetera. We will only deal with resources that have an online presence.

A resource can take two roles. It can be a metadata resource describing other data resources and a data resource itself. To safe on typing we will use the acronym MR to refer to a metadata resource and DR for a data resource. If a resource refers explicitly to another resource and gives some extra information about it, the referring source is a MR and the referred to resource a DR. Two resources can refer to each other and so be both a DR and a MR at the same time. A catalog, for example, can describe a series of art objects and it can itself be described by yet another resource, for instance one that critiques the catalog as an art object in itself. The catalog is both a data and metadata resource. Seen from a philosophical perspective it can perhaps be noted that everything refers to something else, so everything is metadata. So, being a DR or an MR are roles a resource can take.

Any MR needs a reference to a DR, while a DR does not need a reference to a MR. Without loss of generality, I think, we can assume that a MR is a set of RDF triples. All metadataformats can in principle be expressed in the RDF datamodel. What is needed to realize this transformation is that every resource is globally unique identifiable. RDF uses URI’s for this. Let’s however ignore this "problem" for  a while.

We then have the situation in which MR’s and DR’s have a URI, if the URI’s are URL’s both should lead to a representation (retrievable over HTTP) of the resources. There are only a few methods thinkable to retrieve a MR (or more) when a DR is known.

  1. There is an algorithm (or convention) that derives the URI’s of a, or all, MR’s from the URI of a DR.
  2. There is an algorithm (or convention) that derives the URI’s of a, or all, MR’s from the content of a DR.
  3. There is at least one registry where a MR registers itself as an MR for a given DR URI.
  4. The site that hosts a DR can be notified of the existence of MR’s that mention it.
  5. Embed into the DR references to a MR.

Each of these options need to be detailed. Option 1 could be (partially) solved by Cool URI’s and  the finer details of the HTTP response mechanism (especially HTTP 302/303 responses) need to be explored. In option 4 one of the refback methods (HTTP) might be used, or perhaps a variation on the trackback mechanisms used in the weblog world.  But this is not the place to present those details.

What is noteworthy in all solutions is that a community effort is needed to get it done. Technical details need to be worked out, but first conventions need to be written down, workflows and responsibilities need to be written out and handed over to organisations. Registries might be needed and certain  groups have to maintain them. What is lacking therefore is a community that sets itself to these tasks.


Nov 13 2009

Tinkering or thinking?

Henk Ellermann

We too often start with tools and wonder what we can do with them. A large part of the Library 2.0 movement is like that.  There are sites, like 23dingen that seem to promote that attitude. Learn what the internet has to offer and then use it, professionally if possible. The workflow seems to be as follows: become aware of what the internet has to offer, get used to it, and apply it. Do all that, then take the attitude of an evangelist, and your are a modern 2.0 librarian.

This is tinkering.

The idea that our work should be demand-driven leads to tinkering too. We define services in collaboration with researchers and librarians and then realize that service. Serving our customers is of course our main goal, in the end, but librarians should not jump from "demand" to "demand". What is lost is a reflection on that what may connect the services thus developed, what is lost too is a critical attitude to the foundations of a library.  For example, customers tend to take many things for granted (like: libraries shuffle documents, whether online or offline, libraries offer search tools that answer questions by presenting a list of documents). A rethinking of the foundations of libraries and the resources it works with will rarely be triggered by obeying customer demands.

 Tinkering works from existing "infrastructures": takes them for granted. Not tinkering, but thinking might be instrumental in changing that infrastructure in order to deliver future services with a maximum of ease. It is not that no one thinks about such an infrastructure. OAIS reference architecture, SOA, 5S and similar undertakings show that infrastructural issues are addressed in the literature.  Also, the Linked Data Initiative has come with clear advice on how to represent metadata and how metadata can be re-used. Registries of identifiers are seen as essential in this context. The issues around Open Data and rights of (re-)use have received considerable attention too, and are an integral part of a solid infrastructure. That work could lead to the definition of an overall architecture and to the development of a flexible infrastructure on top which new services can be developed.

It would be advisable, I think, to retract from the demand oriented strategy and start working on the specification of a good and flexible infrastructure using one of the existing methodologies (a few were mentioned). I think it is an essential step that would make future developments less costly and increase the likelihood of developments to become stable and sustainable services. And we surely should not waste our time with 23 things, there is no inherent evil in that work, but it distracts us from the core issue: build an adequate infrastructure for the digital library.

We need to think more and tinker less.