From data to metadata

Henk Ellermann

Providing a document with a proper context of other documents, annotations, citations, data sets, previous versions of documents, guidelines and the like, is the proper subject of whole spectrum of recent developments. The semantic web provides the theory and an ever growing set of tools to describe relations to documents in way that machines can process. OAI-ORE is one of the additions to the semantic web repertoire that allows one to describe a bundle of such documents as one whole. Relations between documents and groupings of documents are two important structures that can provide a context to a document. A context that may be really useful at reading time.

But, given a document, how can we find the metadata? The problem is that documents and metadata are often separate resources which are not necessarily bi-directionally linked. In practice, the link from a document to resources containing metadata about that document is often missing.

Let’s first change our terminology before discussing this in more detail. Following the principles and accepted terminology of the web architecture we will not use the term document, but replace it with the more general term resource. A resource refers to a variety of objects, including textual documents, audio-visual material, data sets, running code, etcetera. We will only deal with resources that have an online presence.

A resource can take two roles. It can be a metadata resource describing other data resources and a data resource itself. To safe on typing we will use the acronym MR to refer to a metadata resource and DR for a data resource. If a resource refers explicitly to another resource and gives some extra information about it, the referring source is a MR and the referred to resource a DR. Two resources can refer to each other and so be both a DR and a MR at the same time. A catalog, for example, can describe a series of art objects and it can itself be described by yet another resource, for instance one that critiques the catalog as an art object in itself. The catalog is both a data and metadata resource. Seen from a philosophical perspective it can perhaps be noted that everything refers to something else, so everything is metadata. So, being a DR or an MR are roles a resource can take.

Any MR needs a reference to a DR, while a DR does not need a reference to a MR. Without loss of generality, I think, we can assume that a MR is a set of RDF triples. All metadataformats can in principle be expressed in the RDF datamodel. What is needed to realize this transformation is that every resource is globally unique identifiable. RDF uses URI’s for this. Let’s however ignore this "problem" for  a while.

We then have the situation in which MR’s and DR’s have a URI, if the URI’s are URL’s both should lead to a representation (retrievable over HTTP) of the resources. There are only a few methods thinkable to retrieve a MR (or more) when a DR is known.

  1. There is an algorithm (or convention) that derives the URI’s of a, or all, MR’s from the URI of a DR.
  2. There is an algorithm (or convention) that derives the URI’s of a, or all, MR’s from the content of a DR.
  3. There is at least one registry where a MR registers itself as an MR for a given DR URI.
  4. The site that hosts a DR can be notified of the existence of MR’s that mention it.
  5. Embed into the DR references to a MR.

Each of these options need to be detailed. Option 1 could be (partially) solved by Cool URI’s and  the finer details of the HTTP response mechanism (especially HTTP 302/303 responses) need to be explored. In option 4 one of the refback methods (HTTP) might be used, or perhaps a variation on the trackback mechanisms used in the weblog world.  But this is not the place to present those details.

What is noteworthy in all solutions is that a community effort is needed to get it done. Technical details need to be worked out, but first conventions need to be written down, workflows and responsibilities need to be written out and handed over to organisations. Registries might be needed and certain  groups have to maintain them. What is lacking therefore is a community that sets itself to these tasks.


One Response to “From data to metadata”

Leave a Reply