Nov 17 2009

Identity: how?

Henk Ellermann

The infamous httpRange-14 issue is about making a distinction between informational and non-informational resources, or resources that are on the internet and those that are, or cannot, be on the internet. Since all resources should have an identifier (URI) it seems good to know which resources can have a URL (be on the internet), and which resources (say a person) can only have a URN.

The W3C Technical Architecture Group proposed an operational definition to make that distinction. It goes as follows:

  1. If an “http” resource responds to a GET request with a 2xx response, then the resource identified by that URI is an information resource
  2. If an “http” resource responds to a GET request with a 303 (See Other) response, then the resource identified by that URI could be any resource;
  3. If an “http” resource responds to a GET request with a 4xx (error) response, then the nature of the resource is unknown.

It is almost too obvious that this will never work, for a variety of practical and technical reasons. If arguments for this are needed, the reader is referred to httpRange-14, Cool URIs & FRBR or The web’s identity crisis and httpRange-14.

But there is something fundamentally wrong too with this proposal. It classifies types of resources using properties of URI’s (in the context of the way the internet works). We should not use URI’s to classify things, we should use metadata for that. Classification determines, for the purpose of the classification, what things are identical. Establishing absolute identity is just classification with extreme discrimatory power.

Following Leibniz we should say that thing x is the same as thing y (are identical) when each and every predicate that is true of x is also true of y, and vice versa. In library terms: if the metadata describing x are the same as the metadata describing y, then x and y are identical. If the metadata are coarse, many things that we see as different will still be the same according to this definition. This leads us to an operational definition of identity.

If x and y are described by the same metadata, they are the identical. If, for some reason, we see a significant or relevant difference between x and y, we should add new metadata-elements that express this.

The above mentioned solution to the httpRange problem clearly violates this principle.

But classification using metadata (predicates) on the internet is not without its problems. I will mention three:

  1. It is hard to find all relevant metadata for a given thing, so a complete comparison is often not possible. Identity, in practice, has to be concluded using incomplete evidence.
  2. Not all metadata seem to be used to express identity. Some metadata relate things without defining their identity, but just relate things. That document x is a commentary on document y and not on document z does not allow us to conclude that y and z are different. Or does it? In any case, the question is which metadata elements define identity? To say that all metadata do seems too easy an answer.
  3. On the internet, what has received a name (is called x), may change over time. So what is described may not be fixed.

Problem 3 seems to be the most fundamental. Things need to be named, singled out if you want, before they can be described. (Yes, we need some intuition about what things are. The internet supports this intuition by the way it is organized. At any given moment in time one URI/URL produces an identifiable thing.)

So how can things be named that can change all the time. Here the recent memento proposal comes in handy. It offers a (partial) technical solution to the problem of web archiving by introducing a system that allows us to identify representations of resources at particular times (and relate these timed traces to the current URI). Conceptually the memento proposal allows to see all that happens on the internet as events and offers a system to identify (name) those events.

So the internet then becomes a system of named events. We have the primitives at hand, ready to be described. We have the possibility of naming all that happens on the internet using URL’s (including parameters) and time. Having a set of events at hand, allows us to work with the pragmatic definition of identity given above: events are identical if all the metadata are the same (whereby it is always possible to split up equivalence classes by adding (or finding) extra metadata.

Formally, the internet is the set of all events E, whereby an event e is a tuple e=<URL,time>.

Using metadata, we can define equivalence classes on this set. Metadata are events themselves too, of course. An event is a metadata event if it contains a reference to another event That other event can be a metadata event too. So it is the content (having a URI in the content) that determines whether an event is a metadata event.

What is said about that reference could, if one so wishes, determine whether the talked about “object” is informational or non-informational. Being informational or not, is the proper subject of metadata.

The second problem needs to be solved by singling out a set of attributes that create the equivalence classes. Equivalence becomes a matter of perspective, a matter of selection of appropriate metadata events. Which ones are appropriate may depend entirely on the application one has in mind.

The first problem becomes one of finding metadata events given an event. We discussed this already (if only partially) in an earlier post called From data to metadata.


Nov 15 2009

From data to metadata

Henk Ellermann

Providing a document with a proper context of other documents, annotations, citations, data sets, previous versions of documents, guidelines and the like, is the proper subject of whole spectrum of recent developments. The semantic web provides the theory and an ever growing set of tools to describe relations to documents in way that machines can process. OAI-ORE is one of the additions to the semantic web repertoire that allows one to describe a bundle of such documents as one whole. Relations between documents and groupings of documents are two important structures that can provide a context to a document. A context that may be really useful at reading time.

But, given a document, how can we find the metadata? The problem is that documents and metadata are often separate resources which are not necessarily bi-directionally linked. In practice, the link from a document to resources containing metadata about that document is often missing.

Let’s first change our terminology before discussing this in more detail. Following the principles and accepted terminology of the web architecture we will not use the term document, but replace it with the more general term resource. A resource refers to a variety of objects, including textual documents, audio-visual material, data sets, running code, etcetera. We will only deal with resources that have an online presence.

A resource can take two roles. It can be a metadata resource describing other data resources and a data resource itself. To safe on typing we will use the acronym MR to refer to a metadata resource and DR for a data resource. If a resource refers explicitly to another resource and gives some extra information about it, the referring source is a MR and the referred to resource a DR. Two resources can refer to each other and so be both a DR and a MR at the same time. A catalog, for example, can describe a series of art objects and it can itself be described by yet another resource, for instance one that critiques the catalog as an art object in itself. The catalog is both a data and metadata resource. Seen from a philosophical perspective it can perhaps be noted that everything refers to something else, so everything is metadata. So, being a DR or an MR are roles a resource can take.

Any MR needs a reference to a DR, while a DR does not need a reference to a MR. Without loss of generality, I think, we can assume that a MR is a set of RDF triples. All metadataformats can in principle be expressed in the RDF datamodel. What is needed to realize this transformation is that every resource is globally unique identifiable. RDF uses URI’s for this. Let’s however ignore this "problem" for  a while.

We then have the situation in which MR’s and DR’s have a URI, if the URI’s are URL’s both should lead to a representation (retrievable over HTTP) of the resources. There are only a few methods thinkable to retrieve a MR (or more) when a DR is known.

  1. There is an algorithm (or convention) that derives the URI’s of a, or all, MR’s from the URI of a DR.
  2. There is an algorithm (or convention) that derives the URI’s of a, or all, MR’s from the content of a DR.
  3. There is at least one registry where a MR registers itself as an MR for a given DR URI.
  4. The site that hosts a DR can be notified of the existence of MR’s that mention it.
  5. Embed into the DR references to a MR.

Each of these options need to be detailed. Option 1 could be (partially) solved by Cool URI’s and  the finer details of the HTTP response mechanism (especially HTTP 302/303 responses) need to be explored. In option 4 one of the refback methods (HTTP) might be used, or perhaps a variation on the trackback mechanisms used in the weblog world.  But this is not the place to present those details.

What is noteworthy in all solutions is that a community effort is needed to get it done. Technical details need to be worked out, but first conventions need to be written down, workflows and responsibilities need to be written out and handed over to organisations. Registries might be needed and certain  groups have to maintain them. What is lacking therefore is a community that sets itself to these tasks.