Identity: how?
The infamous httpRange-14 issue is about making a distinction between informational and non-informational resources, or resources that are on the internet and those that are, or cannot, be on the internet. Since all resources should have an identifier (URI) it seems good to know which resources can have a URL (be on the internet), and which resources (say a person) can only have a URN.
The W3C Technical Architecture Group proposed an operational definition to make that distinction. It goes as follows:
- If an “http” resource responds to a GET request with a 2xx response, then the resource identified by that URI is an information resource
- If an “http” resource responds to a GET request with a 303 (See Other) response, then the resource identified by that URI could be any resource;
- If an “http” resource responds to a GET request with a 4xx (error) response, then the nature of the resource is unknown.
It is almost too obvious that this will never work, for a variety of practical and technical reasons. If arguments for this are needed, the reader is referred to httpRange-14, Cool URIs & FRBR or The web’s identity crisis and httpRange-14.
But there is something fundamentally wrong too with this proposal. It classifies types of resources using properties of URI’s (in the context of the way the internet works). We should not use URI’s to classify things, we should use metadata for that. Classification determines, for the purpose of the classification, what things are identical. Establishing absolute identity is just classification with extreme discrimatory power.
Following Leibniz we should say that thing x is the same as thing y (are identical) when each and every predicate that is true of x is also true of y, and vice versa. In library terms: if the metadata describing x are the same as the metadata describing y, then x and y are identical. If the metadata are coarse, many things that we see as different will still be the same according to this definition. This leads us to an operational definition of identity.
If x and y are described by the same metadata, they are the identical. If, for some reason, we see a significant or relevant difference between x and y, we should add new metadata-elements that express this.
The above mentioned solution to the httpRange problem clearly violates this principle.
But classification using metadata (predicates) on the internet is not without its problems. I will mention three:
- It is hard to find all relevant metadata for a given thing, so a complete comparison is often not possible. Identity, in practice, has to be concluded using incomplete evidence.
- Not all metadata seem to be used to express identity. Some metadata relate things without defining their identity, but just relate things. That document x is a commentary on document y and not on document z does not allow us to conclude that y and z are different. Or does it? In any case, the question is which metadata elements define identity? To say that all metadata do seems too easy an answer.
- On the internet, what has received a name (is called x), may change over time. So what is described may not be fixed.
Problem 3 seems to be the most fundamental. Things need to be named, singled out if you want, before they can be described. (Yes, we need some intuition about what things are. The internet supports this intuition by the way it is organized. At any given moment in time one URI/URL produces an identifiable thing.)
So how can things be named that can change all the time. Here the recent memento proposal comes in handy. It offers a (partial) technical solution to the problem of web archiving by introducing a system that allows us to identify representations of resources at particular times (and relate these timed traces to the current URI). Conceptually the memento proposal allows to see all that happens on the internet as events and offers a system to identify (name) those events.
So the internet then becomes a system of named events. We have the primitives at hand, ready to be described. We have the possibility of naming all that happens on the internet using URL’s (including parameters) and time. Having a set of events at hand, allows us to work with the pragmatic definition of identity given above: events are identical if all the metadata are the same (whereby it is always possible to split up equivalence classes by adding (or finding) extra metadata.
Formally, the internet is the set of all events , whereby an event
is a tuple
.
Using metadata, we can define equivalence classes on this set. Metadata are events themselves too, of course. An event is a metadata event if it contains a reference to another event That other event can be a metadata event too. So it is the content (having a URI in the content) that determines whether an event is a metadata event.
What is said about that reference could, if one so wishes, determine whether the talked about “object” is informational or non-informational. Being informational or not, is the proper subject of metadata.
The second problem needs to be solved by singling out a set of attributes that create the equivalence classes. Equivalence becomes a matter of perspective, a matter of selection of appropriate metadata events. Which ones are appropriate may depend entirely on the application one has in mind.
The first problem becomes one of finding metadata events given an event. We discussed this already (if only partially) in an earlier post called From data to metadata.