Opaque DOIs considered harmful

I’d like to push back regarding CrossRef’s advice that DOIs should be “opaque identifiers”, e.g. Constructing your DOIs. The notion that we should generate DOIs that are deliberately “dumb”, such as random strings or numbers is designed to prevent people trying to interpret them. To quote from the advice:

DOIs should not include information that can be understood, interpreted or predicted, especially information that may change. Page numbers and dates are examples of information that shouldn’t be included in suffixes. It is particularly problematic if the suffix includes information that conflicts with the metadata associated with the DOI.

But I spend a huge amount of time trying to match cited literature to DOIs, and harvesting article metadata from journal web sites (on occasion from sites that have gone offline, sometimes for good). Any clue which helps me figure out if a DOI exists, if an article has been missed, is invaluable. Likewise DOIs often contain valuable information about who minted them. It is not uncommon to have multiple DOIs for the same article, sometimes minted by different agencies (e.g., CrossRef and DataCite). Likewise sometimes DOIs are included in the text of a paper and they don’t resolve, but looking at the pattern in the suffix can give clues as to what the DOI was meant to be, and the correct DOI can be inferred.

I think the argument for opacity of identifiers is simplistic and potentially harmful to efforts to link things together. Rather than insist that there is no metadata in a DOI, why not acknowledge that (a) such metadata can’t always be relied on, but (b) it can be very useful for those trying to build the graph connecting the scholarly literature to the wider world of data.

1 Like

Hi @rdmpage ,

Thank you for your message.

I understand your reasoning here, but at scale it just doesn’t work. That’s why we advise our members the way we do.

When our members register DOI suffixes using patterns we find two consistent problems that have led us to the advice that you quoted:

DOIs should not include information that can be understood, interpreted or predicted, especially information that may change. Page numbers and dates are examples of information that shouldn’t be included in suffixes. It is particularly problematic if the suffix includes information that conflicts with the metadata associated with the DOI.

  1. People and the systems we build are inconsistent and/or make mistakes.
  2. When people register DOI suffixes with human-readable patterns, they will inevitably change those patterns over time, thus compromising the persistence of the DOI itself.

As you may know, DOIs for journal articles and other traditional content (books, conference papers, datasets, etc.) are citation identifiers. The most fundamental objective for us for that DOI is its own persistence. What we find time and again is over time when human-readable information or patterns are entered into suffixes our members will decide that they want to alter those patterns. It may not happen tomorrow or next month, but it will happen.

Let’s look at an example you flagged to us in a different thread in the forum, this one: Three DOis for the same article!

In this thread, you highlight that there are three DOIs for the same journal article:

You can see that there is what looks like some human-readable information in the suffixes of these DOIs, which, like you said above, is helpful for you:

mmng is likely an identifier for the journal or organization responsible for the journal (you’d know better than me, but I assume that is meaningful)

2004 - present in one of the three DOIs is the publication year of this journal article

07 and 01 - the volume and issue number

153 - the first page of the article itself present in that final DOI

486 - I don’t know; is that meaningful?

I don’t know the complex history of this journal article and why it has three DOIs, but what we sometimes find is that our members will decide to change the human-readable information in a DOI suffix so that it is uniform. They, like you, want to be able to look at the suffix and know for sure what each blip of information is about, right? In some cases, that’s why we get duplicate DOIs like 10.1002/mmng.4860070108 and 10.1002/mmng.20040070108…because the member has modified their suffix pattern and registered new DOIs for existing articles in order to have uniform DOIs. They may have made a mistake entering that human readable information or decided to adjust the human-readable information over time.

When that happens it makes it much more difficult for you, others in the community, and for us to know which is the definitive DOI that will be used going forward for the journal article. It’s harder for you as a human. It’s harder for us to establish relationships, most notably cited-by links. Which one do we match to? Which one do you share with your colleagues?

I asked above which of the human-readable information was meaningful in this example. Where I can find that information? In our APIs. All of the metadata is there and available to all for these DOIs. We don’t have to guess if, for instance, 486 is meaningful, we can simple review the whole record because the metadata record itself is the definitive source of meaning.

If you prefer XML: http://doi.crossref.org/search/doi?pid=support@crossref.org&format=unixsd&doi=10.1002%2Fmmng.20040070108

If you prefer JSON:
http://api.crossref.org/works/10.5194/fr-7-153-2004?mailto=support@crossref.org

While people, including me, like patterns, they don’t hold up over time and scale. Many of us want those DOI suffixes to have consistent meaning, but they just don’t.

Thanks again,
Isaac

2 Likes

Also, if you do have some metadata for a work and want to enter it to see if it matches a record in our corpus, our Simple Text Query form is a very valuable tool.

If you just want to match references to DOIs, follow these instructions.

  1. Go to the Simple Text Query form and enter a reference or list of references into the search box.
  2. Optional:
  • select List all possible DOIs per reference to return multiple results
  • select Include PubMed IDs in results to include PubMed IDs
  1. Click Submit

Hi @ifarley,

Thanks for taking the time to respond.

I feel your pain. Indeed this a fundamental problem in biological taxonomy, the area that I work in. Taxonomists want species names to reflect the classification of those species, so if the classiifcation changes, so can the name. This can cause all sorts of problems, even Drosophila melanogaster might not be safe (see Whither Drosophila?).

Just to be clear, I’m not arguing for including detailed metadata in a DOI, I’m not advocating for the glory days of the SICI where one could infer almost complete metadata from the identifier!. Instead I am arguing against deliberately creating obscure identifiers that are completely uninterpretable. To me that potentially makes things less resilient.

People will always imbue strings with meaning, as @gbilder pointed out in Author Identifiers: Interview with Geoffrey Bilder:

What this often means is that, when publisher A acquires publisher B, publisher A will ask CrossRef if we can create new DOIs for all of publisher B’s backfiles so that they all have the same prefix! The answer to their request is “no”, but you wouldn’t believe how stroppy publishers can get about this. They somehow imbue this ridiculous four-digit prefix with branding significance. This, of course, is absolutely mental, but it is a predictable form of mental. The French went mad when they had to replace their region-encoded license plates with opaque EU ones. People in the US go mad when they are given new area codes. In short, when people associate semantic significance in identifiers, you will face problems.

I’m well aware of publishers changing DOIs wholesale, I’ve posted many reports on this. A classic is publishers taking over content from JSTOR and wanting to drop the 10.2307 prefix that (used to) signal JSTOR content.

I would argue that “semantic” identifiers become increasingly opaque over time. People’s names once carried meaning (Roderic = “fame; ruler, king”, apparently) but now it’s just a “dumb” string. The same is true for species names, they are often Latin(ish) desciptors of a species, but few can read Latin, and we know that “australiensis” doesn’t always mean it is found in Australia.

Hence I think identiifers eventually achieve opacity anyway. Rather than aggressively make them opaque at the start, why not allow time to do its thing? There are ways to make identifiers mildly opaque without making them obscure.

Anyway, I understand the argument from CrossRef’s point of view, you are keen to provide publishers with as few incentives as possible to want to change a persistent identifier. I’m just keen that we are aware that this has perhaps unintended consequences, especially for those trying of us trying to retrospectively map DOIs to litertaure in larg databases.

Rod

1 Like

I have a special connection with this because I was one of the inventors on the original US patent 6957224 for short URLs (now expired). Short opaque URLs have been criticized (rightly so, in my opinion) for breaking the semantics of the web and abusing web users. When a human looks a the URL, they can form an opinion about whether it’s worth clicking on based on the URL based on the domain name or the “path” in the URL. Most of us have had the experience of seeing a link to a bit.ly URL that leads to a scam or piece of malware. I’m not saying that DOIs are in any way evil, but I think there is value in having the URL convey information to the user about what they should expect from clicking on it. In some ways, the doi.org domain already conveys something useful - that we expect it to lead to officially published material that someone cared enough to pay a dollar for. One example that personally annoys me is when I click on a DOI only to be faced with a paywall. I would just as soon know that the DOI is useless to me up front. I have personally learned a couple of 10.* prefixes that I won’t bother to click on, but that’s too high a cognitive load for most people.

In my discipline, when someone sees a URL like http://eprint.iacr.org/2000/039 it means something to them. In that example, we encoded the year of publication into the URL, and that conveys useful information to the reader on what they should expect when they click on the URL. In fact, it only conveys the first date of publication because in that example we kept the original URL but we point to subsequent revisions. This is also consistent with what crossmark tries to satisfy. This pretty well illustrates the tension between the two goals of having permanent URLs and having URLs that convey useful information. I completely agree that DOIs should remain as permanent identifiers, but I think we can carry it too far in making them opaque and useless to readers. One of the problems we are trying to solve here is the blatant disregard that many webmasters have had for the permanence of URLs. I’m glad to see that crossref is carrying the torch for the cause of permanent URLs, but there are other issues involved.

1 Like