Existing Standard Identifiers

by Julia Clasen

With a growing amount of content online, there is an increasing need for a universal identifier. A content identifier is a key factor to generate value from content. A universal and auto-generated identifier will simplify management, distribution, tracking and licensing of content. A free and open source solution to create, manage, and integrate such an identifier will also support collaboration and make content interoperable between different parties. This will hugely increase the speed of many processes throughout the content industry.

Especially in the field of online journalism the current absence of interoperable identifiers is a problem. Without a common standard there is no easy way of identifying and tracking content across market participants.

Increased efficiency and speed are becoming more and more important because old business models like selling advertisements are declining in rentability and current processes in journalism are often manual and therefore costly.

Also usually more than one party is involved in the production, distribution and licensing processes of content and it is often hard to track and manage all the steps, e.g. tracking usage or proving who, when and what someone contributed without an interoperable identifier.

Based on the strength and weaknesses of existing identifiers for different media types, we will define technical and commercial requirements for a new universal identifier.

Introduction

Currently there are several identifiers for different types of content in use. Each identifier is usually only suitable for a specific purpose and cannot be used in a generic way. Also most identifiers are not interoperable.
In the following, different content identifiers are listed and examined with regard to their field of use. (Identifiers for persons and non-content entities are not discussed in this paper as they are not within the scope of the project’s activities)

Text / Books and Articles

ISBN – International Standard Book Number

Subject	Books / text-based monographic publications
Details	New ISBN for every edition, format or version of a book (except reprints)
Issuer / Issuing Process	National ISBN Agency issues different sizes of blocks of ISBNs to a publisher. There is no requirement for the publisher to provide any metadata about the publications identified.
Standard	ISO 2108
Price per ID	Varies by issuing country/territory and the quantity of IDs issued. Starting at 125$ for one ID, larger quantities are cheaper.
Format	Fixed length, 13 digits divided into: Prefix (3 digits) Registrant Group Element (1 to 5 digits) Registrant Element (up to 7 digits) Publication Element (up to 6 digits) Check digit (1 digit)
Range / Capacity	Can not be calculated exactly. Large registrant blocks were given out in the beginning and it is unclear if/how those are utilized.
Interoperability	Subset of the EAN-13 (since early 2000s)

Publishers receive their own variable length prefix (Registrant Element) and an accompanying block of ISBNs within the Registrant Element from their national ISBN agency. Larger publishers receive a shorter prefix as they need more ISBNs. As soon as a publisher has allocated all his ISBNs he can request another prefix/ISBN block. So by its prefix a book can (in theory) always be assigned to a specific publisher.

As every edition, format or version has to have its own ISBN, book stores, libraries, and warehouses can differentiate. But since one work usually has multiple ISBNs (e.g. one ISBN per edition) works and their different variations cannot be grouped by their ISBNs.

BICI - Book Item and Component Identifier

The BICI is supposed to extend the ISBN by providing unique identifiers for a work’s individual parts (e.g. chapters). It has a fixed structure but a variable length. It is intended to be human readable and most information is included in the identifier in abbreviated form. Unfortunately this practice leads to very long identifiers less than readable for humans.
There is a draft for a standard for the BICI by NISO (United States National Information Standards Organisation) from 2000 but it does not seem like the approach has been pursued since then.

ISTC - International Standard Text Code

Subject	Books / text-based monographic publications
Details	Identifies a work regardless of editions and versions
Issuer / Issuing Process	A request has to be sent to of the ISTC registration agencies
Standard	ISO 21047
Price per ID	No Charge (until further notice)
Format	Fixed length, 16 numbers (some in hex): Registration number (3 digits) Year of registration (4 digits, year in human-readable form) Publication (8 digits hex) Check digit (1 digit)
Range / Capacity	ca. 4 billion
Interoperability	–

In contrast to the ISBN, the ISTC identifies a work and not a version of a work. This way it serves a different purpose and cannot be used in warehouses or similar contexts. The ISTC and the ISBN are not interoperable but when used together they complement each other.

ISSN - International Standard Serial Number

Subject	Serials
Details	Identifies serials and other continuing resources, in the electronic and print world
Issuer / Issuing Process	Registration at a national ISSN centre
Standard	ISO 3297
Price per ID	free
Format	Fixed length, 8 digits: Code (7 digits) Check digit (1 digit)
Range / Capacity	~10 million
Interoperability	can be transformed into EAN

The ISSN identifies serial publications such as newspapers, journals, magazines and other periodicals. It only identifies a serial itself and does not encode any information about the serial or its context. It also does not identify individual issues/editions. A central register of ISSN is available. The complete dataset can be purchased for € 19,768.

SICI - Serial Item and Component Identifier

Subject	Articles in Serials
Details	Identifies individual parts / articles in a serial with a location indication
Issuer / Issuing Process	self issued
Standard	Standard by NISO
Price per ID	free
Format	variable length, extremely long
Range / Capacity	theoretically unlimited
Interoperability	Supposed to be compatible with DOI

The SICI extends the ISSN by giving information about the individual articles in a serial. One part of the SICI is the ISSN so every article can be assigned to the serial it is published in. For the purpose of specifying the ISSN more precisely the SICI is useful but similar to the BICI its structure is very complex.

DOI - Digital Object Identifier

Subject	Any kind of Object (physical, digital or abstract)
Details	Identifies any kind of Object that needs to be managed or tracked. There are no rules regarding assignment of a new identifier if an identified Object or its metadata changes. It is widely adopted for scientific articles.
Issuer / Issuing Process	Organisation numbers issued by registration agencies. The individual DOI issuing process varies by registration agency.
Standard	ISO 26324
Price per ID	varies by agency and service (Example mEDRA)
Format	variable length: 10.ORGANISATION/ID 10 (always 10) Organisation (e.g. publisher) ID for object
Range / Capacity	unlimited
Interoperability	The overarching and generic design of the DOI system can assimilate/integrate other existing identifiers systems. For example the ISBN-A is an integration of the ISBN into the DOI system.

The DOI is a generic digital identifier for any kind of object at arbitrary levels of granularity e.g a book, page in a book, a sentence in a book. A DOI is meant to be permanent and “resolvable” to mutable information about the identified object.

Audio / Music

ISWC - International Standard Music Work Code

Subject	Musical Works
Details	Identifies musical works but not unique recordings or notations
Issuer / Issuing Process	local organizations issue ISWC (e.g. in Germany GEMA)
Standard	ISO 15707
Price per ID	free
Format	Fixed length, 11 digits: Prefix (1 digit, always „T“) Code (9 digits) Check digit (1 digit)
Range / Capacity	1 billion
Interoperability	–

ISMN - International Standard Music Number

Subject	Music
Details	Identifies Musical Notations (digital and print)
Issuer / Issuing Process	Issued by national agencies
Standard	ISO 10957
Price per ID	varies
Format	Fixed size, 13 digits Prefix (4 digits, always 979-0) Publisher (3 to 7 digits) Publication (1 to 5 digits) Check digit (1 digit)
Range / Capacity	cannot be calculated
Interoperability	same format as ISBN -> can be used as EAN

ISRC - International Standard Recording Code

Subject	Music
Details	Identifies Musical Recordings for licensing
Issuer / Issuing Process	IDs can be ordered online
Standard	ISO 3901
Price per ID	starting at about 50 € for one ID, larger quantities are cheaper
Format	Fixed length, 12 digits Country Code (2 letters) Publisher Code (3 digits) Year (2 digits) Code (5 digits, ascending number)
Range / Capacity	up to 100 million
Interoperability	EAN for CDs etc. can be ordered with ISRC

All three identifiers serve a slightly different purpose but as they are not interoperable it seems rather complicated and confusing to distinguish between them. It would probably be more effective to have an identifier for musical works that could be extended to display a certain notation or recording.

Videos

ISAN - International Standard Audiovisual Number

Subject	Audiovisual Content (e.g. videos)
Details	Information in the ID is only about the content and not about the publisher
Issuer / Issuing Process	issued by local ISAN agencies, offline registration seems to be necessary
Standard	ISO 15706
Price per ID	varies (16€ in germany)
Format	fixed length, 12 bytes (usually hex) Root (48 bits) Part (16 bits) Version (32 bits)
Range / Capacity	about 30 trillion + different versions and episodes
Interoperability	–

The ISAN is clearly focused on audiovisual works (e.g. movies and series). This is reflected in the ID which has an element to represent the number of the episode or sequel.

YouTube Content-ID

The YouTube Content-ID enables copyright owners to detect their content in YouTube videos. Copyright owners can upload their content to the Content-ID database and uploaded videos will be checked against it. If a copyright violation has been detected, the copyright owner can decide whether to block the video or to place ads in order to generate revenues from the video. The Content-ID is proprietary and thus cannot be used for detecting copyright violations outside of YouTube.

Strengths and weaknesses of existing identifiers

Database structure

The necessity of a central agency for maintaining databases and issuing identifiers seems to be one of the main problems of almost all existing identifiers. Although some agencies claim to issue identifiers very quickly, it usually takes at least a few hours to issue an identifier, not mentioning the time spent collecting and providing the data that the agency requires. In addition to this, most agencies take a fee for issuing an ID.

This works fine for publishers who only need a few hundred IDs per year but in fields with large amounts of daily content, as in online journalism, the process of generating IDs should be free and automated.

At this point another problem of agencies arises. Especially when it comes to large amounts of data, maintaining servers is very costly so that the operators need to take a fee.

Still, it is important to have a global database and a universal identifier for all types of digital content in order to be able to identify content globally and not just within a company or a certain region. Having a universal identifier simplifies interchanging content between different companies. A maintenance-free global database could be achieved by storing the identifiers in a decentralized blockchain. In doing so, the database would always be available for everyone to add or read identifiers. Once submitted to the blockchain, identifiers cannot be changed, meaning that no one would be able to manipulate the data.

Identifier structure

When it comes to the structure of identifiers, two basic types can be distinguished:
The first basic type is usually an incremental number which is sometimes extended by a check digit. The identifier itself has no internal structure and just serves as a primary key in databases. To issue an identifier it is necessary to check an authoritative global database. This way the recipient receives an identifier that has not been assigned already.
The second basic type is a number that is separated into different parts, which makes it possible to group different IDs with e.g. the same prefix. An example of this are the ISBN and the SICI. The ISBN has a quite loose granularity and only distinguishes between the region, the publisher and the publication while the SICI has a very high granularity so that even the page the article is published on has its own representation within the ID. In general, a high granularity is useful because this way a lot of information can be gathered from the ID itself, but this can only be applied for a narrowly specified type of content.

As the amount of data stored on the blockchain should be as small as possible, it is important that the identifier is divided into meaningful parts that are generic enough to be applied to many different kinds of content (good: publisher, edition…; bad: page, length…).

Even though some IDs distinguish between different elements, those elements usually work as a primary key like the first basic type of identifier. So a central database with additional information about the content and a query whether an element is already assigned is still needed in those cases.

To create a new identifier it should be needless to contact a centralized service in order to receive an ID that is not yet assigned. It should be possible to self-issue new and unique identifiers from the content itself. This can be achieved by generating the identifier from hashes.
Cryptographic hash functions are optimized to generate a unique hash with low probability for collisions. This means that inputs with small differences generate very different hashes.
Locality-sensitive hash functions generate similar hashes for similar input. This means that similarities in the input can be preserved in the hashes. For a unique but comparable identifier it is useful to have a mixture of both types of hash functions: For basic metadata and the actual content a locality-preserving hash function could be used, as similarities should be detectable to either group similar content or to detect possible copyright violations. Still even similar content should be identified uniquely. So at least one element of the ID should be generated through a cryptographic hash function.

If the identifier is composed from the results of multiple different hash functions, the same content always maps to the same identifier. The identifier itself does not reveal anything about the content unless it is compared with other identifiers but the correctness of the identifier can be verified by re-hashing the original content. This means that the identifiers for content can be tracked publicly and separately from actual content while preserving verifiability. Anyone can compare content with the ID and see whether there are similarities. But what exactly those similarities consist of can only be found out if the raw contents are compared e.g. by a copyright holder of a music video who has discovered a video with a similar ID.
By publishing the identifier to a blockchain, the owner of content also creates an indication that he or she was the first to possess this content.

Conclusion

The existing identifiers are good for the purposes they serve, but they have certain limits. The main issue is that these identifiers can only be used for a very small field, either because of their structure or because they are not an open standard or just too costly for the vast amount of existing content.
There is a need for a new identifier that is applicable for different types of digital content. It should be possible to auto-generate such identifiers and to make them publicly available with little effort.

An efficient way to achieve this is by generating identifiers with the help of a mixture of different hash functions and by storing them on a maintenance free public blockchain.