In this post, we give a quick summary of the tests that we ran up to now during the development of the ISCC. Each component of the ISCC shall have its proper round of test runs in order to ensure a smoothly working concept at the end of the development phase.
The first component of the ISCC is the Meta-ID. It is generated from the content’s metadata and shall serve to easily locate content. At the same time it shall help to group various editions and formats of digital content and avoid data duplication. To test our concepts regarding the ISCCs Meta-ID, we have gathered metadata and corresponding ISBNs from different open data sources and run tests on these data sets. Our goals with these test were:
|Supplier||Number of entries||Format||Url|
|Open Library||25 M||JSON||openlibrary.org/developers/dumps|
While parsing the data sources, only those records will be added that contain a title, at least one author and an ISBN. In cases of multiple authors, all authors will be added separated by “;“. ISBN-10 are converted into ISBN-13.
Title, authors, ISBN and the name of the data source are stored; records with identical metadata are only stored once.
During parsing, we stored all eligible metadata in an Elasticsearch index.
After that, we generated the corresponding Meta-ID for each metadata entry from title and authors and stored it along with a link to the metadata entry in another index.
To test a variety of Meta-IDs, the Meta-ID index can be flushed at a later time to be refilled with Meta-IDs generated in a different way. This allows us to test different Meta-ID configurations without having to reparse the data sources for each test.
In the first test run, Meta-IDs were grouped by Meta-ID; in this process only those groups with entries coming from at least two different data sources were taken into consideration. The groups in which every Meta-ID is associated with the same ISBN was then counted, as well as the groups having at least two Meta-IDs for entries with different ISBNs. Two entries with identical Meta-ID should reference the same works, i.e. works with the same ISBN; thus we tried to maximize the first group, which we call “true negatives”, and minimize the second group, which we call “false negatives”.
In the second test run, metadata was grouped by ISBN; similar to the first run, in this process only those groups with entries coming from at least two different data sources were taken into consideration. The groups in which the corresponding Meta-IDs are identical were then counted, as well as the groups having at least two different Meta-IDs. As with the first run, we wanted to maximize the first group, which we called “true positives”, and minimize the second group, which we call “false positives”, as two entries with the same ISBN represent the same work and should therefore get the same Meta-ID.
The test results are stored in another Elasticsearch index. Additionally, all ISBNs and Meta-IDs with collisions are stored in .txt files and are thus traceable in the Elasticsearch. Diagrams comparing the different test results can be created at any time.
During our tests we encountered several issues with regard to our results.
The data formats of the sources are fairly old, and in some of them the authors are given merely as IDs. So we had to retrieve and link them from a separate data file. We tried to speed up the parsing wherever possible, e.g. by partially keeping the authors along with their IDs in the main memory. To further speed-up our tests we have stored the metadata in the Elasticsearch backend. This way we do not have to parse the data again for every test run.
Works often receive a new ISBN for each new edition. Unfortunately, there is no book identifier that uniquely references a work across several editions. This is one of the problems we seek to solve with the ISCC, but it’s difficult to determine to what extent the Meta-ID is up to this task. Lacking alternatives, we nonetheless used the ISBN as identifier and accepted, that the results turn out more negatively.
To some extent, the encoding of the data sources themselves was corrupt, which of course leads to incorrect test results. As these entries make up only a small percentage and finding all entries with broken encoding would be time-consuming, if not impossible, we decided to accept this further issue and put up with more negative results.
During our tests we found that some ISBNs have been assigned manyfold, which of course further distorts the results. As these cases are rare, we accepted this issue, too.
Our initial tests have revealed primarily a number of issues with our normalization. Our first run with our standard bit length of 64 and a shingle size of 4 produced 70.34% false negatives and 32.95% false positives. We have thus improved the normalization in several ways:
Many entries had brackets within the title containing for example a year or some explanatory note regarding the edition.
For normalization purposes we thus added a regular expression to delete the content inside round “(“ and square “[“ brackets (the removal of the brackets themselves was already part of the normalization process). This new normalization reduced our false negatives from 70.34% to 66.92%, but increased the false positives from 32.95% to 35.79%.
In some data records, the title was followed by a colon or semicolon and the subtitle, an explanatory note or even a short description of the book. We added the normalization of truncating after “:” or “;”. This normalization reduced the false negatives from 66,92% to 19.00%. We have thus reduced false negatives to less than a third.
Our fears of a rising number of false positives due to such a drastic normalization has proven false; we saw an absolute rise in false positives of 54%, however, a large number of entries referencing the same works were now assigned the same Meta-ID, which is why the false positives were also reduced relatively from 35.79% to 26.22%.
|True positives||False positives||True negatives||False negatives|
|Truncating after : and ;||73.78%||29.22%||81.00%||19.00%|
As an alternative to truncating after a colon or semicolon, we could truncate after a certain number of characters. We saw from our metadata that the title field contains an average of 41 characters and the creators field an average of 18 characters. We are planing to do further tests with this method of truncating.
Subsequent to the normalization process we tested various combinations of bit length and shingle size. Interestingly, the tests showed that with the amount of test data available to us, these parameters make hardly any difference. Still, as one would expect, our test with a bit length of 24 produced considerably worse results, but then again we initially planned with a bit length of 48, 64 or 96 and a shingle size between 3 and 6.
For the next milestone we want to test normalization after a given number of characters and compare it to truncating after colon and semicolon. Another option is to truncate after a dash, but dashes are frequently a normal part of the title.
In addition, we want to run tests for the Content-ID, Data-ID and Instance-ID, which will be a more difficult task, as it requires a large stock of texts and images.