Metadata Snapshots and the XML and JSON APIs

Description	Creation and distribution of snapshots.
Related services	UNIXSD View Member Info System
Tags	content-types snapshots ref-pref oai-pmh crm-items plus member-info
Products	metadata-retrieval

Metadata Snapshots are a counterpart to our XML and JSON APIs. They are designed to allow users to access the same selection of metadata as our REST APIs, but in bulk. Snapshots are generated every month. They are available exclusively to Plus users and made available by an endpoint on the api.crossref.org subdomain (though not via the Cayenne REST API service). There are two distinct snapshot types, one for XML and one for JSON. These are nominally equivalent, but come via different routes.

Data Flow

See the full Crossref services diagram for all the pieces.

graph TD linkStyle default interpolate basis member-info["Member Info Data / Service"]; xml-automation["XML Automation Script"]; unixsd-view["UniXSDView"]; cddb["CDDB"]; bet["Bulk Extractor Tool"]; cayenne-public-api["Cayenne Public API"]; solr["Cayenne SOLR"]; cayenne-indexer["Cayenne Indexer"]; json-automation["JSON Automation Script"]; cayenne-dump["Cayenne Dump"]; s3["S3 storage"]; serve["Snapshots service"]; oai["OAI-PMH Tables"]; oai-apis["OAI APIs"]; pusher["Pusher"]; member-info --> bet; oai --> oai-apis; oai --> bet; cddb --> oai-apis; cddb --> bet; unixsd-view --> bet; bet --> xml-automation; member-info --> pusher; member-info --> cayenne-data; member-info --> cayenne-public-api; oai --> pusher; cddb --> pusher; unixsd-view --> pusher; pusher --> cayenne-indexer; cayenne-indexer --> solr; solr --> cayenne-public-api; solr --> cayenne-dump; member-info --> cayenne-dump; cayenne-dump --> json-automation; xml-automation --> s3; json-automation --> s3; s3 --> serve;

The JSON and XML snapshots both begin at a point of commonality: the data services designed to support OAI-PMH. These include

the OAI-PMH tables, which list all of the prefixes, containers (journals, books, series), and the items (articles, chapters) that comprise them.
the MemberInfo structure that records the open/closed references per prefix
the CDDB (Central DOI Database) which stores the XML blob for each Item

Between these three sources it is possible to create a list of containers and items, retrieve the XML for each one, and send it somewhere. From here, the data is sent in two directions.

It is queried by the Bulk Extraction Tool to produce XML snapshots.
It is queried by the Pusher, which sends it in to the Cayenne REST API, from which JSON snapshots are produced.

Creating XML Snapshots

XML Snapshots are created using the Bulk Extractor Tool. This tool follows a similar process to the OAI-PMH APIs:

It expects to find the Member Info file locally.
It iterates over three Publication types of J (Journal), B (Book), S (Series).
For each Publication Type:
1. Create a directory in the snapshot output called J, B or S. Files are placed in this directory.
2. Retrieve the list of all Publications (Journals, Series, Books etc).
3. For each Publication:
  1. Create an output file with name ListRecords_<TYPE>_<OWNER_PREFIX>_<PUBLICATION ID>.xml.
  2. TYPE is J, B or S.
  3. OWNER_PREFIX is the full prefix assigned to the owner of the Publication.
  4. PUBLICATION_ID is the internal ID assigned to the Publication.
  5. Retrieve all “Citation IDs” (i.e. items).
  6. Write each Citation ID as a <record> in the output XML stream. The UnixsdView is used to format the data.

UNIXSD View is responsible for creating the XML output, including the removal of references for non-open References. UNIXSD View is also used in other contexts rendering XML as it is produced by responses to queries.

The result of the above is that there will be one file per Publication in the snapshot directory.

Creating JSON Snapshots

JSON Snapshots are currently produced by Cayenne Dump. This iterates over the SOLR database and builds JSON documents of all content types. The code is equivalent to that which serves the REST API, although data does not pass through the REST API serving code. It does use the SOLR database that is constructed by the API indexer though.

The future implementation for the Elastic Search API will communicate directly with the API. This will ensure that the contents of the public API is directly equivalent.

Both methods follow this process:

Iterate over the Works metadata, i.e. that which is present at the /v1/works endpoint.
Partition results into chunks of a few thousand, for example, 3000.
For each chunk:
1. Produce a JSON file which contains a top-level object {"items": [ ... ]} and within it a sequence of Work metadata objects.
2. Give the file a name of <NUMBER>.json. The number is not significant.

The ordering of items within the snapshot is not significant.

Note that in accordance with the rest of the API, full reference lists are indexed by the REST API. They are removed by the REST API (and by cayenne-data) with reference to Member Info data which is derived from the Member Info file. The referenced-display behaviour of the two data sources should be identical between the two, but as they rely on replicated data it’s possible for them to drift out of sync.

Deliverables

Scripts are expected to be run in the month after their nominal date, e.g. the “January” snapshot runs in February.

The two runner scripts will construct a year-month constructed from the previous month, expressed as YYYY/MM. The two directories are compressed into Gzip Tar archives with the names:

all.json.tar.gz
all.xml.tar.gz

and saved in S3 for later retrieval at:

s3://org.crossref.snapshots/monthly/<DATE>/all.json.tar.gz
s3://org.crossref.snapshots/monthly/<DATE>/all.xml.tar.gz

The Snapshots controller allows hierarchical browsing the structure HTML, e.g.

There is also a ‘latest’ URL for each

This will return the most recent snapshot available for each type.

Content

References are included in XML and JSON only if they are open or limited.
The set of Work content types available in the REST API are available in the JSON snapshot (see Content Types).
The J, B and S subset of content types available in OAI-PMH are available in the XML snapshot OAI-PMH. Note that this does not include all content types, e.g. Grants.

Authorization

The Snapshots are only available to Plus users. Tokens are validated and the user is redirected to a signed S3 URL.

Validation

The scripts are expected to run a simple size comparison check between the current and the previous XML and Json Snapshots. A failure error is sent to the Metadata-Plus channel (Slack) and Emailed, if the previous Snapshot files are equal or larger to the current.
Manual error and exception checking against the snapshot logs (future automation task).
Manual compairison of the number of citation id’s between the snapshot and the API or the citation_info table in the DB (future automation task).