Description | Creation and distribution of snapshots. |
Related services | |
Tags | |
Products |
Metadata Snapshots are a counterpart to our XML and JSON APIs. They are designed to allow users to access the same selection of metadata as our REST APIs, but in bulk. Snapshots are generated every month. They are available exclusively to Plus users and made available by an endpoint on the api.crossref.org
subdomain (though not via the Cayenne REST API service). There are two distinct snapshot types, one for XML and one for JSON. These are nominally equivalent, but come via different routes.
Data Flow
See the full Crossref services diagram for all the pieces.
graph TD linkStyle default interpolate basis member-info["Member Info Data / Service"]; xml-automation["XML Automation Script"]; unixsd-view["UniXSDView"]; cddb["CDDB"]; bet["Bulk Extractor Tool"]; cayenne-public-api["Cayenne Public API"]; solr["Cayenne SOLR"]; cayenne-indexer["Cayenne Indexer"]; json-automation["JSON Automation Script"]; cayenne-dump["Cayenne Dump"]; s3["S3 storage"]; serve["Snapshots service"]; oai["OAI-PMH Tables"]; oai-apis["OAI APIs"]; pusher["Pusher"]; member-info --> bet; oai --> oai-apis; oai --> bet; cddb --> oai-apis; cddb --> bet; unixsd-view --> bet; bet --> xml-automation; member-info --> pusher; member-info --> cayenne-data; member-info --> cayenne-public-api; oai --> pusher; cddb --> pusher; unixsd-view --> pusher; pusher --> cayenne-indexer; cayenne-indexer --> solr; solr --> cayenne-public-api; solr --> cayenne-dump; member-info --> cayenne-dump; cayenne-dump --> json-automation; xml-automation --> s3; json-automation --> s3; s3 --> serve;
The JSON and XML snapshots both begin at a point of commonality: the data services designed to support OAI-PMH. These include
- the OAI-PMH tables, which list all of the prefixes, containers (journals, books, series), and the items (articles, chapters) that comprise them.
- the MemberInfo structure that records the open/closed references per prefix
- the CDDB (Central DOI Database) which stores the XML blob for each Item
Between these three sources it is possible to create a list of containers and items, retrieve the XML for each one, and send it somewhere. From here, the data is sent in two directions.
- It is queried by the Bulk Extraction Tool to produce XML snapshots.
- It is queried by the Pusher, which sends it in to the Cayenne REST API, from which JSON snapshots are produced.
Creating XML Snapshots
XML Snapshots are created using the Bulk Extractor Tool. This tool follows a similar process to the OAI-PMH APIs:
- It expects to find the Member Info file locally.
- It iterates over three Publication types of
J
(Journal),B
(Book),S
(Series). - For each Publication Type:
- Create a directory in the snapshot output called
J
,B
orS
. Files are placed in this directory. - Retrieve the list of all Publications (Journals, Series, Books etc).
- For each Publication:
- Create an output file with name
ListRecords_<TYPE>_<OWNER_PREFIX>_<PUBLICATION ID>.xml
. TYPE
isJ
,B
orS
.OWNER_PREFIX
is the full prefix assigned to the owner of the Publication.PUBLICATION_ID
is the internal ID assigned to the Publication.- Retrieve all “Citation IDs” (i.e. items).
- Write each Citation ID as a
<record>
in the output XML stream. TheUnixsdView
is used to format the data.
- Create an output file with name
- Create a directory in the snapshot output called
UNIXSD View
is responsible for creating the XML output, including the removal of references for non-open References. UNIXSD View is also used in other contexts rendering XML as it is produced by responses to queries.
The result of the above is that there will be one file per Publication in the snapshot directory.
Creating JSON Snapshots
JSON Snapshots are currently produced by Cayenne Dump. This iterates over the SOLR database and builds JSON documents of all content types. The code is equivalent to that which serves the REST API, although data does not pass through the REST API serving code. It does use the SOLR database that is constructed by the API indexer though.
The future implementation for the Elastic Search API will communicate directly with the API. This will ensure that the contents of the public API is directly equivalent.
Both methods follow this process:
- Iterate over the Works metadata, i.e. that which is present at the
/v1/works
endpoint. - Partition results into chunks of a few thousand, for example, 3000.
- For each chunk:
- Produce a JSON file which contains a top-level object
{"items": [ ... ]}
and within it a sequence of Work metadata objects. - Give the file a name of
<NUMBER>.json
. The number is not significant.
- Produce a JSON file which contains a top-level object
The ordering of items within the snapshot is not significant.
Note that in accordance with the rest of the API, full reference lists are indexed by the REST API. They are removed by the REST API (and by cayenne-data
) with reference to Member Info data which is derived from the Member Info file. The referenced-display behaviour of the two data sources should be identical between the two, but as they rely on replicated data it’s possible for them to drift out of sync.
Deliverables
Scripts are expected to be run in the month after their nominal date, e.g. the “January” snapshot runs in February.
The two runner scripts will construct a year-month constructed from the previous month, expressed as YYYY/MM
. The two directories are compressed into Gzip Tar archives with the names:
all.json.tar.gz
all.xml.tar.gz
and saved in S3 for later retrieval at:
s3://org.crossref.snapshots/monthly/<DATE>/all.json.tar.gz
s3://org.crossref.snapshots/monthly/<DATE>/all.xml.tar.gz
The Snapshots controller allows hierarchical browsing the structure HTML, e.g.
- http://api.crossref.org/snapshots
- http://api.crossref.org/snapshots/monthly
- http://api.crossref.org/snapshots/monthly/2018
- http://api.crossref.org/snapshots/monthly/2018/04
- http://api.crossref.org/snapshots/monthly/2018/04/all.json.tar.gz
There is also a ‘latest’ URL for each
- http://api.crossref.org/snapshots/monthly/latest/all.xml.tar.gz
- http://api.crossref.org/snapshots/monthly/latest/all.json.tar.gz
This will return the most recent snapshot available for each type.
Content
- References are included in XML and JSON only if they are open or limited.
- The set of Work content types available in the REST API are available in the JSON snapshot (see Content Types).
- The
J
,B
andS
subset of content types available in OAI-PMH are available in the XML snapshot OAI-PMH. Note that this does not include all content types, e.g. Grants.
Authorization
The Snapshots are only available to Plus users. Tokens are validated and the user is redirected to a signed S3 URL.
Validation
- The scripts are expected to run a simple size comparison check between the current and the previous XML and Json Snapshots. A failure error is sent to the Metadata-Plus channel (Slack) and Emailed, if the previous Snapshot files are equal or larger to the current.
- Manual error and exception checking against the snapshot logs (future automation task).
- Manual compairison of the number of citation id’s between the snapshot and the API or the citation_info table in the DB (future automation task).