Design of XML Query System | Crossref Knowledge Base

Related services	Synchronous REST API Deposit v2 XML Deposit System
Tags	search author-search

The XML query system is a request-response service where the request is a metadata query with the criteria encoded in XML and the response is one or more XML content items metadata encoded in XML. Prior to the introduction of the REST API, it was commonly known and documented as “the metadata search.” See the XML API Support document for further use information, XML API.

We have two interfaces to the XML query sub-system.

The Synchronous (or “direct”) interface enables the user to get an immediate response from the query request. This interface is handled by our “QS” deployments in addition to the services used by the different forms of XML query.
The Asynchronous (or “batch”) interface enables the user to submit a query request for latter processing. When the response is ready the user is notified via email or an HTTP callback. This interface is handled by coordinating the Admin deployments, DS deployments, Deposit Queue Service, Notification Email Service, Notification Callback Service, in addition to the services used by the different forms of XML query.

The XML query performs 4 functions:

DOI lookup. The DOI is given and only one work’s metadata is returned.
Author title search. This is a broad search as very little criteria is provided.
Fielded search. This is a precise search as the user can specify many criteria and precision of the criteria (eg “fuzzy”).
Unstructured citation search (aka formatted citation search). This search accepts a string as it would be typeset in a bibliography. It may try parsing into an XML query or use search-based reference matching.

Data sources

DOI lookup uses CDDB.
Author Title search uses Author Title Citation Search (SOLR) and CDDB.
Fielded search uses the Deposit database (Oracle or MySql) for specific fields to match, Misspelling Search Service. Ultimately the result is fetched from CDDB.
Unstructured citation search uses RefXpress and Formatted Citation Search.

Schema

The XML query and result schemas are comprised of two specialized schemas and reuse of several common schemas. See the Schemas project.

The schemas, especially crossref_query_output3.0.xsd, are difficult to read without tooling help. A graphical schema is available at http://data.crossref.org/reports/help/schema_doc/crossref_query_output3.0/output.html

Implementation

The implementation is a small framework of interfaces, and specialized query types and service implementations. A specialized query service implementation upon instantiation will register itself and the specialized types of query it can handle with a registry. When a user makes an XML query it is split into one or more specialized type queries. A coordinator then dispatches the queries to the registered query service and then collects and organizes the results.

The metadata query is implemented as a framework. The key classes in the org.crossref.qs.citationsearch package are

CitationSearchCoordinator, this is a process object that dispatches the queries (CitationQuery) in the search (CitationSearch) to their registered query service (CitationQueryService), and then collect the results (CitationQueryResult) into a single result (CitationSearchResult).
CitationQueryServiceRegistry, this is a process object that enables a query service (CitationQueryService) to register the types of queries (CitationQuery) it can handle. Most query services can handle one or two related types of queries.
CitationSearch, this is a data object that holds details about the search context, and a list of queries.
CitationQuery, this is a data object that holds the details about the query context, and a list of criteria.
CitationQueryService, this is a process object that implements the query using the given criteria (CitationQuery) and results the result (CitationQueryResult).
CitationSearchResult, this is a data object that holds the results of all the queries.
CitationQueryResult, this is a data object that holds the results (ReferenceHit) of a query.
ReferenceHit, this is a dataobject that holds a specific hit.

Notes

Given the following request:

http://localhost:8080/servlet/query?usr=****&pwd=****&type=q&format=unixsd&qdata=

<?xml version = "1.0" encoding="UTF-8"?>
<query_batch version="2.0" xmlns = "http://www.crossref.org/qschema/2.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
    <head>
        <doi_batch_id>eXtyles Request AMP.dodge0724.doc__12</doi_batch_id>
    </head>
    <body>
        <query key="Q1_jrn" enable-multiple-hits="false" expanded-results="false">
            <year>1987</year>
            <article_title>Psychophysiological analysis during autogenic training</article_title>
            <journal_title>Advances in Biological Psychiatry</journal_title>
            <volume>16</volume>
            <first_page>72</first_page>
            <author>Narita</author>
        </query>
    </body>
</query_batch>

Call Sequence:

QueryController.handleRequest()
CitationSearchUtil.populateQueries(qDataParam, doiQuery, search);
List<CitationQuery> updatedQueries = formattedCitationParseService.parse()
List<CitationQuery> updatedQueries = refxService.parse()
CitationSearchResult result = citationSearchService.search(search);
-->    CitationSearchCoordinator.search()
CitationSearchResult result = citationSearchService.search(request);
DispatchCitationQueryCitationSearchService.search()
queryResult = serviceUse.getCitationQueryService().search(query);
-->    LimitedConcurrentUseCitationQueryServiceAdapter.search(CitationQuery)
-->    queryMetaDataCitationQueryService.internalSearch()
protected CitationQueryResult internalSearch(MetaDataCitationQuery query)        

// THE REAL WORK

CitationQueryResult result = metadataQueryResolver.getSearchResults(query);

// ... 

CitationSearchCoordinator.artifactService.contributeArtifact(artifact);
ArtifactServiceDispatcher.contributeArtifact()
ArtifactJmsGateway.contributeArtifact(Artifact artifact)
-->    jmsMessageGateway.sendMessage(artifact)

metadataQueryResolver.getSearchResults returns a CitationQueryResultObject whose content is ultimately returned to the caller. This object is also used to populate a MetadataQueryArtifact object.