Sib-resource query-interface

From Web-team
Jump to: navigation, search

In order to have a light-weight central search on the ExPASy website (i.e. a query engine), a simple interface is defined. This interface has to be implemented by the participating SIB services to be searchable. The interface specification is given below.

Request definition

The request is a REST query (HTTP POST by default - HTTP GET is also allowed) with the following input parameters:

  • query (MANDATORY): main query text to be sent to the resource. The query text must be shorter than 1000 characters in case HTTP GET is used (for HTTP POST there is no such restriction).
  • type (MANDATORY): one of the following types (reserved keywords) must be chosen to clearly identify the query string (SHOULD BE case in-sensitive):
    • text - any free text query e.g. "cyclin-dependent kinase"
    • UniProtAC - UniProt Accession Number (definition) e.g. "P39951"
    • UniProtID - UniProt Identifier (entry name, definition) e.g. "CDK2_HUMAN"
    • UniParc - UniParc identifier (definition) e.g. "UPI0000000065"
    • PDBID - PDB identifier e.g. "1HCL"
    • IPI - IPI (definition) e.g. "IPI00026689"
    • RefSeq - RefSeq (definition1 definition2) e.g. "NP_000008, AP_004894, NZ_ABCD12345678" (not exhaustive)
    • EnsemblID - EnsemblID (definition) e.g. "ENST00000393489, ENSGALG00000017073" (not exhaustive)
    • eGeneID - Entrez Gene ID (definition e.g. "306998"
    • GI - NCBI GI (GenInfo identifier, definition) e.g. "1293614" OR "GI:1293614"
    • GBA - GenBank Accession (definition) e.g. "U00001, AF000137, AAA02483"
    • EC - Enzyme Classification number (definition) e.g. ", 1.14.99.-"
    • AA - amino acid sequence (definition)
  • format (OPTIONAL): this parameter defines the format which is for encoding the result of the query. By default, the methods returns the result in XML. Optionally, the following return format(s) may be supported:
    • tsv - tab-separated values
  • Additional optional parameters are allowed but may not be interpreted by each resource.

Regular expressions

if(preg_match('/^[OPQ][0-9][A-Z0-9]{3}[0-9]|[A-NR-Z][0-9][A-Z][A-Z0-9]{2}[0-9]\d$/',  $clean_query)){     return 'UniProtAC';}
// exactly 4 digits should return textsearch, to prevent mixing with PDBID
elseif(preg_match('/^\d{4}$/',                       $clean_query)){     return 'text';}
elseif(preg_match('/^[A-Z0-9]{1,6}_[A-Z0-9]{3,5}$/', 	$clean_query)){		return 'UniProtID';}
elseif(preg_match('/^UPI[A-Z0-9]{10}/',            		$clean_query)){		return 'UniParc';}
elseif(preg_match('/^[1-9][A-Z0-9]{3}$/',          		$clean_query)){		return 'PDBID';}
// match FORMAT_IPI
elseif(preg_match('/^IPI[0-9]{8}/',                		$clean_query)){		return 'IPI';}
// match FORMAT_REFSEQ see  section 3.8 in RefSeq-releaseXX.txt   &
elseif(preg_match('/^[NXAY]P_\d{6}(\d{3})?(.\d+)?$/', 	$clean_query)){		return 'RefSeq';} //RefSeq protein
elseif(preg_match('/^ZP_\d{8}(.\d+)?$/',               	$clean_query)){		return 'RefSeq';} //RefSeq protein II
elseif(preg_match('/^[NX][MR]_\d{6}(\d{3})?(.\d+)?$/', 	$clean_query)){		return 'RefSeq';} //RefSeq rna
elseif(preg_match('/^N[CGTS]_\d{6}(.\d+)?$/',          	$clean_query)){		return 'RefSeq';} //RefSeq genomic I
elseif(preg_match('/^AC_\d{6}(.\d+)?$/',               	$clean_query)){		return 'RefSeq';} //RefSeq genomic II  (Alternate complete genomic molecule)
elseif(preg_match('/^NW_\d{6}(\d{3})?(.\d+)?$/',       	$clean_query)){		return 'RefSeq';} //RefSeq genomic III (Intermediate genomic assemblies of BAC or Whole Genome Shotgun sequence data)
elseif(preg_match('/^NZ_[A-Z]{4}\d{8}(.\d+)?$/',       	$clean_query)){		return 'RefSeq';} //RefSeq genomic IV  (Collection of whole genome shotgun sequence data for a project)
// match FORMAT_ENSEMBLID ie: GTEPR = Genomic, Transcript, Exon, Protein, Regulatory.   E.g.: ENSG00000139618  ENSGALG00000017073
elseif(preg_match('/^ENS([A-Z]{3})?[GTEPR][0-9]{11}$/', $clean_query)){		return 'EnsemblID';}
// TODO: add ensembl genomes and ensembl gene families accessions
elseif(preg_match('/[ACDEFGHIKLMNPQRSTVWYBJXZUO]{20,}/',$clean_query)){		return 'AA';}
// TODO: match GenBank Accession
elseif(preg_match('/^([A-Z]\d{5})|([A-Z]{2}\d{6,})/',	$clean_query)){		return 'GBA';} //Some GBA id can look like 'UniProtAC'
elseif(preg_match('/^[A-Z]{3}\d{5,}(.\d+)?/', 			$clean_query)){     return 'GBA';}
// match FORMAT_EC
elseif(preg_match('/^[1-6]\.\d+\.\d+\.\d+$/', 			$clean_query)){     return 'EC';}
elseif(strlen($clean_query) >= 3 && strlen($clean_query) < 100){


Example requests:

Full HTTP POST request:

POST /search/swissmodel.cgi HTTP/1.1 
Content-length: 29


Full HTTP GET request:

GET /search/swissmodel.cgi?query=p12344&type=UniProtAC HTTP/1.1 

Base URL and supported types

Each resource is identified by a base URL which consists of the following elements:

  • hostname of the resource to be queried (MANDATORY)
  • relative path to query interface (MANDATORY)
  • optional parameters (OPTIONAL)


Additionally, for each resource the supported types (see above, e.g. text, UniProtAC, PDBID etc.) need to be clearly defined and communicated to the SIB Web Team.

Response definition

In response to the request defined above, the resource returns a result set that typically contains a reference (URL) to a page (or page cache) that contains the answer to the query. Note that the resource does not return actual data objects but the latter have to be requested via an additional HTTP GET request to the given URL!

By default, the response is encoded in XML since it can easily be expanded when needed. Other return formats such as tab-separated values are possible, but they do not need to be implemented by the service. The result data structure in XML (called ExpasyResult) contains the following parameters:

  • count (MANDATORY) number of matching results retrieved by the resource. By default, count>=0. In case of errors, count must be set to -1.
  • url (MANDATORY) URL to result page. Typically, the URL refers to a cache where the result is available for a limited amount of time. A subsequent HTTP GET request to the URL will then refer to the actual query result (i.e. data objects). The URL can also be used to indicate error strings (see Error handling)
  • description (OPTIONAL) description of the result set.

XML Schema definition:

<?xml version="1.0"?>
<xs:schema targetNamespace="" xmlns="">
<xs:element name="ExpasyResult">
      <xs:element name="count" type="xs:int"/>
      <xs:element name="url" type="xs:string"/>
      <xs:element name="description" type="xs:string" minOccurs="0" maxOccurs="1"/>


XML example:

<?xml version="1.0"?>

TSV example:


Full HTTP response:

HTTP/1.1 200 OK
Connection: close
Content-Length: 140
Content-Type: text/xml
Date: Wed, 23 June 2010 22:07:42 GMT

<?xml version="1.0"?>

Examples with optional descriptions of result sets:

<?xml version="1.0"?>
  <description>number of matched PROSITE documentation entries</description> 
<?xml version="1.0"?>
  <description>number of PROSITE motif matches in UniProtKB Q4WEB4</description> 

Error handling

In case a resource cannot handle a request (e.g. types are not supported, internal resource error etc.), an error must be indicated in the following way:

  • count is set to -1
  • url indicates the source or reason of the error.

Additionally, we recommend that the following HTTP status codes are used in the HTTP header to indicate the source of the error:

  • 400 Bad Request. Client side errors such as invalid type etc.
  • 500 Internal Server Error. Any error that is related to the resource (service).



HTTP/1.1 400 Bad Request
Date: Tue, 26-Oct-2010 10:30:12 GMT
Content-length: ...

<?xml version="1.0"?>
   <url>unsupported resource 'xxx'</url>
HTTP/1.1 500 Internal Server Error
Date: Tue, 26-Oct-2010 10:30:12 GMT
Content-length: ...

<?xml version="1.0"?>
   <url>Cannot access local database</url>


HTTP/1.1 400 Bad Request
Date: Tue, 26-Oct-2010 10:30:12 GMT
Content-length: ...

-1\tunsupported resource 'xxx'