Monday, July 21, 2008

High-resolution images

We recently implemented a way to export high-res images (300 dpi, click the image below to see an example). This feature will go public with STRING 8 / STITCH 2, but if you're now using STRING or STITCH and want to prepare an image for publication, please get in touch with us (mkuhn embl de) and we can send you the image.

Thursday, June 26, 2008

How we compute scores (Part 1: experiment channel)

This is in response to a a question that we get quite frequently.
Sorry, it's a bit long - but this way it should contain sufficient detail to roughly understand how our scores come about (for the 'experiments channel' at least, and limited to protein mode). Have fun reading !

Christian von Mering (and Lars Jensen).


Procedure to compute experimental scores

  • first, we import information about which proteins have been shown to interact experimentally, from the following databases: INTACT, MINT, GRID, BIND, and DIP. To a small extent, this also includes experimental data that is not necessarily indicative of a direct physical interaction, such as genetic interaction data. Most, however, are from more-or-less direct, physical detection methods.
  • then, we map the proteins mentioned in these database onto the proteins in the STRING database - using identifiers, or (if needed) sequences.
  • next, we group all interactions by their supporting publication (PMID), and make them non-redundant (they might be reported under the same PMID from several databases). We also expand pulldowns of entire protein complexes using the 'spoke' model (i.e. assuming binary interactions from the tagged/immunoprecipitated protein to all of its co-purified partners).
  • then, we subdivide all interactions into 'small-scale, medium-scale, and high-throughput', based on the number of interactions reported by a single publication. These three classes are delineated by the extent of overlap with benchmark information, see below.
  • next, for each of these classes, we determine their 'reliability', by comparing them to our KEGG-benchmark. Briefly, an interaction between two proteins is counted as 'correct', when they are both annotated together in at least one 'KEGG-map', i.e. in at least one functional process / pathway. It is counted as 'incorrect', when the two proteins are annotated in KEGG, but never in the same pathway. Note that proteins that are not annotated at all in KEGG are not considered here).
  • for the small-scale experiments, which are only very few (per paper), we cannot benchmark each paper separately. Therefor, all such papers are lumped, benchmarked together, and we usually find them to be of quite high quality. As a result, we fix their score to some high number, for example 0.900 in the case of STRING version 7.1
  • for the medium-scale experiments, a separate score is computed for each publication, in a similar manner (some publications are found to report data of better reliability, other of lower reliability).
  • for the high-throughput experiments (there are less than 20 of these currently), we have enough information to be even a bit more specific: for each interaction in these sets, we can compute a 'raw score' from the data, because there are so many measurements done. Usually, this would be a score that describes how often a measurement has been confirmed, or how specific a particular interaction is, given the occurence of the two protein elsewhere throughout the dataset. These 'raw scores' are then binned, and each bin benchmarked separately, to arrive at a 'calibration curve', again using the KEGG pathways as a benchmark as described above. Thus, for these large sets, some interactions get a higher score, and others a lower score, depending on the information in the entire dataset.
  • this brings us to the cutoffs that determine whether something is small-scale, medium or high-throughput. This is defined on how many 'true-positives' are in the dataset: To be a large-scale dataset, we require at least 50 true positive interactions to enable the benchmarking. Otherwise, more than 20 true positive interactions will make it a medium-scale dataset, and the rest is small-scale. ("true positives" are defined as interactions where both proteins are in KEGG, and are sharing at least one KEGG map).
  • then, we have to deal with interactions supported by more than one independent dataset (i.e. by more than one publication). For those, the scores are 'added up'. Of course, they are not literally added up, but rather in a probabilistic integration, like so:
new_score = 1 - (1 - score_a) * (1 - score_b).

  • and finally, whe have to deal with interactions that are reported in multiple organisms, or in an organism other than the one of interest. This is called 'interaction transfer', and is a very important step to increase coverage. It is described in the 2005 STRING paper. Essentially, the better the orthology situation can be delineated (i.e. clear orthologs for both interacting partners can be identified), the bigger the score fraction that is transferred. Transferred interactions are integrated probabilistically as mentioned above, and interactions that are reported in two very similary organisms (say, mouse and rat), are considered redundant and transferred only once. Note that transferred scores are stored separately from the 'direct' scores in the database, so that all the transferred information can be discarded, if desired.

Tuesday, June 17, 2008

Downtime Wednesday morning

The STRING server will get a new disk tomorrow morning (European time), so there will be a downtime for STRING/STITCH. We hope everything will be working again in the early afternoon.

Update: We're back online, with enough room for the next version of STRING.

Monday, May 19, 2008

API also available on STRING

When I created the API, I only put it on STITCH. Of course there's no reason to not have it also on STRING, so here you go:

http://string.embl.de/api/tsv/interactors?identifier=DRD1_HUMAN

It's in the same state as the STITCH API: Still subject to change, and potentially unstable.

Monday, April 28, 2008

Getting identifiers for a list of genes

If you want to to quickly get identifiers for a long list of items you can use the following command, which uses wget to repeatedly query the API.

cat protein_names.txt | xargs -i wget -nv -O - \
'http://stitch.embl.de/api/tsv-no-header/resolve?identifier={}&species=4932&echo_query=1' \
> protein_identifiers.tsv
I've also introduced another parameter, echo_query, so that you can see your query item in the output.

Wednesday, April 23, 2008

No Downtime on Saturday, April 26

There'll be an EMBL-wide power cut on Saturday, April 26. Therefore, our servers won't be reachable at this time. We hope that the computer infrastructure will be re-activated by Monday.

Sorry, the plans were changed and not all of EMBL is affected, so we should stay online.

Tuesday, March 4, 2008

Embedding protein/chemical information into other web pages

One of the nice things about STRING/STITCH is that you can click on any item and be presented with a helpful pop-up describing the item in question. :-)

As you can see, for a protein, we show links to different servers, the domain structure and if possible a representative PDB structure. For a chemical we currently only show the structure and link to PubChem.

In order to use the same pop-up in Reflect (a cool tool that recognizes and annotates proteins and chemical any web page), we made a service out of it. The functionality is going to change in the future, but here are two working examples:

http://stitch.embl.de/services/iteminfo?node=9606.ENSP00000186982
http://stitch.embl.de/services/iteminfo?node=CID2244
The idea is that you can take this raw HTML and put it into an iframe with JavaScript. (This is left as an exercise to the reader... future versions of STRING may use this technique to make the pages a bit smaller.) For now, it only accepts internal STRING identifiers, which you either get from our download files or from the API.

Monday, March 3, 2008

Servers may be unreachable today (March 3)

You might have trouble today accessing our servers today because of construction in the server rooms. We apologize and hope that everything will be done soon!

Update: We're back online.

Scope of the API and current plans

When I announced the API, I didn't devote much space to the intended scope of the API. To make things clearer:

  • REST/SOAP: We'll only provide a REST API plus a Soaplab2 wrapper for Taverna. Perhaps later dedicated programmers can add a SOAP interface if the demand is sufficiently high.
  • Queries for bulk data: For implementation and licensing reasons, we'll only provide methods to query by individual items, just like on the web site. If you need access to bulk data, you can download it.
  • Miscellaneous records: We want to add more query options later for retrieving information from the freely available files. For example: What are the synonyms of this item? To which orthologous group does this protein belong?

Friday, February 22, 2008

Example Taverna workflow

Taverna is a tool that lets you connect web services from different sources with each other. I've implemented a simple example workflow (.xml): You enter a human protein or chemical of interest, STITCH will identify the matching item(s), generate an interaction network and retrieve the 10 most relevant abstracts. The list of Pubmed ids is then passed on to Whatizit, which highlights disease terms in the abstracts. A simple Python script then counts the diseases.

Try it out; try other things; tell me what you think.
Note: The Python script is called via the Soaplab interface. I tried to do this in Taverna but gave up, it didn't like my XPath (due to the XML produced by Whatizit) and writing a Beanshell script seemed more cumbersome.

Tuesday, February 19, 2008

We have an API!

I went to the BioHackathon 2008 in Tokyo and worked on an API for STRING and STITCH. If you think about using STRING or STITCH with an API, and miss features, please get in touch with us either via the comments or e-mail (e.g. mkuhn//embl.de).

Here's what we have to offer so far:

REST interface

The URL patterns are: http://stitch.embl.de/api/[format]/[request]?[parameters]
http://string.embl.de/api/[format]/[request]?[parameters]

Possible formats:

  • tsv: tab-separated values, with a header line
  • tsv-no-header: as above, but no header
  • json: JSON format either as a list of hashes/dictionaries, or as a plain list (if there is only one value to be returned per record)
  • psi-mi: the interaction network is available in PSI-MI 2.5 XML format
  • psi-mi-tab: there is also a tab-delimited form, modeled after the IntAct specification. This is easier to parse, but contains less information than the XML format.
  • url: return the URL of the network image
Possible requests:
  • abstracts: return a list of abstracts that contain the query item
  • abstractsList: return a list of abstracts that contain any of the query items
  • interactions: return an interaction network in PSI-MI 2.5 format (PSI-MI is currently the only format for interactions. Perhaps the PSI-MI tab-delimited form would also make sense? I don't know how a JSON form should look like.)
  • interactionsList: same as above, but for list of identifiers
  • interactors: return a list of interaction partners for the query item
  • interactorsList: return a list of interaction partners for any of the query item
  • resolve: return the list of items that match (in name or identifier) the query item
  • network / networkList: in conjunction with the "url" format, return the URL to the network
For a full list of possible parameters, please refer to our STRING Soaplab 2 interface. With the help of Soaplab2 / Gowlab, we'll describe the set of possible parameters there. (Doesn't work right now. :-/ )

Examples

To find out which proteins match the description "dopamine receptor" in human, you can use this query:

http://stitch.embl.de/api/tsv/resolve?identifier=dopamine%20receptor&species=9606
http://string.embl.de/api/tsv/resolve?identifier=dopamine%20receptor&species=9606

This gives you a lot of additional info. If you just want to get the list of STRING identifiers, you can alter the query a bit:

http://stitch.embl.de/api/tsv-no-header/resolve?identifier=dopamine%20receptor&species=9606&format=only-ids
http://string.embl.de/api/tsv-no-header/resolve?identifier=dopamine%20receptor&species=9606&format=only-ids

Now, you'll only receive a bare list of ids that you could pipe into other STRING API functions.

To illustrate the difference between normal and "list" queries:

http://stitch.embl.de/api/tsv/interactors?identifier=DRD1_HUMAN
http://stitch.embl.de/api/tsv/interactorsList?identifiers=DRD1_HUMAN%0DDRD2_HUMAN

http://string.embl.de/api/tsv/interactors?identifier=DRD1_HUMAN
http://string.embl.de/api/tsv/interactorsList?identifiers=DRD1_HUMAN%0DDRD2_HUMAN


In the second case, the identifiers parameter contains a list of items separated by new line characters (%0A or %0D).

SOAP / Taverna

In a separate post, I've described an example Taverna workflow. As for SOAP integration, I hope that the Soaplab interface works...

Obligatory beta notice

As all good things these days, this is still in beta (internally, everything in fact runs on our beta server, I'm just making it accessible via the normal STITCH domain to expose it to the web). Therefore, the API might change, be down, ... until STITCH 2 / STRING 8 comes out.

Updates

03.03.2008: Added clarification – PSI-MI is currently the only interactions format.
04.03.2008: Fixed typo – it's "
interactorsList"
12.03.2008: Add psi-mi-tab format
19.05.2008: Add STRING API (with same specification)
08.07.2008: Add API for generating network images
16.03.2009: Enabled interactionsList

Wednesday, February 6, 2008

Mea culpa: missing links from PDB

I intended to extract protein–chemical links from the PDB (and we wrote this in the paper), but somehow I didn't quite finish the import scripts before we finalized STITCH 1.0. I am sorry about this, and apologize if you are missing interactions.


We are currently preparing the next versions of STRING and STITCH (versions 8 and 2, respectively) and I will import the remediated PDB for the new version. I guess STITCH 2 should come out in Spring 2008.

(Special thanks to Florian Raible for testing his pet molecule and discovering what was missing.)