How we compute scores (Part 1: experiment channel)
This is in response to a a question that we get quite frequently.
Sorry, it's a bit long - but this way it should contain sufficient detail to roughly understand how our scores come about (for the 'experiments channel' at least, and limited to protein mode). Have fun reading !
Christian von Mering (and Lars Jensen).
Procedure to compute experimental scores
- first, we import information about which proteins have been shown to interact experimentally, from the following databases: INTACT, MINT, GRID, BIND, and DIP. To a small extent, this also includes experimental data that is not necessarily indicative of a direct physical interaction, such as genetic interaction data. Most, however, are from more-or-less direct, physical detection methods.
- then, we map the proteins mentioned in these database onto the proteins in the STRING database - using identifiers, or (if needed) sequences.
- next, we group all interactions by their supporting publication (PMID), and make them non-redundant (they might be reported under the same PMID from several databases). We also expand pulldowns of entire protein complexes using the 'spoke' model (i.e. assuming binary interactions from the tagged/immunoprecipitated protein to all of its co-purified partners).
- then, we subdivide all interactions into 'small-scale, medium-scale, and high-throughput', based on the number of interactions reported by a single publication. These three classes are delineated by the extent of overlap with benchmark information, see below.
- next, for each of these classes, we determine their 'reliability', by comparing them to our KEGG-benchmark. Briefly, an interaction between two proteins is counted as 'correct', when they are both annotated together in at least one 'KEGG-map', i.e. in at least one functional process / pathway. It is counted as 'incorrect', when the two proteins are annotated in KEGG, but never in the same pathway. Note that proteins that are not annotated at all in KEGG are not considered here).
- for the small-scale experiments, which are only very few (per paper), we cannot benchmark each paper separately. Therefor, all such papers are lumped, benchmarked together, and we usually find them to be of quite high quality. As a result, we fix their score to some high number, for example 0.900 in the case of STRING version 7.1
- for the medium-scale experiments, a separate score is computed for each publication, in a similar manner (some publications are found to report data of better reliability, other of lower reliability).
- for the high-throughput experiments (there are less than 20 of these currently), we have enough information to be even a bit more specific: for each interaction in these sets, we can compute a 'raw score' from the data, because there are so many measurements done. Usually, this would be a score that describes how often a measurement has been confirmed, or how specific a particular interaction is, given the occurence of the two protein elsewhere throughout the dataset. These 'raw scores' are then binned, and each bin benchmarked separately, to arrive at a 'calibration curve', again using the KEGG pathways as a benchmark as described above. Thus, for these large sets, some interactions get a higher score, and others a lower score, depending on the information in the entire dataset.
- this brings us to the cutoffs that determine whether something is small-scale, medium or high-throughput. This is defined on how many 'true-positives' are in the dataset: To be a large-scale dataset, we require at least 50 true positive interactions to enable the benchmarking. Otherwise, more than 20 true positive interactions will make it a medium-scale dataset, and the rest is small-scale. ("true positives" are defined as interactions where both proteins are in KEGG, and are sharing at least one KEGG map).
- then, we have to deal with interactions supported by more than one independent dataset (i.e. by more than one publication). For those, the scores are 'added up'. Of course, they are not literally added up, but rather in a probabilistic integration, like so:
- and finally, whe have to deal with interactions that are reported in multiple organisms, or in an organism other than the one of interest. This is called 'interaction transfer', and is a very important step to increase coverage. It is described in the 2005 STRING paper. Essentially, the better the orthology situation can be delineated (i.e. clear orthologs for both interacting partners can be identified), the bigger the score fraction that is transferred. Transferred interactions are integrated probabilistically as mentioned above, and interactions that are reported in two very similary organisms (say, mouse and rat), are considered redundant and transferred only once. Note that transferred scores are stored separately from the 'direct' scores in the database, so that all the transferred information can be discarded, if desired.