RSAT - Help about compare-classes

Description

Authors

Options

Input formats

The class file specifies the relationships between a set of elements and a set of classes.

Class memberships must be formatted as text containing at least 2 columns separated by a tab character. The first column indicates the element names, the second the class names.

          member1	class_1
          member2	class_1
          member3	class_2
        

Optionally, the file may contain additional columns, which will be ignored. The option score column

          member1	class_1	score_1
          member2	class_1	score_2
          member3	class_2	score_3
        

Comparison schema

Compare query classes to reference classes

Each query class is compared to each reference class, to compute the number of shared elements (intersection) and various statistical scores related to this number.

Compare query classes to query classes

Each query class is compared to each other class.

Output formats

The result of the comparison can be reported in two formats: a pairwise class comparison table, and a matrix.

Class pairs

The program returns a tab-delimited file with one row per pair of reference-query class, and one column per statistics.

Reference/query matrix

Return a reference/query matrix, where each row corresponds to a reference class, each column to a query class, and each cell contains a comparison between the two classes.

Return fields

Return fields are grouped by categories, so that each request will return several columns. For example, the group "proba" returns the P-value, the E-value and the significance.

GroupFieldDescription
occQNumber of elements in class Q
occQRNumber of elements found in the intersecion between classes R and Q
occQvRNumber of elements found in the union of classes R and Q. This is R or Q.
occRNumber of elements in class R
freqE(QR)Expected number of elements in the intersection
freqF(!Q!R)frequency of !Q!R elements relative to population size. F(!Q!R)=!Q!R/P
freqF(Q!R)frequency of Q!R elements relative to population size. F(Q!R)=Q!R/P
freqF(Q)frequency of Q elements relative to population size. F(Q)=Q/P
freqF(QR)frequency of QR elements relative to population size. F(QR)=QR/P
freqF(R!Q)frequency of R!Q elements relative to population size. F(R!Q)=R!Q/P
freqF(R)frequency of R elements relative to population size. F(R)=R/P
freqP(QR)probability of Q and R (Q^R), assuming independence. P(QR) = F(Q)*F(R)
freqP(Q|R)probability of Q given R. P(Q|R) = F(QR)/F(R)
freqP(R|Q)probability of R given Q. P(R|Q) = F(QR)/F(Q)
probaE_valE-value of the intersection. E_val = P_val * nb_tests
probaP_valP-value of the intersection, calculated witht he hypergeometric function. Pval = P(X >= QR).
probasigSignificance of the intersection. sig = -log10(E_val)
jac_simjac_simJaccard' similarity. jac_sim = intersection/union = (Q and R)/(Q or R)
dotproddotprodDot product (using the score column)
dotproddp_bitsdp_bits = round(log2_dp). The log2 of the dot product is rounded to obtain an integer value
dotprodlog2_dpLog2 of the dot product
dotprodprodrtsSum of the sqrt of products. This is a sort of dot product but the sqrt of each pairwise product is taken before summing.
dotprodsqrt_dpSquare root of the dot product
entropyH(Q)Entropy of class Q. H(Q) = - F(Q)*log[F(Q)] - F(!Q)*log[F(!Q)]
entropyH(Q,R)Join entropy for classes Q and R. H(Q,R) = - F(QR)*log[F(QR)] - F(Q!R)*log[F(Q!R)] - F(R!Q)*log[F(R!Q)] - F(!Q!R)*log[F(!Q!R)]
entropyH(Q|R)Conditional entropy of Q given R. H(Q|R) = H(Q,R) - H(R)
entropyH(R)Entropy of class R. H(R) = - F(R)*log[F(R)] - F(!R)*log[F(!R)]
entropyH(R|Q)Conditional entropy of R given Q. H(R|Q) = H(Q,R) - H(Q)
entropyI(Q,R)Mutual information of classs Q and R. I(Q,R) = H(Q) + H(R) - H(Q,R)
entropyICInformation content (as defined by Schneider, 1986). IC = F(QR) log[F(QR)/F(Q)F(R)]
entropyU(Q|R)
entropyU(R|Q)
entropydH(Q,R)Entropy distance between classes Q and R. dH(Q,R) = H(Q,R) - H(Q)/2 - H(R)/2

Thresholds

A lower and an upper threshold can be imposed on various fields in order to restrict the result. To avoid applying a threshold, leave the box empty or write none.

Comparison statistics

P-value (P_val)

The P-value is the probability to observe at least c common elements between a given query class and a given reference class. It is computed using the hypergeometric distribution.

Let us assume that we have :


The P-value can be interpreted as an estimation of the false prediction risk (FPR), i.e. the fact to consider the intersection between two given classes as significant whereas it is not.

Note that the P-value only concerns one comparison between a precise query class and a precise reference class. This is called a nominal P-value because it is attached to one particular test among a series of multiple tests (since we compare each query class to each reference class). The multi-testing correction is done by computing the E-value, as explained in the next section.

E-value

Assuming that there are x query classes and y reference classes, each analysis consists in x*y comparisons. Thus, the P-value can be misleading, because even low P-values are expected to emerge by chance alone when the number of query and/or reference classes is very high. The E-value (E_val) reflects better the degree of exceptionality.

Significance

The significance index is the minus log of the E-value. It is calculated in base 10.

This index gives an intuitive perception of the exceptionality of the common elements : a negative sig indicates that the common matches are likely to come by chance alone, a positive value that they are significant. Higher sig values indicate a higher significance.