Method and device for evaluating a decision condition with sequence motifs used for discrimination of symbol sequences5410493Abstract For evaluating decision conditions, by the MDL principle, used for discriminating a given amino acid sequence as a particular one of super families, a sum of the description lengths of each of the decision conditions and a precision of each decision condition is calculated to make an evaluating parameter. Each decision condition comprises a plurality of conditional clauses. A description length of the decision condition is given by a total of clause description lengths of the conditional clauses. One of the clause description lengths is given by a total of a description length of the particular super family and a description length of one or more sequence motifs and a distance between motifs contained in the clause. An optimum is determined as one of the decision condition which makes the minimum evaluating parameter. Claims What is claimed is: Description BACKGROUND OF THE INVENTION
______________________________________
motif (J, cytochrome c): contain ("CXXCH", J).
motif (J, otherwise).
______________________________________
DECISION PREDICATE 2 This decision predicate means that the sequence "J" corresponds to cytochrome c if it contains both patterns of sequence motifs "CXXCH" and "PGTKM" but it is otherwise if it does not contain the both patterns. This is also expressed as follows:
______________________________________
motif (J, cytochrome c): contain ("CXXCH", J),
contain ("PGXKM", J).
motif (J, otherwise).
______________________________________
DECISION PREDICATE 3 This decision predicate 3 means that the sequence "J" corresponds to cytochrome c if it contains the sequence motif "CXXCH" or "AAQCH", but it is otherwise if it contains neither "CXXCH" nor "AAQCH". This decision predicate is expressed by three clauses as follows:
______________________________________
motif (J, cytochrome c): contain ("CXXCH", J).
motif (J, cytochrome c): contain ("AAQCH", J).
motif (J, otherwise).
______________________________________
Table 1 shows the distribution of 6158 training sequences in the data bank PIR 18.0 observed by the decision predicates 1, 2, and 3.
TABLE 1
______________________________________
Total number
Number of Number of
Clauses of sequences
targets correct sequences
______________________________________
Decision Predicate 1
CXXCH 6158 189 116
otherwise
5969 5969 5966
Decision Predicate 2
CXXCH and
6158 78 78
PGTKM
otherwise
6080 6080 6039
Decision Predicate 3
CXXCH 6158 189 116
AAQCH 5969 4 3
otherwise
5965 5965 5965
______________________________________
In order to evaluate decision conditions by the decision predicates 1, 2, and 3, a description length of the decision condition and a description length of precision of the decision condition are calculated according to the present invention. The description length of the decision condition is given by the sum of description lengths of individual clauses in the decision predicate. The description length of each clause is given by the sum of a bit number necessary for encoding the category and a bit number necessary for encoding the items or the sequence motifs in the clause. The bit number necessary for encoding the category is given by log M, where M is a number of categories. The bit number necessary for encoding the motifs is calculated in the following manner. When the sequence motif consists of N (integer) symbols and the sequence motif does not contain X, the sequence motif is one of 20.sup.N symbol sequences because there are 20 amino acids. Therefore, the necessary bit number is given by N.times.log 20. When the sequence motif contains "X", the necessary bit number is calculated by log .sub.N C.sub.x +(N-x).times.log 20. In the formula, x is a number of "X" in the sequence motif, .sub.N C.sub.x is a number of combinations for selecting x positions from N positions in the sequence motif. Since 20 amino acids can be distributed in (N-x) positions, 20.sup.N-x symbol sequences are encoded and the necessary bit number therefor is given by (N-x).times.log 20. Now, referring to FIG. 2, description will be made as regards calculation of the description length of the decision condition in connection with the decision predicate 2. At first, the clause number C and the description length L are set 1 and 0, respectively, at step 21. At step 22, it is decided whether or not C exceeds the total number of clauses in the decision predicate. Since the decision predicate 2 has two clauses, step proceeds from 22 to 23. At step 23, an item number B is set 1. Then, it is decided at step 24 whether or not B exceeds the number of items or sequence motifs in the clause. Since the first clause in the decision predicate 2 has two items or sequence motifs, step proceeds from 24 to 25. At step 25, calculation is performed for the following equations: L=L+log .sub.N C.sub.x +(N-x).times.log 20 (1) B=B+1 (2) The first sequence motif "CXXCH" has five symbols and two "X". Therefore, the equations (1) and (2) are calculated as follows: L=0+log .sub.5 C.sub.2 +(5-2).times.log 20=16.3 B=1+1=2 Thereafter, step returns from 25 to 24. B does not exceed the number (2) of items or sequence motifs in the first clause. Therefore, equations (1) and (2) are again calculated for the second second motif in the first clause at step 25. The second sequence motif "PGXKM" has five symbols and one X. Therefore, equations (1) and (2) are calculated as follows: L=16.3+log .sub.5 C.sub.1 +(5-1).times.log 20=35.9 B=2+1=3 Then, step again returns from 25 to 24. Since B exceeds the item or sequence motif number in the first clause, step proceeds from 24 to 26. At step 26, the following equations (3) and (4) are calculated. L=L+log M (3) C=C+1 (4) The number of categories or super families of proteins is about 1000. Thus, equations (3) and (4) are calculated by: L=35.9+log 1000=45.9 C=1+1=2 Thereafter, step returns to 22 from 26. Since C does not exceed the total clause number, step proceeds from 22 to 23 and B is reset 1. The second clause does not have any motif, step proceeds to 26 from 24. The equations (3) and (4) are again calculated as follows: L=45.9+log 1000=55.9 C=2+1=3 Then, it is decided at step 22 that C exceeds the total clause number. Accordingly, the calculation is completed and the description length of the decision condition by the decision predicate 2 are given by L=55.9 (bits). The description length of precision of the decision condition is calculated by a bit number necessary for encoding the probability distribution of the number of sequences discriminated by each clause in the decision predicate. It is provided that the number of sequences corresponding to an i-th clause in the decision predicate, the actual number of sequences corresponding to a desired category and the actual number of sequences excluded from the desired category are represented by Ni, Ni.sup.+ and Ni.sup.-, respectively. Ni is a sum of Ni.sup.+ and Ni.sup.-, that is, Ni=Ni.sup.+ +Ni.sup.-. The probability of the case is expressed by: pi.sup.Ni.spsp.+ .times.(1-pi).sup.Ni.spsp.-, where pi=Ni.sup.+ /Ni. The bit number necessary for encoding the probability is given by: -log pi.sup.Ni.spsp.+ .times.(1-pi).sup.Ni.spsp.-. Therefore, the total bit number Lp for all of the clauses in the decision predicate is expressed by: ##EQU1## Providing that pi and pi represent a maximum likelihood estimate of pi and a random variable, respectively, the total bit number Lp is rewritten as follows: ##EQU2## In the formula, H(pi) and D(pi .parallel. pi) are an entropy function and an information amount by Kullbach-Leibler, respectively, as determined by the following equations: H(p)=-p log p-(1-p) log (1-p) D(pi .parallel. pi)=pi.times.(log pi-log pi)+(1-pi) log (1-pi)-log (1-pi). Now, providing that pi and pi are approximately equal to each other, that is, pi=pi=(Ni.sup.+ +1)/(Ni+2), the total bit number Lp can be approximately expressed by: ##EQU3## The equation of the total bit number Lp is a monotonous decreasing function. That is, Lp is large as pi decreases to 0, but ecreases to 0 as pi increases to 1. The total bit number Lp expressed by the entropy function is considered minimum in view of encoding according to the Shanon's first law. Now, referring to FIG. 3, description will be made as regards calculation of the description length of the precision of the decision condition in connection with the decision predicate 2. At first, the clause number C and the description length Lp are set 1 and 0, respectively, at step 31. Then, it is decided at step 32 whether or not C exceeds the total clause number of the decision predicate. Since the total clause number is 2, step proceeds from 32 to 33. At step 33, the following equations (5) and (6) are calculated: Lp=Lp+Ni.times.H(pi) (5) C=C+1 (6) According to the amino acid sequence data bank of PIR 18.0, 78 sequences of all of 6158 sequences actually have the both of sequence motifs "CXXCH" and "PGXKM", and all of the 78 sequences actually correspond to cytochrome c as shown at Decision predicate 2 in Table 2. It is provided from the actual data that N1 and p1 are 78 and (78+1)/(78+2), respectively. Therefore, equations (5) and (6) are calculated as follows: Lp=0+78.times.H(79/80)=7.8 (bits) C=1+1=2 Then, C is compared with the total clause number at step 32. Since C is not larger than the total clause number, equations (5) and (6) are calculated in connection with the second clause at step 33. According to the data bank PIR 18.0, 6080 sequences do not have both of the motif patterns "CXXCH" and "PGXKM" and 6039 sequences of them do not correspond to cytochrome c as shown at Decision Predicate 2 in Table 1. Accordingly, N2 and p2 are 6080 and (6039+1)/(6080+2), respectively. Therefore, equations (5) and (6) are calculated as follows: Lp=7.8+6080.times.H(6040/6082)=369.3 (bits) C=2+1=3 Thus, the calculation is completed since C exceeds the total clause number, and the description length of the precision of the decision condition is given by 369.3 bits in connection with the decision predicate 2. Similar calculations are performed for the decision predicates 1 and 3, and the results are shown in the following Table 2.
TABLE 2
______________________________________
Decision Decision Description Length
predicate condition
condition precision
sum
______________________________________
1 CXXCH 36.2 230.0 266.2
CXXCH
2 and 55.8 369.3 425.1
PGXKM
CXXCH
3 or 67.8 199.7 267.5
PGXKM
______________________________________
Table 2 shows that the use of the sequence motif "CXXCH" alone gives the minimum description length. Accordingly, it will be understood from the MDL principle that the sequence motif "CXXCH" is optimum as the decision condition for discrimination of cytochrome c. In order to discriminate mitochondria cytochrome c, sequence motif "CXXCH ", "PGXKM" and "AAQCH" are used in three decision predicates 1', 2', and 3' similar to the decision predicates 1, 2, and 3 and the description length are calculated by the similar way. The results are shown in Tables 3 and 4 corresponding to Tables 1 and 2, respectively.
TABLE 3
______________________________________
Total number
Number of Number of
Clauses of sequences
targets correct sequences
______________________________________
Decision Predicate 1'
CXXCH 6158 189 70
otherwise
5969 5969 5966
Decision Predicate 2'
CXXCH and
6158 78 70
PGTKM
otherwise
6080 6080 6073
Decision Predicate 3'
CXXCH 6158 189 70
AAQCH 5969 4 3
otherwise
5965 5965 5965
______________________________________
It will be understood from Table 4 that the use of sequence motifs "CXXCH" and "PGXKM" is optimum as the decision condition for discriminating mitochondria cytochrome c. EXAMPLE 2 Another type of the decision predicate is known for discriminating sub-categories. The decision predicate of this type uses predicates of "search (P, J, R,)", "match (P, J, R)" and "any (Min, Max, J, R)". The phrase "search (P, J, R)" means that when the sequence motif P is searched in the given sequence J, the residual portion of the given sequence is made as R, and back-tracking is performed. The phrase "match (P, J, R)" means that when a leading pattern of the sequence J is corresponding to the sequence motif P, the residual portion of the sequence J is made as R. The phrase "any (Min, Max, J, R)" means that a number of symbols are omitted from a leading symbol of the sequence J and the residual portion of the sequence J is made as R. The omitted sybol number is given by Min and Max as the minimum and the maximum numbers. Therefore, the back-tracking is repeated by (Max-Min) times. Two examples of the decision predicate of this type are demonstrated below. DECISION PREDICATE 4
______________________________________
motif (J, s-cytochrome c): search ("CXXCH",
J, J1), search ("MP", J1, -).
motif (J, otherwise)
______________________________________
This decision predicate 4 means that when the given sequence J has a first portion corresponding to the sequence motif "CXXCH" and has a second portion corresponding to the sequence motif "MP" in the following sequence portion after the first portion, the sequence J corresponds to the s-type cytochrome c. If the sequence J does not have those sequence motifs in the order, it is otherwise. DECISION PREDICATE 5
______________________________________
motif (J, s-cytochrome c): search ("CXXCH",
J, J1), any (36, 46, J1, J2),
match ("MP", J2, -)
motif (J, otherwise)
______________________________________
This decision predicate 5 means that when the given sequences J has first and second portions corresponding to the sequence motifs "CXXCH" and "MP", respectively, and when a distance from the first portion to the second portion is given by a symbol number between Min and Max, the sequence J correspond to s-type cytochrome c. However, when the sequence J does not fulfill the conditions, it is otherwise. Table 5 shows distribution of sequences in the data bank PIR 18.0 corresponding to each of the clauses in each of the decision predicates 4 and 5.
TABLE 5
______________________________________
Total number
Number of Number of
Clauses of sequences
targets correct sequences
______________________________________
Decision Predicate 4
CXXCH 6158 74 25
and MP
otherwise
6084 6084 6082
Decision Predicate 5
CXXCH and
6158 25 25
d(36, 46)
and MP
otherwise
6133 6133 6131
______________________________________
When the decision condition does not include a distance information, the description length of the decision condition can be calculated in the similar manner as described in connection with FIG. 2. When the decision predicate includes a distance information by the "any" predicate, the description length of the distance information should be calculated as a component for determining the description length of the decision condition. The following three methods are considered for encoding the distance information: 1. Encoding both of the minimum and the maximum values of Min and Max, 2. Encoding the maximum value Max and a difference or distance between the maximum and the minimum values of Max and Min, and 3. Encoding the minimum value Min and a distance or a difference between the maximum and the minimum value of Max and Min. Since the distance between the maximum and the minimum values Max and Min is always smaller than the maximum value Max, the description length or the bit number necessary for encoding the distance information is made minimum by the use of the third encoding method. Thus, the bit number necessary for encoding the distance information is given by the following formula: log (Min)+log (Max-Min). The calculation of the description length of the decision condition is performed according to the flow chart shown in FIG. 4. It will be understood by comparison of FIG. 2 with FIG. 4 that the flow chart of FIG. 4 is similar to FIG. 2 except for steps 45 and 47. In detail, steps 21-24, 25 and 26 in FIG. 2 is corresponding to steps 41-44, 46 and 48 in FIG. 4, respectively. At step 45, phrase in each clause is searched. When the phrase is "search" or "match", step proceeds to 46. When the phrase is "any", step proceeds to 47. At step 47, the bit number for the distance information is calculated. That is, the following equations (5) and (6) are calculated. L=L+log (Min)+log (Max-Min) (7) B=B+1 (8) Now, description will be made as regards the calculation of the description length of decision condition in the decision predicate 5 with reference to FIG. 4. The operation of step 41 through step 44 is similar to that of step 21 through step 24 in FIG. 2. Since the first phrase in the first clause is "search", step proceeds to 46. The first motif pattern "CXXCH" has 5 symbols and the number of "X" is 2. Therefore, the equations (1) and (2) are calculated as follows: L=0+log .sub.5 C.sub.2 +3 log 20=16.3 B=1+1=2 Then, the second phrase in the first clause is detected as "any" at step 45, and step proceeds to 47. Since Max and Min are 46 and 36, respectively, the equations (7) and (8) are calculated as follows: L=16.3+log 36+log (46-36)=24.8 B=2+1=3 Thereafter, the third phrase in the first clause is detected as "match" at step 46. Therefore, the step proceeds to 46. The sequence motif "MP" in the third phrase has two symbols and no "X". Therefore, the equations (1) and (2) are calculated as follows: L=24.8+log .sub.2 C.sub.0 +2 log 20=33.3 B=3+1=4 Thereafter, the step proceeds to 48 from 44, and equations (3) and (4) are calculated as follows: L=33.3+log 1000=43.3 C=1+1=2 Thereafter, the step again proceeds to 48 through steps 42, 43 and 44 because the total number of items in the final clause is 0. At step 48, equations (3) and (4) are again calculated as follows:ps L=43.3+log 1000=53.3 C=2+1=3 Thus, the calculation is completed because C exceeds the total clause number, and the description length is obtained as 53.3. The description length of the precision of the decision condition are calculated in the manner shown in FIG. 3. Similarly, description length of the decision condition was calculated in connection with the decision predicate 4. Those results are shown in Table 6.
TABLE 6
______________________________________
Decision Decision Description Length
predicate condition
condition precision
sum
______________________________________
4 CXXCH 44.8 105.9 150.7
and
MP
5 CXXCH 53.3 43.0 96.3
and
d(36, 46)
and
MP
______________________________________
It will be understood from Table 6 that the condition by decision predicate 5 is optimum for discriminating s-type cytochrome c. Referring to FIG. 5, a device for calculating the description lengths of motifs and precision of the motifs according to the flow charts of FIGS. 2, 3, and 4. The device comprises an input terminal 51 for inputting a decision predicate. The input terminal 51 is usually a keyboard. The input decision predicate is supplied to an analyzer 52. The analyzer 52 analyzes the input decision predicate and supplies information of the number of clauses, the number of motifs or items in each clause, and the items to an input memory 53. The input memory 53 stores the information as shown in FIG. 6. When the analyzer 52 receives the input decision predicate, it reports the reception to a controller 54. In response to the report, the controller 54 reads out a program from a read only memory (ROM) 55. The ROM 55 memorizes the program for calculation of the description lengths of the decision condition and the precision according to the flow charts shown in FIG. 2 or FIG. 4 and FIG. 3, respectively. The ROM 55 also memorizes the actual data for correct sequences in the data bank shown in, for example, Tables 1, 3, and 5. According to the program, the controller 54 controls a C register 56, a DL register 57 and a B register 58 for holding C, L and B, respectively, and calculator 59 for calculating L, B and C, as described in connection with FIGS. 2 or 4 and 3. The calculator 59 calculates the description length L with reference to the input memory 53 and ROM 55 under control of the controller 54.
|
Same subclass Same class Consider this |
||||||||||
