Method for clustering sequences in groups6304868Abstract In order to cluster sequences to biological groups, the conventional databank search programs are iteratively called in with a view to clustering various related sequences to one determined protein sequence. The inventive method enables full automatic distribution of a high number of protein sequences in groups. The major part of such groups are segregated, so that they represent a meaningful and valid grouping of data. Claims What is claimed is: Description This invention concerns a method of grouping sequences in families.
Sequences found:
Probability:
SPR.vertline.Q05925.vertline.HME1_HUMAN HOMEOBOX PROTEIN ENGRAILED-1 (HU-E
. . . 2.4e-279
SPR.vertline.P09065.vertline.HME1_MOUSE HOMEOBOX PROTEIN ENGRAILED-1 (MO-E
. . . 4.7e-189
SPR.vertline.Q05916.vertline.HME1_CHICK HOMEOBOX PROTEIN ENGRAILED-1 (GG-E
. . . 4.6e-132
SPR.vertline.Q05917.vertline.HME2_CHICK HOMEOBOX PROTEIN ENGRAILED-2 (GG-E
. . . 3.6e-95
SPR.vertline.P19622.vertline.HME2_HUMAN HOMEOBOX PROTEIN ENGRAILED-2 (HU-E
. . . 8.0e-95
SPR.vertline.P09066.vertline.HME2_MOUSE HOMEOBOX PROTEIN ENGRAILED-2 (MO-E
. . . 5.6e-92
SPR.vertline.P09015.vertline.HME2_BRARE HOMEOBOX PROTEIN ENGRAILED-2 (ZF-E
. . . 5.2e-70
SPR.vertline.P31538.vertline.HMEB_XENLA HOMEOBOX PROTEIN ENGRAILED-1B (EN-
. . . 1.5e-x
SPR.vertline.P52729.vertline.HMEC_XENLA HOMEOBOX PROTEIN ENGRAILED-2A (EN-
. . . 2.1e-66
SPR.vertline.P52730.vertline.HMED_XENLA HOMEOBOX PROTEIN ENGRAILED-2B (EN-
. . . 2.1e-65
SPR.vertline.P31533.vertline.HME3_BRARE HOMEOBOX PROTEIN ENGRAILED-3 (ZF-E
. . . 6.1e-64
SPR.vertline.Q04896.vertline.HME1_BRARE HOMEOBOX PROTEIN ENGRAILED-1
1.0e-61
SPR.vertline.P09145.vertline.HMEN_DROVI SEGMENTATION POLARITY PROTEIN ENGR
. . . 9.1e-59
SPR.vertline.P05527.vertline.HMIN_DROME INVECTED PROTEIN.
4.5e-57
SPR.vertline.P27609.vertline.HMEN_BOMMO SEGMENTATION POLARITY PROTEIN ENGR
. . . 1.5e-55
SPR.vertline.P27610.vertline.HMIN_BOMMO INVECTED PROTEIN.
1.1e-52
SPR.vertline.P09532.vertline.HMEN_TRIGR HOMEOBOX PROTEIN ENGRAILED (SU-HB .
. . 4.0e-44
SPR.vertline.Q05640.vertline.HMEN_ARTSF HOMEOBOX PROTEIN ENGRAILED.
1.7e-42
SPR.vertline.P09076.vertline.HME3_APIME HOMEOBOX PROTEIN E30 (FRAGMENT).
2.1e-41
SPR.vertline.P09075.vertline.HME6_APIME HOMEOBOX PROTEIN E60 (FRAGMENT).
1.0e-40
SPR.vertline.P14150.vertline.HMEN_SCHAM HOMEOBOX PROTEIN ENGRAILED (G-EN .
. . 2.3e-40
SPR.vertline.P23397.vertline.HMEN_HELTR HOMEOBOX PROTEIN HT-EN (FRAGMENT).
1.3e-38
SPR.vertline.P31537.vertline.HMEA_XENLA HOMEOBOX PROTEIN ENGRAILED-1A (EN-
. . . 7.9e-33
SPR.vertline.P31535.vertline.HMEA_MYXGL HOMEOBOX PROTEIN ENGRAILED-LIKE A .
. . 1.1e-27
SPR.vertline.P34326.vertline.HM16_CAEEL HOMEOBOX PROTEIN ENGRAILED-LIKE CE
. . . 7.1e-27
SPR.vertline.P31536.vertline.HMEB_MYXGL HOMEOBOX PROTEIN ENGRAILED-LIKE B .
. . 5.0e-26
On the basis of the threshold at 10.sup.-30, our cluster now contains the following sequences:
HME1_HUMAN, HME1_MOUSE, HME1_CHICK, HME2_CHICK,
HME2_HUMAN,
HME2_MOUSE, HME2_BRARE, HMEB_XENLA, HMEC_XENLA,
HMED_XENLA,
HME3_BRARE, HME1_BRARE, HMEN_DROVI, HMIN_DROME,
HMEN_BOMMO,
HMEN_DROME, HMIN_BOMMO, HMEN_TRIGR, HMEN_ARTSF,
HME3_APIME,
HME6_APIME, HMEN_SCHAM, HMEN_HELTR, HMEA_XENLA.
The next run through the BLASTP program is then carried out with the sequence weighted as worst in this quantity, namely with the engrailed-1A homeobox protein of the horned toad (HMEA_XENLA). The result of this search looks as follows (excerpts):
Sequences found:
Probability:
SPR.vertline.P31538.vertline.HMEB_XENLA HOMEOBOX PROTEIN ENGRAILED-1B (EN-
. . . 2.8e-36
SPR.vertline.P31537.vertline.HMEA_XENLA HOMEOBOX PROTEIN ENGRAILED-1A (EN-
. . . 3.2e-36
SPR.vertline.P09015.vertline.HME2_BRARE HOMEOBOX PROTEIN ENGRAILED-2 (ZF-E
. . . 1.1e-34
SPR.vertline.Q05925.vertline.HME1_HUMAN HOMEOBOX PROTEIN ENGRAILED-1 (HU-E
. . . 1.3e-33
SPR.vertline.P09065.vertline.HME1_MOUSE HOMEOBOX PROTEIN ENGRAILED-1 (MO-E
. . . 1.4e-33
SPR.vertline.Q05916.vertline.HME1_CHICK HOMEOBOX PROTEIN ENGRAILED-1 (GG-E
. . . 1.5e-33
SPR.vertline.P52729.vertline.HMEC_XENLA HOMEOBOX PROTEIN ENGRAILED-2A (EN-
. . . 9.9e-33
SPR.vertline.P52730.vertline.HMED_XENLA HOMEOBOX PROTEIN ENGRAILED-2B (EN-
. . . 5.9e-32
SPR.vertline.Q05917.vertline.HME2_CHICK HOMEOBOX PROTEIN ENGRAILED-2 (GG-E
. . . 1.3e-31
SPR.vertline.P09066.vertline.HME2_MOUSE HOMEOBOX PROTEIN ENGRAILED-2 (MO-E
. . . 1.8e-31
SPR.vertline.P19622.vertline.HME2_HUMAN HOMEOBOX PROTEIN ENGRAILED-2 (HU-E
. . . 2.0e-31
SPR.vertline.Q04896.vertline.HME1_BRARE HOMEOBOX PROTEIN ENGRAILED-1.
8.1e-31
SPR.vertline.P31535.vertline.HMEA_MYXGL HOMEOBOX PROTEIN ENGRAILED-LIKE A .
. . 4.3e-30
SPR.vertline.P31533.vertline.HME3_BRARE HOMEOBOX PROTEIN ENGRAILED-3 (ZF-E
. . . 6.7e-30
SPR.vertline.P31536.vertline.HMEB_MYXGL HOMEOBOX PROTEIN ENGRAILED-LIKE B .
. . 1.8e-28
SPR.vertline.P09532.vertline.HMEN_TRIGR HOMEOBOX PROTEIN ENGRAILED (SU-HB-
. . . 8.8e-28
SPR.vertline.P31534.vertline.HMEN_LAMPL HOMEOBOX PROTEIN ENGRAILED-LIKE (E
. . . 8.8e-28
SPR.vertline.P09075.vertline.HME6_APIME HOMEOBOX PROTEIN E60 (FRAGMENT).
2.1e-26
SPR.vertline.P23397.vertline.HMEN_HELTR HOMEOBOX PROTEIN HT-EN (FRAGMENT).
2.3e-26
SPR.vertline.P09076.vertline.HME3_APIME HOMEOBOX PROTEIN E30 (FRAGMENT).
3.9e-26
Let us again consider all sequences having a probability lower than 10.sup.-30 we and find that except for HMEA_MYXGL, all sequences are contained in the cluster. This sequence is now included in the cluster, and the next BLASTP search is started with it. This search yields the following result (excerpts):
Sequences found:
Probability:
SPR.vertline.P31535.vertline.HMEA_MYXGL HOMEOBOX PROTEIN ENGRAILED-LIKE A .
. . 3.8e-36
SPR.vertline.P31534.vertline.HMEN_LAMPL HOMEOBOX PROTEIN ENGRAILED-LIKE (E
. . . 1.5e-30
SPR.vertline.P31538.vertline.HMEB_XENLA HOMEOBOX PROTEIN ENGRAILED-1B (EN-
. . . 1.8e-30
SPR.vertline.P31537.vertline.HMEA_XENLA HOMEOBOX PROTEIN ENGRAILED-1A (EN-
. . . 3.8e-30
SPR.vertline.P52729.vertline.HMEC_XENLA HOMEOBOX PROTEIN ENGRAILED-2A (EN-
. . . 4.9e-29
SPR.vertline.P09015.vertline.HME2_BRARE HOMEOBOX PROTEIN ENGRAILED-2 (ZF-E
. . . 1.4e-28
SPR.vertline.Q05925.vertline.HME1_HUMAN HOMEOBOX PROTEIN ENGRAILED-1 (HU-E
. . . 1.7e-28
SPR.vertline.P09065.vertline.HME1_MOUSE HOMEOBOX PROTEIN ENGRAILED-1 (MO-E
. . . 1.8e-28
SPR.vertline.P09066.vertline.HME2_MOUSE HOMEOBOX PROTEIN ENGRAILED-2 (MO-E
. . . 3.1e-28
SPR.vertline.P19622.vertline.HME2_HUMAN HOMEOBOX PROTEIN ENGRAILED-2 (HU-E
. . . 3.3e-28
SPR.vertline.Q05916.vertline.HME1_CHICK HOMEOBOX PROTEIN ENGRAILED-1 (GG-E
. . . 4.6e-28
SPR.vertline.P52730.vertline.HMED_XENLA HOMEOBOX PROTEIN ENGRAILED-2B (EN-
. . . 2.1e-27
SPR.vertline.Q05917.vertline.HME2_CHICK HOMEOBOX PROTEIN ENGRAILED-2 (GG-E
. . . 2.2e-27
SPR.vertline.P09075.vertline.HME6_APIME HOMEOBOX PROTEIN E60 (FRAGMENT).
2.9e-27
SPR.vertline.P23397.vertline.HMEN_HELTR HOMEOBOX PROTEIN HT-EN (FRAGMENT).
4.4e-27
SPR.vertline.Q04896.vertline.HME1_BRARE HOMEOBOX PROTEIN ENGRAILED-1.
4.9e-27
SPR.vertline.P09076.vertline.HME3_APIME HOMEOBOX PROTEIN E30 (FRAGMENT).
5.4e-27
SPR.vertline.P31533.vertline.HME3_BRARE HOMEOBOX PROTEIN ENGRAILED-3 (ZF-E
. . . 2.0e-26
SPR.vertline.P31536.vertline.HMEB_MYXGL HOMEOBOX PROTEIN ENGRAILED-LIKE B .
. . 8.8e-26
This time we add HMEN_LAMPL to our cluster, and we start the next BLASTP search with this sequence, yielding the following result (excerpt):
Sequences found:
Probability:
SPR.vertline.P31534.vertline.HMEN_LAMPL HOMEOBOX PROTEIN ENGRAILED-LIKE (E
. . . 5.7e-37
SPR.vertline.P31535.vertline.HMEA_MYXGL HOMEOBOX PROTEIN ENGRAILED-LIKE A .
. . 5.0e-31
SPR.vertline.P31538.vertline.HMEB_XENLA HOMEOBOX PROTEIN ENGRAILED-1B (EN-
. . . 1.4e-28
SPR.vertline.P31537.vertline.HMEA_XENLA HOMEOBOX PROTEIN ENGRAILED-1A (EN-
. . . 2.9e-28
SPR.vertline.P23397.vertline.HMEN_HELTR HOMEOBOX PROTEIN HT-EN (FRAGMENT).
1.2e-27
SPR.vertline.P31536.vertline.HMEB_MYXGL HOMEOBOX PROTEIN ENGRAILED-LIKE B .
. . 1.4e-27
SPR.vertline.Q04896.vertline.HME1_BRARE HOMEOBOX PROTEIN ENGRAILED-1.
1.5e-27
SPR.vertline.P09015.vertline.HME2_BRARE HOMEOBOX PROTEIN ENGRAILED-2 (ZF-E
. . . 6.9e-27
SPR.vertline.Q05925.vertline.HME1_HUMAN HOMEOBOX PROTEIN ENGRAILED-1 (HU-E
. . . 1.5e-26
SPR.vertline.P09065.vertline.HME1_MOUSE HOMEOBOX PROTEIN ENGRAILED-1 (MO-E
. . . 1.6e-26
SPR.vertline.P09075.vertline.HME6_APIME HOMEOBOX PROTEIN E60 (FRAGMENT).
1.9e-26
SPR.vertline.Q05916.vertline.HME1_CHICK HOMEOBOX PROTEIN ENGRAILED-1 (GG-E
. . . 4.5e-26
Above the threshold, we do not find any sequences that would not already be contained in our cluster, so the SYSTERS search for this inquiry sequence is now concluded, and the cluster contains the following 26 sequences:
HME1_HUMAN, HME1_MOUSE, HME1_CHICK, HME2_CHICK,
HME2_HUMAn,
HME2_MOUSE, HME2_BRARE, HMEB_XENLA, HMEC_XENLA,
HMED_XENLA,
HME3_BRARE, HME1_BRARE, HMEN_DROVI, HMIN_DROME,
HMEN_BOMMO,
HMEN_DROME, HMIN_BOMMO, HMEN_TRIGR, HMEN_ARTSF,
HME3_APIME,
HME6_APIME, HMEN_SCHAM, HMEN_HELTR, HMEA_XENLA,
HMEA_MYXGL,
HMEN_LAMPL.
If this procedure is performed for all 28 sequences annotated as homeobox engrailed in the Swissprot database, this yields 28 clusters at first. The clusters thus found are plotted in the following table against the sequences, where the columns represent the clusters belonging to the inquiry sequence listed at the head of the table and the line indicate the clusters in which the sequence listed at the left is contained (marked with an X). In this case, there are seven clusters having 27 sequences each, five clusters having 26 sequences each, etc.
Cluster (inquiry sequence)
Sequence 1 (HME2_BRARE) 2 (HME2_MOUSE) 3 (HMEB_XENLA)
4 (HMEC_XENLA) 5 (HMED_XENLA) 6 (HMEN_LAMPL) 7 (HMEA_MYXGL)
8 (HMEA_XENLA)
HME2_BRARE X X X
X X X X
X
HME2_MOUSE X X X
X X X X
X
HMEB_XENLA X X X
X X X X
X
HMEC_XENLA X X X
X X X X
X
HMED_XENLA X X X
X X X X
X
HMEN_LAMPL X X X
X X X X
HMEA_MYXGL X X X
X X X X
X
HMEA_XENLA X X X
X X X X
X
HM16_CAEEL X X X
X X X X
X
HME1_MOUSE X X X
X X X X
X
HMEN_DROME X X X
X X X X
X
HMIN_DROME X X X
X X X X
X
HME6_APIME X X X
X X X X
X
HME3_APIME X X X
X X X X
X
HMEN_DROVI X X X
X X X X
X
HMEN_TRIGR X X X
X X X X
X
HMEN_SCHAM X X X
X X X X
X
HMEN_HELTR X X X
X X X X
X
HMEN_BOMMO X X X
X X X X
X
HME3_BRARE X X X
X X X X
X
HME1_BRARE X X X
X X X X
X
HMIN_BOMMO X X X
X X X X
X
HMEN_ARTSF X X X
X X X X
X
HME2_HUMAN X X X
X X X X
X
HME1_CHICK X X X
X X X X
X
HME2_CHICK X X X
X X X X
X
HME1_HUMAN X X X
X X X X
X
HMEB_MYXGL
Cluster (inquiry sequence)
Sequence 9 (HM16_CAEEL) 10 (HME1_MOUSE) 11 (HMEN_DROME)
12 (HMIN_DROME) 13 (HME6_APIME) 14 (HME3_APIME) 15 (HMEN_DROVI)
16 (HMEN_TRIGR)
HME2_BRARE X X X
X X X X
X
HME2_MOUSE X X X
X X X X
X
HMEB_XENLA X X X
X X X X
X
HMEC_XENLA X X X
X X X X
X
HMED_XENLA X X X
X X X X
X
HMEN_LAMPL
HMEA_MYXGL X
HMEA_XENLA X X X
X X X X
X
HM16_CAEEL X X X
X X X X
X
HME1_MOUSE X X X
X X X X
X
HMEN_DROME X X X
X X X X
X
HMIN_DROME X X X
X X X X
X
HME6_APIME X X X
X X X X
X
HME3_APIME X X X
X X X X
X
HMEN_DROVI X X X
X X X X
X
HMEN_TRIGR X X X
X X X X
X
HMEN_SCHAM X X X
X X X X
X
HMEN_HELTR X X X
X X X X
X
HMEN_BOMMO X X X
X X X X
X
HME3_BRARE X X X
X X X X
X
HME1_BRARE X X X
X X X X
X
HMIN_BOMMO X X X
X X X X
X
HMEN_ARTSF X X X
X X X X
X
HME2_HUMAN X X X
X X X X
X
HME1_CHICK X X X
X X X X
X
HME2_CHICK X X X
X X X X
X
HME1_HUMAN X X X
X X X X
X
HMEB_MYXGL
Cluster (inquiry sequence)
Sequence 17 (HMEN_SCHAM) 18 (HMEN_HELTR) 19
(HMEN_BOMMO) 20 (HME3_BRARE) 21 (HME1_BRARE) 22
(HMIN_BOMMO) (23 (HMEN_ARTSF)
HME2_BRARE X X X
X X X
X
HME2_MOUSE X X X
X X X
X
HMEB_XENLA X X X
X X X
X
HMEC_XENLA X X X
X X X
X
HMED_XENLA X X X
X X X
X
HMEN_LAMPL
HMEA_MYXGL
HMEA_XENLA X X X
X X
HM16_CAEEL X X X
X X X
HME1_MOUSE X X X
X X X
X
HMEN_DROME X X X
X X X
X
HMIN_DROME X X X
X X X
X
HME6_APIME X X X
X X X
X
HME3_APIME X X X
X X X
X
HMEN_DROVI X X X
X X X
X
HMEN_TRIGR X X X
X X X
X
HMEN_SCHAM X X X
X X X
X
HMEN_HELTR X X X
X X X
X
HMEN_BOMMO X X X
X X X
X
HME3_BRARE X X X
X X X
X
HME1_BRARE X X X
X X X
X
HMIN_BOMMO X X X
X X X
X
HMEN_ARTSF X X X
X X X
X
HME2_HUMAN X X X
X X X
X
HME1_CHICK X X X
X X X
X
HME2_CHICK X X X
X X X
X
HME1_HUMAN X X X
X X X
X
HMEB_MYXGL
Cluster (inquiry sequence)
Sequence
24 (HME2_HUMAN) 25 (HME1_CHICK) 26 (HME2_CHICK) 27 (HME1_HUMAN)
28 (HMEB_MYXGL)
HME2_BRARE
X X X X
HME2_MOUSE
After removing identical clusters and solving for inclusions, the homeobox engrailed proteins are distributed among two clusters--one with 27 sequences and the other with only the HMEA_MYXGL sequence.
|
Same subclass Same class Consider this |
||||||||||
