System and method for a precompiled database for biomolecular sequence information6223186Abstract A computer system stores biomolecular data in a database in a memory. The biomolecular database has a set of entities. Each entity stores attributes for a plurality of entries. At least one attribute is stored in an array. Data associated with an entry is stored at a location in the array. An entity offset designates the location of the data in the array. The same entity offset value is used to access data associated with a particular entry for all attributes within the entity. Claims What is claimed is: Description The present invention relates generally to a system and method for storing and retrieving biomolecular sequence information. More particularly, the invention relates to a system and method for storing biomolecular sequence information in a precompiled, modular format which allows for rapid retrieval of the information.
TABLE 1
First Character of
Clone Name Format of Clone Name
I I [number]
Y Y [a . . . 2] [01 . . . 99] [a . . . h] [01 . . .
12]
Z Z [a . . . 23] [01 . . . 99] [a . . . h] [01 . . .
12]
N N [a . . . 23] [01 . . . 99] [a . . . h] [01 . . .
12]
E EST[000001 . . . 999999]
The nomenclature of the "Y", "N", and "Z" clone names originates from the row and column location of the clone sample which is stored in the form of purified cDNA within a 96-well assay plate. For instance, one clone name format is "Y [letter][plate][row][column]." Not every location of an assay plate may have a "clone," and therefore clone names may not be "consecutive." The clone names are listed in ASCII files. A clone offset build procedure 142 (FIG. 5) builds the clone name mapping arrays. The clone offset build procedure receives an ASCII file listing all INCYTE clone names and identifies the largest clone number. Next, the clone offset build procedure creates an "I" array 144 (FIG. 5) for the "I" clone set by allocating sufficient storage in the I-array to store one offset value for every clone number up to the largest identified clone number. The clone offset build procedure populates each entry of the I-array with "-1"s to indicate that the entries are empty. The clone offset build procedure then searches the ASCII file of INCYTE clones and stores a "2" in each entry of the I-array that has an INCYTE clone. For example, if INCYTE clone 1 is in the ASCII file, then a "2" is stored in I-array entry 0. After the I-array is populated with "2"s, the clone offset build procedure sets a clone counter to "0" and searches for "2"s. At the first occurrence of a "2", the "2" is replaced with the clone counter value of "0." The clone counter is then incremented, and the next occurrence of "2" is replaced with the clone counter value of "1". The clone counter is incremented and the process of searching, replacing and incrementing repeats until the end of the I-array is reached. In this way, duplicate clone offset values are avoided. Note, for example, that the INCYTE clones are numbered sequentially starting with one and not every "number" will actually have a clone. In addition, because INCYTE clone numbers are sequential, the INCYTE clone number will equal the value of the offset into the I-array plus one. The process described above is also used to generate clone offsets for the non-INCYTE clone names, such as "Y", "Z", "N" and "E." For instance, the "Y" clone names are received in an ASCII file, the maximum number is determined and a Y-array 146 (FIG. 5) is allocated and populated with "-1"s. The Y clone name's in the received ASCII file are not assumed to be sorted in any particular order and the Y-array will typically have unpopulated portions. The Y clones names ASCII file is searched and a "2" is stored in each location of the Y-array having a corresponding Y clone name. The "2"s are replaced in a similar manner to that described above for the I-array. Note that the relationship between a Y-clone name and a Y-array offset is as follows: Y-array offset=(letter-`a`)*9600+(plate-1)*96+(row-`a`)*12+(column-1). Z, N and E arrays 146 are allocated and populated with offsets to clone names in ASCII files using a procedure similar to that just described for the Y array 146. For the other or "O" type clone names, an O-array 148 (FIG. 5) is allocated based on data from an ASCII file. "O" type clone names are ordered names and are amenable to binary searching. Therefore, the position of an O type clone name entry in a corresponding O-array equals the O-array offset. In a preferred embodiment, the range of clone offset values assigned to each of the I, Y, Z, N, E and O arrays is stored by the clone offset build procedure in a range table 150 (FIG. 5) for later use. Clone Offset Determination Procedure The "clone offset" is the offset value used to access all the attributes of a particular clone that are stored in the clone entity. Referring to FIG. 6A, a clone name is passed to the clone offset determination procedure 98 as a parameter, and the clone offset determination procedure 98 returns the clone offset that points to the desired entry in the clone entity. Since a clone name begins with the character indicating which array to use, such as an "I" or "Y", the appropriate array can be identified quickly. For an INCYTE clone, since the I-array offset equals the INCYTE clone number minus one, the clone offset determination procedure subtracts a one from the clone number to generate the I-array offset. The clone offset determination procedure then uses the generated I-array offset to access the clone offset value stored in the I-array and returns that clone offset value. For a name beginning with "Y," the clone offset determination procedure determines the Y-array offset using the following equation: Y-array offset=(letter-`a`)*9600+(plate-1)*96+(row-`a`)*12+(column-1). The clone offset determination procedure uses then uses the Y-array offset to access the location storing the clone offset value in the Y-array and returns that clone offset value. The clone offsets for the "Z," "N" and "E" type clone names are determined in a manner similar to that for the "Y" clones. For a clone name beginning with "O", the clone offset determination procedure searches the O-array for the clone name and the position of the clone name within the O-array indicates its O-array offset value. Since the clone offset build procedure stored the minimum value of the clone offset of the O-array in the range table, the clone offset determination procedure adds the O-array offset value to the minimum value of the clone offset in the O-array to generate the clone offset. Clone Name Determination Procedure Referring to FIG. 6B, a clone offset is passed as a parameter to the clone name determination procedure 100, which returns the corresponding clone name. The clone name determination procedure 100 determines which array stores that clone offset from the value of the clone offset and the range table of clone offset values for the I, Y, Z, N, E and O arrays. For clone offset values in the I-array, the clone name determination procedure calculates the clone name as the clone offset plus one. For clone offset values in the Y, Z, N and E arrays, the clone name determination procedure searches the appropriate mapping array for the stored clone offset value and determines the corresponding array offset from the position of the clone offset value within the mapping array. The clone name determination procedure then applies an inverse function to that used to map that clone name to the array-index value to generate the clone name. For clone offset values in the O-array, the clone name determination procedure subtracts the minimum O-array clone offset value from the clone offset to determine the O-array offset value. The clone name determination procedure uses the O-array offset value to access the clone name stored at that O-array offset. The clone name determination procedure then returns the clone name. Building the Modular Database The modular database is populated with biomolecular information. An example of populating each of three types of attributes used in the database will be provided. Similar types of attributes will be populated in a similar manner. Populating "Absolute" or Actual Attributes As an example of populating entries of an entity with absolute data, the population of the library.name, library.type and library.usable attribute arrays will be described. The library "type" attribute is an integer and represents a library preparation procedure. The library "Usable" attribute is also an integer and represents the number of usable clones in a library. An ASCII file storing library names with the type and usable information is provided. A library build procedure 152 (FIG. 5) sorts the ASCII file by library name, counts the number of entries in the ASCII file, and allocates space for the library.name, the library.type and the library.usable attribute arrays. Beginning with the first entry, library.name[0], the library names from the ASCII file are stored sequentially in the library.name array. The position of the library name within the library.name array corresponds to its library offset value. After a library name has been stored within the library.name attribute array, its corresponding type and usable attributes are stored in the type and usable attribute arrays. Populating Direct Offset Attributes As an example of populating entries with direct offset data, the population of the clone.library attribute array will be described. The clone.library attribute array is an array of integers that are offsets pointing to an associated entry in the library array. A build clone library offset procedure 154 (FIG. 5) populates the clone.library attribute array with the corresponding library offset values. After the build clone library offset procedure 154 populates the library.name attribute array thereby assigning library offsets to each library name, a populate clone.library procedure 156 (FIG. 5) updates the clone.library attribute array of the clone entity. An ASCII file mapping the clone names to a library name is provided. For each clone name in the ASCII file, the populate clone.library procedure calls the clone name determination procedure using the clone name to determine the clone offset value. The populate clone.library procedure also calls a library offset determination procedure passing the library name to determine the library offset value. The populate clone.library procedure stores the library offset in the corresponding clone.library attribute array at the clone offset value. In other words, clone.library[clone offset]=library offset. Populating a POS Array As an example of populating entries of an entity using a POS structure, the population of the cluster.clone attribute array will be described with reference to FIG. 7. In FIG. 7, a populate cluster.clone procedure 158 (FIG. 5) populates the cluster.clone and the cluster.clone.2 attribute arrays shown in FIGS. 3B and 4. An ASCII file mapping the clone names to a cluster is provided. In step 202, the clone offset build procedure builds the clone offset values. In step 204, a cluster_offset and cluster_clone2 offset are initialized to zero to point to the first entry of the cluster.clone and cluster.clone2 secondary arrays, respectively. The populate cluster.clone procedure identifies the total number of clusters and clones in the ASCII file. The populate cluster.clone procedure creates an empty cluster.clone array of POS structure whose size is based on the total number of clusters in the ASCII file. The populate cluster.clone procedure also creates an empty secondary array, cluster.clone.2, whose size is based on the number of total number of clones identified in the ASCII file. In step 206, the populate cluster.clone procedure reads the ASCII file and identifies a first cluster (or next cluster, if this is not the first cluster being processed). In step 208, the cluster.clone array is accessed by the cluster_offset. The populate cluster.clone procedure counts the number of clone names in the first cluster from the ASCII file and stores the count in the count field of the cluster.clone array at the cluster_offset position. The populate cluster.clone procedure stores the offset for the next unused slot in the cluster_clone2 array in the POS structure of the cluster.clone array at the position designated by the value of cluster_offset. In step 210, the populate cluster.clone procedure populates the cluster.clone.2 array with the corresponding clone offset values. In step 212, the populate cluster.clone procedure calls GetCloneOffset function for the Clone name, and stores the returned clone offset value in the cluster.clone.2 array at the position pointed to by the cluster_clone2 offset, and increments the cluster_clone2 offset. Step 214 determines if there is another clone for this cluster. If so, in step 216, the populate cluster.clone procedure gets the next clone name and repeats the process at step 212 for the next clone name. If not, in step 218, the populate cluster.clone procedure determines if there are more clusters. If not, the process ends (220). If so, in step 222, the populate cluster.clone procedure increments cluster_offset and proceeds to step 206 to repeat the process for the next cluster. Building the Database A system build procedure calls various build procedures to build portions of the database. In particular, attributes such as those storing offset values are populated as the offset information becomes available. Certain procedures are executed before other procedures. For instance, the clone offset build procedure is executed to build the clone name mapping arrays. After executing the clone offset build procedure, the library build procedure and the populate cluster.clone procedure are executed. Additional build procedures, similar to the procedures described above, are called to build and populate attributes of other entities. Entity Name and Offset Determination Procedures Often the name attribute of a entity is used to uniquely designate an entry of interest. Therefore, an exemplary library offset determination procedure 164 (FIG. 5) will also be described. The parameters and output of the library offset determination procedure are similar to those shown in FIG. 6A. A particular library name is passed as a parameter, and the library offset determination procedure returns the library offset value that points to the particular entry storing that library name in the library.name attribute array of the library entity. To determine the corresponding library offset for the particular library name, the library offset determination procedure searches the library.name attribute array for the particular library name. Since the names are ordered the search is fast. The position of the library name within the library.name attribute array corresponds to the library offset value for the particular library name. A library name determination procedure 102 (FIG. 5) is also provided. The parameters and output of the library name determination procedure are similar to those shown in FIG. 6B. A particular library offset is passed as a parameter, and the library name determination procedure returns the library name. The procedure uses the particular library offset to directly access the particular library name stored in the library.name attribute array. Although the system has been described with respect to certain attributes of the clone, cluster and library entities, these descriptions apply to populating and accessing the remaining attributes of clone, cluster, library and the other entities. Preferably, for those entities and attributes sharing similar data types, generalized functions are used to access the data stored in those entities. For example, since the ProteinClass, TissueClass, BAC, YAC and HitID entities all have a name like the library entity, a general function, such as FindOffset(entity, name) and FindName(entity, offset), are provided for those entities. The FindOffset and FindName functions are similar to the library name determination procedure and the library offset determination procedure except for passing an additional parameter to specify the entity. Because offsets are used to point to the information for a desired entity, the biomolecular database can be modified easily without disturbing other entries. In particular, additional attributes are added to an entity by generating an additional attribute array of the desired data type, ordered appropriately with respect to the current offsets for the entries for that entity, and storing the new attribute array in a file. New database access procedures are provided to access the data stored in the new attribute array. Clustering Techniques The individual sequences stored in the clone entity represent fragments of a gene. Prior to populating the modular database, a generate cluster procedure 160 (FIG. 5) groups the sequences into clusters and these clusters are stored in the cluster entity. Clustering techniques assess the homology and overlap of pairs of nucleotide sequences from both internal and the public domain databases. Sequences are assigned to a cluster based on sequence homology when the homology satisfies the specific criteria for overlap with respect to other sequences in the database. Each cluster represents a specific gene and will be annotated with a designation assigned to a representative sequence match with an annotated entry from GenBank. This annotation information is stored in the annotation entity. The present invention provides an improved clustering technique that clusters contiguous cDNA species, each having about 100-500 base pairs, such that much longer cluster sequences are obtained which may encompass a full length gene. To cluster the clones, a generate cluster procedure 160 (FIG. 5), first executes BLAST to compare all sequences in an internal database, such as the INCYTE clones. Query and database sequences are input to BLAST. BLAST compares the query and database sequence pairs using a scoring system, and outputs pairs of sequences called High-scoring Segment Pairs (HSP). An HSP has two sequence fragments of arbitrary length whose alignment is locally maximal and for which an alignment score meets or exceeds a threshold or cutoff score. In the implementations of the BLAST algorithm described herein, each HSP includes a segment from the query sequence and a segment from a database sequence. Multiple HSPs involving the query sequence and a single database sequence may be statistically treated in a variety of ways. By default, the programs use "Sum" statistics. Therefore, the statistical significance ascribed to a set of HSPs may be higher than that ascribed to any individual member of the set. When the assigned significance satisfies the specific threshold (E parameter), a match will be reported to the user. The BLAST parameter E establishes a statistical significance threshold for outputting database sequence matches. E represents an upper boundary of the expected frequency of random occurrence of an HSP or set of HSPs within the context of the entire database search. In the implementations of the BLAST algorithm described herein, E=10.sup.-5. The context of the BLAST comparison includes the length and residue composition of the query sequence, the length of the database, a fixed hypothetical residue composition for the database, and the scoring system. Each nucleotide in a sequence is represented by a character. The significance of an alignment score depends on the specific scoring matrix employed and the length and composition of the query sequence and database, all of which may vary with each search. For the purpose of calculating significance levels, Y is the effective length of the query sequence and Z is the effective length of the database. The default values for Y and Z are the actual lengths of the query and database sequences, respectively. To normalize the reported statistics when searching databases of different lengths, Z may be set to a constant value for all database searches. In the implementation described herein, Z=3.times.10.sup.9. Similarly, when querying with sequences of different lengths, Y can be used to normalize over all searches. FIG. 8A shows an HSP of two sequences 230 and 232 from the BLAST results. The region of homology 250 between the sequences 230 and 232 has 100 nucleotides. There are four adjacent non-matching regions 252, 254, 256, and 258, each having 22, 25, 100 and 150 nucleotides, respectively. In FIG. 8B, the generate cluster procedure further filters the BLAST results to form additional clusters. Referring also to FIG. 8A, the filter uses the following parameters: L--The length of the region of sequence homology 250; in FIG. 8A, L equals 100, n--An integer representing the percent of matching bases or nucleotides within the region of sequence homology 250. The value of n ranges from zero to one hundred. d--The length of the shortest non-matching sequence adjacent the region of sequence homology. In FIG. 8A, region 252 is the shortest non-matching sequence adjacent the region of sequence homology, therefore d equals twenty-two. Referring to FIG. 8B, in step 270, the BLAST output is received. Block 272 sets a parameter called S in steps, 274, 276 and 278. S represents a number of bases and is a function of d as defined above. S is equal to either a first value or a second value depending on whether the region of sequence homology is in the middle or at the end of the sequence. Step 272 determines if the region of sequence homology is in the middle or at one of the ends of the sequence by comparing the length of the shortest non-matching sequence adjacent the region of sequence homology d to a first predetermined threshold value, threshold 1. In one embodiment, the first predetermined threshold value equals five. If d is less than or equal to five, then in step 274 S is set to a first value such as forty. If d is greater than the first predetermined threshold value, step 276 sets S to a second value such as eighty. In step 278, the length of the region of sequence homology L is compared to S. If L is greater than or equal to S, then the sequences match. If L is less than S, then other parameters are considered. Step 280 calculates n, the number of nucleotide matches between the pairs in the region of sequence homology divided by L. Step 282 compares n to a second threshold value, threshold 2. In one embodiment, the second threshold value equals 95 (representing a 95% match threshold). If n is less than the second threshold value, then there is no match. If n is greater than or equal to the second threshold value, step 284 calculates T, a variable threshold that is used to determine if a match occurred. T is determined using the following relationship: T=S+(100-n)B. In one embodiment, B is a predetermined constant representing an incremental number of nucleotides required for a match, such as two. Therefore for every one percent difference between 100 and n, two additional nucleotides are required for a match. Alternately, B is a function of d, the length of the shortest non-matching sequence adjacent the region of sequence homology. Step 286 compares L to T: If L.gtoreq.T, the filter indicates a "match," if L<T, the filter indicates "no match." After filtering as described above, the generate cluster procedure further evaluates the pairs of sequences to establish clusters. The sequences include sequences from INCYTE and public domain databases, `template` sequences which are assemblies of ESTs or other sequences and `anchoring` sequences which are sequences from public domain databases with a functional annotation. Anchoring sequences are compared using sequence comparison programs such as BLAST, BLAST2, FASTA, and CrossMatch or other implementations of Smith-Waterman to the complete set of sequences in the database. Sequences are assembled with a sequence assembler such as Phrap (Green, P., Univ. of Washington), ClustalW (Thompson, J. D. et al. (1994) Nucleic Acids Res., 22:4673-4680) GCG Assembly (Genetics Computer, Inc.), or CAP (Huang, X. (1996) Genomics 33: 21-31), to derive sets of template sequences representing `pre-clusters`. The results of this comparison include clusters attached to anchoring sequences, template sequences for previous clusters and singletons which did not cluster. The sequences derived from the comparison above are further compared to each other using one of the sequence comparison tools discussed above. For instance, BLAST is used with the parameter Z equal to 3.times.10.sup.9 and E equal to 10.sup.5. The sequence comparison maintains a record of each query sequence, including the query sequence identification (ID), query length, hit(s) sequence identification, hit(s) length, and the highest score derived from the comparison. This comparison results in groups of pre-clustered sequences and singletons with 5' hits. Lists are kept of duplicate clones and sequence IDs, the duplicates are removed, and the BLAST results are filtered. The groups of sequences are further characterized to avoid inappropriately associating sequences by considering the specific context of each sequence match. For instance, considering the sequence context prevents clustering of 5' and 3' sequences which belong to different clones, and also prevents merging clusters with common sequences but known to be different genes by applying more stringent criteria to the evaluation of the match. Sequences which have matches at the 5' ends of clones are clustered, and sequences with matches at 3' ends are excluded. The remaining singletons and 3' clusters are examined. If the specific clone sequence represented by a 3' cluster or singleton forms a match with a single 5' cluster it will be merged with that cluster. Cluster annotations and sequence composition are modified to reflect any changes that occur and files are generated to track all changes. The generate cluster procedure also generates a log during the clustering process. The log lists the order in which operations were performed and the changes that occurred such as the number of new clusters generated, the number of clones contained within the new clusters, the number of singletons incorporated into clusters, the number of singletons eliminated, and the number of merged clusters. Transcript Imaging The modular database allows for systematic and quantitative characterization of the distribution of ESTs or clones in a plurality of cDNA libraries. Transcript imaging compares expression data with mapping information and annotation at the genome level, rather than one gene at a time, and validates the quality of the clustering of the clones of the cDNA libraries. In particular, in the database, the cluster entity has an Annotation attribute that associates clusters with annotation information. Transcript imaging is based on the analysis of the expression of a cluster, rather than expression of individual clones or sequences, and on the analysis of cluster expression in the cDNA libraries with respect to tissue type. In the modular database, the libraries are associated with tissue classes, and the tissue classes are hierarchically organized. The hierarchy tree of tissue classes is stored in the TissueClass entity. In the TissueClass entity, each tissue class name is stored in the name field of the TissueClass.Name attribute array. In the hierarchy, tissue classes are associated with a parent tissue class and a sub-class tissue class using TissueClass.ParentTissueClass and TissueClass.subclassTissueClass attribute arrays, respectively. Each library is associated with at least one tissue class in the hierarchy. The hierarchy "tree" of tissue classes is based on the 1998 Medical Subject Headings (MeSH.TM.) available from National Library of Medicine. The top level of tissue classes is system based and includes the following classes: cardiovascular system, cells; digestive system; embryonic structures; endocrine system; genitalia, female; genitalia, male; hemic and immune system; musculoskeletal system; nervous system; respiratory system, sense organs, stomatognathic system; tissue types; and urinary tract. The lower level of tissue classes are tissue specific and include blood vessels, heart, blood cells, bone marrow cells, cultured cells, connective tissue cells, epithelial cells, islets of Langerhans, neuroglia, neurons, phagocytes, biliary tract, esophagus, gastrointestinal system, liver, pancreas, fetus, placenta, chromaffin system, endocrine glands, neurosecretory systems, ovary, uterus, penis, prostate, seminal vesicles, testis, bone marrow, immune system, cartilage, muscles, skeleton, central nervous system, ganglia, neuroglia, neurosecretory system, peripheral nervous system, bronchus, larynx, lung, nose, pleurus, ear, eye, nose, mouth, pharynx, connective tissue, epithelium, exocrine glands, bladder, kidney and ureter. Some of these second level categories are linked to more than one first category level. First level headings and second level headings are coded using the MeSH.TM. code system. Specific libraries are then subdivided between one or more of the second level categories. For example, clones isolated from a fetal heart sample library belong to multiple tissue classes. Within the tissue class entity, many tissue class entries have a name of "heart". The entry of the library entity for the fetal heart sample library would be associated with or point to at least one tissue class entry in the tissue class entity having a tissue class name of "heart." The fetal heart sample library entity would be associated with those tissue classes entries having a name of "heart" and having a parent tissue class called "embryonic structures" or "cardiovascular system." When the modular database is created, a build TissueClass procedure 162 (FIG. 5) populates the TissueClass.Name attribute array with the tissue class names. The build TissueClass procedure also populates the TissueClass.ParentTissueClass and TissueClass.subclassTissueClass arrays with offsets that define the hierarchy tree of tissue classes. The relationship between tissue class names, parent and subclasses is predefined and supplied to the build TissueClass procedure in a file. Next, a populate TissueClassLibrary procedure 168 (FIG. 5) populates the TissueClass.SpecificLibrary attribute array with the corresponding library data using the methods described above. Using the hierarchical tissue classes of the database, clusters can be examined to determine a specificity of expression of the clusters with respect to the libraries and tissue classes. To determine the specificity of expression, clusters whose members, or clones, are expressed in a single library are identified. If there is no library specificity, then clusters whose clones are expressed in a single tissue class are identified. A cluster may be fully tissue class specific for more than one tissue class because of the overlapping nature of the tissue classes. If a cluster is not tissue class specific, clusters that are partially tissue class specific are identified. A cluster is partially tissue class specific if a certain fraction of its clones are derived from libraries that are in the same tissue class. This fraction is called a threshold specificity score. The library entity has Library.SpecificCluster and Library.UnspecificCluster attribute arrays that associate clusters with the entries of the library entity based on their cluster specificities as described above. If a cluster is found to be specific for a library or class, several values are provided based on the data stored in the modular database: 1. an expression level (E) 2. a sensitivity, and 3. a cluster specificity value. The expression level (E) is calculated in a calculate expression level procedure 170 (FIG. 5). The expression level represents the number of clones in a cluster (N) expressed in a particular library divided by the number of clones (P) in that library, as in the following relationship: E=N/P. In the modular database, the number of clones in a cluster (N) is stored in the POS structure of the Cluster.Clone attribute array. The number of sequences in each library is stored in the library.Usable attribute array. The sensitivity is also calculated, in a cluster sensitivity determination procedure 172 (FIG. 5), by dividing the number of clones in a cluster (N) by the total number of sequences (Q) of all the libraries in a tissue class, so that S=N/Q. For example, the number of clones in a cluster (N) is retrieved from the Cluster.Clone attribute array as described above. The cluster.LibCount attribute associates various libraries with the cluster. The libraries are associated with tissue classes using the Library.TissueClass attribute. For a particular tissue class, the number of sequences (Q) in that tissue class is retrieved from the TissueClass.TotalUsable attribute array. The cluster specificity value or threshold specificity score is calculated for each cluster in a cluster specificity determination procedure 174 (FIG. 5). A cluster is reported to be library specific if 75% or more of the clones in the cluster are expressed in a library of interest. Partial specificities are reported for clusters when less than 75% of the clones in the cluster are expressed in the library of interest. Alternately, partial class specificity is reported when less than 75% of the clones in the cluster are expressed in the tissue class of interest. In one embodiment, the cluster specificity determination procedure identifies the clones from the Cluster.Clone attribute array, accesses the clone.library attribute array, counts the number of clones associated with each library and divides the largest count by the total number of clones in the cluster to generate the cluster specificity value. Similar to the hierarchy of Tissue Classes, as shown in FIG. 4, a ProteinClass entity provides a hierarchy of proteins using the parent-subclass organization described above. The ProteinClass entity is associated with the HitID entity which associates clusters with HitID entries. Therefore, clusters can be analyzed by ProteinClass. Clusters, via the Annotation entity, are associated with protein functions stored in the FunctionHit entity. Therefore, clusters can also be analyzed by protein function. In another alternative embodiment, a biomolecular entity provides a hierarchy of biomolecules using the parent-subclass organization described above. Mapping The modular database also provides a way to store and associate mapping information with clones and clusters. Referring back to FIG. 4, a MapPos entity stores mapping information that is supplied from multiple public domain databases such as the Stanford Human Genome Center (SHGC) and the Whitehead Institute Center for Genome Research (WICGR). The clone and cluster entities each have POS structures that associate the entries of the MapPos entity with the entries of the clone and cluster entities. The MapPos entity is populated in a manner similar to the procedures described above. To populate the MapPos entity, a file of clone names with the mapping information is supplied to a populate map information procedure. The populate map information procedure populates entries of the MapPos entity with the mapping information and also populates the clone.MapPos attribute of the clone entity. After the clone entity is populated with the mapping information, the populate map information procedure populates the cluster.MapPos attribute of the cluster entity for those clusters that have clones that are mapped. In an alternate embodiment, the populate map information procedure applies a filter before associating a cluster entry with the MapPos entity. For instance, this procedure checks that all clones making up a particular cluster are mapped to the same MapPos entity before associating that particular cluster with an entry of the MapPos entity. Similar to the procedures described above, a get map info procedure retrieves the mapping information from the MapPos entity using an offset into the MapPos entity. Therefore, mapping information from many sources is combined into a single database. Using this database, statistical analysis of the mapped clones in clusters can be performed. In this way, no table joins are performed because the relationships among the data are pre-compiled. Although the invention was described using a database for sequence data, the database can also be used with other biomolecular information. For example, the invention can store full length mRNA sequences, genomic sequences, synthetic sequences, peptide sequences, polypeptide sequences, peptide nucleic acid sequences, and genome mapping, pharmacogenomic, proteomic, single nucleotide polymorphism, genotyping and forensic data.
TABLE 2
Entity Description
Clone Entity
name-derived from clone offset to clone entity table
library-offset to library entity associated with the clone name
sequence-array of offsets (POS structure) for clone nucleic acid
sequence
cluster-offset to cluster entity containing clone
annotation-offset to annotation entity
MapPos-array of offsets to MapPos Entity
BAC-array of offsets to BAC entity for associated Bacterial artificial
chromosome clones
YAC-array of offsets to YAC entity for associated Yeast artificial
chromosome clones
Cluster Entity
Clone-is a POS structure to indicate the clones belongs to a particular
cluster
LibCount-stores a POS structure and is used to indicate the number
of libraries in which that cluster appears, and the associated
offset values of the entries of the library antedate in which that
cluster appears.
Annotation-offset to the Annotation entity
MapPos-is POS structure for associating mapping information with a
cluster.
BAC-array of offsets to BAC entity for associated Bacterial artificial
chromosome.
YAC-array of offsets to YAC entity for associated Yeast artificial
chromosome
Library Entity
Name-the library name
Clone-a POS structure for associating multiple clones with a library
Type-a number representing library preparation condition
Usable-number of usable clones in library
AnnotSGL clone[ ]-POS structure for associating annotated singleton
(a cluster consisting of one clone) clones with Cluster names
that are identical to their Clone name
UNISGL clone [ ]-POS structure for associating unique singleton
clones with cluster name
Specific Cluster-array of offsets to cluster entity, for those clusters
that occur in a single library or class
Unspecific Cluster-array of offsets to cluster entity for those clusters
that occur in multiple libraries or tissue classes
Tissue Class-array of offsets into TissueClass entity to associate a
library with tissue classes
Description-uses a POS structure to store variable length data stored
in a secondary file. The description information describes the
tissue pathology.
Comment-similar to description and uses a POS structure to store
variable Length data stored in a secondary file.
Annotation Entity
HitID-an offset pointing to the HitID Entity
Product Score-an integer representing a normalized value between 0
and 100 indicating the strength of a BLAST match between two
sequences.
LogLikelihood-an integer representing the probability that a sequence
match was due to random chance.
Extent-an integer which indicates whether any coding information is
present within the clone.
FunctionHit-uses a POS structure to point to multiple entries in the
FunctionHit Entity. Function Hit means protein function and is
taken from the annotation.
TissueClass Entity
Name: uses a POS structure to store a name having any number of
characters
parent tissue class: is an offset pointing to another entry in the
TissueClass table that is the parent tissue of the tissue in the
Name attribute. (such as if name is "leucocyte", then the parent
would be the tissue class having a name attribute equal to
"blood.")
subclass TissueClass: since a tissue class can have many subclasses,
a POS structure points to the starting entry a secondary array or
file. The information stored in the secondary array or file is
actually another offset pointing to another entity in the Tissue
Class. If there are no subclasses, the POS structure stores a 0
in its number or count.
Specific library[ ]-an array of offsets pointing to the library entity that
is
specific for the cluster
Specific Usable-an integer meaning the number of clones in the
library(ies) that the cluster is specific to
Total Library [ ] -an array of offsets pointing to total number of usable
clones in the library(ies) a clone is specific to
Total Usable-an integer meaning the total number of usable clones in
library(ies) a cluster is specific to
MapPos Entity
Marker-A unique positional identifier for a chromosomal location
MapType-an annotation specifying the type of map
Chromosome-an integer specifying the chromosome
Position-an integer indicating a Location on a chromosome
LodScore-an integer representing a ratio of likelihood of being within
a Location on a chromosome
Source-an integer indicating the map source
ProteinClass Entity
Name has a variable length and uses a POS structure.
parentClass-an offset pointing to a higher protein function class
subClass-an offset pointing to a protein function subclass
HitID-an array of offsets to the HitID entity
HitID Entity
Name is a GenBank internal identifier that is associated with a
homolog which is used to access data in GenBank.
Clone-an offset into the clone entity
Cluster-an array of offsets to the cluster entity
Singleton-an offset into the clone entity
FunctionHit Entity
Type-an integer representing a protein function class
FunctionID-an integer identifying a protein function derived from
annotation
YAC Entity
Name is an identifier for the YAC
clone-an array of offsets to the clone table associated with a YAC
clone
Source-an integer representing the source of the YAC
BAC Entity
Name is an identifier for the BAC
clone-an array of offsets to the clone table associated with a BAC
clone
Source-an integer representing the source of the BAC
While the present invention has been described with reference to a few specific embodiments, the description is illustrative of the invention and is not to be construed as limiting the invention. Various modifications may occur to those skilled in the art without departing from the true spirit and scope of the invention as defined by the appended claims.
|
Same subclass Same class Consider this |
||||||||||
