Multiple alignment genome sequence matching processor6983274Abstract A digital hardware system slides a sequence to be searched, such as a genome, past a query sequence, such as a DNA fragment, via shift registers. In parallel, multiple alignments between search and query sequences are compared. As each alignment consists of many elements (characters), the parallel matching process is highly accelerated over sequential element-by-element comparison. Individual element comparisons are made using equivalence operators, such as XNOR logic gates. The results of individual element comparisons are then consolidated into a single signal: "hit" or "miss", indicating whether all the elements for the particular alignment under test matched or not. After comparison, the search sequence is shifted by a number of elements equal to the number of parallel alignments compared, and the process repeats. An additional shifting prefetch buffer increases memory efficiency by alleviating the need to fetch data every cycle. Claims I claim: Description BACKGROUND OF THE INVENTION
The linear shift register method of matching first presented in U.S. Pat. No. 5,724,253 to Skovira of IBM (1998) alleviates some of these problems but still has several disadvantages:
BACKGROUND OF THE INVENTION—OBJECTS AND ADVANTAGES Accordingly, the general object of the multiple alignment genome sequence matching processor is to provide a fast and accurate system and method suitable for searching a genomic data vector using a query DNA fragment. Such a system provides rapid search and query capabilities to large databases of genetic data, and accelerates the process of sequence assembly. Several other general objects and advantages of the present invention are:
Further objects and advantages are to provide a mechanism to reduce memory fetch requirements thereby speeding compute time; to provide a system that exploits the nature of genetic data to rapidly find small exact matches (called hits), one of the core computational tasks of many bioinformatics algorithms; and to provide a system capable of performing DNA sequence assembly an estimated one thousand times faster than a conventional general purpose computer using the pair-wise comparison of fragment reads. Still further objects and advantages will become apparent from a consideration of the subsequent description and drawings. SUMMARY The multiple alignment genome sequence matching processor slides a sequence to be searched, such as a genome, past a query sequence, such as a DNA fragment, via shift registers. In parallel, multiple alignments between search and query sequences are compared. As each alignment consists of many elements (characters), the parallel matching process is highly accelerated over sequential element-by-element, single-alignment comparison. Individual element comparisons are made using equivalence operators, such as XNOR logic gates. The results of individual element comparisons are then consolidated into a single signal: "hit" or "miss," indicating whether all the elements for the particular alignment under test matched or not. After comparison, the search sequence is shifted by a number of elements equal to the number of parallel alignments compared, and the process repeats. An additional shifting prefetch buffer increases memory efficiency by alleviating the need to fetch data every cycle. According to one feature of the invention, each element is represented by a digital value assigned such that complementary DNA nucleotides receive complementary codes. This provides the option for XOR matching of complementary sequences or XNOR equivalence matching of identical sequences. According to another feature of the invention, the match length between the data and query vectors can be varied dynamically. DRAWINGS—FIGURES FIG. 1 illustrates a two-way alignment sequence matching processor according to the present invention. FIG. 2 details the cross-comparison mechanism for a four-way alignment sequence matching processor. FIG. 3 details an extension to consolidating AND tree introduced in FIG. 1 which allows the match length between the data and query vectors to vary. FIG. 4A details the logic gate implementation of an equivalence operator. FIG. 4B details the logic gate implementation of an equivalence-complementary combined operator. FIGS. 5A, 5B, and 5C show a time series operational interaction between the data and prefetch vectors. DRAWINGS—REFERENCE NUMERALS 10 Data vector 12 Query vector 14 Prefetch Vector 20a-d: Alignment 0 equivalence operators 22 Alignment 0 comparison block 30a-d: Alignment 1 equivalence operators 32 Alignment 1 comparison block 40 Comparison engine 80 Consolidating AND trees 80a Alignment 0 consolidating AND tree 80b Alignment 1 consolidating AND tree 120 XNOR equivalence operator schematic 130 XOR/XNOR complementary/equivalence combined operator schematic 150 Choose alignment block 160 Incrementing genome index 190 AND tree mux for selecting variable match length 202 Cross-comparison wiring network 230 Select line for complementary or equivalence matching 240 Next input elements 250 Write enable signal 252 Least significant bits of match index 260 Index output bus 262 Most significant bits of match index 270 Results of element comparisons 270a Output bus for alignment 0 element comparisons 270b Output bus for alignment 1 element comparisons 270i Individual element comparison result 280 Results from AND tree consolidation 280a Result from alignment 0 AND tree 280b Result from alignment 1 AND tree 280i Individual result from an AND tree 284a Matched five elements output 284b Matched six elements output 284c Matched seven elements output 284d Matched eight elements output 290 Select line for variable length matching Detailed Description—Preferred Embodiment—FIGS. 1, 2, and 4A A preferred embodiment of the multiple alignment genome sequence matching processor is illustrated in FIG. 1. A data vector 10 comprising of K shift memory elements contains a portion of a sequence such as a genome. While only four such elements are shown for purpose of illustration (K=4), in practice, the number of elements may be much larger. Each element or box in the data vector represents one data character, such as a nucleotide. Since there are only four nucleotides in DNA—Adenine, Thiamine, Guanine, and Cytosine—each element comprises two bits. A prefetch vector 14 comprising of shift memory elements is connected to data vector 10. The connection is such that the adjacent portion of the genome residing in the prefetch vector may be shifted into the data vector. Each element in the prefetch vector represents one data character, such as a nucleotide. The prefetch vector in the preferred embodiment is chosen to have the same number of elements as the data vector, though this number may vary. A query vector 12 consisting of K memory elements contains a portion of a different sequence such as a DNA fragment. Again, each element represents a character such as a nucleotide. The query vector in the preferred embodiment is chosen to have the same number of elements as the data vector, though this number may vary. A comparison engine 40 connects data vector 10 and prefetch vector 14 to query vector 12 using a cross-comparison wiring network 202. The data and prefetch vectors collectively hold a contiguous portion of the sequence to be searched, while the query vector holds the sequence fragment acting as the query in that search. In the preferred embodiment, the comparison engine tests multiple alignments between the genome and the fragment for equivalence. Although only two such alignments, alignment 0 and alignment 1, are drawn in FIG. 1 for purpose of illustration, the number of alignments tested may vary. The comparison between genome and query fragment for alignment 0 is performed by an alignment 0 comparison block 22 which consists of exclusive NOR (XNOR) logic gates 20a, 20b, 20c, and 20d. The output of an XNOR logic gate is asserted when its inputs match exactly. For this reason, XNOR is also known as the equivalence gate or equivalence operator. As a nucleotide can be represented in 2 bits, XNORs with 2-bit inputs are used for the comparison. In practice, these may be achieved through the logical AND of two 1-bit input XNORs as illustrated in FIG. 4A. A parallel comparison on alignment 0 is performed where XNOR gate 20a compares the first nucleotide in genome data vector 10 with the first nucleotide in query vector 12, XNOR gate 20b compares the second nucleotide in genome data vector 10 with the second nucleotide in query vector 12, XNOR gate 20c compares the third nucleotide in genome data vector 10 with the third nucleotide in query vector 12, and XNOR gate 20d compares the fourth nucleotide in genome data vector 10 with the fourth nucleotide in query vector 12. In total, the alignment 0 comparison consists of a parallel comparison of elements 0 to 3 in the data vector to the corresponding elements 0 to 3 in the query vector. The comparison between genome and query fragment for alignment 1 is performed by an alignment 1 comparison block 32 which consists of exclusive NOR (XNOR) logic gates 30a, 30b, 30c, and 30d. In parallel, XNOR gate 30a compares the second nucleotide in genome data vector 10 with the first nucleotide in query vector 12, XNOR gate 30b compares the third nucleotide in genome data vector 10 with the second nucleotide in query vector 12, XNOR gate 30c compares the fourth nucleotide in genome data vector 10 with the third nucleotide in query vector 12, and XNOR gate 30d compares the first nucleotide in genome prefetch vector 14 with the fourth nucleotide in query vector 12. In total, the alignment 1 comparison consists of a parallel comparison of elements 1 to 3 in the data vector and element 0 in the prefetch vector to the corresponding elements 0 to 3 in the query vector. In generalized terms, for K-element wide data, prefetch, and query vectors, an alignment P comparison consists of a parallel comparison via XNOR gates of elements P to (K-1) of data vector 10 and the first P elements of prefetch vector 14 to the corresponding elements 0 to (K-1) in query vector 12. This generalization is restricted to the condition that P is less than or equal to K. The result of an individual XNOR comparison will be a binary 1 or 0 indicating either the equivalence of the nucleotides or their dissimilarity respectively. Individual element comparison results 270 from comparison engine 40 are sent to consolidating AND trees 80. The individual element results from the alignment 0 comparison block are sent down a bus 270a to alignment 0 consolidating AND tree 80a. The individual element results from the alignment 1 comparison block are sent down a bus 270b to alignment 1 consolidating AND tree 80b. The consolidating AND trees perform a logical AND operation on their inputs. Hence, the output of a binary 1 from an AND tree indicates an exact match between multiple nucleotides. If the output of the alignment 0 consolidating AND is asserted, all elements of alignment 0 matched. If the output of the alignment 1 consolidating AND is asserted, all elements of alignment 1 matched. AND tree outputs 280 are fed into choose alignment block 150. The choose alignment block asserts a write enable signal 250 of external memory storage if any of the AND tree outputs are asserted, indicating a match or "hit" for some alignment. The choose alignment block also outputs the alignment number of the match to least significant bits 252 of index bus 260. An incrementing index 160 increments by P, the number of simultaneous alignments checked by comparison engine 40, in order to keep track of the current location in the genome. Each time the genome residing in data vector 10 and prefetch vector 14 is shifted, the index increments. The incrementing index outputs most significant digits 262 of match index to index bus 260. The index bus 260 is written as data into external memory when write enable signal 250 is asserted. Hence, when a match between the genome and query is found, the location where the match occurred is written to memory. In general, it is not actually necessary to increment by P for each shift of the genome. Rather, if P is a power of 2, an index which increments by 1 may be substituted. This index will supply the upper log2(P) to P most significant bits to index bus 260 while choose alignment block 150 supplies the remaining least significant bits based on which alignment produced the match. FIG. 2 illustrates the cross-comparison mechanism for a four-way alignment sequence matching system for purpose of clarification. Here, comparison engine 40 contains four sets of XNOR gates (shown overlapping to emphasize the cross-comparison wiring), where each set compares a different alignment. Set 0 compares elements 0 to 3 of data vector 10 to the corresponding elements 0 to 3 of query vector 12. Set 1 compares elements 1 to 3 of data vector 10 and element 0 of prefetch vector 14 to the corresponding elements 0 to 3 of query vector 12. Set 2 compares elements 2 to 3 of data vector 10 and elements 0 to 1 of prefetch vector 14 to the corresponding elements 0 to 3 of query vector 12. Set 3 compares element 3 of data vector 10 and elements 0 to 2 of prefetch vector 14 to the corresponding elements 0 to 3 of query vector 12. Wiring network 202 which permits such cross-comparison is the primary purpose of this illustration. Since four alignments were checked, four consolidating AND trees 80 will be required. These AND trees will consolidate the element comparisons into output signals 280 indicating whether or not a match occurred for each alignment. Detailed Description—FIGS. 3 and 4B—Additional Features FIG. 3 details an extension to consolidating AND tree introduced in FIG. 1 which allows the match length between the data and query vectors to vary. This is an important extension feature to the preferred embodiment as it adds flexibility to an otherwise hardwired and static match length. FIG. 3 illustrates how such flexibility could be achieved for an embodiment which compares 8 nucleotides in parallel, and therefore produces 8 bits of comparison results 280. By breaking the monolithic 8-bit AND gate into a tree of 2-bit AND gates, the outputs of those gates can be tapped to determine if a match occurred for less than 8 elements. For example, match 5 output 284a will be asserted when the first five comparison result bits are asserted. This indicates a match between the first five elements compared. Match 6 output 284b will be asserted when the first six elements compared are equivalent. Similarly, match 7 output 284c and match 8 output 284d will be asserted if the first 7 and 8 elements respectively are equivalent. A select line 290 is used with a multiplexor 190 to choose the desired match length from among the outputs generated. An individual AND tree result 280i is then sent to the choose alignment block as previously illustrated. FIG. 4A shows the schematic of an equivalence operator 120 (also known as an XNOR gate) which has 2-bit inputs A and B. Functionally, it can be seen that a multi-bit equivalence operation is simply the logical AND of multiple single-bit equivalence operations. FIG. 4B shows the logic gate schematic for a combination equivalence-complementary match operator 130 with 2-bit inputs. The 2-bit XNOR gate can be implemented by inverting the outputs of two XOR gates and performing a logical AND. A 2-bit complementary match operation can be implemented through the logical AND of the XOR gate outputs. Once generated, select line 230 can send either the equivalence of complementary match outputs as a comparison element result 270i. When complementary data elements are given bit-wise complementary codes as shown in table 1 below, this combined unit allows for either element complementary or element equivalence matching between genome and fragment sequences. This is important, as DNA is made of complementary nucleotide pairs arranged in a double helix. Since either side of that helix may have been sequenced, the genome database being queried may store the complement of the query in question. The ability to search in either mode is therefore desirable.
Detailed Description—Alternative Embodiments The multiple alignment genome sequence matching processor may be pipelined (not shown) such that the results from the individual nucleotide comparisons are immediately buffered. The contents of the data and prefetch vectors can then be shifted and compared in the next cycle, while the consolidating AND trees operate on buffered results from the last round of comparisons. Pipelining has the possibility of speeding the processor considerably, as the compare, AND, choose alignment critical path is eliminated. The new critical path should reside in either the XNOR comparison gate or the consolidating AND, both of which are simple and therefore fast hardware operations. Operation—FIGS. 1 and 5 Variables Used P—the number of simultaneous alignments tested. K—the number of elements checked by each alignment. To use the preferred embodiment of the multiple alignment genome sequence matching processor, a genome segment is fetched from memory by an external controller and placed in data vector 10. An adjacent and contiguous segment of the genome is placed into prefetch vector 14. A DNA fragment constituting a search or query pattern against the genome is placed into query vector 12. In theory, there is no limitation on the size of these vectors. In practice, the size is dictated largely by the fetch size of the memory architecture. In a conventional memory architecture, a 32-bit word fetch is common. Since the four letters of DNA—ATGC—can be encoded in only two bits, this 32-bit word fetch contains K=16 nucleotides. This is close to the match length of interest for many bioinformatics algorithms, making the sequence matching processor a good fit for conventional memory architecture. The preferred encoding of the nucleotides is as shown in table 1, where complementary nucleotide pairs, A-T and G-C, are given complementary bit codes. During the initialization process, incrementing index 160 which keeps track of the current position in the genome is zeroed. After data has been initially placed in the K-element data, prefetch, and query vectors, comparison can begin. Comparison engine 40 uses P sets of XNOR gates to check P alignments of K elements each in parallel. This leads to a total of K×P nucleotide comparisons per cycle, and is therefore much faster than a sequential comparison. Results of the individual element comparisons 270 are sent along buses to consolidating AND trees 80. The output of each AND tree determines if there is a match for the associated alignment. A match between the genome and query vectors is an exact match of K nucleotides. In biology, this is called a K-mer "hit." Matching small segments—that is, finding "hits"—allows for the deletions, insertions, substitutions, and errors which are prevalent in DNA. For example, an insertion may be flanked on both sides by a hit. The offset between the locations of these hits gives the size of that insertion. The same is true of substitutions and deletions. Matching small regions of DNA is the core computational task of many genetic processing algorithms. These hits are then combined, weighted, and correlated for higher level genetic analysis. AND tree outputs 280 are fed into choose alignment block 150. The choose alignment block asserts write enable signal 250 of external memory storage if any of the AND tree outputs are asserted, indicating a match or "hit" for some alignment. The choose alignment block also outputs the alignment number of the match to least significant bits 252 of index bus 260. For example, if AND align 180a introduced in FIG. 1 indicated a hit, least significant index bit 252 would be set to a binary 1. In an embodiment which compared four alignments, if the alignment 3 matched, the least significant index bits would be set to 11, the binary equivalent of 3. In order to facilitate the conversion of alignment number to its binary equivalent, the number of simultaneous alignments P checked by comparison engine 40 should be a power of two. An incrementor 160 keeps track of the current location in the genome being searched. Each time the genome residing in data vector 10 and prefetch vector 14 is shifted, the index increments. This incrementor is used to supply the most significant bits of a match location to index bus 260, the least significant bits being determined by the choose alignment block. When a match occurs, the value of index bus 260 specifies the location (index) of that match within the genome. This value is written into external memory when write enable signal 250 is asserted. When comparison engine 40 has completed its element comparisons, data vector 10 and prefetch vector 14 are shifted to the left by P elements. During a shift, elements from the left side of the prefetch vector are moved into the right side of the data vector. This ensures that the data vector is always filled with valid data elements. The result of this arrangement is that a memory fetch is needed only every K/P cycles in order to ensure that the elements being compared are valid. This point is illustrated in detail in the time series FIGS. 5A, 5B, and 5C. Here, the data and prefetch vectors are filled with contiguous alphanumeric characters for ease of example. FIG. 5A shows the data and prefetch vectors filled with characters A through H. Also shown is cross-comparison wiring network 202 which leads to the comparison engine. Each of the elements connected to this wiring network must be valid at all times for a valid comparison between genome and fragment to be made. A blank box will be used to indicate an invalid element position. FIG. 5B shows the data and prefetch vectors after a cycle of operation. Here, the contents of the vectors have been shifted to the left by two elements (i.e. the number of alignments checked in this embodiment). At this time, the rightmost two elements of the prefetch vector are invalid. However, since all of the elements connected to the comparison engine are valid, no new data needs to be input from memory. During this time, next input signals 240 may either be waiting or in the process of being fetched. FIG. 5C shows the data and prefetch vectors after another cycle of operation. The contents of the vectors have again been shifted to the left by two elements. New elements have been accepted into the prefetch vector to avoid an invalid comparison. The acceptance of new values into the prefetch vector can be triggered by a counter or external signal. It can be seen from this example where K=4 and P=2, that a memory fetch is required every K/P=2 cycles so that all data being compared is valid. By alleviating the need to fetch every cycle, the system increases memory efficiency, a crucial concern in conventional computing where memory is often the primary bottleneck. The overall operation of the sequence matching processor is to find exact matches of predetermined length between a search sequence and query sequence through a shift and compare loop. At each shift position, multiple alignments between search and query sequence are checked in parallel, allowing accelerated search speed. If the match length defined by hardware is shorter than the desired match length, the matching processor may be operated as follows. Once a match of length K between the search and query sequences has been found, the adjacent sections of both sequences are loaded into the data and query vectors respectively. The matching processor operates as usual, indicating a hit (match) or miss for this new data. In the case of a hit at this location, the total match length (inclusive of the prior match) now exceeds K. This procedure may be repeated and used with the variable match length feature to provide for any match length desired. Conclusion, Ramifications, and Scope Thus the multiple alignment genome sequence matching processor meets the requisite objects and advantages previously put forth:
Although the description above contains many specificities, these should not be construed as limiting the scope of the multiple alignment genome sequence matching processor, but as merely providing illustrations of some of the presently preferred embodiments. For example, the size of the genome and query fragment vectors can vary according to implementation; the number of simultaneous alignments checked can vary; a variety of incrementing blocks may be used to keep track of the current position in the genome; strings with an alphabet greater than four letters can be matched instead of genomes, etc. Thus the scope of the invention should be determined by the appended claims and their legal equivalents, rather than by the examples given above.
|
Same subclass Same class Consider this |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
