Pattern matching access

Method and apparatus for identifying classifying or quantifying DNA sequences in a sample without sequencing

6141657

Abstract

This invention provides methods by which biologically derived DNA sequences in a mixed sample or in an arrayed single sequence clone can be determined and classified without sequencing. The methods make use of information on the presence of carefully chosen target subsequences, typically of length from 4 to 8 base pairs, and preferably the length between target subsequences in a sample DNA sequence together with DNA sequence databases containing lists of sequences likely to be present in the sample to determine a sample sequence. The preferred method uses restriction endonucleases to recognize target subsequences and cut the sample sequence. Then carefully chosen recognition moieties are ligated to the cut fragments, the fragments amplified, and the experimental observation made. Polymerase chain reaction (PCR) is the preferred method of amplification. Another embodiment of the invention uses information on the presence or absence of carefully chosen target subsequences in a single sequence clone together with DNA sequence databases to determine the clone sequence. Computer implemented methods are provided to analyze the experimental results and to determine the sample sequences in question and to carefully choose target subsequences in order that experiments yield a maximum amount of information.


Claims

What is claimed is:

1. A programmable apparatus for analyzing signals comprising:

(a) an inputting device for inputting one or more actual signals generated by probing a sample comprising a plurality of nucleic acids with recognition means, each recognition means recognizing a target nucleotide subsequence or a set of target nucleotide subsequences, said signals comprising a representation of (i) the length between occurrences of said target subsequences in a nucleic acid of said sample, and (ii) the identities of said target subsequences in said nucleic acid, or the identities of said sets of target subsequences among which is included the target subsequences in said nucleic acid;

(b) a searching device operatively coupled to said inputting device for searching a sequence in a nucleotide sequence database for occurrences of said target subsequences or target subsequences that are members of said sets of target subsequences, and for the length between such occurrences, said database comprising a plurality of known nucleotide sequences that may be present in said sample;

(c) a comparing device operatively coupled to said inputting device and to said searching device for finding a match between said one or more actual signals and a sequence in said database, said one or more actual signals matching a sequence from said database when the sequence from said database has both (i) the same length between occurrences of target subsequences as is represented by said one or more actual signals, and (ii) the same target subsequences as are represented by said one or more actual signals, or target subsequences that are members of the sets of target subsequences represented by said one or more actual signals; and

(d) a control device operatively coupled to said comparing device for causing said comparing to be done for sequences in the database and for outputting those database sequences that match said one or more actual signals.

2. The programmable apparatus of claim 1 wherein said searching device searches for said target subsequences or a set of target nucleotide subsequences in said database sequences by performing a string comparison of the nucleotides in said subsequences with those in said database sequence.

3. The programmable apparatus of claim 1 wherein said control device further comprises causing said searching device to search all sequences in said database in order to determine a pattern of signals that can be generated by probing said sample with said recognition means, and wherein said control device further causes said comparing device to find any matches between said one or more actual signals and said pattern of signals, said one or more actual signals matching a signal in said pattern of signals when the signal from said pattern represents (i) the same length between occurrences of target subsequences as is represented by said one or more actual signals, and (ii) the same target subsequences as are represented by said one or more actual signals, or target subsequences that are members of the sets of target subsequences represented by said one or more actual signals.

4. The programmable apparatus of claim 1 wherein said sample of nucleic acids comprises cDNA of RNA of a cell or tissue type, and said database comprises DNA sequences that are likely to be expressed by said cell or tissue type.

5. A computer readable memory containing computer code executable to direct a programmable apparatus to function for analyzing signals according to steps comprising:

(a) inputting one or more actual signals generated by probing a sample comprising a plurality of nucleic acids with recognition means, each recognition means recognizing a target nucleotide subsequence or a set of target nucleotide subsequences, said signals comprising a representation of (i) the length between occurrences of said target subsequences in a nucleic acid of said sample, and (ii) the identities of said target subsequences in said nucleic acid, or the identities of said sets of target subsequences among which is included the target subsequences in said nucleic acid;

(b) searching a sequence in a nucleotide sequence database for occurrences of said target subsequences or target subsequences that are members of said sets of target subsequences, and for the length between such occurrences, said database comprising a plurality of known nucleotide sequences that may be present in said sample;

(c) matching said one or more actual signals and a sequence in said database when the sequence in said database has both (i) the same length between occurrences of target subsequences as is represented by said one or more actual signals and (ii) the same target subsequences as are represented by said one or more actual signals, or target subsequences that are members of the sets of target subsequences as are represented by said one or more actual signals; and

(d) repetitively performing said searching and matching steps for the majority of sequences in the database and outputting those database sequences that match said one or more actual signals.

6. A programmable apparatus for selecting target subsequences comprising:

(a) an initial selection device for selecting initial target subsequences or initial sets of target subsequences;

(b) a first control device;

(c) a search device operatively coupled to said initial selection device and to said first control device (i) for searching sequences in a nucleotide sequence database for occurrences of said initial target subsequences or occurrences of target subsequences that are members of said initial sets of target subsequences and for the length between such occurrences, and (ii) for determining an initial pattern of signals that can be generated from said selected initial target subsequences or said initial sets of target subsequences, said database comprising a plurality of known nucleotide sequences, said signals comprising a representation of (i) the length between said occurrences in a sequence in said database, and (ii) the identities of said initial target subsequences that occur in said sequence in said database, or the identities of target subsequences that are members of the initial sets of target subsequences that occur in said sequence in said database; and

(d) an ascertaining device operatively coupled to said searching device and to said first control device for ascertaining the value of said determined initial pattern according to an information measure; and wherein

said first control device causes further target subsequences to be selected and causes the search device to determine a further pattern of signals and the ascertaining device to ascertain a further value of said information measure and accepts the further target subsequences when said further pattern optimizes said further value of said information measure.

7. The programmable apparatus of claim 6 wherein a predetermined one or more of the sequences in said database are of interest, and wherein said ascertaining device ascertains the value of an information measure by counting the number of such sequences of interest which generate in said determined pattern at least one signal that is not generated by any other sequence in said database.

8. The programmable apparatus of claim 7 wherein said one or more of the sequences of interest comprise substantially all the sequences in said database.

9. The programmable apparatus of claim 6 wherein said first control device optimizes the value of said information measure according to a method of exhaustive search, wherein said first control device selects further target subsequences of length less than approximately 10 and accepts the further target subsequences if said further value of said information measure is greater than the previous value.

10. The programmable apparatus of claim 6 wherein said first control device optimizes the value of said information measure according to a method comprising simulated annealing, wherein said first control device repeatedly selects further target subsequences and accepts the further target subsequences if said further value of said information measure is not decreased by greater than a probabilistic factor dependent on a simulated-temperature, and wherein said programmable apparatus further comprises a second control device operatively coupled to said first control device for decreasing said simulated-temperature as said first control device selects further target subsequences.

11. The programmable apparatus of claim 10 wherein said probabilistic factor is an exponential function of the negative of the decrease in the information measure divided by said simulated-temperature.

12. The programmable apparatus of claim 6 wherein said database comprises a majority of known DNA sequences that are likely to be expressed in one or more cell types.

13. A computer readable memory containing computer code executable to direct a programmable apparatus to function for selecting target subsequences according to steps comprising:

(a) selecting initial target subsequences or initial sets of target subsequences;

(b) searching a sequence in a nucleotide sequence database for occurrences of said initial target subsequences or occurrences of target subsequences that are members of said initial sets of target subsequences and for the length between such occurrences, said database comprising a plurality of known nucleotide sequences that may be present in said sample;

(c) determining an initial pattern of signals that can be generated from said selected initial target subsequences or said initial sets of target subsequences, said signals comprising a representation of (i) the length between said occurrences in a sequence in said database, and (ii) the identities of said initial target subsequences that occur in said sequence in said database, or the identities of target subsequences that are members of the initial sets of target subsequences that occur in said sequence in said database; and

(d) ascertaining the value of said determined initial pattern according to an information measure; and

(e) repetitively performing said selecting, searching, determining, and ascertaining steps to determine a further pattern of signals and a further value of said information measure, and accepting the further target subsequences when said further pattern optimizes said further value of said information measure.

14. A programmable apparatus for displaying data comprising:

(a) a selecting device for selecting target subsequences or sets of target subsequences, such that recognition means for recognizing said target subsequences or said sets of target subsequences can be used to generate signals by probing a sample comprising a plurality of nucleic acids, said signals comprising a representation of (i) the length between occurrences of said target subsequences in a nucleic acid of said sample and (ii) the identities of said target subsequences in said nucleic acid or the identities of said sets of target subsequences among which are included the target subsequences in said nucleic acid;

(b) an inputting device for inputting one or more actual signals generated by probing said sample with said recognition means;

(c) an analyzing device for analyzing signals operatively coupled to said selecting and inputting devices that determines which sequences in a nucleotide sequence database can generate said actual signals when subject to said recognition means, said database comprising a plurality of known nucleotide sequences that may be present in said sample;

(d) an input/output device operatively coupled to said selecting, inputting, and analyzing devices that inputs user requests and controls the selecting device to select target subsequences or sets of target subsequences, controls the inputting device to accept actual signals, controls the analyzing device to find the sequences in said database that can generate said actual signals, and displays output comprising said actual signals and said sequences in said database that can generate said actual signals.

15. The programmable apparatus of 14 wherein said sample is a cDNA sample prepared from a tissue specimen, and the apparatus further comprises a storage device operatively coupled to the input/output device for storing indications of the origin of said tissue specimen and information concerning said tissue specimen,

and wherein said indications can be displayed upon user input.

16. The programmable apparatus of 15 wherein the indications and information concerning said tissue specimen comprises histological information comprising tissue images.

17. The programmable apparatus of claim 14 further comprising:

(a) one or more instrument devices for probing said sample with said recognition means and for generating said actual signals; and

(b) a control device operatively coupled to said one or more instrument devices and to said input/output device for controlling the operation of said instrument devices,

wherein said user can input control commands for control of said instrument devices and receive output concerning the status of said instrument devices.

18. The programmable apparatus of 17 wherein the one or more instrument devices are capable of automatic operation, whereby the probing and generating can be performed without manual intervention.

19. The programmable apparatus of claim 14 wherein one or more of said selecting, inputting, analyzing, and input/output devices are physically collocated with each other.

20. The programmable apparatus of claim 14 wherein one or more of said selecting, inputting, analyzing, and input/output devices are physically spaced apart from each other and are connected by a communication medium for exchanges of commands and information.

21. A computer readable memory containing computer code executable to direct a programmable apparatus to function for displaying data according to steps comprising:

(a) selecting target subsequences or sets of target subsequences, such that recognition means for recognizing said target subsequences or said sets of target subsequences can be used to generate signals by probing a sample comprising a plurality of nucleic acids, said signals comprising a representation of (i) the length between occurrences of said target subsequences in a nucleic acid of said sample and (ii) the identities of said target subsequences in said nucleic acid or the identities of said sets of target subsequences among which are included the target subsequences in said nucleic acid;

(b) inputting one or more actual signals generated by probing said sample with said recognition means;

(c) analyzing said one or more actual signals to determine which sequences in a nucleotide sequence database can generate said actual signals when subject to said recognition means, said database comprising a plurality of known nucleotide sequences that may be present in said sample; and

(d) inputting user requests to control said selecting step to select target subsequences or sets of target subsequences, said inputting step to input actual signals, and said analyzing step to find the sequences in said database that can generate said actual signals, and outputting in response to further user requests information comprising said actual signals and said sequences in said database that can generate said actual signals.


Description

1. FIELD OF THE INVENTION

The field of this invention is DNA sequence classification, identification or determination, and quantification; more particularly it is the quantitative classification, comparison of expression, or identification of preferably all DNA sequences or genes in a sample without performing any sequencing.

2. BACKGROUND

Over the past ten years, as biological and genomic research have revolutionized our understanding of the molecular basis of life, it has become increasingly clear that the temporal and spatial expression of genes is responsible for all life's processes, processes occurring in both health and in disease. Science has progressed from an understanding of how single genetic defects cause the traditionally recognized hereditary disorders, such as the thalassemias, to a realization of the importance of the interaction of multiple genetic defects along with environmental factors in the etiology of the majority of more complex disorders, such as cancer. In the case of cancer, current scientific evidence demonstrates the key causative roles of altered expression of and multiple defects in several pivotal genes. Other complex diseases have similar etiology. Thus the more complete and reliable a correlation that can be established between gene expression and health or disease states, the better diseases can be recognized, diagnosed and treated.

This important correlation is established by the quantitative determination and classification of DNA expression in tissue samples, and such a method which is rapid and economical would be of considerable value. Genomic DNA ("gDNA") sequences are those naturally occurring DNA sequences constituting the genome of a cell. The state of gene, or gDNA, expression at any time is represented by the composition of total cellular messenger RNA ("mRNA"), which is synthesized by the regulated transcription of gDNA. Complementary DNA ("cDNA") sequences are synthesized by reverse transcription from mRNA. cDNA from total cellular mRNA also represents, albeit approximately, gDNA expression in a cell at a given time. Consequently, rapid and economical detection of all the DNA sequences in particular cDNA or gDNA samples is desired, particularly so if such detection was rapid, precise, and quantitative.

Heretofore, gene specific DNA analysis techniques have not been directed to the determination or classification of substantially all genes in a DNA sample representing total cellular mRNA and have required some degree of sequencing. Generally, existing cDNA, and also gDNA, analysis techniques have been directed to the determination and analysis of one or two known or unknown genetic sequences at one time. These techniques have used probes synthesized to specifically recognize by hybridization only one particular DNA sequence or gene. (See, e.g., Watson et al., 1992, Recombinant DNA, chap 7, W. H. Freeman, New York.) Further, adaptation of these methods to the problem of recognizing all sequences in a sample would be cumbersome and uneconomical.

One existing method for finding and sequencing unknown genes starts from an arrayed cDNA library. From a particular tissue or specimen, mRNA is isolated and cloned into an appropriate vector, which is then plated in a manner so that the progeny of individual vectors bearing the clone of one cDNA sequence can be separately identified. A replica of such a plate is then probed, often with a labeled DNA oligomer selected to hybridize with the cDNA representing the gene of interest. Thereby, those colonies bearing the cDNA of interest are found and isolated, the cDNA harvested and subject to sequencing. Sequencing can then be done by the Sanger dideoxy chain termination method (Sanger et al., 1977, "DNA sequencing with chain terminating inhibitors", Proc. Natl. Acad. Sci. USA 74(12):5463-5467) applied to inserts so isolated.

The DNA oligomer probes for the unknown gene used for colony selection are synthesized to hybridize, preferably, only with the cDNA for the gene of interest. One manner of achieving this specificity is to start with the protein product of the gene of interest. If a partial sequence of 5 to 10-mer peptide fragment from an active region of this protein can be determined, corresponding 15 to 30-mer degenerate oligonucleotides can be synthesized which code for this peptide. This collection of degenerate oligonucleotides will typically be sufficient to uniquely identify the corresponding gene. Similarly, any information leading to 15 to 30 long nucleotide subsequences can be used to create a single gene probe.

Another existing method, which searches for a known gene in a cDNA or gDNA prepared from a tissue sample, also uses single gene or single sequence probes which are complementary to unique subsequences of the already known gene sequences. For example, the expression of a particular oncogene in sample can be determined by probing tissue derived cDNA with a probe derived from a subsequence of the oncogene's expressed sequence tag. Similarly the presence of a rare or difficult to culture pathogen, such as the TB bacillus or the HIV, can be determined by probing gDNA with a hybridization probe specific to a gene of the pathogen. The heterozygous presence of a mutant allele in a phenotypically normal individual, or its homozygous presence in a fetus, can be determined by probing with an allele specific probe complementary only to the mutant allele (See, e.g., Guo et al., 1994, Nucleic Acid Research, 22:5456-65).

All existing methods using single gene probes, of which the preceding examples are typical, if applied to determine all genes expressed in a given tissue sample, would require many thousands to tens of thousands of individual probes. It is estimated a single human cell typically expresses approximately to 15,000 to 15,000 genes simultaneously and that the most complex tissue, e.g. the brain, can express up to half the human genome (Liang et al., 1992, "Differential Display of Eukaryotic Messenger RNA by Means of the Polymerase Chain Reaction, Science, 257:967-971). Such an application requiring such a number of probes is clearly too cumbersome to be economic or, even, practical.

Another class of existing methods, known as sequencing by hybridization ("SBH"), in contrast, use combinatorial probes which are not gene specific (Drmanac et al., 1993, Science, 260:1649-52; U.S. Pat. No. 5,202,231, Apr. 13, 1993, to Drmanac et al). An exemplary implementation of SBH to determine an unknown gene requires that a single cDNA clone be probed with all DNA oligomers of a given length, say, for example, all 6-mers. Such a set of all oligomers of a given length synthesized without any selection is called a combinatorial probe library. From knowledge of all hybridization results for a combinatorial library, say all the 4096 6-mer probe results, a partial DNA sequence for the cDNA clone can be reconstructed by algorithmic manipulations. Complete sequences are not determinable because, at least, repeated subsequences cannot be fully determined. SBH adapted to the classification of known genes is called oligomer sequence signatures ("OSS") (Lennon et al., 1991, Trends In Genetics, 7(10):314-317). This technique classifies a single clone based on the pattern of probe hits against an entire combinatorial library, or a significant sub-library. It requires that the tissue sample library be arrayed into clones, each clone comprising only one pure sequence from the library. It cannot be applied to mixtures.

These exemplary existing methods are all directed to finding one sequence in an array of clones each expressing a single sequence from a tissue sample. They are not directed to rapid, economical, quantitative, and precise characterization of all the DNA sequences in a mixture of sequences, such as a particular total cellular cDNA or gDNA sample. Their adaptation to such a task would be prohibitive. Determination by sequencing the DNA of a clone, much less an entire sample of thousands of sequences, is not rapid or inexpensive enough for economical and useful diagnostics. Existing probe-based techniques of gene determination or classification, whether the genes are known or unknown, require many thousands of probes, each specific to one possible gene to be observed, or at least thousands or even tens of thousands of probes in a combinatorial library. Further, all of these methods require the sample be arrayed into clones each expressing a single gene of the sample.

In contrast to the prior exemplary existing gene determination and classification techniques, another existing technique, known as differential display, attempts to fingerprint a mixture of expressed genes, as is found in a pooled cDNA library. This fingerprint, however, seeks merely to establish whether two samples are the same or different. No attempt is made to determine the quantitative, or even qualitative, expression of particular, determined genes (Liang et al., 1995, Current opinions in Immunology 7:274-280; Liang et al., 1992, Science 257:967-71; Welsh et al., Nucleic Acid Res., 1992, 20:4965-70; McClelland et al., 1993, Exs, 67:103-15; Lisitsyn, 1993, Science, 259:946-50). Differential display uses the polymerase chain reaction ("PCR") to amplify DNA subsequences of various lengths, which are defined by being between the hybridization sites of arbitrarily selected primers. Ideally, the pattern of lengths observed is characteristic of the tissue from which the library was prepared. Typically, one primer used in differential display is oligo(dT) and the other is one or more arbitrary oligonucleotides designed to hybridize within a few hundred base pairs of the poly-dA tail of a cDNA in the library. Thereby, on electrophoretic separation, the amplified fragments of lengths up to a few hundred base pairs should generate bands characteristic and distinctive of the sample. Changes in tissue gene expression may be observed as changes in one or more bands.

Although characteristic banding patterns develop, no attempt is made to link these patterns to the expression of particular genes. The second arbitrary primer cannot be traced to a particular gene. First, the PCR process is less than ideally specific. One to a few base pair ("bp") mismatches ("bubbles") are permitted by the lower stringency annealing step typically used and are tolerated well enough so that a new chain can be initiated by the Tag polymerase, often used in PCR reactions. Second, the location of a single subsequence or its absence is insufficient information to distinguish all expressed genes. Third, length information from the arbitrary primer to the poly-dA tail is generally not found to be characteristic of a sequence due to variations in the processing of the 3' untranslated regions of genes, the variation in the poly-adenylation process and variability in priming to the repetitive sequence at a precise point. Thus, even the bands that are produced often are smeared by the non-specific background sequences present. Also known PCR biases to high G+C content and short sequences further limit the specificity of this method. Thus this technique is generally limited to "fingerprinting" samples for a similarity or dissimilarity determination and is precluded from use in quantitative determination of the differential expression of identifiable genes.

Existing methods for gene or DNA sequence classification or determination are in need of improvement in their ability to perform rapid and economical as well as quantitative and specific determination of the components of a cDNA mixture prepared from a tissue sample. The preceding background review identifies the deficiencies of several exemplary existing methods.

3. SUMMARY OF THE INVENTION

It is an object of this invention to provide methods for rapid, economical, quantitative, and precise determination or classification of DNA sequences, in particular genomic or complementary DNA sequences, in either arrays of single sequence clones or mixtures of sequences such as can be derived from tissue samples, without actually sequencing the DNA. Thereby, the deficiencies in the background arts just identified are solved. This object is realized by generating a plurality of distinctive and detectable signals from the DNA sequences in the sample being analyzed. Preferably, all the signals taken together have sufficient discrimination and resolution so that each particular DNA sequence in a sample may be individually classified by the particular signals it generates, and with reference to a database of DNA sequences possible in the sample, individually determined. The intensity of the signals indicative of a particular DNA sequence depends quantitatively on the amount of that DNA present. Alternatively, the signals together can classify a predominant fraction of the DNA sequences into a plurality of sets of approximately no more than two to four individual sequences.

It is a further object that the numerous signals be generated from measurements of the results of as few a number of recognition reactions as possible, preferably no more than approximately 5-400 reactions, and most preferably no more than approximately 20-50 reactions. Rapid and economical determinations would not be achieved if each DNA sequence in a sample containing a complex mixture required a separate reaction with a unique probe. Preferably, each recognition reaction generates a large number of or a distinctive pattern of distinguishable signals, which are quantitatively proportional to the amount of the particular DNA sequences present. Further, the signals are preferably detected and measured with a minimum number of observations, which are preferably capable of simultaneous performance.

The signals are preferably optical, generated by fluorochrome labels and detected by automated optical detection technologies. Using these methods, multiple individually labeled moieties can be discriminated even though they are in the same filter spot or gel band. This permits multiplexing reactions and parallelizing signal detection. Alternatively, the invention is easily adaptable to other labeling systems, for example, silver staining of gels. In particular, any single molecule detection system, whether optical or by some other technology such as scanning or tunneling microscopy, would be highly advantageous for use according to this invention as it would greatly improve quantitative characteristics.

According to this invention, signals are generated by detecting the presence (hereinafter called "hits") or absence of short DNA subsequences (hereinafter called "target" subsequences) within a nucleic acid sequence of the sample to be analyzed. The presence or absence of a subsequence is detected by use of recognition means, or probes, for the subsequence. The subsequences are recognized by recognition means of several sorts, including but not limited to restriction endonucleases ("REs"), DNA oligomers, and PNA oligomers. REs recognize their specific subsequences by cleavage thereof; DNA and PNA oligomers recognize their specific subsequences by hybridization methods. The preferred embodiment detects not only the presence of pairs of hits in a sample sequence but also include a representation of the length in base pairs between adjacent hits. This length representation can be corrected to true physical length in base pairs upon removing experimental biases and errors of the length separation and detection means. An alternative embodiment detects only the pattern of hits in an array of clones, each containing a single sequence ("single sequence clones").

The generated signals are then analyzed together with DNA sequence information stored in sequence databases in computer implemented experimental analysis methods of this invention to identify individual genes and their quantitative presence in the sample.

The target subsequences are chosen by further computer implemented experimental design methods of this invention such that their presence or absence and their relative distances when present yield a maximum amount of information for classifying or determining the DNA sequences to be analyzed. Thereby it is possible to have orders of magnitude fewer probes than there are DNA sequences to be analyzed, and it is further possible to have considerably fewer probes than would be present in combinatorial libraries of the same length as the probes used in this invention. For each embodiment, target subsequences have a preferred probability of occurrence in a sequence, typically between 5% and 50%. In all embodiments, it is preferred that the presence of one probe in a DNA sequence to be analyzed is independent of the presence of any other probe.

Preferably, target subsequences are chosen based on information in relevant DNA sequence databases that characterize the sample. A minimum number of target subsequences may be chosen to determine the expression of all genes in a tissue sample ("tissue mode"). Alternatively, a smaller number of target subsequences may be chosen to quantitatively classify or determine only one or a few sequences of genes of interest, for example oncogenes, tumor suppressor genes, growth factors, cell cycle genes, cytoskeletal genes, etc ("query mode").

A preferred embodiment of the invention, named quantitative expression analysis ("QEA"), produces signals comprising target subsequence presence and a representation of the length in base pairs along a gene between adjacent target subsequences by measuring the results of recognition reactions on cDNA (or gDNA) mixtures. Of great importance, this method does not require the cDNA be inserted into a vector to create individual clones in a library. Creation of these libraries is time consuming, costly, and introduces bias into the process, as it requires the cDNA in the vector to be transformed into bacteria, the bacteria arrayed as clonal colonies, and finally the growth of the individual transformed colonies.

Three exemplary experimental methods are described herein for performing QEA: a preferred method utilizing a novel RE/ligase/amplification procedure; a PCR based method; and a method utilizing a removal means, preferably biotin, for removal of unwanted DNA fragments. The preferred method generates precise, reproducible, noise free signatures for determining individual gene expression from DNA in mixtures or libraries and is uniquely adaptable to automation, since it does not require intermediate extractions or buffer exchanges. A computer implemented gene calling step uses the hit and length information measured in conjunction with a database of DNA sequences to determine which genes are present in the sample and the relative levels of expression. Signal intensities are used to determine relative amounts of sequences in the sample. Computer implemented design methods optimize the choice of the target subsequences.

A second specific embodiment of the invention, termed colony calling ("CC"), gathers only target subsequence presence information for all target subsequences for arrayed, individual single sequence clones in a library, with cDNA libraries being preferred. The target subsequences are carefully chosen according to computer implemented design methods of this invention to have a maximum information content and to be minimum in number. Preferably from 10-20 subsequences are sufficient to characterize the expressed cDNA in a tissue. In order to increase the specificity and reliability of hybridization to the typically short DNA subsequences, preferable recognition means are PNAs. Degenerate sets of longer DNA oligomers having a common, short, shared, target sequence can also be used as a recognition means. A computer implemented gene calling step uses the pattern of hits in conjunction with a database of DNA sequences to determine which genes are present in the sample and the relative levels of expression.

The embodiments of this invention preferably generate measurements that are precise, reproducible, and free of noise. Measurement noise in QEA is typically created by generation or amplification of unwanted DNA fragments, and special steps are preferably taken to avoid any such unwanted fragments. Measurement noise in colony calling is typically created by mis-hybridization of probes, or recognition means, to colonies. High stringency reaction conditions and DNA mimics with increased hybridization specificity may be used to minimize this noise. DNA mimics are polymers composed of subunits capable of specific, Watson-Crick-like hybridization with DNA. Also useful to minimize noise in colony calling are improved hybridization detection methods. Instead of the conventional detection methods based on probe labeling with fluorochromes, new methods are based on light scattering by small 100-200 .mu.m particles that are aggregated upon probe hybridization (Stimson et al., 1995, "Real-time detection of DNA hybridization and melting on oligonucleotide arrays by using optical wave guides", Proc. Natl. Acad. Sci. USA, 92:6379-6383). In this method, the hybridization surface forms one surface of a light pipe or optical wave guide, and the scattering induced by these aggregated particles causes light to leak from the light pipe. In this manner hybridization is revealed as an illuminated spot of leaking light on a dark background. This latter method makes hybridization detection more rapid by eliminating the need for a washing step between the hybridization and detection steps. Further by using variously sized and shaped particles with different light scattering properties, multiple probe hybridizations can be detected from one colony.

Further, the embodiments of the invention can be adapted to automation by eliminating non-automatable steps, such as extractions or buffer exchanges. The embodiments of the invention facilitate efficient analysis by permitting multiple recognition means to be tested in one reaction and by utilizing multiple, distinguishable labeling of the recognition means, so that signals may be simultaneously detected and measured. Preferably, for the QEA embodiments, this labeling is by multiple fluorochromes. For the CC embodiments, detection is preferably done by the light scattering methods with variously sized and shaped particles.

An increase in sensitivity as well as an increase in the number of resolvable fluorescent labels can be achieved by the use of fluorescent, energy transfer, dye-labeled primers. Other detection methods, preferable when the genes being identified will be physically isolated from the gel for later sequencing or use as experimental probes, include the use of silver staining gels or of radioactive labeling. Since these methods do not allow for multiple samples to be run in a single lane, they are less preferable when high throughput is needed.

Because this invention achieves rapid and economical determination of quantitative gene expression in tissue or other samples, it has considerable medical and research utility. In medicine, as more and more diseases are recognized to have important genetic components to their etiology and development, it is becoming increasingly useful to be able to assay the genetic makeup and expression of a tissue sample. For example, the presence and expression of certain genes or their particular alleles are prognostic or risk factors for disease (including disorders). Several examples of such diseases are found among the neurodegenerative diseases, such as Huntington's disease and ataxia-telangiectasia. Several cancers, such as neuroblastoma, can now be linked to specific genetic defects. Finally, gene expression can also determine the presence and classification of those foreign pathogens that are difficult or impossible to culture in vitro but which nevertheless express their own unique genes.

Disease progression is reflected in changes in genetic expression of an affected tissue. For example, expression of particular tumor promoter genes and lack of expression of particular tumor suppressor genes is now known to correlate with the progression of certain tumors from normal tissue, to hyperplasia, to cancer in situ, and to metastatic cancer. Return of a cell population to a normal. pattern of gene expression, such as by using anti-sense technology, can correlate with tumor regression. Therefore, knowledge of gene expression in a cancerous tissue can assist in staging and classifying this disease.

Expression information can also be used to chose and guide therapy. Accurate disease classification and staging or grading using gene expression information can assist in choosing initial therapies that are increasingly more precisely tailored to the precise disease process occurring in the particular patient. Gene expression information can then track disease progression or regression, and such information can assist in monitoring the success or changing the course of an initial therapy. A therapy is favored that results in a regression towards normal of an abnormal pattern of gene expression in an individual, while therapy which has little effect on gene expression or its progression can need modification. Such monitoring is now useful for cancers and will become useful for an increasing number of other diseases, such as diabetes and obesity. Finally, in the case of direct gene therapy, expression analysis directly monitors the success of treatment.

In biological research, rapid and economical assay for gene expression in tissue or other samples has numerous applications. Such applications include, but are not limited to, for example, in pathology examining tissue specific genetic response to disease, in embryology determining developmental changes in gene expression, in pharmacology assessing direct and indirect effects of drugs on gene expression. In these applications, this invention can be applied, e.g., to in vitro cell populations or cell lines, to in vivo animal models of disease or other processes, to human samples, to purified cell populations perhaps drawn from actual wild-type occurrences, and to tissue samples containing mixed cell populations. The cell or tissue sources can advantageously be a plant, a single celled animal, a multicellular animal, a bacterium, a virus, a fungus, or a yeast, etc. The animal can advantageously be laboratory animals used in research, such as mice engineered or bread to have certain genomes or disease conditions or tendencies. The in vitro cell populations or cell lines can be exposed to various exogenous factors to determine the effect of such factors on gene expression. Further, since an unknown signal pattern is indicative of an as yet unknown gene, this invention has important use for the discovery of new genes. In medical research, by way of further example, use of the methods of this invention allow correlating gene expression with the presence and progress of a disease and thereby provide new methods of diagnosis and new avenues of therapy which seek to directly alter gene expression.

This invention includes various embodiments and aspects, several of which are described below.

In a first embodiment, the invention provides a method for identifying, classifying, or quantifying one or more nucleic acids in a sample comprising a plurality of nucleic acids having different nucleotide sequences, said method comprising probing said sample with one or more recognition means, each recognition means recognizing a different target nucleotide subsequence or a different set of target nucleotide subsequences; generating one or more signals from said sample probed by said recognition means, each generated signal arising from a nucleic acid in said sample and comprising a representation of (i) the length between occurrences of target subsequences in said nucleic acid and (ii) the identities of said target subsequences in said nucleic acid or the identities of said sets of target subsequences among which is included the target subsequences in said nucleic acid; and searching a nucleotide sequence database to determine sequences that match or the absence of any sequences that match said one or more generated signals, said database comprising a plurality of known nucleotide sequences of nucleic acids that may be present in the sample, a sequence from said database matching a generated signal when the sequence from said database has both (i) the same length between occurrences of target subsequences as is represented by the generated signal and (ii) the same target subsequences as is represented by the generated signal, or target subsequences that are members of the same sets of target subsequences represented by the generated signal, whereby said one or more nucleic acids in said sample are identified, classified, or quantified.

This invention further provides in the first embodiment additional methods wherein each recognition means recognizes one target subsequence, and wherein a sequence from said database matches a generated signal when the sequence from said database has both the same length between occurrences of target subsequences as is represented by the generated signal and the same target subsequences as represented by the generated signal, or optionally wherein each recognition means recognizes a set of target subsequences, and wherein a sequence from said database matches a generated signal when the sequence from said database has both the same length between occurrences of target subsequences as is represented by the generated signal, and target subsequences that are members of the sets of target subsequences represented by the generated signal.

This invention further provides in the first embodiment additional methods further comprising dividing said sample of nucleic acids into a plurality of portions and performing the methods of this object individually on a plurality of said portions, wherein a different one or more recognition means are used with each portion.

This invention further provides in the first embodiment additional methods wherein the quantitative abundance of a nucleic acid comprising a particular nucleotide sequence in the sample is determined from the quantitative level of the one or more signals generated by said nucleic acid that are determined to match said particular nucleotide sequence.

This invention further provides in the first embodiment additional methods wherein said plurality of nucleic acids are DNA, and optionally wherein the DNA is cDNA, and optionally wherein the cDNA is prepared from a plant, an single celled animal, a multicellular animal, a bacterium, a virus, a fungus, or a yeast, and optionally wherein the cDNA is of total cellular RNA or total cellular poly(A) RNA.

This invention further provides in the first embodiment additional methods wherein said database comprises substantially all the known expressed sequences of said plant, single celled animal, multicellular animal, bacterium, or yeast.

This invention further provides in the first embodiment additional methods wherein the recognition means are one or more restriction endonucleases whose recognition sites are said target subsequences, and wherein the step of probing comprises digesting said sample with said one or more restriction endonucleases into fragments and ligating double stranded adapter DNA molecules to said fragments to produce ligated fragments, each said adapter DNA molecule comprising (i) a shorter stand having no 5' terminal phosphates and consisting of a first and second portion, said first portion at the 5' end of the shorter strand being complementary to the overhang produced by one of said restriction endonucleases and (ii) a longer strand having a 3' end subsequence complementary to said second portion of the shorter strand; and wherein the step of generating further comprises melting the shorter strand from the ligated fragments, contacting the sample with a DNA polymerase, extending the ligated fragments by synthesis with the DNA polymerase to produce blunt-ended double stranded DNA fragments, and amplifying the blunt-ended fragments by a method comprising contacting said blunt-ended fragments with a DNA polymerase and primer oligodeoxynucleotides, said primer oligodeoxynucleotides comprising the longer adapter strand, and said contacting being at a temperature not greater than the melting temperature of the primer oligodeoxynucleotide from a strand of the blunt-ended fragments complementary to the primer oligodeoxynucleotide and not less than the melting temperature of the shorter strand of the adapter nucleic acid from the blunt-ended fragments.

This invention further provides in the first embodiment additional methods wherein the recognition means are one or more restriction endonucleases whose recognition sites are said target subsequences, and wherein the step of probing further comprises digesting the sample with said one or more restriction endonucleases.

This invention further provides in the first embodiment additional methods further comprising identifying a fragment of a nucleic acid in the sample which generates said one or more signals; and recovering said fragment, and optionally wherein the signals generated by said recovered fragment do not match a sequence in said nucleotide sequence database, and optionally further comprising using at least a hybridizable portion of said fragment as a hybridization probe to bind to a nucleic acid that can generate said fragment upon digestion by said one or more restriction endonucleases.

This invention further provides in the first embodiment additional methods wherein the step of generating further comprises after said digesting removing from the sample both nucleic acids which have not been digested and nucleic acid fragments resulting from digestion at only a single terminus of the fragments, and optionally wherein prior to digesting, the nucleic acids in the sample are each bound at one terminus to a biotin molecule or to a hapten molecule, and said removing is carried out by a method which comprises contacting the nucleic acids in the sample with streptavidin or avidin or with an anti-hapten antibody, respectively, affixed to a solid support.

This invention further provides in the first embodiment additional methods wherein said digesting with said one or more restriction endonucleases leaves single-stranded nucleotide overhangs on the digested ends.

This invention further provides in the first embodiment additional methods wherein the step of probing further comprises hybridizing double-stranded adapter nucleic acids with the digested sample fragments, each said adapter nucleic acid having an end complementary to said overhang generated by a particular one of the one or more restriction endonucleases, and ligating with a ligase a strand of said adapter nucleic acids to the 5' end of a strand of the digested sample fragments to form ligated nucleic acid fragments.

This invention further provides in the first embodiment additional methods wherein said digesting with said one or more restriction endonucleases and said ligating are carried out in the same reaction medium, and optionally wherein said digesting and said ligating comprises incubating said reaction medium at a first temperature and then at a second temperature, in which said one or more restriction endonucleases are more active at the first temperature than the second temperature and said ligase is more active at the second temperature that the first temperature, or wherein said incubating at said first temperature and said incubating at said second temperature are performed repetitively.

This invention further provides in the first embodiment additional methods wherein the step of probing further comprises prior to said digesting removing terminal phosphates from DNA in said sample by incubation with an alkaline phosphatase, and optionally wherein said alkaline phosphatase is heat labile and is heat inactivated prior to said digesting.

This invention further provides in the first embodiment additional methods wherein said generating step comprises amplifying the ligated nucleic acid fragments, and optionally wherein said amplifying is carried out by use of a nucleic acid polymerase and primer nucleic acid strands, said primer nucleic acid strands being capable of priming nucleic acid synthesis by said polymerase, and optionally wherein the primer nucleic acid strands have a G+C content of between 4.0% and 60%.

This invention further provides in the first embodiment additional methods wherein each said adapter nucleic acid has a shorter strand and a longer strand, the longer strand being ligated to the digested sample fragments, and said generating step comprises prior to said amplifying step the melting of the shorter strand from the ligated fragments, contacting the ligated fragments with a DNA polymerase, extending the ligated fragments by synthesis with the DNA polymerase to produce blunt-ended double stranded DNA fragments, and wherein the primer nucleic acid strands comprise a hybridizable portion the sequence of said longer strands, or optionally comprise the sequence of said longer strands, each different primer nucleic acid strand priming amplification only of blunt ended double stranded DNA fragments that are produced after digestion by a particular restriction endonuclease.

This invention further provides in the first embodiment additional methods wherein each primer nucleic acid strand is specific for a particular restriction endonuclease, and further comprises at the 3' end of and contiguous with the longer strand sequence the portion of the restriction endonuclease recognition site remaining on a nucleic acid fragment terminus after digestion by the restriction endonuclease, or optionally wherein each said primer specific for a particular restriction endonuclease further comprises at its 3' end one or more nucleotides 3' to and contiguous with the remaining portion of the restriction endonuclease recognition site, whereby the ligated nucleic acid fragment amplified is that comprising said remaining portion of said restriction endonuclease recognition site contiguous to said one or more additional nucleotides, and optionally such that said primers comprising a particular said one or more additional nucleotides can be distinguishably detected from said primers comprising a different said one or more additional nucleotides.

This invention further provides in the first embodiment additional methods wherein during said amplifying step the primer nucleic acid strands are annealed to the ligated nucleic acid fragments at a temperature that is less than the melting temperature of the primer nucleic acid strands from strands complementary to the primer nucleic acid strands but greater than the melting temperature of the shorter adapter strands from the blunt-ended fragments.

This invention further provides in the first embodiment additional methods wherein the recognition means are oligomers of nucleotides, nucleotide-mimics, or a combination of nucleotides and nucleotide-mimics, which are specifically hybridizable with the target subsequences, and optionally further provides additional methods wherein the step of generating comprises amplifying with a nucleic acid polymerase and with primers comprising said oligomers, whereby fragments of nucleic acids in the sample between hybridized oligomers are amplified.

This invention further provides in the first embodiment additional methods wherein said signals further comprise a representation of whether an additional target subsequence is present on said nucleic acid in the sample between said occurrences of target subsequences, and optionally wherein said additional target subsequence is recognized by a method comprising contacting nucleic acids in the sample with oligomers of nucleotides, nucleotide-mimics, or mixed nucleotides and nucleotide-mimics, which are hybridizable with said additional target subsequence.

This invention further provides in the first embodiment additional methods wherein the step of generating comprises suppressing said signals when an additional target subsequence is present on said nucleic acid in the sample between said occurrences of target subsequences, and optionally wherein, when the step of generating comprises amplifying nucleic acids in the sample, said additional target subsequence is recognized by a method comprising contacting nucleic acids in the sample with (a) oligomers of nucleotides, nucleotide-mimics, or mixed nucleotides and nucleotide-mimics, which hybridize with said additional target subsequence and disrupt the amplifying step; or (b) restriction endonucleases which have said additional target subsequence as a recognition site and digest the nucleic acids in the sample at the recognition site.

This invention further provides in the first embodiment additional methods wherein the step of generating further comprises separating nucleic acid fragments by length, and optionally wherein the step of generating further comprises detecting said separated nucleic acid fragments, and optionally wherein said detecting is carried out by a method comprising staining said fragments with silver, labeling said fragments with a DNA intercalating dye, or detecting light emission from a fluorochrome label on said fragments.

This invention further provides in the first embodiment additional methods wherein said representation of the length between occurrences of target subsequences is the length of fragments determined by said separating and detecting steps.

This invention further provides in the first embodiment additional methods wherein said separating is carried out by use of liquid chromatography, mass spectrometry, or electrophoresis, and optionally wherein said electrophoresis is carried out in a slab gel or capillary configuration using a denaturing or non-denaturing medium.

This invention further provides in the first embodiment additional methods wherein a predetermined one or more nucleotide sequences in said database are of interest, and wherein the target subsequences are such that said sequences of interest generate at least one signal that is not generated by any other sequence likely to be present in the sample, and optionally wherein the nucleotide sequences of interest are a majority of sequences in said database.

This invention further provides in the first embodiment additional methods wherein the target subsequences have a probability of occurrence in the nucleotide sequences in said database of from approximately 0.01 to approximately 0.30.

This invention further provides in the first embodiment additional methods wherein the target subsequences are such that the majority of sequences in said database contain on average a sufficient number of occurrences of target subsequences in order to on average generate a signal that is not generated by any other nucleotide sequence in said database, and optionally wherein the number of pairs of target subsequences present on average in the majority of sequences in said database is no less than 3, and wherein the average number of signals generated from the sequences in said database is such that the average difference between lengths represented by the generated signals is greater than or equal to 1 base pair.

This invention further provides in the first embodiment additional methods wherein the target subsequences have a probability of occurrence, p, approximately given by the solution of ##EQU1## wherein N=the number of different nucleotide sequences in said database; L=the average length of said different nucleotide sequences in said database; R=the number of recognition means; A=the number of pairs of target subsequences present on average in said different nucleotide sequences in said database; and B=the average difference between lengths represented by the signals generated from the nucleic acids in the sample, and optionally wherein A is greater than or equal to 3 and wherein B is greater than or equal to 1.

This invention further provides in the first embodiment additional methods wherein the target subsequences are selected according to the further steps comprising determining a pattern of signals that can be generated and the sequences capable of generating each such signal by simulating the steps of probing and generating applied to each sequences in said database of nucleotide sequences; ascertaining the value of said determined pattern according to an information measure; and choosing the target subsequences in order to generate a new pattern that optimizes the information measure, and optionally wherein said choosing step selects target subsequences which comprise the recognition sites of the one or more restriction endonucleases, and optionally wherein said choosing step selects target subsequences which comprise the recognition sites of the one or more restriction endonucleases contiguous with one or more additional nucleotides.

This invention further provides in the first embodiment additional methods wherein a predetermined one or more of the nucleotide sequences present in said database of nucleotide sequences are of interest, and the information measure optimized is the number of such said sequences of interest which generate at least one signal that is not generated by any other nucleotide sequence present in said database, and optionally wherein said nucleotide sequences of interest are a majority of the nucleotide sequences present in said database.

This invention further provides in the first embodiment additional methods wherein said choosing step is by exhaustive search of all combinations of target subsequences of length less than approximately 10, or wherein said step of choosing target subsequences is by a method comprising simulated annealing.

This invention further provides in the first embodiment additional methods wherein the step of searching further comprises determining a pattern of signals that can be generated and the sequences capable of generating each such signal by simulating the steps of probing and generating applied to each sequence in said database of nucleotide sequences; and finding the one or more nucleotide sequences in said database that are able to generate said one or more generated signals by finding in said pattern those signals that comprise a representation of the (i) the same lengths between occurrences of target subsequences as is represented by the generated signal and (ii) the same target subsequences as is represented by the generated signal, or target subsequences that are members of the same sets of target subsequences represented by the generated signal.

This invention further provides in the first embodiment additional methods wherein the step of determining further comprises searching for occurrences of said target subsequences or sets of target subsequences in nucleotide sequences in said database of nucleotide sequences; finding the lengths between occurrences of said target subsequences or sets of target subsequences in the nucleotide sequences of said database; and forming the pattern of signals that can be generated from the sequences of said database in which the target subsequences were found to occur.

This invention further provides in the first embodiment additional methods wherein said restriction endonucleases generate 5' overhangs at the terminus of digested fragments and wherein each double stranded adapter nucleic acid comprises a shorter nucleic acid strand consisting of a first and second contiguous portion, said first portion being a 5' end subsequence complementary to the overhang produced by one of said restriction endonucleases; and a longer nucleic acid strand having a 3' end subsequence complementary to said second portion of the shorter strand.

This invention further provides in the first embodiment additional methods wherein said shorter strand has a melting temperature from a complementary strand of less than approximately 68.degree. C., and has no terminal phosphate, and optionally wherein said shorter strand is approximately 12 nucleotides long.

This invention further provides in the first embodiment additional methods wherein said longer strand has a melting temperature from a complementary strand of greater than approximately 68.degree. C., is not complementary to any nucleotide sequence in said database, and has no terminal phosphate, and optionally wherein said ligated nucleic acid fragments do not contain a recognition site for any of said restriction endonucleases, and optionally wherein said longer strand is approximately 24 nucleotides long and has a G+C content between 40% and 60%.

This invention further provides in the first embodiment additional methods wherein said one or more restriction endonucleases are heat inactivated before said ligating.

This invention further provides in the first embodiment additional methods wherein said restriction endonucleases generate 3' overhangs at the terminus of the digested fragments and wherein each double stranded adapter nucleic acid comprises a longer nucleic acid strand consisting of a first and second contiguous portion, said first portion being a 3' end subsequence complementary to the overhang produced by one of said restriction endonucleases; and a shorter nucleic acid strand complementary to the 3' end of said second portion of the longer nucleic acid stand.

This invention further provides in the first embodiment additional methods wherein said shorter strand has a melting temperature from said longer strand of less than approximately 68.degree. C., and has no terminal phosphates, and optionally wherein said shorter strand is 12 base pairs long.

This invention further provides in the first embodiment additional methods wherein said longer strand has a melting temperature from a complementary strand of greater than approximately 68.degree. C., is not complementary to any nucleotide sequence in said database, has no terminal phosphate, and wherein said ligated nucleic acid fragments do not contain a recognition site for any of said restriction endonucleases, and optionally wherein said longer strand is 24 base pairs long and has a G+C content between 40% and 60%.

In a second embodiment, the invention provides a method for identifying or classifying a nucleic acid comprising probing said nucleic acid with a plurality of recognition means, each recognition means recognizing a target nucleotide subsequence or a set of target nucleotide subsequences, in order to generate a set of signals, each signal representing whether said target subsequence or one of said set of target subsequences is present or absent in said nucleic acid; and searching a nucleotide sequence database, said database comprising a plurality of known nucleotide sequences of nucleic acids that may be present in the sample, for sequences matching said generated set of signals, a sequence from said database matching a set of signals when the sequence from said database (i) comprises the same target subsequences as are represented as present, or comprises target subsequences that are members of the sets of target subsequences represented as present by the generated sets of signals and (ii) does not comprise the target subsequences represented as absent or that are members of the sets of target subsequences represented as absent by the generated sets of signals, whereby the nucleic acid is identified or classified, and optionally wherein the set of signals are represented by a hash code which is a binary number.

This invention further provides in the second embodiment additional methods wherein the step of probing generates quantitative signals of the numbers of occurrences of said target subsequences or of members of said set of target subsequences in said nucleic acid, and optionally wherein a sequence matches said generated set of signals when the sequence from said database comprises the same target subsequences with the same number of occurrences in said sequence as in the quantitative signals and does not comprise the target subsequences represented as absent or target subsequences within the sets of target subsequences represented as absent.

This invention further provides in the second embodiment additional methods wherein said plurality of nucleic acids are DNA.

This invention further provides in the second embodiment additional methods wherein the recognition means are detectably labeled oligomers of nucleotides, nucleotide-mimics, or combinations of nucleotides and nucleotide-mimics, and the step of probing comprises hybridizing said nucleic acid with said oligomers, and optionally wherein said detectably labeled oligomers are detected by a method comprising detecting light emission from a fluorochrome label on said oligomers or arranging said labeled oligomers to cause light to scatter from a light pipe and detecting said scattering, and optionally wherein the recognition means are oligomers of peptido-nucleic acids, and optionally wherein the recognition means are DNA oligomers, DNA oligomers comprising universal nucleotides, or sets of partially degenerate DNA oligomers.

This invention further provides in the second embodiment additional methods wherein the step of searching further comprises determining a pattern of sets of signals of the presence or absence of said target subsequences or said sets of target subsequences that can be generated and the sequences capable of generating each set of signals in said pattern by simulating the step of probing as applied to each sequence in said database of nucleotide sequences; and finding one or more nucleotide sequences that are capable of generating said generated set of signals by finding in said pattern those sets that match said generated set, where a set of signals from said pattern matches a generated set of signals when the set from said pattern (i) represents as present the same target subsequences as are represented as present or target subsequences that are members of the sets of target subsequences represented as present by the generated sets of signals and (ii) represents as absent the target subsequences represented as absent or that are members of the sets of target subsequences represented as absent by the generated sets of signals.

This invention further provides in the second embodiment additional methods wherein the target subsequences are selected according to the further steps comprising determining (i) a pattern of sets of signals representing the presence or absence of said target subsequences or of said sets of target subsequences that can be generated, and (ii) the sequences capable of generating each set of signals in said pattern by simulating the step of probing as applied to each sequence in said database of nucleotide sequences; ascertaining the value of said pattern generated according to an information measure; and choosing the target subsequences in order to generate a new pattern that optimizes the information measure.

This invention further provides in the second embodiment additional methods wherein the information measure is the number of sets of signals in the pattern which are capable of being generated by one or more sequences in said database, or optionally wherein the information measure is the number of sets of signals in the pattern which are capable of being generated by only one sequence in said database.

This invention further provides in the second embodiment additional methods wherein said choosing step is by a method comprising exhaustive search of all combination of target subsequences of length less than approximately 10, or optionally wherein said choosing step is by a method comprising simulated annealing.

This invention further provides in the second embodiment additional methods wherein the step of determining by simulating further comprises searching for the presence or absence of said target subsequences or sets of target subsequences in each nucleotide sequence in said database of nucleotide sequences; and forming the pattern of sets of signals that can be generated from said sequences in said database, and optionally where the step of searching is carried out by a string search, and optionally wherein the step of searching comprises counting the number of occurrences of said target subsequences in each nucleotide sequence.

This invention further provides in the second embodiment additional methods wherein the target subsequences have a probability of occurrence in a nucleotide sequence in said database of nucleotide sequences of from 0.01 to 0.6, or optionally wherein the target subsequences are such that the presence of one target subsequence in a nucleotide sequence in said database of nucleotide sequences is substantially independent of the presence of any other target subsequence in the nucleotide sequence, or optionally wherein fewer than approximately 50 target subsequences are selected.

In a third embodiment, the invention provides a programmable apparatus for analyzing signals comprising an inputting device for inputting one or more actual signals generated by probing a sample comprising a plurality of nucleic acids with recognition means, each recognition means recognizing a target nucleotide subsequence or a set of target nucleotide subsequences, said signals comprising a representation of (i) the length between occurrences of said target subsequences in a nucleic acid of said sample, and (ii) the identities of said target subsequences in said nucleic acid, or the identities of said sets of target subsequences among which is included the target subsequences in said nucleic acid; a searching device operatively coupled to said accepting device for searching a sequence in a nucleotide sequence database for occurrences of said target subsequences or target subsequences that are members of said sets of target subsequences, and for the length between such occurrences, said database comprising a plurality of known nucleotide sequences that may be present in said sample; a comparing device operatively coupled to said accepting device and to said searching device for finding a match between said one or more actual signals and a sequence in said database, said one or more actual signals matching a sequence from said database when the sequence from said database has both (i) the same length between occurrences of target subsequences as is represented by said one or more actual signals and (ii) the same target subsequences as is represented by said one or more actual signals or target subsequences that are members of the same sets of target subsequences represented by said one or more actual signals; and a control device operatively coupled to said comparing device for causing said comparing to be done for sequences in the database and for outputting those database sequences that match said one or more actual signals, and optionally wherein said searching device searches for said target subsequences or a set of target nucleotide subsequences in said database sequences by performing a string comparison of the nucleotides in said subsequences with those in said database sequence.

This invention further provides in the third embodiment that said control device further comprises causing said searching device to search substantially all sequences in said database in order to determine a pattern of signals that can be generated by probing said sample with said recognition means, and wherein said control device further causes said comparing device to find any matches between said one or more actual signals and said pattern of signals, said one or more actual signals matching a signal in said pattern of signals when the signal from said pattern represents (i) the same length between occurrences of target subsequences as is represented by said one or more actual signals and (ii) the same target subsequences as is represented by said one or more actual signals or target subsequences that are members of the same sets of target subsequences represented by said one or more actual signals.

This invention further provides in the third embodiment that said sample of nucleic acids comprises cDNA from RNA of a cell or tissue type, and said database comprises DNA sequences that are likely to be expressed by d cell or tissue type.

This invention further provides in the third embodiment a computer readable memory that can be used to direct a programmable apparatus to function for analyzing signals according to steps comprising inputting one or more actual signals generated by probing a sample comprising a plurality of nucleic acids with recognition means, each recognition means recognizing a target nucleotide subsequence or a set of target nucleotide subsequences, said signals comprising a representation of (i) the length between occurrences of said target subsequences in a nucleic acid of said sample, and (ii) the identities of said target subsequences in said nucleic acid, or the identities of said sets of target subsequences among which is included the target subsequences in said nucleic acid; searching a sequence in a nucleotide sequence database for occurrences of said target subsequences or target subsequences that are members of said sets of target subsequences, and for the length between such occurrences, said database comprising a plurality of known nucleotide sequences that may be present in said sample; matching said one or more actual signals and a sequence in said database when the sequence in said database has both (i) the same length between occurrences of target subsequences as is represented by said one or more actual signals and (ii) the same target subsequences as is represented by said one or more actual signals, or target subsequences that are members of the same sets of target subsequences as is represented by said one or more actual signals; and repetitively performing said searching and matching steps for the majority of sequences in the database and outputting those database sequences that match said one or more actual signals, or alternatively a computer readable memory for directing a programmable apparatus to function in the manner of the third object.

In a fourth embodiment, the invention provides a programmable apparatus for selecting target subsequences comprising an initial selection device for selecting initial target subsequences or initial sets of target subsequences;.a first control device; a search device operatively coupled to said initial selection device and to said first control device (i) for searching sequences in a nucleotide sequence database for occurrences of said initial target subsequences or occurrences of target subsequences that are members of said initial sets of target subsequences and for the length between such occurrences and (ii) for determining an initial pattern of signals that can be generated from said selected initial target subsequences or said initial sets of target subsequences, said database comprising a plurality of known nucleotide sequences, said signals comprising a representation of (i) the length between said occurrences in a sequence in said database, and (ii) the identities of said initial target subsequences that occur in said sequence in said database, or the identities of target subsequences that are members of the same initial sets of target subsequences that occur in said sequence in said database; and an ascertaining device operatively coupled to said searching device and to said first control device for ascertaining the value of said determined initial pattern according to an information measure; and wherein said first control device causes further target subsequences to be selected and causes the search device to determine a further pattern of signals and the ascertaining device to ascertain a further value of said information measure and accepts the further target subsequences when said further pattern optimizes said further value of said information measure.

This invention further provides in the fourth object that a predetermined one or more of the sequences in said database are of interest, and wherein said ascertaining device ascertains the value of an information measure by counting the number of such sequences of interest which generate in said determined pattern at least one signal that is not generated by any other sequence in said database, and optionally that said one or more of the sequences of interest comprise substantially all the sequences in said database.

This invention further provides in the fourth embodiment that said first control device optimizes the value of said information measure according to a method of exhaustive search, wherein said first control device selects further target subsequences of length less than approximately and accepts the further target subsequences if said further value of said information measure is greater than the previous value.

This invention further provides in the fourth embodiment that said first control device optimizes the value of said information measure according to a method comprising simulated annealing, wherein said first control device repeatedly selects further target subsequences and accepts the further target subsequences if said further value of said information measure is not decreased by greater than a probabilistic factor dependent on a simulated-temperature, and wherein said programmable apparatus further comprises a second control device operatively coupled to said first control device for decreasing said simulated-temperature as said first control device selects further target subsequences, and optionally wherein said probabilistic factor is an exponential function of the negative of the decrease in the information measure divided by said simulated-temperature.

This invention further provides in the fourth embodiment that the database comprises a majority of known DNA sequences that are likely to be expressed by one or more cell types.

This invention further provides in the fourth embodiment a computer readable memory that can be used to direct a programmable apparatus to function for selecting target subsequences according to steps comprising selecting initial target subsequences or initial sets of target subsequences; searching a sequence in a nucleotide sequence database for occurrences of said initial target subsequences or occurrences of target subsequences that are members of said initial sets of target subsequences and for the length between such occurrences, said database comprising a plurality of known nucleotide sequences that may be present in said sample; determining an initial pattern of signals that can be generated from said selected initial target subsequences or said initial sets of target subsequences, said signals comprising a representation of (i) the length between said occurrences in a sequence in said database, and (ii) the identities of said initial target subsequences that occur in said sequence in said database, or the identities of target subsequences that are members of the initial sets of target subsequences that occur in said sequence in said database; ascertaining the value of said determined initial pattern according to an information measure; and repetitively performing said selecting, searching, determining, and ascertaining steps to determine a further pattern of signals and a further value of said information measure, and accepting the further target subsequences when said further pattern optimizes said further value of said information measure, or alternatively a computer readable memory for directing a programmable apparatus to function in the manner of the fourth object.

In a fifth embodiment, the invention provides a programmable apparatus for displaying data comprising a selecting device for selecting target subsequences or sets of target subsequences, such that recognition means for recognizing said target subsequences or said sets of target subsequences can be used to generate signals by probing a sample comprising a plurality of nucleic acids, said signals comprising a representation of (i) the length between occurrences of said target subsequences in a nucleic acid of said sample and (ii) the identities of said target subsequences in said nucleic acid or the identities of said sets of target subsequences among which are included the target subsequences in said nucleic acid; an inputting device for inputting one or more actual signals generated by probing said sample with said recognition means; an analyzing device for analyzing signals operatively coupled to said selecting and inputting devices that determines which sequences in a nucleotide sequence database can generate said actual signals when subject to said recognition means, said database comprising a plurality of known nucleotide sequences that may be present in said sample; an input/output device operatively coupled to said selecting, inputting, and analyzing devices that inputs user requests and controls the selecting device to select target subsequences or sets of target subsequences, controls the inputting device to accept actual signals, controls the analyzing device to find the sequences in said database that can generate said actual signals, and displays output comprising said actual signals and said sequences in said database that can generate said actual signals.

This invention further provides in the fifth embodiment that said sample is a cDNA sample prepared from a tissue specimen, and the apparatus further comprises a storage device operatively coupled to the input/output device for storing indications of the origin of said tissue specimen and information concerning said tissue specimen, and wherein said indications can be displayed upon user input, and optionally that the indications and information concerning said tissue specimen comprises histological information comprising tissue images.

This invention further provides in the fifth embodiment additional apparatus further comprising one or more instrument devices for probing said sample with said recognition means and for generating said actual signals; and a control device operatively coupled to said one or more instrument devices and to said input/output device for controlling the operation of said instrument devices, wherein said user can input control commands for control of said instrument devices and receive output concerning the status of said instrument devices, and optionally wherein one or more of said selecting, inputting, analyzing, and input/output devices are physically collocated with each other, or are physically spaced apart from each other and are connected by a communication medium for exchanges of commands and information.

This invention further provides in the fifth embodiment a computer readable memory that can be used to direct a programmable apparatus to function for displaying data according to steps comprising selecting target subsequences or sets of target subsequences, such that recognition means for recognizing said target subsequences or said sets of target subsequences can be used to generate signals by probing a sample comprising a plurality of nucleic acids, said signals comprising a representation of (i) the length between occurrences of said target subsequences in a nucleic acid of said sample and (ii) the identities of said target subsequences in said nucleic acid or the identities of said sets of target subsequences among which are included the target subsequences in said nucleic acid inputting one or more actual signals generated by probing said sample with said recognition means analyzing said one or more actual signals to determine which sequences in a nucleotide sequence database can generate said actual signals when subject to said recognition means, said database comprising a plurality of known nucleotide sequences that may be present in said sample; and inputting user requests to control said selecting step to select target subsequences or sets of target subsequences, said inputting step to input actual signals, and said analyzing step to find the sequences in said database that can generate said actual signals, and outputting in response to further user requests information comprising said actual signals and said sequences in said database that can generate said actual signals, or alternatively a computer readable memory for directing a programmable apparatus to function in the manner of the fifth object.

In a sixth embodiment, the invention provides a method for identifying, classifying, or quantifying DNA molecules in a sample of DNA molecules having a plurality of different nucleotide sequences, the method comprising the steps of digesting said sample with one or more restriction endonucleases, each said restriction endonuclease recognizing a subsequence recognition site and digesting DNA at said recognition site to produce fragments with 5' overhangs; contacting said fragments with shorter and longer oligodeoxynucleotides, each said shorter oligodeoxynucleotide hybridizable with a said 5' overhang and having no terminal phosphates, each said longer oligodeoxynucleotide hybridizable with a said shorter oligodeoxynucleotide; ligating said longer oligodeoxynucleotides to said 5' overhangs on said DNA fragments to produce ligated DNA fragments; extending said ligated DNA fragments by synthesis with a DNA polymerase to produce blunt-ended double stranded DNA fragments; amplifying said blunt-ended double stranded DNA fragments by a method comprising contacting said DNA fragments with a DNA polymerase and primer oligodeoxynucleotides, each said primer oligodeoxynucleotide having a sequence comprising that of one of the longer oligodeoxynucleotides; determining the length of the amplified DNA fragments; and searching a DNA sequence database, said database comprising a plurality of known DNA sequences that may be present in the sample, for sequences matching one or more of said fragments of determined length, a sequence from said database matching a fragment of determined length when the sequence from said database comprises recognition sites of said one or more restriction endonucleases spaced apart by the determined length, whereby DNA molecules in said sample are identified, classified, or quantified.

This invention further provides in the sixth embodiment additional methods wherein the sequence of each primer oligodeoxynucleotide further comprises 3' to and contiguous with the sequence of the longer oligodeoxynucleotide the portion of the recognition site of said one or more restriction endonucleases remaining on a DNA fragment terminus after digestion, said remaining portion being 5' to and contiguous with one or more additional nucleotides, and wherein a sequence from said database matches a fragment of determined length when the sequence from said database comprises subsequences that are the recognition sites of said one or more restriction endonucleases contiguous with said one or more additional nucleotides and when the subsequences are spaced apart by the determined length.

This invention further provides in the sixth embodiment additional methods wherein said determining step further comprises detecting the amplified DNA fragments by a method comprising staining said fragments with silver.

This invention further provides in the sixth embodiment additional methods wherein said oligodeoxynucleotide primers are detectably labeled, wherein the determining step further comprises detection of said detectable labels, and wherein a sequence from said database matches a fragment of determined length when the sequence from said database comprises recognition sites of the one or more restriction endonucleases, said recognition sites being identified by the detectable labels of said oligodeoxynucleotide primers, said recognition sites being spaced apart by the determined length, and optionally wherein said determining step further comprises detecting the amplified DNA fragments by a method comprising labeling said fragments with a DNA intercalating dye or detecting light emission from a fluorochrome label on said fragments.

This invention further provides in the sixth embodiment additional steps further comprising, prior to said determining step, the step of hybridizing the amplified DNA fragments with a detectably labeled oligodeoxynucleotide complementary to a subsequence, said subsequence differing from said recognition sites of said one or more restriction endonucleases, wherein the determining step further comprises detecting said detectable label of said oligodeoxynucleotide, and wherein a sequence from said database matches a fragment of determined length when the sequence from said database further comprises said subsequence between the recognition sites of said one or more restriction endonucleases.

This invention further provides in the sixth embodiment additional methods wherein the one or more restriction endonucleases are pairs of restriction endonucleases, the pairs being selected from the group consisting of Acc56I and HindIII, Acc65I and NgoMI, BamHI and EcoRI, BglII and HindIII, BglII and NgoMI, BsiWI and BspHI, BspHI and BstYI, BspHI and NgoMI, BsrGI and EcoRI, EagI and EcoRI, EagI and HindIII, EagI and NcoI, HindIII and NgoMI, NgoMI and NheI, NgoMI and SpeI, BglII and BspHI, Bsp120I and NcoI, BssHII and NgoMI, EcoRI and HindIII, and NgoMI and XbaI, or wherein the step of ligating is performed with T4 DNA ligase.

This invention further provides in the sixth embodiment additional methods wherein the steps of digesting, contacting, and ligating are performed simultaneously in the same reaction vessel, or optionally wherein the steps of digesting, contacting, ligating, extending, and amplifying are performed in the same reaction vessel.

This invention further provides in the sixth embodiment additional methods wherein the step of determining the length is performed by electrophoresis.

This invention further provides in the sixth embodiment additional methods wherein the step of searching said DNA database further comprises determining a pattern of fragments that can be generated and for each fragment in said pattern those sequences in said DNA database that are capable of generating the fragment by simulating the steps of digesting with said one or more restriction endonucleases, contacting, ligating, extending, amplifying, and determining applied to each sequence in said DNA database; and finding the sequences that are capable of generating said one or more fragments of determined length by finding in said pattern one or more fragments that have the same length and recognition sites as said one or more fragments of determined length.

This invention further provides in the sixth embodiment additional methods wherein the steps of digesting and ligating go substantially to completion.

This invention further provides in the sixth embodiment additional methods wherein the DNA sample is cDNA prepared from mRNA, and optionally wherein the DNA is of RNA from a tissue or a cell type derived from a plant, a single celled animal, a multicellular animal, a bacterium, a virus, a fungus, a yeast, or a mammal, and optionally wherein the mammal is a human, and optionally wherein the mammal is a human having or suspected of having a diseased condition, and optionally wherein the diseased condition is a malignancy.

In a seventh embodiment, this invention provides additional methods for identifying, classifying, or quantifying DNA molecules in a sample of DNA molecules with a plurality of nucleotide sequences, the method comprising the steps of digesting said sample with one or more restriction endonucleases, each said restriction endonuclease recognizing a subsequence recognition site and digesting DNA to produce fragments with 3' overhangs; contacting said fragments with shorter and longer oligodeoxynucleotides, each said longer oligodeoxynucleotide consisting of a first and second contiguous portion, said first portion being a 3' end subsequence complementary to the overhang produced by one of said restriction endonucleases, each said shorter oligodeoxynucleotide complementary to the 3' end of said second portion of said longer oligodeoxynucleotide stand; ligating said longer oligodeoxynucleotide to said DNA fragments to produce a ligated fragment; extending said ligated DNA fragments by synthesis with a DNA polymerase to form blunt-ended double stranded DNA fragments; amplifying said double stranded DNA fragments by use of a DNA polymerase and primer oligodeoxynucleotides to produce amplified DNA fragments, each said primer oligodeoxynucleotide having a sequence comprising that of a longer oligodeoxynucleotides; determining the length of the amplified DNA fragments; and searching a DNA sequence database, said database comprising a plurality of known DNA sequences that may be present in the sample, for sequences matching one or more of said fragments of determined length, a sequence from said database matching a fragment of determined length when the sequence from said database comprises recognition sites of said one or more restriction endonucleases spaced apart by the determined length, whereby DNA sequences in said sample are identified, classified, or quantified.

In an eighth embodiment, this invention provides additional methods of detecting one or more differentially expressed genes in an in vitro cell exposed to an exogenous factor relative to an in vitro cell not exposed to said exogenous factor comprising performing the methods the first embodiment of this invention wherein said plurality of nucleic acids comprises cDNA of RNA of said in vitro cell exposed to said exogenous factor; performing the methods of the first embodiment of this invention wherein said plurality of nucleic acids comprises cDNA of RNA of said in vitro cell not exposed to said exogenous factor; and comparing the identified, classified, or quantified cDNA of said in vitro cell exposed to said exogenous factor with the identified, classified, or quantified cDNA of said in vitro cell not exposed to said exogenous factor, whereby differentially expressed genes are identified, classified, or quantified.

In a ninth embodiment, this invention provides additional methods of detecting one or more differentially expressed genes in a diseased tissue relative to a tissue not having said disease comprising performing the methods of the first embodiment of this invention wherein said plurality of nucleic acids comprises cDNA of RNA of said diseased tissue such that one or more cDNA molecules are identified, classified, and/or quantified; performing the methods of the first embodiment of this invention wherein said plurality of nucleic acids comprises cDNA of RNA of said tissue not having said disease such that one or more cDNA molecules are identified, classified, and/or quantified; and comparing said identified, classified, and/or quantified cDNA molecules of said diseased tissue with said identified, classified, and/or quantified cDNA molecules of said tissue not having the disease, whereby differentially expressed cDNA molecules are detected.

This invention further provides in the ninth embodiment additional methods wherein the step of comparing further comprises finding cDNA molecules which are reproducibly expressed in said diseased tissue or in said tissue not having the disease and further finding which of said reproducibly expressed cDNA molecules have significant differences in expression between the tissue having said disease and the tissue not having said disease, and optionally wherein said finding cDNA molecules which are reproducibly expressed and said significant differences in expression of said cDNA molecules in said diseased tissue and in said tissue not having the disease are determined by a method comprising applying statistical measures, and optionally wherein said statistical measures comprise determining reproducible expression if the standard deviation of the level of quantified expression of a cDNA molecule in said diseased tissue or said tissue not having the disease is less than the average level of quantified expression of said cDNA molecule in said diseased tissue or said tissue not having the disease, respectively, and wherein a cDNA molecule has significant differences in expression if the sum of the standard deviation of the level of quantified expression of said cDNA molecule in said diseased tissue plus the standard deviation of the level of quantified expression of said cDNA molecule in said tissue not having the disease is less than the absolute value of the difference of the level of quantified expression of said cDNA molecule in said diseased tissue minus the level of quantified expression of said cDNA molecule in said tissue not having the disease.

This invention further provides in the ninth embodiment additional methods wherein the diseased tissue and the tissue not having the disease are from one or more mammals, and optionally wherein the disease is a malignancy, and optionally wherein the disease is a malignancy selected from the group consisting of prostrate cancer, breast cancer, colon cancer, lung cancer, skin cancer, lymphoma, and leukemia.

This invention further provides in the ninth embodiment additional methods wherein the disease is a malignancy and the tissue not having the disease has a premalignant character.

In a tenth embodiment, this invention provides methods of staging or grading a disease in a human individual comprising performing the methods of the first embodiment of this invention in which said plurality of nucleic acids comprises cDNA of RNA prepared from a tissue from said human individual, said tissue having or suspected of having said disease, whereby one or more said cDNA molecules are identified, classified, and/or quantified; and comparing said one or more identified, classified, and/or quantified cDNA molecules in said tissue to the one or more identified, classified, and/or quantified cDNA molecules expected at a particular stage or grade of said disease.

In an eleventh embodiment, this invention provides additional methods for predicting a human patient's response to therapy for a disease, comprising performing the methods of the first embodiment of this invention in which said plurality of nucleic acids comprises cDNA of RNA prepared from a tissue from said human patient, said tissue having or suspected of having said disease, whereby one or more cDNA molecules in said sample are identified, classified, and/or quantified; and ascertaining if the one or more cDNA molecules thereby identified, classified, and/or quantified correlates with a poor or a favorable response to one or more therapies, and optionally which further comprises selecting one or more therapies for said patient for which said identified, classified, and/or quantified cDNA molecules correlates with a favorable response.

In a twelfth embodiment, this invention provides additional methods for evaluating the efficacy of a therapy in a mammal having a disease, the method comprising performing the methods of the first embodiment of this invention wherein said plurality of nucleic acids comprises cDNA of RNA of said mammal prior to a therapy; performing the method of the first embodiment of this invention wherein said plurality of nucleic acids comprises cDNA of RNA of said mammal subsequent to said therapy; comparing one or more identified, classified, and/or quantified cDNA molecules in said mammal prior to said therapy with one or more identified, classified, and/or quantified cDNA molecules of said mammal subsequent to therapy; and determining whether the response to therapy is favorable or unfavorable according to whether any differences in the one or more identified, classified, and/or quantified cDNA molecules after therapy are correlated with regression or progression, respectively, of the disease, and optionally wherein the mammal is a human.

In a thirteenth embodiment, this invention provides a kit comprising one or more containers having one or more restriction endonucleases; one or more containers having one or more shorter oligodeoxynucleotide strands; one or more containers having one or more longer oligodeoxynucleotide strands hybridizable with said shorter strands, wherein either the longer or the shorter oligodeoxynucleotide strands each comprise a sequence complementary to an overhang produced by at least one of said one or more restriction endonucleases; and instructions packaged in association with said one or more containers for use of said restriction endonucleases, shorter strands, and longer strands for identifying, classifying, or quantifying one or more DNA molecules in a DNA sample, said instructions comprising (i) digest said sample with said restriction endonucleases into fragments, each fragment being terminated on each end by a recognition site of said one or more restriction endonucleases; (ii) contact said shorter and longer strands and said digested fragments to form double stranded DNA adapters annealed to said digested fragments, (iii) ligate said longer strand to said fragments; (iv) generate one or more signals by separating and detecting such of said fragments that are digested on each end, each signal comprising a representation of the length of the fragment and the identity of the recognition sites on both termini of the fragments; and (v) search a nucleotide sequence database to determine sequences that match or the absence of any sequences that match said one or more generated signals, said database comprising a plurality of known nucleotide sequences of nucleic acids that may be present in the sample, a sequence from said database matching a generated signal when the sequence from said database has both (i) the same length between occurrences of said recognition sites of said one or more restriction endonucleases as is represented by the generated signal and (ii) the same recognition sites of said one of more restriction endonucleases as is represented by the generated signal.

This invention further provides in the thirteenth embodiment a kit wherein said one or more restriction endonucleases generate 5' overhangs at the terminus of digested fragments, wherein each said shorter oligodeoxynucleotide strand consists of a first and second contiguous portion, said first portion being a 5' end subsequence complementary to the overhang produced by one of said restriction endonucleases, and wherein each said longer oligodeoxynucleotide strand comprises a 3' end subsequence complementary to said second portion of said shorter oligodeoxynucleotide strand, or optionally wherein said one or more restriction endonucleases generate 3' overhangs at the terminus of the digested fragments, wherein each said longer oligodeoxynucleotide strand consists of a first and second contiguous portion, said first portion being a 3' end subsequence complementary to the overhang produced by one of said restriction endonucleases, and wherein each said shorter oligodeoxynucleotide strand is complementary to the 3' end of said second portion of said longer oligodeoxynucleotide stand.

This invention further provides in the thirteenth embodiment a kit wherein said instructions further comprise those signals expected from one or more DNA molecules of interest when said sample is digested with a particular one or more restriction endonucleases selected from among said one or more restriction endonucleases in said kit, and optionally wherein said one or more DNA molecules of interest are cDNA molecules differentially expressed in a disease condition.

This invention further provides in the thirteenth embodiment a kit wherein the restriction endonucleases are selected from the group consisting of Acc65I, Af1II, AgeI, ApaLI, ApoI, AscI, AvrI, BamHI, BclI, BglII, BsiWI,. Bsp120I, BspEI, BspHI, BsrGI, BssHII, BstYI, EagI, EcoRI, HindIII, MluI, NcoI, NgoMI, NheI, NotI, SpeI, and XbaI.

This invention further provides in the thirteenth embodiment a kit further comprising one or more containers having one or more double stranded adapter DNA molecules formed by annealing said longer and said shorter oligonucleotide strands.

This invention further provides in the thirteenth embodiment a kit further comprising the computer readable memory of claim 106, or optionally further comprising the computer readable memory of claim 114, or optionally further comprising the computer readable memory of claim 122.

This invention further provides in the thirteenth embodiment a kit further comprising in a container a DNA ligase, or optionally further comprising in a container a phosphatase capable of removing terminal phosphates from a DNA sequence.

This invention further provides in the thirteenth embodiment a kit further comprising one or more primers, each said primer consisting of a single stranded oligodeoxynucleotide comprising the sequence of one of said longer strands; and a DNA polymerase, and optionally wherein each of said one or more primers further comprises (a) a first subsequence that is the portion of the recognition site of one of said one or more restriction endonucleases remaining at the terminus of a fragment after digestion, and (b) a second subsequence of one or two additional nucleotides contiguous with and 3' to said first subsequence, wherein said primer is detectably labeled such that primers with differing said one or two additional nucleotides have different labels that can be distinguishably detected.

This invention further provides in the thirteenth embodiment a kit wherein said instructions further comprise: detect such of said fragments digested on each end by a method comprising staining said fragments with silver, labeling said fragments with a DNA intercalating dye, or detecting light emission from a fluorochrome label on said fragments.

This invention further provides in the thirteenth embodiment a kit further comprising reagents for performing a cDNA sample preparation step; reagents for performing a step of digestion by one or more restriction endonucleases; reagents for performing a ligation step; and reagents for performing a PCR amplification step.

4. BRIEF DESCRIPTION OF THE DRAWINGS

These and other features, aspects, and advantages of the present invention will become better understood by reference to the accompanying drawings, following description, and appended claims, where:

FIG. 1 shows exemplary results of the signals generated by the QEA method of this invention;

FIGS. 2A, 2B, 2C and 2D show DNA adapters for an RE/ligation implementation of the QEA method of this invention, where the restriction endonucleases generate 5' overhangs, open blocks indicating strands of DNA;

FIGS. 3A and 3B show the DNA adapters for an RE/ligation implementation of the QEA method of this invention, where the restriction endonucleases generate 3' overhangs;

FIGS. 4A, 4B, and 4C show an exemplary biotin alternative embodiment of the QEA method;

FIG. 5 shows the DNA primers for a PCR embodiment of the QEA method;

FIGS. 6A and 6B show a method for DNA sequence database selection according to this invention;

FIG. 7 shows an exemplary experimental description for the QEA embodiment of this invention;

FIGS. 8A and 8B show an overview of a method for determining a simulated database of experimental results for the QEA embodiment of this invention;

FIG. 9 shows the detail of a method for simulating a QEA reaction;

FIGS. 10A-F show exemplary results of the action of the method of FIG. 9;

FIG. 11 shows the detail of a method for determining a simulated database of experimental results for a QEA embodiment of this invention;

FIGS. 12A, 12B, and 12C show an exemplary computer system apparatus, and an alternative embodiment, implementing methods of this invention;

FIG. 13A shows exemplary detail of an experimental design method for QEA and CC embodiments of this invention and

FIG. 13B shows exemplary detail of an experimental design method for a QEA embodiment of this invention;

FIG. 14 shows an exemplary method for ordering the DNA sequences found to be likely causes of a QEA signal in the order of their likely presence in the sample;

FIG. 15 shows the detail of a method for determining a simulated database of experimental results for a CC embodiment of this invention;

FIGS. 16A, 16B, 16C, and 16D show exemplary reaction temperature profiles for preferred manual and automated implementations of a preferred RE embodiment of a QEA method.

5. DETAILED DESCRIPTION

According to the present invention, to uniquely identify an expressed gene sequence, full or partial, and many components of genomic DNA it is not necessary to determine actual, complete nucleotide sequences of samples. Full sequences provide far more information than is needed to merely classify or determine a gene according to the invention. For example, in the human genome, it is known that there are approximately 10.sup.5 expressed genes. Since the average length of a coding sequence is approximately 2000 nucleotides, the total number of possible sequences is approximately 4.sup.2000, or about 10.sup.1200. The actual number of expressed human genes is an unimaginably small fraction (10.sup.-1195) of the total number of possible DNA sequences. Even sequencing a 50 bp fragment of a cDNA sequence generates about 10.sup.25 times more information than is needed for classification of that sequence. Use of the present invention allows direct classification of expressed gene sequences with far less information than either a complete or a partial sequence determination of a sample.

In computer science, codes which compactly identify a few members from among a large set of possibilities are called hash codes. An object of this invention is to construct hash codes for expressed DNA sequences, or alternatively for any other existing set of DNA sequences. In a fully populated code without any unassigned code words, all human genes could be coded by an approximately 17 bit binary number (2.sup.17 =1.3.times.10.sup.5). A 20 bit code would be about 10% filled or 90% sparse (2.sup.20 =1.0.times.10.sup.6).

In this invention codes are constructed from signals which represent the presence of short nucleic acid (preferably DNA) subsequences (hereinafter called "target subsequences") in the sample sequence and, preferably, in a QEA embodiment, include a representation of the length along the sample sequence between adjacent target subsequences. The presence of these subsequences is recognized by subsequence recognition means, including, but not limited to, REs, DNA binding proteins, and oligomers ("probes") hybridizable to DNA of, for example, PNAs or DNAs. The subsequence recognition means allow recognition of specific DNA subsequences by the ability to specifically bind to or react with such subsequences. The invention, and particularly its computer methods, are adaptable to any subsequence recognition means available in the art. Acceptable subsequence detection means preferably precisely and reproducibly recognize target subsequences and generate a recognition signal of adequate signal to noise ratio for all genes, however rare, in a sample, and can also provide information on the length between target subsequences.

The signals contain representations of target subsequence occurrences and, preferably, a representation of the length between target subsequence occurrences. In various embodiments of this invention these representations may differ. In embodiments where the target subsequences are exactly recognized, as where REs are used, subsequence representation may simply be the actual identity of the subsequences. In other embodiments where subsequence recognition is less exact, as where short oligomers are used, this representation may be "fuzzy". It may, for example, consist of all subsequences which differ by one nucleotide from the target, or some other set of possible subsequences, perhaps weighted by the probability that each member of the set is the actual subsequence in the sample sequence. Further, the length representation may depend on the separation and detection means used to generate the signals. In the case of electrophoretic separation, the length observed electrophoretically may need to be corrected, perhaps up to 5 to 10%, for mobility differences due to average base composition differences or due to effects of any labeling moiety used for detection. As these corrections may not be known until target sequence recognition, the signal may contain the electrophoretic length in bp and not the true physical length in bp. For simplicity and without limitation, in most of the following description unless otherwise noted the signals are presumed to represent the information conveyed exactly, as if generated by exact recognition means and error or bias free separation and detection means. However, in particular embodiments, target subsequences may be represented in a fuzzy fashion and length, if present, with separation and detection bias present.

Target subsequences recognized are typically of contiguous sequence. This is required for all known REs. However, oligomers recognizing discontinuous subsequences can be used and can be constructed by inserting degenerate nucleotides in any discontinuous region. For example, a set of 16 oligomers recognizing AGC--TAT, with a two nucleotide skip between the two portions of the recognition subsequence, could be constructed as TCGNNATA, where N is any nucleotide. Alternately, such discontiguous subsequences can be recognized by one oligomer of the form TCGiiATA, where "i" is inosine, or any other "universal" nucleotide, capable of hybridizing with any naturally occurring base.

This invention is adaptable to analyzing any DNA sample for which exists an accompanying database listing possible sequences in the sample. More generally, the invention is adaptable to analyzing the sequences of any biopolymer, built of a small number of repeating units, whose naturally occurring representatives are far fewer that the number of possible, physical polymers and in which small subsequences can be recognized. Thus it is applicable to not only naturally occurring DNA polymers but also to naturally occurring RNA polymers, proteins, glycans, etc. Typically and without limitation, however, the invention is applied to the analysis of cDNA samples from any in vivo or in vitro sources. cDNA can be synthesized either from total cellular RNA or from specific sub-pools of RNA. These RNA sub-pools can be produced by RNA pre-purification, for example, the separation of mRNA of the endoplasmic reticulum from cytoplasmic mRNA, which thereby enriches mRNA primarily encoding for cell surface or extracellular proteins (Cells et al., 1994, Cell Biology, Academic Press, New York, N.Y.). Such enriched mRNAs have increased diagnostic or therapeutic utility due to their encoded protein's cell-surface or extracellular roles, such as being a receptor. Such pre-purified RNA pools can be used in all embodiments of this invention.

First strand cDNA synthesis can use any priming method known in the art, for example, oligo(dT) primers, random hexamer primers, phasing primers, mixtures thereof, etc. Phasing primers, containing either an A, C, or G at the 3' end, can be used in separate cDNA synthesis reactions to split the cDNA first strands into 3 pools, each generated from poly(A) mRNA having a T, G, or C, respectively, 5' to the poly(A) tail. Fifteen mixtures can be synthesized by using all 15 possible oligo(dT) primers containing a pair of non-T nucleotides at the 3' end.

Two specific embodiments of the invention are respectively termed "quantitative expression analysis" ("QEA") and "colony calling" ("CC").

The specific embodiment, QEA, probes a sample with recognition means, the recognition means generating signals, a preferred signal being a triple comprising an indication of the presence of a first target subsequence, an indication of the presence of a second target subsequence, and a representation of the length between the target subsequences in the sample nucleic acids sequence. Each pair of target subsequences may occur more than once in a sample nucleic acid, in which case the associated lengths are between adjacent target subsequence occurrences.

The QEA embodiment is preferred for classifying and determining sequences in cDNA mixtures, but is also adaptable to samples with only one sequence. It is preferred for mixtures because it affords the relative advantage over prior art methods that cloning of sample nudleic acids is not required. Typically, enough distinguishable signals are generated from pairs of target subsequences to recognize a desired sequence in a sample mixture. For example, first, any pair of target subsequences may hit more than once in a single DNA molecule to be analyzed, thereby generating several signals with differing lengths from one DNA molecule. Second, even if the pair of target subsequences hits only once in two different DNA molecules to be analyzed, the lengths between the hits may differ and thus distinguishable signals may be generated.

The target subsequences used in QEA are preferably optimally chosen by methods of this invention from DNA sequence databases containing sequences likely to occur in the sample to be analyzed. Efforts of the Human Genome Project in the United States, efforts abroad, and efforts of private companies in the sequencing of the human genome sequences, both expressed and genetic, are being collected in several available databases (listed in .sctn. 5.1).

In a QEA "query mode" experiment, the focus is on determining the expression of several genes, perhaps 1-100, of interest and of known sequence. A minimal number of target subsequences is chosen to generate signals, with the goal that each of the several genes is discriminated by at least one unique signal, which also discriminates it from all the other genes likely to occur in the sample. In other words, the experiment is designed so that each gene generates at least one signal unique to it (a "good" gene, see infra). In a QEA "tissue mode" experiment, the focus is on determining the expression of as many as possible, preferably a majority, of the genes in a tissue, without the need for any prior knowledge or interest in their expression. Target subsequences are optimally chosen to discriminate the maximum number of sample DNA sequences into classes comprising one or preferably at most a few sequences. Signals are generated and detected as determined by the threshold and sensitivity of a particular experiment. Some important determinants of threshold and sensitivity are the initial amount of mRNA and thus of cDNA, the amount of molecular amplification performed during the experiment, and the sensitivity of the detection means. Preferably, enough signals are produced and detected so that the computer methods of this invention can uniquely determine the expression of a majority, or more preferably most, of the genes expressed in a tissue.

QEA signals are generated by methods utilizing recognition means that include, but are not limited to REs in a preferred RE/ligase method or in a method utilizing a removal means, preferably contacting streptavidin linked to a solid phase with biotin-labeled DNA, for removal of unwanted DNA fragments, and nucleotide oligomer primers in a PCR method.

A preferred embodiment of the RE/ligase method is as follows. The method employs recognition reactions with a pair (or more) of REs which recognize target subsequences with high specificity and cut the sequence at the recognition sites leaving fragments with sticky ends characteristic of the particular RE. To each sticky end, special primers are ligated which are distinctively labeled with fluorochromes identifying the particular RE making the cut, and thus the particular target subsequence. A DNA polymerase is used to form blunt-ended DNA fragments. The labeled fragments are then PCR amplified using the same special primers a number of times preferably just sufficient to detect signals from all sequences of interest while making relatively small signals from the linearly amplifying singly cut fragments. The amplified fragments are then separated by length using gel electrophoresis, and the length and labeling of the fragments is optically detected. Optionally, single stranded fragments can be removed by a binding hydroxyapatite, or other single strand specific, column or by digestion by a single strand specific nuclease. Also, this invention is adaptable to other functionally equivalent amplification and length separation means. In this manner, the identity of the REs cutting a fragment, and thereby the subsequences present, as well as the length between the cuts is determined.

In a preferred PCR method for QEA, a suitable collection of target subsequences is chosen by the computer implemented QEA experimental design methods and PCR primers distinctively labeled with fluorochromes are synthesized to hybridize with these subsequences. The primers are designed as described in .sctn. 5.3 to reliably recognize short subsequences while achieving a high specificity in PCR amplification. Using these primers, a minimum number of PCR amplification steps amplifies those fragments between the primed subsequences existing in DNA sequences in the sample. The labeled, amplified fragments are separated by gel electrophoresis and detected.

In an exemplary QEA method utilizing a removal means, which has improved quantitative characteristics and is also adapted to highly sensitive detection systems, cDNA is synthesized from a tissue sample using at least one internally biotinylated primer. The cDNA is then cyclized, cut wit