Multilingual

Language translation apparatus and method using context-based translation models

5510981

Abstract

An apparatus for translating a series of source words in a first language to a series of target words in a second language. For an input series of source words, at least two target hypotheses, each including a series of target words, are generated. Each target word has a context comprising at least one other word in the target hypothesis. For each target hypothesis, a language model match score including an estimate of the probability of occurrence of the series of words in the target hypothesis. At least one alignment connecting each source word with at least one target word in the target hypothesis is identified. For each source word and each target hypothesis, a word match score including an estimate of the conditional probability of occurrence of the source word, given the target word in the target hypothesis which is connected to the source word and given the context in the target hypothesis of the target word which is connected to the source word. For each target hypothesis, a translation match score including a combination of the word match scores for the target hypothesis and the source words in the input series of source words. A target hypothesis match score including a combination of the language model match score for the target hypothesis and the translation match score for the target hypothesis. The target hypothesis having the best target hypothesis match score is output.


Claims

We claim:

1. An apparatus for translating a series of source words in a first language to a series of target words in a second language different from the first language, said apparatus comprising:

means for inputting said series of source words;

means for generating at least two target hypotheses, each target hypothesis comprising said series of target words selected from a vocabulary of words in the second language, each target word having a context comprising at least one other word in the target hypothesis;

means for generating, for each target hypothesis, a language model match score comprising an estimate of the probability of occurrence of the series of words in the target hypothesis;

means for identifying at least one alignment between the input series of source words and each target hypothesis, the alignment connecting each source word with at least one target word in the target hypothesis;

means for generating, for each source word and each target hypothesis, a word match score comprising an estimate of the conditional probability of occurrence of the source word, given the target word in the target hypothesis which is connected to the source word and given the context of the target word in the target hypothesis which is connected to the source word;

means for generating, for each target hypothesis, a translation match score comprising a combination of the word match scores for the target hypothesis and the source words in the input series of source words;

means for generating a target hypothesis match score for each target hypothesis, each target hypothesis match score comprising a combination of the language model match score for the target hypothesis and the translation match score for the target hypothesis; and

means for outputting the target hypothesis having the best target hypothesis match score.

2. An apparatus as claimed in claim 1, characterized in that each target hypothesis comprises a series of target words selected from a vocabulary comprising words in the second language and a null word representing the absence of a word.

3. An apparatus as claimed in claim 2, characterized in that:

the means for identifying at least one alignment comprises means for identifying two or more alignments between the input series of source words and each target hypothesis, each alignment connecting each source word with at least one target word in the target hypothesis;

the word match score generator comprises means for generating, for each source word and each alignment and each target hypothesis, a word match score comprising an estimate of the conditional probability of occurrence of the source word, given the target word in the target hypothesis which is connected to the source word and given the context of the target word in the target hypothesis which is connected to the source word;

the translation match score generator comprises means for generating, for each target hypothesis, a translation match score comprising a combination of the word match scores for the target hypothesis and the alignments and the source words in the input series of source words.

4. An apparatus as claimed in claim 2, characterized in that:

the input means comprises means for transforming the input series of source words into a series of transformed source words;

the alignment means comprises means for identifying at least one alignment between the series of transformed source words and each target hypothesis, the alignment connecting each transformed source word with at least one target word in the target hypothesis;

the word match score generator comprises means for generating, for each transformed source word and each target hypothesis, a word match score comprising an estimate of the conditional probability of occurrence of the transformed source word, given the target word in the target hypothesis which is connected to the transformed source word and given the context of the target word in the target hypothesis which is connected to the transformed source word;

the translation match score generator comprises means for generating for each target hypothesis, a translation match score comprising a combination of the word match scores for the target hypothesis and the transformed source words; and

the output means comprises means for synthesizing a series of output words from the target hypothesis having the best target hypothesis match score, and means for outputting the synthesized output words.

5. An apparatus as claimed in claim 2, characterized in that:

the translation match score for a target hypothesis comprises the product of the word match scores for the target hypothesis and the source words in the input series of source words; and

the target hypothesis match score for a target hypothesis comprises the product of the language model match score for the target hypothesis multiplied by the translation match score for the target hypothesis.

6. An apparatus as claimed in claim 2, characterized in that:

the context of the target word in the target hypothesis which is connected to a source word is contained in at least one of two or more context classes; and

the estimated conditional probability of occurrence of the source word, given the target word in the target hypothesis which is connected to the source word and given the context of the target word which is connected to the source word, comprises at least one function having a value dependent on the class containing the context of the target word which is connected to the source word.

7. An apparatus as claimed in claim 2, characterized in that:

the context of the target word in the target hypothesis which is connected to a source word comprises at least one word having a part of speech in the target hypothesis; and

the estimated conditional probability of occurrence of the source word, given the target word in the target hypothesis which is connected to the source word and given the context of the target word which is connected to the source word, comprises at least one function having a value dependent on the part of speech in the target hypothesis of at least one word in the context of the target word which is connected to the source word.

8. An apparatus as claimed in claim 2, characterized in that:

the context of the target word in the target hypothesis which is connected to a source word comprises at least one word having an identity; and

the estimated conditional probability of occurrence of the source word, given the target word in the target hypothesis which is connected to the source word and given the context of the target word which is connected to the source word, comprises at least one function having a value dependent on the identity of at least one word in the context of the target word which is connected to the source word.

9. An apparatus as claimed in claim 2, characterized in that the means for outputting the target hypothesis having the best match score comprises a display.

10. An apparatus as claimed in claim 2, characterized in that the input means comprises a keyboard.

11. An apparatus as claimed in claim 2, characterized in that the input means comprises a computer disk drive.

12. An apparatus as claimed in claim 2, characterized in that the input means comprises a computer tape drive.

13. A computer-implemented method of translating a series of source words in a first language to a series of target words in a second language different from the first language, said method comprising:

inputting a series of source words;

generating at least two target hypotheses, each target hypothesis comprising said series of target words selected from a vocabulary of words in the second language, each target word having a context comprising at least one other word in the target hypothesis;

generating, for each target hypothesis, a language model match score comprising an estimate of the probability of occurrence of the series of words in the target hypothesis;

identifying at least one alignment between the input series of source words and each target hypothesis, the alignment connecting each source word with at least one target word in the target hypothesis;

generating, for each source word and each target hypothesis, a word match score comprising an estimate of the conditional probability of occurrence of the source word, given the target word in the target hypothesis which is connected to the source word and given the context of the target word in the target hypothesis which is connected to the source word;

generating, for each target hypothesis, a translation match score comprising a combination of the word match scores for the target hypothesis and the source words in the input series of source words;

generating a target hypothesis match score for each target hypothesis, each target hypothesis match score comprising a combination of the language model match score for the target hypothesis and the translation match score for the target hypothesis; and

outputting the target hypothesis having the best target hypothesis match score.

14. A method as claimed in claim 13, characterized in that each target hypothesis comprises a series of target words selected from a vocabulary comprising words in the second language and a null word representing the absence of a word.

15. A method as claimed in claim 14, characterized in that:

the step of identifying at least one alignment comprises identifying two or more alignments between the input series of source words and each target hypothesis, each alignment connecting each source word with at least one target word in the target hypothesis;

the step of generating a word match score comprises generating, for each source word and each alignment and each target hypothesis, a word match score comprising an estimate of the conditional probability of occurrence of the source word, given the target word in the target hypothesis which is connected to the source word and given the context of the target word in the target hypothesis which is connected to the source word;

the step of generating a translation match score comprises generating, for each target hypothesis, a translation match score comprising a combination of the word match scores for the target hypothesis and the alignments and the source words in the input series of source words.

16. A method as claimed in claim 14, characterized in that:

the step of inputting comprises the step of transforming the input series of source words into a series of transformed source words;

the step of aligning comprises identifying at least one alignment between the series of transformed source words and each target hypothesis, the alignment connecting each transformed source word with at least one target word in the target hypothesis;

the step of generating a word match score comprises generating, for each transformed source word and each target hypothesis, a word match score comprising an estimate of the conditional probability of occurrence of the transformed source word, given the target word in the target hypothesis which is connected to the transformed source word and given the context of the target word in the target hypothesis which is connected to the transformed source word;

the step of generating a translation match score comprises generating, for each target hypothesis, a translation match score comprising a combination of the word match scores for the target hypothesis and the transformed source words; and

the step of outputting comprises the step of synthesizing a series of output words from the target hypothesis having the best target hypothesis match score.

17. A method as claimed in claim 14, characterized in that:

the translation match score for a target hypothesis comprises the product of the word match scores for the target hypothesis and the source words in the input series of source words; and

the target hypothesis match score for a target hypothesis comprises the product of the language model match score for the target hypothesis multiplied by the translation match score for the target hypothesis.

18. A method as claimed in claim 14, characterized in that:

the context of the target word in the target hypothesis which is connected to a source word is contained in at least one of two or more context classes; and

the estimated conditional probability of occurrence of the source word, given the target word in the target hypothesis which is connected to the source word and given the context of the target word which is connected to the source word, comprises at least one function having a value dependent on the class containing the context of the target word which is connected to the source word.

19. A method as claimed in claim 14, characterized in that:

the context of the target word in the target hypothesis which is connected to a source word comprises at least one word having a part of speech in the target hypothesis; and

the estimated conditional probability of occurrence of the source word, given the target word in the target hypothesis which is connected to the source word and given the context of the target word which is connected to the source word, comprises at least one function having a value dependent on the part of speech in the target hypothesis of at least one word in the context of the target word which is connected to the source word.

20. A method as claimed in claim 14, characterized in that:

the context of the target word in the target hypothesis which is connected to a source word comprises at least one word having an identity; and

the estimated conditional probability of occurrence of the source word, given the target word in the target hypothesis which is connected to the source word and given the context of the target word which is connected to the source word, comprises at least one function having a value dependent on the identity of at least one word in the context of the target word which is connected to the source word.

21. A method as claimed in claim 14, characterized in that the step of outputting the target hypothesis having the best match score comprises displaying the target hypothesis having the best match score.


Description

BACKGROUND OF THE INVENTION

The invention relates to computerized language translation, such as computerized translation of a French sentence into an English sentence.

In U.S. patent application Ser. No. 07/736,278, filed Jul. 25, 1991, now pending, entitled "Method and System for Natural Language Translation" by Peter F. Brown et al (the entire content of which is incorporate herein by reference), there is described a computerized language translation system for translating a text F in a source language to a text E in a target language. The system described therein evaluates, for each of a number of hypothesized target language texts E, the conditional probability P(E.vertline.F) of the target language test E given the source language text F. The hypothesized target language text E having the highest conditional probability P(E.vertline.F) is selected as the translation of the source language text F.

Using Bayes' theorem, the conditional probability P(E.vertline.F) of the target language text E given the source language text F can be written as ##EQU1##

Since the probability P(F) of the source language text F in the denominator of Equation 1 is independent of the target language text E, the target language text E having the highest conditional probability P(E.vertline.F) will also have the highest product P(F.vertline.E) P(E). We therefore arrive at ##EQU2## In Equation 2, the probability P(E) of the target language text E is a language model match score and may be estimated from a target language model. While any known language model may be used to estimate the probability P(E) of the target language text E, Brown et al describe an n-gram language model comprising a 1-gram model, a 2-gram model, and a 3-gram model combined by parameters whose values are obtained by interpolated estimation.

The conditional probability P(F.vertline.E) in Equation 2 is a translation match score. As described by Brown et al, the translation match score P(F.vertline.E) for a source text F comprising a series of source words, given a target hypothesis E comprising a series of target words, may be estimated by finding all possible alignments connecting the source words in the source text F with the target words in the target text E, including alignments in which one or more source words are not connected to any target words, but not including alignments where a source word is connected to more than one target word. For each alignment and each target word e in the target text E connected to .phi. source words in the source text F, there is estimated the fertility probability n(.phi..vertline.e) that the target word e is connected to the .phi. source words in the alignment. There is also estimated for each source word f in the source text F and each target word e in the target text E connected to the source word f by the alignment, the lexical probability t(f.vertline.e) that the source word f would occur given the occurrence of the connected target word e.

For each alignment and each source word f, Brown et al further estimate the distortion probability a(j.vertline.a.sub.j,m) that the source word f is located in position j of the source text F, given that the target word e connected to the source word f is located in position a.sub.j in the target text E, and given that there are m words in the source text F.

By combining the fertility probabilities for an alignment and for all target words e in the target text E, and multiplying the result by the probability ##EQU3## of the number .phi..sub.0 of target words not connected with any source words in the alignment, given the sum of the fertilities .phi. of all of the target words in the target text E in the alignment, a fertility score for the target text E and the alignment is obtained.

By combining the lexical probabilities for an alignment and for all source words in the source text F, a lexical score for the alignment is obtained.

By combining the distortion probabilities for an alignment and for all source words in the source text F which are connected to a target word in the alignment, and by multiplying the result by ##EQU4## (where .phi..sub.0 is the number of target words in the target text E that are not connected with any source words), a distortion score for the alignment is obtained.

Finally, by combining the fertility, lexical, and distortion scores for the alignment, and multiplying the result by the combinatorial factor ##EQU5## a translation match score for the alignment is obtained. (See, Brown et al, Section 8.2.)

The translation match score P(F.vertline.E) for the source text F and the target hypothesis E may be the sum of the translation match scores for all permitted alignments between the source text F and the target hypothesis E. Preferably, the translation match score P(F.vertline.E) for the source text F and the target hypothesis E is the translation match score for the alignment estimated to be most probable.

Equation 2 may be used to directly estimate the target hypothesis match score P(F.vertline.E)P(E) for a hypothesized target language text E and a source language text F. However, in order to simply the language model P(E) and the translation model P(F.vertline.E), and in order to estimate the parameters of these models from a manageable amount of training data, Brown et al estimate the target hypothesis match score P(F.vertline.E)P(E) for simplified intermediate forms E' and F' of the target language text E and the source language text F, respectively. Each intermediate target language word e' represents a class of related target language words. Each intermediate source language word f' represents a class of related source language words. A source language transducer converts the source language text F to the intermediate form F'. The hypothesized intermediate form target language text E' having the highest hypothesis match score P(F'.vertline.E')P(E') is estimated from Equation 2. A target language transducer converts the best matched intermediate target language text E' to the target language text E.

In their language translation system, Brown et al estimate the lexical probability of each source word f as the conditional probability t(f.vertline.e) of each source word f given solely the target word e connected to the source word in an alignment. Consequently, the lexical probability provides only a coarse estimate of the probability of the source word f.

SUMMARY OF THE INVENTION

It is an object of the invention to provide an apparatus and method for translating a series of source words in a first language to a series of target words in a second language different from the first language, in which the apparatus and method provide improved estimates of the lexical probabilities of the source words.

It is another object of the invention to provide an apparatus and method for translating a series of source words in a first language to a series of target words in a second language different from the first language, in which the lexical probability of a source word is estimated as a conditional probability given the target word connected to the source word in an alignment, and given the context of the target word connected to the source word.

According to the invention, an apparatus for translating a series of source words in a first language to a series of target words in a second language different from the first language comprises means for inputting a series of source words. Means are also provided for generating at least two target hypotheses. Each target hypothesis comprises a series of target words selected from a vocabulary of words in the second language. Each target word has a context comprising at least one other word in the target hypothesis.

A language model match score generator generates, for each target hypothesis, a language model match score comprising an estimate of the probability of occurrence of the series of words in the target hypothesis. An alignment identifier identifies at least one alignment between the input series of source words and each target hypothesis. The alignment connects each source word with at least one target word in the target hypothesis.

A word match score generator is provided for generating, for each source word and each target hypothesis, a word match score comprising an estimate of the conditional probability of occurrence of the source word, given the target word in the target hypothesis which is connected to the source word and given the context of the target word in the target hypothesis which is connected to the source word. A translation match score generator generates, for each target hypothesis, a translation match score comprising a combination of the word match scores for the target hypothesis and the source words in the input series of source words.

A hypothesis match score generator is provided for generating a target hypothesis match score for each target hypothesis. Each target hypothesis match score comprises a combination of the language model match score for the target hypothesis and the translation match score for the target hypothesis. The target hypothesis having the best target hypothesis match score is provided at an output.

Preferably, each target hypothesis comprises a series of target words selected from a vocabulary comprising words in the second language and a null word representing the absence of a word.

The alignment identifier may comprise means for identifying two or more alignments between the input series of source words and each target hypothesis. Each alignment connects each source word with at least one target word in the target hypothesis. The word match score generator generates, for each source word and each alignment and each target hypothesis, a word match score comprising an estimate of the conditional probability of occurrence of the source word, given the target word connected to the source word, and given the context of the target word. The translation match score generator generates, for each target hypothesis, a translation match score comprising a combination of the word match scores for the target hypothesis and the alignments and the source words in the input series of source words.

The source text input device may comprise means for transforming the input series of source words into a series of transformed source words. The alignment means identifies at least one alignment between the series of transformed source words and each target hypothesis. Each alignment connects each transformed source word with at least one target word in the target hypothesis. The word match score generator generates, for each transformed source word and each target hypothesis, a word match score comprising an estimate of the conditional probability of occurrence of the transformed source word, given the target word in the target hypothesis which is connected to the transformed word and given the context of the target word in the target hypothesis which is connected to the transformed source word.

The translation match score generator generates, for each target hypothesis, a translation match score comprising a combination of the word match scores for the target hypothesis and the transformed source word. Output means are provided for synthesizing a series of output words from the target hypothesis having the best target hypothesis match score, and for outputting the output words.

The translation match score for a target hypothesis may comprise the product of the word match scores for the target hypothesis and the source words in the input series of source words. The target hypothesis match score for a target hypothesis may comprise the product of the language model match score for the target hypothesis multiplied by the translation match score for the target hypothesis.

The context of the target word in the target hypothesis which is connected to a source word may be contained in at least one of two or more context classes. The estimated conditional probability of occurrence of the source word, given the target word connected to the source word and given the context of the target word, comprises at least one function having a value dependent on the class containing the context of the target word.

Alternatively, the estimated conditional probability of occurrence of the source word may comprise a function having a value dependent on the part of speech in the target hypothesis of at least one word in the context of the target word which is connected to the source word, or dependent on the identity of at least one word in the context of the target word which is connected to the source word.

The means for outputting the target hypothesis having the best match score may comprise a display. The input means may comprise a keyboard, a computer disk drive, or a computer tape drive.

By estimating the lexical probability of a source word as the conditional probability of the source word given the target word connected to the source word in an alignment, and given the context of the target word connected to the source word, the present invention obtains translation match scores with improved accuracy.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a block diagram of an example of an apparatus for translating words from a first language to a second language according to the present invention.

FIG. 2 schematically shows an example of an alignment between a hypothetical series of source words and a hypothetical series of target words.

FIG. 3 schematically shows an example of a second alignment between the hypothetical series of source words and the hypothetical series of target words of FIG. 2.

FIG. 4 schematically shows an example of a third alignment between the hypothetical series of source words and the hypothetical series of target words of FIG. 2.

FIG. 5 is a schematic flow diagram of a source transducer.

FIG. 6 is a schematic flow diagram of a token-to-word transducer.

FIG. 7 is a schematic flow diagram of a syntactic transducer for English.

FIG. 8 is a schematic flow diagram of a syntactic transducer for French.

FIG. 9 is an example of a syntactic transduction.

FIG. 10 is a schematic flow diagram illustrating the operation of a finite-state transducer.

FIG. 11 is Table of Patterns for a finite-state pattern matcher employed in some source and target transducers.

FIG. 12 is a Table of Actions for an action processing module employed in some source and target transducers.

FIG. 13 is a schematic block diagram illustrating a manner in which statistical transfer can be incorporated into a translation system based on an interlingua.

FIG. 14 is an example of informants and informant sites.

FIG. 15 is a schematic flow diagram illustrating the operation of a sense-labelling transducer.

FIG. 16 is a schematic flow diagram of a module that determines good questions about informants for each vocabulary word.

FIG. 17 is a schematic flow diagram of a module that determines a good question about each informant of a vocabulary word.

FIG. 18 is a schematic flow diagram of a method for determining a good question about an informant.

FIG. 19 is a schematic block diagram of a simplified French to English translation system which incorporates analysis and synthesis.

FIG. 20 is a schematic flow diagram of a translation component of a batch system.

FIG. 21 is a schematic flow diagram of the hypothesis search component of the system.

FIG. 22 is a schematic flow diagram of a translation component of a human-aided system.

FIG. 23 is an example of a source sentence being transduced to a sequence of morphemes.

FIG. 24 is an example of a partial hypothesis which results from an extension by the target morpheme the.

FIG. 25 is an example of a partial hypothesis which results from an extension with an open target morpheme.

FIG. 26 is an example of a partial hypothesis which results from an extension in which an open morpheme is closed.

FIG. 27 is an example of a partial hypothesis which results from an extension in which an open morpheme is kept open.

FIG. 28 is an example of a partial hypothesis which results from an extension by the target morpheme to.sub.-- wake.

FIG. 29 is an example of a partial hypothesis which results from an extension by the pair of target morphemes up and to.sub.-- wake.

FIG. 30 is an example of a partial hypothesis which results from an extension by the pair of target morphemes up and to.sub.-- wake in which to.sub.-- wake is open.

FIG. 31 is an example of a subset lattice.

FIG. 32 is an example of a partial hypothesis that is stored in the priority queue {2,3}.

FIG. 33 is a schematic flow diagram of the process by which partial hypothesis are selected for extension.

FIG. 34 is a schematic flow diagram of the method by which the partial hypotheses on a priority queue are processed in the selection for extension step.

FIG. 35 contains pseudocode describing the method for extending hypotheses.

FIG. 36 is a schematic block diagram of a target structure language model.

FIG. 37 is a schematic block digram of a method for partitioning a vocabulary into classes.

FIG. 38 is a schematic flow diagram of a method for partitioning a vocabulary into classes.

FIG. 39 shows some sample subtrees from a tree constructed using a clustering method.

FIG. 40 is a schematic flow diagram of a method for aligning sentences in parallel corpora.

FIG. 41 is a schematic flow diagram of the basic step of a method for aligning sentences.

FIG. 42 depicts a sample of text before and after textural cleanup and sentence detection.

FIG. 43 shows an example of a division of aligned corpora into beads.

FIG. 44 shows a finite state model for generating beads.

FIG. 45 shows the beads that are allowed by the model.

FIG. 46 is a histogram of French sentence lengths.

FIG. 47 is a histogram of English sentence lengths.

FIG. 48 is a schematic diagram of a process by which word-by-word correspondences are extracted from a bilingual corpus.

FIG. 49 is an example of a partial hypothesis which results from an extension by the target morpheme mother.

FIG. 50 is a schematic flow diagram of a basic system operating in batch mode.

DESCRIPTION OF THE PREFERRED EMBODIMENTS

FIG. 1 is a block diagram of an example of an apparatus for translating words from a first language to a second language according to the present invention. The apparatus comprises a source text input device 10 for inputting a series of source words. The source text input device 10 may comprise, for example, a keyboard, a computer disk drive, or a computer tape drive.

The source text input device 10 may further comprise means for transforming the input series of source words into a series of transformed source words. Each transformed source word may represent a set of related input source words. For example, each of the input source words in the set {etre, etant, ete, suis, es, est, sommes, etes, sont, fus, fumes, serai, serons, sois, soit, soyons, soyez, soient} are forms of the verb infinitive "etre", and may be transformed to "etre" with a tag representing the tense of the original input source word.

The means for transforming the input series of source words into a series of transformed source words may comprise the source transducers described in Sections 3, 4 and 11 of Brown et al, above, which are incorporated herein by reference and set forth in full herein. In essence, these transducers check and correct the spelling of the input source words, check and correct the case of the input source words, detect titles of documents in the input series of source words, and detect names in the input series of source words. The transducers also tag each input source word with the most likely part of speech of the source word, and flag unknown source words (which are not contained in a stored vocabulary of source words). The means for transforming the input series of source words also collapses multi-word units of input source words into single transformed source words, and splits compound input source words into two or more transformed source words. The means for transforming the input series of source words into a series of transformed source words further performs linguistic or morphological transformations of different forms of a word into a single basic form. Finally, the means for transforming the input series of source words may also estimate the sense of each input source word, and assign that sense to the transformed source word.

Transducers described in Brown et al are now described in detail.

SOURCE TRANSDUCERS (Brown et al. Section 3)

In this section, some embodiments of the the source-transducer 701 will be explained. The role of this transducer is to produce one or more intermediate source-structure representations from a portion of text in the source language.

Overview (Brown et al. Section 3.1)

An embodiment of the source-transducer 701 is shown in FIG. 5. In this embodiment, the transducer takes as input a sentence in the source language and produces a single intermediate source-structure consisting of a sequence of linguistic morphs. This embodiment of the source-transducer 701 comprises transducers that:

tokenize raw text 1101;

determine words from tokens 1102;

annotate words with parts of speech 1103;

perform syntactic transformations 1104;

perform morphological transformations 1105;

annotate morphs with sense labels 1106.

It should be understood that FIG. 5 represents only one embodiment of the source-transducer 701. Many variations are possible. For example, the transducers 1101, 1102, 1103, 1104, 1105, 1106 may be augmented and/or replaced by other transducers. Other embodiments of the source-transducer 701 may include a transducer that groups words into compound words or identifies idioms. In other embodiments, rather than a single-intermediate source-structure being produced for each source sentence, a set of several intermediate source-structures together with probabilities or scores may be produced. In such embodiments the transducers depicted in FIG. 5 can be replaced by transducers which produce at each stage intermediate structures with probabilities or scores. In addition, the intermediate source-structures produced may be different. For example, the intermediate structures may be entire parse trees, or case frames for the sentence, rather than a sequence of morphological units. In these cases, there may be more than one intermediate source-structure for each sentence with scores, or there may be only a single intermediate source-structure.

Components (Brown et al. Section 3.2)

Referring still to FIG. 5, the transducers comprising the source-transducer 701 will be explained. For concreteness, these transducers will be discussed in cases in which the source language is either English or French.

Tokenizing Transducers (Brown et al. Section 3.2.1)

The purpose of the first transducer 1101, which tokenizes raw text, is well illustrated by the following Socratic dialogue.

How do you find words in text?

Words occur between spaces.

What about "however,"? Is that one word or two?

Oh well, you have to separate out the commas.

Periods too?

Of course.

What about "Mr."?

Certain abbreviations have to be handled separately.

How about "shouldn't"? One word or two?

One.

So "shouldn't" is different from "should not"?

Yes.

And "Gauss-Bonnet", as in the "Gauss-Bonnet Theorem"?

Two names, two words.

So if you split words at hyphens, what do you do with "vis-a-vis"? One word, and don't ask me why.

How about "stingray"?

One word, of course.

And "manta ray"?

One word: it's just like stingray.

But there's a space.

Too bad.

How about "inasmuch as"?

Two.

Are you sure?

No.

This dialogue illustrates that there is no canonical way of breaking a sequence of characters into tokens. The purpose of the transducer 1101, that tokenizes raw text, is to make some choice.

In an embodiment in which the source-language is English, this tokenizing transducer 1101 uses a table of a few thousand special character sequences which are parsed as single tokens and otherwise treat spaces and hyphens as word separators. Punctuation marks and digits are separated off and treated as individual words. For example, 87, is tokenized as the three words 8 7 and ,.

In another embodiment in which the source-language is French, the tokenizing transducer 1101 operates in a similar way. In addition to space and hyphen, the symbol -t- is treated as a separator when tokenizing French text.

Token-Word Transducers (Brown et al. Section 3.2.2)

The next transducer 1102 determines a sequence of words from a sequence of token spellings. In some embodiments, this transducer comprises two transducers as depicted in FIG. 6. The first of these transducers 1201 determines an underlying case pattern for each token in the sequence. The second of these transducers 1202 performs a number of specialized transformations. These embodiments can easily be modified or enhanced by a person skilled in the art.

True-Case Transducers (Brown et al. Section 3.2.3)

The purpose of a true-case transducer 1201 is made apparent by another Socratic dialogue:

When do two sequences of characters represent the same word?

When they are the same sequences.

So, "the" and "The" are different words?

Don't be ridiculous. You have to ignore differences in case.

So "Bill" and "bill" are the same word?

No, "Bill" is a name and "bill" is something you pay. With proper names the case matters.

What about the two "May"'s in "May I pay in May?"

The first one is not a proper name. It is capitalized because it is the first word in the sentence.

Then, how do you know when to ignore case and when not to?

If you are human, you just know.

Computers don't know when case matters and when it doesn't. Instead, this determination can be performed by a true-case transducer 1201. The input to this transducer is a sequence of tokens labeled by a case pattern that specifies the case of each letter of the token as it appears in printed text. These case patterns are corrupted versions of true-case patterns that specify what the casing would be in the absence of typographical errors and arbitrary conventions (e.g., capitalization at the beginning of sentences). The task of the true-case transducer is to uncover the true-case patterns.

In some embodiments of the true-case transducer the case and true-case patterns are restricted to eight possibilities. ##EQU6##

Here U denotes an upper case letter, L a lower case letter, U.sup.+ a sequence of one or more upper case letters, and L.sup.+ a sequence of one or more lower case letters. In these embodiments, the true-case transducers can determine a true-case of an occurrence of a token in text using a simple method comprising the following steps:

1. Decide whether the token is part of a name. If so, set the true-case equal to the most probable true-case beginning with a U for that token.

2. If the token is not part of a name, then check if the token is a member of a set of tokens which have only one true-case. If so, set the true-case appropriately.

3. If the true-case has not been determined by steps 1 or 2, and the token is the first token in a sentence then set the true-case equal to the most probable true-case for that token.

4. Otherwise, set the true-case equal to the case for that token.

In an embodiment of the true-case transducer used for both English and French, names are recognized with a simple finite-state machine. This machine employs a list of 12,937 distinct common last names and 3,717 distinct common first names constructed from a list of 125 million full names obtained from the IBM online phone directory and a list of names purchased from a marketing corporation. It also uses lists of common precursors to names (such as Mr., Mrs., Dr., Mlle., etc.) and common followers to names (such as Jr., Sr., III, etc).

The set of tokens with only one true-case consists of all tokens with a case-pattern entropy of less than 1.25 bits, together with 9,506 Number of records in (o+p)8260.lcwtab English and 3,794 records in r8260.coerce(a+b+c) French tokens selected by hand from the tokens of a large bilingual corpus. The most probable case pattern for each English token was determined by counting token-case cooccurences in a 67 million word English corpus, and for each French token by counting cooccurences in a 72 million word French corpus. (In this counting occurrences at the beginnings of sentences are ignored).

Specialized Transformation Transducers (Brown et al. Section 3.2.4)

Referring still to FIG. 6, the transducer 1202 performs a few specialized transformations designed to systematize the tokenizing process. These transformations can include, but are not limited to, the correction of typographical errors, the expansion of contractions, the systematic treatment of possessive, etc. In one embodiment of this transducer for English, contractions such as don't are expanded to do not, and possessives such as John's and nurses' are expanded to John 's and nurses '. In one embodiment of this transducer for French, sequences such as s'il, qu'avez, and j'adore are converted to pairs of tokens such as si il, que avez, and je adore. In addition, a few thousand sequences such as afin de are contracted to strings such as afin.sub.-- de. These sequences are obtained from a list compiled by a native French speaker who felt that such sequences should be treated as individual words. Also the four strings au, aux, du, and des are expanded to a le, a les, de le, and de les respectively.

Part-of-Speech Labeling Transducers (Brown et al. Section 3.2.5)

Referring again to FIG. 5, the transducer 1103 annotates words with part-of-speech labels. These labels are used by the subsequent transducers depicted the figure. In some embodiments of transducer 1103, part-of-speech labels are assigned to a word sequence using a technique based on hidden Markov models. A word sequence is assigned the most probable part-of-speech sequence according to a statistical model, the parameters of which are estimated from large annotated texts and other even larger un-annotated texts. The technique is fully explained in article by Bernard Merialdo entitled `Tagging text with a Probabilistic Model` in the Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, May 14-17, 1991. This article is incorporated by reference herein.

In an embodiment of the transducer 1103 for tagging of English, a tag set consisting of 163 parts of speech is used. A rough categorization of these parts of speech is given in Table 11.

In an embodiment of this transducer 1103 for the tagging of French, a tag set consisting of 157 parts of speech is used. A rough categorization of these parts of speech is given in Table 12.

                  TABLE 11
    ______________________________________
    Parts of Speech for English
    ______________________________________
    29                 Nouns
    27                 Verbs
    20                 Pronouns
    17                 Determiners
    16                 Adverbs
    12                 Punctuation
    10                 Conjunctions
     8                 Adjectives
     4                 Prepositions
    20                 Other
    ______________________________________


TABLE 12 ______________________________________ Parts-of-Speech for French ______________________________________ 105 Pronouns 26 Verbs 18 Auxiliaries 12 Determiners 7 Nouns 4 Adjectives 4 Adverbs 4 Conjunctions 2 Prepositions 2 Punctuation 12 Other ______________________________________


Syntactic Transformation Transducers (Brown et al. Section 3.2.6)

Referring still to FIG. 5, the transducer 1104, which performs syntactic transformations, will be described.

One function of this transducer is to simplify the subsequent morphological transformations 1105. For example, in morphological transformations, verbs may be analyzed into a morphological unit designating accidence followed by another unit designating the root of the verb. Thus in French, the verb ira might be replaced by 3s.sub.-- future.sub.-- indicative aller, indicating that ira is the third person singular of the future tense of aller. The same kind of transformation can be performed in English by replacing the sequence will go by future.sub.-- indicative to.sub.-- go. Unfortunately, often the two words in such a sequence are separated by intervening words as in the sentences:

will he go play in the traffic?

he will not go to church.

Similarly, in French the third person of the English verb went is expressed by the two words est alle, and these two words can be separated by intervening words as in the sentences:

est-t-il alle?

Il n'est pas alle.

It is possible to analyze such verbs morphologically with simple string replacement rules if various syntactic transformations that move away intervening words are performed first.

A second function of the syntactic transducer 1104 is to make the task presented to the statistical models which generate intermediate target-structures for an intermediate source-structure as easy as possible. This is done by performing transformations that make the forms of these structures more similar. For example, suppose the source language is English and the target language is French. English adjectives typically precede the nouns they modify whereas French adjectives typically follow them. To remove this difference, the syntactic transducer 1104 includes a transducer which moves French words labeled as adjectives to positions proceeding the nouns which they modify.

These transducers only deal with the most rudimentary linguistic phenomena. Inadequacies and systematic problems with the transformations are overcome by the statistical models used later in the target hypothesis-generating module 702 of the invention. It should be understood that in other embodiments of the invention more sophisticated schemes for syntactic transformations with different functions can be used.

The syntactic transformations performed by the transducer 1104 are performed in a series of steps. A sequence of words which has been annotated with parts of speech is fed into the first transducer. This transducer outputs a sequence of words and a sequence of parts of speech which together serve as input to the second transducer in the series, and so forth. The word and part-of-speech sequences produced by the final transducer are the input to the morphological analysis transducer.

Syntactic Transducers for English (Brown et al. Section 3.2.7)

FIG. 7, depicts an embodiment of a syntactic transducer 1104 for English. Although in much of this document, examples are described in which the source language is French and the target language is English, here for reasons of exposition, an example of a source transducer for a source language of English is provided. In the next subsection, another example in which the source language is French is provided. Those with a basic knowledge of linguistics for other languages will be able to construct similar syntactic transducer for those languages.

The syntactic transducer in FIG. 7 is comprised of three transducer that

perform question inversion 1301;

perform do-not coalescence 1302;

perform adverb movement 1303.

To understand the function of the transducer 1301 that performs question inversion, note that in English questions the first auxiliary verb is often separated by a noun phrase from the root of the verb as in the sentences:

does that elephant eat?

which car is he driving?

The transducer 1301 inverts the auxiliary verb with the subject of the sentence, and then converts the question mark to a special QINV marker to signal that this inversion has occurred. For example, the two sentences above are converted by this transducer to:

that elephant eats QINV

which car he is driving QINV

This transducer also removes supporting do's as illustrated in the first sentence.

To understand the function of the transducer 1302 that performs do-not coalescence, note that in English, negation requires the presence of an auxiliary verb. When one doesn't exist, an inflection of to do is used. The transducer 1302 coalesces the form of to do with not into the string do.sub.-- not. For example, ##EQU7##

The part of speech assigned to the main verb, like above, is modified by this transducer to record the tense and person of to do in the input. When adverbs intervene in emphatic sentences the do.sub.-- not is positioned after the intervening adverbs: ##EQU8##

To understand the function of the final transducer 1303 that performs adverb movement, note that in English, adverbs often intervene between a verb and its auxiliaries. The transducer 1303 moves adverbs which intervene between a verb and its auxiliaries to positions following the verb. The transducer appends a number onto the adverbs it moves to record the positions from which they are moved. For example, ##EQU9##

An M2 is appended to both probably and to not to indicate that they originally preceded the second word in the verbal sequence will be balkanized. Similarly, an M3 is appended to completely to indicate that it preceded the third word in the verbal sequence.

The transducer 1303 also moves adverbs that precede verbal sequences to positions following those sequences: ##EQU10##

This is done in order to place verbs close to their subjects.

Syntactic Transducers for French (Brown et al. Section 3.2.8)

Referring now to FIG. 8, an embodiment of the syntactic transducer 1104 for French is described. This embodiment comprises four transducers that

perform question inversion 1401;

perform discontinuous negative coalescence 1402;

perform pronoun movement 1403;

perform adverb and adjective movement 1404.

The question inversion transducer 1401 undoes French question inversion much the same way that the transducer 1301 undoes English question inversion: ##EQU11##

This transducer 1401 also modifies French est-ce que questions: ##EQU12##

To understand the function of the transducer 1402, which performs negative coalescence, note that propositions in French are typically negated or restricted by a pair of words, ne and some other word, that surround the first word in a verbal sequence. The transducer 1402 moves the ne next to its mate, and then coalesces the two into a new word: ##EQU13##

To understand the function of the transducer 1403, which performs pronoun movement, note that in French, direct-object, indirect-object and reflexive pronouns usually precede verbal sequences. The transducer 1403 moves these pronouns to positions following these sequences. It also maps these pronouns to new words that reflect their roles as direct-object or indirect-object, or reflexive pronouns. So, for example, in the following sentence le is converted to le.sub.-- DPRO because it functions as a direct object and vous to vous.sub.-- IPRO because it functions as an indirect object: ##EQU14##

In the next sentence, vous is tagged as a reflexive pronoun and therefore converted to vous.sub.-- RPRO. ##EQU15##

The allative pronominal clitic y and the ablative pronominal clitic en are mapped to the two-word tuples a y.sub.-- PRO and de en.sub.-- PRO: ##EQU16##

The final transducer 1404 moves adverbs to positions following the verbal sequences in which they occur. It also moves adjectives to positions preceding the nouns they modify. This is a useful step in embodiments of the present invention that translate from French to English, since adjectives typically precede the noun in English.

Morphological Transducers (Brown et al. Section 3.2.9)

Referring again to FIG. 5, a transducer 1105, which performs morphological transformations, will be described. One purpose of this transducer is to make manifest in the intermediate source-structure representation the fraternity of different forms of a word. This is useful because it allows for more accurate statistical models of the translation process. For example, a system that translates from French to English but does not use a morphological transducer can not benefit from the fact that sentences in which parle is translated as speaks provide evidence that parle should be translated as spoken. As a result, parameter estimates for rare words are inaccurate even when estimated from a very large training sample. For example, even in a sample from the Canadian Parlement of nearly 30 million words of French text, only 24 of the 35 different spellings of single-word inflections of the verb parler actually occurred.

A morphological transducer 1104 is designed to ameliorate such problems. The output of this transducer is a sequence of lexical morphemes. These lexical morphemes will sometimes be referred to in this application as morphological units or simply morphs. In an embodiment of transducer 1104 used for English, inflection morphological transformations are performed that make evident common origins of different conjugations of the same verb; the singular and plural forms of the same noun; and the comparative and superlative forms of adjectives and adverbs. In an embodiment of transducer 1104 used for French, morphological inflectional transformations are performed that make manifest the relationship between conjugations of the same verb; and forms of the same noun or adjective differing in gender and number are performed. These morphological transformations are reflected in the sequence of lexical morphemes produced. The examples below illustrate the level of detail in these embodiments of a morphological transducer 1104: ##EQU17##

Sense-Labelling Transducers (Brown et al. Section 3.2.10)

Referring again to FIG. 5, the transducer 1106, which annotates a lexical morph sequence produced by the transducer 1105 with part-of-speech labels, will be explained. Much of the allure of the statistical approach to transfer in machine translation is the ability of that approach to formally cope with the problem of lexical ambiguity. Unfortunately, statistical methods are only able to mount a successful attack on this problem when the key to disambiguating the translation of a word falls within the local purview of the models used in transfer.

Consider, for example, the French word prendre. Although prendre is most commonly translated as to take, it has a number of other less common translations. A trigram model of English can be used to translate Je vais prendre la decision as I will make the decision because the trigram make the decision is much more common than the trigram take the decision. However, a trigram model will not be of much use in translating Je vais prendre ma propre decision as I will make may own decision because in this case take and decision no longer fall within a single trigram.

In the paper, "Word Sense Disambiguation using Statistical Methods" in the proceedings of the 29th Annual Meeting of the Association for Computational Linguistics, published in June of 1991 by the Association of Computational Linguistics and incorporated by reference herein, a description is provided of a method of asking a question about the context in which a word appears to assign that word a sense. The question is constructed to have high mutual information with the translation of that word in its context. By modifying the lexical entries that appear in a sequence of morphemes to reflect the senses assigned to these entries, informative global information can be encoded locally and thereby made available to the statistical models used in transfer.

Although the method described in the aforementioned paper assigns senses to words, the same method applies equally well to the problem of assigning senses to morphemes, and is used here in that fashion. This transducer 1106, for example maps prendre to prendre.sub.-- 1 in the sentence

Je vais prendre ma propre voiture.

but to prendre.sub.-- 2 in the sentence

Je vais prendre ma propre decision.

It should be understood that other embodiments of the sense-labelling transducer are possible. For example, the sense-labelling can be performed by asking not just a single question about the context but a sequence of questions arranged in a decision tree.

Source-Transducers with Constraints (Brown et al. Section 3.3)

In some embodiments, such as that depicted in FIG. 22, a source-structure transducer, such as that labelled 901, accepts a set of constraints that restricts its transformations source text to an intermediate target structure in source text.

Such constraints include, but are not limited to,

requiring that a particular phrase be translated as a certain linguistic component of a sentence, such a noun-phrase;

requiring that a source word be labelled as a certain part-of-speech such as a verb or determiner;

requiring that a source word be morphologically analyzed a certain way;

requiring that a source word be annotated with a particular sense label;

in embodiments in which the intermediate structure encodes parse-tree or case-frame information, requiring a certain parse or case-frame structure for a sentence; morphologically analyzed in a particular way, or be annotated with a particular sense-label.

A source-transducer accepting such constraints in similar to source transducers as described in this section. Based on the description already given, such transducers can be constructed by a person with a computer science background and skilled in the art.

FINITE-STATE TRANSDUCERS (Brown et al. Section 4)

This section provides a description of an embodiment of a mechanism by which the syntactic transductions in step 1104 and the morphological transductions in step 1105 are performed. The mechanism is described in the context of a particular example depicted in FIG. 9. One with a background in computer science and skilled in the art of producing finite-state transducers can understand from this example how to construct syntactic and morphological transducers of the type described above.

The example transducer inverts questions involving do, does, and did. After steps 1101, 1102, and 1103, the source text Why don't you ever succeed? is transduced into parallel word and part-of-speech sequences 1501:

    ______________________________________
    why     do       not     you   ever  succeed ?
    RRQ     VDO      XX      PPY   RR    VVO     ?
    ______________________________________


Here, RRQ and RR are adverb tags, VD0 and VV0 are verb tags, XX is a special tag for the word not, PPY is a pronoun tag, and ? is a special tag for a question mark.

The example transducer converts these two parallel sequences to the parallel word and part-of-speech sequences 1502:

    ______________________________________
    why   you     succeed  do.sub.-- not.sub.-- MO
                                    ever.sub.-- Mi
                                            QINV
    RRQ   PPY     VVO      XX       RR      QINV
    ______________________________________


Here, QINV is a marker which records the fact that the original input sentence was question inverted.

A mechanism by which a transducer achieves this transformation is depicted in FIG. 10, and is comprised of four components:

a finite-state pattern matcher 1601;

an action processing module 1603;

a table of patterns 1602;

a table of action specifications 1503.

The transducer operates in the following way:

1. One or more parallel input sequences 1605 are captured by the finite-state pattern-matcher 1601;

2. The finite-state pattern-matcher compare the input sequences against a table of patterns 1602 of input sequences stored in memory;

3. A particular pattern is identified, and an associated action-code 1606 is transmitted to the action-processing module 1603;

4. The action-processing module obtains a specification of the transformation associated to this action code from a table of actions 1503 stored in memory;

5. The action-processing module applies the transformation to the parallel input streams to produce one or more parallel output sequences 1604.

The parallel input streams captured by the finite-state pattern matcher 1601 are arranged in a sequence of attribute tuples. An example of such a sequence is the input sequence 1501 depicted in FIG. 9. This sequence consists of a sequence of positions together with a set of one or more attributes which take values at the positions. A few examples of such attributes are the token attribute, the word attribute, the case-label attribute, the part-of-speech attribute, the sense-label attribute. The array of attributes for a given position will be called an attribute tuple.

For example, the input attribute tuple sequence 1501 in FIG. 9 is seven positions long and is made up of two dimensional attribute tuples. The first component of an attribute tuple at a given position refers to the word attribute. This attribute specifies the spellings of the words at given positions. For example, the first word in the sequence 1501 is why. The second component of an attribute tuple at a given position for this input sequence refers to a part-of-speech tag for that position. For example, the part of speech at the first position is RRQ. The attribute tuple at position 1 is thus the ordered pair why, RRQ.

The parallel output streams produced by the action processing module 1603 are also arranged as a sequence of attribute tuples. The number of positions in an output sequence may be different from the number of positions in an input sequence.

For example, the output sequence 1502 in FIG. 9, consists of six positions. Associated with each position is a two-dimensional attribute tuple, the first coordinate of which is a word attribute and the second coordinate of which is a part-of-speech attribute.

An example of a table of patterns 1602 is shown in FIG. 11. This table is logically divided into a number of parts or blocks.

Pattern-Action Blocks. The basic definitions of the matches to be made and the actions to be taken are contained in pattern-action blocks. A pattern-action block comprises of a list of patterns together with the name of actions to be invoked when patterns in an input attribute-tuple sequence 1605 are matched.

Auxiliary Pattern Blocks. Patterns that can be used as sub-patterns in the patterns of pattern-action blocks are defined in Auxiliary Pattern blocks. Such blocks contain lists of labelled patterns of of attributes tuples. These labelled patterns do not have associated actions, but can be referenced by their name in the definitions of other patterns.

In FIG. 11 there is one Auxiliary Pattern block. This block defines four auxiliary patterns. The first of these has a name ADVERB and matches single tuple adverb-type constructions. The second has a name of BARE.sub.-- NP and matches certain noun-phrase-type constructions. Notice that this auxiliary pattern makes use of the ADVERB pattern in its definition. The third and fourth auxiliary patterns match other types of noun phrases.

Set Blocks. Primary and auxiliary patterns allow for sets of attributes. In FIG. 11, for example, there is a set called DO.sub.-- SET, of various forms of to.sub.-- do, and another set PROPER.sub.-- NOUN.sub.-- TAG of proper-noun tags.

Patterns are defined in terms of regular expressions of attribute tuples. Any pattern of attribute tuples that can be recognized by a deterministic finite-state automata can be specified by a regular expression. The language of regular expressions and methods for constructing finite-state automata are well known to those skilled in computer science. A method for constructing a finite-state pattern matcher from a set of regular expressions is described in the article "LEX--A Lexical Analyzer Generator," written by Michael E. Lesk, and appearing in the Bell Systems Technical Journal, Computer Science Technical Report Number 39, published in October of 1975.

Regular expressions accepted by the pattern matcher 1601 are described below.

Regular Expressions of Attribute Tuples: A regular expression of attribute tuples is a sequence whose elements are either

1. an attribute tuple;

2. the name of an auxiliary regular expression; or

3. a register name.

These elements can be combined using one of the logical operations:

    ______________________________________
    Operator  Meaning     Usage    Matches
    ______________________________________
    .         concatenation
                          A.B      A followed by B
    .vertline.
              union (i.e. or)
                          A.vertline.B
                                   A or B
    *         0 or more   A*       0 or more A's
    ?         0 or 1      A?       0 or 1 A's
    +         1 or more   A+       1 or more A's
    ______________________________________


Here A and B denote other regular expressions.

Examples of these logical operations are:

    ______________________________________
    Expression    Matches
    ______________________________________
    A?.B.C        0 or 1 A's followed by B then by C
    (A*).vertline.(B+)
                  0 or more A's or 1 or more B's
    (A.vertline.B).C
                  A or B, followed by C
    ______________________________________


Attribute Tuples: The most common type of element in a regular expression is an attribute tuple. An attribute tuple is a vector whose components are either

1. an attribute (as identified by its spelling);

2. a name of a set of attributes;

3. the wild card attribute.

These elements are combined using the following operators:

    ______________________________________
    Operator   Meaning             Usage
    ______________________________________
    ,          Delimiter between coordinates
                                   a,b
               of an attribute tuple
               Negation             a
    #          Wild Card           #
    ______________________________________


(Here a and b denote attribute spellings or attribute set names).

The meanings of these operators are best illustrated by example. Let a, b, and c denote either attribute spellings or set names. Assume the dimension of the attribute tuples is 3. Then:

    ______________________________________
    Attribute Tuple
              Matches
    ______________________________________
    a,b,c     First attribute matches a, second match b,
              third matches c
    ,b,c      First attribute elided (matches anything),
              Second attribute matches b, third matches c
    ,b,       First and third attribute elided (match anything)
              Second attribute matches b
    a         Second and third attributes elided (Match
              anything) First matches a
    #,b,      First attribute wild-card (i.e matches anything)
              Second attribute matches b. Third attribute
              elided
    a,  b,  c Second attribute matches anything EXCEPT b.
              Third matches anything EXCEPT c.
    ______________________________________


Auxiliary Regular Expressions: A second type of element in a regular expression is an auxiliary regular expression. An auxiliary regular expression is a labelled regular expression which is used as a component of a larger regular expression.

Logically, a regular expression involving auxiliary regular expressions is equivalent to the regular expression obtained by resolving the reference to the auxiliary pattern. For example, suppose an auxiliary regular expression named D has been defined by:

D=A.B+.A*

where A,B denote attribute tuples (or other auxiliary patterns). Then:

    ______________________________________
    Expression   is equivalent to
    ______________________________________
    C.D          C.A.B+.A*
    D+.C.D       (A.B+.A*)+.A.B+.A*.C.A*.B+.A*
    ______________________________________


Registers: Just knowing that a regular expression matches an input attribute tuple sequence usually does not provide enough information for the construction of an appropriate output attribute tuple sequence. Data is usually also required about the attribute tuples matched by different elements of the regular expression. In ordinary LEX, to extract this type of information often requires the matched input sequence to be parsed again. To avoid this cumbersome approach, the pattern-matcher 1601 makes details about the positions in the input stream of the matched elements of the regular expression more directly available. From these positions, the identities of the attribute tuples can then be determined.

Positional information is made available through the use of registers. A register in a regular expression does not match any input. Rather,

1. After a match, a register is set equal to the position in the input sequence of the next tuple in the input sequence that is matched by an element of the regular expression to the right of the register.

2. If no further part of the regular expression to the right of the register matches, then the register is set equal to zero.

The operation of registers is best illustrated by some examples. These examples use registers 1! and 2!:

    ______________________________________
    Expression
             Contents of Registers after match
    ______________________________________
    A. 1!.B.C
             Reg 1:  First position of B match
    A. 2!.(C.vertline.D)
             Reg 2:  First position of either C or D match
    A. 1!.B*. 2!.C
             Reg 1:  If B matches:
                                 First position of B match
                     Otherwise:  First position of C match
           Reg 2:
                 First position of C match
    A. 1!.B*.C*
             Reg 1:  If B matches:
                                 First position of B match
                     If C matches:
                                 First position of C match
                     Otherwise:  0
             Reg 2:  If C matches:
                                 First position of C match
                     Otherwise:  0
    ______________________________________


A pattern-action block defines a pattern matcher. When an input attribute-tuple sequence is presented to the finite-state pattern matcher a current input position counter is initialized to 1 denoting that the current input position is the first position of the sequence. A match at the current input position is attempted for each pattern. If no pattern matches, an error occurs. If more than one pattern matches, the match of the longest length is selected. If several patterns match of the same longest length, the one appearing first in the definition of the pattern-action block is selected. The action code associated with that pattern is then transmitted to the action processing module 1603.

Transformations by the action processing module are defined in a table of actions 1503 which is stored in memory. The actions can be specified in specified in any one of a number of programming languages such as C, PASCAL, FORTRAN, or the like.

In the question-inversion example, the action specified in the pseudo-code in FIG. 18 is invoked when the pattern defined by the regular expression in lines 3-4 is matched. This action inverts the order of the words in certain questions involving forms of do. An instance of this action is shown in FIG. 9. In the pseudo-code of FIG. 12 for this action, the symbol @reg(i) denotes the contents of register i. In line 6 of this pseudo-code, the output attribute tuple sequence is set to null.

A question matched by the regular expression in lines 3-4 may or may not begin with a (so-called) wh- word in the set WH.sub.-- NP. If it does match, the appropriate action is to append the input tuple in the first position to the output sequence. This is done in lines 8-9.

After the wh-word, the next words of the output sequence should be the subject noun phrase of the input sequence. This is made so in line 11-12 that appends all tuples matching the regular expression SUBJECT.sub.-- NP to the output sequence.

For negative questions involving forms of do, the part-of-speech tag of the output verb and of the output adverbs are the same as those of the input verb and adverbs. Thus the entire input tuple sequences corresponding to these words can be appended to the output. This is done in lines 15-18.

For positive questions the tag attribute of the output verb may be different than that of the input verb. This is handled in lines 25-37. The input word attribute for the verb is appended to the output word attribute in lines 26 and 31 and 35. The output tag attribute is selected based on the form of do in the input sequence. Explicit tag values are appended to the output sequence in lines 32 and 37.

The remaining input words and tags other than the question mark are written to the output sequence in lines 43-44. The input sequence is completed in line 46 by the marker QINV signalling question inversion, together with the appropriate tag.

SENSE DISAMBIGUATION (Brown et al. Section 11)

Introduction (Brown et al. Section 11.1)

An alluring aspect of the statistical approach to machine translation is the systematic framework it provides for attacking the problem of lexical disambiguation. For example, an embodiment of the machine translation system depicted in FIG. 50 translates the French sentence Je vais prendre la decision as I will make the decision, correctly interpreting prendre as make. Its statistical translation model, which supplies English translations of French words, prefers the more common translation take, but its trigram language model recognizes that the three-word sequence make the decision is much more probable than take the decision.

This system is not always so successful. It incorrectly renders Je vais prendre ma propre decision as I will take may own decision. Its language model does not realize that take my own decision is improbable because take and decision no longer fall within a single trigram.

Errors such as this are common because the statistical models of this system only capture local phenomena; if the context necessary to determine a translation falls outside the scope of these models, a word is likely to be translated incorrectly. However, if the relevant context is encoded locally, a word can be translated correctly.

As has been noted in Section 3, such encoding can be performed in the source-transduction phase 701 by a sense-labeling transducer.

In this section, the operation and construction of such a sense-labeling transducer is described. The goal of this transducer is to perform cross-lingual word-sense labeling. That is, this transducer labels words of a sentence in a source language so as to elucidate their translations into a target language. Such a transducer can also be used to label the words of an target sentence so as to elucidate their translations into a source language.

The design of this transducer is motivated by some examples. In some contexts the French verb prendre translates into English as to take, but in other contexts it translates as to make. A sense disambiguation transformation, by examining the contexts, might label occurrences of prendre that likely mean to take with one label, and other occurrences of prendre with another label. Then the uncertainty in the translation of prendre given the label would be less than the uncertainty in the translation of prendre without the label. Although the label does not provide any information that is not already present in the context, it encodes this information locally. Thus a local statistical model for the transfer of labeled sentences should be more accurate than one for the transfer of unlabeled ones.

While the translation of a word depends on many words in its context, it is often possible to obtain information by looking at only a single word. For example, in the sentence Je vais prendre ma propre decision (I will make may own decision), the verb prendre should be translated as make because its object is decision. If decision is replaced by voiture then prendre should be translated as take: Je vais prendre ma propre voiture (I will take my own car). Thus the uncertainty in the translation of prendre is reduced by asking a question about its object, which is often the first noun to its right. A sense can be assigned to prendre based upon the answer to this question.

As another example, in Il doute que les notres gagnent (He doubts that we will win), the word il should be translated as he. On the other hand, if doute is replaced by faut then il should be translated as it: Il faut que les notres gagnent (It is necessary that we win). Here, a sense label can be assigned to il by asking about the identity of the first verb to its right.

These examples motivate a sense-labeling scheme in which the label of a word is determined by a question about an informant word in its context. In the first example, the informant of prendre is the first noun to the right; in the second example, the informant of il is the first verb to the right. The first example is depicted in FIG. 14. The two sequence of this example are labeled 4001 and 4002 in the figure.

If more than two senses are desired for a word, then questions with more than two answers can be considered.

Design of a Sense-Labeling Transducer (Brown et al. Section 11)

FIG. 15 depicts an embodiment of a sense-labeling transducer based on this scheme. For expositional purposes, this embodiment will be discussed in the context of labeling words in a source language word sequence. It should be understood that in other embodiments, a sense-labeling transducer can accept as input more involved source-structure representations, including but not limited to, lexical morpheme sequences. In these embodiments, the constituents of a representation of a source word sequence are annotated by the sense-labeling transducer with labels that elucidate their translation into analogous constituents into a similar representation of a target word sequence. It should also be understood that in still other embodiments, a sense-labeling transducer annotates, target-structure representations, (not source-structure representations) with sense labels.

The operation of the embodiment of the sense-labeling transducer depicted in FIG. 15 comprises the steps of:

4101. Capturing an input sequence consisting of a sequence of words in a source language;

4102. For each word in the input sequence performing the Steps 4107, 4108, 4109, 4104 until no more words are available in Step 4105;

4107. For the input word being considered, finding a best informant site such as the noun to the right, the verb to the left, etc. A best informant for a word is obtained using a table 4103 stored in memory of informants and questions about the informants for every word in the source language vocabulary;

4108. Finding the word at the best informant site in the input word sequence;

4109. Obtaining the class of the informant word as given by the answer to a question about the informant word;

4104. Labeling the original input word of the input sequence with the class of the informant word.

For the example depicted in FIG. 14, the informant site determined by Step 4107 is the noun to the right. For the first word sequence 4001 of this example, the informant word determined by Step 4108 is decision; for the second word sequence 4109, the informant word is voiture. In this example, the class of decision determined in Step 4109 is different than the class of voiture. Thus the label attached to prendre by Step 4104 is different for these two contexts of prendre.

Constructing a Table of Informants and Questions (Brown et al. Section 11.3)

An important component of the sense-labeler depicted in FIG. 15 is a table 4103 of informants and questions for every word in a source language vocabulary.

FIG. 16 depicts a method of constructing such a table. This method comprises the steps of:

4203. Performing the Steps 4201 and 4202 for each word in a source language vocabulary.

4201. For the word being considered, finding a good question for each of a plurality of informant sites. These informant sites are obtained from a table 4207 stored in memory of possible informant sites. Possible sites include but are not limited to, the nouns to the right and left, the verbs to the right and left, the words to the right and left, the words two positions to the right or left, etc. A method of finding a good question is described below. This method makes use of a table 4205 stored in memory probabilities derived from Viterbi alignments. These probabilities are also discussed below.

4202. Storing in a table 4208 of informants and questions, the informant site and the good question.

A method for carrying out the Step 4201 for finding a good question for each informant site of a vocabulary word is depicted in FIG. 17. This method comprises the steps of:

4301. Performing the Steps 4302 and 4207 for each possible informant site. Possible informant sites are obtained from a table 4304 of such sites.

4302. For the informant site being considered, finding a question about informant words at this site that provides a lot of information about translations of the vocabulary word into the target language.

4207. Storing the vocabulary word, the informant site, and the good question in a table 4103.

Mathematics of Constructing Questions (Brown et al. Section 11.4)

A method for carrying out the Step 4302 of finding a question about an informant is depicted in FIG. 18. In this subsection, the some preliminaries for describing this method are given. The notation used here is the same as that used in Sections 8-10.

Statistical Translation with Transductions (Brown et al. Section 11.4.1)

Recall the setup depicted in FIG. 19. A system shown there employs

1. A source transducer 201 which encodes a source sentence f into an intermediate structure f'.

2. A statistical translation component 202 which translates f' into a corresponding intermediate target structure e'. This component incorporates a language model, a translation model, and a decoder.

3. A target transducer 203 which reconstructs a target sentence e from e'.

For statistical modeling, the target-transduction transformation 203 e'.fwdarw.e is sometimes required to be invertible. Then e' can be constructed from e and no information is lost in the transformation.

The purpose of source and target transduction is to facilitate the task of the statistical translation. This will be accomplished if the probability distribution Pr(f',e') is easier to model then the original distribution Pr(f,e). In practice this means that e' and f' should encode global linguistic facts about e and f in a local form.

A useful gauge of the success that a statistical model enjoys in modeling translation from sequences of source words represented by a random variable F, to sequences of target words represented by a random variable E, is the cross entropy.sup.3

.sup.3 In this equation and in the remainder of this section, the convention of using uppercase letters (e.g. E) for random variables and lower case letters (e.g. e) for the values of random variables continues to be used. ##EQU18##

The cross entropy measures the average uncertainty that the model has about the target language translation e of a source language sequence f. Here P(e.vertline.f) is the probability according to the model that e is a translation of f and the sum runs over a collection of all S pairs of sentences in a large corpus comprised of pairs of sentences with each pair consisting of a source and target sentence which are translations of one another (See Sections 8-10).

A better model has less uncertainty and thus a lower cross entropy. The utility of source and target transductions can be measured in terms of this cross entropy. Thus transformations f.fwdarw.f' and e'.fwdarw.e are useful if models P'(f'.vertline.e') and P'(e') can be constructed such that H(E'.vertline.F')<H(E.vertline.F).

Sense-Labeling in Statistical Translation (Brown et al. Section 11.4.2)

The remainder of this section is devoted to describing methods for constructing a sense-labeling transducer. In this case the following setup prevails:

The Intermediate Structures. Intermediate structures e' and f' consist of sequences of words labeled by their senses. Thus f' is a sentence over the expanded vocabulary whose `words` f' are pairs (f,s) where f is a word in the original source vocabulary and s is its sense label. Similarly, e' is a sentence over the expanded vocabulary whose words e' are pairs (e,s) where e is a target word and s is its sense label.

Source and target transductions. For each source word and each target word, an informant site, such as first noun to the left is chosen, and an n-ary question about the value of the informant at that site is also chosen. A source-transduction transformation f.fwdarw.f' and an inverse target-transduction transformation e.fwdarw.e' map a sentence to the intermediate structure in which each word is labeled by a sense, determined by the question about its informant. A target-transduction transformation e'.fwdarw.e maps a labeled sentence to a sentence in which the labels have been removed.

The probability models. A translation model such as one of the models in Sections 8-10 is used for both P'(F'.vertline.E') and for P(F.vertline.E). A trigram language model such as that discussed in Section 6 is used for both P(E) and P'(E').

The Viterbi Approximation (Brown et al. Section 11.4.3)

The probability P(f.vertline.e) computed by the translation model requires a sum over alignments as discussed in detail in Sections 8-10. This sum is often too expensive to compute directly since the number of alignments increases exponentially with sentence length. In the mathematical considerations of this Section, this sum will be approximated by the single term corresponding to the alignment .nu.(f.vertline.e), with greatest probability. This is the Viterbi approximation already discussed in Sections 8-10 and .nu.(f.vertline.e) is the Viterbi alignment.

Let c(f.vertline.e) be the expected number of times that e is aligned with f in the Viterbi alignment of a pair of sentences drawn at random from a large corpus of training data. Let c(.phi..vertline.e) be the expected number of times that e is aligned with .phi. words. Then ##EQU19## where c(f.vertline.e;.nu.) is the number of times that e is aligned with f in the alignment A, and c(.phi..vertline.e;.nu.) is the number of times that e generates .phi. target words in A. The counts above are also expressible as averages with respect to the model: ##EQU20##

Probability distributions p(e,f) and p(.phi.,e) are obtained by normalizing the counts c(f.vertline.e) and c(.phi..vertline.e): ##EQU21##

.sup.4 In these equations and in the remainder of the paper, the generic symbol ##EQU22##

is used to denote a normalizing factor that converts counts to probabilities. The actual value of ##EQU23##

will be implicit from the context. Thus, for example, in the left hand equation of (168), the normalizing factor is norm = .SIGMA..sub.f,e c(f.vertline.e) which equals the average length of source sentences. In the right hand equation of (168), the normalizing factor is the average length of target sentences.

(These are the probabilities that are stored in a table of probabilities 4205.) The conditional distributions p(f.vertline.e) and p(.phi..vertline.e) are the Viterbi approximation estimates for the parameters of the model. The marginals satisfy ##EQU24## where u(e) and u(f) are the unigram distributions of e and f and .phi.(e)=.SIGMA..sub..phi. p(.phi..vertline.e).phi. is the average number of source words aligned with e. These formulae reflect the fact that in any alignment each source word is aligned with exactly one target word.

Cross Entropy (Brown et al. Section 11.4.4)

In this subsection the cross entropies H(E.vertline.F) and H(E'.vertline.F') are expressed in terms of the information between source and target words.

In the Viterbi approximation, the cross entropy H(F.vertline.E) is given by

H(F.vertline.E)=m{H(E.vertline.F)+H(.PHI..vertline.E)}, (170)

where m is the average length of the source sentences in the training data, and H(F.vertline.E) and H(.PHI..vertline.E) are the conditional entropies for the probability distributions p(f,e) and p(.phi.,e): ##EQU25##

A similar expression for the cross entropy H(E.vertline.F) will now be given. Since

P(f,e)=P(f.vertline.e)P(e),

this cross entropy depends on both the translation model, P(f.vertline.e), and the language model, P(e). With a suitable additional approximation,

H(E.vertline.F)=m{H(.PHI..vertline.E)-I(E,F)},+H(E) (172)

where H(E) is the cross entropy of P(E) and I(F,E) is the mutual information between f and e for the probability distribution p(f,e).

The additional approximation required is, ##EQU26## where p(f) is the marginal of p(f,e). This amount to approximating P(f) by the unigram distribution that is closest to it in cross entropy. Granting this, formula (172) is a consequence of (170) and of the identities

H(E.vertline.F)=H(E.vertline.F)-H(F)+H(E),

H(F)=H(F.vertline.E)+I(F,E). (174)

Next consider H(E'.vertline.F'). Let e.fwdarw.e' and f.fwdarw.f' be sense labeling transformations of the type discussed above. Assume that these transformations preserve Viterbi alignments; that is, if the words e and f are aligned in the Viterbi alignment for (f,e), then their sensed versions e' and f' are aligned in the Viterbi alignment for (f',e'). It follows that the word translation probabilities obtained from the Viterbi alignment satisfy p(f,e)=.SIGMA..sub.f'.epsilon.f p(f',e)=.SIGMA..sub.e'.epsilon.e p(f,e'), where the sums range over the sensed versions f' of f and the sensed versions e' of e.

By applying (172) to the cross entropies H(E.vertline.F), H(E.vertline.F'), and H(E'.vertline.F), it is not hard to verify that ##EQU27##

Here I(E,F'.vertline.f) is the conditional mutual information given a source word f between its translations E and its sensed versions F'; I(F,E'.vertline.e) is the conditional mutual information given a target word e between its translations F and its sensed versions E'; and I(.PHI., E'.vertline.e) is the conditional mutual information given e between .PHI. and its sensed versions E'.

Selecting Questions (Brown et al. Section 11.5)

The method depicted in FIG. 18 for finding good informants and questions for sensing is now described.

Source Questions (Brown et al. Section 11.5.1)

For sensing source sentences, a question about an informant is a function c from the source vocabulary into the set of possible senses. If the informant of f is x, then f is assigned the sense c(x). The function c(x) is chosen to minimize the cross entropy H(E.vertline.F'). From formula (175), this is equivalent to maximizing the conditional mutual information I(F',E.vertline.f) between E and F' ##EQU28## where p(f,e,x) is the probability distribution obtained by counting the number of times in the Viterbi alignments that e is aligned with f and the value of the informant of f is x, ##EQU29##

An exhaustive search for the best c requires a computation that is exponential in the number of values of x and is not practical. In the aforementioned paper entitled "Word-Sense Disambiguation using Statistical Methods" by P. F. Brown, et al., a good c is found using the flip-flop method which is only applicable if the number of senses is restricted to two. Here a different method that can be used to find c for any number of senses is described. This method uses the technique of alternating minimization, and is similar to the k-means method for determining pattern clusters and to the generalized Lloyd method for designing vector quantitizers.

The method is based on the fact that, up to a constant independent of c, the mutual information I(F',E.vertline.f) can be expressed as an infimum over conditional probability distributions q(E.vertline.c), ##EQU30##

The best value of the information is thus an infimum over both the choice for c and the choice for the q. This suggests the iterative method, depicted in 4401 for obtaining a good c. This method comprises the steps of:

4401. Beginning with an initial choice of c;

4404. Performing Steps 4402 and 4403 until no further increase in I(F',E.vertline.f) results;

4403. For given q, finding the best c:

c(x)=argmin.sub.c D(p(E.vertline.x,f); q(E.vertline.c));

4402. For this c, finding the best q: ##EQU31##

Target Questions (Brown et al. Section 11.5.2)

For sensing target sentences, a question about an informant is a function c from the target vocabulary into the set of possible senses. c is chosen to minimize the entropy H(E'.vertline.F). From (175) this is equivalent to maximizing the sum

I(F,E'.vertline.e)+I(.PHI.,E'.vertline.e).

In analogy to (179), ##EQU32##

Again a good c is obtained alternating minimization.

Generalizations (Brown et al. Section 11.6)

The method of sense-labeling discussed above ask a single question about a single word of context. In other embodiments of the sense labeler, this question is the first question in a decision tree. In still other embodiments, rather than using a single informant site to determine the sense of a word, questions from several different informant sites are combined to determine the sense of a word. In one embodiment, this is done by assuming that the probability of an informant word x.sub.i at informant site i, given a target word e, is independent of an informant word x.sub.j at a different informant site j given the target word e. Also, in other embodiments, the intermediate source and target structure representations are more sophisticated than word sequences, including, but not limited to, sequences of lexical morphemes, case frame sequences, and parse tree structures.

Table 1 shows a hypothetical example of an input series of source words according to the invention. In this example, the source words are French words.

                  TABLE 1
    ______________________________________
    Input Series of Source Words, F
    f.sub.1 f.sub.2
                   f.sub.3  f.sub.4
                                 f.sub.5
                                        f.sub.6
    ______________________________________
    La      clef   est      dans la     porte
                                             .
    ______________________________________


The translation apparatus according to the present invention further comprises a target hypothesis generator 12. The target hypothesis generator 12 generates at least two target hypotheses. Each target hypothesis comprises a series of target words selected from a vocabulary of words in the second language. The vocabulary of words in the second language may be stored in a target language vocabulary store 14. Each target word in a target hypothesis has a context comprising at least one other word in the target hypothesis.

An example of a target hypothesis generator is described in Section 14 of Brown et al, cited above, which is incorporated herein by reference, and set forth in full herein at this time.

HYPOTHESIS SEARCH--STEPS 702 AND 902 (Brown et al. Section 14)

Overview of Hypothesis Search (Brown et al. Section 14.1)

Referring now to FIG. 20, the second step 702 produces a set of hypothesized target structures which correspond to putative translations of the input intermediate source structure produced by step 701. The process by which these target structures are produced is referred to as hypothesis search. In a preferred embodiment target structures correspond to sequences of morphemes. In other embodiments more sophisticated linguistic structures such as parse trees or case frames may be hypothesized.

An hypothesis in this step 702 is comprised of a target structure and an alignment of a target structure with the input source structure. Associated with each hypothesis is a score. In other embodiments a hypothesis may have multiple alignments. In embodiments in which step 701 produces multiple source structures an hypothesis may contain multiple alignments for each source structure. It will be assumed here that the target structure comprised by a hypothesis contains a single instance of the null target morpheme. The null morphemes will not be shown in the figures pertaining to hypothesis search, but should be understood to be part of the target structures nonetheless. Throughout this section on hypothesis search, partial hypothesis will be used interchangeably with hypothesis, partial alignment with alignment, and partial target structure with target structure.

The target structures generated in this step 702 are produced incrementally. The process by which this is done is depicted in FIG. 21. This process is comprised of five steps.

A set of partial hypotheses is initialized in step 5401. A partial hypothesis is comprised of a target structure and an alignment with some subset of the morphemes in the source structure to be translated. The initial set generated by step 5401 consists of a single partial hypothesis. The partial target structure for this partial hypothesis is just an empty sequence of morphemes. The alignment is the empty alignment in which no morphemes in the source structure to be translated are accounted for.

The system then enters a loop through steps 5402, 5403, and 5404, in which partial hypotheses are iteratively extended until a test for completion is satisfied in step 5403. At the beginning of this loop, in step 5402, the existing set of partial hypotheses is examined and a subset of these hypotheses is selected to be extended in the steps which comprise the remainder of the loop. In step 5402 the score for each partial hypothesis is compared to a threshold (the method used to compute these thresholds is described below). Those partial hypotheses with scores greater than threshold are then placed on a list of partial hypotheses to be extended in step 5404. Each partial hypothesis that is extended in step 5404 contains an alignment which accounts for a subset of the morphemes in the source sentence. The remainder of the morphemes must still be accounted for. Each extension of an hypothesis in step 5404 accounts for one additional morpheme. Typically, there are many tens or hundreds of extensions considered for each partial hypothesis to be extended. For each extension a new score is computed. This score contains a contribution from the language model as well as a contribution from the translation model. The language model score is a measure of the plausibility a priori of the target structure associated with the extension. The translation model score is a measure of the plausibility of the partial alignment associated with the extension. A partial hypothesis is considered to be a full hypothesis when it accounts for the entire source structure to be translated. A full hypothesis contains an alignment in which every morpheme in the source structure is aligned with a morpheme in the hypothesized target structure. The iterative process of extending partial hypotheses terminates when step 5402 produces an empty list of hypotheses to be extended. A test for this situation is made on step 5403.

This method for generating target structure hypotheses can be extended to an embodiment of step 902 of FIG. 22, by modifying the hypothesis extension step 5404 in FIG. 21, with a very similar step that only considers extensions which are consistent with the set of constraints 906. Such a modification is a simple matter for one skilled in the art.

Hypothesis Extension 5404 (Brown et al. Section 14.2)

This section provides a description of the method by which hypotheses are extended in step 5404 of FIG. 21. Examples will be taken from an embodiment in which the source language is French and the target language is English. It should be understood however that the method described applies to other language pairs.

Types of Hypothesis Extension (Brown et al. Section 14.2.1)

There are a number of different ways a partial hypothesis may be extended. Each type of extension is described by working through an appropriate example. For reasons of exposition, the method described in this section assigns scores to partial hypotheses based on Model 3 from the section entitled Translation Models and Parameter Estimation. One skilled in the art will be able to adopt the method described here to other models, such as Model 5, which is used in the best mode of operation.

A partial hypothesis is extended by accounting for one additional previously unaccounted for element of the source structure. When a partial hypothesis H.sub.1 is extended to some other hypothesis H.sub.2, the score assigned to H.sub.2 is a product of the score associated with H.sub.1 and a quantity denoted as the extension score. The value of the extension score is determined by the language model, the translation model, the hypothesis being extended and the particular extension that is made. A number of different types of extensions are possible and are scored differently. The possible extension types and the manner in which they are scored is illustrated in the examples below.

As depicted in FIG. 23, in a preferred embodiment, the French sentence La jeune fille a reveille sa mere is transduced in either step 701 of FIG. 20 or step 901 of FIG. 22 into the morphological sequence la jeune fille V.sub.-- past.sub.-- 3s sa mere.

The initial hypothesis accounts for no French morphemes and the score of this hypothesis is set to 1. This hypothesis can be extended in a number of ways. Two sample extensions are shown in FIGS. 24 and 49. In the first example in FIG. 24, the English morpheme the is hypothesized as accounting for the French morpheme la. The compound of the score associated with this extension is the equal to

l(the.vertline.*,*)n(1.vertline.the)t(la.vertline.the)d(1.vertline.1).(182)

Here, * is a symbol which denotes a sequence boundary, and the factor l(the.vertline.*,*) is the trigram language model parameter that serves as an estimate of the probability that the English morpheme the occurs at the beginning of a sentence. The factor n(1.vertline.the) is the translation model parameter that is an estimate of the probability that the English morpheme the has fertility 1, in other words, that the English morpheme the is aligned with only a single French morpheme. The factor t(la.vertline.the) is the translation model parameter that serves as an estimate of the lexial probability that the English morpheme the translates to the French morpheme la. Finally, the factor d(1.vertline.1) is the translation model parameter that serves as an estimate of the distortion probability that a French morpheme will be placed in position 1 of the French structure given that it is aligned with an English morpheme that is in position 1 of the English structure. In the second example in FIG. 49, the English morpheme mother is hypothesized as accounting for the French morpheme mere. The score for this partial hypothesis is

l(mother.vertline.*,*)n(1.vertline.mother)t(mere.vertline.mother)d(7.vertli ne.1). (183)

Here, the final factor d(7.vertline.1) serves as an estimate of the distortion probability that a French morpheme, such as mere, will be place in the 7th position in a source sequence given that it is aligned with an English morpheme such as mother which is in the 1st position in an hypothesized target sequence.

Now, suppose the partial hypothesis in FIG. 24 is to be extended on some other invocation of step 5404. A common translation of the pair of French morphemes jeune fille is the English morpheme girl. However, since in a preferred embodiment a partial hypothesis is extended to account for only a single French morpheme at a time, it is not possible to account for both jeune and fille with a single extension. Rather the system first accounts for one of the morphemes, and then on another round of extensions, accounts for the other. This can be done in two ways, either by accounting first for jeune or by accounting first for fille. FIG. 25 depicts the extension that accounts first for fille. The + symbol in FIG. 25 after the the English morpheme girl denotes the fact that in these extensions girl is to be aligned with more French morphemes than it is currently aligned with, in this case, at least two. A morpheme so marked is referred to as open. A morpheme that is not open is said to be closed. A partial hypothesis which contains an open target morpheme is referred to as open, or as an open partial hypothesis. A partial hypothesis which is not open is referred to as closed, or as a closed partial hypothesis. An extension is referred to as either open or closed according to whether or not the resultant partial hypothesis is open or closed. In a preferred embodiment, only the last morpheme in a partial hypothesis can be designated open. The score for the extension in FIG. 25 is ##EQU33##

Here, the factor l(girl.vertline.*,the) is the language model parameter that serves as an estimate of the probability with which the English morpheme girl is the second morpheme in a source structure in which the first morpheme is the. The next factor of 2 is the combinatorial factor that is discussed in the section entitled Translation Models and Parameter Estimation. It is factored in, in this case, because the open English morpheme girl is to be aligned with at least two French morphemes. The factor n(i.vertline.girl) is the translation model parameter that serves as an estimate of the probability that the English morpheme girl will be aligned with exactly i French morphemes, and the sum of these parameters for i between 2 and 25 is an estimate of the probability that girl will be aligned with at least 2 morphemes. It is assumed that the probability that an English morpheme will be aligned with more than 25 French morphemes is 0. Note that in a preferred embodiment of the present invention, this sum can be precomputed and stored in memory as a separate parameter. The factor t(fille.vertline.girl) is the translation model parameter that serves as an estimate of the lexical probability that one of the French morphemes aligned with the English morpheme girl will be the French morpheme fille. Finally, the factor d(3.vertline.2) is the translation model parameter that serves as an estimate of the distortion probability that a French morpheme will be placed in position 3 of the French structure given that it is aligned with an English morpheme which is in position 2 of the English structure. This extension score in Equation 184 is multiplied by the score in Equation 182 for the partial hypothesis which is being extended to yield a new score for the partial hypothesis in FIG. 24 of ##EQU34##

Consider now an extension to the partial hypothesis in FIG. 25. If a partial hypothesis that is to be extended contains an open morpheme, then, in a preferred embodiment, that hypothesis can only be extended by aligning another morpheme from the source structure with that open morpheme. When such an extension is made, there are two possibilities: 1) the open morpheme is kept open in the extended partial hypothesis, indicating that more source morphemes are to be aligned with that open target morpheme, or 2) the open morpheme is closed indicating that no more source morphemes are to be aligned with that target morpheme. These two cases are illustrated in FIGS. 26 and 27.

In FIG. 26, an extension is made of the partial alignment in FIG. 25 by aligning the additional French morpheme jeune with the English morpheme girl. In this example the English morpheme girl is then closed in the resultant partial hypothesis. The extension score for this example is ##EQU35##

Here, the first quotient adjusts the fertility score for the partial hypothesis by dividing out the estimate of the probability that girl is aligned with at least two French morphemes and by multiplying in an estimate of the probability that girl is aligned with exactly two French morphemes. As in the other examples, the second and third factors are estimates of the lexical and distortion probabilities associated with this extension.

In FIG. 27, the same extension is made as in FIG. 26, except here the English morpheme girl is kept open after the extension, hence the + sign. The extension score for this example is ##EQU36##

Here, the factor of 3 is the adjustment to the combinatorial factor for the partial hypothesis. Since the score for the partial hypothesis in FIG. 26 already has a combinatorial factor of 2, the score for the resultant partial hypothesis in FIG. 27, will have a combinatorial factor of 2.times.3=3 . The quotient adjusts the fertility score for the partial hypothesis to reflect the fact that in further extensions of this hypothesis girl will be aligned with at least three French morphemes.

Another type of extension performed in the hypothesis search is one in which two additional target morphemes are appended to the partial target structure of the partial hypothesis being extended. In this type of extension, the first of these two morphemes is assigned a fertility of zero and the second is aligned with a single morpheme from the source structure. This second target morpheme may be either open or closed.

FIG. 29 shows an extension of the partial hypothesis in FIG. 28 by the two target morphemes up her, in which her is aligned with the source morpheme sa. The score for this extension is

l(up.vertline.girl,to.sub.13 wake)l(her.vertline.to.sub.-- wake,up)n(0.vertline.up)n(1.vertline.her)t(sa.vertline.her)d(6.vertline.6) .(188)

Here, the first two factors are the trigram language model estimates of the probabilities with which up follows girl to.sub.-- wake, and with which her follows to.sub.-- wake up, respectively. The third factor is the fertility parameter that serves as an estimate of the probability that up is aligned with no source morphemes. The fourth, fifth, and sixth factors are the appropriate fertility, lexical, and distortion parameters associated with the target morpheme her in this partial alignment.

FIG. 30 shows a similar extension by up her. The difference with the extension in FIG. 29 is that in FIG. 30, the source morpheme her is open. The score for this extension is ##EQU37##

The score for this extension differs from the score in Equation 188 in that the fertility parameter n(1.vertline.her) is replaced by the combinatorial factor 2 and the sum of fertility parameters which provides an estimate of the probability that her will be aligned with at least two source morphemes.

A remaining type of extension is where a partial hypothesis is extended by an additional connection which aligns a source morpheme with the null target morpheme. The score for this type of extension is similar to those described above. No language model score is factored in, and scores from the translation model are factored in, in accordance with the probabilities associated with the null word as described in the section entitled Translation Models and Parameter Estimation.

Selection of Hypotheses to Extend 5402 (Brown et al. Section 14.3)

Throughout the hypothesis search process, partial hypotheses are maintained in a set of priority queues. In theory, there is a single priority queue for each subset of positions in the source structure. So, for example, for the source structure oui , oui, three positions: oui is in position 1; a comma is in position 2; and oui is in position 3, and there are therefore 2.sup.3 subsets of positions { }, {1}, {2}, {3}, {1,2}, {1,3}, {2,3}, and {1,2,3}. In practice, these priority queues are initialized only on demand, and many less than the full number of queues possible are used in the hypothesis search. In a preferred embodiment, each partial hypothesis is comprised of a sequence of target morphemes, and these morphemes are aligned with a subset of source morphemes. Corresponding to that subset of source morphemes is a priority queue in which the partial hypothesis is stored. The partial hypotheses within a queue are prioritized according to the scores associated with those hypotheses. In certain preferred embodiments the priority queues are limited in size and only the 1000 hypothesis with the best scores are maintained.

The set of all subsets of a set of source structure positions can be arranged in a subset lattice. For example, the subset lattice for the set of all sets of the set {1,2,3} is shown in FIG. 31. In a subset lattice, a parent of a set S is any which contains one less element than S, and which is also a subset of S. In FIG. 31 arrows have been drawn from each set in the subset lattice to each of its parents. For example, the set {2} is a parent of the set {1,2}.

A subset lattice defines a natural partial ordering on a set of sets. Since the priority queues used in hypothesis search are associated with subsets, a subset lattice also defines a natural partial ordering on the set of priority queues. Thus in FIG. 31, there are two parents of the priority queue associated with the subset of source structure positions {1,3}. These two parents are the priority queues associated with the set {1} and {3}. A priority queue Q.sub.1 is said to be an ancestor of another priority Q.sub.2 if 1) Q.sub.1 is not equal to Q.sub.2, and 2) Q.sub.1 is a subset of Q.sub.2. If Q.sub.1 is an ancestor of Q.sub.2, then Q.sub.2 is said to be to be a descendant of Q.sub.1.

Considering now the process by which a set of partial hypotheses are selected in step 5402 to be extended in step 5404, when step 5402 is invoked, it is invoked with a list of partial hypotheses that were either 1) created by the initialization step 5401, or 2) created as the results of extensions in step 5404 on a previous pass through the loop comprised of steps 5402, 5403, and 5404. These partial hypotheses are stored in priority queues according to the sets of source morphemes they account for. For example, the partial hypothesis in FIG. 32 would be stored in the priority queue associated with the set {2,3}, since it accounts for the source morphemes in positions 2 and 3.

A priority queue is said to be active if there are partial hypotheses stored in it. An active priority queue is said to be on the frontier if it has no active descendent. The cardinality of a priority queue is equal to the number of elements in the subset with which it is associated. So, for example, the cardinality of the priority queue which is associated with the set {2,3} is 2.

The process in step 5402 functions by assigning a threshold to every active priority queue and then places on the list of partial hypotheses to be extended every partial hypothesis on an active priority queue that has an a score that is greater than the threshold for that priority queue. This is depicted in FIG. 33. First, in step 6601 the threshold for every active priority queue is initialized to infinity, in practice, some very large number. Second, in step 6602, thresholds are determined for every priority queue on the frontier.

The method by which these thresholds are computed is best described by first describing what the normalizer of a priority queue is. Each priority queue on the frontier corresponds to a set of positions of source morphemes. At each position of these positions is a particular source morpheme. Associated with each morpheme is a number, which in a preferred embodiment is the unigram probability of that source morpheme. These unigram probabilities are estimated by transducing a large body of source text and simply counting the frequency with which the different source morphemes occur. The normalizer for a priority queue is defined to be the product of all the unigram probabilities for the morphemes at the positions in the associated set of source structure positions. For example, the normalizer for the priority queue associated with the set {2,3} for the source structure la jeune fille V.sub.-- past.sub.-- 3s reveiller sa mere is:

normalizer({2,3})=Pr(jeune)Pr(fille). (190)

For each priority queue Q on the frontier define the normed score of Q to be equal to the score of the partial hypothesis with the greatest score in Q divided by the normalizer for Q. Let Z be equal to the maximum of all normed scores for all priority queues on the frontier. The threshold assigned to a priority queue Q on the frontier is then equal to Z times the normalizer for that priority queue divided by a constant which in a preferred embodiment is 45.

After step 6602, thresholds have been assigned to the priority queues on the frontier, a loop is performed in steps 6604 through 6610. The loop counter i is equal to a different cardinality on each iteration of the loop. The counter i is initialized in step 6604 to the largest cardinality of any active priority queue, in other words, i is initialized to the maximum cardinality of any priority queue on the frontier. On each iteration of the loop the value of i is decreased by 1 until i is equal to 0, at which point the test 6604 is satisfied and the process of selecting partial hypotheses to be extended is terminated.

Inside the loop through cardinalities is another loop in steps 6606 through 6609. This is a loop through all active priority queues of a given cardinality. In this loop each priority queue of cardinality i is processed in step 6608.

A schematic flow diagram for this processing step 6608 is shown in FIG. 34. The priority queue Q to be processed enters this step at 6701. Steps 6704 through 6707 perform a loop through all partial hypotheses i in the priority queue Q which are greater than the threshold associated with Q. At step 6705 the partial hypothesis i is added to the list of partial hypotheses to be extended. At step 6706 i is used to adjust the thresholds of all active priority queues which are parents of Q. These thresholds are then used when priority queues of lower priority are processed in the loop beginning at step 6604 in FIG. 33.

Each priority queue which is a parent of partial hypothesis i at step 6706 contains partial hypotheses which account for one less source morpheme than the partial hypothesis i does.

For example, consider the partial hypothesis depicted in FIG. 26. Suppose this is the partial hypothesis i. The two target morphemes the and girl are aligned with the three source morphemes la, jeune, and fille which are in source structure positions 1, 2, and 3 respectively. This hypothesis i is therefore in the priority queue corresponding to the set {1,2,3}. The priority queues that are parents of this hypothesis correspond to the sets {1,2}, {1,3}, and {2,3}. We can use partial hypothesis i to adjust the threshold in each of these priority queues, assuming they are all active, by computing a parent score, score.sub.p from the score score.sub.i associated with the partial hypothesis i. A potentially different parent score is computed for each active parent priority queue. That parent score is then divided by a constant, which in a preferred embodiment is equal to 45. The new threshold for that queue is then set to the minimum of the previous threshold and that parent score.

These parent scores are computed by removing from score, the contributions for each of the source morphemes la, jeune, and fille. For example, to adjust the threshold for the priority queue {2,3}, it is necessary to remove the contribution to the score, associated with the source morpheme in position 1, which is la. This morpheme is the only morpheme aligned with the, so the language model contribution for the must be removed, as well as the translation model contributions associated with la. Therefore, ##EQU38##

As another example, to adjust the threshold for the priority queue {1,3}, it is necessary to remove the contribution to the score, associated with the source morpheme in position 2, which is jeune. This morpheme is one of two aligned with the target morpheme girl. If the connection between girl and jeune is removed from the partial alignment in FIG. 26, there is still a connection between girl and fille. In other words, girl is still needed in the partial hypothesis to account for fille. Therefore, no language model component is removed. The parent score in this case is: ##EQU39##

Here, the first quotient adjust the fertility score, the second adjusts the lexical score and the third adjusts the distortion score.

With some thought, it will be clear to one skilled in the art how to generalize from these examples to other situations. In general, a parent score is computed by removing a connection from the partial alignment associated with the partial hypothesis i. Such a connection conne