Method for producing summaries of text document6317708Abstract A computer method for preparing a summary string from a source document of encoded text. The method comprises comparing a training set of encoded text documents with manually generated summary strings associated therewith to learn probabilities that a given summary word or phrase will appear in summary strings given a source word or phrase appears in encoded text documents and constructing from the source document a summary string containing summary words or phrases having the highest probabilities of appearing in a summary string based on the learned probabilities established in the previous step. Claims What is claimed is: Description COPYRIGHT NOTICE
TABLE 1
"The U.N. Security Council on Monday was to
address a dispute between U.N. chief weapons
inspector Richard Butler and Iraq over which
disarmament documents Baghdad must hand over.
Speaking in an interview with CNN on Sunday
evening, Butler said that despite the latest
dispute with Iraq, it was too soon to make a
judgment that the Iraqis had broken last week's
agreement to unconditionally resume cooperation
with weapons inspector -- an agreement which
narrowly averted air strikes by the United States
and Britain."
Some possible headline/summaries for the document produced above are: "Security Council to address Iraqi document dispute." "Iraqi Weapons Inspections Dispute." These summaries illustrate some of the reasoning required for summarization. The system must decide (1) what information to present in the summary, (2) how much detail to include in the summary or how long the summary can be, and (3) how best to phrase the information so that it seems coherent. The two summaries above illustrate some of the issues of length, content and emphasis. The statistical models are produced by comparison of a variety of documents and summaries for those documents similar to those set forth above to learn for a variety of parameter settings, mechanisms for both (1) content selection for the most likely summaries of a particular length and (2) generating coherent English (or any other language) text to express the content. The learning for both content selection and summary generation may take place at a variety of conceptual levels ranging from characters, words, word sequences or n-grams, phrases, text spans and their associated syntactic and semantic tags. In this case, prior to the comparison, the texts in the training sets must be tagged. Set forth in the following table is the text of Table 1 after being tagged with syntactic parts of speech using the LDC standard, e.g., DT: definite article, NNP: proper noun, JJ: adjective.
TABLE 2
The_DT U.N._NNP Security_NNP Council_NNP on_IN
Monday_NNP was_VBD to_TO address_VB a_NN dispute_NN
between_IN U.N._NNP chief_JJ weapons_NNS
inspector_NN Richard_NNP Butler_NNP and_CC Iraq_NNP
over_IN which_WDT disarmament_NN documents_NNS
Baghdad_NNP must_NN hand_NN over._CD _NN _NN _NN
Speaking_VBG in_IN an_DT interview_NN with_IN
CNN_NNP on_IN Sunday_NNP evening,_NNP Butler_NNP
said_VBD that_IN despite_IN the_DT latest_JJS
dispute_NN with_IN Iraq,_NNP it_PRP was_VBD too_RB
soon_RB to_VBP make_VB a_DT judgment_NN that_IN
the_DT Iraqis_NNPS had_VBD broken_VBN last_JJ
week's_NN agreement_NN to_TO unconditionally_RB
resume_VB cooperation_NN with_NN weapons_NNS
inspectors:_NNS an_DT agreement_NN which_WDT
narrowly_RB averted_VBP airstrikes_NNS by_IN the_DT
United_NNP States_NNPS and_CC Britain._NNP.
Set forth in the following table is the text of Table 1 with named entity markers after being tagged with semantic tags using the TIPSTER/MUC standards, e.g., NE: named entity, TE: temporal entity, LOC: location.
TABLE 3
The [U.N. Security Council]- NE on [Monday]- TE was to
address a dispute between [U.N.]- NE chief weapons
inspector [Richard Butler]- NE and [Iraq]- NE over
which disarmament documents [Baghdad]- NE must hand
over.
Speaking in an interview with [CNN]- NE on [Sunday]-
TE evening, [Butler]- NE said that despite the latest
dispute with [Iraq]- NE, it was too soon to make a
judgment that the [Iraqis]- NE had broken last week's
agreement to unconditionally resume cooperation with
weapons inspectors - an agreement which narrowly
averted airstrikes by the [United States]- NE and
[Britain]- NE.
Set forth in the following table is the text of Table 1 after being tagged with semantic tags, e.g., AGENT, CIRCUMSTANCE, CIRCUMSTANCE/TEMPORAL, COMMUNICATIVE_ACTION, and OBJECT.
TABLE 3A
The [U.N. Security Council]-AGENT on [Monday]-
CIRCUMSTANCE/TEMPORAL [was to address]-
COMMUNICATIVE_ACTION [a dispute between U.N. chief
weapons inspector Richard Butler and Iraq over which
disarmament documents Baghdad must hand over.]-OBJECT
[Speaking in an interview with CNN on Sunday evening,]-
CIRCUMSTANCE [Butler]-AGENT [said]-
COMMUNICATIVE_ACTION [that despite the latest dispute with
Iraq, it was too soon to make a judgement that the Iraqis had broken
last week's agreement to unconditionally resume cooperation with
weapons inspectors -- an agreement which narrowly averted airstrikes
by the United States and Britain.]-OBJECT
The training set is used to model the relationship between the appearance of some features (text spans, labels, or other syntactic and semantic features of the document) in the document, and the appearance of features in the summary. This can be, in the simplest case, a mapping between the appearance of a word in the document and the likelihood of the same or another word appearing in the summary. The applicants used a training set of over twenty-five thousand documents that had associated headlines or summaries. These documents were analyzed to ascertain the conditional probability of a word in a document given that the word appears in the headline. In the following table, the probabilities for words appearing in the text of Table 1 are set forth.
TABLE 4
Word Conditional Probability
Iraqi 0.4500
Dispute 0.9977
Weapons 1.000
Inspection 0.3223
Butler 0.6641
The system making use of the translation model extracts words or phrases from the source text based upon the probability these or other words will appear in summaries. The probability that certain subsets of words individually likely to appear in summaries will appear in combination can be calculated using Bayes theorem. Thus, the probability that the phrase "weapons inspection dispute", or any ordering thereof may be expressed simply: Pr("weapons".vertline."weapons" in document)*Pr("inspection".vertline."inspection" in document)*Pr("dispute".vertline."dispute" in document). Equivalently, this probability may be expressed: Log(Pr("weapons".vertline."weapons" in document))+Log(Pr("inspection".vertline."inspection" in document))+Log(Pr("dispute".vertline."dispute" in document)). More involved models can express the relationship among arbitrary subsets, including subsequences, of the words in the document and subsets of candidate words that may appear in the summary. The more involved models can express relationships among linguistic characterizations of subsets of terms in the document and summaries such as parts-of-speech tags, or parse trees. The more involved models may express relationships among these sets of terms and meta-information related to the document or the summary, such as length, derived statistics over terms (such as proportion of verbs or nouns in the document, average sentence length, etc.), typographical information, such as typeface, formatting information, such as centering, paragraph breaks and so forth, and meta-information, such as provenance (author, publisher, date of publication, Dewey or other classification) recipient, reader, news group, media through which presented (web, book, magazine, TV chiron or caption). One of the advantages in learning a content selection model is that the system can learn relationships between summary terms that are not in the document and terms that are in the document, and apply those relationships to new documents thereby introducing new terms in the summary. Once a content selection model has been trained on the training set, conditional probabilities for the features that have been seen in the summaries can be computed. The summary structure generator makes use of these conditional probabilities to compute the most likely summary candidates for particular parameters, such as length of summary. Since the probability of a word appearing in a summary can be considered to be independent of the structure of the summary, the overall probability of a particular candidate summary can be computed by multiplying the probabilities of the content in the summary with the probability of that content expressed using a particular summary structure (e.g., length and/or word order). Since there is no limitation on the types of relationships that can be expressed in the content selection model, variations on this invention can use appropriate training sets to produce a cross-lingual or even cross-media summary. For example, a table expressing the conditional probability that an English word should appear in a summary of-a Japanese document could be used to simultaneously translate and summarize Japanese documents. An inventory of spoken word forms, together with a concatenative synthesis algorithm and a table of conditional probabilities that speech segments would be used in a spoken summary of a particular document, could be used to generate spoken summaries. Similarly, corresponding video or other media could be chosen to represent the content of documents. EXAMPLE For use in generating summaries, the probability of finding particular words in a summary is learned from the training set. For certain words appearing in the text set forth in Table 1, the learned probabilities are listed in the following table:
TABLE 5
Word Log probability of word in Reuters headlines
Iraqi -3.0852
Dispute -1.0651
Weapons -2.7098
Inspection -2.8417
Butler -1.0038
Also, for generating summaries, the probability of finding pairs of words in sequence in the training set summaries is learned. For certain words appearing in the text set forth in Table 1, the learned probabilities are listed in the following table:
TABLE 6
Log probability of word 2
Word pair (word 1, word 2) given word 1
Iraqi weapons -0.7622
Weapons inspection -0.6543
Inspection dispute -1.4331
To calculate the desirability of a headline containing the sequence "Iraqi weapons inspection . . . ", the system multiplies the likelihood of seeing the word "Iraqi" in a headline (see Table 5) by it being followed by "weapons" and that being followed by "inspection" (see Table 6). This may be expressed as follows: Log(P("Iraqi"))+Log(P("weapons".vertline."Iraqi"))+Log(P("inspection".vertl ine."weapons")), which, using the values in the tables, yields a log probability of -2.8496. Alternative sequences using the same words, such as "Iraqi dispute weapons", have probabilities that can be calculated similarly. In this case, the sequence "Iraqi dispute weapons" has not appeared in the training data, and is estimated using a back-off weight. A back-off weight is a very small but non-zero weight or assigned probability for words not appearing in the training set. These calculations can be extended to take into account the likelihood of semantic and syntactic tags both at the word or phrase level, or can be carried out with respect to textual spans from characters on up. The calculations can also be generalized to use estimates of the desirability of sequences of more than two text spans (for example, tri-gram (three-word sequence) probabilities may be used). Other measures of the desirability of word sequences can be used. For example, the output of a neural network trained to evaluate the desirability of a sequence containing certain words and tags could be substituted for the log probabilities used in the preceding explanation. Moreover, other combination functions for these measures could be used rather than multiplication of probabilities or addition of log probabilities. In general, the summary generator comprises any function for combining any form of estimate of the desirability of the whole summary under consideration such that this overall estimate can be used to make a comparison between a plurality of possible summaries. Even though the search engine and summary generator have been presented as two separate processes, there is no reason for these to be separate. In the case of the phrase discussed above, the overall weighting used in ranking can, as one possibility, be obtained as a weighted combination of the content and structure model log probabilities. Alpha*(Log(Pr("Iraqi".vertline."Iraqi" in doc))+Log(Pr("weapons".vertline."weapons" in doc))+Log(Pr("inspection".vertline."inspection" in doc)))+Beta*(Log(Pr("Iraqi".vertline.start_of_sentence))+Log(Pr((weapons". vertline."Iraqi"))+Log(Pr("inspection".vertline."weapons"))). Using a combination of content selection models, language models of user needs and preferences, and summary parameters, a plurality of possible summaries, together with estimates of their desirability, is generated. These summaries are ranked in order of estimated desirability, and the most highly ranked summary or summaries are produced as the output of the system. Depending on the nature of the language, translation and other models, heuristic means may be employed to permit the generation and ranking of only a subset of the possible summary candidates in order to render the summarization process computationally tractable. In the first implementation of the system, Viterbi beam search was used to greatly limit the number of candidates produced. The beam search makes assumptions regarding the best possible word in at the front position of a summary and in consideration of the next position will not undo the assumption concerning. the first position. Other search techniques, such as A* or IDA*, SMA*, may be employed to comply with particular algorithmic or resource limitations. An example of the results of commanding the search to output the most highly ranked candidate for a variety of values of the summary length control parameter is set forth in the following table.
TABLE 7
Number of Words String
1 Iraq
2 United States
3 Iraq on Weapons
4 United States on Iraq
5 United States in latest week
6 United States in latest week on Iraq
7 United States on security cooperation
in latest week
The following computer code appendix contains code in the Java language to implement this invention. The UltraSummarise class is the main function that makes a summarizer object, loads a story, creates a search object and uses the Vocabulary class and story to produce a summary. The ViteriSearch class defines the meat of the operation. It takes the LanguageModel class, the TranslationModel class and the story and searches for strings having the highest probability of being used in a summary for the story. The LanguageModel class reads in a file which is a model for summaries containing the probabilities of each word following another. The TranslationModel class reads in a file containing the probabilities that a word will appear in a summary given words in the story. The Story class reads in the story. The Vocabulary class reads in a file that turns words into numbers. The computer code appendix which is conatained on twenty-two pages labeled a1-a22 and attached hereto on separate sheets. Those skilled in the computer programming arts could implement the invention described herein in a number of computer programming languages. It would not be necessary to use an object oriented programming language such as Java. As used in the following claims, a "summary string" is a derivative representation of the source document which may, for example, comprise an abstract, key word summary, folder name, headline, file name or the like. Having thus defined our invention in the detail and particularity required by the Patent Laws, what is desired to be protected by Letters Patent is set forth in the following claims.
|
Same subclass Same class Consider this |
||||||||||
