Method and apparatus providing capitalization recovery for text6922809Abstract A method for capitalizing text in a document includes processing a reference corpus to construct a plurality of dictionaries of capitalized terms, where the plurality of dictionaries include a singleton dictionary and a phrase dictionary. Each record in the singleton dictionary contains a word in lowercase, a range of phrase lengths m:n for capitalized phrases that the word begins, where m is a minimum phrase length and n is a maximum phrase length, and where each record in the phrase dictionary includes a multi-word phrase in lowercase. The method adds proper capitalization to an input monocase document by capitalizing words found in mandatory capitalization positions; and by looking up each word in the singleton dictionary and, if the word is found in the singleton dictionary, testing the corresponding phrase length range. If the phrase length range indicates that the word does not start a multi-word phrase, the method capitalizes the word, while if the phrase length range indicates that the word does start a multi-word phrase, the method tests the word and an indicated plurality of next words as a candidate phrase to determine if the candidate phrase is found in the phrase dictionary and, if it is, capitalizes the words of the multi-word phrase. If the candidate phrase is not found in the phrase dictionary, the method changes the number of words in the candidate phrase (e.g., decrements by one) to form a revised candidate phrase, and determines whether the revised candidate phrase is found in the phrase dictionary. Claims 1. A method for capitalizing text in a document comprising: Description TECHNICAL FIELD
The titles dictionary 15D (or simply a titles list) may be a manually generated list of common titles, including Dr., Gov., Mr., Mrs., Ms., Pres., Prof, Rep., Rev., and Sgt. Clearly other titles could be added to this list. Which titles to include depends on the domain of the text being processed. The heuristic processing involves three rules for identifying additional abbreviations that don't appear in the abbreviations dictionary 15E or the titles dictionary 15D. First, if the word is a single letter followed by a period, the word is assumed to be a middle initial. Second, if the word matches the regular expression "[a-z]\. {2, }" (single letter followed by a period, repeated two or more times), the capitalization recovery system 10 assumes the word is an abbreviation (acronym). Third, if the word consists entirely of consonants followed by a period, the capitalization recovery system 10 assumes the word is an abbreviation. Using these resources, the capitalization recovery system 10 processes words that end in a period using the following algorithm:
After processing punctuation, the capitalization recovery system 10 applies one additional capitalization heuristic unrelated to abbreviations. All forms of the pronoun T (i.e., I, I've, I'm, I'd, I'll) are always capitalized. Applying the techniques described thus far has been found to recover more than 36% of the capitalized words with better than 99% accuracy. In order to increase the coverage, the capitalization recovery system 10 uses more than just the punctuation cues provided in the text being capitalized. The first technique considered to increase the number of correctly capitalized words is the use of a capitalization frequency dictionary constructed from the training corpus 2. For each word in the training text that consists entirely of letters, the capitalization dictionary 15C stores the number of times the word occurs in each of the following forms: Items 1 through 3 can be collected in a straightforward manner from the training corpus 2. Unless the corpus has been annotated with sentence boundaries, item 4 is collected instead by estimating sentence boundaries. This is preferably accomplished by applying the same punctuation, title, and abbreviation processing described above. The capitalization dictionary 15C allows the capitalization recovery system 10 to estimate the probability that any given word should be capitalized. The probability that word should be capitalized is estimated as: where l, c, u, m are counts of the number of times each word in training text occurs lowercased (l), capitalized (c), all uppercase (u), and in a mandatory capitalization position (m). As each word in the test text is processed, if it does not match any of the punctuation, abbreviation, or title rules, the capitalization recovery system 10 calculates the word's capitalization probability using the capitalization dictionary 15C. If this probability exceeds a specified threshold (e.g., 0.5), then the word is capitalized. Using the capitalization dictionary 15C, the capitalization recovery system 10 was found in one experiment to be able to recover an additional 43% of the capitalized words, or 79% total, with an accuracy over 93%. Since the capitalization dictionary 15C contains information about most known common words, it may be safe to assume that any word (consisting entirely of letters) that does not appear in the capitalization dictionary 15C is most likely a named entity and should be capitalized. Adding this assumption to the processing brings the total coverage up to 82% with an accuracy of over 92%. At this point, the majority of the missed words that still require capitalization are words that can act as both common words and proper names, e.g., 'brown', which can be both a color and a surname. Proper capitalization of these words depends on the context in which they occur. The preferred approach to adding context processing to the capitalization recovery system 10 is to create the phrase dictionary 15B from the training corpus 2, and to incorporate the phrase dictionary 15B into the capitalization processing. In that a goal is to enable named entity extraction in case-deficient text using a Named Entity recognizer that relies on case, the same named entity recognizer may be used to create the phrase dictionary. The presently preferred Named Entity recognizer is one known as Textract (see IBM Intelligent Miner for Text, "http://www-4.ibm.com/software/data/iminer/fortext/" and Y. Ravin, N. Wacholder and M. Choi, "Disambiguation of Names in Text," Proc. of the Fifth ACL Conf on Applied Natural Language Processing, pp. 202-208, Washington D.C., 1997.) Textract operates to identify proper names, places, organizations, abbreviations, dates, and a number of other vocabulary items in text. Textract also aggregates variant lexical forms of the same concept and identifies a canonical form for the concept. For example, Textract might identify the canonical form "President George Washington" and associate with that form the variants "President Washington," "George Washington," and "Washington." The output from Textract is a vocabulary file containing a record for each identified concept that gives the canonical form, its variants, and frequency statistics for how often the concept occurs in the collection. After Textract has processed the training data, the resulting vocabulary file is filtered to generate the singleton or singles dictionary 15A (see FIG. 10) and the phrases dictionary 15B (see FIG. 11). For every concept that occurs in at least three documents, all of the multi-word variants (including the canonical form) with capitalized words are added to the phrase dictionary 15B and the first word in each phrase is added to the singles dictionary 15A as a phrase head. For each single word variant, if its capitalization probability (according to the capitalization dictionary 15C described earlier) is greater than 0.5, then it is added to the singles dictionary 15A as a singleton. The entry for a phrase head in the singles dictionary 15A includes the lengths of the shortest and longest known phrases started by the word. Singletons and phrases with unusual capitalization (where "usual" capitalization means only the first letter in each word is capitalized) have preferred capitalization forms stored in their respective dictionaries. The capitalization recovery system 10 uses these dictionaries as follows. For each word that does not match any of the punctuation, abbreviation, or title rules, the capitalization recovery system 10 looks up the word in the singles dictionary 15A. If the word is a phrase head, n-;1 additional words are parsed from the input text (where n is the length of the longest known phrase started by the current word) and the phrase is used to probe the phrase dictionary 15B. If the phrase is not found, it is shortened from the end one word at a time until it is either found or the capitalization recovery system 10 determines that the phrase is not in the phrase dictionary 15B. When a phrase is found in the phrase dictionary 15B, every word in the phrase is capitalized and processing continues with the next word after the phrase. If the initial probe of the singles dictionary 15A reveals that the current word is a singleton and not a phrase head, then the word is capitalized. In either case, if the capitalization recovery system 10 finds a preferred capitalization form in the singles dictionary 15A or the phrase dictionary 15B, the capitalization recovery system 10 uses that form rather than the usual capitalization. The set of singletons in the singles dictionary 15A is similar to the set of words in the capitalization dictionary 15C, with capitalization probabilities greater than 0.5. The differences are that the singletons in the singles dictionary are initially selected by the Textract Named Entity extraction process, the singletons may contain punctuation (e.g., hyphens or periods), and the singletons may have preferred unusual capitalization forms. For a final processing variant, the capitalization recovery system 10 may combine the singles and phrases dictionary processing with the capitalization dictionary 15C processing. If a word is not found in the singles dictionary, the capitalization recovery system 10 probes the capitalization dictionary 15C with the word. If it is found, the word is capitalized if its capitalization probability exceeds the probability threshold. If it is not found, and if the word consists entirely of letters, it is assumed to be a proper name that does not appear in the training data, and the word is capitalized. Having thus provided an overview of the processing performed by the capitalization recovery system 10, a more detailed description is now provided. Referring to FIG. 2, the input is the text from the source 1 which is to be automatically capitalized. A word is defined to be any sequence of characters between the current position and the next space. To be able to use this definition of word appropriately, punctuation should be part of the word it annotates. For example, there should be no space between an opening double quote and the following word. To assure that the text conforms to this, it is first processed by the preprocessor 50, which is shown in FIG. 2. There are three inputs to the preprocessor 50 as depicted in boxes 51, 52 and 53. Box 52 is the original text from which the appropriate spaces are to be removed, and the processed text T1 is returned at the end in box 63. The list L1 shown in box 51 is a list of punctuation marks which should be following a word without spaces. This list may include ''',.!( )[]? but is dependent, in general, on the language. For English, the list contains all characters which are neither a letter or a number. One preferred embodiment of the preprocessor 50 is a finite state machine, and the state is initialized to 0 in box 53. Depicted in box 54 is the loop through the text, one character at a time. In box 55 it is checked whether the next character is null, meaning that the end of the input text was reached. In that case, the processed text T1 is returned in box 63. If the character is not null, the value of the state is tested in box 56. Depending on the value different paths in the flowchart are followed. In the case where the state is l it is tested whether the current character is a space in box 61. In the case where the character is a space, the value of the state is changed to 1 in box 62 and the preprocessor 50 continues to get the next character in box 54. If the character is not a space (as tested in box 61), it is appended to the text T1 in box 60 and again the preprocessor 50 continues to get the next character in box 54. Returning to the description of box 56, in the case where the value of the state is 1, it is tested in box 57 whether the current character is a member of the list L1. If it is not, a space (which is actually the previous character) is added to the text T1 in box 58. In both cases (whether the current character is a member of the list L1 or not, the preprocessor 50 continues at box 59 and set the state to 0 before proceeding to box 60, where the current character is added to text T1. After finishing this task, the next character is examined in box 54. The operation of the capitalization recovery system 10 (without phrase processing, which is shown in the separate flowchart of FIG. 8) is depicted in the flowchart of FIG. 3. The input to the capitalization recovery system 10 is the text depicted in box 52 of FIG. 2, and the input to the remainder of the capitalization recovery system 10 is the processed text T1 from box 63 that is output from the preprocessor 50. In box 410 the state of the finite state machine is set to 1. Box 420 depicts the beginning of the loop through all the words in the text. Each word is sent to the several subsystems introduced above and is modified (i.e. capitalized) if appropriate and then appended to a text string T2. When the last word is encountered, which is determined by a positive null test in box 430, the capitalization recovery system 10 returns the automatically capitalized text T2 in box 440. Otherwise, the word becomes the input to the punctuation processing in box 300. The output of the punctuation processing, shown in box 360, is three strings: S1, E1 and W_String. S1 and E1 can be empty and hold the potential punctuation at the beginning and the end of each word. The string W_String is the word with the non-essential punctuation stripped out and captured in the strings S1 and E1. The string W_String is examined to determine whether it is a title. Note that each of these subsystems return the same information: If the title processing returns 'no', W_String is examined by the abbreviation processing subsystem in box 200. If the abbreviations subsystem 200 returns 'no', the same string is examined in box 500 by the single word subsystem which then continues to box 450. Box 450 is immediately reached when either the title processing or the abbreviation processing subsystems return 'yes'. In box 450, the string S1 is prepended and the string E1 is appended to the string W_String to form the string W. This string W is then appended to the text T2 followed by a space. Then the next word of the original text T1 is retrieved in the box 420 and the loop continues. The dotted line box 600 denotes the single word processing subsystem. The first subsystem invoked is the punctuation processing subsystem 300, as shown in FIG. 6. The input is shown in box 310 as the word w. Some other variables are also initialized in this box, including strings W1, S1 and E1. It should be noted at this point that the use of strings is one preferred embodiment, and that other representations could be employed as well. The string W1 is initialized to be identical to the string W, whereas S1 and E1 are empty strings. In box 345 the first character C of W is determined. In box 320 it is tested whether C is null, indicating that the end of the string was reached, in which case the capitalization recovery system 10 continues to box 365. Otherwise, it is checked in box 325 whether it is a letter or number (for English or other characters in different languages). If it is not a letter or a number, the capitalization recovery system 10 continues to box 330. There the first character of the string W1 is removed and the string W1 is now assigned this new value. The character C is appended to the string S1. In box 335 the next character of the word W is determined and the loop continues in box 320. On the other hand, if the character C is a letter or number (as tested in box 325), the capitalization recovery system 10 proceeds to box 340 where the last character C of the word W is determined. The character is tested in box 345 and if it is null, indicating that the end of the string was reached, it proceeds to box 365. Otherwise, C is tested for being either a letter or number in box 350. In the case that C is a letter or number the capitalization recovery system 10 proceeds to box 365. Otherwise, the last character of the string W1 is removed and W1 is set to this new value in box 355 and the character C is added to the beginning of string El. In box 360 the previous character of the string W is determined before continuing in box 345. Different paths through the flowchart end in box 365 at which point there are three strings S1 and E1 containing punctuations and W1 the word itself However, a period maybe both a punctuation or a part of the word itself (as in abbreviations or titles). Hence, in box 365 it is checked whether the first character in E1 is a period. If that assertion is true, the period is appended in box 370 and removed from the beginning of E1. After box 370, or if the assertion concerning the period is false, the capitalization recovery system 10 continues in box 375 where the three strings W1, E1 and S1 are returned. FIG. 4 is flowchart that is descriptive of the title processing subsystem 100. The input to this subsystem is a word w and a state s which is shown in box 100. The other input is the title dictionary 15D, shown in box 110. One preferred embodiment of the title dictionary 15D is a text file, and the content of the title dictionary 15D is language dependent. For English, the title dictionary 1SD may contain Dr., Prof, Mr., Mrs., Gen. to mention a few. The title dictionary 15D is denoted as T_Dict. In box 120 it is tested whether the input word is a member of T_Dict. In the case where it is not, the title processing subsystem is done with its processing and continues to box 130 where the return values are prepared: the unchanged input word w and state s and the answer 'no'. On the other hand, if the word is a title, it is capitalized in box 140 (i.e., the first letter is capitalized) and the state is set to 1. In box 150 the return from the title processing subsystem is prepared: the capitalized word w, the state (value is now 1) and the answer 'yes'. FIG. 5 depicts the operation of the abbreviations processing subsystem 200. The input shown in box 210 is a word w and the current state of the capitalization recovery system 10. In box 215 it is tested whether w is a single letter followed by a period. If the test is positive, w is capitalized in box 220 and the state is set to 1 before proceeding to box 265 where the answer is set to 'yes'. Processing then continues to box 275 where the word, the state and the answer are returned from the abbreviations processing subsystem. In the case where the test in box 215 is negative, the word is looked up in the abbreviations dictionary 15E in box 225. The abbreviations dictionary 15E (ABB_Dict), shown in box 280, is in a preferred embodiment a text file and contains a set of abbreviations and, where appropriate, the preferred spelling for each abbreviation. For example, the preferred spelling for an abbreviation may be all letters being capitalized instead of only the first letter. In the case where the word w is found in the abbreviation dictionary 15E, the capitalization recovery system 10 continues in box 230 where the preferred spelling of the abbreviation is determined. In box 235 it is checked whether the current state is 0, in which case the capitalization recovery system 10 continues to box 265. Otherwise, the word is be capitalized (i.e., the first letter is capitalized) and the state is set to 0 in box 240 before continuing to box 265. If the word is not in the abbreviations dictionary 15E, it is checked whether it has a certain pattern. The pattern described here in box 245 is for English, however other patterns for either English or other languages could be employed. The pattern tested here is of the form: letter followed by a period which is repeated at least twice. If it is of that form, all letters of the word are capitalized in box 250 and the state is set to 0 before continuing to box 265. If the word does not satisfy the first mentioned pattern in box 245, it is checked for a different pattern in box 251. In this case a determination is made as to whether the word consists only of consonants followed by a period. If this is true, the state is examined in box 255, and if the state is 0 the capitalization recovery system 10 proceeds to box 265, otherwise processing continues at box 260 where the word is capitalized and the state is set to 0 before going to box 265. If the answer is negative in box 251, the answer is set to 'no' in box 270, and the word and the state are identical to the input. The final answer is returned in box 275. The next subsystem to be described is the single word processing subsystem 100 shown in FIG. 7. The input is a word w and the current state of the capitalization recovery system 10. The conventions are as follows: when state is equal to 1, the next word is capitalized, and a capitalized word has its first letter capitalized, and the rest of the characters can be either lower or upper case. In box 515 it is checked whether the last character is a period. If 'yes', the period is removed from the word and S1 is set to the period in box 520. Otherwise the capitalization recovery system 10 proceeds to box 535. In box 525 it is checked whether the word ends with the string "'s" (apostrophe s), in which case these two characters are removed from the word in box 530 and prepended to string S1 in box 530. It should be noted that testing the word for "." and "'s" are English language-specific, and for other languages a different end of sentence punctuation mark could be substituted for the ".", and a different string than "'s" could be substituted if appropriate. In box 535 it is checked whether the word w is in the singles dictionary 15A. The singles dictionary 15A contains the words which should be capitalized in the text and, if the capitalization is different than only capitalizing the first letter, a preferred spelling is included. If the word w is in the singles dictionary 1SA its preferred spelling is looked up in box 545. Otherwise the capitalization recovery system 10 continues to box 540 for (language dependent) algorithms for capitalization. For English, the following two algorithms are appropriate, but not exclusive: 1) if the word begins with "mc", the first letter and the character following the mc are capitalized; and 2) if the word is hyphenated, each word by itself is looked up in the singles dictionary 15A and the same rules as just described apply to each of the words separately before recombining them with a hyphen. In box 555 the state is examined and if it is 1 the method continues to box 550 where the word is capitalized. Recall that a word is also capitalized when it is in the singles dictionary 15A and, hence, after box 545. If the state is 0, or after the word has been capitalized, the capitalization recovery system 10 proceeds to box 560 where the start of the string S1 is checked. If it starts with a period or with the string "'s.", the capitalization recovery system 10 proceeds to box 565 where the state is set to 1, otherwise processing proceeds to box 570 where the state is set to 0. After the state has been set correctly, the capitalization recovery system 10 continues to box 575 where the string S1 is appended to word W before returning word W and the current state. FIG. 8 is a logic flow diagram that extends the foregoing method so as to include phrase processing. The method of FIG. 8 shares components with the capitalization system described above. In the same fashion, the input text is shown in box 52. The preprocessor 50 adjusts the space between words as previously described to produce the adjusted text T1 shown in box 63. The state of the system is initialized to 1 in box 410. The first word is obtained in box 710 and it is tested for being null (indicating that the end of the input text was reached) in box 715. At the end of the input text, the properly capitalized text is returned in box 780. Otherwise, it is checked whether the word W is in the singleton dictionary 15A in box 720. In one preferred embodiment, the singleton dictionary 15A is structured as shown in FIG. 10. As was mentioned previously, there are several items of information associated with each word 1020 in the singleton dictionary 15A. More specifically, there are two numbers denoting the minimum number (first range) of words of a phrase which starts with the word w (field 1030) and the maximum number (second range) of words in a phrase starting with this word (field 1040). Furthermore, a preferred spelling (1050), if it exists, is also associated with the word. If the word W is not found in the singleton dictionary 15A it is sent to the single word processing subsystem which has been previously described in reference to FIG. 7. However, if the word is in the singleton dictionary 15A, the system proceeds to box 725. There, the first range (1030) and the second range (1040) for the word are retrieved. In Box 730 it is tested whether the second range 1040 is 1, which would indicate that the word is not the beginning of a phrase. In that case the processing of the capitalization recovery system 10 proceeds to box 600. If the second range 1040 is greater than 1, N, the next N-;1 words are retrieved from the text in box 735 and a phrase string is assembled by concatenating the n words with spaces in between. This phrase string becomes the input to the phrase processing subsystem which is described in greater detail in FIG. 9. The output of the phrase processing subsystem is a phrase string and a number n. The number n indicates how many words were used to assemble a properly capitalized phrase, and the phrase string contains the capitalized string of words. In case the number n is 0, indicating that no appropriate phrase was found, the system continues with box 600, the single word processing subsystem shown in FIG. 500. Otherwise a counter j is initialized to 0 in box 755 and the next word obtained in box 760. This word is tested in box 765. If it is null the phrase processing is completed and the final text returned in box 780. Otherwise, the counter j is incremented by 1 in box 770. In box 775 the counter is tested against the size N of the found phrase. If these numbers are equal, the system continues in box 720, otherwise it obtains the next word in box 760. After a word is processed by the single word processing subsystem in box 600, it continues to box 785 where the counter is initialized to 1 before getting the next character in box 760. The operation of the phrase processing subsystem 800 is illustrated in FIG. 9. The input is a phrase which is a set of words 810 and the number n which denotes the number of words in the phrase. In one preferred embodiment, a phrase is a string of characters and, as such, it is input to the punctuation processing subsystem 300 of FIG. 6. The punctuation processing subsystem 300 was previously described as taking a word as input, however, a phrase can be viewed for this purpose as a word with embedded spaces. The output of the punctuation processing subsystem 300 is shown in box 815, and contains strings S1 and E1 (the punctuation at the beginning and at the end of the phrase) and the remaining phrase (PH_String). In box 820 the output is tested as to whether the string ends with a period, in which case a period is prepended to the string E1 and the period is removed from PH_String. In box 830 a test is made as to whether the ending string "'s" (apostrophe s) is present, in which case the string is prepended to E1 and the string "'s" is removed from PH_String. In the next box 840 a test is made as to whether PH_String is in the phrase dictionary 15B. In one preferred embodiment, the phrase dictionary 15B is structured as shown in FIG. 11. The fields 1110 of the phrase dictionary 15B include fields holding phrases 1120 and corresponding fields 1130 holding the preferred spelling, and hence the preferred capitalization for the phrases (if it differs from the standard capitalization where each word in the phrase is capitalized). If PH_String is found in the phrase dictionary 15B, the method executed by the capitalization recovery system 10 proceeds to box 870 where a check is made as to whether there is a preferred capitalization of PH_String (as indicated in the field 1130). In case there is, PH_String is set to this preferred capitalization in box 880. Otherwise, each word of PH_String is capitalized in box 875. Next, in box 845 it is checked whether E1 starts with "." or "'s" (apostrophe s) in which case the capitalization recovery system 10 proceeds to box 847 to set the state to 1. Otherwise processing continues at box 846 to set the state to 0. The string S1 is prepended to PH_String in box 885, while E1 is appended. The phrase string PH_String, the number n and the state are returned. If the phrase string is not found in the phrase dictionary 15B in box 840, the number n is decreased by 1 in box 850. In box 855 it is tested whether this number is 1, in which case the phrase string is set to be an empty string and n is set to 0 before returning. On the other hand, if n is not 1, the processing of capitalization recovery system 10 continues in box 865 where S1 is prepended to PH_String and the last word is removed before starting the loop again in box 300. FIG. 10 shows the organization of the singleton dictionary 15A, discussed above. The singleton dictionary 15A contains one or more entries 1010, where each entry consists of a term string 1020 (the term in all lower case), the length of the shortest phrase that the term heads 1030, the length of the longest phrase that the term heads 1040, and the optional preferred spelling 1050 for the term, if the capitalized form of the term is other than the default (first letter uppercase, rest lowercase). If the term does not head a phrase, then fields 1030 and 1040 are one. The singleton dictionary 15A may be loaded into a hash table or other suitable data structure for rapid lookup of terms. FIG. 11 shows the organization of the phrase dictionary 15B, also discussed above. The phrase dictionary 15B contains one or more entries 1110, where each entry includes the phrase string 1120 (the phrase in all lower case) and the optional preferred spelling 1130 for the phrase if the capitalized form of the phrase is other than the default (for each word in the phrase the first letter is uppercase and the reset are lowercase). The phrase dictionary 15B may also be loaded into a hash table or other suitable data structure for rapid lookup of phrases. FIG. 12 illustrates a method for constructing the singleton dictionary 15A and the phrase dictionary 15B. Properly capitalized training text 1210 is input to a dictionary build process and sent to two subprocesses. Subprocess 1220 runs, for example, a conventional Named Entity Extraction system on the text to extract named entities. Named entities include proper names, people, places and similar items, and each named entity may consist of one or more terms. In practice, almost every capitalized term or phrase that does not appear in a mandatory capitalization position (e.g., start of sentence) is a named entity. Subprocess 1230 counts the number of times each word in the training text 1210 occurs lowercased (l), capitalized (c), all uppercase (u), and in a mandatory capitalization position (m). These counts are then used to compute a capitalization probability p for each word using the above-mentioned formula: In step 1240, the named entities extracted in step 1220 are filtered. All named entities that occur in fewer than, for example, three documents are discarded, and all single-term named entities with capitalization probability (from step 1230) less than, for example, 0.5 are discarded. These values may be varied as a function of the nature of the reference corpus, and based on other criteria. The named entities that survive this filtering are stored into the singleton dictionary 15A at step 1250 and into the phrase dictionary 15B at step 1260. The inventors have thus described their capitalization recovery system 10 as applying a series of heuristic, statistical, and dictionary-based processing steps to recover capitalization. Experimental results have shown that the capitalization recovery system 10 is both effective and robust across a variety of text genres and training conditions. Optimum performance is found to be achieved when a suitable training corpus is available, but for most applications this is not overly burdensome, since properly capitalized text is usually readily available. Unlike other applications, such as named entity recognition or document classification, the training data does not require manual labeling. In addition to the applications discussed above for the capitalization recovery system 10, another potential application of these teachings is local document analysis, where dictionaries and statistics are modified on a per document basis as each document is processed, allowing the system to account for the occurrence of named entities that alter capitalization probabilities for the common words in those named entities. It is also contemplated that the operation of the capitalization recovery system 10 may be improved by the use of richer statistical models, such as Hidden Markov Models, that incorporate additional features (e.g., context) into the capitalization probability calculation. Thus, while these teachings have been particularly shown and described with respect to preferred embodiments thereof, it will be understood by those skilled in the art that changes in form and details may be made therein without departing from the scope and spirit of these teachings.
|
Same subclass Same class Consider this |
|||||||||||||||||||||||||
