Information processing system for compaction and replacement of phrases4773039Abstract An information processing system is disclosed which provides a writer with acceptable replacement phrases to substitute for trite phrases in a manuscript text. The replacement phrases are grammatically equivalent to the trite phrases and can be immediately inserted into the text without further alteration. Claims What is claimed is: Description BACKGROUND OF THE INVENTION
TABLE I
______________________________________
Source phrase = "We are not about to"
Source constant = "not about to"
Source variable = the various forms of the verb "be"
Pronoun Verb Verb Form of "be" Rank
______________________________________
I am first person, singular, present tense
1st
We are first person, plural, present tense
2nd
You are second person, singular, present tense
2nd
They are third person, plural, present tense
2nd
He is third person, singular, present tense
3rd
She is third person, singular, present tense
3rd
It is third person, singular, present tense
3rd
I was first person, singular, past tense
4th
He was third person, singular, past tense
4th
She was third person, singular, past tense
4th
It was third person, singular, past tense
4th
You were second person, singular, past tense
5th
We were first person, plural, past tense
5th
They were third person, plural, past tense
5th
______________________________________
The variable replacement word element 34 of FIG. 1 symbolically represents all of the parts of speech for a corresponding replacement verb (continuing with the same example) in the replacement phrase. (The variable replacement word element 34 can also symbolically represent all of the parts of speech for pronouns, pronoun-verb combinations, verb phrases, regular verb endings, and other grammatical elements and combinations.) The variable replacement word element 34 serves as an address pointer to a second table called the replacement table 48 shown in FIG. 1, which is stored at an addressable location in the memory 14 and which contains all of the forms of the symbolically represented replacement verb. These verb forms are called values of the variable replacement word element 34. The plurality of replacement verb forms are arranged into a plurality of ranks 40', 42', 44', 46' and 47' shown in FIG. 1. These ranks have a grammatically significant sequence with the verb form in each rank of the replacement table 48 being grammatically equivalent to the verb form in a corresponding rank of the source table 38. This is illustrated in Table II and in FIG. 2. Table II shows the replacement phrase "do not intend to" and its preceding pronoun "we" analyzed into its various grammatical forms. The verb forms for the verb "do" include the verbs "do," "does," and "did." The various parts of speech for these verb forms are shown in Table II. These various parts of speech are ranked in the same order as the corresponding rankings for the parts of speech shown for the source phrase in Table I. The first, second, third, fourth and fifth ranks shown in Table II occupy ranks 40', 42', 44', 46' and 47', respectively, of the replacement table 48 shown in FIG. 2. A plurality of replacement tables 48, 48', 48", etc. is stored at addressable locations in the memory 14, each table corresponding to a different family of verb forms, pronoun forms and other parts of speech, as is represented by step 64 in the flow diagram of FIG. 3.
TABLE II
______________________________________
Replacement phrase = "We do not intend to"
Replacement constant = "not intend to"
Replacement variable = the various forms of the verb "do"
Pronoun Verb Verb Form of "do" Rank
______________________________________
I do first person, singular, present tense
1st
We do first person, plural, present tense
2nd
You do second person, singular, present tense
2nd
They do third person, plural, present tense
2nd
He does third person, singular, present tense
3rd
She does third person, singular, present tense
3rd
It does third person, singular, present tense
3rd
I did first person, singular, past tense
4th
He did third person, singular, past tense
4th
She did third person, singular, past tense
4th
It did third person, singular, past tense
4th
You did second person, singular, past tense
5th
We did first person, plural, past tense
5th
They did third person, plural, past tense
5th
______________________________________
In response to a command entered by the writer at the keyboard 10, the execution unit 16 of the computer will compare first target words 24 in an alpha-numeric string from the input word stream 12 with the constant source word elements 32 in each of the plurality of phrase-pair expressions 28, 28' and 28", as is represented by step 66 of the flow diagram of FIG. 3. The constant source word element 32 in the phrase-pair expression 28 is the constant portion of the trite phrase which does not change when the phrase is used in various parts of speech. This constant portion can be a single word or a sequence of words in an alpha-numeric string which is compared with the alpha-numeric strings in the input word stream 12. Each phrase-pair expression 28, 28', 28", etc. in the memory 14 is accessed for the comparison and eventually a match is found between the constant portion of a trite phrase in one of the phrase-pair expressions 28 and the target word or sequence of words 24 in the input word stream 12. This is represented in step 68 of FIG. 3. As is shown in the example of FIG. 2, the first target string 24 is the phrase "not about to" which is matched with the source constant 32 in the phrase-pair expression 28. The variable source word element 30 in FIG. 1 in the trite phrase portion of the matched phrase-pair expression 28, is then used as an address pointer to the source table 38 which contains all of the forms of the symbolically represented verb in the trite phrase. This source table 38 is accessed as is represented by step 70 of the flow diagram of FIG. 3. Each of the verb forms in the source table 38 is then compared with a second target word or sequence of words in a string 26 in the input word stream 12, which is proximate to the first target words 24 found to match the constant portion of the trite phrase. This is represented by step 72 in the flow diagram of FIG. 3. If a match is found with one of the verb forms in the source table 38, then an actual trite phrase 22 has been located in the input word stream 12. In the example shown in FIG. 2, the second target string 26 is the word "are" which matches with the verb form "are" in the second rank 42 of the source table 38. The matched phrase 22 will be highlighted on the display screen 18 of the computer as a trite phrase which is a candidate for replacement. The invention then proceeds in step 74 of the flow diagram of FIG. 3, to generate the grammatically equivalent replacement phrase by identifying the grammatically significant rank of the trite verb form in the source table 38 which is equal to the matched, second target word 26. This is the second rank 42. Then, the replacement table 48 is accessed, as is represented by step 76 in the flow diagram of FIG. 3. This is accomplished by using the variable replacement word element 34 in the replacement phrase portion of the matched phrase-pair expression 28 of FIG. 1, as an address pointer to the replacement table 48 which contains all of the forms of the symbolically represented verb in the replacement phrase. The replacement table 48 is accessed and the replacement verb form "do" in the second rank 42' of the table 48 is the replacement verb form which corresponds to the grammatically significant second rank 42 which was previously identified in the source table 38. Step 78 of the flow diagram of FIG. 3 represents the selection of the replacement verb for the replacement phrase. FIG. 1 shows that an output replacement phrase 50 consisting of a replacement value 52 and a replacement constant 54, is constructed from the replacement verb selected from the rank 42' of the replacement table 48 and the constant replacement word element 36 in the matched phrase-pair expression 28. The constant replacement word element 36 in the phrase-pair expression 28 is the constant portion of the replacement phrase which does not change when the phrase is used in its various parts of speech. This constant portion can be a single word or a sequence of words in an alpha-numeric string, to which is added the replacement verb from the replacement table 48, to form the output replacement phrase 50, as represented by step 80 of the flow diagram of FIG. 3. The output replacement phrase 50 is then displayed on the display screen 18 to the writer. The writer can then decide whether he wants to substitute the replacement phrase 50 for the highlighted trite phrase 22. If he desires to make the substitution, the writer enters the command at the keyboard 10 and the alpha-numeric string comprising the replacement phrase 50 is substituted for the first and the second target words or multiple word strings 24 and 26 in the input word stream 12. Alternately, the substitution of the replacement phrases into the manuscript text can be done automatically without requiring the writer's further intervention. The output word stream 20 can then be directly stored as a modified manuscript text in the memory 14 or disk storage 17. In this manner, replacement phrases which are grammatically equivalent to the trite phrases to be replaced, can be immediately inserted into the text without further alteration. An additional advantage of the invention is the significant data compaction which is achieved for the trite phrases and the replacement phrases being stored in the memory 14. By representing the family of trite phrases and their corresponding replacement phrases with a phrase-pair expression 28, a source table 38 and the replacement table 48, a significant reduction in the memory space required for the storage of such phrases, is achieved. The second embodiment of the invention shown in FIGS. 4-9 builds upon the principle of operation of the first embodiment which has been described in conjunction with FIGS. 1-3. Improvements in the second embodiment of the invention include the use of match terms to identify the source phrase in the input word stream and the use of hash encoding of a focus word or words in the source constant of the phrase-pair expression to increase the speed of comparison with the input word stream. The second embodiment of the invention is described as follows. DESCRIPTION OF THE SECOND EMBODIMENT OF THE INVENTION The second embodiment of the system for compaction and replacement of phrases is shown in FIGS. 4-9. The system provides for the automatic replacement of words and phrases in specific linguistic context such as automatic translation, the replacement of improper grammatical phrases, and term substitution with definitions. Linguistic compaction is achieved by the invention through the use of symbolic expressions and grammatical completeness is achieved through the use of relational tables. DESCRIPTION OF COMPACTION AND DECODING PROCESS Compaction and decoding of phrases consists of three stages: 1. linguistic codification 2. phrase compaction 3. phrase decoding 1. Procedure for Linguistic Codification of the Phrases The linguistic codification of the phrases is best done manually to create files that can be understood and maintained easily. The linguistic codification requires recognizing the elements of the language with which correspondences need to be established and defining the tables used to generate conjugations and other linguistic variants. One begins with a compendium of cliches and trite phrases. Many of the phrase are associated with a preferred usage. For example, "have the ability to" can be more concisely stated as "can" or "be able to." The phrase file is the compilation of phrase-pair expressions 28, encoded to allow variants of a word by reference to list names. The "at" symbol (@) is used as a list name symbol at the beginning of those words in the phrase file which have reference lists of variants or alternate parts of speech for the word in source tables 38 or in replacement tables 48. A typical phrase file entry is: @have the ability to=@can, @be able to This entry is a phrase-pair expression 28 which references the lists of words with the names "@have," "@can,"and " @be"each consisting of four entries:
______________________________________
@have @can @be
______________________________________
have can am, are, be
has can is
had could was, were
having ## being
______________________________________
The @have list is a source table 38 and the @can and @be lists are replacement tables 48. The first line of each list contains the infinitive/present form of the verb, the second line the third person form, the third line has the past tense forms, and the fourth line the present participle form. A null entry is indicated by a double pound sign. The mapping of the phrase to its replacement is accomplished on the basis of the correspondences of these lists. Phrase-pair expression 28 entries of the phrase file consist of two parts the source <PHRASE> and the <REPLACEMENT> phrase separated by an equal sign. The following Backus-Naur Form (BNF) description defines the format of the entries. Quoted strings represent literals; lower case names represent terminal symbols. (See Naur, P., "Revised Report on the Algorithmic Language ALGOL 60," Communications of the ACM, 6 (January 1963), 1-17)).
______________________________________
The format of an ENTRY is <PHRASE> "="
<REPLACEMENT>,
where <PHRASE>
is <WORD> .vertline. <PHRASE>
" " <WORD> or <STRLIST> .vertline.
<PHRASE> " " <STRLIST>'
<REPLACEMENT> is <WORD> .vertline.
<REPLACEMENT> " " <WORD>,
<WORD> is @list
or string
or string+@list
or string(ending) and
<STRLIST> is string or strlist.
The symbol "=" is the "equal sign" character. The symbol
".vertline." means "or." The symbol " " is a blank character.
______________________________________
As an example the source phrase "@have low .vertline. high value(s)" will match against the following alpha-numeric strings in the input word stream:
__________________________________________________________________________
have low value
has low value
had low value
having low value
have low values
has low values
had low values
having low values
have high value
has high value
had high value
having high value
have high values
has high values
had high values
having high values
__________________________________________________________________________
Similarly, the use of ending lists allows a very compact form in the phrase file to match a large number of input words. The list "@elr," for example, contains the endings for regular verbs and allows "work+@elr" to match against "work," "works," "worked," and "working." 2. Phrase Compaction Process The phrase compaction is illustrated in FIG. 4. In step 101 the linguistically codified phrase-pair expressions 28 are read by the program. The word in the constant source word element 32 which is to be used as an index key (the focus word) is identified and isolated in step 102. In step 103, the invariant attributes of the key are used to construct a series of bit patterns that are highly characteristic of the key. The bit patterns for all the keys of the file are superimposed in a hash screen table. Although the superimposed bit patterns are not unique for any particular key, the presence of all the bits for a particular term indicates that the term has a high probability of being a key in the phrase table. (Examples of suitable hashing techniques are given in Bloom, B. H. "Space/Time Trade-Offs in Hash Coding with Allowable Errors," Communications of the ACM, 13(7), 1977, pages 422-426 and also in Murray, D. M., "A Scatter Storage Scheme for Dictionary Lookups," Journal of Library Automation, Vol. 3/3, September 1970, pp. 173-201.) The index key is also used in step 104 to sort the phrase-pair expressions in the phrase file, to organize it for efficient retrieval. In step 105 the source phrase portion of phrase-pair expressions 28 are decomposed into terms used for matching during the decoding stage. The match terms have to include a description of positional constraints (such as adjacency of words) and an indication of special matching requirements (references to ending lists, alternate terms, etc.). Step 105 also encompasses the encoding of the match terms based on the frequencies of the characters in the text. Frequency-based compaction reduces storage requirements significantly, although not as much as linguistic codification. Finally, in step 106, the encoded phrase file and the hash screen is written to an output file which is used during the decoding stage. 3. Process for Decoding Phrase Tables FIG. 5 illustrates the steps required for decoding the phrase-pair expressions 28 in the phrase file. The purpose of the process is to identify target phrases within the input word stream that match against the source phrase portion of phrase-pair expressions 28 and provide replacement alpha-numeric strings which can be synonyms, foreign language translations, or grammatically equivalent replacement phrases which can include annotations. Step 107 is the initial step of scanning the input word stream and identifying words and punctuation. Each word pair and word of the input word stream is hashed in step 108 using the same procedure that was used during the creation of the compact phrase tables. In step 109 the hash codes for the word pairs or word from the input word stream are tested against the hash screen table. If all the bits of the hash code are found to be "on" in the hash screen table, the word pair or word from the input word stream is presumed to be in the phrase file and processing continues, otherwise processing continues with words from the input word stream looking for a match. Once the hash screen has been successfully matched, step 110 accesses the compacted phrase file and reads the records containing the key term. A character-by-character comparison is used in step 111 to determine whether the key term in the phrase file actually matches the word from the input word stream. The word may not match because of a false "collision" in the hash screen table; in this case the process goes back to step 108 and tries another word from the input word stream. After the key term has matched, the variable source word 30 associated with the key term is matched. This requires referencing the source table 38 containing the term lists in step 112 and matching source word element values by suffix substitution. The final matching procedure in step 113 consists of applying rules for matching adjacent words of the phrase. If any matching requirement fails, processing resumes at step 108. Finally, when all terms have matched properly, the replacement phrase associated with the matching phrase is uncompacted. Processing resumes from step 108 until all the input word stream has been processed. FIG. 6 illustrates the process for converting the linguistically encoded phrase file into the bit screen table and match terms that make it possible to match efficiently against input text. Box 115 identifies a linguistically encoded phrase where the symbolic variable @BE1 represents forms of the verb "be" and the variable @DO represents forms of the verb "do." The phrase is encoded as an equation where the left side specifies the matching constraints and the right side specifies the corresponding replacement. Box 116 identifies the word pair "not about" as the index key. The word "about" is preceded by an asterisk to indicate that it is the word indexed. Box 117 contains the entries defining the symbolic variables. The entries of each symbolic variable have a one-to-one correspondence based on grammatical constraints (person, in this case). Box 119 contains the match terms derived from the left side of the phrase in box 115. Box 119 contains the relative word numbers and values required to effect a successful match; the replacement used after a successful match is given within parentheses next to the index word. The decoding process starts by isolating the words of the input word stream, hashing them, and testing the hash codes against the bit screen. When the match against the bit screen is successful, the rules for the index word are retrieved and applied against the input sentence. Box 120 in FIG. 6 identifies a word pair which has been successfully matched against the bit screen in box 118. Since the word "about" is the current word, the words around it are given relative numbers as indicated under box 120. The match terms in box 119 are applied to determine if the phrase matches. These terms are ordered so that the ones that require the least amount of effort for matching are checked first. First, the process checks to the left of the word "about" for the word "not" which is in the "-1" position (i.e., one word to the left of "about"). The next check is for the word "to" to the right of "about," and the last check is for the symbolic variable "@BE1" two words to the left of "about." Checking symbolic variables involves consulting their definition (box 117) and keeping track of the relative position of the match. In this case the word "is" matches the second line of the definition for @BE1. Having matched successfully, the process generates the replacement phrase from the parenthetical expression in box 119. Since this expression has a symbolic variable, the process retrieves a term corresponding to the same relative position as the term that matched. The second word of the "@DO" variable is "does" and the replacement phrase is "does not intend to" which is given in box 122. The system for compaction and replacement of phrases finds its preferred application in a host data processing system such as that shown in FIGS. 7, 8 and 9. FIG. 7 is a system diagram of the host data processing system. The host data processor 130 is connected through a terminal controller 134 to a plurality of workstations 136, 136A and 136B. The host data processor 130 is also connected to a bulk storage unit 133. The system configuration of FIG. 7 can be embodied with an IBM System/370-type host data processor 130, such as an IBM 3081 processor connected through an IBM 3274 terminal controller 134 to an IBM 3270 workstation 136. Details of such a configuration can be found, for example, in U.S. Pat. No. 4,271,479 to Cheselka, et al., entitled "Display Terminal With Modularly Attachable Features," which is assigned to the IBM Corporation. A more detailed description of the host data processor 130 can be found in IBM System/370 Principles of Operations, Order No. GA22-7000, published by the IBM Corporation, 1981. The host data processor 130 can emp1oy an operating system such as the Virtual Machine/Conversational Monitor System (VM/CMS) which is described in IBM Virtual Machine Facility/370 Introduction, IBM Systems Library, Order No. GC20-1800, published by the IBM Corporation, 1981. The system shown in FIG. 7 is described in greater detail in FIG. 8 where it is seen that the host data processor 130 has a primary bus 148 which interconnects the channel 146, the memory 150, the execution unit 152 and the storage controller 154. The bulk storage 133, which can be a large capacity disk drive such as an IBM 3380, is connected to the storage controller 154. The channel 146 is connected to a plurality of input/output terminals 134A. The channel 146 is also connected to the terminal controller 134. The terminal controller 134 includes a screen buffer 140 which is connected to the display screen 137, a processor 142 which is connected to the screen buffer 140 and also to the keyboard 135, and the communications adapter 144 which is connected to the processor 142. The communications adapter 144 provides the communications interface with the channel 146 of the host data processor 130. The workstation 136, which includes the display screen 137 and the keyboard 135, is also shown in FIG. 8, as it is related to the terminal controller 134. In addition, the channel 146 includes an output to the printer 156. A user at the workstation 136 will access the system by inputting commands and working text at the keyboard 135. This information is processed by the processor 142 which writes into the local screen buffer 140 for immediate display on the display screen 137. Whenever a command key or a function key is depressed on the keyboard 135, the processor 142 alerts the communications adapter 144 to transfer those portions of the working text which have been changed in the screen buffer 140, to the channel 146 of the host data processor 130. The information received by the channel 146 is transferred to the bus 148. Conversely, when information is provided by the bulk storage 133 through the information controller 154 to the bus 148, or by the execution unit 152 to the bus 148, or by the memory 150 to the bus 148, that information is transferred by the channel 146 to the communications adapter 144 at the terminal controller 134 for display on the display screen 137. The random access memory 150 in the host data processor 130 includes a number of data areas and functional programs for operating with the data input into it through the bus 160 which is connected to the bus 148. FIG. 9 is a logical block diagram showing the apparatus of the memory 150 including several designated data areas and functional programs controlling the operation of the system. The instructions in each of the functional programs are executed by the execution unit 152. The memory 150 is divided into a plurality of substantially identical partitions 200, 200A and 200B which respectively perform the functions for workstations 136, 136A and 136B of FIG. 7. The VM/CMS operating system program 70 in the memory 150 provides the overall control for the operation of the host data processor 130 and provides the coordination of the memory partitions 200, 200A and 200B so that the users of the respective workstations 136, 136A and 136B appear to have seemingly separate and independent IBM System/370 computing systems. See the above cited VM/CMS reference for further details. The file access method 172 coordinates transfers of data between the bulk store buffer 174 in the memory 150 and the storage controller 154 which interfaces with the bulk storage 133. The printer executive 175 controls printer 156 operations through the channel 146. FIG. 9 shows the apparatus of memory 150 during the decoding process when phrases within an input string are matched against the phrase file and replacement strings are provided which can be synonyms, foreign language translations, or replacement phrases which can also include annotations. During a text processing session, the operator at the workstation 116 inputs words and phrases at keyboard 135 and the terminal controller 134 transfers that text to the host data processor 130 where it is stored in the memory 150 in the working text buffer 178 where it can be operated upon by the word processing executive 176 to carry out conventional word processing operations. When the operator indicates by a control input at the keyboard 135 that phrase substitution is desired, input phrase strings are transferred from the working text buffer 178 to the input phrase register 182. The hashing processor 180 operates on the phrases in the input phrase register 182 and provides hash-encoded values for the input phrase to the hash-encoded input phrase register 184. The hash-encoded values for the input in 184 are then compared with the hash bit screen table (for the source file) in the buffer 188 by means of the comparator 186, in order to identify index or focus words. When a successful comparison is achieved by the comparator 186, the rules processor 190 performs a matching operation checking to determine if the adjacent words in the phrase in the input phrase register 182 satisfy the replacement rules. If the rules processor satisfies the matching of terms between the input phrase and the source file phrase, then the symbolic variable processor 192 selects the correct part of speech for the symbolic variables which occur in the replacement phrase. The replacement string is then output from the file buffer 194 which contains the equivalent phrases, and is stored in the replacement string phrase buffer 196, for transmission over bus 160 and through the channel 146 to the terminal controller 134, for display on the display screen 137 at the workstation 136. The operator can then elect whether to adopt the suggested replacement phrase being displayed on the display screen 137. The operator can make this election by entry to the keyboard 135, indicating the desired substitution of the replacement string for the existing input phrase. Although the disclosed embodiment in FIGS. 7, 8 and 9 is in a host data processing system, the invention can also find application in smaller data processing systems, such as the IBM Personal Computer, Model 5160, for example. The resulting system for compaction and replacement of phrases provides an improved system for phrase replacement which is based upon the linguistic relationship between the input phrase and the replacement phrase. The information is compacted using a combination of character frequency encoding and a recognition of the linguistic regularities in natural languages. Although a specific embodiment of the invention has been disclosed it will be understood by those of skill in the art that the foregoing and other changes in form and detail may be made therein without departing from the spirit and the scope of the invention.
|
Same subclass Same class Consider this |
||||||||||
