Method of and system for disambiguating syntactic word multiples6260008Abstract A method and system are provided for disambiguating multiples of syntactically related words automatically using the notion of semantic similarity between words. Based on syntactically related words derived from a sample text, a set is formed containing each associating word and the words associated in the syntactic relationship with it. The associating words are expanded to all word senses. Pair wise intersections of the resulting sets are formed so as to form pairs of semantically compatible word clusters which may be stored as pairs of cooccurrence restriction codes. Claims What is claimed is: Description BACKGROUND OF THE INVENTION
word POS sense definition and/or example
fire verb 1 open fire
2 fire a gun; fire a bullet
3 of pottery
4 give notice
5 The gun fired
6 call forth; of emotions, feelings and
responses
7 They burned the house and its diaries
8 provide with fuel; Oil fires the furnace
gun noun 1 a weapon that discharges a missile esp
from a metal tube or barrel
2 large but transportable guns
3 a pedal or hand-operated lever that
controls the throttle; he stepped on the
gas
clerk noun 1 keeps records or accounts
clerk noun 2 a salesperson in a store
This Table gives partial examples of dictionary entries omitting synonymic links. POS stands for "part of speech" and the integers in the "sense" column are indices referring to specific word uses. In order to extract word senses from the WordNet lexical database, an input such as: <fire_v, clerk_n> may be supplied, where the input comprises a set containing the words "fire" and "clerk" together with abbreviations "v" and "n" which specify the parts of speech as "verb" and "noun", respectively. The output from the database is of the form <{fire_v.sub.-- 4}, {clerk_n.sub.-- 1, clerk_n.sub.-- 2}> which comprises a set having two subsets and the integers succeeding the abbreviations indicating the parts of speech refer to word senses in the lexical database. Thus, the accessing of the WordNet lexical database determines that, for the word pair comprising "fire" as a verb and "clerk" as a noun, "fire" is used as a verb in the fourth sense and "clerk" is used as a noun in its first or second sense. As shown in FIG. 1, subsamples of text are selected at 1 so that disambiguation may be directed to a particular subject matter such as legal, financial, or medical subject matter. It is assumed that the relevant text is already in machine-readable form so that step 1 may comprise an editing step so as to tune the disambiguation towards a particular subject matter where the subsamples make use of the same terminology. At 2, all syntactically dependent pairs are extracted from the text sample, for instance using a robust parser of known type. For instance, the parser extracts and labels all verb-object pairs, all verb-subject pairs, and all adjective-noun pairs. These word pairs are sorted according to the syntactic dependencies at 3 and a first syntactic dependency, such as verb-object, is chosen at 4. Syntactic word collocates are extracted at 5 and a step 6 checks whether all syntactic dependencies have been used. If not, another syntactic dependency is chosen at 7 and the step 5 is repeated until all syntactic dependencies have been used. The step 5 is illustrated in detail in FIGS. 2 and 3. At 9, an associating word is selected from the first of a syntactically dependent pair of words and is entered in a first subset of a first set. For instance, the associating word "employee" as an object of a verb may be chosen as the associating word. At 10, an associated word is selected. In particular, the step 10 selects a verb which appears in the verb-object syntactic dependency with "employee" as the object in the text sample. At 11, the number N1 of occurrences of the associating word as the object of a verb in the text sample is counted. At step 12, the number N2 of occurrences of the associating word as the object of the associated word is counted in the text sample. At step 13, the conditional probability P1 for the verb (for instance "fire") given the noun ("employee") in the text sample is calculated as N2/N1. Step 14 compares the conditional probability with a threshold T1 which represents the threshold for inclusion in the set of statistically relevant associated words for the associating word ("employee"). The threshold T1 for statistical relevance can be either selected manually or determined automatically as the most ubiquitous conditional probability value. Suppose, for example, T1 is to be computed automatically with reference to the conditional probabilities for the following associated verbs of employee: <fire_v/.25, employee_n> <dismiss_v/.223, employee_n> <hire_v/.27, employee_n> <recruit_v/.22, employee_n> <attract_v/.02, employee_n> <be_v/.002, employee n> <make_v/.005, employee_n> <affect_v/.01, employee_n> This can be obtained by distributing all probabilities over a ten-cell template, where each cell is to receive progressively larger values starting from a value greater than 0.01 e.g.
FROM TO VALUES
>0.01 .1 .01, .02
>.1 .2 --
>.2 .3 .25, .333, .27, .22
>.3 .4 --
>.4 .5 --
>.5 .6 --
>.6 .7 --
>.7 .8 --
>.8 .9 --
>.9 1 --
The lowest value of the cell which has been assigned most elements, which in this case would be 0.22, is selected for T1. If the conditional probability exceeds the threshold T1, the associated word is entered in the second subset of the first set at 15. Control then passes to step 16 which checks whether all of the words associated with the present associating word have been used up. If not, another associated word is selected at 17 and control returns to the step 12. Once all of the associated words have been used up, step 18 checks whether all associating words as objects of verbs have been used up. If not, step 19 selects from the first of another syntactically dependent pair another associating word and enters it in a first subset of another first set. Control then passes to the step 10. As a specific example, all of the word pairs occurring in the text sample with "employee" as the object of a verb are as follows <fire_v/, employee_n> <dismiss_v/, employee_n> <hire_v/, employee_n> <recruit_v/, employee_n> <attract_v/, employee_n> <be_v/, employee_n> <make_v/, employee_n> <affect_v/, employee_n> The steps 11 to 15 ascribe conditional probability to the associated words and these are indicated as follows <fire_v/.25, employee_n> <dismiss_v/.223, employee_n> <hire_v/.27, employee_n> <recruit_v/.22, employee_n> <attract_v/.02, employee_n> <be_v/.002, employee_n> <make_v/.005, employee_n> <affect_v/.01, employee_n> For a threshold T1 of 0.22, only the verbs fire, dismiss, hire and recruit are entered in the second subset of the first set containing "employee" as the first subset so that the first set comprises: <{fire_v,dismiss_v,hire_v,recruit_v}, employee_n> Once all of the objects of verb-object pairs have been analysed in this way and given rise to corresponding first sets, the verbs of verb-object pairs are analysed as shown in FIG. 3. Steps 20 to 30 correspond to steps 9 to 19, respectively, but result in a plurality of second sets, each of which corresponds to an associating word in the form of a verb of a verb-object pair and comprises a fourth subset containing the associating word and a third subset of statistically relevant associated words. For instance, the verb "fire" may give rise to verb-object pairs: <fire_v, gun_n> <fire_v, rocket_n> <fire_v, employee_n> <fire_v, clerk_n> <fire_v, hymn_n> <fire_v, rate_n> As a result of steps 22 to 26, the objects "hymn" and "rate" are not found to be statistically relevant so that the second set is <fire_v, {gun_n,rocket_n,employee_n,clerk_n}> where the fourth subset contains the associating word "fire" as a verb and the third subset contains the associated words gun, rocket, employee, and clerk as nouns. When step 6 in FIG. 1 detects that all syntactic dependencies have been used, control passes to step 40 in FIG. 4. The step 40 selects a first associated word from a second or third subset and enters it in a new subset. Step 41 selects a combination of the first word and a further word which is associated with the first word. For instance, for the first set described hereinbefore and having "employee" as the associating word forming the first subset, the step 40 selects the first associated word "fire" and the step 41 selects "dismiss". Step 42 determines the semantic similarity between these words specifying the similarity as a numerical value. This is then compared with a threshold T3 at step 43 and, if the similarity exceeds the threshold, the further word is entered in the new subset at step 44. Step 45 assesses whether all pairwise combinations of the associated words in the second or third subset have been used. If not, a new combination is formed at 46 and steps 42 to 45 repeated. Once step 45 determines that all combinations have been used, a new set is formed at step 47 by associating the words remaining in the new subset with the associating word of the first or second set. Step 48 checks whether all words in all of the second and third subsets have been used as first words. If not, step 49 selects another first associated word from a second or third subset and enters it in another new subset. Steps 41 to 47 are then repeated until all of the associated words in the second and third subsets have been used. These steps thus form all possible unique word pairs with non-identical members out of each associated word subset. For instance, in the case of the associated word subset. {fire, dismiss, hire, recruit} the following word pairs are formed {fire-dismiss, fire-hire, fire-recruit, dismiss-hire, dismiss-recruit, hire-recruit} Similarity, for the associated word subset {gun, rocket, employee, clerk} the following word pairs are formed {gun-rocket, gun-employee, gun-clerk, rocket-employee, rocket-clerk, employee-clerk} The semantic similarities are then assessed, for instance by reference to a thesaurus. For the above two sets of word pairs, the following semantic similarities are obtained: {[fire_v.sub.-- 4,dismiss_v.sub.-- 4,11], [fire-hire,0] [fire-recruit,0] [dismiss-hire,0] [dismiss-recruit,0] [hire_v.sub.-- 3,recruit_v.sub.-- 2,11]} {[gun_n.sub.-- 1,rocket_n.sub.-- 1,5.008], [gun_n.sub.-- 3/gun_n.sub.-- 2/gun_n.sub.-- 1,employee_n.sub.-- 1,1.415], [gun_n.sub.-- 3/gun_n.sub.-- 2/gun_n.sub.-- 1,clerk_n.sub.-- 1/clerk_n.sub.-- 2,1.415], [rocket_n.sub.-- 3, employee_n.sub.-- 1,2.2555] [rocket_n.sub.-- 3,clerk_n.sub.-- 1/clerk_n.sub.-- 2,2.255] [employee_n_i,clerk_n.sub.-- 1/clerk_n.sub.-- 2,4.144]} In order to determine semantically congruent word senses for the semantic similarities given above, a semantic similarity threshold T3 is established. The semantic similarity threshold T3 can be either selected manually or determined automatically as the most ubiquitous semantic similarity value. As in the case of the threshold T1 for statistical relevance, automatic determination of T3 can be carried out by 1. distributing all semantic similarity scores over an n-cell template, where each cell is to receive progressively larger values starting from a value greater than 0, and then 2. selecting the lowest value of the cell which has been assigned most elements. In the present case, T3 is manually fixed at 3. The subsets produced by the step 43 for the example given above are those having a semantic similarity greater than 3 and are as follows {fire_v.sub.-- 4,dismiss_v.sub.-- 4} {hire_v.sub.-- 3,recruit_v.sub.-- 2} {clerk_n.sub.-- 1,clerk_n.sub.-- 2,employee_n.sub.-- 1} {gun_n.sub.-- 1,rocket_n.sub.-- 1} The resulting new subset therefore contains only words which are semantically related to each other and the step 47 associates each such subset with its associating word to form a new set. Step 50 shown in FIG. 5 then expands the associating word of each new set into all of its possible senses, for instance by reference to an electronic thesaurus or dictionary. The word senses are included for all of the associated and associating words. For instance, the resulting expanded new sets corresponding to the previous specific example are as follows: <{hire_v.sub.-- 3,recruit_v.sub.-- 2}, {employee_n.sub.-- 1}> <{dismiss_v.sub.-- 4,fire_v.sub.-- 4}, {employee_n.sub.-- 1}> <{fire_v.sub.-- 1,fire_v.sub.-- 2,fire_v.sub.-- 3,fire_v.sub.-- 4,fire_v.sub.-- 5, fire_v.sub.-- 6,fire_v.sub.-- 7,fire_v.sub.-- 8}, {clerk_n.sub.-- 1,clerk_n.sub.-- 2,employee_n.sub.-- 1}> <{fire_v.sub.-- 1,fire_v.sub.-- 2,fire_v.sub.-- 3,fire_v.sub.-- 4,fire_v.sub.-- 5, fire_v.sub.-- 6,fire_v.sub.-- 7,fire_v.sub.-- 8}, {gun_n.sub.-- 1,rocket_n.sub.-- 1}> For the associating word "employee", there is only one meaning so that the first subset of each new expanded set comprises "employee_n.sub.-- 1". However, the associating word "fire" has eight possible senses so that the fourth subsets contain each of these eight senses. Thus, each expanded new set has a subset containing associated words which are semantically related to each other and another subset containing all senses of the associating word. Steps 51 and 52 select two of the expanded new sets and step 53 intersects these two sets. In particular, in the case of verb-object pairs, the subsets of these two new sets containing verbs are intersected and likewise the subsets containing objects are intersected. Thus, the output of the step 53 comprises a new set which is non-empty if the two sets have one or more common members in both the "verb" subsets and the "object" subsets. For the specific example given hereinbefore of the expanded new sets, when the set [{dismiss_v.sub.-- 4,fire_v.sub.-- 4},{employee_n.sub.-- 1}] is intersected with the set [{fire_v.sub.-- 1,fire_v.sub.-- 2,fire_v.sub.-- 3,fire_v.sub.-- 4, fire_v.sub.-- 5,fire_v.sub.-- 6,fire_v.sub.-- 7,fire_v.sub.-- 8}, {clerk_n.sub.-- 1,clerk_n.sub.-- 2,employee_n.sub.-- 1}] the resulting intersection comprises the set [{fire_v.sub.-- 4}, {employee_n.sub.-- 1}]. All other pair-wise intersections of the four expanded new sets are empty because there are no verbs and objects common to both sets of each pair-wise combination. Step 54 determines whether the intersection is empty and, if not, the set formed by the intersection is added to an existing set at 55. The steps 53 and 54 effectively perform the disambiguation, the result of which is merged by step 55 into pairs of semantically compatible word clusters using a thesaurus function and/or the notion of semantic similarity. The word senses contained in the resulting subsets or clusters formed by the step 55 are all semantically similar (possible in synonymic relation) to each other. For instance, the sets: <fire_v.sub.-- 4, employee_n.sub.-- 1> <dismiss_v.sub.-- 4, clerk_n.sub.-- 1> <give_the_axe_v.sub.-- 1, salesclerk_n.sub.-- 1> <sack_v.sub.-- 2, shop_clerk_n.sub.-- 1> <terminate_v.sub.-- 4, clerk_n.sub.-- 2> may be merged into the following set <{fire_v.sub.-- 4, dismiss_v.sub.-- 4, give_the_axe_v.sub.-- 1, sack_v.sub.-- 2, terminate_v.sub.-- 4}, {clerk_n.sub.-- 1, employee_n.sub.-- 1, salesclerk_n.sub.-- 1, shop_clerk_n.sub.-- 1, clerk_n.sub.-- 2}> Step 56 checks whether all other new sets have been used and thus effectively determines whether all possible intersections with the new set selected at 51 have been formed. If not, another new set is selected at 57 and control returns to the step 53. When all pair-wise combinations including the new set selected in the step 51 have been formed, step 59 checks whether all new sets have been used. If not, step 58 selects another new set and the process is repeated until each expanded new set has been intersected with every other expanded new set. The results of steps 1 to 59 as described above comprise sets of pairs of semantically congruent word sense clusters, for example such as <{fire_v.sub.-- 4, give_the_axe_v.sub.-- 1, send_away_v.sub.-- 2, sack_v.sub.-- 2, force_out_v.sub.-- 2, terminate_v.sub.-- 4} {clerk_n.sub.-- 1, employee_n.sub.-- 1, salesclerk_n.sub.-- 1, shop_clerk_n.sub.-- 1, clerk_n.sub.-- 2}> <{lease_v.sub.-- 4, rent_v.sub.-- 3, hire_v.sub.-- 3, charter_v.sub.-- 3, engage_v.sub.-- 6, take_v.sub.-- 22,recruit_v.sub.-- 2}, {clerk_n.sub.-- 1, employee_n.sub.-- 1, salesclerk_n.sub.-- 1, shop_clerk_n.sub.-- 1, clerk_n.sub.-- 2}> These results are then stored so that future disambiguation involving any of the word sense associations may be reduced to simple table look-ups. In step 60, a common first subcode is assigned to the clusters or subsets of each set. Step 61 assigns a second subcode to each subset or cluster representing its syntactic dependency. For instance, a first subcode VO may be assigned to each cluster of verbs whereas a second subcode OV may be assigned to each cluster of verb objects. A specific example would be as follows: <{102_VO, fire_v.sub.-- 4, dismiss_v.sub.-- 4, give_the_axe_v.sub.-- 1, send_away_v.sub.-- 2,sack_v.sub.-- 2, force_out_v.sub.-- 2, terminate_v.sub.-- 4}, {102_OV, clerk_n.sub.-- 1, employee_n.sub.-- 1, salesclerk_n.sub.-- 1, shop_clerk_n.sub.-- 1, clerk_n.sub.-- 2}> <{103_VO, lease_v.sub.-- 4, rent_v.sub.-- 3, hire_v.sub.-- 3, charter_v.sub.-- 3, engage_v.sub.-- 6, take_v.sub.-- 22, recruit_v.sub.-- 2}, <{103_OV-, clerk_n.sub.-- 1, employee_n.sub.-- 1, salesclerk_n.sub.-- 1, shop_clerk_n.sub.-- 1, clerk_n.sub.-- 2}> <{104_VO, shoot_v.sub.-- 3, fire_v.sub.-- 1, . . . }, {104_OV, gun_n.sub.-- 1, rocket_n.sub.-- 1, . . .}> Step 62 stores the codes in a cooccurrence restriction table, for instance of the form:
102_VO , 102_OV
103_VO , 103_OV
104_VO , 104_OV
Step 63 stores the subsets or clusters against their assigned codes comprising the first and second subcodes. For instance, the orthography and part of speech of each word sense in the clusters may be stored in a table along with the associated sense member and cluster code as follows:
fire , v , 4 , 102_VO
dismiss , v , 4 , 102_VO
. . . , . . . , . . . , . . .
clerk , n , 1 , 102_OV
employee , n , 1 , 102_OV
. . . , . . . , . . . , . . .
hire , v , 3 , 103_VO
recruit , v , 2 , 103_VO
. . . , . . . , . . . , . . .
shoot , v , 3 , 104_VO
fire , v , 1 , 104_VO
. . . , . . . , . . . , . . .
gun , n , 1 , 104_OV
rocket , n , 1 , 104_OV
. . . , . . . , . . . , . . .
Once the disambiguation procedure is complete, subsequent disambiguation of word pairs or multiples may be achieved by conventional table look-ups using the tables described hereinbefore. For instance, disambiguation of syntactically related words such as [fire_v,employee_n] may be performed by retrieving all of the cluster codes for each word in the pair and creating all possible pairwise combinations, namely: <102_VO, 102_OV> <104_VO, 102_OV> Those code pairs which are not in the table of cooccurrence restrictions are then eliminated to leave: <102_VO, 102_OV> The resolved cluster code pairs may then be used to retrieve the appropriate senses for the input word, giving: <fire_v.sub.-- 4, employee_n.sub.-- 1> FIG. 6 shows semantically a system suitable for disambiguating word pairs. The system comprises a programmable data processor 70 with a program memory 71, for instance in the form of a read only memory ROM, storing a program for controlling the data processor 70 to perform the method illustrated in FIGS. 1 to 5. The system further comprises non-volatile read/write memory 72 for storing the cooccurrence restriction table and the table of word senses against codes. "Working" or "scratch pad" memory for the data processor is provided by random access memory (RAM) 73. An input reference 74 is provided, for instance for receiving commands and data. An output interface 73 is provided, for instance for displaying information relating to the progress and result of disambiguation. The text sample may be supplied via the input interface 74 or may optionally be provided in a machine-readable store 76. A thesaurus and/or a dictionary may be supplied in the read only memory 71 or may be supplied via the input interface 74. Alternatively, an electronic or machine-readable thesaurus 77 and an electronic or machine-readable dictionary 78 may be provided. The program for operating the system and for performing the method described hereinabove is stored in the program memory 71. The program memory may be embodied as semiconductor memory, for instance of ROM type as described above. However, the program may be stored in any other suitable storage medium, such as floppy disc 71a or CD-ROM 71b. The method described above with reference to FIGS. 1 to 5 can be extended to deal with statistically inconspicuous collocates, that is collocates which do not meet the threshold tests of steps 14 and 25 in FIGS. 2 and 3 respectively. Because, in the method described above, only statistically relevant collocations are chosen to drive the disambiguation process (see step 14), it follows that no cooccurence restrictions might be acquired for a variety of word pairs. This, for example, might be the case with verb-object pairs such as <fire_v,hand_n> where the noun is a somewhat atypical object of the verb, and so does not occur frequently. This problem can be addressed by using the cooccurrence restrictions already acquired to classify statistically inconspicuous collocates, as described below with reference to the verb object pair <fire_v,hand_n>. First, all verb-object cooccurrence restrictions containing the verb fire are found, which with reference to the example given above are <102_VO, 102_OV> <104_VO, 104_OV> Then all members of the direct object collocate class are retrieved, e.g. 102_.fwdarw.clerk_n.sub.-- 1, employee_n.sub.-- 1 104_.fwdarw.gun_n.sub.-- 1, rocket_n.sub.-- 1 The statistically inconspicuous collocate is then clustered with all members of the direct object collocate class according to semantic similarity, following the procedure described in steps 41-47 of FIG. 4. This will provide one or more sense classifications for the statistically inconspicuous collocate. In the present case, the WordNet senses 2 and 9 (glossed as "farm labourer" and "crew member" respectively) are given when hand_n clusters with clerk_n.sub.-- 1 and employee_n.sub.-- 1, e.g. IN: {hand_n, clerk_n.sub.-- 1, employee_n.sub.-- 1, gun_n.sub.-- 1, rocket_n.sub.-- 1} OUT: {{hand_n.sub.-- 2/9, clerk_n.sub.-- 1, employee_n.sub.-- 1}{gun_n.sub.-- 1, rocket_n.sub.-- 1}} The disambiguated statistically inconspicuous collocate is then associated with the same code of the word senses with which it has been clustered. e.g.,
hand n 2 102_VO
hand n 9 102_VO
This will make it possible to choose senses 2 and 9 for hand in contexts where hand occurs as the direct object of verbs such as fire, in accordance with the above method. The disambiguation method described above with reference to FIGS. 1 to 5 can occasionally yield multiple results. For example, given the pair of verb and object multiples <{wear_v, have_on_v, record_v, file_v}{suit_n, garment_n, clothes_n, uniform_n} as the learning data set, the disambiguation of the pair IN: <wear_v suit_n.sub.-- 1> May yield the following results OUT: {<wear_v.sub.-- 1 suit_n.sub.-- 1>, <wear_v.sub.-- 9 suit_n.sub.-- 1>} Multiple disambiguation results typically occur when some of the senses given for a word in the source dictionary database are close in meaning. For example, WordNet defines sense 1 of wear as be dressed in and sense 9 as putting clothes on one's body. In order to overcome this problem, multiple word sense resolutions can be ranked with reference to the semantic similarity scores used in clustering word sensed during disambiguation (step 42 of FIG. 4). The basic idea is a word sense resoluting choice represented by a word cluster with a higher semantic similarity score provides a better disambiguation hypothesis. For example, specific word senses for the verb-object pair <wear suit> are given by the following disambiguated word multiples {<{have_on_v.sub.-- 1, wear_v.sub.-- 1}, {clothes_n.sub.-- 1, garment_n.sub.-- 1, suit_n.sub.-- 1, uniform_n.sub.-- 1} <{file_v.sub.-- 2, wear_v.sub.-- 9}, {clothes_n.sub.-- 1, garment.sub.-- 1, suit_n.sub.-- 1, uniform_n.sub.-- 1}>} These arise from intersecting pairs consisting of all senses of an associating word and a semantically congruent cluster of its associated words, as described in relation to steps 50-59 of FIG. 5. Taking into account the semantic similarity scores shown below which are used to derive the sets of associated verbs according to steps 41-47, the best word sense candidate for the verb wear in the context wear suit would be wear_v.sub.-- 1. As a further extension of the method described above with reference to FIGS. 1 to 5, grammatical properties of words such as transitivity for verbs can be used to facilitate and improve the disambiguation process by reducing the number of sense expansions for the associating word at step 50. For example, in expanding the associating word fire_v in <fire_v, {clerk_n.sub.-- 1/2, employee_n.sub.-- 1)> all senses of fire_v pertaining to non-transistive uses of the verb fire as specified in the lexical database (e.g. sense 1 which is defined as open fire) could be excluded as the disambiguating context requires a transistive use of the verb.
|
Same subclass Same class Consider this |
||||||||||
