Method and computer system for part-of-speech tagging of incomplete sentences6910004Abstract The invention relates to a method and a computer system for enhanced part-of-speech (POS-) tagging as well as grammatically disambiguating a phrase. A phrase is usually a short multiword expression that may be ambiguous. By introducing grammatical constraints the invention supports POS-tagging as well as grammatically disambiguating the phrase. According to an identifier for the phrase, the phrase is supplemented with artificial context information. The supplemented phrase is then POS-tagged or grammatically disambiguated. Important applications are POS-tagging, Automatic Term Encoding, Headword Detection and Information Retrieval. Claims 1. A method, for use in a computer system, for assigning at least one part-of-speech (POS) tag to a phrase, the method comprising: Description BACKGROUND OF THE INVENTION
The context information may comprise textual information, POS-tags or information adapted for the POS-tagger. Referring to FIG. 1, for the phrase "close the door" the identifier for the category VerbPhrase in step 11 is obtained, the identifier being associated to context information as indicated above, and the corresponding pre-context 201 and post-context information 202 in step 12 is supplemented to the phrase 100:
Although the content of such an supplemented phrase 120 does not make sense, the POS-tagger is now able to disambiguate the parts of the phrase, because the supplemented phrase 120 represents a known grammatical structure. The POS-tagger uses the tags +VERB for verbs, +NOUN_SG for singular nouns, +NOUN_PL for plural nouns, +ART for articles, +PREP for preposition, +PRON for pronouns, +SENT for end of sentence marker for the step 13 of assigning the at least one POS-tag to the phrase 120:
The supplemented context information is removed from the phrase when the POS-tagging process is finished. The result of step 110 is:
The step 11 of obtaining the identifier may be implemented in various manners for example the phrase "close the door" could be part of an instruction list in a document thereby being associated to the identifier for the grammatical category VerbPhrase. Further the phrase could be an input by an user, the identifier being automatically obtained on evaluation of an interaction history with the user or even manually obtained by input of the user. A second aspect of the invention is described in the following: a method for use in a computer system for grammatically disambiguating a phrase comprises the steps of getting the phrase; getting an identifier for the phrase, the identifier being associated to artificial information; supplementing the phrase with the artificial information and grammatically disambiguating the phrase based on the supplemented phrase. It is apparent that this second method is not limited to POS-taggers and can be seen as a more general version of the first method of the invention. Therefore all parts of the detailed description of the present invention above and following below are applicable to the second method as well although they are discussed with reference to the first method of the present invention only. The method illustrated in FIG. 1 improves prior art POS-tagging processes. An example for the basic steps of POS-tagging, as indicated already in the introductory part of this application, is illustrated in FIG. 6. FIG. 6 illustrates the steps in a common POS-tagger from start 60 to end 66. After the step 61 of getting a phrase 100, it is tokenised in step 62 into Token1 to Token3101 to 103. Potential tags Tag11 to Tag32111-132 are provided in step 63 by evaluating each token 101-103 based on lexical information. The step 63 of providing potential tags 111-132 may comprise a morphological analysis of the tokens 101-103. For example, for identifying a word "swam" as a simple past tense of the verb "swim". In step 64 by disambiguating the tags 111-132 a single tag 113, 121, 132 is assigned to each token 101, 102, 103. The disambiguated tags 113, 121 and 132 are assembled to the tokens 101, 102, 103 of the phrase 100 in step 65 resulting in the tagged phrase 190. Some prior art POS-taggers for example use Finite State Transducers (FSTs) or Hidden Markov Models (HMM) in the POS-tagging process. However, the method of the present invention is applicable to any prior art POS-tagger. The steps of the method of the present invention may be combined with prior art POS-tagging processes in various manners, some of them will be discussed in the following with reference to FIG. 1 and FIG. 2. In an embodiment of the present invention the steps 11 and 12 of obtaining the identifier and supplementing the phrase can be performed with the step 61 of getting the phrase, wherein the step 13 of assigning the text summarizes the steps 63 and 64 of providing potential tags and disambiguating tags. In another embodiment of the present invention the steps of the method as shown in FIG. 1 may be inserted in the step 64 of disambiguating tags, for example in case more than one potential tag is provided in step 63 for one token of the phrase. FIG. 2 illustrates in more detail the step 11 of obtaining the identifier for a preferred embodiment of the present invention from Start 20 to End 26. The phrase 100 and the associated 150 identifier in step 21 are obtained, and the identifier is mapped in step 22 to a plurality of potential categories 160 for the phrase. The mapping actually is a step of pre-selecting categories. In a further embodiment of the present invention, the plurality of categories 160 are main grammatical categories of the phrase. The plurality of categories 160 is provided in step 23 for a selection which can be an external selection 24. In case no external selection 24 for the most probable category is made it is selected in step 25 as default. The phrase now is associated 161 to the most probable category being associated to the context information 211. In a further embodiment of the present invention the at least one POS-tag assigned to the phrase is selected from potential POS-tags for the phrase without context and the most probable category for the phrase is selected by evaluating the potential POS-tags. In fact such an evaluation eliminates the need for the further disambiguation of the POS-tags. Applications FIG. 3 illustrates ways of using the method of FIG. 1 for optional applications, starting from connection point 15 to end 34. In a first optional step 31 the tagged phrase or the phrase tags are stored or outputted. The optional step 32 of extracting a headword out of the phrase based on the phrase with the at least one assigned POS-tag is another application using the method of the present invention. In another embodiment of the present invention in a further optional step 33 a formal structure for the phrase is derived, that covers variations of the original phrase. The steps 32 and 33 are discussed in more detail in the following. Many existing applications in natural language processing (e.g. dictionary generation, terminology database creation) require the part of speech encoding of expressions. Currently lexicographers perform this encoding manually according to some specific grammar. This manual encoding can be improved and speeded up by an automatic process called Automatic Term Encoding. The step 33 of deriving a formal structure is the final step of the Automatic Term Encoding process which results in the automatic creation of linguistic regular expressions that can be used by natural language processing tools. For example, the phrase "close the door" could be a part of a traveling dictionary including short phrases for every-day use, which has to be translated into different languages. The lexicographers specify this phrase as the term they want to encode and provide the general grammatical category for the phrase. The latter may also be derived by a structural property of the phrase, e.g. in case the phrase is part of an instruction list. Again the grammatical category obtained in this example is VerbPhrase. The tagged phrase
Syntactic categories resulting from the tagging process are mapped to more general grammatical tags. The POS-tag +VERB resulting from disambiguating and identifying the affected verb is mapped to the more generic qualifier V, which covers all types of verbs. The POS-tags +NOUN_SG (for noun, proper noun, or abbreviation), are replaced by the global qualifier N to which all noun tags are mapped. These generic tags generalize the initial expression. The mapping rules can also insert additional information: for example, a rule can specify that adjectives can be inserted between two nouns or that several adverbs can be added after a verb. The rules applied by the method in the step of deriving a formal structure 33 are language and tagger dependent. The phrase finally leads to the formal structure:
This formal structure captures variations of the original expression such as:
Automatic Term Encoding improves the work of language resource creators, automating a part of process of building dictionaries, terminology databases etc. This changes the role of the resource creator having more time for validation by automating the tedious parts of the encoding process. In addition to saving time, the rule application in the Automatic Term Encoding application ensures that the encoding (e.g. choice of generalization tags) is homogenous since the mapping is not performed manually and the resource creator merely guides the tagger rather than imitates it. A further application of this invention involves information retrieval, taking advantage of the methods described above by using the result of the Automatic Term Encoding. Based on the formal structure resulting from Automatic Term Encoding, an application can determine all the different variations of a multiword expression thereby catching all the terms matching the regular expression. For example, for the phrase: "dense matrix" we will get the following results from the different steps of the Automatic Term Encoding process:
A specific automatically applied grammar rule has added the possibility of having zero or more adjectives (ADJ for adjective) before a noun. Equivalent expressions can now be identified, which match this regular expression, for example:
Adding further grammar rules extends the variety of expression that can be caught. The step 32 in FIG. 3 of extracting a headword out of the phrase based on the phrase with at least one assigned POS-tag is the next application using the method of the present invention. For example the phrase
By applying relevant grammatical rules to the tagged phrase a headword in the phrase is identified to be
Similar to the rules for Automatic Term Encoding in the headword detection process the rules, having to be applied for extracting the headword, are also coded using regular expressions and are language as well as tagger dependent. The step 32 of headword detection may be split into the sub steps of finding all nouns in the phrase which are potential headwords and identifying the one noun which most probably is the headword of the phrase. Functional Units In FIG. 4 the functional units involved in POS-tagging, headword extracting or formal structuring processes are illustrated. In a first embodiment, the context supplementer 44 is connected to identifier input means 43, a POS-tagger 45 and a context storage 42, being connected to the identifier storage 41. The context supplementer 44 obtains an identifier for a phrase via the identifier input 43. Alternatively the phrase may be obtained from a data storage 49 or a phrase input 48, being connected to the context supplementer 44. The context storage 42 comprises a plurality of context information items for being supplemented to a phrase. The identifier storage 41 comprises a plurality of identifiers, each of which being associated to at least one context information item of the context storage 42. The context supplementer 44 selects a context information item according to the obtained identifier from the context storage 42. The phrase is supplemented with the selected context information, both together being the input for the POS-tagger 45. The POS-tagger performs the POS-tagging process leading to the tagged phrase or the phrase tags. The result can be displayed or outputted at the output 83, or even stored to the data storage 49. In a further embodiment of the present invention, the computer system further comprises a category storage 47 comprising a plurality of categories, each identifier being associated with at least one category of the category storage 47 and each category being associated to at least one context information item in the context storage 42. When a category is obtained via category input 82 the context information that has to be supplemented to the phrase can be selected directly. An obtained identifier may be mapped to the category and consequently to the context information. In case more than one category is associated with the identifier, a category evaluator 46 performs the pre-selection of probable categories e.g. main grammatical categories for the phrase according to the identifier and selects a most probable category from the pre-selected categories. The selection may be performed by external selection via the selection means 81 or according to selection rules stored in the data storage 49. In a further embodiment of the present invention the most probable category is selected based on potential POS-tags for the phrase, which are provided by the data storage together with the phrase. The context information may comprise at least pre-context or post-context information, each of which may be represented by at least one POS-tag or textual information. For the applications illustrated with reference to FIG. 3 the POS-tagger 45 may be connected to a headword extractor 84 for performing the headword extraction process based on the tagged phrase, or a formalizer 85 for deriving a formal structure for the phrase, that covers variations of the original phrase. In a further embodiment of the present invention the formalizer 85 may be connected to a morphological generator 86 and the data storage 49. The data storage 49 may function as an input or output data storage for the phrase or the tagged phrase, and further may comprise rules for POS-tagging, formalizing or headword extraction processes. Hardware Units FIG. 5 illustrates a computer system with a CPU 50, a keyboard 51, a display 52, a pointing device 53, a wired/wireless interface 54, audio input means 55, audio output means 56, a secondary storage 57, printer 58 and a primary storage 59. In view of the present invention the best mode for carrying out the invention will be described in the following: the primary storage 59 comprises a computer program comprising processor-executable instructions implementing: a context supplementer 44 for supplementing the context information to the phrase and a POS-tagger 45 for assigning the at least one POS-tag to the phrase. The primary storage 59 further includes a context storage 42 comprising a plurality of context information items and an identifier storage 41 comprising a plurality of identifiers, each of which is associated with at least one context information item of the plurality of context information items. The CPU 50 executes the processor-executable instructions stored in the primary storage 59, thereby performing the implemented methods of the present invention. The keyboard 51 may be used as identifier input 43 to obtain an identifier for a phrase. The identifier is one of the plurality of identifiers of the identifier storage 41 and therefore is associated with a context information item of the plurality of context information items. The phrase is supplemented with the context information item by the context supplementer 44. The supplemented phrase being input for the POS-tagger 45 is evaluated for assigning at least one POS-tag to the phrase. Any rules used in the POS-tagging process are stored as a part of the POS-tagger 45. The keyboard 51 and the pointing device 53 can be used as identifier input 43, category input 82 or phrase input 48. The display 52 or the printer 58 can serve as result output 83, and in combination with the keyboard 51 or the pointing device 53 may be used as selection means 81. The audio input means 55 can be used as one of the input means or the selection means 81, whereas the audio output means 56 can be used as the result output 83. The secondary storage 57 serves as part of the data storage 49 and may be a hard disk, CD, DVD or the like. The secondary storage typically is used for storing language dependent data, mainly because it is exchangeable. While the invention has been described with respect to the preferred physical embodiments constructed in accordance therewith, it will be apparent to those skilled in the art that various modifications, variations and improvements of the present invention may be made in the light of the above teachings and within the preview of the appended claims without departing from the spirit and the intended scope of the invention. In addition, those areas in which it is believed that those of ordinary skill in the art are familiar, have not being described herein in order not to unnecessarily obscure the invention described herein. Accordingly it is to be understood that the invention is not to be limited by the specific illustrated embodiments, but only by the scope of the appended claims.
|
Same subclass Same class Consider this |
||||||||||
