Method and apparatus for generating normalized representations of strings6983240Abstract A method generates normalized representations of strings, in particular sentences. The method, which can be used for translation, receives an input string. The input string is subjected to a first operation out of a plurality of operating functions for linguistically processing the input string to generate a first normalized representation of the input string that includes linguistic information. The first normalized representation is then subjected to a second operation for replacing linguistic information in the first normalized representation by abstract variables and to generate a second normalized representation. Claims What is claimed is: Description BACKGROUND OF THE INVENTION
This example sentence may be normalized and represented through different levels of normalization which may result from one of the processes described in the following. As a first normalization process applied to the example sentence morphological variance of individual words is normalized according to its base form. An example of normalizing a word to its base form is as follows.
In addition, a syntactic category preferably disambiguated, is assigned to each word. The syntactic category is determined based on a context analysis. The following examples may be assigned to individual words of the above example:
Based on the above normalizing operations, particular linguistic units, so-called tags, are assigned to each word. Such "part-off-speech" tags may be +VB to indicate the syntactic category "verb". As an example, "use" would be tagged by +VB, "HomeCenter" would be tagged as +PN for a proper noun and "buttons" would be tagged by +NNS for being a plural noun.
Table 1 above shows the results of POS (part-of-speech) tagging with respect to the example sentence. With this normalized information, matching techniques between words may match words of text units regardless of the actual form of each word. Based on this approach, adverbials or adjectives may be ignored and matches between text sequences may be successful although both sequences may differ for particular string portions. Such a difference might be "use the buttons" and "use only the relevant button". Both strings would be matched based on the above-described approach. Parsing is a further normalization operation. A parser extracts syntactic information indicating the configuration of a string into sub-strings (e.g. phrases or chunks) and relations between sub-structures. Parsing the above given example sentence, may result in a normalized representation as follows: The result indicates that the sub-string "when you are working with the HomeCenter" is recognized as a sub-clause. The sub-clause is introduced by "when". The beginning and end of the "when"-sub-clause is indicated by markers, e.g. "SC—WHEN". The infinitive clause is enclosed by markers "IV". The above example additionally shows some dependency relations which will be described in the following in detail. A syntactic analysis may also provide explicit dependency relations between sub-strings extracted for selected predefined relations. As shown in the above syntactic information representing the first example sentence, "you" is indicated as being the subject of the verb "work". Further, the term "button" is indicated as being the object of the word "use". The expression "task" is indicated to be the object of the word "do". The expression "HomeCenter" is indicated to modify the verb "work". Such dependency relations may also be noted as follows:
Such relations may be used to find matches between common syntactic structures of sentences without regarding their actual surface realization. An example for a successful matching process for different sentences may be as follows:
In both above example sentences "technician(s)" is the subject of the verb "replace"; the term "parts" is the object of the verb "replace". For normalizing strings, e.g. sentence, also taxonomic and semantic links may be used. Such a semantic link may create from an individual word a connection to a super-ordinated term having a more general meaning, to a synonym word or even to another more specialized term for the same super-ordinated term. Such semantic links may exist between "jet plane" and "civil aircraft" or between "cattle", "breeder" and "farmer". Semantic normalization relies on pre-existing language resources such as ontologies, thesauri, terminological databases and on semantic disambiguation processes. By employing operations for generating semantic links, the first example sentence described above might be annotated with the following semantic tags:
The above-described linguistic processes correspond to different levels of abstraction and provide information of increasing complexity. In addition, their representation scheme of the present invention makes use of another type of operation, named "skeletisation". Sekeletisation operates on the output of linguistic processes such as described above and provides another level of abstraction. More precisely, skeletisation replaces selected linguistic units by abstract variables. As an example, skeletisation operating on the output of a shallow parser and abstracting over direct objects, proper names and technical terms, produces the following normalized representation for the first example sentence:
In the above example, the two direct objects "the buttons" and "one or more tasks" are replaced by the index variables OBJ1 and OBJ2. Further, the expression "HomeCenter" is replaced by the variable PROPER1. Different skeletisation schemes may be used which are based on various linguistic parameters. As another example adverbial clauses may be abstracted. Such a clause is the "when-clause" in the first example sentence. Such a skeletisation process would result in the following representation of the sub-sentence:
In addition to the different possible kinds of skeletisation operations, it is possible to combine the operations into a complex one. Both above proposed skeletisation operations may be combined to produce the following representation:
A skeletisation may be more generally seen as a function Skel(x, Abstr, Filter(x)). In this function, "x" is a linguistic representation attached to an input string, "Abstr" is the set of linguistic units to be abstracted in "x" and how they should be represented in the output, and "Filter(x)" specifies how not-abstracted elements of x should be re-written (e.g., translation). By default, "Filter(x)" produces a sequence of not-abstracted surface forms as shown in the example representation of both skeletisation functions. The above described skeletisation processes may be defined by such skeletisation functions. The first described skeletisation operation is directed to abstract over direct objects, proper names and technical terms. This skeletisation function may be described by function Skel(x, {OBJ,PN},). This function indicates that parts of a sentence marked by OBJ or PN must be abstracted, while the other elements of the sentence remain unchanged. The second above described skeletisation process may be described by the following skeletisation function Skel(x, {WHEN—CLAUSE},). The finally described skeletisation procedure combining the first two may be described as Skel(x, {OBJ, PN, WHEN—CLAUSE},). A linguistic unit to be abstracted may be specified in more detail in order to only abstract particular selected kinds of words. As an example, only female objects may be abstracted. Such a skeletisation function would have a parameter specifying the abstract parameter as OBJ&Fem. The parameter "Abstr" directly specifies the type of variable to replace the term in the sentence. Each such variable is indexed according to its position of occurrence in a string or sentence. Thus, the first direct object is marked by OBJ1 in order to uniquely identify this variable. This skeletisation process highly depends on the linguistic representation in use. The examples above are only meant to illustrate the scope of invention. A plurality of different skeletisation functions is obvious for a person skilled in the art based on the above given examples. When performing a plurality of normalization functions, each results in a different representation for the input sentence. The normalization procedures may also be combined in different ways. Each input sentence can be stored together with a set of representations provided by the employed normalization procedures. The first example sentence may be stored together with the following sequence of three representations:
As already mentioned the given sentence may result in many different sequences of representations depending on the different linguistic processes and skeletisation functions used. In the following the selection of appropriate skeletisation functions or sequences of skeletisation functions will be described. In the form of Skel(x, Abstr, Filter(x)), "Abstr" defines those elements within a string, namely a sentence x, which are to be abstracted, i.e. replaced by a variable. Elements to be replaced by variables may be lexical units, phrases, clauses, or any text element which should be abstracted. In the following, the term "skeleton" will be used for the result of a skeletisation function. Skeletons may be ranked depending on their degree of generalization. When assuming Skel1=Skel(x1, Abstr1, Filter(x)) and Skel2=Skel(x2, Abstr2, Filter(x)), then Skel1 is less general than Skel2. Those representations produced by Skel2 include elements to be replaced according to Skel1. An example for such a different generalization degree is shown by the following example wherein skeleton (a) is more general that skeleton (b) as (b) may be produced from (a):
Skeletisation functions may be ranked according to their degree of generalization based on the following ordering property. For any non empty sets "Abstr1" and "Abstr2" of elements to be abstracted, in case Abstr1⊂Abstr2, then all skeletons obtained with Skel1 are less general than those skeletons obtained with Skel2. Then, the skeletisation function Skel1 is less general than the skeletisation Skel2. Different schemes may be used to represent and select sequences of normalized representations efficiently. The following proposal for a representation in the form of a binary tree is a particular embodiment of the present invention. However, it is obvious to a person skilled in the art that a plurality of other representations may be possible based on the given examples. When using the above ordering property, sequences of skeletisation functions may be represented using a binary tree. Each node of such a binary tree represents a skeletisation function. Each non-terminal node in the binary tree branches to two descendents, namely a left child and a right child. The left child represents the application of a non-empty skeletisation function, whereas its right child represents the identity function "idty". This identity function "idty" leaves the input unchanged. Thus, the identity function can be regarded as an empty skeletisation function (i.e. a skeletisation in which Abstr is empty). A sequence of subsequently performed skeletisation functions is represented by a path within the binary tree. Such a path starts from the top of a tree and connects via a plurality of nodes (skeletisation operations) a terminal node. An example will be given in connection with FIG. 5. Assuming that the following two sentences are subjected to sequences of skeletisation operations:
The following elements are to be replaced by variables:
Based on the above listed skeletisation operations for the two example sentences, the binary tree of FIG. 5 may represent sequences of skeletisation functions. For sake of clarity, only a part of the tree is shown in FIG. 5. As both sentences, namely sentence 1 and 2, may be abstracted using the whole set of above described skeletisation functions, they appear on the left most node of the tree. Such nodes may be marked in order to indicate that it is not possible to further abstract these representations based on a given set of sekeletisation functions. For any given node N within the binary tree, those nodes which are located above or on the right of the given node N wherein the depth thereof is less than or equal to the one of N defines a set of nodes which are equivalent to N. In a more particular embodiment, the set of text elements to be abstracted may be divided into two parts, the first part consisting of those elements which have to appear in the sentence in a precise order, and a second part consisting of those elements which may appear in any position within a sentence. The above described binary tree representation is only one among others which may be derived from the described ordering property. The selection of skeletons for sentences depends on the application for which the normalization procedure is intended. Skeletons relevant for translation memories may not be relevant for retrieval or for monolingual authoring. For retrieval applications, it is important that a retained skeletisation functions partitions efficiently the set of sentences, so that the search for similar sentences involves few steps. Thus, those skeletisation functions are to be selected, which discriminate best-input sentences and which stop skeletising sentences when the associated skeletisation function applies only to a few sentences. Similar considerations have to be carried out for other kinds of applications. A selection of operating functions, in particular skeletisation functions, and their order may be carried out by machine learning. Referring now again FIG. 1, the configuration of an information retrieval system 101 further comprises a "search line" which comprises a query formatting unit 104 for formatting a query and generating a normalized representation for the query. In the formatting stage, the query is transformed by the unit 104 into a normalized representation, using the same methods 103 as described for text units. The results of retrieving operations performed by a matching unit 106 will be output as retrieved strings. As illustrated in FIG. 2, a computer system 200 according to the present invention comprises a processor 201 and a main storage 202 connected to the processor 201. The processor is further connected to an input unit 203 for inputting a data base query by a user. The results of the query are displayed on a display unit 204 or may be printed out via printer 205. The information is retrieved from a database 206 included in the computer system 200. The database may be provided on a hard disk of the computer system or on a particular storage media as CDs, DVDs or tape recording media. Alternatively, database 206 may be stored on a remote location connected to the computer system via a data transmission network. In a further alternative embodiment, database 206 may comprise texts stored in a plurality of computers connected in a network at high speed. Processor 201 processing a query may also be used to process sentences. The processor 201 may further use a thesaurus memory 207. The thesaurus memory may provide semantic relations between terms or a plurality of terms. All those skilled in the art will appreciate, that the application of the method and apparatus according to the present invention is not limited to particular memories. All parts of the memories 206, 207, may be realized as a distinct area in a single memory. As those skilled in the art will further appreciate, the computer system 200 preferably comprises a main memory 202 for storing program code, such as operating and application programs 101,104, and 106 and data 102,103, and 105. It will also be appreciated by those skilled in the art that the processor 201 executes program code to process strings or queries. The computer 200 preferably comprises an external memory, such as a hard disk drive and floppy disk drive, for storing a problem code and data memory permanently. Computer system 200 may further comprise a parsing unit 208 in order to extract syntactic relations between terms of a sentence. Alternatively, parsing unit 208 may be provided as an application program to be performed by processor 201. A process of information retrieval will now be described in more detail with reference to the flow-chart diagrams as depicted in FIG. 3 and FIG. 4. FIG. 3 illustrates general steps in an information retrieval system. First, in a string processing step 301 normalized representations are generated for all strings or text units in a database. Details of the string processing are given in FIG. 4. In order to retrieve strings from a database, the user has to input a database query to find those strings relevant to the given input information. The received query is processed in the computer system 200 in step 302 in order to be in a format which may be compared to the stored data in the database. For a retrieval, the query has to be compared with the stored data. The retrieval of strings in step 303 is based on the similarity of each string with respect to the database query. The retrieved information, which may include a translated version, is presented in step 304 to a user based on the relevance of each string with respect to the database query. FIG. 4 illustrates in more detail the processing of strings, i.e. text units or sentences, for obtaining a database according to the present invention. The manner of knowledge representation, namely architecture of the database is an essential aspect of the present invention. All those skilled in the art will appreciate, that the particular knowledge representation and method of generating such knowledge representation is not restricted to a document retrieval application but may be a basis for knowledge extraction, translation information apparatus, etc. First, in step 401 linguistic information is extracted from an input sentence as described above. Linguistic information includes morphological, syntactical, and semantic information and ways to disambiguate between. Based on the extracted information, a normalized representation is generated. Such a normalizing procedure may be carried out with respect to different aspects a plurality of times. The performed sequence of operations will result in a corresponding plurality of normalized representations having a different degree of generalization. Further, as indicated in step 402, a skeletisation operation is applied to the normalized representation(s). The skeletisation operations replace particular elements of a string or sentence by a variable in order to allow for better matches during an information retrieval process. Such variables will match to corresponding variables of other sentences regardless of the particular originally used text elements in text units or sentences to be matched. The normalized representations resulting from the linguistic extracting step and the skeletisation step may each (or only selected ones) be stored together with the particular sentence, as for example strings 102. Based on such databases, information retrieval and the provision of translation information may be improved based on an improved access to the relevant information stored. It will be appreciated that portions of the system embodying the present invention may be readily implemented in software using software development environments that provide source code that can be used on a variety of hardware platforms. Alternatively, portions of the system may be implemented partially or fully in hardware using standard logic circuits. Whether software or hardware is used to implement different portions of the system varies depending on the speed and efficiency requirements of the system and also the particular function and the particular software or hardware systems and the particular microprocessor or microcomputer systems being utilized. Also it will be appreciated by those skilled in the art that any resulting program(s) incorporating the present invention, having computer-readable program code, may be embodied within one or more computer-usable media such as memory devices or transmitting devices, thereby making a computer program product or article of manufacture. As such, the terms "article of manufacture" and "computer program product" as used herein are intended to encompass a computer program existent (permanently, temporarily, or transitorily) on any computer-usable medium such as on any memory device or in any transmitting device.
|
Same subclass Same class Consider this |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
