Method for measuring thresholded relevance of a document to a specified topic5778363Abstract A method is provided for specifying the representation of a document and determining the relevance of the document according to an externally defined topic profile. The topic profile includes one or more compound terms having a positive correlation with the topic of interest. Each compound term has a specified form such as capitalization, punctuation, number, or adjacency relation, that is either ignored by conventional indexing processes or requires substantial data overhead to track. The compound terms of the topic profile are tagged to indicate how corresponding terms are treated when identified in a document being analyzed. Application of the topic profile to a document generates a document representation in which compound terms present in the document are retained in their specified form. A similarity function between the document representation and the topic profile is calculated, and the result is compared to a relevance threshold associated with the topic profile. A document is deemed relevant to the topic when the similarity function meets or exceeds the threshold. Claims What is claimed is: Description RELATED PATENT APPLICATIONS
TABLE 1
______________________________________
Entry 2A 2B 2B' 2B" 2C 3A 3B 3B' 3C
______________________________________
Term Savings And & and Loan savings
& and loan
______________________________________
Entry 350 is the initial term of a compound term template 240(4) representing a compound term that has only a single component term. Consequently, entry 350 includes a label (term 4A), a tag, and a canonical name, but does not include pointers to other terms. Such compound term templates are useful for specifying a term for which a specific capitalization is sought. For example, a search term "thrift" may pick up documents relating to a wide variety of topics. By specifying it as a compound term and including a template for only the capitalized form of the word, searchers can identify only those documents that use the term to refer to financial institutions. Another single component compound term useful in the same search is "thrifts", where compounding is used to retain the plural form in the document representation. A two component compound term, "the thrift", would also be useful for these purposes. Because, for example, "thrift" and "S&L" are used interchangeably to describe certain financial institutions, they are identified by the same canonical name. Where a compound term comprises a single term entry, i.e. the initial term, the comparison consists of matching the label of the initial term entry to the initial term used to access the compound term template. Where alternative form templates are associated with an initial term, the alternative form templates are accessed along with the compound term template, and comparison of stream terms with the component terms of the templates proceed in parallel. As noted above, the externally defined topic profile both generates a document representation and determines the relevance of a document to a topic specified by the topic profile. The terms of the topic profile identify the vocabulary used to discuss the specified topic. The topic profile identifies and preserves these terms, where possible, in a document representation. Accordingly, terms selected for the topic profile are those having a strong, positive correlation with the specified topic and a weak positive correlation with topics that are unrelated to the specified topic. In practice, however, it is often difficult to identify terms that are both strongly and uniquely correlated with the specified topic. One embodiment of the present invention employs a multiple weighting to fine tune the relevance analysis, where necessary to compensate for the potential ambiguities of terms selected for the topic profile. In particular, it allows a document to be analyzed against a threshold standard using a first set of weights (selection weights) for profile terms identified in the document. A document meeting the threshold criteria is deemed relevant to the specified topic. A second set of weights (production weights) may be employed for determining a measure of relevance for the document from the profile terms identified in the document. The production weights are adjusted to reflect, among other things, the greater likelihood that a profile term present in a document deemed relevant is being used to discuss the specified topic. The tagging feature of topic profiles is employed to identify compound terms and to bypass or bias the similarity function/weighting scheme for selected profile. The use of tags to bias the weighting scheme may arise, for example, where a particular profile term is invariably used in discussions of the specified topic, e.g. a term having >95% correlation with the specified topic. In order to recognize the significance of this term in a document, without artificially depressing the weights used to determine threshold relevance from other profile terms, the profile term may be tagged so that documents including the term are analyzed in a non-standard manner. The use of tags in this manner is suggested only where circumstances indicate that the weighting scheme may fail to capture some relevant documents or may capture a large number of irrelevant documents. Referring now to Table 2, there is shown a pseudo-code representation of a topic profile in accordance with the present invention. For completeness, the topic profile is represented with tags and multiple weights for profile terms.
TABLE 2
______________________________________
<Topic Profile> :=
<List of Topic Profile Elements>
<List of Topic Profile Elements> :=
<Tagged Primary Item><List
of Topic Proflle Elements>
<Tagged Primary Item> :=
<Tagged Compound Term>.vertline.<Tagged
Simple Term>
<Tagged Compound Term> :=
<Item Modifier><Compound Term>
<Tagged Simple Term> :=
<Item Modifier><Token>
<Item Modifier> :=
<Selection Wt.><Production Wt.>
<Item Tag List>
<Selection Wt> :=
<Number>.vertline.<Null>
<Productiou Wt.> :=
<Number>.vertline.<Null>
<Item Tag List> :=
<Item Tag><Item Tag
List><Item Tag>
<Compound Term> :=
<List of Tokens>
<List of Tokens> :=
<Token><List of Tokens>.vertline.<Token>
<Item Tag> := <Item Tag>.vertline.<Null>
<Item Tag> := ATOM
<Token> := ATOM
<Number> := ATOM
______________________________________
As used in Table 2, a token is a word, number, or punctuation mark which follows the same rules used in tokenizing 110 (FIG. 1) a document into a stream of document terms. That is, if tokenization step 110 considers apostrophes and dashes as part of a single token, the topic profile represented by Table 2, will include apostrophes and dashes in the token. Alternatively, if tokenization step 110, considers apostrophes or dashes as token separators, the topic profile tokens will be defined similarly. In the preferred embodiment of the invention, documents are tokenized to preserve capitalization, number, and the like, at least through compound term recognition step 120. An ATOM represents a term, weight (selection or production), or tag in a topic profile. A selection weight associated with a profile term provides an indication of how significant the presence of the profile term in the document is in an initial threshold analysis. In one embodiment of the invention, a similarity function between the topic profile and the document is calculated using the selection weights of profile terms identified in the document and compared against a relevance threshold value. This comparison serves as a binary filter for the document set. Assigning a large selection weight to a profile term indicates that the term has a significant positive correlation with the specified topic and does not have a significant positive correlation with other topics. A production weight associated with the term provides an indication of how relevant the document is to the specified topic once it has been established that the document meets the relevance threshold. Differences between the selection and production weights may be used, for example, to reflect the greater likelihood that a term, which may alias other topics, is being employed in a discussion of the specified topic once the overall relevance of the document to the specified topic has been established. In sum, associating with each term a selection weight and a production weight that may be set independently, allows the presence of a profile term in a document to be de-emphasized, i.e. assigned a lower selection weight, when determining the threshold relevance of the document if the profile term correlates strongly with topics other than the specified topic, i.e. aliasing. Once threshold relevance to the specified topic has been established for the document, use of a higher production weight reflects the increased likelihood that the term is being used to discuss the specified topic. A well-selected assignment of production weights to the terms of a topic profile provides measures of relevance for a document set that are spread over a range above the threshold level in a manner that roughly tracks the ordering an expert would assign to the relevant documents. In the absence of any significant overlap between the vocabularies used to discuss the specified topic and those used to discuss other topics, the selection weights and production weights of the profile terms would be closely related if not identical. In fact, where selection weights are not given, a default assignment for the selection weight may be to set it equal to the production weight. More generally, however, associating with each term selection and production weights that are set independently, allows the topic profile to compensate for the aliasing potential of profile terms without ignoring the terms entirely. In the present invention, overlap between a topic profile and document representation is determined by a similarity function. In one embodiment of the present invention, the similarity function is given by Equation (I): ##EQU1## Here, DTERM.sub.ik indicates whether the k.sup.th term of the topic profile is present in the i.sup.th document of the document set (Doc.sub.i), PTERM.sub.k is a weight associated with the k.sup.th term of the topic profile, Profile is a collection of PTERMs specified in the topic profile, and N is a normalization constant. As noted above, {PTERM.sub.k } may be unity, a weight, or a set of weights for each profile term, depending on the analysis scheme chosen. DTERM.sub.ik is 1 or zero, depending on whether profile term k is present in Doc.sub.i. . The set {DTERM.sub.ik } may be derived from the document representation by identifying which terms of the topic profile are present in the document representation. Other functions providing comparable measures of overlap between the document representation and topic profile may be used in place of Equation (I). As noted above, tagging allows profile terms in a topic profile to be treated in a non-standard manner. In particular, tagging profile terms provides a way to force the relevance threshold determination to a specified outcome, independent of the selection weights of profile terms identified in the document. In effect, tagging may be used to replace the selection weights of a topic profile with a single, binary weight (0/1) determined by the presence or absence of a tagged profile term. In this case, Equation (I) is not used for determining a document's relevance threshold. However, where a multiple weighting scheme is employed, Equation (I) will be used in a second calculation based on production weights. As noted above, tagging may be appropriate for a profile term that consistently appears in any discussion of the specified topic and, consequently, has a high, positive correlation with the specified topic, e.g. greater than 95%. In this case, the profile term may be tagged to indicate that its presence in a document is "required" in order for the document to satisfy the relevance threshold criteria. Where a topic profile includes a profile term tagged "required", no document will be deemed relevant to the selected topic unless it includes the "required" term. On the other hand, unless the "required" term is uniquely correlated with the specified topic, characterizing a document as relevant, based solely on the presence of the "required" term, ignores the relevance-discriminating power of other profile terms and is likely to lead to inclusion of non-relevant documents. Accordingly, the presence of a "required" term in a document is best regarded as a necessary but not sufficient condition for the document to be deemed relevant. In some instances where tagging is employed, it may be that the presence in a document of any one of a number of alternative terms satisfies the "required" condition. Accordingly, each term may be tagged "required", and detection of any one of the "required" terms would suffice to meet the required-term threshold. If none of the "required" terms is present, the document will be deemed not relevant, even where the selection weights of profile terms present in the document would yield a value in excess of the relevance threshold value. Another potentially useful category of tagged terms are those terms whose presence in a document substantially guarantee the specified topic is discussed, i.e. the presence of the term is a "giveaway" that the document is relevant. "Giveaway" terms are those terms that discriminate strongly against all but the specified topic and are strongly correlated with the specified topic. The presence of a giveaway term in the document is taken as such a strong indicator that the specified topic is being discussed, the relevance threshold determination is by-passed. In this case, the presence of a giveaway term is regarded as a sufficient but not necessary condition for the document's being relevant to the specified topic. Tagging a term as "giveaway" is equivalent to assigning a very large selection weight to the term in the topic profile. However, using the "giveaway" tag instead of a large selection weight, preserves the relevance-discriminating value of the selection weights of other profile terms in those documents where the "giveaway" term is not present. Other tags are used to determine how compound terms are treated, when they are identified in the document. For example, where a profile term is a compound term comprising multiple-component terms, tags specified in the data structure and associated with the token(s) introduced into the document stream indicate whether the tokens for the component terms are to be retained in the term representation. For example, component terms that alias topics other than the one represented by the topic profile are tagged for elimination at stopping step 130. Referring now to Table 3, there is shown an example of a topic profile useful for identifying documents relevant to the banking industry.
TABLE 3
______________________________________
Canonical Name
Profile Terms (Token) Weight Tag
______________________________________
mortgage mortgage 0/60
"Bank" Bank 180 --
"s & l" --
"savings & loan"
"Savings & Loan"
"savings and loan"
"Savings and Loan"
s & l 150
"Thrift"
"the thrift"
"thrifts"
"credit union"
credit union 80 --
"West Bank" West Bank 0/0 %
"Left Bank" Left Bank 0/0 %
______________________________________
In table 3, compound terms are identified, i.e. tagged, with quotation marks. These are the selected profile terms that will be used to identify and tag tokenized document terms for the document representation when they appear in the specified form. For example, only the term "Bank", with a capital "B" will be retained in a document representation, since "bank" appears in many contexts unrelated to the financial industry. To distinguish documents that include references to "Left Bank" and "West Bank", these compound terms are tagged with %. This tag indicates that the instances of the terms "Left", "West", and "Bank", are not to be retained in the document representation when they are identified in the document within the compound term. The compound terms "West Bank" and "Left Bank" are retained but are assigned 0 selection and production weights. In this example, tagging avoids aliasing, whereby a term (Bank) correlated with the topic may also correlates with a different topic in a different context. "West Bank" and "Left Bank" are relevant to discussions of the Mid-East and Paris, respectively, despite their inclusion of the compound term "Bank". Since neither topic is relevant to Financial Institutions, the tags provide a convenient way to eliminate them from consideration The terms "Thrift", "thrifts", and "the thrift" are compound terms because they specify capitalization, number (plural), and multiple terms, respectively. The specific features of these terms are specified to distinguish them from the common term, thrift, which may appear in many different contexts. Indexed-based representations typically eliminate capitalization and number information in the indexing process. Consequently, they require substantially more data, i.e. location information for all index terms, to recognize multiple component terms and to distinguish alias terms. The various forms of "savings and loan" are also compound terms because they include multiple terms and/or specific capitalizations. As indicated in the Table 3, the forms of "thrift" and "savings and loan" are all represented by the same token, s&l, in the document representation, and given the same weight in the topic profile. This strategy reflects the fact that these compound terms refer to the same type of entity, i.e. a savings and loan institution, and avoids overweighting the relevance of documents that use multiple terms to reference the same entity. Referring still to Table 3, the profile term, mortgage, is assigned different selection (0) and production(60) weights. This reflects the fact that "mortgage" appears in different contexts, e.g. real estate, and so should not be weighted heavily in determining threshold relevance. Once a document is deemed relevant, however, the presence of the term mortgage in the document is accorded weight in calculating the similarity function and, thus, the measure relevance of the document to financial institutions. Where terms are assigned only a single weight, the production and selection weights are the same. If only single weights are assigned to all terms in the topic profile, a separate calculation of the threshold relevance is unnecessary. Also, if no weights are assigned, a default value, i.e. 100, may be assumed for calculating similarity functions. In each case, the normalization factor N of Equation (I) should be adjusted to reflect the different weighting scheme. It is useful to define different categories for characterizing terms considered for inclusion in a topic profile. In the following discussion, "common" terms are those terms that arise frequently in the discussion of many different topics and consequently lack any power to discriminate between different topics. Common terms, such as articles, pronouns, and prepositions, correspond to those terms eliminated from document indices by stopping (step 130 in FIG. 1). In addition to these terms, common terms, as used in reference to topic profiles, may include words like "ground", "statement", "authorize", "task", which alias many different topics. Unambiguous terms refers to those terms that are strongly correlated with the specified topic and which are not strongly correlated with any other topics. These terms, while rare, are ideal candidates for inclusion in a topic profile that characterizes the specified topic. As noted above, "alias terms" refers to those terms that are correlated with the specified topic, but which may be correlated with other topics in different contexts. For example, the term "French" is not a common term, and it has a strong, positive correlation with the topic of French culture. However, this term may also appear in documents relating to music (French horn), breakfast foods (French toast), and fast foods (French fries). Aliasing refers to situations in which a profile term that is strongly correlated with a specified topic may also appear in a different context with a meaning different from that in the specified topic. Alias terms are those terms having the context-dependent meaning. Ambiguities introduced by aliasing may often be removed by defining compound terms which clarify the context in which the term is being used. For example, the terms, "French toast", "French horn", and "French fries", may be identified in a document as compound terms and tagged to indicate their treatment in analyses for the specified topic. In the above, example, any incidence of "French" appearing in the compound terms may be ignored, i.e. compound terms assigned 0 weight and tagged with %, when determining the relevance of a document to French culture. Characterization of terms as common, unambiguous, or alias is neither exact nor absolute. Terms considered for a topic profile will fall along a continuum ranging from those that correlate strongly and exclusively with the specified topic, i.e. unambiguous terms, to those that apply to just about any topic, i.e. common terms. It is expected that profile terms and their associated weights will be determined in part by trial and error. Referring now to FIG. 4, there is shown a flowchart of a method 400 for generating a topic profile in accordance with the present invention. Initially, method 400 identifies 410 terms used to discuss the specified topic and eliminates 420 common terms from among the identified terms. The remaining terms are separated 430 into unambiguous terms and alias terms. Unambiguous terms are included 440 in the topic profile and, if weights are being used, appropriate selection and production weights are assigned. The remaining alias terms are treated according to how strongly they may be made to correlate with the specified topic. Selected alias terms may be used to greater effect by forming compound terms 450 that resolve the contextual ambiguity. This is the case in the use of variations on "thrift" described above. Alias terms that can not be resolved by compounding but are strongly correlated with the specified topic may also be included 460 in the topic profile. If no weighting scheme is used or only a single weight is associated with each term, inclusion of alias terms is likely to turn up some irrelevant documents. In the dual weighting scheme, however, the selection weight of such an aliased term may be set to zero or a very low value, while the production weight is assigned a non-zero value representative of the correlation between the term and the specified topic. This strategy allows the presence of the term in a document to be considered in determining the relevance measure of the document, provided the document is deemed to meet the threshold relevance value based on the presence of other (non-aliased) terms in the document. Alias terms that are only weakly correlated with the specified topic are not included in the topic profile. This embodiment of method 400 ensures that a document including terms that alias a different topic is not discarded at the relevance threshold stage, provided it also contains unambiguous terms. Assignment of a non-zero production weight to the alias term allows it to be considered in relevance determinations of those documents that meet the relevance threshold based on the presence of other terms. Referring now to FIG. 5A, there is shown an overview of a method 500 for determining a measure of relevance of a document to a specified topic, in accordance with the present invention. A topic profile for the subject of interest is selected 510 to analyze the document. The selected topic profile may be generated specifically for the analysis, or a previously generated topic profile may be retrieved for the analysis. The topic profile is applied 520 to the document to generate a document representation augmented to include tokens for any compound terms specified in the topic profile. As noted above, this occurs prior to the conventional stopping and stemming steps (steps 130 and 140 in FIG. 1), to preserve details of the original document text that are otherwise eliminated by these steps. A similarity function between the document representation and the topic profile is then calculated 530 to determine a measure of the corresponding document's relevance to the topic. The similarity function is calculated using any profile terms, whether single or compound, present in the document representation and with the weighting scheme indicated in the topic profile. For example, using the similarity function represented by Equation I, only those document terms (DTERMs) that are also profile terms (PTERMs) are counted. Where no weighting scheme is employed, PTERM.sub.k =1 for all k, i.e. all profile terms identified in a document representation are given equal weighting. In effect, Equation (I) increments a sum by PTERM.sub.k for each profile term (k) present in the document representation and scales the sum by a normalization factor. The value generated by the similarity function, Equation (I), is compared 540 to a threshold specified for the topic profile. The document is deemed 550 irrelevant (|relevant) when the threshold level is not met and it is deemed 560 relevant when the threshold level is met or exceeded. Referring now to FIG. 5B, there is shown an embodiment of method 500 (method 500') in which a threshold relevance and a comparative relevance are calculated 530, 544 using selection and production weights, respectively, associated with profile terms identified in the document. In particular, the selection weights of the identified terms are used to calculate 530 a similarity function between the identified terms and the topic profile. If the calculated similarity function exceeds 560 a threshold value, the document is deemed relevant to the topic represented by the topic profile, and a similarity function between the identified terms and the topic profile is calculated 580 using the production weights of the profile terms identified terms identified in the representation. A topic value (TV) is then assigned 590 the value of this similarity function. Referring now to Fig. 5C, there is shown a another embodiment of method 500 (method 500") in which selected profile terms are tagged to bypass or modify calculation of a similarity function for threshold determination. In method 500", threshold relevance may be determined 530 in one of several ways. For illustration, consider the case where a topic profile includes profile terms tagged "required" and "giveaway". If the representation does not include 524 the "required" profile term, the document is deemed 540 not relevant, independent of the correlation between the document representation and the other terms of the topic profile. If the representation includes 524 the "required" profile term, the analysis proceeds. If the topic profile includes 526 the profile terms tagged "giveaway", the document is deemed relevant 560, and a topic value is calculated 580 using the weights of profile terms present in the document representation. If a multiple weighting scheme is employed, this calculation is done using the production weights. If the "giveaway" profile term is not present 526 in the document representation, the document must be subjected to a threshold determination of relevance. Accordingly, a similarity function is calculated 530 between the document representation and topic profile and compared 540 with the relevance threshold. The document is deemed relevant 560 if the similarity function meets or exceeds the threshold. The value calculated for the similarity function is the topic value 580, unless a multiple weighting scheme is used. In the latter case, a second similarity function is calculated using the production weights and the value of this similarity function serves as the topic value. There has thus been provided a method for determining a thresholded measure of relevance of a document to a topic specified by a topic profile. The topic profile comprises one or more compound terms, the forms of which are specified with sufficient detail to emphasize their correlation with the topic while minimizing their correlations with other topics. The document representation is generated with reference to the topic profile to ensure that any compound terms present in the document are retained in document representation in the form specified in the topic profile. The resulting document representation is thus tailored to the topic being searched, and the document's relevance to the topic is determined by calculating a similarity function between the document representation and the topic profile. The present invention has been described in terms of several embodiments solely for the purpose of illustration. Persons skilled in the art will recognize from this description that the invention is not limited to the embodiments described, but may be practiced with modifications and alterations limited only by the spirit and scope of the appended claims.
|
Same subclass Same class Consider this |
||||||||||
