Process for determination of text relevancy5694592Abstract This is a procedure for determining text relevancy and can be used to enhance the retrieval of text documents by search queries. This system helps a user intelligently and rapidly locate information found in large textual databases. A first embodiment determines the common meanings between each word in the query and each word in the document. Then an adjustment is made for words in the query that are not in the documents. Further, weights are calculated for both the semantic components in the query and the semantic components in the documents. These weights are multiplied together, and their products are subsequently added to one another to determine a real value number(similarity coefficient) for each document. Finally, the documents are sorted in sequential order according to their real value number from largest to smallest value. Another, embodiment is for routing documents to topics/headings (sometimes referred to as faltering). Here, the importance of each word in both topics and documents are calculated. Then, the real value number (similarity coefficient) for each document is determined. Then each document is routed one at a time according to their respective real value numbers to one or more topics. Finally, once the documents are located with their topics, the documents can be sorted. This system can be used to search and route all kinds of document collections, such as collections of legal documents, medical documents, news stories, and patents. Claims I claim: Description FIELD OF THE INVENTION
______________________________________
Vapor
______________________________________
noun fog State ASTE
fume State ASTE
illusion
spirit
steam Temperature ATMP
thing imagined
verb be bombastic
bluster
boast
exhale Motion with Reference to Direction
AMDR
talk nonsense
______________________________________
The term "vapor" has eleven different meanings. We can associate the different meanings to the thematic and attribute categories given in FIG. 3. In this example, the meanings "fog" and "fume" correspond to the attribute category entitled -State-. The vapor meaning of "steam" corresponds to the attribute category entitled -Temperature-. The vapor meaning "exhale" is a trigger for the attribute category entitled -Motion with Reference to Direction-. The remaining seven meanings associated with "vapor" do not trigger any thematic roles or attributes. Since there are eleven meanings associated with "vapor", we indicate in the lexicon a probability of 1/11 each time a category is triggered. Hence, a probability of 2/11 is assigned to the category entitled -State- since two meanings "fog" and "fume" correspond. Likewise, a probability of 1/11 is assigned to the category entitied -Temperature-, and 1/11 is assigned to the category entitled -Motion with Reference to Direction-. This technique of calculating probabilities is being used as a simple alternative to an analysis to a large body of text. For example, statistics could be collected on actual usage of the word to determine probabilities. Other interpretations can exist. For example, even though there are eleven senses for vapor, one interpretation might be to realize that only three different categories could be generated so each one would have a probability of 1/3. Other thesauruses and dictionaries, etc. can be used to associate their word meanings to our 36 categories. Roget's thesaurus is only used to exemplify our process. The enclosed appendix covers all the words that have listed so far in our data base into a semantic lexicon that can be accessed using the 36 linguistic categories of FIG. 1. The format of the entries in the lexicon is as follows: <word> <list of semantic category abbreviations>. For example: <vapor> <ASTE ASTE NONE NONE ATMP NONE NONE NONE NONE AMDR NONE>, where NONE is the acronym for a sense of "vapor" that is not a semantic sense. FIRST PREFERRED EMBODIMENT FIG. 2 illustrates an overview of using applicant's invention in order to be able to rank multiple documents in order of their importance to the word query. The overview will be briefly described followed by an example of determining the real value number (similarity coefficient SQ) for Document #4. In FIG. 2, the Query Words 101 and the documents 110 are input into the df calculator 210. The output of the df calculator 210 as represented in FIG. 6 passes to the Importance Calculator 300, whose output is represented by an example in FIG. 7. This embodiment further uses data from both the Query words 101, and the Semantic Lexicon 120 to determine the category probability of the Query Words at 220, and whose output is represented by an example in FIG. 8. Each document 111, with the Lexicon 120 is cycled separately to determine the category probability of each of those document's words at 230, whose output is represented by an example in FIG. 9. The outputs of 300, 220, and 230 pass to the Text Determination Procedure 400 as described in the six step flow chart of FIG. 3 to create a real number value for each document, SQ. These real value numbers are passed to a document sorter 500 which ranks the relevancy of each document in a linear order such as a downward sequential order from largest value to smallest value. Such a type of document sorting is described in U.S. Pat. No. 5,020,019 issued to Ogawa which is incorporated by reference. It is important to note that the word query can include natural language words such as sentences, phrases, and single words as the word query. Further, the types of documents defined are variable in size. For example, existing paragraphs in a single document can be separated and divided into smaller type documents for cycling if there is a desire to obtain real number values for individual paragraphs. Thus, this invention can be used to not only locate the best documents for a word query, but can locate the best sections within a document to answer the word query. The inventor's experiments show that using the 36 categories with natural language words is an improvement over relevancy determination based on key word searching. And if documents are made to be one paragraph comprising approximately 1 to 5 sentences, or 1 to 250 words, then performance is enhanced. Thus, the number of documents that must be read to find relevant documents is greatly reduced with our technique. FIG. 3 illustrates the 6 steps for the Text Relevancy Determination Procedure 400 used for determining document value numbers for the document ranking in FIG. 2. Step 1 which is exemplified in FIG. 10, is to determine common meanings between the query and the document. Step 2, which is exemplified in FIG. 11, is an adjustment step for words in the query that are not in any of the documents. Step 3, which is exemplified in FIG. 12, is to calculate the weight of a semantic component in the query and to calculate the weight of a semantic component in the document. Step 4, which is exemplified in FIG. 13, is for multiplying the weights in the query by the weights in the document. Step 5, which is also exemplified in FIG. 13, is to sum all the individual products of step 4 into a single value which is equal to the real value for that particular document. Step 6 is to output the real value number(SQ) for that particular document to the document sorer. Clearly having 6 steps is to represent an example of using the procedure. Certainly one can reduce or enlarge the actual number of steps for this procedure as desired. An example of using the preferred embodiment will now be demonstrated by example through the following figures. FIG. 4 illustrates 4 documents that are to be ranked by the procedures of FIGS. 2 and 3. FIG. 5 illustrates a natural word query used for searching the documents of FIG. 4. The Query of "When do trains depart the station" is meant to be answered by searching the 4 documents. Obviously documents to be searched are usually much larger in size and can vary from a paragraph up to hundreds and even thousands of pages. This example of four small documents is used as an instructional bases to exemplify the features of applicant's invention. First, the df which corresponds to the number of documents each word is in must be determined. FIG. 6 shows a list of words from the 4 documents of FIG. 4 and the query of FIG. 5 along with the number of documents each word is in(df). For example the words "canopy" and "freight" appear only in one document each, while the words "the" and "trains" appears in all four documents. Box 210 represents the df calculator in FIG. 2. Next, the importance of each word is determined by the equation Log.sub.10 (N/df). Where N is equal to the total number of documents to be searched and df is the number of documents a word is in. The df values for each word have been determined in FIG. 6 above. FIG. 7 illustrates a list of words in the 4 documents of FIG. 4 and the query of FIG. 5 along with the importance of each word. For example, the importance of the word "station"=Log.sub.10 (4/2)=0.3. Sometimes, the importance of a word is undefined. This happens when a word does not occur in the documents but does occur in a query(as in the embodiment described herein). For example, the words "depart", "do" and "when" do not appear in the four documents. Thus, the importance of these terms cannot be defined here. Step 2 of the Text Relevancy Determination Procedure in FIG. 11 to be discussed later adjusts for these undefined values. The importance calculator is represented by box 300 in FIG. 2. Next, the Category Probability of each Query word is determined. FIG. 8 illustrates this where each individual word in the query is listed alphabetically with the frequency that each word occurs in that query, the semantic category triggered by each word, and the probability that each category is triggered. FIG. 8 shows an alphabetized list of all unique words from the query of FIG. 5; the frequency of each word in the query; and the semantic categories and probability each word triggers. For our example, the word "depart" occurs one time in the query. The entry for "depart" in the lexicon corresponds to this interpretation which is as follows: <DEPART> <NONE NONE NONE NONE NONE AMDR AMDR TAMT>. The word "depart" triggers two categories: AMDR(Motion with Reference to Direction) and TAMT(Amount). According to an interpretationg of this lexicon, AMDR is triggered with a probability 1/4 of the time and TAMT is triggered 1/8 of the time. Box 220 of FIG. 2 determines the category probability of the Query words. Further, a similar category probability determination is done for each document. FIG. 9 is an alphabetized list of all unique words from Document #4 of FIG. 4; and the semantic categories and probability each word triggers. For example, the word "hourly" occurs 1 time in document #4, and triggers the category of TTIM(Time) a probability of 1.0 of the time. As mentioned previously, the lexicon is interpreted to show these probability values for these words. Box 230 of FIG. 2 determines the category probability for each document. Next the text relevancy of each document is determined. Text Relevancy Determination Procedure-6 Steps The Text Relevancy Determination Procedure shown as boxes 410-460 in FIG. 2 uses 3 of the lists mentioned above: 1) List of words and the importance of each word, as shown in FIG. 7; 2) List of words in the query and the semantic categories they trigger along with the probability of triggering those categories, as shown in FIG. 8; and 3) List of words in a document and the semantic categories they trigger along with the probability of triggering those categories, as shown in FIG. 9. These lists are incorporated into the 6 STEPS referred in FIG. 3. STEP 1 Step 1 is to determine common meanings between the query and the document at 410. FIG. 10 corresponds to the output of Step 1 for document #4. In Step 1, a new list is created as follows: For each word in the query, go through either subsections (a) or (b) whichever applies. If the word triggers a category, go to section (a). If the word does not trigger a category go to section (b). (a) For each category the word triggers, find each word in the document that triggers the category and output three things: 1) The word in the Query and its frequency of occurrence. 2) The word in the Document and its frequency of occurrence. 3) The category. (b) If the word does not trigger a category, then look for the word in the document and if it's there output two things: 1) The word in the Query and it's frequency of occurrence. 2) The word in the Document and it's frequency of occurrence. 3) --. In FIG. 10, the word "depart" occurs in the query one time and triggers the category AMDR. The word "leave" occurs in Document #4 once and also triggers the category AMDR. Thus, item 1 in FIG. 10 corresponds to subsection a) as described above. An example using subsection b) occurs in Item 14 of FIG. 10. STEP 2 Step 2, is an adjustment step for words in the query that are not in any of the documents at 420. FIG. 11 shows the output of Step 2 for document #4. In this step, another list is created from the list depicted in Step 1. For each item in the Step 1 List which has a word with undefined importance, then replace the word in the First Entry column by the word in the Second Entry column. For example, the word "depart" has an undefined importance as shown in FIG. 7. Thus, the word "depart" is replaced by the word "leave" from the second column. Likewise, the words "do" and "when" also have an undefined importance and are respectively replaced by the words from the second entry column. STEP 3 Step 3 is to calculate the weight of a semantic component in the query and to calculate the weight of a semantic component in the document at 430. FIG. 12 shows the output of Step 3 for document #4. In Step 3, another list is created from the Step 2 list as follows: For each item in the Step 2 list, follow subsection a) or b) whichever applies:
______________________________________
a) If the third entry is a category, then
1. Replace the first entry by multiplying:
importance of frequency of probability the word
word in * word in * triggers the category
first entry first entry in the third entry
2. Replace the second entry by multiplying:
importance of frequency of probability the word
word in * word in * triggers the category
second entry second entry in the third entry
3. Omit the third entry.
b) If the third entry is not a category, then
1. Replace the first entry by multiplying:
importance of frequency of
word in * word in
first entry first entry
2. Replace the second entry by multiplying:
importance of frequency of
word in * word in
second entry second entry
3. Omit the third entry.
______________________________________
Item 1 in FIGS. 11 and 12 is an example of using subsection a), and item 14 is an example of utilizing subsection b). STEP 4 Step 4 is for multiplying the weights in the query by the weights in the document at 440. The top portion of FIG. 13 shows the output of Step 4. In the list created here, the numerical value created in the first entry column of FIG. 12 is to be multiplied by the numerical value created in the second entry column of FIG. 12. STEP 5 Step 5 is to sum all the values in the Step 4 list which becomes the real value number (Similarity Coefficient SQ) for a particular documentat at 450. The bottom portion of FIG. 13 shows the output of step 5 for Document #4 STEP 6 This step is for outputting the real value number for the document to the document sorter illustrated in FIG. 2 at 460. Steps 1 through 6 are repeated for each document to be ranked for answering the word query. Each document eventually receives a real value number(Similarity Coefficient). Sorter 500 depicted in FIG. 2 creates a ranked list of documents 550 based on these real value numbers. For example, if Document #1 has a real value number of 0.88, then the Document #4 which has a higher real value number of 0.91986 ranks higher on the list and so on. In the example given above, there are several words in the query which are not in the document collection. So, the importance of these words is undefined using the embodiment described. For general information retrieval situations, it is unlikely that these cases arise. They arise fin the example because only 4 very small documents are participating. FIG. 14 illustrates a simplified algorithm for running the text relevancy determination procedure for document sorting. For each of N documents, where N is the total number of documents to be searched, the 6 step Text Relevancy Determination Procedure of FIG. 3 is run to produce N real value numbers (SQ) for each document 610. The N real value numbers are then sorted 620. SECOND PREFERRED EMBODIMENT This embodiment covers using the 6 step procedure to route documents to topics or headings also referred to as filtering. In routing documents there is a need to send documents one at a time to whichever topics they are relevant too. The procedure and steps used for document sorting mentioned in the above figures can be easily modified to handle document routing. In routing, the role of documents and the Query is reversed. For example, when determining the importance of a word for routing, the equation can be equal to Log.sub.10 (NT/dft), where NT is the total number of topics and dft is the number of topics each word is located within. FIG. 15 illustrates a simplified flow chart for this embodiment. First, the importance of each word in both a topic X, where X is an individual topic, and each word in a document, is calculated 710. Next, real value numbers(SQ) are determined 720, in a manner similar to the 6 step text relevancy procedure described in FIG. 3. Next, each document is routed one at a time to one or more topics 730. Finally, the documents are sorted at each of the topics 740. This system can be used to search and route all kinds of document collections no matter what their size, such as collections of legal documents, medical documents, news stories, and patents from any sized data base. Further, as mentioned previously, this process can be used with a different number of categories fewer or more than our 36 categories. The present invention is not limited to this embodiment, but various variations and modifications may be made without departing from the scope of the present invention.
|
Same subclass Same class Consider this |
||||||||||
