Method and apparatus for profile score threshold setting and updating6430559Abstract A novel approach for filtering documents involves the use of delivery ratio threshold setting technique to set an initial profile score threshold and the use of beta-gamma regulation for dynamic threshold updating. A group of documents is scored pursuant to a user profile. The score for each document is indicative of the relevance of the corresponding document to the user profile. The score can be compared with a profile score threshold to decide if the document should be accepted or rejected. According to one aspect of the invention, the initial threshold is set to a score threshold that approximates an expected ratio of acceptable documents calibrated with respect to a set of reference documents. According to another aspect of the invention, the score threshold can be updated based on the accumulated example documents, user's relevance judgment, and the user's utility function. The accumulated example documents are first scored against a profile and a ranked list of scored documents is obtained. Each position at the ranked list corresponds to a candidate score threshold as well as a utility value computed based on the relevance status of the example documents. From these candidate threshold points, an optimal utility threshold and a zero utility threshold are determined. Using the optimal utility threshold and the zero utility threshold, a new utility threshold is calculated by interpolating between estimates of the optimal utility threshold and the zero utility threshold. This new utility threshold is used for subsequent information retrieval and filtering. Claims I claim: Description FIELD OF THE INVENTION
Terms Q.sub.1 Q.sub.2 D.sub.1 D.sub.2
dog 1 1 2 --
cat 1 -- -- 1
hat 1 -- 1 --
bat 1 -- -- --
mat 1 1 -- --
hut -- 1 2 --
cut -- 1 -- 2
luck -- -- 3 --
buck -- -- 1 --
muck -- -- -- 3
In this table, the Terms column lists a union of all the terms contained in the two documents D.sub.1 and D.sub.2. The scores of D.sub.1 and D.sub.2 refer to the frequency of those terms as they appear in the documents. The scores of Q.sub.1 and Q.sub.2 refer to frequency of the terms as they appear in the query. The similarity score of the query Q.sub.1 to document D.sub.1 is computed as: ##EQU2## Also, the similarity of the profile Q.sub.1 to document D.sub.2 is computed as: S.sub.G (Q.sub.1, D.sub.1)=0.12. As can be seen from the above example, the similarity score of profile Q.sub.1 to document D.sub.1 is higher than the similarity score of profile Q.sub.1 to document D.sub.2. As a result, the similarity score provides a relative measure of the relevance of a document to the profile. A highly-scored document is more likely to be relevant to a profile than a low-scoring one. Therefore, a high score threshold would only allow a few high-scoring documents to be accepted. Most of these high-scoring documents may be expected to be relevant to the profile. On the other hand, a low score threshold would allow more documents to be accepted. However, the ratio of actually relevant documents among these documents--referred to as precision--may be low. The correct threshold can only be determined according to the user's actual preference concerning the number amount of documents accepted as well as the expected precision of the accepted documents. FIG. 2 illustrates an embodiment of the invention used to set an initial score threshold 104. A set of reference documents is identified as reference database. The profile term vector is used to assign a score to each reference document. The reference documents are sorted by their scores to generate a sorted list of reference documents. The expected delivery ratio provided by the user determines a cutoff point at the list. Assuming that the user expects to accept a fraction r of documents from the corpus of documents(e.g., 10%), the cutoff point will be the k-th document in the ranked list, where K=r.times.N, and N equals the number of documents in the reference database. The score at the cutoff point is taken as the assigned threshold. In special cases when K<1 or when K>N, heuristic extrapolation is applied. The thresholding operation in step 106 determines whether a document will be delivered to the user in step 107. Documents yielding a score from step 105 above the score threshold 104 are accepted as relevant in step 106 and delivered to the user in step 107. Conversely, documents yielding a score below the score threshold 104 are rejected as not relevant and discarded. In step 108, relevance feedback for each accepted document is then obtained based upon the user's particular needs. The documents that the system has already processed serve as a training corpus for updating the user profile 102 in step 110 for the filtering of subsequent documents in the corpus of documents 101. This updating of the user profile 102 in step 110 can be done as frequently as needed. The frequency of updating can be determined based on the amount of new delivered documents or the time elapse since last updating. Optionally, profile editor 109 may be used to update user profile 102 directly without regard to the results obtained in step 107. In the preferred embodiment, the user profile 102 is updated in step 110 by expanding the term vector 103 and re-estimating, according to the present invention, the score threshold 104. To expand the term vector 103, standard Rocchio feedback may be used, where the centroid vector of the relevant document vectors is computed and the terms are ranked according to their centroid weight. Preferably, however, the K best-ranked terms are assigned a uniform weight before they are merged into the current term vector 103. K grows heuristically with the number of relevant documents N available for training, according to the functions: K=10+10.multidot.log(N+1). FIG. 3 illustrates an embodiment of the present invention used to update the score threshold 104 in step 110. In step 201, documents from a reference dataset (or initial training set) are scored against the profile vector, and are sorted according to their scores. At each position in the ranked list, a utility value U.sub.i can be computed by assuming a threshold that is equal to the score of the document at that position. Therefore, each position yields a candidate score threshold and a corresponding utility value. Thereafter, the "optimal" utility threshold .theta..sub.opt is determined in step 203, and the "zero" utility threshold .theta..sub.zero is determined in step 204. The optimal utility threshold .theta..sub.opt is the threshold that yields the highest utility over the accumulated training data. The zero utility threshold .theta..sub.zero is the highest threshold below the optimal utility threshold .theta..sub.opt that gives a non-positive utility over the training data under the assumption that all documents that were rejected are non-relevant. Using the optimal utility threshold .theta..sub.opt and the zero utility threshold .theta..sub.zero, a new profile utility threshold .theta.' is then calculated in step 205 by interpolating between the empirical optimal utility threshold .theta..sub.opt and the zero utility threshold .theta..sub.zero over the historical training data the system has accrued at any given point. As documents are filtered using this process, they are added to the historical training data for the system. In this way, the optimum utility is updated as new documents are evaluated. The interpolation between the optimal and zero utility threshold using a constant parameter .alpha., may be calculated according to the following evaluation formula: .theta.=.alpha..multidot..theta..sub.zero +(1-.alpha.).multidot..theta..sub.opt The parameter .alpha. can be empirically set to any value between zero and one (alpha-regulation). In the preferred embodiment, .alpha. is expressed as a function of two further parameters, .beta. and .gamma. (beta-gamma function 206), as reflected in the following calculation: .alpha.=.beta.+(1-.beta.).multidot.e.sup.-.gamma..multidot.M in which M equals the number of training documents upon which the relevance feedback in step 108 of FIG. 1 is performed. In the preferred embodiment, the new profile utility threshold .theta.' replaces the previous score threshold 104 and is used along with the newly expanded term vector 103 to filter any subsequent documents in steps 105 and 106. In writing the parameter .alpha. in terms of .beta. and .gamma. in the beta-gamma function 306, both aspects of the bias present in the optimal utility threshold .theta..sub.opt calculation are captured. First, .gamma. represents a score bias correction factor that compensates for the relatively higher scores of relevant documents in the training corpus. Second, .gamma. expresses that the estimated optimal utility threshold .theta..sub.opt approximates the true optimal utility threshold more closely when more judged training examples are available. The parameter .gamma. is the inverse of the number of documents at which the profile utility threshold is placed at approximately the midpoint of the range between the optimal utility threshold .theta..sub.opt and the zero utility threshold .theta..sub.zero. are available, the profile utility threshold will be somewhat lower. By contrast, if more than 1/.gamma. training examples are available, the profile utility threshold will be somewhat higher. FIG. 4 illustrates how a choice of the parameter .alpha. determines a cutoff point between the points of optimal and zero utility and how the parameters .beta. and .gamma. help to dynamically adjust parameter .alpha. according to the number M of judged documents in the training database. Given a ranked list of all of the documents in the training database sorted by their scores, their relevance, and a specific utility criterion, the utility value at each different cutoff position can be plotted. Each cutoff position corresponds to a utility threshold. HARDWARE OVERVIEW FIG. 5 is a block diagram which illustrates a computer system 300 upon which an embodiment of the invention may be implemented. Computer system 300 includes a bus 302 or other communication mechanism for communicating information, and a processor 304 coupled with bus 302 for processing information. Computer system 300 also includes a main memory 306, such as a random access memory (RAM) or other dynamic storage device, coupled to bus 302 for storing information and instructions to be executed by processor 304. Main memory 306 also may be used for storing temporary variables or other intermediate information during execution of instructions to be executed by processor 304. Computer system 300 further includes a read only memory (ROM) 308 or other static storage device coupled to bus 302 for storing static information and instructions for processor 304. A storage device 310, such as a magnetic disk or optical disk, is provided and coupled to bus 302 for storing information and instructions. Computer system 300 may be coupled via bus 302 to a display 312, such as a cathode ray tube (CRT), for displaying information to a computer user. An input device 314, including alphanumeric and other keys, is coupled to bus 302 for communicating information and command selections to processor 304. Another type of user input device is cursor control 316, such as a mouse, a trackball, or cursor direction keys for communicating direction information and command selections to processor 304 and for controlling cursor movement on display 312. This input device itypically has two degrees of freedom in two axes, a first axis (e.g., x) and a second axis (e.g., y), which allows the device to specify positions in a plane. The invention is related to the use of computer system 300 to retrieving information using beta-gamma regulation of threshold updating. According to one embodiment of the invention, retrieving information using beta-gamma regulation of threshold updating is provided by computer system 300 in response to processor 304 executing sequences of instructions contained in main memory 306. Such instructions may be read into main memory 306 from another computer-readable medium, such as storage device 310. However, the computer-readable medium is not limited to devices such as storage device 310. For example, the computer-readable medium may include a floppy disk, a flexible disk, hard disk, magnetic tape, or any other magnetic medium, a CD-ROM, any other optical medium, a RAM, a PROM, and EPROM, a FLASH-EPROM, any other memory chip or cartridge, or any other medium from which a computer can read. Execution of the sequences of instructions contained in main memory 306 causes processor 304 to perform the process steps previously described. In alternative embodiments, hard-wired circuitry may be used in place of or in combination with software instructions to implement the invention. Thus, embodiments of the invention are not limited to any specific combination of hardware circuitry and software. Computer system 300 also includes a communication interface 318 coupled to bus 302. Communication interface 318 provides a two-way data communication coupling to a network link 320 that is connected to a local network 322. For example, communication interface 318 may be an integrated services digital network (ISDN) card or a modem to provide a data communication connection to a corresponding type of telephone line. As another example, communication interface 318 may be a local area network (LAN) card to provide a data communication connection to a compatible LAN. Wireless links may also be implemented. In any such implementation, communication interface 318 sends and receives electrical, electromagnetic or optical signals which carry digital data streams representing various types of information. Network link 320 typically provides data communication through one or more networks to other data devices. For example, network link 320 may provide a connection through local network 322 to a host computer 324 or to data equipment operated by an Internet Service Provider (ISP) 326. ISP 326 in turn provides data communication services through the world wide packet data communication network now commonly referred to as the "Internet" 328. Local network 322 and Internet 328 both use electrical, electromagnetic or optical signals which carry digital data streams. The signals through the various networks and the signals on network link 320 and through communication interface 318, which carry the digital data to and from computer system 300, are exemplary forms of carrier waves transporting the information. Computer system 300 can send messages and receive data, including program code, through the network(s), network link 320 and communication interface 318. In the Internet 328 for example, a server 330 might transmit a requested code for an application program through Internet 328, ISP 326, local network 322 and communication interface 318. In accordance with the invention, one such downloaded application provides for the retrieval of information using chunks of text as described herein. The received code may be executed by processor 304 as it is received, and/or stored in storage device 310, or other non-volatile storage for later execution. In this manner, computer system 300 may obtain application code in the form of a carrier wave. To measure the effectiveness of the present invention over simple linear interpolation (i.e., alpha regulation), 49 user profiles were used to filter about 250 megabytes of 1988 Associate Press news articles. A set of 1987 Wall Street Journal documents were used as an initial reference database. The initial threshold is set for all the profiles with a delivery ratio of 0.0005 using the present invention. The utility function used in the evaluation is the utility function UF1 defined below: UF1=3.multidot.R-2.multidot.N where, R is the number of relevant documents accepted and N is the number of non-relevant documents accepted. Three experiments were conducted. In one experiment, the initial threshold was kept without updating. The other two experiments update the threshold using the present invention in two different ways--one uses the alpha regulation and the other uses the beta-gamma regulation. Updating frequency is such that a profile will be updated whenever there are four new documents accepted for the profile. The UF1 utility value for each profile and their average are shown in the following table.
baseline .alpha. improve (.alpha. reg. .beta.-.gamma.
regulation improve (.beta.-.gamma. reg.
Profile (no regulation over (.beta. = 0.1,
over
# updating) (.alpha. = 0.3) baseline) .gamma. = 0.05)
baseline)
1 -14 -8 6 -8 6
2 -15 -12 3 -12 3
3 -3 1 4 -11 -8
4 3 1 -2 1 -2
5 0 15 15 27 27
6 -16 -4 12 -4 12
7 9 1 -8 26 17
8 -108 -5 103 -5 103
9 -26 1 27 -3 23
10 3 6 3 46 43
11 -106 -4 102 -25 81
12 -31 5 36 -3 28
13 -16 -13 3 -10 6
14 -1 3 4 2 3
15 10 -3 -13 -5 -15
16 -29 -2 27 -2 27
17 46 32 -14 60 14
18 -475 -24 451 -24 451
19 -4 -6 -2 -6 -2
20 2 6 4 10 8
21 14 8 -6 35 21
23 73 4 -69 110 37
24 15 -3 -18 -7 -22
25 -68 -5 63 -16 52
26 -14 2 16 -4 10
27 -1 3 4 3 4
28 -26 -6 20 -6 20
29 -9 -2 7 -2 7
30 -2 -2 0 -2 0
31 -3 3 6 -3 0
32 -24 -6 18 -6 18
33 -8 -4 4 -4 4
34 -10 -10 0 -10 0
35 -24 -8 16 -8 16
36 4 4 0 0 -4
37 -12 -2 10 -2 10
38 12 5 -7 1 -11
39 -14 -10 4 -10 4
40 21 -1 -22 14 -7
41 -9 -4 5 -4 5
42 7 9 2 -2 -9
43 -30 5 35 -9 21
44 2 -8 -10 -8 -10
45 -10 -8 2 -8 2
46 -187 -20 167 -21 166
47 0 -1 -1 -1 -1
48 -16 -12 4 -12 4
49 7 3 -4 7 0
50 -16 -8 8 -8 8
Average -22.42857143 -1.714285714 20.71428571
1.448979592 23.87755102
These results show that threshold updating using the present invention (both the alpha regulation and the beta-gamma regulation) generally improves the utility values, and in some cases significantly, when compared with the performance without updating. The comparison between the alpha regulation and the beta-gamma regulation indicates that the beta-gamma regulation technique gives more stable utility performance and is more adaptive when a profile has the potential of achieving a high positive utility. For example, referring to topic 23, although the alpha regulation gives a very small utility, the beta-gamma regulation generates a very high positive utility. While this invention has been particularly described and illustrated with reference to particular embodiments thereof, it will be understood by those skilled in the art that changes in the above description or illustrations may be made with respect to form or detail without departing from the spirit or scope of the invention.
|
Same subclass Same class Consider this |
||||||||||
