Method and system for collecting related queries6701309Abstract A method for collecting related queries, comprising the steps of obtaining a first query and a second query that have been submitted during a search for data, and determining whether the first query and the second query have a likelihood of being submitted by a class of searcher. Claims What is claimed is: Description BACKGROUND OF THE INVENTION
TABLE 1
Time Difference
Time Submitted From Previous Submission User
Query (HH:MM:SS) (Seconds) Session
Q1 12:00:00 -- A
Q2 12:00:30 30 A
Q3 12:04:00 210 A
Q4 12:07:00 180 A
Q5 12:15:00 480 B
Q6 12:18:00 180 B
Q7 12:28:00 600 C
Q8 12:32:30 270 D
The difference in time between Q1 and Q2 is 30 seconds, the difference between Q2 and Q3 is 210 seconds, the difference between Q3 and Q4 is 180 seconds, and the difference between Q4 and Q5 is 480 seconds. Because the successive differences between Queries Q1, Q2, Q3 and Q4 are each less than 240 seconds, these queries are regarded in a set, in this case associated with User Session A. Because the difference between Q4 and Q5 is 480 seconds, which is greater than 240 seconds, Q5 is assumed to have been submitted by a different user than that of User Session A. Accordingly, Q5 is considered to be the first query of User Session B. The difference between Q5 and Q6 is 180 seconds, and the difference between Q6 and Q7 is 600 seconds. Therefore, Queries Q5 and Q6 are regarded as a set associated with User Session B. The difference between Q7 and Q8 is 270 seconds. Accordingly, Q7 is a stand-alone query associated with User Session C. Q8 is associated with User Session D, which may include subsequent queries that are not represented in Table 1. Step 120 can include the sub-steps of (1) discarding queries if a rate of query submission from a particular source exceeds a predetermined rate, and (2) discarding queries from an undesired source. An undesired source can be identified, for example, by its uniform resource locator (URL). These sub-steps are intended to screen out queries that are received from a proxy server. Queries from a proxy server are undesirable because the proxy server is typically a common point through which multiple users may concurrently submit queries. Consequently, additional processing would be required to identify queries from a single user. For the purpose of illustrating other aspects of the present invention, assume that a first pass through step 120 will yield the set of queries associated with User Session A, as shown in Table 1. After step 120 is completed, the method progresses to step 125. In step 125, the method identifies pairs of queries in the set provided by step 120. For example, as described above, Q1, Q2, Q3 and Q4 comprise a set associated with User Session A. The query pairs of this set are (Q1,Q2), (Q1,Q3), (Q1,Q4), (Q2,Q3), (Q2,Q4) and (Q3,Q4). Step 125 may perform one or more preliminary processes to normalize an individual query so that it can be compared to other queries. Such processes can include, for example: (1) converting a non-printable character to a space in a query, (2) replacing consecutive spaces with a single space in a query, (3) removing a quotation mark from a query, and (4) converting an alpha character to its uppercase representation in a query. Additionally, step 125 considers a first query and a second query to be a valid query pair only in a case where the first query is not identical to the second query, and where the first query is not a plural of the second query. For example, if the first query can be produced from the second query by adding "s" or "es", then the first query is a plural of the second query. Note that the reverse scenario is permitted. Step 125 may also perform one or more preliminary processes to improve the integrity of an individual query. Such processes can include, for example: (1) discarding a query in a case where the query comprises a number of characters that is not within a predetermined range of numbers, e.g., discarding a query if it does not have at least two character, or if it has more than twenty-five characters, and (2) discarding a query in a case where the query comprises a uniform resource locator as judged by whether the query contains the sequence "http" or ".com", e.g., the query "lycos.com" would be discarded. Since the techniques that drive query refinement are largely data driven, pornographic suggestions are inevitable. Without any filtering, nearly any celebrity that you query will lead to the suggestion that you search for that celebrity nude. Arguably this is what some users may want, but these users can easily bypass a refinement interface and express their needs accurately the first time. Not only does the presence of pornographic web sites undermine family-oriented strategy, but they also "pollute" suggestion lists for innocuous queries, harming the user experience even for porn-tolerant users. The present invention therefore strives to suppress pornographic suggestions to the greatest extent possible by filtering out pornographic words from all results using as a stoplist the keywords most highly correlated with pornography. Accordingly, step 125 also discards a query in a case where the query includes a term from the group consisting of a pornographic term, a violent term, a hateful term, an ethnically derogatory term, or any other predetermined objectionable term. After completion of step 125, the method progresses to step 130. In step 130, the method determines whether data is available from another search session. If step 130 determines that data is available from another search session, then the method branches back to step 120. If step 130 determines that data is not available from another search session, then the method advances to step 135. For example, if step 125 has just processed the data from User Session A, then the method will branch back to step 120 to process the data from User Session B. The method continues to loop about steps 120, 125 and 130 until all of the data from access log 155 has been processed. In practice, access log 155 may contain millions of records, so accordingly, millions of query pairs may be assembled. In step 135, the method determines the number of occurrences of each query pair that was submitted in a plurality of searches. That is, it determines the number of occurrences of each query pair that was found during the processing of steps 120, 125 and 130. Assume for example that there exist twenty user sessions X1 through X20, which involved twenty independent searchers, possibly at twenty different locations in the United States. Assume further that each of the twenty searchers submitted both of the queries, "patent" and "trademark". Assume also that five user sessions, X21 through X23, each submitted both of the queries "leather" and "patent leather". The processing loop of steps 120, 125 and 130 would therefore assemble an occurrence of the query pair (patent, trademark) for each of user sessions X1 through X20, and an occurrence of the query pair (leather, patent leather) for each of sessions X21 through X25. Accordingly, step 135 would determine that the query pair (patent, trademark) occurred twenty times, and the query pair (leather, patent leather) occurred five times. After completion of step 135, the method advances to step 140. In step 140, the method identifies query pairs that are likely to occur from a single searcher in a single search session. This can be achieved by determining whether the number of occurrences of a particular query pair is greater than a predetermined number. So, if the predetermined number is three, the method would conclude that the queries "patent" and "trademark", which comprise a query pair that occurred twenty times, have a likelihood of being submitted by a single searcher in a single search session. Likewise, the method would conclude that the queries "leather" and "patent leather", which comprise a query pair that occurred five times, also have a likelihood of being submitted by a single searcher in a single search session. Step 140 processes data from, and then updates, a query pair database 160. Query pair database 160 contains counts of occurrences of query pairs that were found during previous executions of the method shown in FIG. 1. Accordingly, step 140 considers not only the number of occurrences of query pairs found during the most recent processing of step 135, but also the number of occurrences of query pairs retained in query pair database 160. For a given query pair, the total number of occurrences is the sum of the number found in step 135 and the number from query pair database 160. This total is then merged into query pair database 160. For example, the twenty occurrences of the query pair (patent, trademark) and the five occurrences of (leather, patent leather) will be merged into query pair database 160. By utilizing query pair database 160, the method takes advantage of a broader spectrum of queries collected over a longer interval of time than that which is represented in access log 155 alone. Upon completion of step 140, the method progresses to step 145. In step 145, the method saves, into an alternate query database 165, unique query pairs that occur a predetermined number of times. One of the uses of a query pair comprising a first query and a second query is that in a case where a searcher submits the first query, the second query can be suggested to the searcher as an alternate query. More than one alternate query can be suggested to the searcher, yet only a select group of the most useful alternates should be suggested. The purpose of step 145 is to limit the number of alternate queries to some reasonable quantity. Assume that the predetermined number is fifty, although any appropriate number can be used. Assume also that after the processing of step 140, the query pair (patent, trademark) was found to have occurred seven hundred times. Accordingly, the query pair (patent, trademark) would be entered into the alternate query database 165. After completion of step 145, the method progresses to step 150. In step 150, the method for collecting related queries terminates. In practice, this method can be run periodically, such as once a week. This would allow for access log 155 to acquire an adequate pool of new queries, and it would further serve to provide fresh query pairs for query database 160 and alternate query database 165. The processing of steps 135, 140 and 145 in conjunction with query pair database 160 and alternate query database 165 is intended to determine whether the first query and the second query have a likelihood of being submitted by a class of searcher. These steps determine a number of occurrences of a query pair that are submitted in a plurality of searches, and use the number of occurrences to determine whether the first and second queries have the likelihood of being submitted by the class of searcher. When a first and second query have a likelihood of being submitted by the class of searcher, the queries are considered to be related. To determine whether such a relationship exists, the method can apply a common sense knowledge database, or a technique for evaluating an actual number of occurrences of a query against an expected number of occurrences of the query. For example, as an alternative to steps 135, 140 and 145 as described above, which use predetermined numbers as thresholds, the method can evaluate a ratio between an actual number of occurrences of a query pair and an expected number of occurrences of the query pair. Such a ratio will indicate a degree of correlation between the first query and the second query. The closer the correlation, the greater the likelihood that the first and second queries are submitted by the class of searcher. The ratio can be evaluated using any technique for evaluating a ratio between an observed number of events and an expected number of events such as mutual information analysis or a chi-squared (.chi..sup.2) test. The mutual information for two queries (a, b) is defined to be the ratio between the empirical probability that both a and b appear together in a given user session, and the probability assigned by a model. The model assigns to (a, b) the co-occurrence probability dictated by the independence assumption, namely the product of the empirical probability that a occurs in a given session with the empirical probability that b occurs in a given session. The query pair (a, b) will be retained if the pair meets a minimum mutual information threshold. In a chi-squared (.chi..sup.2) test, deviations, that is observed values minus expected values, are squared, divided by the expected values, and summed. The value of .chi..sup.2 is then compared with values in a statistical table to determine the significance of the deviations. Another alternative to the processing of steps 135, 140 and 145 is to evaluate a ranking of mutual information values for query pairs. A first and second query are regarded as a first unique query pair. The first query may also be the first query in N unique query pairs that have been submitted in a plurality of searches. The method determines a number of occurrences of each of the N unique query pairs submitted in the plurality of searches, and uses a number of occurrences of the first unique query pair and the number of occurrences of each of the N unique query pairs to determine whether the first and second queries have a likelihood of being submitted by a class of searcher. Each of the N unique query pairs has a mutual information value, and each of the N unique query pairs has a rank ordered according to its mutual information value where a greatest rank corresponds to a greatest mutual information value. The first and second queries are found to have the likelihood of being submitted by the class of searcher if the rank of the first unique query pair is greater than a predetermined rank. Consider the following example using sample data from Table 2. Assume that that a query pair (patent, trademark) is under consideration, so it is regarded as a first unique query pair. The query, "patent", is the first query in six unique query pairs, hence N=6. The number of occurrences of each of the six unique query pairs, and their corresponding mutual information values are set forth in columns two and three, respectively, of Table 2.
TABLE 2
Mutual Mutual
Number of Information Information
Query Pair Occurrences Value Ranking
patent, trademark 750 0.90 2
patent, fiber optics 30 0.10 4
patent, hair loss 25 0.09 5
patent, walnuts 20 0.08 6
patent, lawyer 800 0.92 1
patent, high-tech 150 0.30 3
The query pair (patent, lawyer) has the highest mutual information value, i.e., 0.92, and thus, the rank, i.e., 1. The query pair, (patent, trademark) has the second highest mutual information value, and thus, the second rank. The query pair (patent, high-tech) has the third highest mutual information value, and thus, the third highest rank. If the method is designed to accept a query pair with a rank greater than four, then the method would accept the query pairs of (patent, lawyer), (patent, trademark) and (patent, high-tech). In this example, a first query of "patent" and a second query of "trademark" would be considered as having a strong likelihood of being submitted by a class of searcher because the rank of query pair (patent, trademark) is greater than four. Related queries can be used to suggest alternate queries. For example, a system receiving a communication indicating that a searcher has submitted one of a first query and a second query can send the other of the first and second queries to the searcher as a suggested alternate query. Similarly, the one query can be used to enhance a search for the other query. For example, a person searching for "patent" may appreciate being directed to information that would be revealed by a more specific search for "patent" AND "trademark". Even more helpful may be the case where a person who first searches for "patent" and then searches for "trademark" is directed to information that has been selected by other users who have also submitted queries for both "patent" and "trademark". Related queries can also be used it aid in a selection of an advertisement. For example, given that a searcher has submitted a query for "patent", the searcher may appreciate being presented with an advertisement related to "trademark". Given a user's query Q, the query refinement method presented above returns other queries that are often asked by users who ask query Q. One consequence of this method is that it returns these other queries whether or not the users were satisfied with their results. For example, suppose query Q is "Coffee". Users that issued this query may also have queried for "Cafe", "Java", and "Cofee" (sic). Users are unlikely to have been satisfied with the results of the last of these queries, because "cofee" is a misspelling. It is desirable to suppress suggestions of such queries in the future, to avoid leading the user down unproductive paths, and to maintain the user's confidence in the system's recommendations. When the results of a query are not satisfying, such a situation can be detected by using a simple heuristic that counts the average number of clicks that a query receives on its search results. More specifically, we can sum the number of clicks in the click logs that each search result of a query Q receives, and divide by the number of times in the access logs that query Q was issued. The rationale for this heuristic is that an uninformative query, such as "Cofee", will yield results that will clearly be irrelevant to a user searching for coffee--a dentist named Dr. Bernie Cofee, a company with the acronym COFEE, and so on--and thus users will rarely click on them. On the other hand, if the user is enticed into clicking on many of the search results, then the query is probably a reasonable one. This measure of query satisfaction can be incorporated into the query refinement algorithm in a variety of ways. In a preferred embodiment, the query refinement algorithm can generate an ordered list of queries, and suppress any queries in the list whose average click rate is below a predetermined threshold. An alternative embodiment would involve combining the query refinement algorithm's mutual information score and the average click rate into an overall score for each query, ordering the queries according to their overall scores, and again suppressing queries whose overall score falls below a predetermined threshold. The combination of scores can be done in a variety of ways, for example, expressing each score as a number of standard deviations away from the mean of its distribution, and then multiplying the two scores together to form an overall score. FIG. 2 is a flowchart of a method for refining a presentation of an alternate query to a first query in accordance with the present invention. It involves a query pair filtering technique that considers the satisfaction the user perceives after selecting a suggested alternate query. The method comprises the steps of presenting a second query to a searcher that has submitted the first query, determining whether the searcher thereafter submits the second query, and determining whether the searcher thereafter utilizes information presented to the searcher if the searcher submits the second query. The method begins with step 210. In step 210, the method presents an alternate query (QA), to a searcher who has submitted a first query. The alternate query QA is obtained from an alternate query database 260, which is preferably developed in accordance with the method described above in association with FIG. 1. However, any appropriate source of alternate queries can be used. The method then progresses to step 215. In step 215, the method increments a count (QA.sub.Presented) of total times the alternate query QA has been presented to searchers. The method then progresses to step 220. In step 220, the method determines whether the searcher selects, i.e., submits, the alternate query QA. If the searcher selects the alternate query, then the method progresses to step 225. If the searcher does not select the alternate query, then the method branches to step 235. In step 225, the method determines whether the searcher utilizes information that is presented to the user with the result of the search executed for the alternate query. Often, in addition to a search result, a search engine web site will display an advertisement or another form of collateral information, such as a new set of alternate queries. Step 225 is intended to determine whether the searcher has selected any feature that indicates the searcher's satisfaction with the presentation. For example, the searcher's selection of a link to further pursue the information would be indicative of his satisfaction. The objective here is to determine whether the searcher made use of information produced downstream from a click on an alternate query. If the searcher does not utilize the information, then the method branches to step 235. If the searcher does utilize the information, then the method progresses to step 230. In step 230, the method increments a count (I.sub.Used) of the total times the information has been utilized. The method then progresses to step 235. In step 235, the method calculates a usage level (U) of the presented information. Over time, the searcher is but one of a plurality of searchers, each of whom has submitted the first query and thereafter submitted the alternate query. For each search, step 215 updates the count of total times the alternate query QA is presented QA.sub.Presented, and step 230 updates the count of total times the information has been utilized I.sub.Used. The usage level U is found by determining a ratio between the number of times the plurality of searchers further pursued, i.e., utilized, the information and a number of times the alternate query is presented to the plurality of searchers. U:=I.sub.Used /QA.sub.Presented After completion of step 235, the method progresses to step 240. In step 240, the method updates an alternate query usage database 250 based on the usage level U. The alternate query usage database contains data related to the usage level U of the alternate query QA, as well as the usage level of other alternate queries. The method then advances to step 245. In step 245, the current pass of the method for refining a presentation of an alternate query terminates. However, periodically, the method will execute step 255. In step 255, the method updates the alternate query database 260 using data from the alternate query usage database 250. If a second query is a candidate for an alternative query to a first query, step 255 determines whether to retain the second query as the candidate based on the usage level U. Step 255 eliminating the second query as the candidate if the usage level U is less than a predetermined level. FIG. 3 is a flowchart of a method for refining a search for data in a database in accordance with the present invention. It consolidates many of the features described above in association with FIGS. 1 and 2. The method includes the steps of determining that a first query and a second query have a likelihood of being submitted by a class of searcher, receiving a communication indicating that a searcher has submitted the first query in a search for data, and presenting the second query to the searcher. A discussion of a current pass of the method begins with step 310. However, in practice, step 350 will have been previously executed. For the sake of clarity, the following paragraphs first describe a pass through steps 310 through 330, inclusive, and then describe step 355. In step 310, the method receives a search query from a searcher. The method then progresses to step 315. In step 315, the method saves, to an access log 355, the search query, an identifier indicating a source from which the search query was submitted, and a time at which the search query was submitted. In the preferred embodiment, the identifier is an Internet Protocol (IP) address. The method then progresses to step 320. In step 320, the method executes a search based on the search query. The method then progresses to step 325. In step 325, the method obtains an alternate query for the search query from an alternate query database 345. The method then progresses to step 330. In step 330, the method monitors the searcher's use of the alternate query. If the searcher submits the alternate query, then step 330 monitors the searcher's decision of whether to utilize information that is presented with the search result. The method determines a usage level of the information and updates an alternate query usage database 335. The method then loops back to step 310 in anticipation of receiving a next query from the searcher. Step 340 is executed asynchronously from steps 310 through 330. In step 340, the method updates the alternate query database 345 using data from the alternate query usage database 335. More particularly, where a second query is a candidate for an alternative query, step 340 determines whether to retain the second query based on the usage level of information, which was determined in step 330 and used to update alternate query usage database 335. Step 350 is also executed asynchronously from steps 310 through 330. In step 350, the method obtains data from access log 355 and assembles query pairs, each comprising a first query and a second query that have a likelihood of being submitted by a class of searcher. The query pairs are stored into alternate query database 345. FIG. 4 is a block diagram of a system 400 for collecting related queries, refining a presentation of an alternate query to a first query in accordance with the present invention, and refining a search for data in a database in accordance with the present invention. System 400 includes a local computer 410, and access log 420, an alternate query database 430, an alternate query usage database 440, and a query pair database 450. Local computer 410 is coupled to a computer network 480, such as the Internet, to which a plurality of remote computers 470 is also coupled. Through computer network 480, a user at any of the remote computers 470 can communicate with local computer 410. Local computer 410 includes a search engine (not shown) to which the user can submit a search query to locate data in a database (not shown). Local computer 410 also includes a processor 412 and a memory 414. Memory 414 contains data and instructions, organized into one or more programs, for controlling processor 412 to execute the methods described above in association with FIGS. 1 through 3. As stated earlier, system 400 includes a capability of collecting related queries. To perform this operation, processor 412 obtains a first query and a second query that have been submitted by a user of a remote computer 470 during a search for data. The first and second queries are stored into access log 420. Thereafter, processor 412 determines whether the first and second queries have a likelihood of being submitted by a class of searcher. If the first and second queries do have the likelihood of being submitted by a class of searcher, then processor 412 stores the first and second query as a query pair into query pair database 450 and alternate query database 430. System 400 is also capable of refining a presentation of an alternative query to a first query. Processor 412 presents a second query to a searcher that has submitted the first query, and determines whether the searcher thereafter submits the second query. If the searcher submits the second query, processor 412 determines whether the searcher thereafter utilizes information presented to the searcher. Processor 412 determines a usage level of the information, and updates alternate query usage database 440. Where a second query is a candidate for an alternative query to the first query, processor reads the usage level data from the alternate query database 440 and determines whether to retain the second query as the candidate and updates alternate query database 430. To search for data in a database (not shown), processor 412 receives a communication indicating that a searcher at a remote computer 470 has submitted the first query in a search for data. Processor obtains a second query from alternate query database 430 and presents the second query to the searcher. While the instructions required to execute the invention hereof are indicated as already loaded into the memory 414 of the local computer 410, they may be configured on a storage media, such as data memory 460, for subsequent loading into memory 414. Local computer 410 can be implemented as a general purpose computer, such as a personal computer (PC) or a mainframe, or it may be a special purpose system in which processor 412 and memory 414 are configured in discrete circuitry or firmware. Local computer 410 may also be implemented as an array comprising multiple processors (not shown) and multiple memories (not shown), where various responsibilities of the method steps described above are distributed among the multiple processors. The databases 420, 440 and 450 may reside in one memory unit, or they may be contained within multiple memory units such as that found in an enterprise storage system. Those skilled in the art, having the benefit of the teachings of the present invention may impart numerous modifications thereto. Such modifications are to be construed as lying within the scope of the present invention, as defined by the appended claims.
|
Same subclass Same class Consider this |
||||||||||
