Metadata search results ranking system6718324Abstract A system and method of metadata search ranking is disclosed. The present invention utilizes a combination of popularity and/or relevancy to determine a search ranking for a given search result association. Given the exponential growth rate currently being experienced in the Internet community, the present invention provides one of the few methods by which searches of this vast distributed database can produce useful results ranked and sorted by usefulness to the searching web surfer. The present invention permits embodiments incorporating a Ranking System/Method (0100) further comprising a Session Manager (0101), Query Manager (0102), Popularity Sorter (0103), and Association (0104) functions. These components may be augmented in some preferred embodiments via the use of a Query Entry means (0155), Search Engine (0156); Data Repository (0157), Query Database (0158), and/or a Resource List (0159). Claims What is claimed is: Description CROSS-REFERENCE TO RELATED APPLICATIONS
URL Version Popularity Count
http://www.ibm.com/myfile.htm 0 49
http://www.ibm.com/myfile.htm 1 178
http://www.ibm.com/myfile.htm 2 290
http://www.ibm.com/myfile.htm 3 122
The URL here in this record set is the same for all four records. The document itself has 4 different versions, where version 0 is the most recent one. The popularity count shows how many times a particular version of a document was visited by a user. The Monitor Agent (0205) will increase only the popularity counter of version zero, if a user selects the specified URL in the search result list. Document changes that will add additional versions to a document, are handled by the Version Manager (0206) component. Version Manager (0206) Every time a document change is detected by the search engine (0223), the Version Manager (0206) receives a notification of this change. Note that there are several ways of how a search engine (0223) can detect document changes. These will not be described in more detail, because different search engines may use different approaches to accomplish this task. Once the Version Manager (0206) knows that a particular URL was changed, it updates the existing record set of the URL by increasing each version number by one, starting with the highest version number first. Finally, a new record with version number zero is created, which represents the most recent version. Version Adjusted Popularity Daemon Process (0207) A Version Adjusted Popularity Daemon Process (0207) calculates the version-adjusted popularity offline and stores it in a data field within the Ranking Database (0208). This background processing is done for efficiency reasons, and is not a mandatory component of the system. Ranking Database (0208) The Ranking Database (0208) contains all the data the present invention needs to calculate the version adjusted popularity and the document recency. One table within this database uses the URL and the version number (0 to N) as a primary key. The value for each record is the number of hits used to count the popularity. A second table only uses the URL as a primary key. One value to store is the time stamp (when the document was created). This is used to calculate the document recency. Exemplary General Ranking Method An exemplary general present invention method may best be understood by referencing the system diagrams of FIG. 1 and the exemplary flowchart of FIG. 3. These diagrams will now be discussed in detail. Referencing FIG. 3, the exemplary general search ranking method (0300) involves the following steps: Obtaining a search request query string (typically from a user) (0301) and submitting the search request to a session manager. Submitting the search request to a search engine (0302). Obtaining search results from the search engine (0303). Forwarding the query string, search results, and session ID to a query manger (0304). Using query manager data to interrogate a query database (0311) for matching query vector items (0305). Creating a popularity vector (0306). Sorting the popularity vector (0307). Making the sorted query vector items available to the user (0308). Intercepting search viewing requests and updating a query database (0311) based on session ID and resource associations (0309). One skilled in the art will recognize that these steps may be rearranged and/or augmented with no loss of generality in the teachings of the present invention. Exemplary Alternate Ranking Method An exemplary alternate present invention method may best be understood by referencing the system diagrams of FIG. 2 and the exemplary flowchart of FIG. 4. These diagrams will now be discussed in detail. Referencing FIG. 4, the exemplary alternate search ranking method (0400) involves the following steps: 1. Obtaining a search request query string (typically from a user) (0401) and submitting the search request to a session manager. 2. Submitting the search request to a search engine (0402). 3. Obtaining search results from the search engine (0403). 4. Analyzing the search results for content relevance (0404) using a ranking database (0411). 5. Calculating relevancy based on version adjusted popularity and/or document recency (0405). 6. Ranking the search results based on relevancy (0406). 7. Representing and/or viewing the ranked search results (0407). 8. Monitoring selection of ranked search results and updating the ranking database accordingly (0408). 9. Updating URL version numbers within the ranking database (0411) when there are detected document changes and/or updates to version adjusted popularity information (0409). One skilled in the art will recognize that these steps may be rearranged and/or augmented with no loss of generality in the teachings of the present invention. Query Association Metrics (QAM) A significant aspect of various embodiments of the present invention is that of Query Association Metrics (QAM) that may be spoken of in terms of the Association (0104) in FIG. 1 and the Relevancy Calculator (0203) of FIG. 2. In both of these examples a variety of information associated with the search query must be evaluated to determine a ranking score that is then used to sort the search results list(s). The general mathematical principles of projective metrics, orthogonal basis vector decomposition, and vector norms are applicable here. See David G. Luenberger, OPTIMIZATION BY VECTOR SPACE METHODS (LOC 68-8716 SBN 471 55359x, 1969); L. E. Franks, SIGNAL THEORY (ISBN 0-941052-00-7, 1981); Erwin Kreyszig, ADVANCED ENGINEERING MATHEMATICS (ISBN 0-471-85824-2, 1988). While there are a plethora of potential QAM methodologies that are applicable to the present invention, some particular schemes are worthy of note as they have wide application to the problems associated with Internet web searches. The following discussion explains how these particular techniques may be used to advantage within the present invention. Exemplary Application Background In applying the present invention to the problem of Internet Search engine technology, it will be observed that users typically enter keywords into an electronic form of a Internet Search Engine. Additionally, some electronic forms provide the possibility to enter complex Boolean queries. Before the query gets submitted to the Internet Search engine, the user input will be parsed and processed, in order that the search engine is able to understand the query command. System such as "SearchScope" for instance rely on monitoring users query behavior, selections users have made, etc. These user choices are typically stored in a database, using the query string as a key. To enhance search accuracy, however, it is important to be able to compare similarity of queries. What is the difference of a query, issued by User A who queried "CATS and DOGS", and then selected a set of specific search result items, and User B, who queries "DOGS and CATS"? Systems that are comparing similar queries currently rely on parsing the original query string. Although this approach works for a specific sort of query expressions (e.g. Boolean queries), it is difficult to handle different query types. Standard search engines support typically three different query types: Boolean query (e.g., "CATS and DOGS"); Natural Language query (e.g. "How is the weather in Seattle?"); and Vector space query (weighted strings) ("CATS[60], DOGS[40]"). The present invention utilizing QAM provides a way of comparing the similarity of search queries from these categories, thus providing a comprehensive scheme of detecting similarity of search queries. This knowledge is then used to enhance the ranking algorithm of a search engine. QAM Methodology In an information system, users make queries and a search engine replies with an ordered list of web pages to each query. The users then make choices to determine which of these pages are more preferable. These choices can often be traced and stored to enhance the answer to future queries. One way to enhance a search engine upon receiving a query is to first use the search engine to find its ordered list of web pages for the query, then use the previous data on the preference to reorder the list. This technique will statistically improve the chance that better fit pages are on the top of the list. How can this be achieved? In the simplest approach, when receiving a query Q and a list of answers R from the search engine, it is instructive to examine whether Q has been asked before. If it has, then it is instructive to incorporate the information from the previous choices (which indicates the preference of the previous users) to reorder the list R. However, very often users do not make the exactly the same query. Some previous queries are more similar to the current query Q than some others, but maybe none is exactly the same. For example, "how is the weather in SF?" is a closely related query to "SF & Weather", but they are not exactly the same. So the questions to be asked are: How to measure the similarity among queries? 1. How to use the history of the choices for one query to improve the answer of a similar query? An example is instructive at this point. Suppose a user made a query (cats .times. dogs) and received a ranked list of web pages given in their URLs. In this context a ranked list means that the web pages are given in a sorted order according to the scoring function of the search engine. Suppose, in addition, the user or some other users had already made a query on (cats .sym. dogs), and had already ranked the list of the web pages according to the relevancy to (cats .sym. dogs). How can the user-ranked list for (cats .sym. dogs) be used to further enhance the order of the list given by the search engine for (cats .times. dogs)? What is the similarity between the queries (cats .sym. dogs) and (cats .times. dogs)? Suppose URL1, URL2, URL3, and URL4 are ranked in the order of the best interests for (cats .sym. dogs), and URL4 and URL2 are returned by the search engine for (cats .times. dogs) according to its own scoring. Then it can be concluded, according to the previous data for (cats .sym. dogs), that URL2 may be more preferred than URL4. Here is an argument: A web page satisfying (cats .times. dogs) must also satisfy (cats .sym. dogs). Hence, the lists of web pages that satisfy (cats .times. dogs) is a logical projection and a subset of the list of web pages that satisfy (cats .sym. dogs). In this projective sense, URL2 is more preferred than URL4, when the user ranked the query result for (cats .sym. dogs). Even though URL3 is ranked in between URL2 and URL4 in the list for (cats .sym. dogs), the relative ranking provided with URL2 and URL4 are correct in a logical projection sense (assuming a monotonic mapping metric) for the query (cats .times. dogs). So the projective ordering given for (cats .sym. dogs) could be used fully for (cats .times. dogs). It is now instructive to look at the searching problem from the other direction. Suppose a user has a ranked list of URLs for query (cats .times. dogs) as URL4 and URL2. Suppose then a new query (cats .times. dogs) is issued and the user obtained an ordered list from the search engine as {URL1, URL2, URL3, URL4}. How much information for (cats .times. dogs) can be used in this case? Clearly, there is no additional information provided between URL1 and URL2 or URL1 and URL4. But in the projective sense, URL4 is more preferred than URL2, and should move up the list. But there is no justification to move URL4 over URL1. Thus, a reasonable reordering for (cats .sym. dogs) is {URL1, URL4, URL2, URL3}. URL4 is moved above URL3 because by the scoring result of the search engine, URL2 is more preferred than URL3, while by the previous data for (cats .times. dogs), URL4 is more preferred than URL2, and hence may be more preferred than URL3. As the result above indicates, the impact of (cats .times. dogs) to (cats .sym. dogs) is much weaker than that of (cats .sym. dogs) to (cats .times. dogs). Therefore, the similarity between queries is not symmetric. This will be verified by the projective similarity example given below. QAM Partitioning Example As an example of this technique, consider two queries Q1 and Q2. How should the previous data for Q1 be used to enhance the ordering for a list of returned pages for Q2? Here a method based on the projective similarity between Q1 and Q2 is presented. The two queries divide the space of web pages into four sets: {A}: the set of pages that satisfy Q1 but not Q2. 1. {B}: the set of pages that satisfy both Q1 and Q2. 2. {C}: the set of pages that satisfy Q2 but not Q1. 3. {D}: the set of pages that neither satisfy Q1 nor Q2. So {B} is the only common sub-set for both Q1 and Q2. The set {B} will be referred to as the projective sub-set between Q1 and Q2 (Q1 onto Q2), and denote it as QamProjection(Q1,Q2). The set {A} will be referred to as QamPages(Q1) and set {C} as QamPages(Q2). The degree of projective similarity from Q2 to Q1 is then ##EQU1## and the degree of projective similarity from Q1 to Q2 is then ##EQU2## For example, QamSimilarity ((cats.sym. dogs).fwdarw.(cats.times.dogs))=1 (3) while ##EQU3## This is the reason why the impact of (cats.times.dogs) to (cats.sym. dogs) is much weaker than that of (cats.sym. dogs) to (cats.times.dogs). Enhancement with Previous Data for the Same Query The basic method for using prior data to enhance a returned list of search results will now be discussed. Following this discussion a few practical modifications of this method will be presented. It is instructive to first illustrate how to use the previous selections of the same query to improve its future ordering. The algorithm has several components. The basic QAM component is called QamPermute. It takes two parameters: The first parameter is R=(r[1], . . . , r[t]), which is an ordered list of web pages returned by the search engine. There is a weight value w(r[i]) associated with the web page r[i], measuring the degree of match and is calculated by the search engine. In otherwords, R=(r[1], . . . , r[t]) is sorted according to w. The second parameter is a t-place vector c, measuring the selection weight given by the previous data. For example c(r[i]) can be the number of times that the web page r[i] had been selected (with 0 indicating that it has never been chosen), or some kind of weight to measure the degree of preference of r[i] in the history. It is assumed that a primitive function s(w,c) is given which maps a pair of real number to a real number. Intuitively, s(w(r[i]),c(r[i])) computes the degree of preference of the web page w[i] given the matching weight w(r[i]) returned by the search engine and preference weight c(r[i]) given by the previous searches. In practice, s(w,c) is monotonically increasing in both of its parameters. The component QamPermute can now be defined as the following: QamPermute(R,c) 1. Compute s(w(r[i]),c(r[i])). 2. Sort R in a decreasing order according to s(w(r[i]),c(r[i])). Suppose the search engine has been queried with Q and a list of web pages R(Q) has been obtained. We have selected some subset of R(Q) as to be the most fifting ones. The history of these selections was recorded in c. Suppose now some new user comes and queries the search engine with Q as well, we will then return QamPermute(R(Q),c). QamPermute allows enhancement of a list of returned web pages by the previous data for the exactly same query. However, in general, it is more beneficial to enhance the output based on previous data on similar by not exactly the same queries. Here the idea of projection and projection similarity is utilized. Enhancement with Previous Data for a Similar Query This section describes a method for enhancing the ordering of a list of returned web pages with previous data on a similar query. We first discuss one of its components, called QamPartialPermute. It has three parameters: The first parameter is R=(r[1], . . . , r[T]), returned by the search engine and its weight vector w( ). Again R=(r[1], . . . , r[T]) is sorted according to w. 1. The second parameter is a T-place vector m, called the mask, whose entries are either 0 or 1. We use the mask vector to encode the projection from one query to another. 2. The third parameter is again a T-place vector c, measuring the selection weight given by the previous data. Technically, c(r[i]) is meaningful only if m(i)=1. Again, we assume to be given a function s(w,c). An example of this function follows: QamPartialPermute(R,m,c) 1. for each i such that m(i)=1, compute s(w(r[i]),c(r[i])). 2. compute the sorted order of s(w(r[i]),c(r[i]))'s with m(i)=1 in an increasing order. Call this order p(1),p(2), . . . , p(L), where L is the number of the marked entries. 3. following the order given in step 2, for j=2 to L, 1. let h=p(j) and g=p(j-1) 2. if r[h] is higher in R than r[g], then move r[h] just in front of r[g]. 4. return R. An application example is in order. Suppose R=(r[1], r[2], r[3], r[4], r[5], r[6], r[7], r[8]) and m=[0 1 0 1 0 1 0 1], and the sorting in the second step results p(1)=4, p(2)=8, p(3)=2, and p(4)=6. The first sub-step of step 3 moves r[8] above r[4] (given by p(1)), the second sub-step did not move r[2], and the last sub step move r[6] above r[2]. Hence we have (r[1], r[6], r[2], r[3], r[8], r[4], r[5], r[7]). In other words, QamPartialPermute properly moves up a marked entry so that the marked entry are properly sorted according to s(w(r[i]),c(r[i])). Suppose Q[1] is a query that we would like to use to enhance R(Q). The mask vector m is then given as the characteristic vector for the intersection of R(Q) and projection(Q1,Q). The vector c hence measures the selection weight given by the previous data with query Q1. The procedure QamEnhance(Q,Q[1]) first computes the mask vector, and then applies QamPartialPermute. QamEnhance(Q,Q[1]) 1. compute the mast vector m from R(Q) and R(Q[1]) 2. retrieve the selection vector c for pages in R(Q[1]) 3. return QamPartialPermute(R(Q),m,c). So the basic idea behind QamEnhance is to first find the subset of common web pages between Q and Q[1], referred as the projection from Q[1] to Q, and then use the previous data of Q[1] to improve the ordering of this projective sub-lists. Enhancement with Previous Data for Similar Queries Suppose we have already made k queries with Q[1], Q_2, . . . , Q[k], with ordered web pages R(Q[1]), . . . , R(Q[k]), respectively. Suppose we are making a new query Q. Let R be the returned result from the search engine. We iteratively enhance the quality of the list of R(Q) as following. QamEnhance(R(Q),[R(Q[1]), . . . , R(Q[k])]) 1. Compute s[1]=QamSimilarity(Q1->Q), . . . , s[k]=QamSimilarity(Q[k]->Q). 2. Sort {s[1], . . . , s[k]) in the increasing order, wlog, assume {s[1], . . . , s[k]} is already sorted. 3. For i=1 to k, 1. R(Q)=QamEnhance(R(Q),R(Q[i])); 4. Return R(Q). We improve the ordering of R(Q) by a list of queried results from the weakest projective similarity to the strongest projective similarity. Score Enhancing Methods: Sampling for Practical Implementations We have presented a theoretical version of the score-enhancing algorithm. However, there are several difficulties in implementing this algorithm: 1. The computation of QamSimilarity. 2. The computation of the mask vector QamMaskVector. 3. The potentially large number of the queries. In this section, we will address these issues and design more practically suitable approximations of our theoretical version of the algorithm. This basic idea we will use here is sampling. We will use the results from the search engine to help the approximation of the similarities and mask vectors. We will use two parameters in the following method: The first parameter is T, which is the number of the pages that we could like to reorder. In other words, after obtain a list of web pages from the search engine; we will only reorder the top T pages. 1. The second parameter is t, which is the number of web pages that we will use as the sample of a query. In practice t is around 25 and t<T. But in this discussion, we will use a parameter rather than a fixed constant. When a user asks a query Q[1], the search engine returns a list of web pages weighted by the matching scores computed by the search engine. Let R[t(Q[1 ])] be the top t pages. The user will make some selections, either from the top t pages or beyond. Let S(Q[1]) be the union of R[t(Q[1])] and the set of pages selected by the user. S(Q[1]) is the sampled space of web pages for query Q[1]. For each web page in S(Q[1]), we maintain its selection weight. The weight is 0 if a web page in S(Q[1]) which is not selected. Otherwise, for a web page in S(Q[1]) that is selected, this weight is the degree of preference. This gives the c( ) vector for the selected and sampled pages. Suppose now the user asks another query Q and the search engine returns a list of web pages. Again weighed by the matching scores. Let R[T(Q)] be the set of tope T pages that we would like to enhance. Then the similarity factor will be approximated as ##EQU4## The mask vector will simply be the vector of characteristics of S(Q[1]) in R[T(Q)]. So to use Q[1] to improve Q, we only reorder among the set of pages in common with S(Q) and R[T(Q)]. With this sampling based approximation, we are able to support the algorithm QamEnhance efficiently. We now discuss the details below. Associate with each query, let S(Q) be the sample maintained for query Q. For each web page p in S(Q), let c(p) denote the weight of preference of the page p in S(Q). Let R[T(Q)] and R[t(Q)] be the top T and t, respectively, pages returned by the search engine. This following procedure masking vector computes the mask vector m and the number of common pages in S(Q[1]) and R[T(Q)]. [common,m]=QamMaskingVector(Q,Q1,T) 1. common=0; 2. Initialize the mask vector m as a T-place all zeros vector. 3. for i=1 to T, enumerate the page in R[T(Q)], 1. if the ith page is in S(Q[1]), then m[i]=1; 2. if the ith page is not in S(Q[1]), then m[i]=0; 4. common=common+1; 5. Return common and m. QamScoreEnhance(R[T(Q)],Q[1], T, t,m) 1. Assume R[T(Q)]=(r[1], . . . , r[T]), in the order returned by the search engine. 2. for i=1 to T, 1. for each i such that m(i)=1, compute s(w(r[i]),c(r[i])), where c(r[i]) is given in S(Q[1]); 3. Compute the sorted order of s(w(r[i]),c(r[i]))'s with m(i)=1 in an increasing order. Call this order p(1),p(2), . . . , p(L), where L is the number of the marked entries. 4. QamPartialPermute: following the order given in step 2, for j=2 to L, 1. let h=p(j) and g=p(j-1) 2. if r[h] is higher in R[T(Q)] than r[g], then move r[h] just in front of r[g]. 5. Return R[T(Q)]. Suppose we have already made k queries with Q[1], Q_2, . . . , Q[k]. Suppose we are making a new query Q. Let R(Q) be the returned result from the search engine. We iteratively enhance the quality of the list of R0 as follows: QamEnhance(R(Q),[Q[1], . . . , Q[k]]) 1. for i=1 to k; 1. [common[i], m[i]]=QamMaskingVector(Q,Q1,T) 2. Sort {common[1], . . . , common[k]} in the increasing order, so wlog, assume it is already sorted. 3. R(Q)=R[T(Q)]; 4. For i=1 to k, 1. R(Q)=QamScoreEnhance(R(Q),Q[i],T,t,m[i]); 5. Return R(Q). In practice, we do not need to use all of the previous queries to improve the current query. We only need to use the top few (perhaps) five queries with the highest similarity. We will discuss this issue in the implementation section. Types Queries and Practical Justification of Projection and Sampling The sampling based projection scheme developed effectively combines the power of a search engine and the preference given by users' selection. Our method does not depend on the format of the query and treat the search engine as a black box. For example, the enhancing method does not need to know how the search engine processes a query such as "How is the weather in SF?". The similarity between two queries, such as "How is the weather in SF?" and (Weather & SF), is surely a function of the search engine. If the search engine only look for document, which exactly contains the sentence "How is the weather in SF?", then the similarity between these two queries will be very small. On the other hand, if the search engine interprets the query "How is the weather in SF?" as (current & weather & SF), then the similarity between these two queries will be much larger. However, as a post-process mechanism, it is more effective in practice to have a plug-in method, a method does not need to know the semantic of the search engine. Below, we survey the types of queries supported by the search engine for the IBM jCentral site. An important feature of our method is that it does not need to processor the query in a semantic way in order to improve the quality of the search results. It treats the query as an explicit set given by a proper sample from the search engine and user's selections. Exemplary Search Syntax Boolean Operators The concept of projective similarity allows us to make use of the previous results on closely related queries, with too much additional cost in processing these queries. The basic principle of our approach is the proper combination of the weighting score from the search engine as well as the selection data for previous queries. We will illustrate these points after giving the detailed discussion of these queries. Boolean Operators One of the main reasons of using the projective similarity and sampling as the means to enhance a list of Boolean Operators.
Word Symbol Description For Example
AND & AND or & or + bubble AND sort returns
+ returns pages pages that contain both of
containing all the the terms; that is, the term sort
query terms. and the term bubble.
OR .vertline. OR or .vertline. returns all applet OR "source code"
re-
pages that have turns pages containing one or
either query term. both terms; that is, pages that
include only applet, pages
that include only "source
code" and pages that include
applet and "source code".
NOT ! - NOT or ! or - tree NOT binary returns any
returns pages that pages that include tree, but
do not contain the don't include binary.
query term that
follows.
Query
Modifiers
quotation "" Quotation marks A search on "Java
marks ("") are used to Virtual Machine" only
denote exact matches documents
phrases. containing the exact
phrase. It doesn't
return pages with the
words used in a
different order, such
as "Virtual Machine
Java".
parentheses () By default, AND When searching for
has a higher terms like jit, "Just In
precedence than Time" and Symantec.
OR. To change One possible query
the precedence would be: (jit OR "Just
order, use the In Time") AND
parentheses. Symantec
Wildcard Character jCentral allows the use of the asterisk (*) as a wildcard character in searches. The following table describes how this character can be used to expand searches:
Use the
To search for . . . Symbol format . . . For example . . .
Keywords with * prefix* program*, yields search
the same prefix results containing words that
have the prefix "program",
such as "programs",
"programming",
"programmer", and so on
Words based on ** stem** try**, yields search results
the same stem containing words based on
word the stem word "try", such as
"trying", "tried", "tries", and so
on
Proximity (Near) Operator jCentral allows searches based on the proximity of word or strings. The following table describes this feature:
Use the
To search for . . . Symbol format . . . For example . . .
Two words or near string1 near database near tool, yields
strings in the same .about. string2 search results in which
file, close together or the word "database" is
word1 .about. word2 near the word "tool"
The near operator returns a match if both of the words being searched for are in the same file. The ranking of the match depends on the proximity of the searched-for words. That is, the ranking of a file in which the searched-for words are closer together is greater than or equal to the ranking of a file in which the words are farther apart. If the searched-for words are more than 50 words apart, they are not considered near enough, and the page is assigned a ranking of zero. Weighted Strings jCentral allows a web searcher to assign a numerical weight to a word or string to influence the ranking of the search results (Vector Space queries). The greater the weight you assign, the higher the matching results are ranked. The following table describes this feature:
To search for . . . Use the format . . . For example . . .
Pages that contain word1, word2 Java resources with words that
specific words best match the words being
searched for
Pages that contain prefix, remote*, method[50],
weighted prefixes, word[weight], data[20], "procedure call"
words, and phrases phrase[weight] [200] yields search results
that contain words prefixed
by remote, the words method,
data, and the phrase procedure
call (the terms are weighted)
Search Examples
This search
string . . . Yields . . .
quick sort Search results that contain both quick and sort
"quick sort" Search results that contain the phrase "quick sort"
algorithm, sort* Search results that contain the word "algorithm" or
words with the prefix "sort".
multi* near thread* Search results that contain words starting with
"multi" appearing near words starting with "thread"
play** Search results that contain the words "play",
"player", "playing", "played", and so on
Search results are ranked using percentages. The higher the percentage given, the more closely the result matched the search criteria. (100% is the highest ranking.) Set Approach: Approximating Queries with Samples An important feature of our method is that it does not need to process the query in a semantic way in order to improve the quality of the search results. It treats the query as an explicit set given by a proper sample from the search engine and user's selections. This provides a uniform treatment of all kinds of queries without too much overhead, and allows us to measure the similarity between queries such as 1. sorting[50], selection[10], binary_search[30] 2. sorting[80], selection[1], binary_search[19] 3. quick and sort 4. binary_search, data base, sort* 5. "quick sort" using the guessing semantic logic of a search engine. In doing so, it does put a lot of trust and faith in the quality of the search engine. For example, for a search engine that does not handle weighted vector query, (sorting [80] and binary_search[20]) may be treated as the same query as (sorting[50] and binary_search[20]). In this case, based on our set approach with sampling, the similarity between them is greater than the similarity measured when we use a search engine that processes the weighted query. In any practical sense, our projective approach provided a proper treatment of the query result returned by a search engine, proportional to the internal semantics of the search engine. In other words, the search engine itself provided the proper distribution for the measurement of the similarity between queries and hence the impact in the reordering. Data Structures and Implementation Details We now address the implementation details of our method with the focus on the data structures. The method has three data components: 1. History: recording the queries processed before and its samples and their preference value. 2. Current query Q: a user entered query. 3. Search result R(Q): a sorted list of web pages (given in URLs) returned by the search engine. The search result R(Q) is can be stored in an array or a linked list. Every element contains the URL and its weight (the w vector) given the search engine. The history part is a tabulate data. Each entry has 1. The query. 2. The list or an array of sampled web page in which every element contains the URL and its selection weights (the c vector). To compute the similarity and the mask vector, we need a data structure to determine, given an URL, which query in the history table contains the URL as well as the selection weight in the sample of that entry. We will use a hash table HASH to support this operation. In HASH the key is the URL. If the URL is not in the history, then the content entry is an empty list; otherwise, the content entry a list of the entries in the history table to indicate the collection of queries whose samples contains the URL. Therefore, given an URL, we are able to locate quickly the list of previous queries whose samples contain the URL. We now introduce another hash table, QamHash to help the computation of the mask vector and the projective similarities. The key entry is the id for previous queries. The content entry contains a variable called common, and T-place vector, m, for the mask, and a T-place vector, c, for the preference value. 1. let R[T(Q)]=(r[1], . . . , q[T]) be the set of top T pages for query Q. 2. for i=1 to T, 1. let p* be the list of previous queries that contains r[i]; p* can be obtained from HASH. 2. for each query in p*, we update its entry in QamHash: we add 1 to its common, assign the ith entry of its m vector 1, and assign the ith entry of its c vector properly. 3. return QamHash. QamHash contains all the previous queries that has a non-empty overlap in web pages with R[T(Q)]. For each such a query, we also compute its mask vector and the degree of similarity. Summary While there are a wide variety of existing methodologies that attempt to standardize the original query string to detect similarity between search queries, the present invention does not depend on the semantic parsing of the search string to achieve this standardize ranking information. Instead, it utilizes the search result set itself to detect similarity which is a different approach from the prior art and provides a variety of advantages as described above. Computer Software As would be known by one skilled in the art and as indicated in FIGS. 1-4, the system and method described herein and generally illustrated in FIGS. 1-4 may be reduced to computer instruction codes and embodied on a computer readable storage means. This may take the form of a wide variety of storage media well known in the art and/or contemplated for future use. Thus, the present invention specifically anticipates the incorporation of the system and methods discussed herein in the form of tangible computer software products. Furthermore, while not limiting the scope of the present invention, the present invention specifically anticipates that one or more components of the present invention may be implemented using the Microsoft.TM. Windows.TM. operating environment in all its variations or its equivalent commercial embodiments, including but not limited to any system incorporating a graphical user interface. A metadata search results ranking system and method have been disclosed wherein popularity and/or recency may be utilized to improve search results as compared to the prior art. The present invention's use of adaptive human interaction to improve future search results provides for a self-correcting nature that both improves future search results and permits current search results to be more easily interpreted and filtered by the interactive user. Although a specific embodiment of the invention has been disclosed, it will be understood by those having skill in the art that changes can be made to this specific embodiment without departing from the spirit and scope of the invention. The scope of the invention is not to be restricted, therefore, to the specific embodiment, and it is intended that the appended claims cover any and all such applications, modifications, and embodiments within the scope of the present invention.
|
Same subclass Same class Consider this |
||||||||||
