System and method for determining relevancy of query responses in a distributed network search mechanism6961723Abstract A system and method for selecting or ordering search results received from members of a distributing search network in response to a search request. Network nodes operating as consumer or requesting nodes generate the search requests. Nodes operating as hubs are configured to route the search requests in the network. Individual nodes operating as provider nodes receive the search request and in response may generate results according to their own procedures and return them. Communication between nodes in the network may use a common query protocol. Hub nodes may resolve the search requests to a subset of the provider nodes in the network, for example by matching search requests with registration information from nodes. Search results may be selected, ordered, and/or consolidated for use by the requesting nodes by nodes receiving a plurality of the search results. Claims 1. A method for searching distributed resources, comprising: Description BACKGROUND OF THE INVENTION
An example registration for abcd.com may look like this:
Query messages may be structured to indicate which portions are queries and which include other information. For example, a default namespace may be specified by a URI such as "http://abcd.org/search". A query message may be contained within the envelope <request> . . . </request>. A query unique ID may be specified in a uuid attribute of the <request> tag. A query space may be specified within the tags <query-space> and </query-space>. An actual query data may be enveloped within the tags <query> and </query>. Query data may be arbitrary XML within a namespace that matches "http://abcd.org/search/text", which includes the tag <text> to specify free text, or within any other namespace specified by the <query-space> definition. Generally any envelop or structure may be used provided it adequately identifies information needed to define query information between members of the search network. In one embodiment, each query request message includes the <request uuid="uuid details">, <query-space>, and <query> tags. In one embodiment, although <query-space> defines a name for the type of query that is being performed, a name space may be used with the same value as that specified in <query-space> for all the queryspace-specific tags. One embodiment may use a full XML schema framework for defining and validating queryspaces. Response messages may be structured to indicate which portions are responses and which include other information. For example, a default name space may be "http://abcd.com/search". A response message may be enveloped within the <response> and </response> tags. A body of the response may be arbitrary XML as long as it corresponds to the specified queryspace and corresponding namespace, e.g. the queryspace "http://abcd.com/search" includes the <text> tag. Generally any envelop or structure may be used provided it adequately identifies information needed to define response information between members of the search network. The following is an example of the format of a query for the term "foo":
The following is an example of a response to this query:
A more complex example may be:
In this example the query space is defined as "http://bigbookseller.com/search" and the namespace "books" matches the URI for this query space. The query specifies that the "author" within the name space "books" should be "John" or "Doe" or that the title should contain "Widgets". An example of a response by abcd.com may be:
In addition, request messages may contain optional attributes. These may be contained inside request tags. If unspecified, defaults attributes may be assumed. Optional attributes may include: "max-hits-per-provider" indicating a number of hits expected from a provider; "flushafter" to indicate to flush the output stream to the client after receiving responses from a certain number of providers; "queryuuid" to indicate a unique id of the query; "querylifetime" to indicate a length of time during which the query is valid; or "maxfanout" to indicate a maximum number of providers to which to forward the query. For example, a tag may be: <request flushafter=5 providerhits=2 timeout=2>. An architecture for the distributed information discovery platform is shown in FIG. 2, according to one embodiment. In one embodiment, a consumer 140 may provide users an access point to a distributed information discovery network. A consumer such as consumer 140A may include a consumer query request protocol interface 142. A QRP interface may be a stand alone application, a component of a distributed information discovery platform, a script capable of parsing requests and generating an appropriately formatted response, or any hardware or software configured to include at least functionality for translating to or from query response protocol data. The consumer QRP interface 142 may send queries written in the query request protocol to the hub 100 for query resolution and routing. After sending a query, the consumer QRP interface 142 may await responses from providers. In one embodiment, the queries may be received by a hub consumer QRP interface 108 of router 104. In one embodiment, the consumer QRP interface 142 may also perform formatting of the responses for presentation to the end user or application, which may include ordering or otherwise organizing the responses. In one embodiment, the formatting or ordering of the responses may be in response to instructions received from the consumer or provider. In one embodiment, consumers 140 may also include a front end or user interface (e.g. a web user interface) to the hub (e.g. the router and/or resolver). In one embodiment, a consumer 140 may include a mechanism for ranking and presentation of query results. In one embodiment, this mechanism may be a component of the consumer QRP interface 142. Ranking methodology may be implicit in each queryspace, and may be returned as part of each response in some embodiments. Some ranking schemes may require third-party involvement. In one embodiment, consumers such as consumer 140C may not include a consumer QRP interface 142. These consumers may use a consumer proxy 110 to interface to the functionality of the hub 100. The consumer proxy 110 may perform translation of queries formatted in one or more query protocols supported by the consumers 140 into queries in the query routing protocol. These queries may then be sent to the hub 100 for resolution and routing. In one embodiment, the queries may be received by a hub consumer QRP interface 108 of router 104. The consumer proxy 110 may also perform translation of query responses formatted in the query routing protocol into one or more protocols supported by the consumers 140. As shown, one or more consumers 140 may interface with the consumer proxy 110. In one embodiment, a provider such as provider 120A may include a provider query request protocol (QRP) interface 122 that may accept queries from the hub 100 in the query routing protocol and respond to the queries with query responses in the query routing protocol. The provider QRP interface 122 may perform translation of queries into provider-specific requests. In one embodiment the QRP interface 122 may include an indexing and/or searching interface and may be configured to perform indexing and searching itself. In one embodiment, the provider QRP interface 122 may not perform any indexing or searching itself, but rather may call the appropriate indexing and/or searching interface of the provider 120, for example, a database search engine. In this embodiment, the provider QRP interface 122 may, if necessary, translate the queries from the query request protocol into a protocol that may be used by the appropriate indexing and/or searching interface of the provider 120. The provider QRP interface 122 may also, if necessary, translate the query responses from the protocol used by the appropriate indexing and/or searching interface of the provider 120 into the query request protocol. A provider QRP interface 122 may be, for example, a small modification of an existing search engine script (Java Server Page (JSP), Perl etc.) so that queries from a distributed information discovery network can be applied to the provider's search engine. Provider proxy 114 may perform translation of queries formatted according to the query routing protocol to specific search engine formats for a provider 120 such as provider 120C. Provider proxy 114 may also perform translation of responses formatted according to the specific search engine formats into responses formatted according to the query routing protocol. A provider proxy 114 may be used, for example, if a provider 120 does not run its own provider QRP interface 122B but does allow access to its own search engine. Hub 100 performs the routing of queries from consumers 140 to providers 120. The hub 100 accepts queries, resolves those queries to the appropriate providers 120 and then manages the routing of the queries to the providers 120. The hub 100 then may collate the results received from one or more providers 120 and send the results back to the requesting consumer 140 in a query response. In one embodiment, rather than sending the results back to the consumer 140 in a query response, the results may be provided to the client by other means. For example, the query message may include an email address or addresses to receive the results. After receiving and collating the results, the hub 100 may email the results to the email address(es) specified in the query message. In one embodiment, the hub 100 may also receive queries in email messages from consumers. As another example, the hub 100 may post the results to a URL specified in the query message. Alternatively, the provider 120 may provide the results directly to the consumer 120 rather than routing the results through the hub 100. The query message may include information that allows the provider 120 to provide the results directly to the consumer 120. For example, the query message may include an email address and/or a URL for the consumer 140, and the provider 120 may email the results to the specified email address or send the results directly to the URL specified in the query message. A hub 100 may comprise a router 104 that may provide a portion of the functionality of the hub 100. The router 104 may route queries to providers 120, manage query connections, collate results and return responses to consumers 140. A hub may also comprise a resolver 102 that matches queries to providers 120. Provider information 106 may include one or more registration files comprising metadata specified by the providers 120 during registration. In one embodiment, the resolver 102 may be based on a full text search engine. For example, the core components may be adapted from the Lucene search engine (http://www.lucene.com) written using Java. In one embodiment, the resolver 102 may index all tags and text in the registration files. A reverse index may be created which maps query terms to providers. For efficiency, the resolver may create separate indices for each queryspace. In one embodiment, a provider 120 may accept queries in the query routing protocol directly from consumers 140 without the queries being routed by the hub 100. In one embodiment, a provider 120 may also return responses to queries directly to consumers 140 without routing the responses through the hub 100. In one embodiment, a distributed information discovery system may be implemented as a series of distinct web services. Each of the router, resolver, proxies, and QRP interfaces may run independently. In one embodiment, these web services may be implemented as Java Servlet classes referencing additional Java classes for core functions. For example, in a web embodiment, each of the router, resolver, proxies, and QRP interfaces may be implemented on a web-accessible server or servers. Also, a distributed information discovery network may include multiple different routers, resolvers, proxies, and QRP interfaces. One router or resolver may register with another router or resolver. Or one distributed information discovery system implementing a router, resolver, proxies, and/or QRP interfaces may register with another such system. For example, a distributed information discovery routing system may register providers of information concerning outdoor recreation. Another distributed information discovery routing system having provider registrations for boating may register with the first system. In some embodiments, a distributed information discovery system may be implemented on different peers in a peer-to-peer network, or other networks. In one embodiment, a database used by any of the above components may be a database that provides persistency, such as a GOODS (Generic Object Oriented Database System) database. GOODS is an object-oriented fully distributed database management system (DBMS) using an active client model. Other databases based on other DBMSs may also be used. Using the proxies and QRP interfaces described above, the distributed information discovery platform may offer a unique technology by enabling search across heterogeneous communication protocols and systems and presenting those results using any other protocol and system. An example of this is the distributed information discovery platform's ability to search Java Server Page (JSP) based HTTP systems simultaneously with Perl-based XML systems and Java-based peer-to-peer protocol systems. The distributed information discovery platform may also provide a mechanism for presenting those results in HTTP-based HTML, a peer-to-peer protocol, or any other protocol/medium combination. The consumer and provider proxies and QRP interfaces may serve as adaptors for multiple data sources to plug into a standardized interface for distributed deep search. In one embodiment, the distributed information discovery platform is an XML-based request/response system. By using an XML-based messaging format, the distributed information discovery platform may enable powerful and easily implemented deep web searches. Participants in the distributed information discovery platform network need only apply fairly common and available facilities to adapt their system as a network provider. The XML nature of the response messages additionally expands the scope of a provider's ability. Applications other than web browsers may manipulate the responses for different purposes such as determining an average price or a current demand based upon availability. To be a network provider, participants may include a provider QRP interface 122 that may be tailored for a provider's specific system. A provider QRP interface 122 may parse or translate a query routing protocol request from the distributed information discovery network, query a provider back end 180 to get appropriate data 182, and then generate a response and send it back to the distributed information discovery network according to the query routing protocol. In one embodiment a QRP interface may determine whether a query is recognizably formatted, contains an illegal query or result, would access restricted information, or otherwise cannot validly be processed, and may return an error message or code or similar indication. In one embodiment, the distributed information discovery platform may provide one or more generic QRP interfaces that may be used as examples that illustrate how to use a specific language to accept requests and/or generate responses. The distributed information discovery platform may also provide one or more QRP interfaces that plug into existing, freely available systems. In one embodiment, queryspaces may be defined within the distributed information discovery platform that enable providers 120 to do more than return links to web pages. For example, rather than querying a database, a QRP interface 122 may compute a price for a particular service based on demand in real time. The QRP interface 122 may generate data, or may cause another application to generate data on demand. An example may be an auction system for spare CPU cycles, where a client would query various providers 120 for CPU time. The providers 120 may generate a price based on current availability. In one embodiment, the distributed information discovery platform may include a result presentation mechanism that may perform computations on and/or presentation formatting of results data. There may be some differences in some of the internal mechanisms of embodiments that bind to different networks. In general, the query routing protocol and the resolution mechanism may be the same or similar in the different embodiments. The routing mechanism and the client interfaces in the different embodiments, however, may be implemented at least partially differently to support the different network types. FIG. 3 illustrates message flow in a distributed information discovery network according to one embodiment. An application on consumer 140 may find information providers 120 to respond to a particular query by sending the query into the network via a specific access point (hub 100). In one embodiment, consumer 140 may send the query to router 104 of hub 100. Router 104 may then send the query to resolver 102. In one embodiment, queries may conform to the query routing protocol. In one embodiment, queries are markup language (e.g. XML) messages with essentially arbitrary structure. In this embodiment, there are no restrictions on what tags may be used in queries. Resolver 102 may determine one or more providers 120 which may receive the query. One or more information providers 120 may have previously registered with hub 100 by sending registration messages each including one or more queryspaces for the particular provider 120. In one embodiment, information from the registration messages, including queryspace information, may be maintained in provider information 106. Provider information 106 may be a file or database of files including registration information for one or more providers 120. Resolver 102 may index and search provider information 106 for queryspaces that match the query. For a provider 120 to be selected to receive the query, the queryspace specified in the query must match a queryspace of the provider 120. Also, the path predicate specified in the registration message must select a non-empty set of nodes in the query. After determining the one or more providers 120 to receive the query, resolver 102 may provide a list of the selected one or more providers 120 to router 104. Router 104 may then send the query to each of the selected one or more providers 120. Once an information provider 120 receives the query, it composes a response and sends it back to the router 104 of hub 100. Hub 100 may receive one or more responses from each provider 120 that was sent the query. Router 104 may then forward the received responses to the consumer 140, and thus to the querying application. In one embodiment, the hub 100 does not evaluate competing relevance rankings. In one embodiment that task is left to the querying application. In one embodiment, the hub 100 may collate the responses received from one or more providers 120 prior to sending them to the consumer 140. In this embodiment, consumers 120 are not required to listen for asynchronous responses. Collation may also provide security benefits. For example, collating responses may help prevent distributed denial-of-service attacks based on spoofed queries. Also, the distributed information discovery network may be used to establish peer-to-peer connections. In one embodiment, the consumer 140 may connect to the resolver 102 initially to request a set of providers 120 to be targets of a query, and then sends this list of providers 120 to the router 104, which manages the query routing from the consumer 140 to the providers 120, and which also returns the results to the consumer 140 (i.e. Consumer->Resolver->Consumer->Router->Providers->Router->Consumer). FIG. 4 illustrates a provider 120 with provider QRP interface 122 interfacing to a provider search engine backend 180 according to one embodiment. In one embodiment, a provider QRP interface 122 may serve as an adaptor to the query routing protocol. In one embodiment using a common protocol based on a technology such as XML, fairly common and available facilities may be used to create a provider QRP interface 122 to serve as an adaptor between a provider backend and the distributed information discovery network. Thus, to be a network provider, participants may include a provider QRP interface 122. A provider QRP interface 122 may be tailored for a provider's specific system. A provider QRP interface 122 may parse or translate a query routing protocol request from the distributed information discovery network, query a provider back end 180 to get appropriate data 182, and then generate a response and send it back to the distributed information discovery network according to the query routing protocol. A provider QRP interface 122 may be a stand-alone application or alternatively a script capable of parsing the requests, gathering data and generating an appropriately formatted response. In one embodiment, providers or back-end systems may send response messages to the provider QRP interface 122 using the Rich Site Summary (RSS) protocol as a default protocol. RSS is an XML protocol designed for site summaries. Using RSS may provide a common formatting standard of the responses, removing the need to handle custom HTML or other custom protocols being returned from providers. In one embodiment, provider proxies are configured to use RSS. In one embodiment, QRP interfaces may support queries wrapped as XML-RPC (XML Remote Procedure Call (RPC)) requests. XML-RPC is a protocol (which forms the basis of Simple Object Access Protocol (SOAP)) for invoking server side methods over XML. In other embodiments, QRP interfaces also support HTML or other formats for data transmission or data gathering. FIG. 5 illustrates a provider 120 with provider QRP interface 122 interfacing to a provider search engine backend 180 according to one embodiment. In this embodiment, a result presentation mechanism 190 is shown that may enable providers 120 to do more than return links to web pages. For example, result presentation mechanism 190 may take the search results in the response message from search engine 180 and tailor the results into a presentation format such as a markup language document. This markup language document may be sent to the provider QRP interface 122, which may package the document in a QRP response and send it to the consumer 140. In one embodiment, QRP interface 122 includes result presentation mechanism 190. As another example, a result presentation mechanism 190 may compute a price for a particular service based on demand in real time. As an example, in an auction system for spare CPU cycles, a consumer 140 may query various providers 120 for CPU time. The providers 120 may generate a price based on current availability. The distributed information discovery platform may be used for augmenting standard search engines that index statically available web pages. Standard web pages are useful mainly for web browsers. Other devices, such as wireless communications devices, may benefit from searches that expose relevant data. The distributed information discovery platform may provide the ability to collect queries and provide results with meaningful relevance to a wide variety of information consumers and producers. The distributed information discovery system may use a dynamic data collection methodology that is independent of an information provider's presence on the World Wide Web, for example. An information provider may use a provider QRP interface 122 to function as an adapter and handle incoming queries, provide a registration that defines which services and information are available for what devices (e.g. cell phones, PDAs, etc.), and use a result presentation mechanism 190 to tailor results for presentation on the particular devices. For example, a cell phone may be used to find open service stations and compare prices, or restaurants to compare menus. A consumer QRP interface may be integrated in the cell phone, or may be accessible from the cell phone device to handle queries and responses, tailoring results for presentation on the cell phone. The consumer QRP interface may also similarly be integrated in other mobile or portable devices, and computers generally. For providers 140 that do not run an adapter for the distributed information discovery platform, a hub 100 may run a provider proxy 114 as illustrated in FIG. 2. A provider proxy 114 may perform translation of queries formatted according to the query routing protocol to specific search engine formats for a provider 120. Provider proxy 114 may also perform translation of responses formatted according to the specific search engine formats into responses formatted according to the query routing protocol. In one embodiment, the provider proxy 114 may perform off-line spidering and indexing of the providers 140 and respond to queries as a standard search engine would (this could be considered an open search indexing service). In another embodiment, the provider proxy 114 may perform translation of queries formatted according to the query routing protocol to specific search engine formats for a provider 120, and may also perform translation of responses formatted according to the specific search engine formats into responses formatted according to the query routing protocol. FIG. 6 illustrates an exemplary distributed information discovery network including a plurality of hubs 100 according to one embodiment. Each hub 100 may support one or more providers 120 and/or consumers 140 which may use the hub 100 as an access point to the distributed information discovery network. As shown in node 180, a node on the network may include instances of both a consumer 140 and a provider 120. In one embodiment, the distributed information discovery platform may support nodes comprising one or more consumers 140, one or more providers 140, and/or one or more hubs 100. In one embodiment, the distributed information discovery network may include one or more hubs 100 that each may support a particular type of application or specialist domain. For example, a web site might run a hub 100 as a vertical aggregator of content pertaining to Java programming. Its providers 120 may include other sites with content focused on Java. However, the web site may also send queries out to a different hub 100 running on a more general technology news site whose providers 120 may include sites such as CNet or Slashdot, for example. As another example, in a peer-to-peer network, hubs 100 may be used to group together peers with similar content, geography or queryspaces. Each peer within the network may interact with the hubs 100 using its appropriate service(s) (e.g. provider, consumer, and/or registration services). FIG. 7 illustrates provider registration in a distributed information discovery network according to one embodiment. Information providers 120 may register themselves within a distributed information discovery network. To register, a provider 120 may contact a hub 100 with a registration message. The registration message may conform to the query routing protocol (QRP). In one embodiment, a provider 120A may include a provider QRP registration interface 124 that is operable to send a registration message to the hub 100. In one embodiment, hub 100 may include a QRP registration interface 112 that may be configured to receive registration messages from providers 120. Provider QRP registration interface 124 may also maintain a registration file for the provider 120A. In one embodiment, the distributed information discovery platform may include a registration service 160 that may provide a QRP registration interface to hub 100 for providers 120 that do not include a provider QRP registration interface 124. Providers 120 may specify the type of queries they wish to receive in a registration file that may be provided to a hub 100 at provider registration. In one embodiment, a registration file may be an XML document comprising metadata about the information that the provider 120 wishes to expose. This file may encode the type and structure of queries, queryspaces and response formats compatible with provider 120. A QRP interface may use the type and structure information in the file to encode queries, queryspaces and responses in formats compatible with provider 120. The registration file can be thought of as an advertisement of the provider's metadata and its structure. The registration file may include information specifying one or more of several items. For example, a provider's query server endpoint may be included. If this is a peer-to-peer network implemented using the peer-to-peer platform described herein, the endpoint may be a pipe identifier or advertisement. In the web domain, this may be a CGI script which is capable of processing the query request protocol request messages and responding with a query request protocol response. In other embodiments, the endpoint may be a URL. Queries which match one of the provider's predicates may be posted to this endpoint. The file may include a queryspace of the queries this provider will accept. In one embodiment, this may be specified as a queryspace URI (e.g. URL). When queries are posted to this queryspace, the query may be checked against the provider's predicates for matches. The file may include a response format that the provider is capable of responding in. The response format may be specified as a URI to an XML schema. The file also may include a structure and content of the queries the provider is interested in receiving, specified in predicate form. In one embodiment, a set of predicates may define the structure and content In one embodiment, a registration message may include the following tags:
The following is an example registration document according to one embodiment:
This example registers a provider 120 with a queryspace. It also registers one predicate that will direct any query containing any of the words "baba", "ghannouj", "ghannoush" or "ganoush" to the provider's query server running at http://www.efgh.com/search.jsp. This matches any query containing the particular keywords. As another example, consider the following registration:
This registers a provider 120 with the text queryspace, specified by http://bigbookseller.com/search. This registration registers the provider for the following queries: any query containing "John Doe" or "Jane Doe" in the <author> field and any query containing "Foobar", "Gadgets", or "Widgets" in the <title> field. Queries matching these conditions may be directed to the query server running at http://littlebookseller.com/exec/search. Predicates may be much larger than this exemplary predicate, and may also contain more complex structure. In some embodiments, if the provider 120 does not specify a queryspace, a default queryspace may be registered for the provider 120. In such an embodiment, queries failing to indicate a queryspace may be assumed to be of the default queryspace. In one embodiment, a provider may be registered using a user interface in which keywords may be typed or pasted. In one embodiment, the user interface may be a Web page. In one embodiment, providers may be able to choose from a list of categories in addition to choosing keywords for their registrations. These categories may reflect the contents of open directories such as dmoz.org and some common news sources (e.g. CNN). For example, the top level of dmoz may be used as a pull down list or menu of categories from which providers may choose. In one embodiment, further specialization in categories may be provided—e.g. for News, providers may choose News->Tech News. In one embodiment, a recursive menu system may be used—e.g. a provider picks News, then presses submit, then picks Tech News and so on. The category data may be updated as needed—e.g. daily for news, weekly for other categories. In one embodiment, providers may edit their registration information via a user interface (e.g. web page) or a web form, or alternatively submit a replacement/addition to their registration. In one embodiment a QRP adapter may monitor or log queries, results, number of hits, searches, results, etc. or generally the information passing through the QRP adapter. In one embodiment, a user interface may be provided through which providers may view the results of searches and hits performed by consumers—e.g. how many searches resulted in their entry being returned, how many users clicked through, etc. In one embodiment, a user interface may be provided through which providers may monitor and/or control the number of queries sent to them and also to throttle traffic (e.g. turn it off) if necessary. In some embodiments, a QRP interface may be able to access a registration file, for example to read at least part of the registration document or to write to replace or to add to at least part of the registration document. An embodiment may include a site analysis tool that may be used for building registrations for sites that do not know how to or that do not desire to build their own registration. The site analysis tool may be available as an option during registration (for example, "build me a registration file" with a turn around of 24 hours or so), and may allow the provider to enter one or more initial keyword starting points. The site analysis may produce a queryspace from the information available through a site to reflect the kind of query to which the site may respond. In one embodiment the site analysis tool is part of a QRP interface. In one embodiment the QRP interface is a proxy to a provider. The tool site analysis tool may query, crawl, spider, index, or otherwise access or interact with the site to determine the type of information available from the site. FIG. 8 is a flowchart illustrating message flow in a distributed information discovery network according to one embodiment. An application on a consumer may find information providers to respond to a particular query by sending the query into the network via a specific hub. In one embodiment, queries may conform to a query routing protocol. In one embodiment, a consumer QRP interface is configured to produce queries that conform to a query routing protocol. In one embodiment, queries are markup language (e.g. XML) messages with essentially arbitrary structure. In this embodiment, there are no restrictions on what tags may be used in queries. The consumer may send the query to the hub as indicated at 300. In one embodiment, a router on the hub may receive the query. In one embodiment, a query routing protocol interface of the consumer may translate the query from a protocol understood by the consumer to the query routing protocol before sending the query to the hub. As indicated at 302, the hub may resolve the query to determine one or more providers that may want to process the query. In one embodiment, the router may then send the query to a resolver on the hub to perform the query resolution. In one embodiment, a provider may be selected to receive the query if the queryspace specified in the query matches a queryspace of the provider and the path predicate specified in the registration message selects a non-empty set of nodes in the query. In one embodiment, the resolver may index and search provider information for queryspaces that match the query. After determining the one or more providers to receive the query, the hub may route the query to the one or more providers as indicated at 304. In one embodiment, the resolver may provide a list of the selected one or more providers to the router. The router may then send the query to each of the selected one or more providers. Once a provider receives the query, it may search for results in its queryspace that satisfy the query as indicated at 306. A backend search engine of the provider may perform the search. In one embodiment, the query may be translated from the query routing protocol to a protocol used by the provider by a query routing protocol interface of the provider. In one embodiment, a provider QRP interface or adapter may access a backend search engine of the provider to perform the search. The provider may compose a response (containing the results of the query) and send it back to the hub as indicated at 308. In one embodiment, the query response may be translated from the protocol used by the provider to the query routing protocol by a query routing protocol interface or adapter of the provider before sending the response to the hub. In one embodiment, the response may be received on the hub by the router. The hub may receive one or more responses from each provider that was sent the query at 304. As indicated at 310, in one embodiment, the hub may collate the responses received from the one or more providers prior to sending them to the consumer. The hub may be configured to tailor the collated responses, as by arranging them in a particular order or according to some categories, by chronological order, to indicate relevancy, or some other method that may be useful to the consumer. The hub may then forward the (possibly collated) responses to the consumer as indicated at 312, and thus to the querying application. In one embodiment, the router handles the routing of the response(s) to the consumer. The consumer may receive the query response and optionally display the results as indicated at 314. Optionally, the consumer can do whatever is necessary to the results, including storing the results, forwarding the results, and modifying the results. In one embodiment, the query routing protocol interface of the consumer may translate the query response from the query routing protocol to a protocol understood by the consumer after receiving the response from the hub. In one embodiment a consumer QRP interface at the hub or a consumer proxy may translate the query response from the query routing protocol to the protocol understood by the consumer. In one embodiment, instead of, or optionally as well as, sending the results to the hub, the provider may send the results directly to a location specified in the query message. For example, the query message may specify a URL that the consumer wishes the results forwarded to or displayed at. As another example, the query message may include an email address or addresses that the consumer wants the results emailed to. In some embodiments, pre-crawling may be employed to create or update a provider registration automatically. For example, a provider may register with a distributed information discovery network. The provider may use, or contract a service to use, a tool to build a statistical metadata index from documents retrieved automatically through the provider's web-based interface. The metadata index may then be used to provide query routing. In other words, the provider's site may be "crawled" to create the registration (e.g. an XML-based metadata index). Key terms may be selected as the site is crawled to form the registration index. A queryspace is a unique identifier for an abstract space over which a query will travel. Queryspaces may be identified by unique URIs. Queryspace URIs may not necessarily reference actual content. Queryspace URIs are identifiers that providers and consumers may use to find each other. In one embodiment, both providers and queries may have queryspaces. A provider's queryspace may be defined as a schema that defines the scope of the set of data which the provider is capable of searching. A query's queryspace may be defined as a schema that defines the scope of the set of data which the consumer wishes to search. In one embodiment, the distributed information discovery platform may not make assumptions about the syntax or semantics of queryspaces. In this embodiment, the distributed information discovery platform does not process queryspaces, nor does it attempt to validate queries and responses—queryspaces are purely for coordination between consumers and providers. In one embodiment, a queryspace may include information regarding structure, for example so that queryspaces may allow providers and consumers to agree on the structure of messages and by specifying structural constraints in a standard form, e.g. a DTD or an XML Schema. In one embodiment, a queryspace may include information regarding semantics, for example so that providers and consumers may agree on the meaning of the messages that they exchange (in addition to their structure). While structural information may be machine-readable, semantic information may be intended for use by in writing client and server software. In one embodiment, a queryspace may include information regarding ranking. Queryspaces may define how clients may sort the results that they receive. Ranking may be application-dependent, and some applications may not require ranking at all. In one embodiment, the distributed information discovery platform may not specify methods for exchanging queryspace information. The distributed information discovery platform may ensure that providers receive only queries that match their queryspaces. The distributed information discovery platform encourages efficiency by allowing providers to filter the queries that they receive. To filter queries, a provider may include one or more predicates with each queryspace that they register. A predicate statement may be applied to each candidate query in the given queryspace; only queries that match the predicate statement may be sent to the provider. Internally, the distributed information discovery platform may use the predicates to optimize routing. In one embodiment, each query may contain at least one query section which may contain arbitrary XML. The contained XML should conform to the specified queryspace; otherwise, the query will probably not match any information provider predicates and will therefore receive no responses. In some embodiments, the distributed information discovery platform may not attempt to validate the query. If multiple query sections are specified, the information provider may choose which query to respond to. In one embodiment any QRP interface may indicate that a query cannot be processed, for example if it is an illegal query or otherwise invalid. In one embodiment, a resolver may validate a query according to a registered schema for the queryspace identified in the query. In one embodiment, the query routing protocol does not require queries or responses to identify machine addresses. Some queryspaces may agree to share addresses explicitly (e.g. peer-to-peer file sharing), while other queryspaces may choose to share addresses implicitly (e.g. with embedded XHTML). The structure of both the query and the response may be specified (explicitly or implicitly) by the chosen queryspace. In an example of a full-text schema, the response in the data section may be mixed-content XHTML to be displayed in a browser. In an example of a music schema, the data section of a response may contain structured information intended for applications as well as "unstructured" XHTML intended for humans. Some embodiments may use full-text queryspaces. In one embodiment, a full-text queryspace may use the following DTD:
For example, a query for "dog biscuits" under this queryspace may be formatted as:
In one embodiment, a full-text queryspace may be the default queryspace. In some embodiments, a full-text queryspace, such as the above example, may be extended to support "and" and "or" operations. Providers may register query predicates with a distributed information discovery network, e.g. by registering with a hub. When a client submits a query to the network, it is resolved to matching providers. For example, a provider may register a registration using the queryspace specified by the URI "http://www.infrasearch.com/food/recipies":
This registration registers the provider with the recipes queryspace with two predicates. Queries with "appetizer" in their <type> node and either of the words "eggplant" or "tahini" in their <ingredients> node are matched by this registration. A predicate is also registered that will direct any query containing any of the words "baba", "ghannouj", "ghannoush" or "ganoush" to the provider's query server running at http://www.efgh.com/search.jsp. Query Node Patterns (QNPs) may be the basic building block of query predicates. Each matches a node of an XML query. QNPs may be XML fragments. They match a query when they match some subset of that query's structure, or, more formally, they may be constructed by a series of the following transformations: (1) deleting a node in the query; or (2) replacing the query with a subnode of itself. For example, consider the following XML query:
This query is matched by the QNPs as illustrated in Table 1 of FIG. 9. In QNP matching, tag text (a.k.a. character data) may be tokenized at whitespace breaks and considered a set of tokens. Some embodiments may be limited to keyword matching only. Other embodiments may support phrase matching as well. In some embodiments, matching may be case-insensitive. In some embodiments, a QNP may only contain one path through the query XML. In such embodiments, the following QNP would be invalid:
In other embodiments, the single path restriction does not apply and the above QNP would be valid. In single path restricted embodiments, the above QNP would instead be specified as a predicate containing the conjunction of two separate QNPs:
Tag text may be an exception to the single path restriction. In some embodiments, if a QNP node contains multiple text tokens, these may form an implicit disjunction. A query predicate may be a boolean expression composed of QNPs. In some embodiments, predicates must be in conjunctive normal form, i.e., a conjunction of disjunctions. In other embodiments, this restriction may not apply. As an example of a conjunctive normal form predicate, consider the following query predicate:
Note that the first two conjuncts are implicit disjunctions. When an <or> . . . </or > tag contains only a single QNP, the <or> . . . </or> may be dropped. Similarly, if the top-level only has one element, the <and> . . . <and> may also be dropped. Thus, according to one embodiment, at its simplest, a predicate may be of the form: This predicate would match any query containing the word "U2" or the word "Nirvana." As mentioned previously, a resolver may create and maintain a set of indices for the provider registration files, with separate indexes for each queryspace. When a provider sends a registration file, the resolver parses it into a set of predicates, each predicate having a set of clauses, and each clause having a set of disjunctions. In one embodiment, predicates may be in conjunctive normal form. Each predicate may be given a global unique predicate ID, and each clause may be given a local clause ID. For each pattern in the registration, a posting may be created which contains the predicate ID and the clause ID. The predicate ID and clause ID may be used to trace the pattern to the clause in the registration where the pattern occurs. The (pattern, posting) pair may be stored in the corresponding query space index. The posting may also include a score, which may be updated based on feedback received from the user. The following is an example of a simple XML fragment of two predicates from a registration and the corresponding index entries:
The corresponding entries in the index may be:
That is, the index will have eight entries. In one embodiment, at least three of these entries have to match a query for the query to be routed to the provider. Query resolution is the process of determining a set of one or more providers to which a given query should be routed. Sending all queries to all providers is inefficient, therefore the distributed information discovery platform defines a framework for providers to register the type of queries they are interested in receiving and provides a query resolution and routing service. Providers may specify the type of queries they wish to receive in their registration file. In one embodiment, the minimal condition for matching a query to a provider is that the query has to have the same queryspace as the provider registration. In some embodiments, the minimal condition for matching a query may be for the query to have at least one matching element to the queryspace of the provider registration. In one embodiment, the set of providers may be selected by the resolver 102 in a certain order. In one embodiment, providers which have all clauses of at least one predicate satisfied may be selected first. In order to match a predicate, a query may first be tokenized into a set of patterns (QNPs). In one embodiment, providers may be ranked based on the matched pattern scores. In one embodiment, providers which do not have a matching predicate, but are similar in their responses and have the same queryspace as providers who have a matching predicate may be selected in a lesser category. In one embodiment if the number of providers returned is still less than the maximum, a provider may be selected (e.g. at random) from the same queryspace as the query. In this embodiment, this allows the exploration of the provider content in case the provider registration file is incomplete, or is not updated frequently. As mentioned previously, there may be a score associated with each (pattern, posting) pair in the resolver index. In one embodiment, scoring may be used to determine the popularity of providers for a particular type of query. Scoring may be used in selecting the most popular providers relevant to the query first. Scoring works as follows. If a user sends some feedback in response to a query response, the (pattern, posting) pairs that matched the query may be retrieved from the corresponding queryspace index, and their scores updated (i.e. increased for a positive feedback or decreased for a negative one). In one embodiment, a simple score update formula may be used: where (0<alpha<1) determines the rate of change of the score. Other embodiments may use other score update formulas. In one embodiment, in instances where there are very few providers who match a query, providers may be selected that did not match the query, but who have registered the same query space and who are similar to a provider who matched the query. In one embodiment, a method similar to collaborative filtering may be used in determining provider similarity. Providers who tend to match the same queries are considered more similar. In one embodiment, a similarity matrix may be maintained in the resolver. The entries in this matrix may determine the degree of similarity between provider x and provider y. A router may perform the certain functions. For example, in one embodiment a router may receive the queries from the end application/consumer. In one embodiment a router may route the queries to the appropriate providers. In one embodiment a router may merge the results of the queries and presents them to the end application. In one embodiment, a router may include routing or address information with its communications. When the router receives a request from the network, it may ask the resolver for a list of nodes on the network that are registered as wanting to receive queries like the request received. Once the resolver returns a set of network node endpoints, the router routes the query to this set of providers. In one embodiment, the resolver may return network node IDs with the network node endpoints that may be relevant only within the distributed information discovery platform and that may be used for logging. In one embodiment, a router may be a JAVA Servlet. The router may be platform-independent so that the deployment platform for the router may be Linux, Win32, etc. In some embodiments, routers may be distributed or clustered. In one embodiment, a router system may be organized to include a router to perform certain functions, for example functions described above. In one embodiment a router system may include a RouterServlet to receive routing requests and give access to real-time statistics. In one embodiment, a router system may include a HttpRouteConnection to use HTTP as transport and XML as encoding for a route to a given provider. In one embodiment, a router system may include a Router.Stat to provide statistics for a given route, for example bandwidth, response times, traffic, etc. The RouterServlet may receive a request to route a particular query. In one embodiment, each routing request may be an HTTP request with certain headers. For example, a uuid or unique identifier for the request (which may be used for logging purposes in the router, and may have other uses in other components or users of a distributed information discovery network). Another header may be a timeout or the amount of time to give each provider to respond. In one embodiment another header may be a NumHits, where each provider may respond with several hits but the router may take only the first N hits to be propagated back to the app/user. Another header may be a FlushAfter that may indicate to flush the response stream after receiving responses from N providers. In one embodiment, the body of the routing request may be an XML-encoded query (see description of queries above). In some embodiments, the routing request may also include a set of cookie headers, which may be encoded, for example, as When RouterServlet receives a query request from the distributed information discovery network, it asks the resolver for a list of nodes on the network that are registered as wanting to receive queries like this one. The resolver may return a set of network node URLs and network node IDs (e.g. unique provider IDs stored within the distributed information discovery router system and used for logging). The Router may then route the query to this set of providers. The router may contact the list of providers returned by the resolver. At least one QRP interface may be used when the router contacts the list of providers. In some embodiments a router is not limited to any transport or encoding scheme. In one embodiment, different transports and encodings may be plugged in. In one embodiment, HTTP and light-weight XML encoding may be used. In one embodiment, the router may use an exponential back-off algorithm to handle spamming and/or slow or temporarily down hosts. For example, if a provider exceeds a set timeout, the resolver subsystem may be alerted to make the provider no longer active in the subsystem. If a time-out is exceeded, or exceeded too often, the provider may be unregistered or flagged so that further resolutions do not include this provider. In some embodiments, in addition to collating the responses from the providers, the router may also pass through HTTP cookies (cookies may be retrieved from and set on a URLConnection class via a get/setHeader method, so this may be transport-independent, since other network-transport implementations of the URLConnection interface may be used). When passing a cookie from a provider to the client, the router may encode them as "unique_provider_id=base64encoding_of real_cookie", for example, so that it may later match cookies with provider IDs when the user does another search. In one embodiment, the Router may receive a query in XML format through a HTTP interface. When the Router receives the query, it may sends the query to the Resolver through an HTTP Interface. The Resolver may return with a list of providers that have registered interest in this query. In one embodiment, the Router does not attempt to interpret the query at all. The Router may then set up multiple threads, each thread opening a URL to post the query to each of the provider. In one embodiment, the query may be posted to each provider with a timeout value. When the provider returns with a result page (e.g. in XML), the router may parse the result page and extract the "hits" to be merged with the other hits from the other providers. The number of hits, the timeout value and the number of provider results may be specified through a "preference" interface. In one embodiment, the router may maintain a pool of TCP/IP connections to the providers and reuse them. This reuse may reduce the overhead in opening and closing connections. For example, each HTTP request to the providers may use KEEP_ALIVE so that the connections will not be closed by the provider. In one embodiment, the router system may track certain statistics so that administrators may access the router system to view current statistics about their node, such as how many queries were sent to them today, what's the average response time, how many queries failed, etc. A provider may be registered with multiple distributed information discovery routers. In such embodiments, real-time stats may be aggregated at the time of viewing by code in the provider subsystem. This code may query each router to give up-to-the-moment stats for a given provider. The resulting information is processed and displayed. In some embodiments, a distributed information discovery router may store and allow administrators to view historical data about their nodes. In an embodiment, each router system may keep a local log of its actions and export the log for download via HTTP with authentication protection. In one embodiment, logs may be periodically aggregated to a log-administrator machine with the script. Once aggregated from all the router systems, the logs may then be parsed. The result of the parsing may be a set of logs per provider. In one embodiment, each parsed set of logs may include a log file, for example with information regarding the router noted down for that provider's ID (e.g., provider-id.log). In one embodiment, each parsed set of logs may include information regarding successful routes of requests for that provider (e.g., provider-id-success.log). In one embodiment, each parsed set of logs may include information regarding failed routes of requests for that provider (e.g., provider-id-error.log). In the above example, the log file may be available for download by the administrator, so that a human administrator may run his own set of scripts on that data and maybe glean something only he wants to see from it. A log may be plotted for each provider (e.g. using gnuplot during the parsing), so that the provider-human, who doesn't know how or doesn't have the time to pipe the log to his own charting tools, may visualize the correspondence between time, number of successful routes, and number of failed routes. In one embodiment, a log file or parts of a log file may be accessible to applications or elements of the information discovery network. In the above example, a failed route may be one where the provider didn't accept the connection or took | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
