System and method for dynamic content retrieval6735586Abstract Systems and methods for collecting information distributed over a computer network are described. Problems addressed by the invention include the marking of content distributed over a network, the instant display of current information distributed over a network, and the retrieval of information at a browser without an intermediary step to save the information. The invention enables customized aggregation of content distributed over a network in real-time. The invention includes a recursive scripting language. Scripts in the recursive scripting language may be used to point dynamically to web objects whose URLs have changed. Embodiments include a feature extraction object used for identifying similar information objects. Feature Extraction may use `fuzzy logic` to insure that targeted content is identified despite modifications in the source page. Claims What is claimed is: Description BACKGROUND OF THE INVENTION
CCL_COMMAND(param1(CCL_NEXT(param2(CCL_LAST
(param3,void)))))
{
save param1
create CCL_NEXT
CCL_NEXT(param2(CCL_LAST(param3,void))))))
{
save param2
create CCL_LAST
CCL_LAST(param3,void)
{
Execute CCL_LAST and
save data values in hashtable
RETURN hashtable
}
Execute CCL_NEXT using
hashtable results from CCL_LAST
add new data and RETURN hashtable
}
Execute CCL_COMMAND using hashtable results
from CCL_NEXT and RETURN hashtable
results to creator
}
Processing does not take place until all CCL_COMMANDS have been created. The `most nested` command is processed first and returns its results to its creator. Each command is executed using the results from its `nested child` until the final result is returned to its creator. Commands are `chained` to obtain specific content results: LABEL(param1, param2, param3(LABELTWO(param2.1, param2,2( ) The result is a single string command which can be used in a manner similar to a URL to describe content anywhere on the web. To illustrate, consider a script encoded in the content collection language for retrieving a graphic from a financial news site: GRAPHIC((ANCHOR(/sandp.html(LOAD(foo_financial.com/markets/))))). This description uses three commands to capture the S&P chart from the foo_financial.com page. LOAD reads the foo_financial.com/markets/web page ANCHOR captures an anchor associated with `sandp.html` GRAPHIC reads a graphic object when passed a URL Using a standard scripting language, the script above may be written as follows: If LOAD("foo_financial.com/markets/") { if (ANCHOR("sandp.html")) { return (GRAPHIC( ); } } The content collection language executes the command that is most deeply nested first. If this is successful, the next most deeply nested command is executed until all of the commands have either returned an error message or executed successfully. Once elementary commands are place, they can be combined algebraically to produce additional commands in the content collection language. Each command is made up of parameters, and the NextCommand to produce a third CCL command as a result. Each CCL command returns a collection of objects as a result of its parameters and the NextCommand. For example, the CCL descriptor: NEWSLIST((LOAD(foo_news.com)) returns a `collection` or list of all anchors separated by a delimiter that could be identified as a new's list item. NEWSLIST((LOAD(foo_news.com)) Returns a `collection` or list of all anchors separated by a delimiter that could be identified as a news list item. Set operations in CCL include Union function: All elements of collection A that contain parameter B. Exclusion function: All elements of collection A that do not contain parameter B. Operations possible in CCL include BEFORE: Each element of collection A that are BEFORE parameter B. AFTER: Each element of collection A that are AFTER parameter B. FIRST: First element of a collection A. FIRSTNUM: First NUMBER of collection A. LAST: Last element of a collection A. LASTNUM: Last NUMBER of collection A. CCL commands all share the same error handling and behave the same way during failure, pause and retry situations. C. Feature Extraction The invention supports protocols and techniques for identifying information objects by use of attributes of the information objects. These protocols and techniques are collectively referred to as "Feature Extraction". A feature extraction tag or an information object comprises a number of `fuzzy rules` or attributes describing the information object. For instance, a feature extraction tag for a graphic object could be "G0ABMMZA001". The first character of the tag `G` defines the type of net object, with the character G being reserved for graphic object. The second character `0` defines this tag as a Graphics tag version `0` so that one can easily add or modify tags and maintain backward compatibility. The `ABMMZA` characters describe the capture attributes, and `001` is a numeral indicating the occurrence of the graphic object on the page. In this case G0ABMMZA001 is the first occurrence of several ABMMZA objects on the page. The attributes are ranked with the most significant attribute leftmost in the tag with `A` being the highest value and `Z` being the lowest value for any attribute. For example, in FIG. 1, the URL of a page is passed to the feature extraction indexer. The page is retrieved from the web 102 and then each `container object` is analyzed or parsed one at a time 104. A container object for HTML is the TABLE tag that is used for page layout. Each TABLE tag may have many tables which, in turn, have nested tables of their own. Each container (TABLE) is separated from the target page into a new data object containing only information for that particular container. As each TABLE is parsed, objects in that table are created for each element of the TABLE such as, by way of a non-limiting example, a headline, graphic object, or button. Within each of these element tags is information that is used to produce the element's feature tag. A loop is used 106 to build all element tags within a container and another loop is used to build all container tags within a page 108. Feature extraction attributes are constructed using an `Inside->Out` method instead of an `Outside-In` approach. For example, FIG. 2 illustrates two pages 200, 202 with several tables. The left page 200 is tagged by building a list of table attributes from the top of the page to the desired capture target. In the approach used by this invention, illustrated on the right side of the diagram 202, the table attributes are limited to this particular table, or container, table, and its contents. The benefits of the `Inside->Out` approach of this invention are that advertising banners or other graphics can be added to the top or the bottom of the page, and the table, with its contents, can be moved, without disrupting the identification of the desired object. As long as the contents inside the table remain structurally unchanged the correct table for a generated tag will be collected. This allows capture tags to remain useful and accurate even when the pages are being modified by the publisher. This feature of the invention is commonly referred to as `persistence`, and the approach used by this invention to mark and collect information is more `persistent` than other approaches in the prior art. Feature extraction objects can be used to capture discrete net objects on a page such as a headline, graphic image, or button. Tags are also generated for distinct areas on the page, which may be a single container (TABLE), or an area made up of several nested containers. Feature extraction tags can be combined to create more accurate and persistent tags for very complicated web pages. For example a container tag can be combined with any element tag (graphic, headline, form, etc) to produce a very accurate extraction tag even for the most crowded of pages. In embodiments of the invention, the fuzzy logic of the attributes are used to extract an object using not only the content of the element itself (headline, graphic, button) but also the context of the element on the page. Situations where this is especially powerful is on very dense pages, such as a news portal, where there may be a large number of headlines that have very similar attributes. Another situation where the above technique can be used to retrieve data on the base of context is when extracting column and row data from a `spreadsheet` type of document on a page that has several `spreadsheet` displays of data. When creating the feature extraction tag, one may choose between a simple `element` tag or a `compound` tag, made up of a container tag and a element tag, depending on the page and the object being captured. The information retrieval processes of the present invention use tags that have been generated previously to load a page of information and subsequently extract the desired information defined by the tag. Such a process is illustrated in FIG. 3. The URL of a page is passed with a `target` tag to the feature extraction indexer 300. The page is retrieved from the web 302 and then each `container object` 304 is parsed one at a time. Each container is examined to see if `this container tag` equals the `target` tag 306. If this container matches the target 308 then the information within this container is returned to the caller 310. Next, the container is examined to see if the target tag is an element within that particular container 312. If an element matches the target tag, then that element's information is returned to the caller. If all containers on a page are examined without a match being found, this invention makes it possible to find the `nearest` object if an `exact` object match is not found. This is done by a `de-fuzzy` search 314 from the least significant (rightmost) attribute to the most significant (leftmost) of the attributes in a tag. For example, if an exact match was not found for the tag G0ABMMZA001 the search would look for: G0ABMMZB001, G0ABMMZC001, G0ABMMZD001, G0ABMMZE001, G0ABMMY*001, G0ABMMX*001, G0ABMMV*001, , , , G0AB***001 In effect, one searches right to left for the best fit, narrowing the search on the most significant attributes. The information retrieval module can be tuned for different solutions to provide very fine or very coarse fuzzy match of the tag to the target held. The tag notation of this invention also makes it possible to use wildcards to get all of the graphics from a page with a tag of "G0ABMMM**", and operators such as get all of the tags `greater than`> G0ABMMZA. The Feature Extraction object has a `getContainer( )` method that will return any element's container. This feature is used on the `zoom-out` so the user can select the content (graphic, headline, button) or the desired context (area) at the same time. By passing a container tag, the target tag container's container will be returned. For example, in FIG. 4, the URL of a page is passed with a `target` tag 400 to the feature extraction `getContainer( )` method. The page is retrieved from the web 402 and then each `container object` is parsed one at a time 404. Each container is examined to see if the target tag is an element or a nested container within that particular container 406. If an element or nested container matches the target tag then that container's information is returned to the caller 408. If all containers on a page are examined without a match, the `nearest` object's container is returned. This invention may also be used to `post-process` information in order to filter out undesired content from an otherwise good information retrieval. For example, a headline capture tag that collects all of the major headlines on a popular web site may have some `navigation` or `site related` elements within that capture that are not desired when this information is added to an aggregated collection of headlines from multiple sites. Some examples of visible text that would not be desired when doing headline aggregation would be: "show more", "click here", "email to", etc. To subtract these kinds of text from a headline capture the following tag may be used: L0TTTTTTTTTTT003HHHHHHHaaaaaaabbbbbbb In this example, L0 is the headline tag list, TTTTTTTTTTT003 is the area in which the headlines are to be captured from, HHHHHHH is the tag for the desired headlines, and aaaaaaabbbbbbb instruct the indexer to remove headline type aaaaaaa and bbbbbbb from the collection. In other words, collect headlines HHHHHHH and remove from that capture headlines with a tag or aaaaaaa and bbbbbbb. D. System Architecture FIG. 5 illustrates the `portal` 500 and `content` 502 servers used in embodiments of the invention. User registration, password security, and internationalization is served by the portal server 500. The top part of the diagram represents the `portal` server 500, with the `content` servers 502 in the lower half. In embodiments of the invention, each `content` server is as simple as possible and does not require access to the database with each content server specialized for a specific collection function, GRAPHIC, ARTICLE, NEWLIST, TABLE. FIGS. 6A-B illustrate the process of delivering information in parallel. As the user togs into his account, a page is sent to the client web browser 600. The web page 600 does not contain all of the collected content when it is first drawn but will have a reference 602 to each `webpart` to be collected in the window. These references are sent back to a cluster of content servers. Each content server is specialized to collect content quickly from each target site. In embodiments of the invention, the information is requested from each target site and then sent immediately to the user's web page without being saved in a server database, cache, or repository. By separating the main page creation from the collection and delivery of information, this invention makes it possible to combine information from several sources residing on many different servers without delaying or halting the display of the main page. If an information source is unavailable, has a network error, or if the information source page has been redesigned, the rest of the main page will be displayed without interruption. The separation of the main page server from content servers also makes it possible for several `branded` servers 604, 606, 608, 610 to share the same content collection servers 612. Another advantage of this specialized server architecture is the ability to serve information to another portal server that is separated by physical distance and network address. For example, the four different portal servers 604-610 illustrated in FIG. 6C can be located across the world and have different user interfaces, languages, and features. The user main page 612, 614, 616, 618 may look completely different from each of the portal servers 604-610 but the information servers will collect and deliver requested information using the same rack of collection servers, which may be located anywhere and shared by all configured portal servers 500. Adding new information to a user's page can be accomplished by using this inventions `mark and collect` process 700, as illustrated in FIG. 7. After the user logs in to the portal server, he can begin `recording` the path to a desired net object by entering the start URL at a prompt window. During marking and navigation the user's browser screen is divided into two parts: the top frame contains a `stop recording` button and the lower frame on the screen displays the first page requested. The user is now in `navigation` mode. During navigation mode, illustrated in flowchart 800 of FIG. 8, every hypertext link on the page is modified by the content server. When a hyperlink tag is clicked on, the page request is sent back to the navigation server providing a record of every user action. The user will continue in `navigation` mode until the page is reached containing the desired collection object. This process allows the user to `drill down` many pages deep into a web site looking for the desired target information. This process also makes it possible to save all of the `web parameters` necessary to reach the target information again without user intervention. For example, username, password, and other items such as search criteria are all monitored and saved during navigation to be used to recollect the information again mechanically and quickly by the collection server. This process also makes it possible to save all of the `web parameters` necessary to reach the target information again without user intervention. For example, username, password, and other items such as search criteria are all monitored and saved during navigation to be used to recollect the information again mechanically and quickly by the collection server. When the page containing the desired information target object is viewed, the user will click on the `stop` button. The `content` server will send a page to the user prompting a selection of the desired information he wishes to collect from the page. The user is engaged in a dialog to accurately identify the item or items on the page to collect. If needed, the software prompts the user for `parameter` information such as `search criteria` or `address information` need to replay the in formation request later. The content server will extract all objects or the desired page and present those objects on a `preview` page. In embodiments of the invention, each object is evaluated before it is displayed so that the `most valuable` information objects are displayed first. For example, the largest graphic object that is located in the center of the page will be displayed first on the preview page. The preview page will show the desired objects allowing the user to choose the object to send to his user page. By clicking on the `add to page` button the information target reference is sent back to the portal server 500 where it is saved as part of the user's page definition. The next time the page is updated, the portal server 500 will send the user's page to the browser and the browser will request the new information object as one of many objects collected in parallel in real-time by the content servers. E. Conclusion The foregoing description of various embodiments of the invention has been presented for purposes of illustration and description. It is not intended to limit the invention to the precise forms disclosed. Many modifications and equivalent arrangements will be apparent.
|
Same subclass Same class Consider this |
||||||||||
