Systems and methods for identifying and extracting data from HTML pages6920609Abstract Systems and methods for analyzing HTML formatted web pages to automatically identify and extract desired information. A computer algorithm identifies and extracts different pieces of information from different web pages automatically after minimal manual setup. The algorithm automatically analyzes pages with different content if they have the same, or similar, formats. Claims 1. A computer implemented method of identifying desired content in HTML formatted web pages, comprising the steps of: Description COPYRIGHT NOTICE This HTML code puts the line as a row in a table, adds an image, italicizes "available ", and highlights "ON SALE " in red. A typical commerce page may have hundreds of formatting tags. A different product Q may appear as: "Product Q in Oregon and Washington for $15.99" and be formatted as: If one is interested in extracting only the price of the product, a typical rule-based extraction mechanism, using the first document for product P, may infer that the price appears after the ON SALE text, or after the red formatted text. However, this same extraction mechanism, when analyzing the second document for product Q, will miss the price of product Q, because neither the ON SALE text nor the red formatting is present. In general, the page may be much more complex and variable. Accordingly, it is desirable to provide methods and systems for analyzing the structure of web pages and for automatically extracting pertinent information from the web pages. SUMMARY OF THE INVENTION The present invention provides systems and methods for analyzing web pages formatted using HTML or other markup language to automatically identify and extract desired information. In one embodiment, aspects of the invention are embodied in a computer algorithm that identifies and extracts different pieces of information from different web pages automatically after minimal manual setup. The algorithm automatically analyzes pages with different content if they have the same, or similar, formats. The algorithm is robust, in the sense that it operates successfully and correctly in the presence of small changes to the formatting of documents. The algorithm is fast and efficient and performs the extraction process quickly in real-time. Many database and data mining applications require structured data-they have to know the meanings of numbers and text, and not just their values, so they can infer relationships among them. Using the techniques of the present invention, it becomes possible to build databases from unstructured web information. The algorithm can be implemented in an agent that captures information about products, and compares prices or other characteristics. The algorithm can also be used to populate structured databases that, given the different pieces of information, can analyze products and their characteristics. Additionally, the algorithm can be used for data mining applications, e.g., looking for patterns useful for marketing analyses, for testing and quality assurance (QA) purposes, or other uses. According to an aspect of the invention, a method is provided for identifying and extracting content from HTML formatted web pages. The method typically comprises the steps of selecting a model page, wherein the model page includes a plurality of HTML tags, identifying an area of interest in the model page, and parsing the model page to determine a first string of symbols associated with the plurality of HTML tags, wherein the first area of interest is identified by a first portion of the first string of symbols. The method also typically includes the steps of retrieving a second web page, parsing the second web page to determine a second string of symbols associated with the HTML tags of the second web page, comparing the first and second strings to determine whether the second string includes a second portion similar to the first portion of the first string, wherein the second portion corresponds to a second area of interest in the second page, and thereafter extracting the second area of interest from the second page. In preferred aspects the steps of selecting the model page and identifying a first area of interest are performed manually, and the remaining steps are performed automatically. According to another aspect of the present invention, a computer readable medium is provided containing instructions for controlling a computer system to automatically identify and extract desired content from a retrieved HTML formatted web page. The medium includes instructions to control the computer system to automatically parse the HTML code of a manually selected model web page to determine a first string of symbols associated with a first plurality of HTML tags. The medium also typically includes instructions to control the computer system to automatically retrieve a second web page, parse the HTML code of the second web page to determine a second string of symbols associated with HTML tags of the second page, compare the first and second strings to determine whether the second page includes a second plurality of HTML tags substantially matching the first plurality of HTML tags, and extract a portion of the second page corresponding to the second plurality of HTML tags. According to yet another aspect of the present invention, a computer system is provided for identifying and extracting content from HTML formatted web pages. The system typically comprises a means for retrieving web pages including HTML tags, wherein a model web page is retrieved, and a means for manually identifying a first area of interest in the model page, wherein the first area of interest corresponds to a first plurality of HTML tags. The system also typically comprises a processor including a means for parsing a page, wherein the parsing means parses the model page to determine a first string of symbols associated with the first plurality of HTML tags, and wherein the parsing means thereafter parses an automatically retrieved second web page to determine a second string of symbols associated with the HTML tags of the second web page. The processor also typically includes a means for comparing the first and second strings to determine whether the second string includes a second portion similar to the first portion of the first string, wherein the second portion corresponds to a second area of interest in the second page, and a means for extracting the second area of interest from the second page. According to a further aspect of the invention, a computer implemented method of identifying and extracting content from web pages formatted using a markup language is provided. The method typically includes the steps of selecting a model page, wherein the model page includes a plurality of tokens, identifying a first area of interest in the model page, and parsing the model page to determine a first string of symbols associated with the plurality of tokens, wherein the first area of interest is identified by a first portion of the first string of symbols. The method also typically includes the steps of retrieving a second web page, parsing the second web page to determine a second string of symbols associated with the tokens of the second web page, comparing the first and second strings to determine whether the second string includes a second portion similar to the first portion of the first string, wherein the second portion corresponds to a second area of interest in the second page, and thereafter extracting the second area of interest from the second page. The present invention is applicable to any markup language, including any instance of SGML, such as XML, WML, HTML, DHTML and HDML. Reference to the remaining portions of the specification, including the drawings and claims, will realize other features and advantages of the present invention. Further features and advantages of the present invention, as well as the structure and operation of various embodiments of the present invention, are described in detail below with respect to the accompanying drawings. In the drawings, like reference numbers indicate identical or functionally similar elements. BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 illustrates a general overview of an information retrieval and communication system according to an embodiment of the present invention; and FIG. 2 is a flow chart showing the process of identifying and extracting information from web pages according to an embodiment of the present invention. DESCRIPTION OF THE SPECIFIC EMBODIMENTS FIG. 1 illustrates a general overview of an information retrieval and communication network 10 including a client device 20 according to an embodiment of the present invention. In computer network 10, client device 20 is coupled through the Internet 40, or other communication network, to servers 501 to 50N. Client device 20 is also interconnected to server 30 either directly, over any LAN or WAN connection, or over the Internet 40. As will be described herein, client device 20 is configured according to the present invention to access and retrieve web pages from any of servers 501, to 50N, identify and extract desired information therefrom, and provide the information to server 30 to populate database 35. Although as described herein, access and processing of web pages is performed using client device 20, it will be understood that server 30 can also be configured to access and process web pages according to the present invention described herein. Several elements in the system shown in FIG. 1 are conventional, well-known elements that need not be explained in detail here. For example, client device 20 (and server 30) could be a desktop personal computer, workstation, laptop, PDA, cell phone, or any WAP-enabled device or any other computing device capable of interfacing directly or indirectly to the Internet. Client device 20 typically runs a browsing program, such as Microsoft's Internet Explorer, Netscape Navigator or the like, allowing a user of client 20 to access and browse pages available to it from servers 501 to 50N over Internet 40. Client 20 (and server 30) also typically includes one or more user interface devices 22, such as a keyboard, a mouse, touchscreen, pen or the like, for interacting with a graphical user interface (GUI) provided by the browser in conjunction with pages and forms retrieved from servers 501, to 50N or other servers. The present invention is suitable for use with the Internet, which refers to a specific global Internetwork of networks. However, it should be understood that other networks can be used instead of the Internet, such as an intranet, an extranet, a virtual private network (VPN), a non-TCP/IP based network, or the like. According to one embodiment, client device 20 or server 30, and all of its components are operator configurable using an application including computer code run using a central processing unit such as an Intel Pentium processor or the like. Computer code for operating and configuring client device 20 or server 30 as described herein is preferably stored on a hard disk, but the entire program code, or portions thereof, may also be stored in any other volatile or non-volatile memory medium or device as is well known, such as a ROM or RAM, or provided on any media capable of storing program code, such as a compact disk medium, DVD, a floppy disk, or the like. Additionally, the entire program code, or portions thereof may be downloaded from a software source to client device 20 or server 30 over the Internet as is well known, or transmitted over any other conventional network connection as is well known, e.g., extranet, VPN, LAN, etc., using any communication medium and protocol as are well known. Appendix A includes an example of code for implementing the techniques of the present invention. It will also be appreciated that computer code for implementing the present invention can be implemented in JavaScript, or any scripting language such as VBScript, that can be executed on a client device or server system. Although it is understood that server 30, or any other server, can be configured using the code as above, the following will discuss the present invention implemented in the context of client device 20. In general, a user is able to access and query servers 501 to 50N and other servers through client device 20 to view and download content such as news stories, advertising content, search query results including links to various websites and so on. Such content can also include other media objects such as video and audio clips, URL links, graphic and text objects such as icons and hyperlinks, and the like. As described herein, the techniques of the present invention are particularly useful for identifying and extracting information related to products from remote vendor servers. Such information can be used, for example, to populate database 35 with comparative information for access by subscribers or the general public, e.g., over the Internet. For example, the extracted information can be used to populate database 35 with comparative pricing information for a particular product or service or related products or services. One example of such an accessible server/database for which the invention is useful is the Yahoo! Shopping website. It will of course be apparent that the present invention is useful for identifying and extracting any desired information in web pages retrieved from any website for use in any data mining application or other application. FIG. 2 is a flow chart showing the process of identifying and extracting information from web pages according to an embodiment of the present invention. In the following description, it is assumed that the web pages are formatted using HTML, although the present invention is equally applicable to processing web pages formatted using any markup language including any instance of the Standard Generalized Markup Language (SGML), such as XML, WML, HDML (for hand-held devices), DHTML and others. According to one embodiment, at step 100 an operator using client device 20 (or server 30) first selects a target page that is deemed a model page for a particular product type, company format, or any other type of document. For example, the operator accesses a particular product page for product P from one of servers 501 to 50N, which corresponds to a particular remote vendor's website. At step 110, the HTML code for the selected page is parsed to determine a model pattern for the page. In one embodiment, a model pattern based on the selected page is built by first dividing the web page into HTML tokens. In general, HTML tokens include tag elements and text elements. In one embodiment, the text is preferably initially ignored, and the tags that are primarily used for formatting purposes, e.g., | ||||||
