Software application and associated methods for generating a software layer for structuring semistructured information6851089Abstract A wrapper builder application provides a variety of features for facilitating the creation of wrappers that are used to extract information from Web sites. In one aspect, the wrapper builder application provides a tool with which the process of creating a wrapper, which typically resembles coding, can be accomplished by a graphical design process involving drag and drop operations, clicking on objects, and filling in forms. A web viewer component provides a web browser frame, a source code frame, and a tree view frame, enabling the user to identify semistructured information of interest on Web sites. A wrapper editor component provides a graphical design environment in which a wrapper can be graphically constructed from operations and links. A wrapper model component provides a functioning internal representation of the graphically designed wrapper using Java objects and methods. A property editor component provides for the setting of properties that define the particular functionality of individual wrapper operations. A wrapper execution component provides features that enable the wrapper to be executed and debugged using a number of debugging tools. A wrapper serialization component provides a mechanism for storing and retrieving a wrapper for subsequent use and/or modification. Claims What is claimed is: Description APPENDICES
a. Start Operation
Input Variables: Not applicable.
Output Variables: TEXT - set to the complete text of the start
URL.
URL - the origin of the TEXT.
START - set to 0.
END - set to the length of TEXT minus 1.
Properties: URL - the URL at which to start the wrapper.
Destination Operations: THEN - the one destination operation.
The Start operation functions to start the execution of the wrapper and is the first operation to be performed when running a wrapper. As illustrated in the flowchart 800 of FIG. 8A, the Start operation fetches the Web page specified in the URL property at step 801. At step 802 the Start operation stores the text of the page in TEXT and sets the other variables, URL, START, and END accordingly. At step 803, the start operation calls the THEN operation and awaits its completion. The THEN operation is the operation to which the directed link from the Start operation points on the wrapper graph canvas 704. Once the THEN operation completes, the execution of the wrapper is also complete and the Start operation returns at step 804. As the Start operation will typically not be called or linked to by another operation, the start operation need not have input variables. The only property that needs to be set by the user is the URL at which to start the wrapper. Upon creating a new wrapper, the Start operation is automatically placed on the canvas. The user need only edit the Start operation's properties and continue creating the remainder of the wrapper.
b. Match Operation
Input variables: TEXT, URL, START, and END.
Output Variables: TEXT - same as the input variable.
URL - same as the input variable.
START - the start position of a matched
variable within the TEXT.
END - the end position of the matched variable
within the TEXT.
Properties: MATCH - the match expression specifying
how to iteratively match the input TEXT of
interest. The matching is only performed on
the portion of the TEXT delimited by the
START and END input variables, not the
whole TEXT.
EMIT - a boolean property indicating whether
the present variable values should be output
as a new row in the output table.
HOW MANY - the number of matches to
process, can be any number or "ALL" to
indicate that the complete TEXT should be
processed for matches, regardless of how many
are found.
Destination Operations: One destination can be associated with each
matched variable.
The Match operation is used to match structure in a Web page using regular expressions, as illustrated by the flowchart 810 of FIG. 8B. The Match operation can be configured to bind portions of a matched expression to variables in the execution environment upon matching a regular expression. In addition, the Match operation can be configured to call destination operations to further process bound variables. Finally, the Match operation can be configured to output a row of data based upon the variable bindings in the current execution environment. At step 811 of flowchart 810, the Match operation attempts to match a match expression, as specified by the MATCH property variable, to the input variable TEXT between the START and END positions. Acceptable regular expressions can contain variables to which the Match operation will bind portions of the matched text. For example, the match expression, "`<b>`BTEXT `</b>`" consists of the literal or constant "<b>" followed by the variable "BTEXT," followed by another literal "</b>." The expression will match a sequence of characters consisting of a "<b>" followed by any sequence of characters ending in a "</b>." At step 811A, if the MATCH expression is not found in the input TEXT, then the Match operation immediately returns at step 819. However, if at step 811A a match is found, control passes to step 812. At step 812, each matched variable takes on the value of the associated matched sequence of characters. In the previous example, the sequence of characters following the "<b>" and preceding the "</b>" would be bound to or stored in the variable "BTEXT," at step 812. Within HTML, a sequence, "<b>" "</b>" denotes bold text; anything between the two delimiters will be displayed in bold format. Thus, the previous expression will match a continuous sequence of bold text within the operation's input TEXT variable and store the characters of the matched bold text in the variable BTEXT. In this manner, the Match operation allows a user to identify structure within a Web page that delimits information of interest or may lead to information of interest. The regular expressions from which the match expression can be composed can include variables, literals, and match instructions. Variables, such as BTEXT, as used above, can be used without being first defined or created. Literals are sequences of characters enclosed in quotes. To include a quote character in a literal, it is preceded by a backslash. Match instructions allow further flexibility in defining match expressions. Match instructions can include, but need not be limited to the following: to(`x`)--reads TEXT up to `x`; backto(`x`)--reads TEXT backwards up to `x`; pos(N)--read to position N in TEXT; pos(+N)--read forwards N characters in TEXT; pos(-N)--read backwards N characters in TEXT; to(`x` `y` `z`)--read TEXT up to `x` if match exists, otherwise read TEXT up to `y` if match exists, otherwise read TEXT up to `z` if match exists; set(var exp)--set variable "var" to result of evaluating expression "exp"; `x` var1 `y` var2 `z`--read up to `x`, save up to `y` into var1, save up to `z` into var2; and `x` var lookahead(`y`)--read up to `x`, save up to and including `y` into var. Each match expression typically contains at least one variable that can be linked to a destination operation. At step 813, the Match operation determines whether a destination operation is associated with the current matched variable. If so, at step 814, the START and END variables are set to identify the matched variable within TEXT and the destination operation is called at step 815. In this manner, the matched variable is effectively passed on to the destination operation. Upon completion of the destination operation associated with a particular matched variable, the Match operation passes control to step 816. If, at step 813, no destination operation is associated with the current matched variable, control passes directly to step 816. At step 816, the Match operation checks whether there is another matched variable in the current match expression. For additional matched variables, the Match operation repeats steps 813-816 as necessary. At step 817, the Match operation checks the value of the EMIT property. If EMIT is set to true, then the wrapper outputs a row of data at step 817A. The Match operation outputs a row of data by applying the associations specified in the category file to the variable bindings in the current execution environment. As discussed above, the category file defines associations between wrapper variables and columns in an output table. Using the category file, the Match operation produces SQL commands that populate a row of a relational database table in which the data of interest is output. At step 818, the Match operation determines whether it will attempt to match the match expression again. The HOW MANY property specifies how many times a Match operation will attempt to match a match expression. The HOW MANY property can take on integer values or the value "ALL" to indicate that the Match operation should attempt to perform matches until the end of the input TEXT is reached. If the Match operation has not completed the number of matches specified by the HOW MANY property, control will pass back to step 811. If HOW MANY matches have already been performed, control passes to step 819 and the Match operation returns control to the source operation that called it.
c. Follow Links Operation
Input variables: TEXT, URL, START, and END.
Output Variables: TEXT - set to the complete text of the followed
URL.
URL - the origin of the TEXT.
START - set to 0.
END - set to the length of TEXT minus 1.
Properties: SAVETITLE - the name of the variable into
which the hypertext tag associated with a
link will be saved.
Destination Operations: THEN - the destination operation to be
called upon following a Web link.
As illustrated by flowchart 820 in FIG. 8C, the Follow Links operation follows each Web link in the input TEXT between START and END. At step 821, the Follow Links operation searches for Web links within the input TEXT beginning at the START position. If, at step 822, no link has been found in the input text, the Follow Links operation returns control to the calling operation at step 828. If a link is found, however, the URL of the hypertext link is stored in the URL wrapper variable at step 823. The SAVETITLE property causes the HTML tag associated with a followed link to be stored in the variable named in the SAVETITLE property at step 824. At step 825, the Follow Links operation fetches the page associated with the URL and at step 826 the variables TEXT, START, and END are set according to the listings above. At step 827, the Follow Links operation calls the THEN operation and awaits its completion. Upon completion of the THEN operation, control is passed back to step 821, to search for additional Web links. The Follow Links operation can be used in conjunction with a Match operation to identify a link of interest and then follow it. The Match operation can serve to identify the link and then the Follow Links operation will follow the link that was matched by the Match operation. Alternatively, the Follow Links operation could be applied to the complete text of a Web page, and then the Match operation could be used to determine whether links of interest have been followed by matching expressions in the resulting Web pages.
d. Set Fields Operation
Input variables: TEXT, URL, START, and END.
Output Variables: TEXT - same as the input variable.
URL - same as the input variable.
START - same as the input variable.
END - same as the input variable.
Properties: VARIABLES - variables to be set in the
execution environment and the associated
values.
Destination Operation: THEN - the destination operation to be called
upon completing the variable assignment(s).
The Set Fields operation assigns values to wrapper variables in the execution environment, as illustrated by flowchart 830 in FIG. 8D. The properties of the Set Fields operation comprise the names of variables to be bound in the execution environment. Each variable name has an associated value to which it will be bound by the Set Fields operation. At step 831, the variables are created, if necessary, and bound to their associated values. At step 832, the THEN operation is called, and upon its return, the Set Fields operation also returns at step 833.
e. Submit Form Operation
Input variables: TEXT, URL, START, and END.
Output Variables: TEXT - set to the complete text of the followed
URL.
URL - the origin of the TEXT.
START - set to 0.
END - set to the length of TEXT minus 1.
Properties: ITERATE_OVER_ATTRIBUTES - a list of
attribute-value pairs that are submitted in
response to an HTML form. The value
elements can be lists of values.
ITERATE_IN_SYNC - the attributes for
which iteration over the value lists will occur
synchronously.
Destination Operation: THEN - the destination operation to be called
upon following the URL resulting from the
form submittal.
The Submit Form operation allows a wrapper to submit HTML forms to a Web server as illustrated in flowchart 840 of FIG. 8B. At step 841, the Submit Form operation sets the URL variable to the action URL of the form. The action URL is the URL associated with the form that allows the server to process the form. At step 842, the Submit Form operation submits a form to a Web server using one combination of attributes for the form values. The Web server will respond with a new Web page at step 843. At step 844, the SUBMIT FORM sets the TEXT, START, and END variables as described above. At step 845, the Submit Form operation calls the THEN operation, which operates on the new Web page, and awaits its return. At step 846, the Submit Form operation determines whether there is another combination of attribute values that can be submitted in response to the form. If so, then control is passed back to step 841 and the subsequent steps repeat. If all of the possible combinations of attribute values have been exhausted, then the Submit Form returns control to the calling operation at step 847. The ITERATE_OVER_ATTRIBUTES property identifies the attributes of the form and the associated values for each attribute that the user would like to submit in response to the form. Each attribute can be associated with a single value or a list of values. Upon execution, the Submit Form operation will submit a form, receive the resulting web page, and call the destination operation for each possible combination of attribute values. The number of different possible combinations of attribute values is the product of the numbers of values associated with the attributes. The following example should help to clarify the concept of attributes and values:
ATTRIBUTE VALUES
color red, blue, green
size 4, 6, 8
In this example, there are nine possible combinations of color and size. Supposing a form had spaces for both color and size, the Submit Form operation would submit the nine different combinations and the server would respond with the nine resulting Web pages. In some instances, the user may only be interested in certain combinations of values of attributes. In this case, the user can indicate that certain attributes iterate over their possible values synchronously. The ITERATE_IN_SYNC property is used to indicate those attributes over which the Submit Form should synchronously iterate through the values. All attributes identified in the ITERATE_IN_SYNC property should have the same number of associated values. Thus, in the above example, if color and size were listed as ITERATE_IN_SYNC attributes, there would only be three possible combinations, namely, (red, 4), (blue, 6), and (green, 8).
f. If Operation
Input variables: TEXT, URL, START, and END.
Output Variables: TEXT, URL, START, END - all the same as
the input variables.
Properties: TEST - expression to evaluate. If the
expression evaluates to true, call true
operation. If the expression
evaluates to false, call false operation.
Destination Operations: TRUE and FALSE operations.
As illustrated by flowchart 850 in FIG. 8F, the If operation calls one of two destination operations based upon the evaluation of an expression. The If operation takes a TEST expression as its property and is linked to TRUE and/or FALSE destination operations. At step 851, the If operation evaluates the TEST expression. Valid TEST expressions are determined by the implementation of the If operation. In the preferred embodiment, for example, the TEST expression could be set to "contains(`<TR>`)". This expression will evaluate to TRUE if the TEXT variable contains `<TR>` between START and END. Otherwise, the expression will evaluate to FALSE. At step 852, control is passed to step 853 if the TEXT expression evaluates to TRUE. At step 853, the If operation calls the TRUE destination operation and awaits its return. Upon the return of the TRUE operation, the If operation returns at step 855. At step 852, control is passed to step 854 if the TEST expression evaluates to FALSE. At step 854, the FALSE destination operation is called and the If operation awaits its return. Upon the return of the FALSE operation, the If operation returns at step 855.
g. Extract Operation
Input variables: TEXT, URL, START, and END.
Output Variables: TEXT - same as the input variable.
URL - same as the input variable.
START - same as the input variable.
END - same as the input variable.
Properties: VARIABLES - variables to be set in the
execution environment using Java extraction
functions.
Destination Operation: THEN - the destination operation to be called
upon completing the variable assignment(s).
Oftentimes, variables of interest are represented in a common format, regardless of the structure of the Web site on which they are found. For example, a price will usually be represented in the format of a dollar sign followed by a series of numbers. Dates will usually be represented in one of a number of possible formats. Information that is universally represented in one or a number of common formats can be efficiently handled by Java functions. These functions can be written once and used for all wrappers. The Extract operation provides a method of extracting variable values from the input TEXT using predefined Java functions as illustrated by flowchart 860 in FIG. 8G. At step 861 the Extract operation looks up in the category file the Java function associated with a variable listed in VARIABLES property. At step 862, the Java function is applied to the input TEXT variable between START and END. The output of the Java function will be the value to which the variable of interest is bound in the execution environment at step 863. At step 864, the Extract operation checks to see whether there is another variable to be extracted within the VARIABLES property. If so, then control returns to step 861 for the processing of the next variable. Once all of the variables have been processed, step 864 passes control to step 865 at which the THEN destination operation is called. The Extract operation awaits the return of the THEN operation upon which the Extract operation also returns at step 866. Line 18 of the category file illustrated in FIG. 6, for example, shows the association of the variable PRICE with the Java function "amazon.util.ExtractPrice." An Extract operation listing PRICE as one of the variables in the VARIABLES property would pass to the "amazon.util.ExtractPrice" function the input TEXT variable between START and END. The function, which can be easily written in Java to recognize dollar amounts within text, will return a price. The returned value is then bound in the execution environment to the wrapper variable PRICE. In this manner, simple or complicated matching functionality that is frequently reused can be written once in Java, by an experienced programmer, as opposed to recreating the functionality in each wrapper using wrapper operations each time the functionality is needed. h. Other Operations Although the basic operations that can be used to create wrappers are described herein, the wrapper builder application can be extended to provide additional operations. As new formats for Web pages appear and as new standards for HTML and Java are implemented, the operations required to satisfactorily extract data from Web pages can increase in number. The wrapper builder application can also be extended by providing additional features and functionality to operations that have already been described. E. Run and Debug Environment The wrapper builder incorporates a graphical run and debug environment in which the wrapper can be examined as it is run. The run and debug environment operates in conjunction with the wrapper editor 702, so that the wrapper can be edited and debugged/run simultaneously. The individual operations 1001-1006 are highlighted as they are executed within the wrapper graph canvas 704. A wrapper can be configured to run within the run/debug environment with specified delays between the operations or by using a number of debugging buttons incorporated into a toolbar 740 at the top of the wrapper editor 702. The buttons provide for stopping or starting the wrapper execution, for adding or removing breakpoints, for continuing or stepping operation, and for adding and removing cutpoints as discussed above in the subsection titled "Wrapper Editor." As illustrated in FIGS. 9A-D, the wrapper builder provides a debug frame 900, that displays information related to the running of the wrapper. Four tabs at the top of the debug frame 900 allow the user to switch the data displayed by the frame. The tabs include: Site Overview, HTML Source, Variable Bindings, and Listing View. FIG. 9A illustrates the debug frame 900 when the Site Overview tab is selected. The site overview tab causes the debug frame 900 to display a tree listing 910 of the URLs that have been accessed by the wrapper. The tree listing 910 can be of use to the user in editing the wrapper. FIG. 9B illustrates the debug frame 900 when the HTML source tab is selected. The HTML Source tab causes the debug frame 900 to display the HTML source 920 of the current page, with any matched portions being highlighted. FIG. 9C illustrates the debug frame 900 when the Variables tab is selected. The Variables tab causes the debug frame 900 to display the variables and bindings in the current execution environment 930. The execution environment 930 was discussed above in the subsection titled "Wrapper Operations." FIG. 9D illustrates the debug frame 900 when the Listing View tab is selected. The Listing View tab causes the debug frame 900 to display the current listings 940 (table column entries). The listings 940 are defined by the category file's association of environment variable bindings to table columns in the tabular output. Each row in the listings 940 consists of a table column. The first entry, for example "PROPERTIES.STATE," refers to the STATE column of the PROPERTIES table, as defined in the category file 600 illustrated in FIG. 6. Following the identification of the table column, in parentheses, is the environment variable to which the table column is bound, for example, the "(STATE)" environment variable. Next is the value associated with each listing, for example, the value "California." In addition to displaying the current listings 940, the run/debug environment also provides a table display window 500 as illustrated in FIG. 5. The table display window 500 displays the accumulated rows of data produced by the SQL output of the wrapper upon execution. With the various tools provided by the wrapper builder, the proper operation of the wrapper can be verified. Thus, the wrapper can be run, edited, and verified all in the same environment. The present invention also contemplates the use of a statistical analysis tool for verifying the operation of the wrapper on more extensive and complex Web sites. The statistical analysis tool can be a standard package that the user can run on a wrapper's SQL output to check for null values or missing data. Proper wrapper operation can be verified through the number of null values or missing entries in wrapper output. III. Example Wrapper In this section an example wrapper is presented and its application to a sample web site is demonstrated. FIG. 10 illustrates a schematic diagram 1000 of the example wrapper. The associated category file is illustrated in FIG. 6. In FIG. 7, an illustration 1001 of the same wrapper is shown as it would be displayed on the wrapper graph canvas 704. The example wrapper was designed to extract information from the hypothetical web site illustrated in FIGS. 3A-C, the corresponding HTML being shown in FIGS. 4A-C. Each operation in the schematic 1000 consists of a box containing the name of the operation (following the word "Operation"), the type of the operation (following the word "Type"), and the properties and associated values of the operation. The names adjacent to the links between operations indicate the name a source operation uses to refer to a destination operation. Operation 1001 of the wrapper is the Start operation and is named "1." The URL property of operation 1001 is bound to the hypothetical URL "http://www.rentals.com/rentals.htm." Upon execution, the start operation will fetch the HTML text of the URL; the text is shown in FIG. 4A. It should be noted that although Web pages typically contain references to images, the actual HTML code consists solely of text. Upon fetching this text, the operation 1001 binds the URL variable to the URL, the TEXT variable to the complete text of the URL, the START variable to the value 0, and the END variable to the value 132, which is the number of characters in the page minus 1. At this point the THEN operation 1002 is called as indicated by the directed link 1011. Operation 1002 of the wrapper is a FollowLinks operation and is named "2." The SAVETITLE property of the operation is set to STATE. Thus, upon finding a hypertext link, the operation 1002 saves the HTML tag associated with the hypertext link in the variable STATE. In this case the first link encountered is "<A HREF="calif.htm">California</A>" and the associated tag is "California." Therefore, the operation 1002 binds the variable STATE to "California." Next, the FollowLinks operation follows the first link to the URL "http://www.rentals.com/calif.htm" and fetches the associated Web page illustrated in FIG. 4B. The TEXT, URL, START, and END variables are updated to identify the complete text of the fetched page. At this point the THEN operation 1003 is called as indicated by the directed link 1012. Operation 1003 functions in a similar manner to operation 1002 as it is also a FollowLinks operation. It follows the first link, "http://www.rentals.com/bayarea.htm," within the text depicted in FIG. 4B, and fetches the page depicted in FIG. 4C. The variables TEXT, URL, START, and END are set according to the new page, and the REGION variable is bound to "Bay Area." At this point the THEN operation 1004 is called as indicated by the directed link 1013. Operation 1004 is a Match operation that operates upon the complete text depicted in FIG. 4C. The operation 1004 attempts a first match and is successful. The PROPERTYTYPE variable is bound to "Condos" and the LISTINGS variable is bound to the following: <B><P>Palo Alto</B>2 BR/1 BA with sunny dining area and new carpeting. $1200/mo. <P><B><P>Los Altos</B>Terrific views from this end unit. Only $1500/mo if you respond to this ad before 9/15. </P> Note that the lookahead instruction indicates that the Match operation 1004 should begin its next match attempt at the beginning of the character sequence `<H2>`, following the LISTINGS text, rather than after it. The lookahead instruction ensures that indicating sequence `<H2>` is made available to the next match. Returning to the execution of operation 1004, as no link is associated with the PROPERTYTYPE variable, there is no link to follow. The LISTINGS variable, however, is associated with a link 1014 to operation 1005. Operation 1004 thus sets the START, and END variables to define the match of the LISTINGS variable within the TEXT. START is set to 64 and END is set to 262. At this point, operation 1005 is called. Operation 1005 is another Match operation, but this Match operation 1005 only operates upon a limited portion of the TEXT variable defined by START and END, which were set by operation 1004. Within the TEXT, the Match operation 1005 matches and binds CITY to "Palo Alto" and LISTING to the following text: 2 BR/1 BA with sunny dining area and new carpeting. $1200/mo. Since there is no link associated with the CITY variable, the operation 1005 sets the START and END variables to reflect the matched text of LISTING within TEXT and calls the operation 1006 associated with the LISTING variable by a link 1015. Operation 1006 is an Extract operation. The operation 1006 looks to the category file to find the Java function associated with the variable PRICE. The operation 1006 then runs the function "amazon.util.ExtractPrice" on the TEXT between the START and END characters. The result of the function is the value 1200, which is bound to the variable PRICE. At this point the Extract operation 1006 returns and passes control back to the Match operation 1005. The Match operation 1005 checks its EMIT property and finds it set to true, therefore the Match operation 1005 emits a row of data. The row of data is produced by applying the category file's associations to the current environment. The execution environment at this point is illustrated in FIG. 10A. The category file, as illustrated in FIG. 6, binds the STATE, REGION, CITY, PROPERTYTYPE, and PRICE columns of the PROPERTIES table to the STATE, REGION, CITY, PROPERTYTYPE, and PRICE variables in the execution environment. Based upon the variable bindings illustrated in FIG. 10A, the wrapper produces a sequence of SQL commands that generate the first row in the output table 500 of FIG. 5. Once the Match operation 1005 has produced a row of output, it checks the HOWMANY variable, which is set to ALL. The Match operation 1005 then attempts further matches. The functioning of the wrapper continues in this manner with control going back to operation 1006, returning to operation 1005, upon which another row of data is output. Control then passes back to operation 1004, which processes another match. The sequence of control then passes back down through operations 1005 and 1006 as necessary. The wrapper continues execution in this manner until each operation has completed execution and the Web site is fully processed. IV. Wrapper Builder Implementation In the preferred embodiment, the wrapper builder application 1100 is implemented as a Java application comprising several components as illustrated in FIG. 11. The main desktop component 1110 implements the main desktop 700 user interface and its functionality. The web viewer component 1120 implements the web viewer 200 user interface and its functionality. The wrapper model component 1150 implements a number of methods and data structures from which a wrapper is formed in the wrapper builder application. The wrapper editor component 1130 implements the various aspects of the wrapper editor 702 user interface. The wrapper editor component 1130 also operates in conjunction with the wrapper model component 1150 to implement the functionality of the wrapper editor 702 as viewed by the user. The property editor component 1140 provides the pop up property editor dialog box 708 and its associated functionality. The wrapper execution component 1160 implements the various aspects of the run and debug environment provided by the wrapper builder. The debug frame 900 user interface and the run/debug toolbar 740 user interface and their functionality are implemented by the wrapper execution component 1160. The wrapper serialization component 1170 implements the functionality by which wrappers are stored and retrieved, called serialization. A number of these components will be discussed in greater detail in the subsections below. The components mentioned above employ user interfaces and accompanying functionality that are well known in the art. The web viewer 200, for example, can be implemented using an encapsulation of the HotJava bean component, which is well known in the art. The wrapper editor 702 can be implemented with techniques similar to those used in the numerous drawing programs available on the market. The table display window 500, text editor, tree view frame 208, HTML source frame 212, main desktop 700, menus, capture of the user input, and other aspects of the graphical user interface can be implemented in a straightforward manner using well known programming techniques. The implementation of aspects such as these is a matter of course in the writing of any extensive Java application. A. The Wrapper Model Component The wrapper model component 1150 enables the creation and representation of wrappers within the wrapper builder application. The wrapper model component 1150 comprises a number of Java classes, objects, and methods that implement the wrapper. 1. Class Hierarchy The wrapper model component 1150 defines a number of classes, within the wrapper builder's Java code, from which a wrapper is constructed. The classes are organized in a hierarchy as illustrated in FIG. 12. At the highest level of the hierarchy is the WrapperElement class 1204. The WrapperElement class 1204 serves as the superclass for its two subclasses, the Link class 1208 and the Operation class 1212. The Operation class 1212 also has a number of subclasses comprising the various operation types from which a wrapper is constructed. These subclasses comprise a Start class 1216, a Match class 1220, a FollowLinks class 1224, a SetFields class 1228, a SubmitForm class 1232, an If class 1236, and an Extract class 1240. Also illustrated in FIG. 12 is the WrapperModel class 1260. An instance of the WrapperModel class 1260 serves as a handle to the wrapper itself and contains references to all of the wrapper's operations and links. 2. Object Organization The wrapper model component 1150 creates a wrapper model from an instantiation of objects of the wrapper model class hierarchy. FIG. 13 illustrates an organization of objects that a very simple wrapper might comprise. The illustrated wrapper includes only two operations and one link. Referring to FIG. 13, the Wrapper Model object 1304 is an instantiation of the Wrapper Model class 1260 and serves to identify the wrapper and its components. The Wrapper Model object 1304 includes references to the objects and links that the wrapper model comprises. In this case, the Wrapper Model object 1304 has references, indicated by directed arrows, to two operation objects 1308 and 1316. The operation objects include a Start operation object 1308, an instantiation of the Start class 1216, and a Match operation object 1316, an instantiation of the Match class 1220. The Wrapper Model also has a reference to a Link object 1312, again indicated by a directed arrow. Each operation object has a reference back to the Wrapper Model object 1304 so that the object can reference and call methods of the wrapper model with which it is associated. Similarly, the Link object 1213 also has a reference back to the Wrapper Model object 1304 so that it can reference and call methods of the wrapper model object 1304. The Start operation object 1308 serves as the source operation for the Link object 1312, and the Match operation object 1316 serves as the destination operation. The Link object 1312 has references to the source and destination operations as well as the associated Wrapper Model object 1304. 3. Class Methods As discussed above, a wrapper is represented by an instantiation of the WrapperModel class 1260. Java code implementing one embodiment of the WrapperModel class 1260 is included in Appendix A. The WrapperModel class 1260 provides methods for setting and storing references to the operations and links from which it is composed. The class also provides a number of methods for running and debugging the wrapper. The methods comprise: public WrapperModel( )//the constructor; public Operation getroot( )//returns the root (start) operation; public void setRoot(Operation op)//sets the root operation; public void addBreakpoint (Operation op)//adds a breakpoint at operation "op"; public void removeBreakpoint(Operation op)//removes breakpoint from operation "op"; public void addOperation(Operation operation)//adds operation to wrapper model; public void removeOperation(Operation operation)//removes operation from wrapper model; public Vector getOperations( )//returns the operation vector consisting of all the operations of the wrapper model; public void addLink(Link link)//adds link to wrapper model; public void removeLink(Link link)//removes link from wrapper model; and public Vector getlinks( )//returns the link vector consisting of all the links of the wrapper model. The method: public Operation getOperation(Operation operation, String linkname) returns the destination operation associated with linkName. The source operation is passed in the "operation" parameter. Referring to FIG. 13, it will be noted that individual operation objects such as the Start operation object 1308, have no direct reference to their destination operations in the depicted embodiment. Thus, the source operation object, in this case the Start operation object 1308, uses its reference to the wrapper model object 1304, to call the WrapperModel method getOperation(operation, linkName). The getOperation method returns a reference to the destination Match operation object 1316. Operation objects use this method to retrieve a reference to the destination operation in order to make a call to the destination operation. A number of additional methods of the WrapperModel class 1260 provide for identification of the operations, operation vectors, links, link vectors and the manipulation of aspects of the appearance of the wrapper model in the graphical user interface. The WrapperElement class 1204 provides a basic wrapper element having functionality applicable to both links and operations. Java code implementing one embodiment of the WrapperElement class 1204 is included in Appendix B. The methods of the WrapperElement class 1204 comprise: public WrapperElement( )//the constructor for an element instance; public void setLabel(String label)//sets the label of the element; public String getabel( )//returns the label of an element; public void setSelected(boolean selected)//sets whether or not the element is selected by the user for manipulation, setting of break points etc.; public boolean is Selected( )//returns whether the element is selected; public void setId(int id)//sets the id of the element; public int getId( )//returns the id of the element; public void setImageIcon(ImageIcon icon, int x, int y)//sets the image icon and the x,y position of the icon within the wrapper graph canvas; public void SetWrapperModel(WrapperModel model)//associates the wrapper element with the wrapper model to which it belongs; and public WrapperModel getWrapperModel( )//returns the wrapper model associated with the element. The WrapperElement class 1204 also may include a number of other methods and a number of private variables within which data referenced by the above methods may be stored. The Operation class 1212 serves as a superclass for all of the individual operation classes. Java code implementing one embodiment of the Operation class 1212 is included in Appendix C. The Operation class 1212 includes the methods that are common to and identically implemented for each individual operation. Each individual operation provides a "call" method, to be discussed below, by which the individual operation is called during execution of the wrapper. The Operation class 1212 also provides the following "call" method that serves as a shell to catch exceptions and to call the "call" method of the individual operation: public Environment call(Operation from, Operation op, Environment state). This method, defined at the Operation class level, is just a shell method that encapsulates a call from an operation "from" to an operation "op" passing the environment "state." The shell method handles the case when the current operation is a break or cut point and if so takes the appropriate action by returning control to the user. Otherwise, the shell method then calls a "call" method of the individual operation "op." The shell call method returns the same environment returned by the individual call method of the "op" operation. This shell "call" method is used primarily to handle exceptions. Other methods defined by the Operation class 1212 comprise: public Operation( )//the constructor; public Vector getLinkNamesVector( ) //returns a vector of link names for an operation; and public Operation getOperation(String linkName). The getOperation method returns the destination operation associated with a link with the label "linkName" for which the calling operation is the source operation. This method acts as a shell for and calls the getOperation method of the WrapperModel class 1260. It will be noted that link names are the Labels derived from the WrapperElement superclass of the Link class and will be addressed in the discussion of the Link class below. A number of additional methods of the Operation class 1212 provide for the manipulation of the appearance of the operation in the graphical user interface. Each of the individual operation subclasses 1216, 1220, 1224, 1228, 1232, 1236, and 1240, implement the individual "call" method of the operation. The call method is called by the shell "call" method of the Operation class 1212, described above, but takes only one parameter--the wrapper variable execution environment. For each operation, the call method has the following format: public Environment call(Environment state) The call method implements the actual functionality of each individual operation as illustrated in FIGS. 8A-G. Upon completion, the method returns the environment "state" as modified by the operation in the course of execution. The individual operation subclasses 1216, 1220, 1224, 1228, 1232, 1236, and 1240 also implement the method public String[ ] getLinkNames( ) that returns the set of link names for all of the links for which the calling operation serves as the source operation. This method is called by the getLinkNamesVector method of the Operation class 1212, discussed above. Each operation subclass also has a constructor. Java code implementing one embodiment of the Match class 1220 is included in Appendix D. Another subclass of the WrapperElement class 1204 is the Link class 1208, which links operations. Java code implementing one embodiment of the Link class 1208 is included in Appendix E. The Link class 1208 functions to identify the source operation and the destination operation associated with a link. In this manner, a directed graph is formed from the wrapper's operations. The methods provided by the Link class 1208 comprise: public Link(Operation start, Operation end)//constructor for the Link class; public Operation getStartOperation( ) //returns the start operation; public void setStartOperation(Operation start)//sets the start operation; public void setEndOperation(Operation end)//sets the end operation; and public Operation getEndOperation( )//returns the end operation. A number of additional methods provide for identification of the associated operations and the manipulation of aspects of the appearance of the link in the graphical user interface. It will be noted that some of the methods of the above classes reference "link names." A link name is simply the label associated with a Link object as derived from its superclass, WrapperElement. The link name is set and retrieved using the WrapperElement methods, setLabel and getLabel as described above. B. Wrapper Editor Component In the preferred embodiment, the wrapper editor component 1130 provides the wrapper editor 702 user interface. The wrapper editor component 1130 also operates in conjunction with the wrapper model component 1150 to create an internal representation of a wrapper in the form of a wrapper model object. FIG. 13A illustrates the process by which a wrapper model is created in a preferred embodiment of the present invention. At a step 1331 the wrapper editor component 1130 makes the appropriate calls to the wrapper model component 1150 to instantiate a wrapper model object. Once the wrapper model object has been created, the wrapper editor component 1130 can await user input at step 1332. The wrapper editor component will direct the user input to the appropriate step 1333, 1335, or 1337, depending on the character of the user input. If the user input comprises a selection of a new operation, the wrapper editor component 1130 will pass control to a step 1333. At the step 1333, the wrapper editor component 1130 makes the appropriate calls to the wrapper model component 1150 to instantiate an operation object. At a next step 1334, the wrapper editor component 1130 displays a representation of the operation on the wrapper graph canvas 704. If the user input comprises a selection of a new link, the wrapper editor component 1130 will pass control to a step 1335. At the step 1335, the wrapper editor component 1130 makes the appropriate calls to the wrapper model component 1150 to instantiate a link object. The wrapper editor component 1130 also initializes the new link object such that it references the appropriate source and destination objects indicated by the user input. At a next step 1336, the wrapper editor component 1130 displays a representation of the link on the wrapper graph canvas 704. If the user input comprises the invocation of the property editor for a particular operation, the wrapper editor component 1130 will pass control to a step 1337. At the step 1337, the wrapper editor component 1130 calls the property editor, which displays the appropriate dialog box 708 for the corresponding operation. Once the user has entered the properties in the dialog box, control passes to step 1338. At step 1338, the wrapper editor component 1130 sets the properties of the corresponding operation object in accordance with the user input to the dialog box 708. Once one of steps 1334, 1336, or 1338 has completed, the wrapper editor component 1130 passes control back to step 1332 for the processing of additional user input. The above-described process continues until the user has completed construction of the wrapper. At this point, the wrapper is represented internally within the wrapper builder application by the wrapper model component 1150. The wrapper can then be run, debugged, saved or otherwise manipulated by the wrapper builder application. C. The Property Editor Component The property editor component 1140 provides the pop up property editor dialog box 708 (FIG. 7) to the user upon the right clicking of an operation within the wrapper graph canvas 704. The property editor component 1140 allows the characteristics of each operation to defined. The property editor component takes as input an object of arbitrary class, and creates a panel with editable fields corresponding to each property of this object. The initial values of the object are displayed as preset fields and can be modified by a user. At any point in time a function getInstance( ) can be called to return the edited object. Only properties which are coded in the form setXXX( ) and getXXX( ) will be displayed in the property editor dialog box 708. For example an instance of the following operation:
public class TestOperation
{
public void setMatch(String match);
String getMatch( );
public void setIterate(Boolean iterate);
Boolean getIterate( );
Will be displayed as follows in the property editor:
Match (enter string here)
Iterate [x] (checkbox)
The types used in the property editor are: String, Integer, Double, Boolean. D. Wrapper Execution Component The wrapper execution component 1160 runs the wrapper by calling the "call" method of a start operation object of a Wrapper Model object 1304. Each source operation then executes the code within its respective call method, which oftentimes results in the invocation of call methods of destination operations to which a source operation is linked. Upon completion of a call, control is returned to the source object. This process continues until control is returned to the start operation at which point execution of the wrapper has completed. The wrapper execution component 1160 also provides the functionality of the wrapper builder's debug environment such as starting, stopping, and stepping through a wrapper. During the run and debug process, the wrapper execution component 1160 provides the debug frame 900. The wrapper execution component 1160 also uses the category file 600 to create the wrapper builder's SQL output during wrapper execution. E. Wrapper Serialization Component and Wrapper Storage and Retrieval An embodiment of the present invention already described provides for the creation of a wrapper using objects instantiated from a number of classes. The wrapper can be created and executed all in the environment provided by the wrapper builder. The wrapper serialization component 1170 provides for the storage and retrieval of wrappers in XML (Extensible Markup Language) through the process of Object Serialization. XML is a well-known file format widely used on the Web. It will be noted that object Serialization is well known in the art of Java programming. An excerpt from a Sun Microsystems Java web page sunmarizes the concept of serialization: Object Serialization extends the core Java Input/Output classes with support for objects. Object Serialization supports the encoding of objects, and the objects reachable from them, into a stream of bytes; and it supports the complementary reconstruction of the object graph from the stream. Serialization is used for lightweight persistence and for communication via sockets or Remote Method Invocation (RMI). The default encoding of objects protects private and transient data, and supports the evolution of the classes. (See http://java.sun.com/products/jdk/1.1/docs/guide/serialization/). The wrapper builder application employs serialization to encode an internal object representation of a wrapper into XML format. The XML data can be saved as a wrapper file. Step 140 of flowchart 100 comprises the serialization process. Once a wrapper file has been created and stored, the wrapper file can be read by a wrapper builder application and deserialized, by known methods, to reproduce the objects that the wrapper comprises. Alternatively, once a wrapper's development and testing is complete, it can be deployed for use. In this case, a wrapper execution engine, to be discussed below, can read the serialized wrapper, reproduce the wrapper within its execution environment, and run it. F. Wrapper Builder Extensibility The wrapper builder provides a basic set of operations. In one embodiment, however, the user is free to code additional operations using the Java language. When an operation is created for the first time, it is possible to add this operation to an operation palette by selecting the `Import Operation` item from the `Operation` menu of the wrapper editor. The property editor can be configured to automatically determine the appropriate properties of the new operation class and present to the user appropriate fields in which to enter the properties. In the present embodiment, this automatic determination is accomplished through the capabilities of the Java Reflection libraries, which are available from Sun Microsystems. This is a known technique and there exist a number of property editors in a number of applications that use the Java Reflection Libraries. Currently, applicable property editors are provided by Java design products such as Borland's JBuilder, Symantec's Symantec Cafe, and Microsoft's Visual J++. V. Wrapper Systems The present invention contemplates a first system involving the wrapper builder application for the construction of wrappers. A second system will also be disclosed in which wrappers that have already been constructed using the wrapper builder application can be executed to perform the useful function of automatically retrieving and structuring Web site data. A. Wrapper Builder System FIG. 14 illustrates one embodiment of a system 1400 comprising the wrapper builder application 1100. The application 1100 is executed on a host computer 1404 and is connected to a communications port 1424 that provides access to the Internet or an intranet 1428 using the HTTP protocol over TCP/IP. The application 1100 accesses a Web site 1432, which is hosted by web servers 1436. The application 1100 can write a wrapper to or read a wrapper from a wrapper file 1412. The application 1100 can also write to or read from a category file 1416. The application 1100 links in or has compiled in the operation classes 1420 from which a wrapper's operations are instantiated. One embodiment of the present invention comprises a wrapper builder application 1100 coded in the Java programming language. The application 1100 can be run on a computer with a Java interpreter, the computer and Java interpreter being referred to as a virtual machine. B. Wrapper Execution System FIG. 15 illustrates a system 1500 in which a wrapper can be used once it is constructed. The system 1500 comprises a host computer 1504 that has access to the Web site of interest 1432. A wrapper execution engine 1508, running on the host computer 1504 executes the wrapper instead of the wrapper builder application 1100. The wrapper execution engine 1508 receives input comprising the wrapper file 1412, the category file 1416, and the operation classes 1420 from which the wrapper has been constructed. The wrapper execution engine 1508 interfaces with a Java database connectivity (JDBC) driver 1540. The JDBC driver serves as an interface to a querying application 1544. The querying application 1544 executes on an application computer 1505 in communication with the host computer 1504. Although only one querying application 1544 and one application computer 1505 are shown, any number of querying applications and application computers could communicate with the host computer 1504. The querying application 1544 is preferably an application that is capable of making JDBC method calls. JDBC is a well-known application program interface (API) for accessing relational database systems. The JDBC driver interface 1540 to the wrapper execution engine 1508 makes the engine 1508 accessible in the same manner that relational databases are typically accessed by Java applications. The querying application 1544 sends an SQL query 1562 to the JDBC driver interface 1540. The JDBC driver 1540 returns a result set object 1566 containing the requested data retrieved from the Web site of interest 1432. The result set object 1566 is an object that provides methods by which its data can be accessed. Such objects are well known in the art and will not be described in detail herein. The JDBC driver 1540 acts as a driver for the wrapper execution engine 1508. The driver 1540 calls the engine 1508 with the URL 1550 of the web site of interest 1432. At this point the engine 1508 loads the appropriate wrapper file 1412 and category file 1416. The wrapper execution engine 1508 can consist of a stripped down version of the wrapper builder application without the wrapper editing capabilities or graphical user interface. In place of the graphical user interface the of the wrapper builder, the wrapper execution engine 1508 can have an appropriate interface to the JDBC driver 1540. Once the wrapper execution engine loads the wrapper file 1412 and category file 1416, it runs the wrapper (not illustrated), accessing the web site of interest 1432 through the communications port 1424. The wrapper produces relational database rows 1554 that are passed back to the JDBC driver 1540 through a queue 1558. The queue 1558 buffers the database rows 1554 to compensate for any difference in processing rates between the wrapper and the driver 1540. Although this invention has been described in terms of certain preferred embodiments and applications, other embodiments and applications that are apparent to those of ordinary skill in the art, including embodiments which do not provide all of the features and advantages set forth herein, are also within the scope of this invention. Accordingly, the scope of the present invention is intended to be defined only by reference to the appended claims.
|
Same subclass Same class Consider this |
||||||||||
