Method facilitating data stream parsing for use with electronic commerce6850950Abstract A parsing technique suitable for use in electronic commerce that avoids the disadvantages of known monolithic parsers. The parsing system generates code specific to each input document and data stream type, which may then be updated as needed to handle document-specific idiosyncrasies without requiring modifications to the core parser code. As a user (10) defines parsing rules for extracting data from a representative sample document (15), a visual trainer (16) automatically generates code in the background, referred to as a filer (17), that is specific to that document and that embodies the rules for extracting data from that particular document. The generated code may be modified manually (18) as needed to account for any idiosyncratic conditions associated with the document. Each representative document has associated with it its own filer. A parsing engine (20, 21) comprises a collection of such individual filers appropriate for the types of documents that arise in any given organization. A mapping or other association is maintained between representative document types and their filers. In regular operation, a user selects a data set to be extracted from certain documents. When the parsing engine receives a document in an input data stream, the associated filer is loaded and parses that document for the selected data set. Then another filer is loaded in response to another input document, and so on. The filers may be especially efficiently generated using an object-oriented approach and then dynamically instantiated at run time as may be conveniently achieved, for example, in the Java programming language. Claims What is claimed is: Description BACKGROUND OF THE INVENTION
TABLE I
Formatting
Command Description Parameters
COM: Comment The comment text that follows the
formatting command. This text
does not appear on the printed
page.
PTX: Presentation text item The text to be printed immediately
follows the formatting command.
A POS: formatting command
A FON: formatting command
POS: Position an the printed The line number, followed by a
page to add the comma, followed by a column
preceding presentation
text
FON: The font to display the The font name
preceding presentation
text in
PGB: Page break N/A
The APF file in this example may be understood from the preceding table. The first page of the output when the file is printed is shown in FIG. 3. A complicating point illustrated by this example is that the information shown on a printed page is not necessarily located in one part of the physical print file. In this example, the data in the print file is separated by font. In fact, in many real-world instances print file composition engines sort print files by font, which serves to optimize printing speed. The following listing shows the sorted version of the print file in this example, which would produce identical output to the non-sorted version. COM:Sample Alysis Print Format (APF) file sorted for optimization PTX:Acme Bill POS:1,40 FON:TimesRoman PTX:Amount Due POS:5,20 FON: TimesRoman PTX:. POS:4,37 FON:TimesRoman PTX:To Jim Flynn POS:3,20 FON:Arial PTX:$ POS:4,35 FON:Arial PTX:200 POS:4,36 FON:Courier PTX:00 POS:4,38 FON:Courier PGB: PTX:Acme Bill POS:1,40 FON:TimesRoman PTX:Amount Due POS:5,20 FON: TimesRoman PTX:. POS:4,37 FON:TimesRoman PTX:To Bill Clarke POS:3,20 FON:Arial PTX:$ POS:4,35 FON:Arial PTX:300 POS:4,36 FON:Courier PTX:00 POS:4,38 FON:Courier As illustrated by the preceding print file, the text for a dollar amount might utilize three different fonts (i.e. one for the currency symbol, one for the numbers and another for the decimal place) which would be physically located in three different sections of the file. The sorted version of the print file illustrates that, even in this simplified illustrative example, parsing data out of these types of formats is not straightforward. Building a Filer As indicated above, "filer" is the general term used for the document-specific parsing code for extracting data out of an input data stream according to the invention. More generally, in the language of object oriented systems it may also refer to a filer class for defining such parsing code. All of the rules for parsing a data stream are expressed directly in a filer's source code. The Java programming language is well suited for developing filers because of its support for dynamic instantiation. This makes it possible to utilize the filer program in another application without re-compiling the e-business application. Nevertheless, those skilled in the software arts will recognize that the same effective functionality may be achieved in other languages such as C++ by using the techniques appropriate for those languages, for example, dynamic link libraries (DLLs) in C++. A simplified embodiment of a visual trainer tool is now described for generating filer code which illustrates the principles of the invention. Given the benefit of the examples and teachings provided herein, those skilled in the art will be able to develop more complicated visual trainer tools for various lagacy formats. The visual trainer tool used in this example can view "Acme Presentation Format" (APF) files and generate filer code for APF files. To create a filer by using the APF visual trainer, the user first opens a sample APF file. The APF visual trainer displays the APF file as it would appear if printed. This is sometimes referred to as a WYSIWYP display (What You See Is What You Print). The user begins to visually define the data desired to be extracted. The trainer provides a Select Text mode, which may be activated from a pull-down menu. The desired text fields may be highlighted with a mouse, for example, and the trainer will embed an appropriate rule in the filer code for extracting the selected text fields. Other data fields may also be directly or indirectly defined, for example, through conditional logic. An example of such conditional logic is given below where for purposes of illustration it is inserted manually into filer code, although it may be incorporated into the automatic code-generating capability as well. The areas of a document from which data may be eligible to be extracted are sometimes referred to as extraction data fields. An extraction data field may alternatively be defined by its location relative to any characteristic element of the document, serving as an anchor, such as a dollar sign or a particular heading. The implementation of methods for directly and indirectly defining extraction data fields in a document is known in the art, as it is practiced also in connection with known monolithic parsers, and thus need not be described in any detail here. Before the APF visual trainer can generate a filer, all of the possible extraction data fields should be directly or indirectly defined. The next step is to generate the filer source code. This may be accomplished by activating the APF visual trainer "Publish Filer" mode, typically from a pull-down menu. The Publish Filer mode generates the source code specific to the particular sample document used to define the extraction data fields. The source code embodies the rules for extracting data from the extraction data fields. The visual trainer converts the filer source code to executable filer code and places the executable filer code in a filer repository 20 (see FIG. 5) for later use in parsing documents like the sample document arising in real-time APF input data streams. At the same time the system maintains an association between the filer just generated and the sample document and input data stream type (here, APF) so that when this particular document arises in this particular input data stream type, the system will be able to load the relevant filer code. The association of filers with their documents and data stream types may be maintained for example through a simple mapping or database structure, the provision of which is entirely routine and need not be disclosed in any detail here. The filer published by the APF Visual trainer is composed of Java source code that can parse the input data stream. The next section describes typical code generation by the APF visual trainer along with the supporting classes for the present illustrative example. Filer Classes A particularly convenient and efficient way to implement filers is within an object-oriented framework, such as the Java programming language, which will be used here for purposes of illustration. FIG. 4 illustrates a filer class hierarchy. As is evident from FIG. 4, the base object of all filers is the "Filer" class. In the present example, a filer is more generally provided by a Java interface. An interface is a known concept in the Java language; it defines all of the methods that must be implemented by any class that "implements" the interface. The following listing shows the code for the Filer interface.
public interface Filer
{
void setInputName(String inputFileName);
void parseInput() throws Exception;
}
As can be seen from the Filer interface, any filer class simply needs to provide method implementations for setInputName( ) and parseInput( ). Because all filers must implement these methods, a separate application could dynamically instantiate a special-purpose filer and use it to parse an input data stream later on. An XML conversion program, for example, can use the SampleAPFFiler_V1 to convert APF documents to XML documents, which can then be used more easily in electronic commerce applications. FIG. 5 illustrates this conversion process. The first step occurs when a Data Loader application 21 detects that an APF file is ready to be converted to XML. This could occur, for example, when the APF file is copied into a predefined directory 22, which may be periodically polled by Data Loader 21. To convert the APF file, the Data Loader dynamically instantiates the APFSampleFileFiler class and uses it to convert the APF file to XML by calling the APFSampleFileFiler object's setInput( ) and parseInput( ) methods. Since the APFSampleFileFiler is a "Filer" object (i.e. it implements the Filer interface), the Data Loader can always call the Filer methods--regardless of the filer type--without having to know anything about the implementations of these methods. The immediate base class of the APFSampleFileFiler is the APFFilerBase class. This class is shown in the following listing.
public class APFFilerBase implements Filer
{
public APFFilerBase()
{
}
public void setInputName(String name)
{
inputFileName = name;
}
public void parseInput() throws Exception
{
}
String inputFileName = "";
}
As can be seen from the APFFilerBase class, its only action is to set a class attribute called inputFileName in the setInput( ) method. This is so here because in the present illustrative implementation all APF data will be input simply as disk-based files. No limitation is intended, however, to disk-based files, and those skilled in the art will readily be able to configure implementations for other inputs. Now that the basic filer class hierarchy in our example has been described, we consider the code for a program and file called SampleAPFFileFiler_V1.java, illustrative of code which may be generated by the APF visual trainer. This code is shown in the following listing.
import java.io.*;
public class SampleAPFFileFiler_V1 extends APFFilerBase
{
/*
The APF visual trainer generated this class automatically.
It is strongly recommended that you do not delete any of
the comments in this document.
Parsing rules are preceded by a "RULE--" comment. You can
add your own customized rules in this section. Other
generated lines of code are preceded by "AUTO--" comments.
XML rules are also included in this class in order to help
parse the code for subsequent invocations of the APF
visual trainer.
*/
public SampleAPFFileFiler_V1()
{
super();
}
public void parseInput() throws Exception
{
/*AUTO--*/ LineNumberReader reader =
/*AUTO--*/ new LineNumberReader( new FileReader(
/*AUTO--*/ inputFileName ) );
/*AUTO--*/ String line = reader.readLine();
/*AUTO--*/ String text = "";
/*AUTO--*/ while ( line != null )
/*AUTO--*/ {
/*AUTO--*/ line = reader.readLine();
/*AUTO--*/ if ( line.startsWith( "PTX:Acme Bill
/*AUTO--*/ POS:1,40" ) )
/*AUTO--*/ {
/*AUTO--*/ while ( !line.startsWith("PGB:") )
/*AUTO--*/ {
/*AUTO--*/ text += line;
/*AUTO--*/ line = reader.readLine();
/*AUTO--*/ }
/*AUTO--*/ APFDocument apf = new
/*AUTO--*/ APFDocument ( text );
applyRules (apf);
/*AUTO--*/ }
/*AUTO--*/ }
}
void applyRules(APFDocument apf)
{
/* Stat of parsing rules */
/*RULE1--*/ String name = apf.getText( 3, 20, 40 );
/*RULE2--*/ String amountDue = apf.getText( 5, 20, 30 );
/* End of parsing rules */
/*AUTO--*/ XMLDocument xml = new XMLDocument
/*AUTO--*/ ("Acme Bill For" + name );
/*AUTO--*/ xml.addElement( "name", name );
/*AUTO--*/ xml.addElement( "amount-due", amountDue );
/*AUTO--*/ xml.save();
}
/* Start of auto-generated XML rules
<parsing-rules>
<file type="apf"/>
<document delimiter="PGB:"/>
<rules>
<gettext id="RULE1" line="3" start="20" end="40"/>
<gettext id="RULE2" line="5" start="20" end="30"/>
</rules>
</parsing-rules>
End of auto-generated XML Rules */
}
A description is now given of a few aspects of the SampleAPFFileFiler_V1.java program. First, since it extends the APFFilerBase class, it automatically inherits the setInput( ) method implementation in that class as well as the inputFileName class attribute. The most interesting parts of the SampleAPFFileFiler_V1 class are the parseInput( ) and applyRules( ) methods. In the present example, the entire implementation of the parseInput( ) method was generated by the APF visual trainer. Each line of code is preceded by a characteristic /*AUTO--*/ comment, which indicates that the code was generated automatically by the visual trainer. These comments are included in the class to demark the automatically generated code distinguishing it from any manually inserted code so that when the visual trainer is subsequently used to edit SampleAPFFileFiler_V1, the visual trainer will be able to distinguish the code that was auto-generated from custom code logic that may be added by a developer. The XML rules are also included as a comment at the end of the class. These rules represent the standard parsing logic that was generated by the APF visual trainer. When the visual trainer is used to "edit" the SampleAPFFileFiler_V1.java program, it can re-build all of the rules by reading the XML metadata and rules. This is so because XML is much easier to parse than the Java code in the class. In general, filer source code will also include appropriate method calls to skip over unselected document elements to the selected data fields. The next section describes how custom rules may be manually inserted into an automatically generated filer. Adding Custom Parsing Rules to a Generated Filer In many cases it may become necessary to insert custom rules into a generated filer. This may be required if the input data stream contains unique idiosyncrasies that cannot, or simply are not, handled by the existing tool. Suppose, for example, that Acme provides bonus reward points to customers that are members of the "platinum" program. It would be desirable to pass this information along so that it becomes part of the resultant XML document. But since not all customers are members of the platinum program, the APF visual trainer was not devised to generate this kind of conditional logic. The requisite logic may nevertheless be added to the auto-generated filer as a custom rule. In fact, such conditional logic may also be included in the visual trainer code-generating capabilities, as skilled programming practitioners will recognize, but for purposes of illustrating code customization this capability is added here manually. To add a custom rule to the SampleAPFFileFiler_V1 class, one can simply edit the applyRules( ) method. An example of an inserted rule is shown in the following listing.
void applyRules(APFDocument apf)
{
/*AUTO--*/ XMLDocument xml = new XMLDocument
/*AUTO--*/ ("Acme Bill Number" + count++ );
/* Stat of parsing rules */
/*RULE1--*/ xml.addElement( "name", apf.getText( 3, 20, 40
) );
/*RULE2--*/ xml.addElement( "amount-due", apf.getText( 5,
20, 30 ) );
if ( apf.getText( 7, 20, 31 ).equals( "Bonus Points" ) )
xml.addElement( "bonus-points", apf.getText( 7, 33, 38
) );
/* End of parsing rules */
/*AUTO--*/ xml.save();
}
When the filer is run again, it will see that a custom rule has been inserted. Because this rule has not been defined in the XML rules, the filer knows that this code should be preserved. To ensure that this code is clearly marked, the visual trainer may also add a custom rule when it saves the filer. In addition, if the programmer fails to insert a standard "/*CUST--" comment before any custom rule lines, the visual trainer can do it, as shown in the following code snippet.
void applyRules(APFDocument apf)
{
/*AUTO--*/ XMLDocument xml = new XML
Document("Acme Bill Number" + count++ );
/* Stat of parsing rules */
/*RULE1--*/ xml.addElement( "name", apf.getText
( 3, 20, 40 ) );
/*RULE2--*/ xml.addElement( "amount-due", apf.
getText( 5, 20, 30 ) );
/*CUST1--*/ if ( apf.getText( 7, 20, 31 ).equals
( "Bonus Points" ) )
/*CUST1--*/ xml.addElement( "bonus-points",
apf.getText( 7, 33, 38 ) );
/* End of parsing rules */
/*AUTO--*/ xml.save();
}
The XML metadata in the document would be represented as follows.
<parsing-rules>
<file type="apf"/>
<document delimiter="PGB:"/>
<rules>
<gettext id="RULE1" line="3" start="20" end="40"/>
<gettext id="RULE2" line="5" start="20" end="30"/>
<customrule id="CUST1"/>
</rules>
</parsing-rules>
This example demonstrates yet another advantage of the invention: any input data stream can be successfully parsed, regardless of the auto-code-generating capabilities of the visual trainer used. The above descriptions and drawings disclose illustrative embodiments of the invention. Given the benefit of this disclosure, those skilled in the art will appreciate that various modifications, alternate constructions, and equivalents may also be employed to achieve the advantages of the invention. For example, while an embodiment of the invention has been described here in terms of the Java programming language, no limitation to that language is intended. Thus, the invention is not to be limited to the above description and illustrations, but is to be defined by the following claims.
|
Same subclass Same class Consider this |
||||||||||
