Automatic extraction of metadata using a neural network6044375Abstract A method of automatically extracting metadata from a document. The method of the invention provides a computer readable document that includes blocks comprised of words, an authority list that includes common uses of a set of words, and a neural network trained to extract metadata from groupings of data called compounds. Compounds are created with one compound describing each of the blocks. Each compound includes the words making up the block, descriptive information about the blocks, and authority information associated with some of the words. The descriptive information may include such items as bounding box information, describing the size and position of the block, and font information, describing the size and type of font the words of the block use. The authority information is located by comparing each the words from the block to the authority list. The compounds are processed through the neural network to generate metadata guesses including word guesses, compound guesses and document guesses along with confidence factors associated with the guesses indicating the likelihood that each of the guesses is correct. The method may additionally include providing a document knowledge base of positioning information and size information for metadata in known documents. If the document knowledge base is provided, then the method includes deriving analysis data from the metadata guess and comparing the analysis data to the document knowledge base to determine metadata output. Claims We claim: Description BACKGROUND OF THE INVENTION
______________________________________
e-mail or surface addresses;
sequencer words (e.g., volume, edition);
prepositions; years;
journal names; months;
conference names;
times of year (e.g., summer);
copyright notice words;
symbols;
organizational names;
numbers;
magazine titles;
punctuations;
first names of people;
outline indicators (e.g., III.); and
last names of people;
names of known authors.
______________________________________
As the words are compared with the authority list, all possible word functions indicated by the authority list for a particular word are associated with that particular word. This associated information is called the authority information of that word. The comparison of the words with the authority list may also incorporate approximation matching. Approximation matching is where both the word and close approximations of the word are compared against the authority list. The close approximations are creating using methods known in the art. Approximation matching is particularly useful when a computer readable document has undergone an OCR operation that may leave slight errors in the words. Next, the information derived from each block is combined (block 36). For text blocks, the combination includes the words, descriptive information for the block as a whole, and both the descriptive information and the authority information associated with each word. For non-textual blocks, the combination includes descriptive information for the block and information about the content of the block. The combination of information for each block is called a compound. When the compound creation process is completed, each text block has an associated compound (block 37). The compounds are then processed through a trained neural network. Neural networks are known in the art. A neural network is a network of many simple processors (units), each possibly having a small amount of local memory. The units are connected by communication channels (connections) which usually carry numeric (as opposed to symbolic) data, encoded by any of various means. The units typically operate only on the data stored in their local memory and on the inputs they receive via the connections. Most neural networks have some sort of "training" rule where the weights of connections are adjusted on the basis of data. In other words, neural networks "learn" from examples (as children learn to recognize dogs from examples of dogs) and exhibit some capability for generalization beyond the training data. According to Haykin, S. (1994), Neural Networks: A Comprehensive Foundation, N.Y.: Macmillan, p. 2: "A neural network is a massively parallel distributed processor that has a natural propensity for storing experiential knowledge and making it available for use. It resembles the brain in two respects: 1) Knowledge is acquired by the network through a learning process; and 2) Interneuron connection strengths known as synaptic weights are used to store the knowledge." It is the ability of the neural network to learn that gives the method according to the invention one of its advantages over the prior art in the process of extracting metadata from documents. The ability to learn provides the flexibility and adaptability that make the method according to the invention adaptable to user-specified metadata and user-specified types of documentation. Before the compounds can be processed through the neural network, however, the neural network must be trained. A flowchart showing the training of a neural network is shown in FIG. 4A. While the detailed process used will vary depending on the structure of the neural network that is used, the same basic process applies to all neural networks. That is, the neural network must be provided with training examples, each example indicating the desired output for a fixed set of input conditions (block 41). In the preferred embodiment of the invention, the each neural network training example includes both an input part and an output part. The input part that includes compound information and word information. The compound information includes items that describe a block such as: 1) whether the block is centered; 2) the coordinates of the upper left corner of the bounding box surrounding the block; and 3) the coordinates of the lower right corner of the bounding box. The word information for each word includes items such as: 1) position of the word within the block; 2) size of the word (e.g., width and height within the block); 3) font size of word; 4) font style of word (e.g., bold, italics); 5) font type of word (e.g., Courier); and 6) all categories of authority information listed above. The output part includes a document part, compound part, and word part. The document part includes a likelihood that the document might be each of a number of document types including, but not limited to: a technical report, a journal document, a conference document, a chapter, a patent, a news clip, or numerous other document types that can be specified by the user. It also includes the likelihood that the document is not of any known document type. The compound part includes a likelihood that the block described by the compound information input might be each of a number of block types including, but not limited to: title, conference name, publication name, author name, date, copyright, thanks, keywords index, communication, running header, page numbers, or numerous other compound types that can be specified by the user. It also includes the likelihood that the block is not of any known block type. The word part includes a likelihood that each word described by the word input might be each of a number of word types including, but not limited to, first name, last name, company name, journal name, conference name, organization name, magazine name, or numerous other word types that can be specified by the user. It also includes the likelihood that each word is not of any known word type. Once the network has been trained, the compounds associated with each block can be processed through the neural network. A flowchart of this process is depicted in FIG. 4B. While the actual processing through the neural network varies depending on the structure of the neural network used, most neural networks would employ this general structure. First, the neural network takes the compounds as an input (block 42). While some neural networks may be able to take the compound information directly, others may require some input processing of the compounds to create the neural network input (block 47). For purposes of this description, the expression "processing the compound through a neural network" includes processing compounds that have undergone input processing to create the neural network input. Input processing may include any process that converts the compound into a format that can be easily processed as a neural network input. Summarizing and sliding windows are two types of input processing. Summarizing is when key information from the words is used as a neural network input rather than using all the words as the neural network input. The key information may be sufficient for the neural network to make compound and document guesses. By limiting the number of inputs to the neural network by summarizing, the speed and occasionally the accuracy of the neural network processing can be improved. Sliding windows is a technique for creating a neural network input that includes information not only about a particular item, but also information derived from a set number of items proceeding the particular item and possibly a set number of items following the particular item. For example, in making a word guess, the network may be provided with an input that includes not only information about the word in question, but also information derived from a preset number of words immediately proceeding and immediately following the word in question. Next the neural network analyzes the inputs, either directly from the compounds or as processed, based on the training examples it has previously been supplied as well as against preset rules. A preset rule might include, for example, that a centered text block near the top of a page in a large font should be considered a probable title. Using the training examples and the preset rules, the neural network makes metadata guesses of three types for each compound: word guesses, a compound guess, and a document guess. Word guesses (block 43) indicate possible word types for each word from the processed compound. The word guesses may also include word confidence factors. Word confidence factors are numeric values (typically between zero and one-hundred percent) that are associated with each word guess and indicate the likelihood that each possible word type indicated by the word guess is correct. Similarly, the compound guess (block 44) indicates possible block types for the blocks associated with the processed compound. The compound guess may also include compound confidence factors. Compound confidence factors are numeric values (typically between zero and one-hundred percent) that are associated with the compound guess and indicate the likelihood that each possible block type indicated by the compound guess is correct. Finally, the document guess (block 45) indicates possible document types based on the processed compound. The document guess may also include document confidence factors. Document confidence factors are numeric values (typically between zero and one-hundred percent) that are associated with the document guess and indicate the likelihood that each possible document type indicated by the document guess is correct. It is important to note that the neural network does not determine the word guesses, compound guesses, and document guesses independently. In fact the neural network processes all three types of guesses simultaneously utilizing intermediate results in the determination of each type of guess as an analysis factor in the determination of the other two types of guesses. Thus, the intermediate results in the determination of a compound guess may be used as a factor in determining both the document guess and the word guesses. As a result, some of the word confidence factors, for example, may be altered. For purposes of this description, the term neural network may include multiple neural networks. In fact, depending on the neural network used, it may most efficient to used three separate neural networks in place of the one described above. One of the neural networks can be specially configured and trained to determine word guesses, one can be specially configured and trained to determine compound guesses, and one can be specially configured and trained to determine document guesses. Alternatively, multiple neural networks can be configured with each neural network being specially configured and trained to determine metadata guesses for particular document types. Thus, after classifying the document, the metadata can be extracted from the document with a neural network that has been specially configured and trained for that type of document. This method may be particularly effective when users add new metadata types. When all of the compounds have been processed through the neural net, metadata may be determined by selecting from the word guesses, compound guesses and document guesses having the highest word, compound, and document confidence factors, respectively. Alternatively, however, the metadata guesses may be improved prior to determining the metadata through additional analysis that will ultimately result in improved accuracy and reliability of the metadata extracted from the document. FIG. 5 is a block diagram depicting the additional analysis. The additional analysis portion of the method according to the invention involves two steps: 1) deriving analysis data (blocks 52 through 55) from the metadata guesses (block 51); and 2) comparing the analysis data with a predefined document knowledge base (block 56) to improve the metadata guesses. The document knowledge base may include such information as the positioning and sizing of information in known documents. The improved metadata guesses are then used to determine the metadata (block 57). Analysis data can include the raw metadata guesses including word guesses, compound guess and the document guess for each compound processed though the neural network along with their respective confidence factors (block 52). In addition analysis data may include data derived from these raw guesses. For example, it can be very helpful in determining the function of a particular block of a document to know the function of the blocks (both textual and non-textual) that neighbor the particular block (block 54). The functions of neighboring blocks can be derived from the compound guesses describing the neighboring blocks. Similarly, knowing the positions of neighboring blocks may be helpful in determining the function of a particular block. Data describing the relative positions of neighboring blocks is called proximate block position data (block 54). The proximate block position data can be derived by comparing bounding box information from the compound describing the particular block with the bounding box information from the compounds describing the neighboring blocks. Furthermore, the position of a particular block on a page often helps define its function (block 53). The page position for a particular block can also be derived from the bounding box information taken from the compound describing the block. The page position data can also be part of the analysis data described above. Similarly, the font size and type can be useful in determining the purpose of a particular text block or of a particular word within the text block (block 55). For example, items in particularly large fonts are more likely to be titles. The font size and type information for each word of a text block may also be included in the analysis data described above. Once the analysis data has been derived, it is compared with a preexisting document knowledge base (block 56) to determine which, if any, of the word, compound, and document confidence factors should be changed to improve the word, compound, and document guesses, respectively (block 57). The document knowledge base contains information about the metadata position and size in a pool of known documents. The knowledge base may also be dynamic and arranged to include information about each of the documents that has had metadata automatically extracted using this method. The weight given to each piece of analysis data in this comparison is typically not equal and may be adjusted. Once each piece of analysis data has been compared against the knowledge base, and the metadata has been improved, the metadata can be derived from the metadata guesses. This is done by selecting the word guesses, compound guesses, and document guesses with the highest word, compound, and document confidence factors, respectively. Once the metadata has been derived, the user may verify and, if necessary, correct the automatically extracted metadata. If correction by the user is necessary, the corrected information may be used to improve the knowledge base so future errors of this type will be less likely. In the preferred embodiment of the method according to the invention, the various steps described above are performed by a computer. In light of this fact and in order to provide a more detailed description of the method according to the invention, a listing of pseudo code for running the method on a computer is attached. Although a specific embodiment of the invention has been described and illustrated, the invention is not to be limited to the specific forms or arrangements of parts so described and illustrated. The invention is limited only by the claims. PSEUDO CODE LISTING Assumptions: [1] we receive from an OCR system an ocrPage object. This object has an attribute which is an array of word strings, where a word is a white space delineated string of symbols. The object also contains markers giving the beginning and end of paragraphs, which are distinct blocks of text. [2] the ocrPage also has a metaData subclass which carries extra information about each word and paragraph in the page, and about the page itself. In particular, the metaData subclass contains the following attributes in three levels. page level: document classification paragraph level: compound classification other information (bounding box, justification, etc) word level: token classification other information (font information, numeric, punctuation etc) For training examples, all the fields are filled in. For unseen examples, the classification information is set to null. [3] ocrPage has methods for the following functions finding the first word on the page finding the next word on the page (null if the end of a paragraph is encountered) finding the first word of the next paragraph on the page (null if the end of the page is encountered) returning the current position in the page returning the meta data for the current page/paragraph/word [4] a "header-style" definition of ocrPage is given by: class ocrPage{ firstWord(); nextWord(); nextParagraph(); currentPosition(); metaPage(); metaParagraph(firstWordlndex); metaWord(wordlndex); NOTE: in the classes below, the "type" attributes are a vector of entries between 0 and 1, where each entry corresponds to a particular type. If the type(s) are known definitely, the vector will have only 0-1 entries, otherwise uncertainty is measured by the fractional values. Further, the DBMatch method searches through a vector of databases (DB), one for each token type. If the token is found in a particular database, then the corresponding type is set to 1.
__________________________________________________________________________
class Token{
type;
otherInfo;
token;
Token(word, meta){
token = word;
type = meta.type;
otherInfo = meta.otherInfo;
}
DBMatch(DBs){
for(int i=0; i<DBs.length( ); i++){
// check if this token is in the database at index i
thisDB = DBs.elementAt(i);
if (thisDB.isIn(token))
type[i] = 1;
else
type[i] = 0;
}
}
printNNInput(inFile){
inFile.print(otherInfo);
}
printNNTargets(inFile){
inFile.print(type);
}
}
class Compound{
type;
otherInfo;
Vector Tokens;
Compound(meta){
type = meta.type;
otherInfo = meta.otherInfo;
Tokens = new Vector( );
}
printNNInput(inFile){
inFile.print(otherInfo);
for(int i=0; i<Tokens.length( ); i++)
(Tokens.elementAt(i)).printNNTargets(inFile);
}
printNNTraining(inFile){
printNNInput(inFile);
for(int i=0; i<Tokens.length( ); i++)
(Tokens.elementAt(i)).printNNTargets(inFile);
inFile.print(type);
}
}
class Document{
type;
Vector Compounds;
Document(meta){
type = meta.type;
Compounds = new Vector( );
}
printNNInput(inFile){
for(int i=0; i<Compounds.length( ); i++)
(Compound.elementAt(i)).printNNInput(inFile);
}
printNNTraining(inFile){
for(int i=0; i<Compounds.length( ); i++)
(Compound.elementAt(i)).printNNTraining(inFile);
}
}
public Document readPage(ocrPage page, Vector DBs) {
Document thisDoc = new Document(page.metaPage( ));
wordIndex = 0;
word = page.firstWord( );
while(word != null) {
thisCompound = new Compound(page.metaParagraph(wordIndex));
while(word != null) {
thisToken = new Token(word,page.metaWord(wordIndex));
thisToken.DBMatch(DBs); // search the DBs
thisCompound.Tokens.addElement(thisToken);
word = page.nextWord( );
wordIndex++;
}
thisDoc.Compounds.addElement(thisCompound);
word = page.nextParagraph( );
}
return thisDoc;
/*------------------------------------------------------------------------
---
NOTE: nnOutput is a structure which gives the nn prediction for a
particular document.
In particular, nnOutput supplies a vector of numbers for the nn
prediction on each
Compound in the document (nnOutput.getCompoundType(compoundIndex))
Token in the document (nnOutput.getTokenType(tokenIndex))
as well as
the Document type (nnOutput.getDocumentType( ))
*/
public Document addNNprediction)Document thisDoc, nnOutput) {
Document newDoc = thisDoc;
newDoc.type = nnOutput.getDocumentType( );
tokenIndex = 0;
for(int i=0; i<thisDoc.Compounds.length( ); i++){
thisComp = thisDoc.Compounds.elementAt(i);
thisComp.type = nnOutput.getComoundType(i);
for(int j=0; j<thisComp.Tokens.length( ); j++){
thisTok = thisComp.Tokens.elementAt(j);
thisTok.type = nnOutput.getTokenType(tokenIndex++);
thisComp.Tokens.replaceElement(i,thisTok);
}
newDoc.Compounds.replaceElement(i,thisComp);
}
return newDoc;
}
/*------------------------------------------------------------------------
---
NOTE: the Glue routine presumes the existence of the following objects
Vector docTypes; // vector of docType objects
docType{
threshold; // a threshold on how certain we need to be to classify a
document
// as having this type
Vector compTypes; // vector of compType objects
}
compType{
threshold;
topDist; //the furthest this compound type can be from the top of the
page
botDist; //the furthest this compound type can be from the bottom of the
page
}
So, for example, a document type "Journal Article" might have a threshold
of 0.8, and
compTypes "Title", "Author", "Journal", "Date", "Page", "Address". The
"Title"
compType may then have a threshold of 0.9, and may also need to be in the
top 1/3 of the
page (that is, topDist=0.33, botDist=MAXFLOAT)
Also, maxIndex is a function which returns the position of the largest
value in a numeric
array.
*/
public Document Glue(Document thisDoc){
Document newDoc = thisDoc;
newDoc.Compounds = thisDoc.Compounds;
// set all the compound types to "unknown"
for(int i=0; i<newDoc.Compounds.length; i++){
newComp = newDoc.Compounds.elementAt(i);
for(int j=0; j<newComp.types.length( ); j++)
newComp.types[j] = 0.0;
newDoc.Compounds.replaceElement(i, newComp);
}
// find the document type
int maxDocTypeIndex = maxIndex(thisDoc.type);
thisDocType = docTypes.elementAt(maxDocTypeIndex);
// if the document type is acceptable, process the compounds
if(thisDoc.type[maxDocTypeIndex] < thisDocType.threshold){
// cycle through all the compound types
for(int i=0; i<thisDocType.compTypes.length( ); i++){
thisCompType = thisDocType.compTypes.elementAt(i);
bestComp = thisDoc.Compounds.elementAt(0);
int bestCompIndex = 0;
// find the most likely compound for this type
for(int j=1; j<thisDoc.Compounds.length( ); j++){
thisComp = thisDoc.Compounds.elementAt(j);
if(thisComp.type[i] > bestComp.type[i]){
bestComp = thisComp;
bestCompIndex = j;
}
}
// now see if the most suitable compound is acceptable. If so,
// set it to type i. yUp gives the vertical coordinate of the upper
// side of the compound's bounding box, yDown of the lower side.
if((bestComp.type[i] < thisCompType.threshold) AND
((bestComp.yUp < topDist) OR (bestComp.yDown > botDist)))
{
bestComp.type[i] = 1;
newDoc.Compounds.replaceElement(baseCompIndex,bestComp);
}
}
return newDoc;
}
else {
System.out.println("Document does not fit any current document
types");
return thisDoc;
}
}
/*------------------------------------------------------------------------
---
Main function - this calls the above algorithms. It presumes the
existance of the following
extra functions:
make DBs returns a vector of all the necessary DBs.
trainNN takes a file of NN training data and trains a NN.
printDoc prints the final results of an analyzed document in some
acceptable form.
Main takes command line arguments for either NN learning or analysis as
follows.
Learning
[0] D (make training data)
[1] name of file to put the training data in
[2-->] ocrPages with training meta data for NN learning
[0] T (train a network)
[1] name of file containing training data
[0] N (make training data AND train a network)
[1] name of file to put training data in
[2-->] ocrPages with training meta data for NN learning
Analysis
(presumes a file containing the NN prediction for the input data on each
ocrPage)
[0] A (Analysis) [1-->] according to [2*i-1] ocrPage i
[2*i] NN prediction on page i
*/
main(String[ ] args){
DBs = makeDBs( );
if(args[0] == "D"){ // create learning data
File NNTrainFile = args[1];
for(int i=2; i<args.length( ); i++){
thisDoc = readPage(args[i], DBs);
thisDoc.printNNTraining(NNTrainFile);
}
}
else if(args[0] == "T"){ // train network
File NNTrainFile = args[1];
NNTrain(NNTrainFile);
}
else if(args[0] == "N"){ // create data and train
File NNTrainFile = args[1];
for(int i=2; i<args.length( ); i++){
Document thisDoc = readPage(args[i],DBs);
thisDoc.printNNTraining(NNTrainFile);
}
NNTrain(NNTrainFile);
}
else if(args[0] == "A") { // analysis of NN predictions
numDocs = (args.length( ) - 1)/2;
for(int i=0; i< numDocs; i++){
thisDoc = readPage(args[2*i+1],DBs);
nnOutput = args[2*i+2];
// add the NN output results to the document
thisDoc = addNNPrediction(thisDoc, nnOutput);
// now apply Glue to this document
thisDoc = Glue(thisDoc);
printDoc(thisDoc);
}
}
}
__________________________________________________________________________
|
Same subclass Same class Consider this |
||||||||||
