Office correspondence storage and retrieval system4358824Abstract A system that intelligently abstracts and archives a document for storage and interprets a free form user retrieval query to recall the document from the storage file. The system includes a method for automatically selecting keywords from the document using a parts of a speech directory. A method is given for weighing the importance or centrality of each keyword with respect to the document of its origin. Using the same logic paths, a free form query that describes the document in the same manner that it would have to be described to a secretary to "find" it in a filing cabinet, the system automatically determines the key matching terms and finds the archived document(s) with the greatest affinity. Claims What is claimed is: Description BACKGROUND OF THE INVENTION
TABLE 1
__________________________________________________________________________
Document Abstraction Routine
__________________________________________________________________________
BEGINPROCEDURE(OCRS --ABSTRACT);
ENTER ABSTRACT, SAVE DOCUMENT NUMBER PARAMETER;
READ DOCUMENT ABSTRACT FILE RECORD FOR DOCUMENT NUMBER;
IF
RECORD FOUND
THEN
CALL (DELETE --ABSTRACT);
ENDIF;
WHILE
NOT END OF DOCUMENT
DO
WHILE
NOT END OF PAGE
DO
GET NEXT LINE OF TEXT FROM THE DOCUMENT;
WHILE
MORE CHARACTERS EXIST ON THE LINE
DO
GET NEXT WORD FROM THE LINE (2 OR MORE
CONSECUTIVE CHARACTERS A-Z, 0-9, OR
');
IF
THE WORD IS "CC"
THEN
SET CC LINE NUMBER TO THE DOCUMENT
LINE NUMBER MINUS 1;
ENDIF;
CALL (ABSTRACT --PROCESS --WORD);
ENDWHILE;
INCREMENT PAGE NUMBER BY 1;
ENDWHILE;
INCREMENT DOCUMENT LINE NUMBER BY 1;
ENDWHILE;
SET LAST BODY LINE COUNT TO THE LESSOR OF:
THE CC LINE NUMBER AND THE DOCUMENT LINE NUMBER;
DECREMENT THE LAST BODY LINE COUNT BY 4;
CALL (ABSTRACT --END --PROCESSING);
ENDPROCEDURE(OCRS --ABSTRACT):
__________________________________________________________________________
Table 1 is the program routine in Program Design Language (PDL) for abstracting the document. If the document number (identifier code) is found to exist in the abstract file, the program routine branches to the delete abstract routine of Table 2 which is shown as block 22 of the flow chart of FIG. 2.
TABLE 2
__________________________________________________________________________
Delete Abstract Subroutine
__________________________________________________________________________
BEGINPROCEDURE(DELETE --ABSTRACT);
ENTER DELETE ABSTRACT;
WHILE
NOT END OF DOCUMENT ABSTRACT RECORD
DO
GET THE NEXT ENTRY IN THE DOCUMENT ABSTRACT RECORD;
READ THE WORD INDEX RECORD FOR THE WORD;
WHILE
NOTE END OF WORD INDEX RECORD
DO
GET THE NEXT ENTRY IN THE WORD INDEX RECORD;
IF
THE DOCUMENT NUMBER IN THE ENTRY IS THE
SAME AS THE DOCUMENT NUMBER FROM
THE DOCUMENT ABSTRACT RECORD
THEN
REMOVE THE ENTRY FROM THE WORD INDEX
RECORD;
IF
THERE ARE NOW NO ENTRIES IN THE WORD
INDEX RECORD
THEN
DELETE THE WORD INDEX RECORD FROM
THE FILE;
ELSE
REWRITE THE WORD INDEX RECORD TO THE
FILE;
ENDIF;
ENDIF:
ENDWHILE;
ENDWHILE;
DELETE THE DOCUMENT ABSTRACT RECORD FROM THE FILE;
ENDPROCEDURE(DELETE --ABSTRACT);
__________________________________________________________________________
The delete abstract subroutine of Table 2 deletes the abstract from memory by deleting occurrences of the words in the abstract from the word index file. The makeup of the word index file will be fully explained below. Following deletion of the existing abstract from memory, or, if no words having the document number are stored in the word index file, the document is processed at block 23 to create an abstract. Referring to the program routine in Table 1, the next word in the document is tested to determine if the Carbon Copy (CC) list follows. If not, the program branches to abstract process word routine in Table 3 to determine if the word should be included in the abstract for the document.
TABLE 3
______________________________________
Abstract Process Word Subroutine
______________________________________
BEGINPROCEDURE(ABSTRACT --PROCESS --WORD);
ENTER PROCESS WORD
INCREMENT DOCUMENT WORD COUNT BY 1;
LOOK THE WORD UP IN THE DICTIONARY;
IF
THE WORD WAS FOUND IN THE DICTIONARY BUT
NOT FLAGGED AS A NOUN OR A SINGLE
PURPOSE ADJECTIVE
THEN
IGNORE THIS WORD;
ELSE
IF
THE WORD WAS FOUND IN THE DICTIONARY BUT
FLAGGED AS A NOUN OR A SINGLE PURPOSE
ADJECTIVE
THEN
FLAG THE WORD AS NORMAL;
ELSE
FLAG THE WORD AS ACRONYM;
ENDIF;
IF
THIS WORD HAS NOT BEEN FOUND PREVIOUSLY IN
THIS DOCUMENT
THEN
SAVE THIS WORD;
SAVE THE DOCUMENT LINE COUNT;
SET FREQUENCY COUNT FOR THIS WORD TO 1;
ELSE
INCREMENT FREQUENCY COUNT FOR THIS WORD BY 1;
ENDIF;
ENDIF;
ENDPROCEDURE(ABSTRACT --PROCESS --WORD);
______________________________________
As was previously stated, the criteria for determining whether a word is included in the abstract is whether the word is determined to be a "message specialization term", i.e., a noun, single purpose adjective, proper name, acronym, or numeric. The program routine of Table 3 compares the word to the contents of dictionary memory 108 (FIG. 1). If the word is found in the dictionary memory but it is not a noun or single purpose adjective then the word is ignored. The decision as to whether a word in the dictionary is a noun or single purpose adjective is made at the time of preparation of the dictionary memory 8 and those words designated as nouns or single purpose adjectives have appended to them a code bit. If the word is determined to be a noun or single purpose adjective, a code bit or "flag" is added to the word to indicate as "normal". If the word is not in the dictionary then a code bit or "flag" is added to the word to indicate its status as acronym or proper name. Acronyms and proper names are considered to have more influence as message specialization terms than nouns and single purpose adjectives and therefore are more useful for document retrieval as will be shown below. The Process Word routine of Table 3 controls the processor 10 to save only one copy of each abstract term for storage in the word index file. However, the Process Word routine appends to the word the number of each line in the document where the word appears and a count of the number of times the word appears in the document. As will be seen below for document retrieval, the frequency of occurrence of the word in the document and the place of occurrence help determine the value of the word as a query term for retrieving the document. Following completion of the Word Process subroutine control returns to the Abstract routine in Table 1 which repeats the routines for each word in the document. The Abstract routine accumulates a count for the number of pages in the document. Upon reaching the end of the document a count is calculated to determine the fifth line from the end of the body of the document and the Abstract End Processing subroutine of Table 4 is selected.
TABLE 4
__________________________________________________________________________
Abstract End Processing Subroutine
__________________________________________________________________________
BEGINPROCEDURE(ABSTRACT --END --PROCESSING);
ENTER END PROCESSING;
CREATE A DOCUMENT ABSTRACT RECORD CONSISTING OF;
THE DOCUMENT NUMBER, THE DOCUMENT WORD COUNT, AND
EACH WORD IN THE ABSTRACT;
WRITE THE DOCUMENT ABSTRACT RECORD TO THE FILE;
WHILE
MORE WORDS ARE LEFT TO PROCESS;
DO
READ THE WORD INDEX RECORD FOR THE WORD;
IF
THE RECORD WAS NOT FOUND
THEN
CREATE A WORD INDEX RECORD CONSISTING OF:
THE WORD, THE NORMAL/ACRONYM/PROPER NAME
FLAG, THE DOCUMENT NUMBER, THE FREQUENCY
COUNT, AND A FLAG INDICATING IN HEADER/
TRAILER/CC LIST/BODY;
WRITE THE WORD INDEX RECORD TO THE FILE;
ELSE
ADD THE DOCUMENT NUMBER, THE FREQUENCY COUNT,
AND A FLAG INDICATING IN HEADER/TRAILER/CC
LIST/BODY TO THE RECORD;
REWRITE THE WORD INDEX RECORD TO THE FILE;
ENDIF;
ENDWHILE;
ENDPROCEDURE(ABSTRACT --END --PROCESSING);
__________________________________________________________________________
The Abstract End Processing subroutine controls the processor 10 to create an abstract record which includes all words saved by the Process Word subroutine of Table 3, a count of the number of words in the document and the document identifier code number. The Abstract End Processing subroutine also creates a Word Index Record for each word in the abstract record which includes the word, the "normal" or "acronym/proper name" code, the document number, the number of pages in the document, the frequency of occurrence of the word in the document, and a code indicating whether the word occurs in the header (first 10 lines), trailer (last 5 lines) or the copy list or body of the document. The words in the Word Index File are searched to determine if a record for the word already appears in the Word Index File. If it does then the record is updated by adding the document number, frequency count, and codes such that no duplicates of the word appear in the Word Index File. Following completion of the Abstract End Processing subroutine of Table 4 control returns to the Abstract routine of Table 1 which terminates the abstracting procedure. To retrieve a document stored in the system, the requestor must enter a query for the document into the system. This may be done through a keyboard, for example. The queries used with the preferred embodiment of this system can be a natural language statement or string of words that describes the item. The search argument is created by testing the query words against the word index file. In many cases, the words in the search argument will occur in the key word records (abstracts) of several documents. In order to provide better discrimination between contending documents, different weights are applied to different key words. Weighting criteria are applied according to these general rules: 1--Matches on numeric key words are given greater weight than matches on alpha key words. 2--Matches with key words that are proper names or acronyms are given greater weight than matches with nouns or single purpose adjectives that are found in the dictionary memory. 3--The weight assigned to a key word match is proportional to the number of times that the word occurred in the document divided by the log of the number of pages in the document. 4--Matches with key words that occur in the first ten lines of the document are given greater weight than those of key words in the center of the body of text. 5--Matches that occur with key words in the last five lines of text (before any copy lists) are given more weight than matches with words in the center of the text, but less weight than matches with words in the first ten lines. 6--The weight of a key word match is increased when that word is the name of a month or year. 7--The weight of a key word match is inversely proportional to the number of documents in the entire file that contain that key word in the body of the document (excluding occurrences as part of the copy list). The rationale behind these general rules is to give the greatest weight to those matches that involve key words that have the most narrowly specific meaning. It is assumed that specific names, numbers and dates have very specific meaning so they are weighed heavily. It is also assumed that the most specific items will be mentioned at the beginning or end of the correspondence. Hence, words occurring in these regions are also given greater weight. An example of an expression that satisfies the general rules is the following: ##EQU1## where: F.sub.i,j =number of times ith key word appears in jth document divided log.sub.2 of the number of pages in document. A.sub.i =binary indicator if ith key word is an acronym or proper name. K.sub.i =binary indicator if ith key word occurs in first 10 lines. L.sub.i =binary indicator if ith key word is a numeric. E.sub.i =binary indicator if ith key word occurs in last 5 lines. H.sub.i =binary indicator if ith key word occurs in the dictionary as a noun or single purpose adjective. M.sub.i =binary indicator if ith key word is a month. Y.sub.i =binary indicator if ith key word is a year. D.sub.i =number of documents that contain ith key word. Referring to FIG. 3, a flow chart of the processing of a query for a document is shown. At block 30 the user query is input to the processor 10 (FIG. 1) from input register 16 over bus 15. Tables 5, 6, and 7 show program routines for processing the user query according to the general rules stated above.
TABLE 5
______________________________________
Query Routine
______________________________________
BEGINPROCEDURE(OCRS --QUERY);
ENTER QUERY;
WHILE
MORE QUERY LINES OF TEXT EXIST
DO
GET THE NEXT LINE OF QUERY TEXT;
WHILE
MORE CHARACTERS EXIST ON THE LINE
DO
GET THE NEXT WORD FROM THE LINE (2 OR MORE
CHARACTERS A-Z, 0-9, OR ');
READ THE WORD INDEX RECORD FOR THE QUERY
WORD
IF
WORD FOUND
THEN
CALL (QUERY --PROCESS --WORD);
ENDIF
ENDWHILE;
ENDWHILE;
CALL (QUERY --END --PROCESSING);
ENDPROCEDURE(OCRS --QUERY);
______________________________________
The Query routine of Table 5 compares the query words to the contents of the word index file as shown in block 31 of the flow diagram of FIG. 3. The query words that match the word index file are processed at block 32 of the flow diagram by the Query Word Process subroutine of Table 6.
TABLE 6
______________________________________
Query Process Word Subroutine Detailed Logic
______________________________________
BEGINPROCEDURE(QUERY --PROCESS --WORD);
ENTER PROCESS WORD;
IF
THE WORD IS A YEAR
THEN
SET INDICATOR FOR YEAR IN QUERY;
ENDIF;
IF
THE WORD IS A MONTH
THEN
SET INDICATOR FOR MONTH IN QUERY;
ENDIF;
IF
THE WORD IS NUMERIC
THEN
SET NUMBER WEIGHT TO 10;
ELSE
SET NUMBER WEIGHT TO 0;
ENDIF;
THEN
COUNT THE NUMBER OF DOCUMENTS CONTAINING
THIS WORD;
COUNT THE NUMBER OF DOCUMENTS WHERE
THE WORD IS NOT IN THE CC LIST;
IF
THE WORD INDEX RECORD IS FLAGGED AS AN ACRONYM/
PROPER NAME
THEN
SET ACRONYM/PROPER NAME WEIGHT TO 10;
ELSE
SET NORMAL WEIGHT TO 5;
ENDIF;
WHILE
MORE DOCUMENT ENTRIES ARE IN THE WORD INDEX
RECORD
DO
GET THE NEXT DOCUMENT ENTRY FROM THE WORD
INDEX RECORD
IF
THE FLAG INDICATES THAT THE WORD OCCURRED
IN THE HEADER
THEN
SET HEADER WEIGHT TO 10;
ELSE
SET HEADER WEIGHT TO 0;
ENDIF;
IF
THE FLAG INDICATES THAT THE WORD OCCURRED
IN THE TRAILER
THEN
SET TRAILER WEIGHT TO 5;
ELSE
SET TRAILER WEIGHT TO 0;
ENDIF;
IF
THE FLAG INDICATES THAT THE WORD OCCURRED
IN THE CC LIST
THEN
SET CC DIVIDE WEIGHT TO 99,999;
ELSE
SET CC DIVIDE WEIGHT TO 1;
ENDIF;
SET THE RETRIEVAL VALUE TO:
(ACRONYM/PROPER NAME WEIGHT + NUMBER
WEIGHT + NORMAL WEIGHT + HEADER WEIGHT +
TRAILER WEIGHT + WORD FREQUENCY DIVIDED
BY THE LOG BASE 2 OF COUNT OF NUMBER OF
PAGES) DIVIDED BY THE LOG BASE 2 OF THE
COUNT OF DOCUMENTS NOT CONTAINING THE
WORD IN THE CC LIST;
DIVIDE THE RETRIEVAL VALUE BY THE CC DIVIDE
WEIGHT;
IF
THIS DOCUMENT HAS NOT BEEN ANALYZED YET
IN THIS QUERY
THEN
SAVE THE DOCUMENT NUMBER;
SAVE THE RETRIEVAL VALUE;
ELSE
INCREMENT THE DOCUMENTS RETRIEVAL VALUE
BY THE NEW RETRIEVAL VALUE;
ENDIF;
ENDWHILE;
ENDPROCEDURE(QUERY --PROCESS --WORD);
______________________________________
Each query word is tested to determine if it is a month, year, numeric, acronym or normal (noun or single purpose adjective). The subroutine of Table 6 also adds weighting factors if the indicators in the word index file show the word occurs in the first ten lines (Header) of the document, last five lines (Trailer) of the document, or occurs more than once in the document. The value of the word is reduced if it occurs in the copy list of the document or occurs in more than one document. An overall calculation of value for each word is calculated and a total value for all query words that match words in the word index file for each document number having any matches is accumulated. The steps of calculating the retrieval value for words and the retrieval value for documents are shown in block 33 and 34 of FIG. 3. Following processing of all words in the query, the Query routine of Table 5 branches to the Month/Year Evaluation subroutine of Table 7.
TABLE 7
______________________________________
Query Month/Year Evaluation
______________________________________
BEGINPROCEDURE(QUERY --END --PROCESSING);
ENTER END PROCESSING;
IF
THERE WAS A YEAR MENTIONED IN THE QUERY
THEN
INCREMENT THE RETRIEVAL VALUE OF EACH
DOCUMENT THAT DID CONTAIN THE YEAR BY 20%;
ENDIF;
IF
THERE WAS A MONTH MENTIONED IN THE QUERY
THEN
INCREMENT THE RETRIEVAL VALUE OF EACH -DOCUMENT THAT DID CONTAIN THE
MONTH BY 20%;
ENDIF;
RETRIEVE THE DOCUMENT NUMBERS OF THE
DOCUMENTS WHOSE RETRIEVAL VALUE IS WITHIN
25% OF THE HIGHEST RETRIEVAL VALUE;
SORT THIS LIST BY THE NUMBER OF WORDS FROM
THE QUERY ACTUALLY OCCURRING
IN THE DOCUMENT;
OUTPUT THE DOCUMENTS;
ENDPROCEDURE(QUERY --END --PROCESSING);
______________________________________
The subroutine of Table 7 increases the retrieval value for each document that contains a year and/or month that matches a year and/or month in the query. The subroutine of Table 7 then controls the processor 10 to output those documents from main memory 12 to output register 18 whose retrieval value is within 25 percent of the highest retrieval value calculated. Control is then returned to the Query routine of Table 5 which terminates the query procedure. While the invention has been shown and described with reference to a specific set of computer instructions, i.e. PDL, and retrieval weighting values, it will be understood by those skilled in the art that the spirit of this invention can be implemented in other computer languages and the set of document retrieval weighting factors can be modified without avoiding the scope of the invention claimed herein.
|
Same subclass Same class Consider this |
||||||||||
