Method, system and program product for automatically retrieving documents6990483Abstract A method, system and program product for automatically retrieving documents is provided. Specifically, a hit list of documents is generated directly from an input file of requests. Once generated, the hit list is processed according to system, data object, storage node identification, storage drive and/or cache. Once sorted, plurality of retrieval programs are launched and executed in parallel to retrieve the requested documents. Claims The invention claimed is: Description BACKGROUND OF THE INVENTION
After the sorting is complete the hit list is then split into sublists according to data system, storage drive and cache. In one embodiment, the hit list is first split according to data system. Since there are two data systems in the above example, the following two sublists will result. Sublist One
Sublist Two
Sublist1
Sublist2
Sublist3
Sublist4
Sublist5
Sublist6
It should be understood that the precise order in which the sorting and splitting steps are performed is not intended to be limiting. For example, the hit list could first be split according to data system 26A-C, then sorted according to data system, cache, storage node and data object, and then split into the next level sublists. Alternatively, all splitting could be performed in one step. For example, the hit list could be sorted and then split into hit lists according to data system 26A-C, storage drive 28A-C and cache 30A-C in one step. Accordingly, the order/means in which the steps are performed is not limiting as long as a separate sublist for each storage drive and/or cache containing requested documents results. It should also be appreciated that not all storage drives 28A-C and/or caches 30A-C will store requested documents. To this extent, next level sublists are only created for storage drives 28A-C and/or caches 30A-C actually storing requested documents. It should also be understood that although not shown, other storage locations could exist. For example, each data system 26A-C could communicate with a database (not shown) that also stores documents. In this case, for each database storing requested documents, another next level sublist would be generated. Once the hit list has been processed (sorted and split), the documents will be retrieved. In a typical embodiment, the documents are retrieved by document system 22 by launching a retrieval program for each next level sublist. In the example above, this would result in six retrieval programs being launched. Once launched, the retrieval programs are executed in parallel, through library system 20, to retrieve the documents in the processed hit list from storage locations 26A-C, 28A-C and/or 30A-C. In another embodiment, the documents could be retrieved in batch by document system 22 (as will be described below). Retrieval of the documents depends on the format of input file 16. Specifically, if input file 16 was provided in query format, the query process must be repeated to rebuild the hit list before the documents can be retrieved. Specifically, when input file 16 is provided in a query format, the hit list is created by query and then erased from memory. Accordingly, the hit list must be recreated by query so that the documents can be retrieved. The benefit of providing input file 16 in the hit list format (preferred) is that the queries do not have to be performed a second time for each document. In addition, by providing input file 16 in a hit list format deferred processing is possible. In particular, deferred processing is provided when user 10 requests documents and request system 14 generates input files 16 in hit list format. Request system 14 queries database 24 on library system 20 using provided application program interfaces (APIs). The document location information is then returned to request system 14. Once retrieved, the documents could be accessed at library system 20 by user 10 via user system 12. Alternatively, the retrieved documents could be outputted to user 10 via a recordable medium (e.g., CD-ROM) or via user system 12. In addition, any error files and/or optional index files could be outputted to user 10. Error files are generated when a request entered by user 10 contains errors that make the processing and/or retrieval impossible. This could be, for example, a typographical error in entering a document name or serial number. In the event of an error, an error file that contains error codes describing the exact error could be automatically generated and outputted to user 10. To this extent, generation and output of the error file need not wait until processing or retrieval is complete. Rather, the error file could be generated immediately upon recognition of an error. The error file could also include information messages detailing the steps of the retrieval process. In any event, the files could be outputted to user 10 along with the requested documents on a single recordable medium. Referring now to FIG. 2, a more detailed diagram of library system 20 is shown. As depicted, library system 20 generally comprises central processing unit (CPU) 60, memory 62, bus 64, input/output (I/O) interfaces 66, external devices/resources 68 and database 24. CPU 60 may comprise a single processing unit, or be distributed across one or more processing units in one or more locations, e.g., on a client and server. Memory 62 may comprise any known type of data storage and/or transmission media, including magnetic media, optical media, random access memory (RAM), read-only memory (ROM), a data cache, a data object, etc. Moreover, similar to CPU 60, memory 62 may reside at a single physical location, comprising one or more types of data storage, or be distributed across a plurality of physical systems in various forms. I/O interfaces 66 may comprise any system for exchanging information to/from an external source. External devices/resources 68 may comprise any known type of external device, including speakers, a CRT, LED screen, hand-held device, keyboard, mouse, voice recognition system, speech output system, printer, monitor, facsimile, pager, etc. Bus 64 provides a communication link between each of the components in library system 20 and likewise may comprise any known type of transmission link, including electrical, optical, wireless, etc. In addition, although not shown, additional components, such as cache memory, communication systems, system software, etc., may be incorporated into library system 20. Database 24 may provide storage for information necessary to carry out the present invention. Such information could include, among other things, parameters, input files, index files, error files, etc. As such, database 24 may include one or more storage devices, such as a magnetic disk drive or an optical disk drive. In another embodiment, database 24 includes data distributed across, for example, a local area network (LAN), wide area network (WAN) or a storage area network (SAN) (not shown). Database 24 may also be configured in such a way that one of ordinary skill in the art may interpret it to include one or more storage devices. It should be understood that although not shown, user system 12 and data systems 26A-C typically contain components (e.g., CPU, memory, etc.) similar to library system 20. Such components have not been separately depicted and described for brevity purposes. It should be understood that communication between user system 12, library system 20 and data systems 26A-C could be provided through any known means. For example, user system 12, library system 20 and data systems 26A-C could be connected via direct hardwired connections (e.g., serial port), or via addressable connections (e.g., remotely) in a client-server environment. In the case of the latter, the server and client may be connected via the Internet, wide area networks (WAN), local area networks (LAN) or other private networks. The server and client may utilize conventional network connectivity, such as Token Ring, Ethernet, or other conventional communications standards. Where the client communicates with the server via the Internet, connectivity could be provided by conventional TCP/IP sockets-based protocol. In this instance, the client would utilize an Internet service provider to establish connectivity to the server. Stored in memory 62 of library system 20 is document system 22 (shown as a program product). As depicted, document system 22 includes parameter system 42, reception/location system 44, list system 46, processing system 48, retrieval system 50, index system 52, error system 54 and output system 56. As described above, user 10 manipulates user system 12 to designate parameters and create/transfer input files containing document requests to library system 20. Typical parameters which user 10 can designate is whether document system 22 should be run as a daemon, a minimum request quantity an input file must have before being processed, a maximum time limit an input file can wait before being processed etc. In one embodiment, the parameters are designated using programmable flags. Listed below, are some exemplary flags and corresponding definitions: -c output—dir: The file system where data systems 26A-C store the retrieved documents. The default is the directory where the document system 22 is invoked. -d input—dir: The file system that contains the input files. The input directory is a required flag. This flag is ignored when the document system 22 is executed using the -R requests—file flag. -e "delimiter": The character that separates field values. The default delimiter is a comma. -f folder: A folder name that is required when the document system 22 is run as a daemon. -F parmfile—ext: The extension of the input file that contains the request records. The extension consists of the period character followed by three additional characters (for example: .ext). The default is ".prm". This parameter is ignored when the document system 22 is executed using the -R requests—file flag. -h host: The fully qualified host name or IP address of library system 20. The host name is a required parameter. -H: If specified, the input file is in the hit list format. -I: If specified, an index file will be created. This flag is ignored when the document system 22 is executed using the -R requests—file flag. -l: If specified, messages are written to a log file rather than sending them to stderr and stdout. This flag is ignored when the document system 22 is run as a daemon. -m min—nbr: The minimum number of entries that must be present to initiate processing. If the minimum number of entries are not present, document system 22 will sleep for -t seconds. The minimum number of entries is a required flag. This flag is ignored when the document system 22 is executed using the -R requests—file flag. -n nbr—drives: The maximum number of drives 28A-C, per data system 26A-C, that will be utilized during the retrieval process. The default is one drive. This parameter is ignored when the document system 22 is executed using the -R requests—file flag. -p password: The password for the user specified with the -u parameter. The password is a required parameter when the document system 22 is run as a daemon. If there is no password, specify -p "". -r: If specified, reconciliation processing will be performed. This process consists of performing a query for all of the requests and generating an .rcn file if an error occurs. No documents will be retrieved. This flag is ignored when the document system 22 is executed using the -R requests—file flag. This flag cannot be used with the -H flag since database queries are not performed for the hit list file format. -t seconds: The polling time in seconds. This is the interval that the document system 22 checks the input directory for new input files. The default is 600 seconds (ten minutes). This parameter is ignored when the document system 22 is executed using the -R requests—file flag. -R requests—file: The name of the file that contains retrieval requests. This parameter is used for immediate processing of the requests in the file. Primarily, this parameter will be used to resubmit requests that were not processed due to some type of error. -T seconds: The maximum wait time in seconds. This value is compared to the last-update-time for the input file to determine if the retrievals in the input file should be processed. The maximum wait time is a required parameter. This parameter is ignored when the document system 22 is executed using the -R requests—file flag. -u userid: The userid of a document system 22 user. The userid could be a required parameter when the document system 22 is run as a daemon. -v: If specified, generate informational messages in addition to error messages. Once any parameters are generated, user 10 will enter requests for documents. As indicated above, the requests will be packaged into one or more input files that could be in one of two formats. The first format is the hit list format in which all information needed to retrieve the requested documents is contained within the input file. The input file in the hit list format includes one or more records with each record representing a document request. Each field in the input record is separated by a character. The character that is used as the delimiter is identified by the value for the -e flag or by default, a comma is assumed. The input record also includes a unique identifier that is used as the name of the file where the document is saved. Other fields are also provided in the input record so that a hit can be recreated; thus eliminating the need to query database 24. Listed below are typical fields that could be provided: unique identifier: The name of the file where the document will be written. location: The location of the document; cache, storage media, external cache, unknown. agid: The application group identifier. db—field1-db—fieldn: The database fields in application group order. name: The object name identifier (i.e. 2FAAA). offset: The offset into the object where the document can be located. length: The length of the document. comp offset: The offset into the compressed object. comp length: The compressed object length. annotation: A flag that indicates whether there are annotations. compression type: The compression algorithm used. resource id: The resource identifier. primary node id: The primary node identifier. secondary node id: The secondary node identifier. If the input file is in query format, the file will include one or more records with each record representing a document request. Similar to the hit list format input file, each field in the input record is separated by a character. The character that is used as the delimiter is identified by the value for the -e flag or by default, a comma is assumed. However, in the query format, fields 2-n in the record correspond to the index field values in the application group. The order in which the index field values are listed in the record could have many variations. In one embodiment, the order is by folder field query order (e.g. field 2 corresponds to the value for the 1st folder field listed in the Search Criteria area of the Client, field 3 corresponds to the value for the 2nd folder field listed in the Search Criteria area, etc.). The first field is not an index field, but is the document identifier. Index fields that are not used to identify the document can have "null" values specified in the record. Consecutive commas in the record specify null values and will not be included in the SQL search string. When using a null value for the last index field value, the last character on the record must be the delimiter (i.e. blank spaces are not allowed at the end of the record). In providing an input file in query format, document system 22 will have to query database 24 to build the hit list. In contrast, if the input file is received in hit list format document locations are provided and the hit list can be generated directly from the input file (i.e., without a query). In either event, reception system 44 will read input files 16 and optionally store the same in database 24. If input files are sent to a destination other than library system 20, and/or if document system 22 is stored on a computer system other than library system 20, reception system 44 will access the appropriate destination to read the input files. In any event, once read, list system 46 will process the input file (according to any parameters) to build the hit list of documents. As indicated above, document system 22 can be programmed to automatically generate a hit list as each new input file is received. Alternatively, user 10 could have designated parameters (e.g., request quantity, minimum time, etc.) which would alter this schedule. In generating the hit, list system 46 will perform all necessary steps. Specifically, referring to FIG. 3, list system 46 is shown in greater detail. As depicted, list system includes query system 70 and generation system 72. If the input file was in hit list format, generation system 72 will process the file and generate the hit list directly therefrom. However, if the input file was in query format, query system 70 will generate the SQL query string(s) and conduct the necessary queries to retrieve the information needed by generation system 72 to generate the hit list. Once the hit list has been generated, it must be processed via processing system 48. As shown in FIG. 4, processing system 48 includes sorting system 74, server system 76, drive system 78 and cache system 80. Sorting system 74 will sort (e.g., rearrange) the hit list according to data system 26A-C, cache, storage node and data object (as described above). Once sorted, server system 76 will split the hit list into sublists according to data system 26A-C. That is, a separate sublist will be created for each data system 26A-C so that if there are three data systems, three sublists will be created. This splitting will streamline the retrieval process because all documents on a sublist are stored within a particular data system and thus, the retrieval time is shortened. Once split, the sublists will be further split into a next level of sublists according to storage drive and cache by drive system 78 and cache system 80, respectively. For example, if each data system 26A-C has documents stored on one drive and in cache, two next generation sublists will be created for each data system 26A-C, yielding a total of six next generation sublists. Once the hit list has been processed (i.e., sorted and split), the documents therein can be retrieved. Referring to FIG. 5, retrieval system 50 is shown in greater detail. As shown, retrieval system 50 includes launch system 82 and bulk system 84. In a typical embodiment, the documents on the hit list are retrieved in parallel. Specifically, launch system 82 will launch a separate retrieval program (e.g., agent based) for each sublist created by processing system 48. In the example above, six sublists (next generation) were created. Accordingly, launch system 82 will launch six separate retrieval programs, which will each communicate with a particular drive or cache of a particular data system. Each of the retrieval programs will execute in parallel in retrieving the documents. In another embodiment, the documents could be retrieved in bulk via bulk system 84. Typically, bulk retrieval is implemented when document system 22 is run as a non-daemon and the hit list is not processed. That is, separate sublists are not created. In this case, bulk system 84 will launch one or more retrieval programs to retrieve all documents in the hit list in bulk. In either event, if input file 16 was provided in query format, a second query and hit list generation is required before the documents can be retrieved. In this case, list system 46 of FIG. 3 will reactivate to conduct the necessary query and list generation. Referring back to FIG. 2, index system 52 will create an index file in the event it was requested (e.g., via parameter system 42) by user 10. With respect to errors, error system 54 will detect and record errors that occur during document retrieval. In general, errors result from incorrect information in a request (record). In addition to detecting errors, error system 54 can generate informational messages detailing the entire retrieval process. The messages and detected errors can be reported to user 10 in the form of an error or log file. The error file name is generally the name of the input file containing the error with an error code extension that defines the error. If the original input file contains multiple records and several of the records have the same error, multiple records will be written to the error file. The following is a list of error codes/extensions and a description of their meaning: .RC1: There are not enough index field values in the request record. .RC2: There are too many index field values in the request record. .RC3: An index field value is not valid. .RC4: The document identifier is missing in the request record. .RC5: The document file name is not valid. .RC6: The database query did not find a document for the index field values. .RC7: The database query failed. .RC8: The data system for the document could not be determined. .RC9: The requests file specified with the -R flag could not be opened. .RC10: The input file does not contain any request records. .RC11: The input file could not be opened. .RC12: The document file could not be opened. .RC13: The resource group could not be retrieved. .RC14: The document could not be retrieved. .RC15: A hit could not be recreated from the request record because of invalid or missing fields. .RC16: The document location is missing from the request record. .RC17: The application group identifier (agid) is missing from the request record. .RC18: The application group identifier (agid) in the request record is not valid. .RC19: The application group identifier (agid) in the request record is not in the folder. Once any index files and/or error files have been created, output system 56 will output the same to user 10. Referring now to FIG. 6, a first flow diagram according to the present invention is shown. In general, the program is started 102 and parameter values are read 104. Parameter values are those designated via parameter system 42 and described above. Once the parameter values are read, it will be determined whether the values are valid 106. If not, the program is terminated 108. If, however, the values are valid, it is determined whether the program is being run as a daemon 110 (i.e., continuously). If not, the processes will follow the steps set forth beginning with step "B" 112 in flow diagram 300 (FIG. 8). If the program is being run as a daemon, input files 16 will be read 114. As indicated above, input files 16 are read by document system 22 from whatever destination they were transmitted to from user system 12. To this extent, input files 16 need not be transmitted to library system 20. Moreover, document system 22 need not be loaded on library system 20. In any event, once input files 16 have been read, it will then be determined whether input files 16 contain a minimum quantity of requests 116, as specified by the parameters. If not, it will be determined whether the maximum wait time has been exceeded 120. If not, a specified wait time will be reached 122 and input files 16 will be read again 114. If however, the minimum quantity of requests were provided, or the maximum wait time was reached, the request list will be built 118. Once built, it will be determined whether the input file was in hit list format 124. If so, the hit list can be built directly from the input files 126. If not, the hit list must be built by querying database 24 of library system 20. In either event, once the hit list has been generated, the process proceeds to step "D" 130 in flow diagram 200. Referring now to FIG. 7, flow diagram 200 is shown in greater detail. As depicted, once the hit list has been formed, it will be sorted according to system (data systems 26A-C), cache, data object and/or storage node identification 202, split according to system 204, and then split again according to drive 206 and cache 208 (if any). Then, error files are generated detailing any errors that were detected 210. It will then be determined whether index files were requested 212. If so, the index files will be created 214. If not, the program will proceed directly to document retrieval. To this extent, a separate retrieval program will be started for each drive sublist 216, while a separate retrieval program will be started for every cache sublist 218. As indicated above, the retrieval programs execute in parallel to retrieve the requested documents. Once complete, and if being run as a daemon (i.e., continuously), the program will return to step "C" 132 in flow diagram 100 to repeat the steps beginning with reading the input files 114. Referring now to FIG. 8, a non-daemon method flow diagram 300 is shown. As indicated above, in step 110 of flow diagram 100, it will be determined whether document system 22 is being run as a daemon. If not, the program will proceed to step "B" 112 of flow diagram 300. From this step, input files 16 are read 302 and the request list is built 304. Similar to flow diagram 100, it will then be determined whether input files 16 are in hit list format 306. If not, the hit list is built by query 310. If, however, input files 16 are in hit list format, the hit list is built simply by rearranging the information therein 308. Once the hit list is built, the documents are retrieved in bulk 312. That is, the hit list is not processed (sorted and split). Rather all documents on the hit list are simply retrieved in bulk. Once retrieved, error files for any input files 16 containing errors are generated 314 and the program is terminated 108. It is understood that the present invention can be realized in hardware, software, or a combination of hardware and software. Any kind of computer/server system(s)—or other apparatus adapted for carrying out the methods described herein—is suited. A typical combination of hardware and software could be a general purpose computer system with a computer program that, when loaded and executed, controls library system 20 such that it carries out the methods described herein. Alternatively, a specific use computer, containing specialized hardware for carrying out one or more of the functional tasks of the invention could be utilized. The present invention can also be embedded in a computer program product, which comprises all the features enabling the implementation of the methods described herein, and which—when loaded in a computer system—is able to carry out these methods. Computer program, software program, program, or software, in the present context mean any expression, in any language, code or notation, of a set of instructions intended to cause a system having an information processing capability to perform a particular function either directly or after either or both of the following: (a) conversion to another language, code or notation; and/or (b) reproduction in a different material form. In a typical embodiment, the present invention is implemented using CONTENT MANAGER ONDEMAND available from International Business Machines, Corp. of Armonk N.Y. The foregoing description of the preferred embodiments of this invention has been presented for purposes of illustration and description. It is not intended to be exhaustive or to limit the invention to the precise form disclosed, and obviously, many modifications and variations are possible. Such modifications and variations that may be apparent to a person skilled in the art are intended to be included within the scope of this invention as defined by the accompanying claims.
|
Same subclass Same class Consider this |
||||||||||
