Database system and method for data acquisition and perusal6834276Abstract A data acquisition and perusal system and method including a database selection module, a database index generator module and a search module. The database selection module enables selection of a plurality of files for inclusion into at least one selectable database. The database index generator module enables generation of a searchable index of the data contained in the selectable database. The search module enables a search to be performed of the searchable index according to search criteria. The data acquisition and perusal system and method may also allow users to view, acquire, and generate single- or multiple-data sources locally or remotely, and allow users to compile, index, modify, and append the data sources according to default or user defined criteria. The data acquisition and perusal system and method may also selectively acquire and display data contained within remote databases depending upon the user's access permissions to such databases. Such a system allows for the capture of hypertext data which is automatically indexed without human intervention and has the ability to automatically and accurately locate or "pinpoint," and highlight specific text or groups of text designated by the user within the resulting database. Such a system contains a link module that enables custom links to be defined between selected terms of selected files of the selectable database including the custom links so that the searchable index includes only valid links. Claims What is claimed is: Description FIELD OF THE INVENTION
TABLE I
Long Integer Interpretation
If first long And second
integer (x) long integer
is: (y) is: Interpretation:
Positive Positive. First number (x) is the
and less number of files in the
than the database containing the
number of word.
files in the Second number (y) is an
database. index to the file position in
Part 1 of the Master Word
Index at which starts the list
of file numbers containing
this word. The list of
numbers is x entries long.
Positive, Positive x indicates the number of
and files that DO NOT contain the
greater given word. This number is
than the determined by subtracting
number of the number of files in the
files in the database from x.
database. y is an index to the file
position in Part 1 of the
Master Word Index at which
starts the list of the file
numbers that do NOT contain
this word. The length of this
list is the number x, less the
number of files in the
database.
Positive -1 x is the file number of the
one and only file in the
database which contains this
word. (No entry is needed in
Part 1.)
-1 -1 All files in the database
contain this word. (No entry
is needed in Part 1.)
The information contained in Part 2 of the master word index 202 enables the search module 183 to expedite searching procedures for any search query that may be entered into the search module 183. Part 3 is a sequence of three indices, herein referred to as a first index, a second index, and a third index, for eliminating search terms that do not appear in Part 2 of the master word index file. Essentially, once a database index has been generated, the search module 183 uses Part 3 as a "negative search" index, i.e., an index to quickly eliminate search terms that do not appear in the database. In one embodiment, before the first of these three indices, there is a two-byte ASCII 5, ASCII NULL pair that serves as a dividing point between Parts 2 and 3. The first index of Part 3 is a numeric index which consists of 110 long integers. The first ten long integers are indices into the Part 2 information for words starting with "0"-"9". Thus, when the database index 200 is generated, offsets for the words starting with "0"-"9" in the Part 2 data are recorded in each of the first ten long integers. If no word in Part 2 starts with the given single digit, four ASCII 255's are written into the corresponding long integer of the first ten long integers. Following these ten long integers are 100 long integers for words starting with the pairs "00"-"99". Similar to the first ten long integers, offsets for words in the Part 2 data are recorded, but if no word starts with the given pair, four ASCII 255's are written to that long integer of the first index. The second index is an index for "odd" leading characters. This index is a list of 255 long integers, corresponding to ANSI characters 1-255. Like the first index, offsets for words in the Part 2 data are recorded, but if no word in Part 2 starts with a given character, four ASCII 255's are written to the corresponding long integer of the second index. Also, if the given character is a letter, a numeric digit, or any other character that a user is not intended to find with the search module 183, four ASCII 255's are written to the long integer that represents that character. The third index is a list of long integers that index words with alphabetical leading characters. The third index is of variable length depending on whether the index is a (two or a three dimensional index (to be described herein). The first 26 long integers in the third index are offsets for words in the Part 2 data that begin with the single letters "a" through "z". If no words in Part 2 begin with a given letter, four ASCII 255's are written to the corresponding long integer. The next 676 (26 squared) long integers of the third index are offsets for words that begin with the pairs "aa", "ab", "ac", etc., through "zz", thus, creating a "two dimensional" index from the third index. Offsets for these words in the Part 2 data are recorded in the 676 long integers, but if no word begins with a given pair, four ASCII 255's are written to the corresponding long integer. If desired, the third index can be a "three dimensional" index, i.e., an index including references to single alpha characters (26), pairs of alpha characters (676), and three alpha characters. If the index is three dimensional, then 26 cubed (17576) long integers follow "zz". These long integers index words beginning with the triplets "aaa", "aab", "aac", etc., through "zzz". Again, if no word begins with a given triplet, four ASCII 255's are written to the corresponding long integer for that triplet. Following these three indices is a nine byte string. The string begins with a single character that is ASCII 2 if the third index is two dimensional, and ASCII 3 if the third index is three dimensional. Following this character is a long integer corresponding to the offset at which the Part 2 data begins, i.e. the first character following the Part 1 data, if there is any Part 1 data. The last four bytes are a long integer corresponding to the first byte that follows the last byte of the Part 2 data. This is the offset for the ASCII 5 in the ASCII 5, ASCII NULL pair that tags the beginning of At the three indices of Part 3. Because the size of the three indices of Part 3 can be computed exactly based on the known dimensions of the alpha locator string as coded in byte 1 of this 9 byte string, this final four-byte long integer is not strictly necessary. After the search module 183 determines which files contain the search terms, a word number index 203 is accessed to find the exact location of the search terms in each file of the database. The word number index 203 is included in the database index 200 and can be described by two files, a DSI file 204, and a DSF file 205. The terms "DSI" and "DSF" are somewhat arbitrary character strings and are commonly used as file extensions for the respective files in the word number index 203. Broadly speaking, the terms represent a file (DSF) and an index (DSI) to that file, but for purposes of understanding, each term is referred to as a file from a portion of the database index 200. It should be noted that, in a similar manner, the remaining portions of the database index 200 are also designated with similar character strings to designate files included in the respective portions of the database index 200. The word number index 203 is used by the search module 183 to find the character and slot positions of words in database files. A character position is defined as the number of the logical byte or character in a file at which a word starts. For text files this is straightforward. For RTF, DOC (MS-Word), and HTM files, a translation from the actual binary file as stored on the disk to the logical file is necessary. A slot position is defined as the numeric position of the word in the file, a "word" being defined as any contiguous unit of text, including stop words, that appears between white space. Hence, for a file whose sole contents is the string "Have a nice day!", the word "nice" has a character position of 7 because the count starts at 0, where `H` is at position 0. In addition, the word "nice" has a slot position of 3 because the count starts at 1, where "Have" is at position 1. As stated, the DSI file 204 is an index into the DSF file 205 and contains a list of indices. This list contains a sequence of long integer pairs, encoded as eight bytes, for each file in the database. For a file which contains searchable words and has an entry in the DSF file 205, the first long integer in a DSI long integer pair is a start position in the DSF file 205 of information relating to that file and the second long integer in the pair is an end position of the information in the DSF file 205. For a file which contains no searchable words such as an HTM file that is simply a frame container, or a nonsense file that is filled with stop words only, each long integer of the long integer pair has a value less than 0, indicating that no DSF entry exists for the particular file. With reference to FIG. 2A, the DSF file 205 for a database index 200 contains a sequence of word position tables 219 for each file in the database that contains searchable terms. Of note, some files of the database may be without searchable terms and, thus, not included in the DSF file 205. As stated, examples of files without searchable terms might include HTM pages that describe frame containers only, and thus have no searchable data of their own, or nonsense files which contain only stop lot words. The beginning and end of each word position table 219 in the DSF file 205 is coded in the companion DSI file 204. For each file which has a word position table 219, the table 219 is laid out in columns as shown by a single row view. The first column of the word position table 219 includes character positions 220. The character positions 220 comprise variable length binary strings containing a sequence of long integers indicating character positions at which a given word appears in the file for which the word position table 219 was generated. In the second column of the word position table 219, a word slots list 222 is provided which is another variable length binary string containing another sequence of long integers, each indicating a slot position at which given words in the file appear. The correspondence between the character positions 220, the word slots 222 and their associated words is recorded in a locator string 224, i.e., the third column of the file's word position table 219. In this embodiment, the locator string 224 is a variable length binary string containing a sequence of twelve-byte sub-segments, each sub-segment coding three long integers. As illustrated in FIG. 2B, each twelve-byte sub-segment of the locator string 224 begins with a word number 228. The word number 228 is followed by a character position index 230 which is an index into the first column of the word position table 219 and indicates the location of the long integer that represents the position of the first character of the word in the file. This character position index 230 is followed by a slot position index 232 which is an index into the second column of the word position table 219, the word slots list 222, and indicates the location of the long integer that represents the position of the word in the file. Referring to FIG. 2A, a number of elements in locator string 226 comprises the fourth and last column in the word position table 219. The number of elements in locator string 226 is a long integer and stores the number of sub-segments in the locator string 224. Referring back to FIG. 2, a WDN file 216 is shown that represents a streamlined master word index 202 and contains data that is loaded into WDN maps, which are used for word searches on primary databases. These searches are typically faster than direct searches of the master word index 202 because the WDN file 216 is commonly loaded directly into the memory 106 of the computer system 100. Of course, compared to accessing the hard disk storage system 120 of the computer system 100, the memory 106 provides faster access for the search module 183. However, the memory 106 is limited in size and, thus, the size of the WDN file 216 may be limited. In this embodiment, the data in the WDN file 216 consists of segments, one segment per each word in the database, where each segment consists of 52 bytes. The first 40 bytes contain the string representation of a given search word (e.g. "apple"). This string is padded on the right with spaces, so that it is always 40 bytes long, thus allowing easier loading into the word map. The next twelve bytes precisely duplicate the data in the three long integers stored in Part 2 of the master word index 202. In other words, the first long integer of the twelve bytes encode the word's word number. The next eight bytes encode two long integers, whose interpretations depend upon one another. Refer to Table I for possible interpretations. For file/document organization, the database index 200 also includes a contents table 209 to assist the search module 183 to organize files/documents for display when a search has completed. In this embodiment, the contents table 209 includes two files, a COI file 210 and a COF file 211. The contents table 209 operates in conjunction with fields list files 212. The COI file 210 is an index into the COF file 211. The COI file 210 contains a sequence of four-byte binary encoded long integers, one long integer for each file in the database. These long integers encode a start position in the COF file 211 at which information for the given file begins. For example, to find the field information for the thirteenth file in a twenty-file database, the software of the computer system 100 retrieves the thirteenth long integer encoded in the COI file 210. The system 100 retrieves the fourteenth long integer encoded in the COI file 210 to determine where the fourteenth file's information begins and the thirteenth file's information ends in the COF file 211. Using these two values, the system 100 then extracts the characters from the COF file 211 and thus obtains all the field information for file thirteen of the database. Of course, for file twenty in this example, the system 100 simply reads the twentieth long integer in the COI file 210 to find the start position 1D for the information in the COF file 211. Since no file follows the last file, the end position for the information is simply the end of the COF file 211. The COF file 211 contains the field information for each file in the database. Although each file in a given database has the same number of fields, though a particular file may have several blank fields, it should be noted that different databases may have different numbers of fields for the files in their databases. For example, HTM databases typically have fewer fields per file than databases containing MS-Word documents. Field information for a particular file is tab delimited. In the embodiment shown, characters are not used to delimit the field information for one file from the field information for another file. Instead, the last text character of field information for one file is immediately followed by the first character of field information for the next file. When performing a search of a database, search results for a database may be ordered based on a number of different file fields taken from the fields list files 212, including title and date fields. The fields list files 212 aid in determining a proper sort order for files based on different fields. These different files are designated C01, C02, . . . CO# Files 213. Each of these files 213 is a list of four-byte binary encoded long integers. The long integers correspond to the numbers of each file in the database. The file numbers are presented in the order in which those files should be presented so that the files are sorted according to the given field order. For example, in a four-file database where field 1 is a title field and the files in the database are as follows: File 1--TITLE: "Warthogs Eat Wooly Worms" File 2--TITLE: "Canaries Crave Caraway Seeds" File 3--TITLE: "Aardvarks Ate Ants" File 4--TITLE: "Dogs Dine on Dairy Dumplings"; the CO1 file contains the file numbers 3, 2, 4, 1 bin that order, because the alphabetical sort order for these files by title is Aardvarks (file 3), then Canaries (file 2), then Dogs (file 4), then Warthogs (file 1). In this example, the CO2 file is based on a date field in the files so that the file numbers are in a different order based on date. Thus, the files 213 each contain a presorted list of file numbers that assist the search module 183 to organize the files found in a search based on a selected field. Referring to FIG. 2, the WDN file 216 is part of a word lists structure 214. The word lists structure 214 includes files that contain different organizations of information associated with the words from the selected databases, the files being available to expedite the search of the database index 200 for the terms of a search phrase. In this embodiment, the word lists structure 214 includes a word length (WDL) file 215 that comprises an index of words according to their length, a reverse word order (WDR) file 217 that comprises an index of words spelled in reverse order (i.e., right to left order) and that are alphabetized according to the reverse spelling of the words, and the WDN file 216. Thus, the word lists structure 214 is useful when a search query includes terms such as leading conflation searches, i.e., searches that call for all words meeting a search criteria in which only the last few letters of the search term are required to be met in the search query. For example, a search for "*ample" creates a hit for the words "sample", "example", "ample", etc. In this embodiment, if the search term is not found in the WDN file 216, the search for that term is terminated because the files/documents of the selected databases do not contain the term of the search query. If the search term is found in the WDN file 216, the exact location of additional information about the term stored in the master word index 202 is provided to the search module 183. If the computer does not have enough memory 106 to store the WDN file 216 in a memory map, the master word index 202 is searched directly for all information about the word, thus bypassing the WDN file 216 of the database index 200. In one embodiment, WDN files 216 of three databases are stored in memory 106, if possible, because users frequently select three or less databases to search and, typically, three or less WDN files 216 do not overly burden the memory 106 of a computer system operating the search module 183. Of note, the search module 183 must still perform more tasks before displaying the documents that fit the search conditions, and these tasks are not necessarily related to any specific search. Any document displayed also exhibits any hypertext jump links tying it to other files in the database to which it belongs. When the database is indexed to generate the index files, a jump link list 206 is also generated. It contains an OAI file 207 comprising an index into an OAF file 208, which contains expansive data about hypertext links that exist in the database files. To assist in the understanding of the database index 200, the following narrative of a search for the word "unique" from the perspective of FIG. 2 is offered. In this example, a database index is created for each of three databases. One database includes three HTM files, a second database includes three RTF files, and a third database includes four DOC files. In each of the databases, the word "unique" appears twice in one document and once in another document. Therefore, upon a search for the word "unique", each database has two files with at least one hit, one file with two hits and one file with one hit. The user selects the three databases and generates database indexes. The user presses "Enter" in the search dialog, requesting a search of the selected databases for the word "unique". The search module 183 determines that there are three databases selected, and all are primary databases. Because they are primary databases, the corresponding WDN files 216 are loaded into memory 106. Starting with database 1 (the HTM database), the search module 183 searches the HTM WDN file for the word "unique". The return value indicates that "unique" exists in this database, has a given word number (e.g., 138), and has two associated numeric values. In this case, the two values might be 4 and 68. The interpretation of the numeric values is carried out according to the interpretations described in Table I, where x=4 and y=68. Because the HTM database is a three-file database, and x is 4, then row 2 of Table I applies, i.e., x (or 4) minus the number of files (3) equals one. Thus, one file does NOT contain the word "unique", but the other files do. The file number of the single file that does not contain the word "unique" may be found at position y=68 in the master word index 202. The search module 183 next looks in the master word index 202 at position 68 and reads one four-byte binary encoded long integer, whose value is 1. This is interpreted to mean that files 2 and 3 in this database contain the word "unique". Thus, all the files in the first database that contain the word "unique" are known. The search module 183 next performs a search on the second RTF database with similar results, perhaps finding that "unique" was word number 122 and files 1 and 3 contain the word "unique". This is followed by a check of the third database, i.e., the four-file MS-Word DOC database, where the word number is 190 and the numeric values are x=6 and y=156. Again, according to Table I, the return values indicate that two (6-4=2) of the four files in the database do not contain the word "unique", and those two files are recorded at position 156 of the master word index file 202. Reading the two four-byte binary encoded long integers at position 156 in the master word index 202 indicates that files 1 and 2 do not contain the word "unique", and thus files 3 and 4 do contain the word "unique". Thus, at this point, the user knows that each of the three databases has two as files that contain the word "unique". These files include Files 2 and 3 of Database 1, Files 1 and 3 of Database 2 and Files 3 and 4 of Database 3. With this information in hand, the next step of the search module 183 is to display the titles and other appropriate fields of the found files in the dialog, in the sort order specified by the user. In this example, assume that the user is sorting by document title and that the document title corresponds to field number four. First, the search module 183 reorders its file number hits list to correspond to the final display selected by the user. Initially, the file number order may be represented as the following ordered pairs (database number, file number): (1,2), (1,3), (2,1), (2,3), (3,3) and (3,4). The search module 183 begins by loading the full contents of the first database's CO4 file (213, member of 212), since ordering is by field number four. A comparison of the ordered contents of the CO4 file to the two "hit" file numbers for database 1 indicates that file 3 should be displayed before file 2. This process is repeated for databases 2 and 3, resulting in a final sorted list of: (1,3), (1,2), (2,1), (2,3), (3,4), (3,3). Now that the search module 183 has sorted the complete hits list, the numeric pairs are translated to field list strings 212. The search module 183 begins by looking in the COI file 210 of Database 1's contents table 209. In this example, the COI file 210 indicates that the field information for file 3 begins at position 112. Further, because 112 is the third and final number stored in the COI file 210, and the total file length for the COF file 211 is 172, the field information for file 3 ends at position 172. Reading the data in the COF file 211 from position 112 to 172, the search module 183 gives the fields for the file, including a file name (field one) of "1 uniq.htm", a title field (field four) of "Unique appears only once", and a closing date field, with blank fields in between. The search module 183 sorts these fields and composes a string in which field four is presented first, followed by the database name, followed by a number of other mostly blank fields (excluding the file name), and concluding with the file date. This string is output to the display. A similar process is carried out for each file hit, allowing a total of six field strings to be output to the dialog display 112. At this point, it is up to the user to select a file to view. If the user selects the third file in the list, which would be the first file of database 2, the dialog is closed and file 1 of database 2 starts to open. During the opening process, OAI and OAF files 207 and 208 for database 2 are checked to see if any string ranges in the RTF file need to be highlighted and treated as jump links. In this case, no jump links exist in the file. Also during the opening process, the word number index 203 for database 2 is used to determine the character ranges in file 1 of database 2 that are to be highlighted and treated as search terms located in the file. The first step in using the word number index 203 occurs when the search module 183 opens the DSI and DSF files 204 and 205 for database 2. The DSI file 204 is a binary file listing pairs of long integers, each long integer coded as a four-byte binary number. Every file in a database has a corresponding pair of long integers in the DSI file 204, listed in file number order. Hence, file 1 corresponds to the first pair of long integers in the DSI file 204, and the last file in the database corresponds to the last pair of long integers in the DSI file 204. If both long integers are positive in value, then they are interpreted as beginning and ending indices into the DSF file 204, indicating the start and end of a word position table 219 describing a database file. If both long integers are less than 0, then the DSF file 205 contains no entry for this file. In the case of file 1, a DSF 205 entry exists, so the first two long integers in the DSI file 204 indicate the beginning and ending ranges for this entry in the DSF file 205. The search module 183 temporarily extracts this segment into main memory 106 and examines it. The layout of information in this segment is determined by first examining the last four bytes of this segment, and translating it into a number. The number is the number of elements in the segment's locator string 224, which immediately precedes the last four bytes of the segment. The search module 183 knows that each locator string 224 entry is twelve bytes long, and thus the locator string 224 is 1200 bytes long if the number of elements is 100. The search module 183 then examines the first entry in the locator string 224. This entry, as is true of all the entries, codes three long integers in its twelve bytes. The first four bytes code the word number 228 for the first indexed word in the file. For example, the file may begin with the word "Zebra" and end with the word "aardvark", but since "aardvark" lexically precedes "Zebra", "aardvark" is considered the first indexed word in the file. The second four bytes indicate the 100 character position index 230 information for this first word, which should be 0, indicating the beginning of this DSF 205 segment. The third set of four bytes indicates the start of the slot position index 232 information for this first word, which will thus be the position in this DSF 205 segment at which the word slots list 222 information begins. Thus, the DSF 205 segment has been divided into four parts, including the character positions 220 addressed by the second byte of each locator string 224; the word slots list 222 addressed by the third byte of each locator string 224; the locator string 224, in this case containing 100 twelve-byte segments; and the number of elements in locator string 226, in this case 100. As stated earlier, if the word number for "unique" in database 2 is 122, the locator string 224 is searched for an entry whose word number portion is 122. Once this locator string 224 entry is found, the second long integer in the locator string 224 is read and interpreted, for example, a value of 68. Following this, the next locator string 224 entry is read and interpreted, for example, a value of 76. Thus, the eight bytes starting at 68 and ending at 76 in this segment indicate the starting positions for the word "unique" in file 1. Since these bytes are interpreted as four-byte long integers, this indicates that "unique" occurs twice in file 1. For example, the first long integer could indicate that "unique" begins at character position 100 and the second long integer could indicate another instance beginning at character position 200. With this information, plus the knowledge that "unique" is six characters long, the search module 183 is able to identify character positions 100 to 106 and 200 to 206 of by file 1 in database 2 as the location of the two instances of the search term in this file. These text ranges are indicated through operations such as highlighting, and the file is finally displayed for the user. Of course, the search module 183 treats the character positions in the remaining files in a similar fashion for indicating or highlighting the terms for a user. FIG. 3 is a flow diagram of an exemplary startup sequence of a database application program implemented according to the present invention. When a user starts the program, a user logon sequence is initiated at a block 301. The user logs in to the system, and the program first loads the previous interface display settings or default settings if there are no previous interface display settings at next block 302. The interface display settings include a list of selected databases. The program checks each database that has been selected for searching and validates selected database files at next block 303. If the validation fails as indicated at next block 304, a message is displayed alerting the user that the database has corrupt or missing files at block 305 and deselects the problem database from the program. If there are more databases that have not been validated as determined at block 306, then operation returns to block 303 to resume the validation procedure. Each database has an initialization file that the software of the system 100 uses to generate the database index 200. Once all selected databases have been validated or deselected and success is achieved at block 304, the validated databases' initialization files are loaded at next block 307 and then operation proceeds to next block 308, where a start screen is displayed and the program waits for user instructions. When logged in to the program, a user may generate a database index. FIG. 4 is a flow diagram of an index generator processing sequence of the database application of FIG. 3. When the user starts the database application, a database generator initializes and loads previous settings at block 400. The database generator then generates a table of files to process at block 401 based on the generator settings when the user begins the index generation process. The database generator then extracts field information (or data) from the top file in the processing table at block 402 and proceeds to the next file in the processing table as indicated at block 404 until all of the files have had their field data extracted for later compilation into the content stable 209 as determined at decision block 403. The next series of steps corresponds to producing data for creating the master word index 202 and the word lists 214. For each file that is processed, valid words are extracted from the file and inserted into a word table at next block 405, an index of the word locations in the file is generated at next block 406, and a table of link patterns and field matches among the files th at have been processed up to that point is then generated at next block 407 as described in conjunction with the jump link list 206. Each file in the table of files is sequentially processed in like manner as indicated by block 409 until the last file has been processed as determined at block 408. In particular, operation loops between blocks 405-409 until the last file is processed as determined at block 408. Block 406's functions regarding HTML format files are more fully illustrated by FIG. 4A. The format is first determined to be an HTML file or a non-HTML file at block 417. If the file is not an HTML file, a fast and straightforward string analysis method is used to determine the locations of words within the displayable text string of the file. For example, if a file consists solely of the string "hello, world", the first word occupies file positions 1-5, and the second word occupies file positions 8-12. Once the search engine reports that "world" is in the file, it determines its file positions so the word can be set off with different color text or by some other means. If the file position information for the word is not accurate, then the retrieved word will not be highlighted accurately. The string analysis method first requires obtaining an index string wherein all visible characters occupy positions absolutely relative to each other. The index string is then parsed into words entered into an index along with the numeric word location in the string. In the "hello, world" example, the search engine can then go to the absolute position of 8 as the beginning of "world" instead of the relative position of "the end of `hello` plus 3" to get the display data for the word. A string analysis method can be adapted to handle embedded control characters provided their behavior and characteristics are consistent. For example, an image in a RTF file may consist of thousands of bytes, but the beginning and end of the sequence is consistently identified, and the entire sequence always affects the file position the same way. Thus, the string analysis method can simply discard all image byte sequences without affecting the absolute position determination of visible characters in words. HTML files involve major complications for using a string analysis method to determine file positions. HTML control tags are placed in line with visible characters. Some of the tags cause the file position to increase, and some do not. Furthermore, the parameters and tag content can be of unlimited and indeterminate length. A simple HTML file that only displays "hello, world", can have thousands of invisible control characters before the first word, thousands between it and the second word, and thousands after that. Furthermore, whether those control characters cause the file position of a visible character to increase or not depends on the type of HTML tag and the interaction of other HTML tags. Consequently, obtaining an accurate index string to parse is immensely difficult when HTML files are involved. Other mark up language file types, such as SGML, etc., present similar but less egregious problems in obtaining accurate index strings. The method described herein for HTML files can also be used for other types of mark up language files. The problem is that there is no known accurate way to determine what the effect of present and future HTML control tags will be relative to the file positions of visible words displayed by an HTML viewer when using a string analysis method. HTML viewer technology includes a text ranging method to determine where visible characters are displayed. Essentially, this method assigns a null value to non-incrementing control tags, including their parameters, and a byte value to tags that cause the display to advance the "file position pointer" when they are encountered. The technology also includes rules for determining whether the interaction of tags changes their behavior with respect to advancing the file position pointer. An accurate index string representing not only the relative file positions of words within an HTML file but also the starting position can be generated using a text ranging method. However, the method is slow compared to a string analysis method because each byte in the file has to be analyzed individually, and single byte analysis using the text range method requires beginning at the first byte of the html string. Thus, the time required for analysis increases exponentially with increasing lengths of files to be analyzed. The present invention overcomes the inaccuracy of the string analysis method used on HTML files and the slowness of the text ranging method. The entire HTML file is a string of bytes, which will be referred to as the html string. From it, a second string consisting of only visible characters and single byte representations of all adjacent control characters combined will be derived and referred to as the visible character string. The objective is to generate an index string for parsing that will contain visible characters positioned absolutely relative to one another numerically. The index string is analogous to a plain text file string or structured file strings, such as RTF, etc., and can be unambiguously parsed to determine word locations absolutely relative to one another. At block 418, all HTML control tags and their contents are converted to single characters in the non-displayable range, typically ASCII 1 through ASCII 31. In the same block 418, adjacent strings of these control characters are then combined into just one control character. Thus, the example of "hello, world", would be reduced at most to 15 characters regardless of the length and complexity of embedded HTML tags. This is the visible character string. The HTML viewer starting position of the first visible character must next be determined relative to the html string, which is done at block 419 by using the text ranging method. From that point, the objective is to maintain synchronization between the html string and the visible character string. String analysis is used for adjacent visible characters, and the method involves designating a sub-string with its start being the character following a control character and the end of the sub-string being the character preceding a subsequent control character. Such a sub-string segment is then added to the building index string in one step, whether it is one or thousands of characters in length as depicted by block 420. At this point, the effect of the encountered control character must be determined, and that first involves synchronizing the entry point for the text range method into the html string. Depicted by block 421, the length of the sub-string added to the index string in block 420 is added to an html string processing variable, and that is where the text range method is applied to the html string. One by one, each byte is analyzed as depicted by block 422. If it advances the file position pointer, it is added to the index string. If the next character is not visible (block 423), a test for the end of the html string is performed at block 424. If so, the index string is completed, and processing is transferred to block 427 for string parsing and subsequent word location index generation, block 428. If the next character is visible, resynchronization of the HTML string processing variable is performed at block 425 so that the next entry point will land on the next control character after the length of the next sub-string is added when block 421 is next encountered. Before leaving block 425, the next byte is analyzed at block 426 to determine if the end of the string has been encountered. If so, processing is transferred to block 427 as previously described. If not, the processing is transferred to block 420 again, and the process continues until the entire index string is accreted. The process of block 407 on FIG. 4 is straightforward. Link patterns and field matches are designated by the user through the Linking Control Panel depicted by FIG. 11 and the Options for Field Links dialog depicted by FIG. 13. When a user designates a custom link word by entering it in text box 1101, associates it with a specific file (such as a glossary) by entering its path into text box 1102, and then clicks the Add New Link button 1104, instructions for that link have been programmed into the index generator. Likewise, when a user specifies a link pattern by entering it (with or without optional wildcard characters) in text box 1106, associates it with a particular field number by selecting one in the options box 1107, and then clicks the Add New Link button 1108, instructions for that link pattern have been programmed into the index generator. The user selectable options depicted on FIG. 13 allow refinement of the link pattern choices. For example, a user may want to use aliases or synonyms so that "equine" is also linked when "horse" is the primary pattern. Functionally, generating valid links automatically as depicted by block 407 of the database index generation process of FIG. 4 is a two step process. First, the virtual list of link pointers (words and patterns) is checked each time a word is extracted in block 405. If the word is on the list, the virtual list of all the files that will be in the final database (that is, a virtual table of contents) is checked to determine if a link target exists for the link pointer. For example, a pattern of "# S.W.2d #" might match a potential link pointer of "877 S.W.2d 200" that designates a file with a field likewise containing "877 S.W.2d 200" as the target. However, if the target file is not in the virtual table of contents, the pattern will not be designated as a link pointer. This avoids having link pointers that have no target being created. Generating valid links from patterns requires knowing the potential link pointers associated with specific target files. If a target file exists in the virtual table of contents, the link pointer can be inserted during the first pass through the files. The process is simpler in the case of words becoming link pointers. The virtual table of contents is examined to determine if the target file for a word is included. If so, a link pointer is created when the specified word is encountered. As with link patterns, the validity of all links is assured because no link is created before the existence of its target is established. At block 410, the master word index 202 is then compiled with the index of word locations. Block 411 entails assigning unique numbers to every unique word in the database which produces the word number index 203 having its two parts, the DSI 204 and DSF 205. Based on the data collected, the generator program's jump link index is compiled at block 412, resulting in the jump link list 206 having its two parts, the OAI 207 and the OAF 208. At next block 413, the word lists 214 are generated, resulting in the WDL 215, the WDN file 216, and the WDR 217. The fields list 212 is then generated at next block 414 to include the individual presorted lists CO1, CO2 . . . CO# 213. The contents tables 209 then are generated at next block 415 to include the COI 210 and the COF 211. The generator program returns to the start dialog allowing a user to generate another database's index or to exit. A graphic user interface (GUI) embodiment of a database application program according to the present invention will now be described which provides utilities for database index generation and database selection and searching. The following FIGS. 5-15 are exemplary screen shots at various stages of the database application program in order to demonstrate the principles of the present invention. The database application program may be executed on the computer system 100, where each of the screen shots or displays are displayed on the display 112 and viewable by a user of the computer system 100. The GUI database application program may comprise a more specific embodiment of the system 170 shown in FIG. 1C, and may further incorporate the principles described in relation to the flow diagrams shown in FIGS. 3 and 4. FIG. 5 is a screen display illustrating an exemplary database registration dialog of a graphic user interface (GUI) embodiment of a database application program implemented according to the present invention on a computer, such as the computer 100. The screen display includes a view options button 500, a database generator button 501, a search button 502, a database display window 504 which provides a list of database names 503, a Register New Database button 505, an UnRegister Selection button 506, and an Enable Word Lists control 507. The database display window 504 shows that four databases are registered as a result of previous use of the Register New Database button 505. As indicated by associated checkmarks 508, three of the registered databases have been selected. For example, a database may be selected when the user performs a standard operation with the mouse 114 by clicking a button on the mouse 114 while a cursor is on the database name, thus, causing a checkmark 508 to appear adjacent to the database name 503. FIG. 6 is a screen display illustrating an exemplary unregister confirmation dialog 601 of the GUI database application program introduced in FIG. 5 that appears when a user has highlighted a database name 503 and then selects the UnRegister Selection button 506. The unregister confirmation dialog 601 presents the user with an unregister confirmation message 602 that reminds the user of other options that are available. A message box 603 presents the user with various messages according to the position of the mouse pointer. A message 604 is shown in the message box 603 when the mouse pointer hovers over a Cancel Unregister button 606. The message 604 in the message box 603 changes when the mouse pointer is moved to other positions such as over an Unregister ONLY button 605, over a Delete Database Index Files button 607, or over a Delete All Files In Database button 608 to perform the indicated functions. FIG. 7 is a screen display of an exemplary index generator dialog of the GUI database application program introduced in FIG. 5 as it might appear after a user 1 presses the database generator button 501. The index generator dialog includes a source file location edit box 700, a database output directory edit box 701, a generator type selection box 702, a set link properties or Linking button 703, a New Database Name edit box 704, a Register New Database check box 705, an enable Pause feature button 706, a Run button 707, and an Exit button 708. The index generator dialog is used for registering a database or regenerating the database index 200 from a previously registered but changed database. Should the user press the Run button 707 without changing any of the FIG. 7 parameters, the database indicated is registered and appears as shown at 503 in the database display window 504. If the database has already been registered, the database index 200 is regenerated when the Run button 707 is pressed. Checking the register new database check box 705 causes the generator to register new databases or to reregister changed databases and add them to the database display window 504. A user might choose to regenerate a database index in this manner if any of the source files in the source file location edit box 700 have been changed or if any files matching the generator type selection box 702 were added or deleted. The Pause button 706 toggles a feature that allows the user to suspend database processing indefinitely. When the pause feature is disabled, the generator completes its tasks faster. Database indexes are made from documents or files located at a path to a directory or folder indicated in the source file location edit box 700 and according to the file type indicated in the generator type selection box 702. If the documents of the database index are located remotely, e.g., on the World Wide Web (WWW) of the internet, the source file location edit box 700 contains a hypertext transfer protocol address, i.e., an "http" (HyperText Translation Protocol) address to the location. Of course, other types of addresses/designations are available for remotely accessible files, and these various types of addresses/designations are entered into the source file location edit box 700 in a similar manner. A database index is placed in the location shown in the database index output directory edit box 701 when generated from the selected files. Before pressing the Run button 707, the user can press the Linking button 703 in order to cause the documents of a database to have custom links to one another automatically generated at the same time the database index is generated (see FIG. 11 and related discussion). However, in order to understand searching operations of the software of the invention, at this point it is assumed that links have already been set and a database index has already been generated. FIG. 8 is a screen display of an exemplary search/retrieval dialog of the GUI database application program introduced in FIG. 5 that is displayed when a user presses the search button 502. The search/retrieval dialog presents the user With a search expression edit box 803 in which the user enters search terms of interest. In this case, the search terms "second amended petition" (including the quote marks) have been entered into the search expression edit box 803. The search expression edit box 803 supports search expressions of any degree of complexity by using the following techniques: parentheses; phrases set off by double quotations; proximity expressions; single- and multiple-character conflation in any combination of leading, middle, and trailing conflation; and default or overriding explicit Boolean operators, such as AND, OR, XOR, etc. Other search expression techniques are also contemplated. In addition, the search/retrieval dialog includes default Boolean operator controls 805 to determine how the system interprets multiple words entered in the search expression edit box 803. For example, if only two terms are entered without being surrounded by double quotation marks and the default Boolean operator is AND, the system finds all occurrences of both terms in documents that contain both terms. If the default Boolean operator is set to OR using the same example, the system finds all occurrences of either term in all documents with either term. If the default Boolean operator is set to XOR, the system finds all occurrences of either term only in documents that contain one term but not the other. Further, when checked, a Search within current results box 801 causes the system to perform the search called for in the search expression edit box 803 only for those documents found by the previous search. Once search terms are entered into the search expression edit box 803, a search of the database indexes for each of the selected databases 503 is performed by the search module 183 when an Execute button 806 is pressed. Further, the Execute button 806 causes all selected databases 503 to have instructions applied such as where to position a document when viewing it on the display 112, how to order search results, etc. For example, some instructions are set with a Document Position control 800 that designates whether the document, when a View button 810 is pressed, is displayed from its first line at the top of the document or from the location of the first search term that was found. Further, an Order Search Results By control 802 determines the sort order for the list of documents found that are to be displayed in a documents found window 815. If a Display first document found checkbox 804 is checked, the system displays the first document found that satisfies the search expression without the intermediate display of the completed search results. After the Execute button 806 is pressed, the system records and displays its progress in a Search terms found window 809 and includes the number of documents found that match the search criterion. After all documents satisfying the expression are found, a document number is displayed in a document counter 807 and the documents found window 815 is populated in the order indicated by the order search results by controls 802. The View button 810 causes a highlighted document 812 to be displayed according to the Document Position control 800 setting. Should the number of documents found exceed the number that can be displayed in the documents found window 815, a scroll bar, the down arrow, and the Page Down keys are available so that the user can see the other documents found. Since a database application program, in one embodiment, is configured to simultaneously search over two billion databases, each with over two billion files, and each file with over two billion characters, the user may want to stop a search after it has started. For that reason, a Stop button 808 is provided. Further, a Clear button 811 allows all data to be cleared from the search expression edit box 803, the search terms found window 809, and the documents found window 815. If the Enable Word Lists control 507 is enabled, a Word List button 814 is enabled. When pressed, the Word List button 814 causes a list of all words that appear in all selected databases 503 arranged in alphabetical order to be displayed. Words can be placed directly into the search expression edit box 803 from the word list. A Close button 816 closes the search/retrieval dialog and returns the user to the previous screen without taking any further actions that may be available. Finally, a Sort Again button 813 is used to repeat the above procedure after changing the terms in the search expression edit box 803. FIG. 9 is a screen display of an exemplary dialog displaying a document, such as the highlighted document 812, retrieved from among the documents indicated in the documents found window 815. A document display window 928 displays text and graphics of a selected document being viewed in a similar manner as it would be seen in a word processor application such as MS-Word or the like. A word wrap button 921 toggles between two display states. The first state shows text as wrapping to the next line when the right side of the document display window 928 is too narrow to show all of the text in a paragraph on a single line. The second state of the word wrap button 921 displays all the text in a paragraph on a single line, and, if necessary, a horizontal scroll bar appears at the bottom of the document display window 928 which allows the user to move the contents of the window to see any portion of the text. This second state of the word wrap button 921 is especially useful when viewing documents with table type data where columns were determined by use of tabs or spaces. Since most computers use a proportional font to display text, such table type data may not align properly unless a fixed-pitch, non-wrapping display format is used. The word wrap button 921 allows the user to instantly toggle between either display format as desired. A field link 925 is illustrated in the text in FIG. 9, in which the underlying text is shown highlighted with selectable color and font different from the surrounding text to indicate the link, where the highlight selections are made in a Search Terms display control 1011 (FIG. 10). When the user double clicks on the field link 925, the system displays the document that the field link 925 targets. To return to the text displayed, the user need only press a jump backward button 916. The document display window 928 then shows the text of the document 812. A found terms display 927 shows that two terms were found in the highlighted document 812 of the documents found window 815. The same information about the document 812 is accessible through activation of a title bar 906. The Document Position controls 800 were set to display the document at the first search term, and the order search results by controls 802 were set to sort the results by database name 503. A database named "RTF12231" is the first one shown in the selected databases 503, and the system assumes that the user prefers that order. The search expression edit box 803 shows that the phrase "second amended petition" was searched for, and the document display window 928 shows two instances 926 of the phrase appearing near the center of the screen display for user convenience in determining the context of a term. The terms of the phrase are shown in font attributes determined by the Search Terms display control 1011. The previous search term button 911 is not available because the first search term in the document is displayed and current as indicated by a text cursor 950. The next search term button 912 is available because there is one more instance 926 in the document. Both the next document with search terms button 915 and the previous document with search terms button 914 are shown as available because the document displayed is the thirteenth of forty documents found as shown in the document counter 807. Also shown in the document display window 928 is a phrase 909, "Texas Rules of Appellate Procedure". The phrase 909 is shown in bold italics to indicate that it has a legal pad note attached to it, where the bold italics is determined by a LegalPad Notes display control 1009. Legal pad notes allow a user to create reference notes that are accessible from a document in a manner similar to document access through the field link 925. The LegalPad Notes display control 1009 shows that bold italics is used when the system displays text where legal pad notes are attached. As discussed in relation to FIG. 12, a legal pad button 918 is used to create new legal pads from highlighted text. A SmartScreen button 900 causes the system to display the same screen shown when the database application program is started (initialized) as in the example embodiment of FIG. 5. The first document in universe button 901, the "universe" including all files/documents in all selected databases, is not available and thus not highlighted because the document shown in the document display window 928 just happens to be the first document in all of the documents in the selected databases. The same situation applies to a previous document in universe button 902, which is also not highlighted. However, a next document in universe button 903 is available as indicated by being highlighted. When the button 903 is pressed, the document following the one currently displayed is displayed. When pressed, a last document in universe button 904 causes the system to immediately display the last document in the list of all of the selected databases 503. Further, when pressed, a table of contents button 905 displays a dialog with collapsible table of contents to allow a user to quickly determine and view any file in any of the selected databases 503. The find document in entire universe button 907 displays a dialog allowing a user to type fragments of a sought document in order to find it and quickly view it. A find button 908 allows a user to search within the document currently displayed. A direct from text button 910 causes a phrase search to immediately be executed for all text that is selected by a user and highlighted. It is not available unless some text is selected. A bookmark button 917 allows a user to place an electronic q bookmark at any point in any document through a dialog that allows the user to name and manage bookmarks. A copy button 919 allows the user to copy any highlighted text to the computer's memory for insertion elsewhere. A print button 920 displays a print dialog which provides full print utilities to the user. A font change button 922 allows the user to toggle from a proportional pitch font to a fixed pitch font for ease of viewing text formatted with spaces and tabs for columnar alignment or back to the original font. A help button 923 displays information about the system. An exit button 924 causes the system to terminate and asks the user whether data about the session should be saved or not. In summary, the document display window 928 illustrates examples of field links 925, legal pad phrases 909, and instance 926 of search phrases. The appearance of these portions of the document display window 928 is controlled by a display options dialog that is discussed in relation to FIG. 10. FIG. 10 is a screen display of an exemplary display options dialog, i.e., a view options dialog 1012, of the GUI database application program introduced in FIG. 5 that appears when a user has pressed the view options button 500. A FastSearch button 1002 allows the user to set a variable that controls the speed with which the system preloads certain index components when it is started. Colors and Styles controls 1001 enable the user to set display options for the document display window 928. For example, a Document Background screen color box 1000 is used to select background colors of the document display window 928. Further, a Jump Tags section 1006, a LegalPad Notes display control 1009, and the Search Terms found section 1011 are available in the view options dialog 1012, each for selecting the color, weight, and font of the text in the document display window 928. The effects of each control are immediately shown in the window appearing below the Colors and Styles controls 1001. A Default Text Font Size 1003 is set by the user. Pressing a Restore Defaults button 1005 resets all controls to their original state. Pressing an OK button 1004 accepts any changes the user has made and restores the disp | ||||||
