Keyword and methods for using a keyword6687711Abstract Systems and methods are provided for data storage and retrieval in which data is stored in a file storage system and keywords are associated with the data. The keywords include parameters of the data and associated values for the parameters. The keywords are collected together in indexes of the data, and various methods are used on the keywords or the indexes of keywords. These methods include index creation methods, searching methods, and data conversion methods. Claims I claim: Description NOTICE OF COPYRIGHTS
TABLE 1
Units
Keyword Instance Keyword Contents Parameter Designator Parameter
Value
first keyword Author:"Bob Author N/A "Bob
Richards"
instance 10a Richards"
second keyword Subject:"Beaker" Subject N/A "Beaker"
instance 10b1
third keyword Subject :"deliveries" Subject N/A
"deliveries"
instance 10b2
fourth keyword Subject:"sales" Subject N/A "sales"
instance 10b3
fifth keyword Date Written:"July Date Written N/A "July 24,
2000"
instance 10c 24, 2000"
sixth keyword Capacity(ml):500 Capacity ml 500
instance 10d
seventh keyword Material of Material of N/A "tempered
glass
instance 10e manufacture: manufacture
"tempered glass"
eighth keyword Temperature Temperature F 212
instance 10f1 Maximum(F):212 Maximum
ninth keyword Temperature Temperature C 100
instance 10f2 Maximum(C.):100 Maximum
tenth keyword Pressure Pressure psi 120.25
instance 10g Resistance (psi): Resistance
120.25
eleventh keyword Date of order:"June Date of order N/A "June 3,
2000"
instance 10h 3, 2000"
twelfth keyword Date of Date of shipment N/A "July 24,
2000"
instance 10i shipment:"July 24,
2000"
thirteenth keyword Volume of Volume of order pieces 5000
instance 10j order(pieces):5000
fourteenth keyword Type of item Type of item N/A "beaker"
instance 10k ordered:"beaker" ordered
fifteenth keyword Brand of item Brand of item N/A "SuperTuf"
instance 10l ordered:"SuperTuf" ordered
Each keyword instance 10a, 10b1, 10b2, 10b3, 10c, 10d, 10e, 10f, 10g, 10h, 10i, 10j, 10k, 10l, 10m is an instance of a keyword 10 that is associated with a particular parameter instance 12a, 12b, 12c, 12d, 12e, 12f, 12g, 12h, 12i, 12j, 12k, 12l, 12m and an associated parameter value 14. The keyword instances 10a, 10b1, 10b2, 10b3, 10c, 10d, 10e, 10f, 10g, 10h, 10i, 10j, 10k, 10l, 10m may be associated with the first document 20 at the time the first document 20 is created, or the keyword instances 10a, 10b1, 10b2, 10b3, 10c, 10d, 10e, 10f, 10g, 10h, 10i, 10j, 10k, 10l, 10m may be later associated with an existing first document 20. In a preferred embodiment, the keyword instances 10a, 10b1, 10b2, 10b3, 10c, 10d, 10e, 10f, 10g, 10h, 10i, 10j, 10k, 10l, 10m are associated with the first document 20 at the time the first document 20 is created, by being incorporated within the first document 20 as one or more tags 22. Other methods of association could include providing a link to a separate file of keywords 10 for the first document 20. The particular method of associating the keyword instances 10a, 10b1, 10b2, 10b3, 10c, 10d, 10e, 10f, 10g, 10h, 10i, 10j, 10k, 10l, 10m with the first document 20 is a design choice for those skilled in the art, and is not critical to the invention. In a preferred embodiment, more than one keyword instance 10a, 10b1, 10b2, 10b3, 10c, 10d, 10e, 10f, 10g, 10h, 10i, 10j, 10k, 10l, 10m may contain the same instance of the parameter 12. In the exemplary embodiment of FIGS. 2A-2B, the parameter instance 12b labeled "Subject" is contained in the third keyword instance 10b1, the fourth keyword instance 10b2 and the fifth keyword instance 10b3. Thus a first document 20 discussing multiple instances of the same property can be managed by a preferred embodiment of the invention. In a preferred embodiment, parameters 12 can also vary between documents within the file storage system. For example, turning to FIGS. 3A-3B, a second document 30 is an invoice for the business transaction referenced in the first document 20. The second document 30 contains parameter instances 12b, 12d, 12e, 12f, 12g, 12h, 12j, 12k, 12l, 12m that are also contained in the first document 20. Some of these parameter instances 12b, 12d, 12e, 12f, 12g, 12h, 12j, 12k, 12l, 12m have different parameter values 14 associated with them. The second document 30 also contains parameter instance 12n that is not found in the first document 20. The first document 20 also contains parameter instances 12a, 12c, 12i that are not found in the second document 30. Turning now to FIG. 4, an index 40 of the keyword instances 10a, 10b1, 10b2, 10b3, 10c, 10d, 10e, 10f, 10g, 10h, 10i, 10j, 10k, 10l, 10m is created. In a preferred embodiment, the index 40 is created by using an automated process which parses the first document 20, gathers the keyword instances 10a, 10b1, 10b2, 10b3, 10c, 10d, 10e, 10f, 10g, 10h, 10i, 10j, 10k, 10l, 10m from the first document 20, and create an index entry 42 for each keyword 10a, 10b1, 10b2, 10b3, 10c, 10d, 10e, 10f1, 10f2, 10g, 10h, 10i, 10j, 10k, 10l. The index 40 preferably indexes each keyword 10 that actually appears in at least one of the documents within the file storage system. In another embodiment, the index 40 indexes each keyword 10 that could possibly appear within a document within the file storage system. A preferred embodiment of an index 40 is described in the Jul. 24, 2000 Nunez application. The precise method of creating the index 40 is a design choice for those skilled in the art, and is not critical to the invention. In a preferred embodiment, the index 40 includes one or more index entries 42. Each index entry 42 is created by associating a keyword 10 with one or more body-to-record pointers 44. In a preferred embodiment, the keyword 10 is associated with the one or more body-to-record pointers 44 by combining the keyword 10 and the one or more body-to-record pointers 44 into a contiguous text string. The precise method of associating a keyword 10 with the one or more body-to-record pointers 44 is, however, a design choice for those skilled in the art, and is not critical to the invention. In a preferred embodiment, the one or more body-to-record pointers 44 identify members of the collection of data files within the file storage system that contain the keyword 10 listed in the index entry 42. In a preferred embodiment, the one or more body-to-record pointers 44 include a volume designation 46 and a file designation 48. A detailed description of the one or more body-to-record pointers 44 of a preferred embodiment is contained in the Jul. 24, 2000 Nunez application. The precise nature of the one or more body-to-record pointers 44 is a design choice for those skilled in the art, and is not critical to the invention. In the exemplary embodiment of FIG. 4, the index 40 is sorted alphabetically based upon the parameter 12 contained within each keyword 10 within each index entry 42. In a preferred embodiment, the index 40 is organized as described in the Jul. 24, 2000 Nunez application. In another embodiment the index 40 is sorted using a hashing function. The specifics of the organization scheme for the index 40 is a design choice for the user and is not critical to the invention. In a preferred embodiment, once a keyword 10 is created, associated with one of the documents in the file storage system, and included in an index entry 42 of an index 40, turning to FIG. 5, the keyword 10 is then used in a identifying method 500 for identifying an index entry 42 that satisfies, turning briefly to FIG. 5A, a search criterion 50. Returning to FIG. 5, the identifying method 500 can be incorporated into a variety of other methods that operate on an index 40. For example, the identifying method 500 could be a step in a method for searching an index 40, a method for creating an index 40, or a method for deleting an index entry 42 from an index 40. A preferred method for searching an index 40 is described in the Jul. 24, 2000 Nunez application. The precise uses for the identifying method 500 are design choices for those skilled in the art, and are not critical to the invention. The identifying method 500 includes a first step 510 where a search criterion 50 is identified. In a preferred embodiment, the search criterion 50 is a keyword 10 or a parameter 12. In a preferred embodiment, the search criterion 50 is represented as a contiguous text string. In another preferred embodiment, the search criterion 50 is converted into a contiguous text string if it is not already in that form. The search criterion 50 can come from a variety of sources. For example, a user of the file storage system can provide the search criterion 50. The search criterion 50 can alternatively be automatically generated by the file storage system itself. The particular source of the search criterion 50 is a design choice for those skilled in the art, and is not critical to the invention. The identifying method 500 includes a second step 520, where an index entry 42, to be compared with the search criterion 50, is identified. The index entry 42 can be identified in a variety of ways. In a preferred embodiment, the index entry 42 is identified using the methods disclosed in the Jul. 24, 2000 Nunez application. In another embodiment, the index entry 42 is identified by iterating through all of the index entries 42 contained within the index 40. The precise method of identifying an index entry 42 for comparison with the criterion 40 is a design choice for those skilled in the art, and is not critical to the invention. The ordering of steps 510, 520 is also a design choice for those skilled in the art, and is not critical to the invention. The identifying method 500 includes a third step 530, where the search criterion 50 identified in step 510 is compared with the index entry 42 identified in step 520. In a preferred embodiment, having represented the search criterion 50 as a contiguous text string, the search criterion 50 is then compared with the keyword 10 contained within the index entry 42 by making a Boolean logic comparison of each character of the search criterion 50 with the corresponding character of the index entry 42. A result is returned for each comparison, indicating if a match occurred. In an embodiment, the result of each character comparison is the numeral one (1) if the characters match, and the numeral zero (0) if they characters do not match. Representations of results of Boolean comparisons are well known to those skilled in the art, and the specific representation chosen is not critical to the invention. The identifying method 500 includes a fourth step 540, where the results of the comparison are presented. The results of the comparison can be presented to a variety of entities. For example, the results of the comparison can be presented to another method of which the identifying method 500 is a part, such as a method for searching an index 40. The results of the comparison could also be provided directly to a user. The precise nature of the entity to which the results of the comparison are presented is a design choice for those skilled in the art, and is not critical to the invention. The identifying method 500 then terminates at step 550. As a practical example of how the identifying method 500 might be used in conjunction with the index 40, assume that a searcher is looking for a document that discusses the subject of deliveries. The searcher provides a search criterion 50 that specifies a parameter of "subject" and a parameter value of "deliveries" to a search method (not shown) which includes the identifying method 500. This search criterion 50 is identified in step 510 as the search criterion 50 to be processed. Using the index search method disclosed in the Jul. 24, 2000 Nunez application, in step 520 the index entry 42 corresponding to the second keyword 10b1 is selected as the index entry 42 to be compared with. The search criterion 50 is represented as, or converted to if necessary, the string: subject: "deliveries" which is compared with the second keyword 10b1, represented as the string: subject: "beaker" in step 530. The character-by-character Boolean comparison of the two strings will generate a result set as follows, where a numeral one (1) signifies a match, and a numeral zero (0) signifies no match: 11111111101000000000. These results are returned in step 540 to the search method (not shown) that the identifying method 500 is a component of. Since the search method (not shown) is looking for a complete match between the search criterion 50 and the second keyword 10b1, the search method (not shown), rejects the index entry 42, and continues searching. Using the index search method disclosed in the Jul. 24, 2000 Nunez application, the index entry 42 corresponding to the third keyword 10b2 is selected as the next index entry 42 to be processed, in step 520. The search criterion 50 is represented as, or converted to if necessary, the string: subject: "deliveries" which is compared with the third keyword 10b2, represented as the string: subject: "deliveries" in step 530. The character-by-character Boolean comparison of the two strings will generate a result set as follows, where a numeral one (1) signifies a match, and a numeral zero (0) signifies no match: 11111111111111111111. These results are returned in step 540 to the search method (not shown) that the identifying method 500 is a component of. Since the search method (not shown) is looking for a complete match between the search criterion 50 and the third keyword 10b1, the search method (not shown), accepts the index entry 42. Depending on the precise nature of the search query the user presented to the search method (not shown), the search method (not shown) could continue searching the rest of the index 40 for other documents that satisfy the search criterion 50, or the search method (not shown) could halt upon locating a first document 20 that satisfied the search criterion 50. The particular actions of the search method (not shown) are a design choice for those skilled in the art, and are not critical to the invention. As a second example of the identifying method 500, assume that the search criterion 50 used above specified only a parameter 12 of "subject", and had no parameter value 14 specified. The comparison with the second keyword 10b1 then returns the following result: 1111111 which is a complete match. This type of comparison allows the identifying method 500 to identify index entries 42 that match only the parameter 12 of a keyword 10 contained in an index entry 42. Thus, in an embodiment of the invention, meta-data searches as well as data searches are possible. As a third example of the use of the identifying method 500, in an embodiment using the indexing scheme and index search method disclosed in the Jul. 24, 2000 Nunez application, a more complex query is processed using the above described identifying method 500. Assume that a searcher is looking for all documents that discuss deliveries of beakers, the beakers having a capacity between 90 and 120 milliliters (ml) and a maximum temperature tolerance between 212 and 1000 degrees Fahrenheit (F.). The searcher provides the query: Subject=deliveries AND Subject=beaker AND Capacity(ml)=90 to 120 AND Temperature(F)=212 to 1000 The search is done in four passes. In the first pass, the search method (not shown) calls the identifying method 500 and passes the first search criterion subject: "deliveries" to the identifying method 500. Following the steps outlined above, the identifying method 500 returns all the records that contain the keyword matching the first search criterion. In the second pass, the search method (not shown) calls the identifying method 500 and passes the second search criterion subject: "beaker" to the identifying method 500. Following the steps outlined above, the identifying method 500 returns all the records that contain the keyword matching the second search criterion. In the third pass, the third search criterion is a compound search criterion, which specifies a range of values to be located. The third pass extracts from the third search criterion a first sub-criterion: capacity(ml): 90 that is passed to the identifying method 500. The identifying method 500 returns the index entry, matching the first sub-criterion, that is located first in the index 40. The third pass then extracts from the third search criterion a second sub-criterion: capacity(ml): 120 that is passed to the identifying method 500. The identifying method 500 returns the index entry, matching the second sub-criterion, that is located last in the index 40. The third pass of the search method then returns all records pointed to by all index entries 42 in the index 40 that are located between the index entry identified by the first sub-criterion and the index entry identified by the second sub-criterion. The fourth pass is treated similarly to the third pass. A third sub-criterion of temperature(F): 212 is extracted from the fourth search criterion an used by the identifying method 500 to locate the top of the range of index entries 42 in the index 40 that satisfy the fourth search criterion. A fourth sub-criterion of temperature(F): 1000 is extracted from the fourth search criterion an used by the identifying method 500 to locate the bottom of the range of index entries 42 in the index 40 that satisfy the fourth search criterion. The fourth pass then returns all records pointed to by the index entries 42 within the range. In an embodiment, upon completion of all four passes, the search method returns the conjunction of the result sets from the four passes, since the original query was seeking records that met all four search criteria. In another embodiment, the second pass only searches the records returned by the first pass, the third pass only searches the records returned by the second pass, and the fourth pass only searches the records returned by the third pass. Thus the first pass generates, for example, 100 matches. The second pass searches those 100 matches and returns 50 matches from within the previous 100. The third pass searches those 50 matches and returns 30 matches from within the previous 50. The fourth pass searches those 30 matches and returns a final result set of five matches from within the previous 30. The precise method for combining the result sets of the passes of the search method is a design choice for the user, and is not critical to the invention. In a preferred embodiment, shown in FIG. 6, a keyword 10 is used in a converting method 600 for converting data from an unstructured or semi-structured format into a row-and-column format, such as, turning to FIG. 7, a table 700. The row-and-column formatted data can then be accessed and manipulated by conventional database systems. The table 700 contains one or more rows 702 and one or more columns 704, which together create one or more cells 706 for containing the data extracted from the first document 20 and the second document 30. Each column 704 has a column identifier 708 associated with it. In a preferred embodiment, each column identifier 708 is a parameter 12 of a keyword 10 associated with data stored in the file storage system. Each row 702 has a row identifier 710 associated with it. In a preferred embodiment, the row identifier 710 identifies the first document 20 or the second document 30 from which the data contained in the cells 706 of the row 702 came from. Each cell 706 can be identified by the combination of the row 702 and the column 704, or the row identifier 710 and the column identifier 708, of which the cell 706 is a member. The data within each cell 706 can be stored in a variety of formats. In an embodiment, each cell 706 contains a single value. In another embodiment, each cell 706 contains multiple values for a single parameter 12, organized in some fashion. The particular formats for storing data in the cells 706 are design choices for those skilled in the art, and are not critical to the invention. The converting method 600 includes a first step 610, where the table 700 is identified as the next table to enter data into. In an embodiment, the table 700 is created in step 610. In an embodiment, a column 706 of the table 700 is created for each parameter 12 that identifies data to be converted. In another embodiment, the table 700 is an existing table of a conventional row-and-column database (not shown). The particular method of identifying the table 700 is a design choice for those skilled in the art and is not critical to the invention. The converting method 600 includes a second step 620, where the first document 20 to be converted is identified. In a preferred embodiment, the first document 20 is stored in the file storage system described in the May 23, 2000 Nunez application. In a preferred embodiment, the first document 20 is identified by searching an index file 40 and retrieving a collection of index entries 42 that point to the first document 20. In a preferred embodiment, the first document 20 itself is not retrieved, rather the data stored within the first document 20 is retrieved directly from the parameter values 14 contained within the index entries 42 that point to the first document 20. In another embodiment, the first document 20 is retrieved. The particular method of identifying the first document to be converted is a design choice for those skilled in the art, and is not critical to the invention. The converting method 600 includes a third step 630, where a row 702 of the table 700 is identified for insertion of the data from the first document 20. In an embodiment, the row 702 is created in step 630. In another embodiment, the row 702 is an existing row of the table 700. The particular method of identifying the row 702 is a design choice for those skilled in the art, and is not critical to the invention. The converting method 600 includes a fourth step 640, where a first keyword instance 10a is identified as the next keyword 10 to be processed. In a preferred embodiment, the first keyword instance 10a is identified by searching an index file 40, or a subset of an index file 40, and locating an index entry 42 that points to the first document 20, using the search method described in the Jul. 24, 2000 Nunez application. The precise method of identifying the first keyword instance 10a is, however, a design choice for those skilled in the art, and is not critical to the invention. The converting method 600 includes a fifth step 650 where the data associated with the first keyword instance 10a is written to the row 702 identified in step 630. In a preferred embodiment where the first keyword instance 10a is contained in an index entry 42 of the index 40 that was searched in the fourth step 640, the data associated with the first keyword instance 10a is extracted from the parameter value 14, stored in the index entry 42 of the index 40, that is associated with the first keyword instance 10a. In another embodiment, the data associated with the first keyword instance 10a is instead extracted directly from the first document 20, by locating the first document 20 using the body-to-record pointer 44 of the index entry 42 containing the first keyword instance 10a. In a preferred embodiment, the data associated with the first keyword instance 10a is written to the cell 706 that is in the column 704 bearing the column identifier 708 that matches the first parameter instance 12a of the first keyword instance 10a. In an embodiment, if the table 700 does not have a column 704 bearing the column identifier 708 that matches the first parameter instance 12a of the first keyword instance 10a, then such a column 704 is created. The converting method 600 then returns to step 640, where the next keyword 10 to be processed is identified. Once all keywords 10 in the first document 20 have been processed, the converting method 600 returns to step 620, where the next document to be processed is identified. Once all documents that are to be converted into the table 700 have been processed, the converting method 600 returns to step 610 where it identifies the next table to be processed. Once all tables 700 have been processed, the converting method 600 terminates in a sixth and final step 660. Several preferred embodiments of a system of and method for using keywords in parameter-based searching of text, and many of the system's attendant advantages, have thus been disclosed. It will be apparent, however, that various changes may be made in the system's or method's form and components without departing from the spirit and scope of the invention, the embodiments hereinbefore described being merely preferred or exemplary embodiments thereof Therefore, the invention is not to be restricted or limited except in accordance with the following claims and their legal equivalents.
|
Same subclass Same class Consider this
|
||||||||||||
