Document data processing method and apparatus for document retrieval5469354
Abstract
High-speed full document retrieval method and system capable of providing result of retrieval within practically acceptable short search time. Upon registration of documents in a document database, condensed texts are created by decomposing each of textual character strings of the documents to be registered into fragmental character strings in dependence on character species and by checking mutual inclusion relations existing among the fragmental character strings. A component character table is created in which characters occurring in each of the condensed texts are registered without duplication. The condensed texts and the component character table are registered in the data base together with the texts of the documents to be registered. Upon retrieval of a document containing a search term designated by a user, a component character table search is first executed to extract those documents which contain all species of characters constituting the search term by consulting the component character table, and subsequently a condensed text search is executed by consulting the condensed texts of the documents. Finally, a text body search is executed for extracting a document which satisfies query condition imposed on the search term by consulting the texts of the documents extracted through the component character table search and the condensed text search.
Claims
We claim:
1. A document data processing method for retrieving a document containing at least a search term designated by an operator from a document database registering therein document information in terms of character code data while referring to textual content of said document, comprising steps of:
upon registration of text documents in said document database,
creating condensed texts by decomposing each of textual character strings of the documents to be registered into fragmental character strings on the basis of at least one of character species including katakana character, hiragana character, kanji character, alphabetic character, numeric character, and symbol character and checking mutual inclusion relations possibly existing among said fragmental character strings resulting from said decomposition, to thereby create the condensed texts each constituted by a set of the fragmental character strings in which any character string found to be included by other character string is eliminated;
creating a component character table in which characters occurring in each of said condensed texts are registered without duplication; and
registering in said document database said condensed texts together with said component character table in addition to the texts of the document to be registered; and
upon retrieval of the document containing the designated search term, executing first a component character table search for thereby extracting those documents which contain all species of characters constituting the search term designated by the operator by consulting said component character table;
executing subsequently a condensed text search by consulting the condensed texts of the documents extracted through said component character table search for extracting only the documents corresponding to the condensed texts which contain the fragmental character strings constituting the search term designated by the operator to thereby select the documents containing the designated search term; and
executing finally a text body search for extracting a document which satisfies query condition imposed on the search term by consulting the texts of the documents extracted through said component character table search and said condensed text search.
2. A document data processing method for document retrieval according to claim 1, wherein said component character table registers without duplication all the characters as used on a document basis.
3. A document data processing method for document retrieval according to claim 2, wherein said component character table is created by mapping the character codes to a number of entry codes of said table which is smaller than a number of the characters used actually by using a hash function.
4. A document data processing method for document retrieval. according to claim 1, wherein in association with said component character table search, a bit list in which one-bit information allocated to all usable characters is prepared for each of said documents and wherein bit positions for the characters used in the documents are set to "1s" respectively, while the bit positions for the characters not used in the documents are set to "0s", respectively;
said component character table search is executed by searching said a bit list having all the bit positions set to "1s" for all the characters constituting said search term.
5. A document data processing method for document retrieval according to claim 4, wherein by mapping the character codes to a number of entry codes which is smaller than the number of the characters actually used, said bit list is realized by a number of bits which is smaller than a number of the character types actually used.
6. A document data processing method for document retrieval according to claim 5, wherein in association with said component character table search, a bit list in which one-bit information allocated to all usable characters is prepared for each of said documents and wherein bit positions for the characters used in the documents are set to "1s" respectively, while the bit positions for the characters not used in the documents are set to "0s", respectively;
said component character table search is executed by searching said bit list having all the bit positions set to "1s" for all the characters constituting said search term.
7. A document data processing method for document retrieval according to claim 6, wherein by mapping the character codes to a number of entry codes which is smaller than the number of the characters actually used, said bit list is realized by a number of bits which is smaller than a number of the characters actually used.
8. A document data processing method for document retrieval according to claim 1, wherein said text body search is executed when said search term is constituted with a plurality of character species and when said query condition includes a positional condition of the search term in the next text.
9. A document data processing method for retrieving a document containing at least a search term designated by an operator from a document database registering therein document information in terms of character code data while referring to textual content of said document, comprising steps of:
upon registration of text documents in said document database,
creating condensed texts by decomposing each of textual character strings of the documents to be registered into fragmental character strings on the basis of at least one of character species including katakana character, hiragana character, kanji character, alphabetic character, numeric character, and symbol character and checking mutual inclusion relations possibly existing among said fragmental character strings resulting from said decomposition, to thereby create the condensed texts each constituted by a set of the fragmental character strings in which any character string found to be included by other character string is eliminated;
creating a component character table in which characters occurring in registered texts are registered without duplication; and
registering in said document database said condensed texts together with said component character table in addition to the texts of the document to be registered; and
upon retrieval of the document containing the designated search term, executing first a component character table search for thereby extracting those documents which contain all species of characters constituting the search term designated by the operator by consulting said component character table;
executing subsequently a condensed text search by consulting the condensed texts of the documents extracted through said component character table search for extracting only the documents corresponding to the condensed texts which contain the fragmental character strings constituting the search term designated by the operator to thereby select the documents containing the designated search term; and
executing finally a text body search for extracting a document which satisfies query condition imposed on the search term by consulting the texts of the documents extracted through said component character table search and said condensed text search.
10. A document data processing method for document retrieval according to claim 9, wherein said component character table registers without duplication all the characters as used on a document basis.
11. A document data processing method for document retrieval according to claim 10, wherein said component character table is created by mapping the character codes to a number of entry codes of said table which is smaller than a number of the characters used actually by using a hash function.
12. A document data processing method for document retrieval according to claim 9, wherein said text body search is executed when said search term is constituted with a plurality of character species and when said query condition includes a positional condition of the search term in the next text.
13. A document data processing method for retrieving a document containing all of plural search terms designated by an operator from a document database registering therein document information in terms of character code data while referring to textual content of said document, comprising steps of:
upon registration of text documents in said document database,
creating condensed texts by decomposing each of textual character strings of the documents to be registered into fragmental character strings on the basis of at least one of character species including katakana character, hiragana character, kanji character, alphabetic character, numeric character and symbol character and checking mutual inclusion relations possibly existing among said fragmental character strings resulting from said decomposition, to thereby create the condensed texts each constituted by a set of the fragmental character strings in which any character string found to be included by other character string is eliminated;
creating a component character table in which characters occurring in registered texts are registered without duplication; and
registering in said document database said condensed texts together with said component character table in addition to the texts of the document to be registered; and
upon retrieval of the document containing the designated search term, executing first a component character table search for thereby extracting those documents which contain all species of characters constituting each of said search terms designated by the operator by consulting said component character table;
executing subsequently a condensed text search by consulting the condensed texts of the documents extracted through said component character table search for extracting only the documents corresponding to the condensed texts which contain all the fragmental character strings constituting each of said search terms designated by the operator to thereby select the documents containing the designated search terms; and
executing finally a text body search for extracting a document which satisfies query condition imposed on said search terms such as positional relation thereof in the text by consulting the texts of the documents extracted through said component character table search and said condensed text search.
14. A document data processing method for retrieving a document containing any one of search terms designated by an operator from a document database registering therein document information in terms of character code data while referring to textual content of said document, comprising steps of:
upon registration of text documents in said document database,
creating condensed texts by decomposing each of textual character strings of the documents to be registered into fragmental character strings on the basis of at least one of character species including katakana character, hiragana character, kanji character, alphabetic character, numeric character and symbol character and checking mutual inclusion relations possibly existing among said fragmental character strings resulting from said decomposition, to thereby create the condensed texts each constituted by a set of the fragmental character strings in which any character string found to be included by other character string is eliminated;
creating a component character table in which characters occurring in registered texts are registered without duplication; and
registering in said document database said condensed texts together with said component character table in addition to the texts of the document to be registered; and
upon retrieval of the document containing the designated search term, executing first a component character table search for thereby extracting those documents which contain all species of characters constituting any one of said search terms designated by the operator by consulting said component character table;
executing subsequently a condensed text search by consulting the condensed texts of the documents extracted through said component character table search for extracting only the documents corresponding to the condensed texts which contain all the fragmental character strings constituting any one of said search terms designated by the operator to thereby select the documents containing the designated search terms; and
executing finally a text body search for extracting a document which satisfies query condition imposed on said search terms by consulting the texts of the documents extracted through said component character table search and said condensed text search.
15. A document data processing method for retrieving a document containing at least a search term designated by an operator from a document database registering therein document information in terms of character code data while referring to textual content of said document, comprising steps of:
upon registration of text documents in said document database,
creating condensed texts by decomposing each of textual character strings of the documents to be registered into fragmental character strings in dependence on character species each of the fragmental character strings being able to include one of katakana character string, hiragana character string, kanji character string, alphabetic character string, numeric character string and symbol character string, and checking mutual inclusion relations possibly existing among said fragmental character strings resulting from said decomposition, while checking said hiragana character string by consulting a basic word dictionary and conjunction rules as to whether said hiragana character string represents a succession of subsidiary words having semantically no meaning as the search term, to thereby create the condensed texts each constituted by a set of the fragmental character strings in which any character string found to be included by other character string and any hiragana character string found to be a succession of the semantically meaningless subsidiary words are excluded;
creating a component character table in which characters occurring in registered texts are registered without duplication; and
registering in said document database said condensed texts together with said component character table in addition to the texts of the document to be registered; and
upon retrieval of the document containing the designated search term, executing first a component character table search for thereby extracting those documents which contain all species of characters constituting the search term designated by the operator by consulting said component character table;
executing subsequently a condensed text search by consulting the condensed texts of the documents extracted through said component character table search for extracting only the documents corresponding to the condensed texts which contain the fragmental character strings constituting the search term unless said fragmental character strings have been determined to be a succession of semantically meaningless words as the search term after the check of said fragmental character strings by using the basic word dictionary and the conjunction rules; and
executing finally a text body search for extracting a document which satisfies query condition imposed on the search term by consulting the texts of the documents extracted through said component character table search and said condensed text search while consulting the registered texts of the documents extracted through said component character table search when any one of said fragmental character strings has been determined to be a succession of the semantically meaningless words, for thereby extracting a document which contains each of the fragmental character strings and which satisfies the retrieval condition imposed on the search term concerning the positional relation thereof.
16. A document data processing method for document retrieval according to claim 15, wherein said text body search is executed when said search term is constituted with a plurality of character species and when said query condition includes a positional condition of the search term in the next text.
17. A document data processing method for retrieving a document containing at least a search term designated by an operator from a document database registering therein document information in terms of character code data while referring to textual content of said document, comprising steps of:
upon registration of text documents in said document database,
creating condensed texts by decomposing each of textual character strings of the documents to be registered into fragmental character strings in dependence on character species each of the fragmental character strings being able to include one of katakana character string, hiragana character string, kanji character string, alphabetic character string, numeric character string and symbol character string, and checking mutual inclusion relations possibly existing among said fragmental character strings resulting from said decomposition, after having eliminated all the hiragana character strings, to thereby create the condensed texts each constituted by a set of the fragmental character strings in which any character string found to be included by other character string is excluded;
creating a component character table in which characters occurring in registered texts are registered without duplication; and
registering in said document database a plurality of said condensed texts corresponding to said character species, respectively, together with said component character table in addition to the texts of the documents to be registered; and
upon retrieval of the document containing the designated search term, executing first a component character table search for thereby extracting those documents which contain all species of characters constituting the search term designated by the operator by consulting said component character table;
executing subsequently a condensed text search by consulting the condensed texts of the documents extracted through said component character table search for extracting only the documents corresponding to the condensed texts which contain the fragmental character strings constituting the search term designated by the operator provided that said fragmental character strings constituting the search term designated by the operator has been determined as including none of the hiragana character strings as a result of corresponding decision step; and
executing finally a text body search for extracting a document which satisfies query condition imposed on the search term by consulting the texts of the documents extracted or alternatively for extracting a document containing the designated fragmental character strings and satisfying said query condition by consulting the original text of the document extracted through said component character table search.
18. A document data processing method for document retrieval according to claim 17, wherein said text body search is executed when said search term is constituted with a plurality of character species and when said query condition includes a positional condition of the search term in the next text.
19. A document data processing method for retrieving a document containing at least a search term designated by an operator from a document database registering therein document information in terms of character code data while referring to textual content of said document, comprising steps of:
upon registration of text documents in said document database,
creating condensed texts by decomposing each of textual character strings of the documents to be registered into fragmental character strings on the basis of at least one of character species including katakana character, hiragana character, kanji character, alphabetic character, numeric character and symbol character and checking mutual inclusion relations possibly existing among said fragmental character strings resulting from said decomposition, to thereby create a plurality of condensed texts separately on a character species basis, each of said condensed texts being constituted by the fragmental character strings of a same character species while excluding any character string found to be included by other character string;
creating a component character table describing the species of the characters occurring in registered texts;
registering in said document database said plurality of character-species based condensed texts together with said component character table in addition to the text of the document to be registered; and
upon retrieval of the document containing the designated search term, executing first a component character table search for thereby extracting those documents which contain all the species of characters constituting the search term designated by the operator by consulting said component character table;
executing subsequently a condensed text search by consulting the condensed text corresponding to the character species of the fragmental character strings constituting the search term designated by the operator in the documents extracted through said component character table search for extracting only the documents corresponding to the condensed texts which contain the fragmental character strings constituting the search term designated by the operator to thereby select the documents containing the designated search term; and
executing finally a text body search for extracting a document which satisfies query condition imposed on the search term by consulting the texts of the documents extracted through said component character table search and said condensed text search.
20. A document data processing method for document retrieval according to claim 19, wherein said text body search is executed when said search term is constituted with a plurality of character species and when said query condition includes a positional condition of the search term in the next text.
21. A document data processing method for retrieving a document containing a search term designated by an operator from a document database registering therein document information in terms of character code data while referring to textual content of said document, comprising steps of:
upon registration of text documents in said document database,
creating condensed texts by decomposing each of textual character strings of the documents to be registered into fragmental character strings on the basis of at least one of character species including katakana character, hiragana character, kanji character, alphabetic character, numeric character and symbol character and checking mutual inclusion relations possibly existing among said fragmental character strings resulting from said decomposition, to thereby create the condensed texts each constituted by a set of the fragmental character strings in which any character string found to be included by other character string is eliminated;
creating a component character table in which characters occurring in each of said condensed texts are registered without duplication; and
registering in said document database said condensed texts together with said component character table in addition to the text of the document to be registered; and
upon retrieval of the document containing the designated search term, executing first a component character table search for thereby extracting those documents which contain all species of characters constituting the search term designated by the operator by consulting said component character table; and
executing subsequently a condensed table search by consulting the condensed texts of the documents extracted through said component character table search for thereby extracting only the documents corresponding to the condensed texts which contain the fragmental character strings constituting the search term designated by the operator to thereby extract the documents containing the designated search term;
creating a component character table in which characters occurring in texts are registered without duplication; and
registering in said document database said component character table in addition to the texts of the documents to be registered; and
upon retrieval of the document containing the designated search term, executing first a component character table search for thereby extracting those documents which contain all species of characters constituting the search term designated by the operator by consulting said component character table; and
executing subsequently a text body search by consulting the texts of the documents extracted through said component character table search for thereby extracting only the document which contains the designated search term and which satisfies query condition imposed on the search term such as positional relation thereof in the text, whereby a full text retrieval is carried out at an equivalently increased speed.
22. A document data processing method for retrieving a document containing a search term designated by an operator from a document database registering therein document information in terms of character code data while referring to textual content of said document, comprising the steps of:
upon registration of text documents in said document database,
creating a component character table in which characters occurring in texts are registered without duplication; and
registering in said document database said component character table in addition to the texts of the documents to be registered; and
upon retrieval of the document containing the designated search term,
executing first a component character table search for thereby extracting those documents which contain all species of characters constituting the search term designated by the operator by consulting said component character table; and
executing subsequently a text body search by consulting the texts of the documents extracted through said component character table search for thereby extracting only the document which contains the designated search term and which satisfies query condition imposed on the search term.
23. A document data processing method for retrieving a document containing at least a search term designated by an operator from a document database registering therein document information in terms of character code data while referring to textual content of said document, comprising steps of:
upon registration of text documents in said document database,
creating condensed texts by decomposing each of textual character strings of the documents to be registered into fragmental character strings on the basis of at least one of character species including katakana character, hiragana character, kanji character, alphabetic character, numeric character and symbol character and checking mutual inclusion relations possibly existing among said fragmental character strings resulting from said decomposition, to thereby create the condensed texts each constituted by a set of the fragmental character strings in which any character string found to be included by other character string is eliminated; and
registering in said document database said condensed texts in addition to the texts of the documents to be registered; and
upon retrieval of the document containing the designated search term, executing a condensed text search by consulting the condensed texts of the documents for extracting only the documents corresponding to the condensed texts which contain the fragmental character strings constituting the search term designated by the operator to thereby select the documents containing the designated search term; and
executing a text body search for extracting a document which satisfies query condition imposed on the search term by consulting the texts of the documents extracted through said condensed text search.
24. A document data processing method for document retrieval according to claim 23, wherein said text body search is executed when said search term is constituted with a plurality of character species and when said query condition includes a positional condition of the search term in the next text.
25. A document data processing system for retrieving a document containing a search term designated by an operator from a document database registering therein document information in terms of character code data while referring to textual content of said document, comprising:
for registration of text documents in said document database,
means for creating condensed texts by decomposing each of textual character strings of the documents to be registered into fragmental character strings on the basis of at least one of character species including include hiragana character, kanji character, alphabetic character, numeric character and symbol character and checking mutual inclusion relations possibly existing among said fragmental character strings resulting from said decomposition, to thereby create the condensed texts each constituted by a set of the fragmental character strings in which any character string found to be included by other character string is eliminated;
means for creating a component character table in which characters occurring in each of said condensed texts are registered without duplication; and
means for registering in said document database said condensed texts together with said component character table in addition to the texts of the documents to be registered; and
for document retrieval,
component character table search means for extracting those documents which contain all species of characters constituting the search term designated by the operator by consulting said component character table;
condensed text search means for extracting only the documents corresponding to the condensed texts which contain the fragmental character strings constituting the search term designated by the operator by consulting the condensed texts of the documents extracted through the component character table search; and
text body search means for extracting a document which satisfies query condition imposed on the search term by consulting the texts of the documents extracted.
26. A document data processing system for retrieving a document containing a search term designated by an operator from a document database registering therein document information in terms of character code data while referring to textual content of said document, comprising:
for registration of text documents in said document database,
means for creating condensed texts by decomposing each of textual character strings of the documents to be registered into fragmental character strings on the basis of at least one of character species including include hiragana character, kanji character, alphabetic character, numeric character and symbol character and checking mutual inclusion relations possibly existing among said fragmental character strings resulting from said decomposition, to thereby create the condensed texts each constituted by a set of the fragmental character strings in which any character string found to be included by other character string is eliminated;
means for creating a component character table in which characters occurring in each of said condensed texts are registered without duplication;
means for registering in said document database said condensed texts together with said component character table in addition to the texts of the documents to be registered; and
means for storing the condensed text data in a RAM disk while storing the component character table in a semiconductor memory; and
for document retrieval,
component character table search means for extracting those documents which contain all species of characters constituting the search term designated by the operator by consulting said component character table;
condensed text search means for extracting only the documents corresponding to the condensed texts which contain the fragmental character strings constituting the search term designated by the operator by consulting the condensed texts of the documents extracted through the component character table search; and
text body search means for extracting a document which satisfies query condition imposed on the search term by consulting the texts of the documents extracted.
27. A document data processing system for retrieving a document containing a search term designated by an operator from a document database registering therein document information in terms of character code data while referring to textual content of said document, comprising:
for registration of text documents in said document database,
means for creating condensed texts by decomposing each of textual character strings of the documents to be registered into fragmental character strings on the basis of at least one of character species including include hiragana character, kanji character, alphabetic character, numeric character and symbol character and checking mutual inclusion relations possibly existing among said fragmental character strings resulting from said decomposition, to thereby create the condensed texts each constituted by a set of the fragmental character strings in which any character string found to be included by other character string is eliminated;
means for creating a component character table in which characters occurring in each of said condensed texts are registered without duplication; and
means for registering in said document database said condensed texts together with said component character table in addition to the texts of the documents to be registered and storing the text data and the condensed text data in a magnetic disk while storing said component character table in a semiconductor memory; and
for document retrieval,
component character table search means for extracting those documents which contain all species of characters constituting the search term designated by the operator by consulting said component character table;
means for checking the number of the documents extracted through the component character table search;
condensed text search means for reading out all of said condensed texts by neglecting the result of the component character table search, when said number of said extracted documents has attained a predetermined number, to thereby extract only the documents corresponding to the condensed texts which contain the fragmental character strings constituting the search term designated by the operator, while consulting the condensed texts of the documents extracted through said component character table search to thereby extract only the documents corresponding to the condensed text containing the fragmental character strings which-constitute the search term designated by the operator, when said number of said extracted documents is smaller than said predetermined number; and
text body search means for extracting a document which satisfies query condition imposed on the search term by consulting the texts of the documents extracted.
28. A document data processing system for retrieving a document containing a search term designated by an operator from a document database registering therein document information in terms of character code data while referring to textual content of said document, comprising:
for registration of text documents in said document database,
means for creating condensed texts by decomposing each of textual character strings of the documents to be registered into fragmental character strings on the basis of at least one of character species including include hiragana character, kanji character, alphabetic character, numeric character and symbol character and checking mutual inclusion relations possibly existing among said fragmental character strings resulting from said decomposition, to thereby create the condensed texts each constituted by a set of the fragmental character strings in which any character string found to be included by other character string is eliminated;
means for creating a component character table in which characters occurring in each of said condensed texts are registered without duplication; and
means for registering in said document database said condensed texts together with said component character table in addition to the texts of the documents to be registered and storing the text data and the condensed text data in a magnetic disk while storing said component character table in a semiconductor memory; and
for document retrieval,
component character table search means for extracting those documents which contain all species of characters constituting the search term designated by the operator by consulting said component character table;
means for checking the number of the documents extracted through the component character table search;
condensed text search means for reading out all of said condensed texts by neglecting the result of the component character table search only when said number of said extracted documents has attained a predetermined number, to thereby extract only the documents corresponding to-the condensed texts which contain the fragmental character strings constituting the search term designated by the operator; and
text body search means for extracting a document which satisfies query condition imposed on the search term by consulting the texts of the documents extracted, while consulting the condensed texts of the documents extracted through said component character table search to thereby extract only the document corresponding to the condensed text containing the fragmental character strings which constitute the search term designated by the operator, when said number of said extracted documents is smaller than said predetermined number.
29. A document data processing method for retrieving a document containing at least a search term designated by an operator from a document database registering therein document information in terms of character code data while referring to textual content of said document, comprising steps of:
upon registration of text documents in said document database,
creating condensed texts by decomposing each of textual character strings of the documents to be registered into fragmental character strings on the basis of at least one of character species including hiragana character, katakana character, kanji character, alphabetic character, numeric character and symbol character and checking mutual inclusion relations possibly existing among said fragmental character strings resulting from said decomposition, to thereby create the condensed texts each constituted by a set of the fragmental character strings in which any character string found to be included by other character string is eliminated;
creating a concatenated component character table by preparing, for each of the documents, information of all usable character strings each composed of at least two characters, said information including first information indicating those character strings which are used in the document to be registered and second information indicating those character strings unused in the document to be registered; and
registering in said document database said condensed texts together with said concatenated component character table in addition to the texts of the document to be registered; and
upon retrieval of the document containing the designated search term,
executing a component character table search for extracting all the documents in which all the character strings contained in the search term designated by the operator and each composed of at least two characters are used, by consulting said concatenated component character table;
executing a condensed text search by consulting the condensed texts corresponding to the documents extracted through said component character table search for thereby extracting only the documents which contain the fragmental character strings constituting the search term designated by the operator; and
executing finally a text body search for extracting a document from the documents selected through said condensed text search which document satisfies query condition imposed on the search term by consulting the texts of the documents extracted through said concatenated component character table search and said condensed text search.
30. A document data processing method for document retrieval according to claim 29, wherein in association with said concatenated component character table a bit list in which one-bit information are allocated to all usable character strings each composed of at least two characters, respectively, is prepared for each of said documents and wherein bit positions in said bit list for the character strings used in the documents are set to "1s", respectively, while the bit positions for the character strings not used in the documents are set to "0s", respectively.
31. A document data processing method for document retrieval according to claim 30, wherein said concatenated component character table is prepared on the basis of the individual character strings each constituted by a predetermined number n (where n is an integer greater than or equal to 2) of characters for each character species including hiragana character, katakana character, kanji character, numeric character, symbol character and symbol character.
32. A document data processing method for document retrieval according to claim 30, wherein said concatenated component character table is prepared by mapping sets of character codes to the bit list having a number of entries which is smaller than the number of combinations of the characters used actually by using a hash function.
33. A document data processing method for document retrieval according to claim 32, wherein each of the character strings used actually is decomposed on the basis of at least one of the character species including hiragana character, katakana character, kanji character, alphabetic character, numeric character, symbol and symbol character, and wherein said concatenated component character table is prepared by mapping sets of character codes to the bit list having a number of entries which is smaller than the number of combinations of the characters used actually by using a hash function.
34. A document data processing method for document retrieval according to claim 32, wherein use frequencies at which the character strings are actually used are checked, and upon mapping the sets of character codes to the bit list having a number of bits smaller than the number of the character strings used actually by the hash function, the character strings of a lower use frequency are mapped to a same bit.
35. A document data processing method for document retrieval according to claim 32, wherein the character codes are mapped to a number of codes of entries which is smaller than that of the characters used actually by using said hash function, whereon sets of the hashed character codes are mapped to the bit list having a number of entries smaller than the number of the actually used character strings by using another hash function.
36. A document data processing method for document retrieval according to claim 30, said concatenated component character table being prepared on the basis of the character strings each composed of n characters, wherein in the step of the concatenated component character table search, the document containing all the character strings each composed of n characters and contained without duplication in the search term designated by the operator is extracted by searching the bit list having the relevant bit positions all set to 1".
37. A document data processing method for document retrieval according to claim 30, said concatenated component character table being prepared on the basis of the character strings each composed of n characters, wherein in the step of the concatenated component character table search, the document containing all the character strings each composed of n characters and contained in duplication in the search term designated by the operator is extracted by searching the bit list having the relevant bit positions all set to "1".
38. A document data processing method for document retrieval according to claim 30, said concatenated component character table being constituted by character strings each composed of a given number of characters in a range of one to n, wherein when the search term designated by the operator is composed of a number of characters which is smaller than n, the result of said concatenated component character table search is outputted as the final result of the document retrieval, whereupon the search processing is ended.
39. A document data processing method for document retrieval according to claim 29, wherein said text body search is executed when said search term is constituted with a plurality of character species and when said query condition includes a positional condition of the search term in the next text.
40. A document data processing method for retrieving a document containing at least a search term designated by an operator from a document database registering therein document information in terms of character code data while referring to textual content of said document, comprising steps of:
upon registration of text documents in said document database,
creating condensed texts by decomposing each of textual character strings of the documents to be registered into fragmental character strings on the basis of at least one of character species including hiragana character, katakana character, kanji character, alphabetic character, numeric character and other symbol character and checking mutual inclusion relations possibly existing among said fragmental character strings resulting from said decomposition, to thereby create the condensed texts each constituted by a set of the fragmental character strings in which any character string found to be included by other character string is eliminated;
creating a single component character table and a concentrated component character table by preparing, for each of the documents, information of all usable single characters and character strings each composed of at least two characters, said information including first information indicating those single-character and character strings which are used in the document to be registered and second information indicating those single-character and character strings unused in the document to be registered, respectively; and
registering in said document database said condensed texts together with said concatenated component character table in addition to the texts of the document to be registered; and
upon retrieval of the document containing the designated search term,
executing a component character table search for extracting all the documents in which all the character strings contained in the search term designated by the operator and each composed of at least two characters are used, by consulting said concatenated component character table;
executing a condensed text search by consulting the condensed texts corresponding to the documents extracted through said component character table search for thereby extracting only the documents which contain the fragmental character strings constituting the search term designated by the operator; and
executing finally a text body search for extracting a document from the documents selected through said condensed text search which document satisfies query condition imposed on the search term by consulting the texts of the documents extracted through said concatenated component character table search and said condensed text search.
41. A document data processing method for document retrieval according to claim 40, wherein in association with said concatenated component character table, a bit list in which one-bit information are allocated to all usable character strings each composed of at least two characters, respectively, is prepared for each of said documents and wherein bit positions in said bit list for the character strings used in the documents are set to "1s", respectively, while the bit positions for the character strings not used in the documents are set to "0s", respectively.
42. A document data processing method for document retrieval according to claim 40, wherein said text body search is executed when said search term is constituted with a plurality of character species and when said query condition includes a positional condition of the search term in the next text.
43. A text data creating method for creating a text database for storing document information as character code data, comprising steps of:
(1) fetching text data;
(2) determining frequencies at which individual character strings each constituted by a predetermined number n of characters are used in the text data and rearraying said character strings in a sequential order in dependence on said frequencies;
(3) establishing correspondences between said character strings and a number of entries which is smaller than the number of said character strings and storing said correspondences in the form of a hash table; and
(4) storing at the entry corresponding to the character strings used in said text data said character strings in the form of a componeht character table.
44. A full text retrieval method for retrieving a document containing a search term designated by an operator from a text data database registering therein document information as character code data while referring to textual content of said document, comprising steps of:
(1) fetching text data;
(2) determining frequencies at which individual character strings each constituted by a predetermined number n of characters are used in the text data and rearraying said character strings in a sequential order in dependence on said frequencies;
(3) establishing correspondences between said character strings and a number of entries which is smaller than the number of said character strings and storing said correspondences in the form of a hash table;
(4) storing at the entry corresponding to the character strings used in said text data said character strings in the form of a component character table;
(5) decomposing the search term designated by the operator into fragmental character strings each composed of n characters;
(6) extracting from said component character table those entries which correspond to said fragmental character strings resulting from said decomposition; and
(7retrieving said document in which all the character strings constituting said search terms exist, by consulting the entries extracted from said component character table.
45. A document data processing system for retrieving a document containing a search term designated by an operator from a document database registering therein document information in terms of character code data while referring to textual content of said document, comprising:
for registration of text documents in said document database,
means for registering texts of documents to be registered;
means for creating condensed texts by decomposing each of textual character strings of the documents to be registered into fragmental character strings on the basis of at least one of character species including hiragana character, katakana character, kanji character, alphabetic character, numeric character and symbol character and checking mutual inclusion relations possibly existing among said fragmental character strings resulting from said decomposition, to thereby create and register the condensed texts each constituted by a set of the fragmental character strings in which any character string found to be included by-other character string is eliminated; and
means for creating a concatenated component character table by preparing, for each of the documents, information of all usable character strings each composed of at least two characters, said information including first information indicating those character strings which are used in the document to be registered and second information indicating those character strings unused in the document to be registered and registering said concatenated component character table in said database; and
for retrieval of the document containing the designated search term,
component character table search means for extracting all the documents in which all the character strings contained in the search term designated by the operator and each composed of at least two characters are used, by consulting said concatenated component character table;
condensed text search means for executing a condensed text search by consulting the condensed texts corresponding to the documents extracted through said component character table search for thereby extracting only the documents which contain the fragmental character strings constituting the search term designated by the operator; and
text body search means for executing a text body search for extracting a document from the documents selected through said condensed text search which document satisfies query condition imposed on the search term by consulting the texts of the documents extracted through said concatenated component character table search and said condensed text search.
46. A document data processing system for retrieving a document containing a search term designated by an operator from a document database registering therein document information in terms of character code data while referring to textual content of said document, comprising:
for registration of text documents in said document database,
means for registering texts of documents to be registered;
means for creating condensed texts by decomposing each of textual character strings of the documents to be registered into fragmental character strings on the basis of at least one of character species including hiragana character, katakana character, kanji character, alphabetic character, numeric character and symbol character and checking mutual inclusion relations possibly existing among said fragmental character strings resulting from said decomposition, to thereby create and register the condensed texts each constituted by a set of the fragmental character strings in which any character string found to be included by other character string is eliminated;
means for creating a hash table by checking frequencies at which said fragmental character strings are used, determining a hash function on the basis of the frequency information and mapping said fragmental character strings to a bit list having entries in a number smaller than that of combinations of actually used character; and
means for creating a concatenated component character table by preparing, for each of the documents, information of all usable character strings each composed of at least two characters by consulting said hash table, said information including first information indicating those character strings which are used in the document to be registered and second information indicating those character strings unused in the document to be registered and registering said concatenated component character table in said database; and
for retrieval of the document containing the designated search term,
component character table search means for extracting all the documents in which all the character strings contained in the search term designated by the operator and each composed of at ieast two characters are used, by consulting said concatenated component character table;
condensed text search means for executing a condensed text search by consulting the condensed texts corresponding to the documents extracted through said component character table search for thereby extracting only the documents which. contain the fragmental character strings constituting the search term designated by the operator; and
text body search means for executing a text body search for extracting a document from the documents selected through said condensed text search which document satisfies query condition imposed on the search term by consulting the texts of the documents extracted through said concatenated component character table search and said condensed text search.
47. An index creating apparatus, comprising:
means for fetching data for retrieval;
counting means for determining frequencies at which characters contained in said data for retrieval are used;
sorting means for rearraying said characters in the order of frequencies at which said characters are used;
means for establishing correspondences between said characters and a number of bits, respectively, said bit number being smaller than that of said characters,
means for converting character codes of said characters to the corresponding bits; and
means for manipulating said bits on a bit-by-bit basis.
48. A document retrieval apparatus, comprising:
input means for inputting a search term;
means for extracting bit lists corresponding to character strings constituting said search term from a component character table;
means for logically ANDing said bit lists; and
means for transforming result of said ANDing operation into a document identifier affixed to a document.
49. A document data processing method for retrieving a document containing a search term designated by an operator from a document database registering therein document information in terms of character code data while referring to the textual content of said document, comprising steps of:
upon registration of text documents in said document database, creating a concatenated component character table in which character strings, each being constituted with n-characters (n<2) and occurring in the text documents, are registered without duplication for each of the text documents, and registering in said document database said component character table in addition to the texts of the documents to be registered; and
upon retrieval of a document containing the designated search term, executing first a component character table search for thereby extracting those documents which contain all species of characters constituting the search term designated by the operator by consulting said concatenated component character table; and
executing subsequently a text body search by consulting the texts of the documents extracted through said component character table search for thereby extracting only the document which contains the designated search term and which satisfies a query condition imposed on the search term.
50. A document data processing method for document retrieval according to claim 49, further including, upon registration of the text documents, a step of creating and registering an additional character table in which characters occurring in the text documents or character strings, each being constituted with characters of a number smaller than n and occurring in the text documents, are registered, wherein said additional character table is consulted instead of said concatenated character table when said search term is constituted with characters of a number smaller than n.
Description
BACKGROUND OF THE INVENTION
The present invention generally relates to a document data processing system and particularly to a full document retrieval system also known as a full text search system For searching and retrieving a full text of a document From a document database on the basis of a designated character string. In more particular, the present invention is concerned with a document retrieval method and system which is capable of speeding up a full text retrieval processing significantly by using an auxiliary File For the search processing.
In the document registration/retrieval systems known heretofore, such a scheme is generally adopted in which a word or term (referred to as a keyword) representing the content of a document to be registered is used as an index. According to this method, however, it is necessary to have an expert called "indexer" read thoroughly every document to be registered and assign pertinent keywords to the documents on the basis of his or her understanding of the contents thereof. As an attempt For evading such troublesome and time-consuming work For the document registration, there has been proposed a method according to which the words or terms occurring in the texts of a document are all registered as the keywords in an index file, as is disclosed, For example, in JP-A-63-198124.
However, the method mentioned above still suffers from a drawback that difficulty is encountered in determining a semantically meaningful word or term of a minimum unit upon preparation or creation of the index file. Besides, due to possible deficiency in a word dictionary and/or grammatical rules, analysis of sentences often Fails of success, presenting a problem that even an important word can not be extracted as the keyword.
As an approach to solve the above problem, there has already been proposed a full document retrieval system which is also referred to as the full text search system and in which documents are straightforwardly loaded in a database through the medium of a computer as texts composed of coded characters upon document registration, while upon retrieval of a document, contents of all the documents stored in the database are read to thereby retrieve the document containing a given or designated keyword (hereinafter referred to as "search term" to distinguish it from the authorized or controlled keyword used in conjunction with the conventional system), as is disclosed, for example, in an article entitled "Text Database Manage System SIGMA and Applications" contained in "Study Reports of The Information Processing Society of Japan: Informatics Fundamentals 14-7", Vol. 89, No. 66 (Jul. 27, 1989). This. full text search system Features among others a character-by-character based scanning of a whole text file from the beginning, as is described in the preamble of the second section of the abovementioned article. By virtue of this feature, it is possible to search or retrieve a document from the database by using the text body as a clue, even in the case where there is available no index file containing document identifiers corresponding to the keywords. In other words, by conducting a character-string based search for all the text data with the aid of a given search term, only the document in which the search term is described or contained can be outputted as the result of the retrieval.
This full document or text retrieval system takes, however, a lot of time for the search processing because the whole text file has to be scanned from the beginning on a character-by-character basis, incurring a problem that the full text search can not practically be applied to a large scale database. As stated also in the abovementioned article in the second section, the full text search system under consideration can realize only the search processing speed (rate) on the order of 2 MB/sec., even by resorting to the use of a general-purpose large scale computer. Of course, the processing speed on this order can afford a practically admissible search time so far as the capacity of a database is several megabytes or so. In reality, however, a database used in practice for the business purpose or the like usually demands a capacity of several hundred megabytes or so. In that case, the full text search system mentioned above will not be in the position to assure any satisfactory response time for the document search.
In an effort to cope with the difficulties mentioned above, the inventors of the present application have already proposed an information retrieval system in which the reading of text data as well as the search processing effected by using a search term are speeded up by providing hardware dedicated thereto, while performing in precedence to a text body search a presearch, so to say, on an auxiliary File in which the text data are previously stored in the compressed state, to thereby screen or shift the documents to undergo the text body search, with a view to realizing the full text search at an equivalently increased speed. In this conjunction, reference may be made to PCT/JP/90/00774, U.S. patent application Ser. No. 555,483, now U.S. Pat. No. 5,168,533 and WO/90/16036. More specifically, this information retrieval system features the presearch procedures referred to as a component character table search and a condensed text search, respectively, wherein the documents to be subjected to the text body search fare screened out (i.e. reduced in the number of documents) hierarchically, so to say, by executing stepwise the component character table search and the condensed text search. To say in another way, through the document screening or narrowing-down preprocessing, the number of the documents to be subjected to the text body search the time for which occupies a greater proportion of the whole search time can be decreased, which in turn means that the time taken for the search or retrieval processing as a whole can correspondingly be shortened, whereby the full text search can be realized at an equivalently increased speed.
According to the abovementioned hierarchical presearch featuring the system proposed by the inventors, the number of the documents is decreased first through the character-based search performed by consulting the component character table, which is then followed by second document number reduction through the word- or term-based search performed by using the condensed text table on the documents remaining even after the character-based search. In connection with the capacity of the database, it is to be mentioned that storage of a condensed text requires about 30% of the capacity for storing a text while the component character table requires 256 bytes per document.
In the information retrieval system mentioned above, however, no consideration is paid to the sentences or words in which the characters contained in the component character table are used, because the document screening or reduction is realized solely in dependence on whether or not a character constituting a part of the search term exists in the component character table. As a consequence, for an input search term composed of those characters which make appearance in the text at a high frequency, the component character table search can not afford a sufficiently high screening ratio for reduction of the documents, giving rise to a problem. In that case, the number of the documents to be subjected to the text body search will not be diminished to-such an extent which can assure a sufficiently high retrieval response.
As another approach for speeding up the full text search, there can be mentioned a method disclosed in an article entitled "Method of Speeding-Up Katakana Character Search in Full Document Retrieval By Using Character String Matching" contained in "Study Reports of The Information Processing Society of Japan: Database System 83-1" Vol. 91, No. 46 (May 24, 1991). According to this known method, positional information of all the characters appearing in a document is stored as the indexes on a character-by-character basis, wherein a document in which all the characters constituting a designated or inputted search term make appearance in succession is sought by reference to the indexes. This method requires, however, as many as about 40 KB for the indexes on the assumption that the positional information of four bytes is stored for each character in the case of a document containing ten thousand characters, by way of example. Accordingly, an attempt of structuring a text database containing such documents in a number of one hundred thousands or so will require a storage capacity of 4 GB for the indexes in addition to 2 GB for the storage of the documents themselves. Accordingly, it can be said by no means that such attempt is practical, in view of the enormous capacity demanded for the index storage.
SUMMARY OF THE INVENTION
In the light of the state of the art described above, it is an object of the present invention to provide a document data processing method for high-speed full document retrieval and an apparatus for carrying out the same which allow the retrieval or search result to be outputted within a practically acceptable search time even in the search of a large scale text database for practical application.
More particularly, it is another object of the present invention to provide a hierarchical presearch type document retrieval method incorporating component character table creation and search facilities which can afford a sufficient document screening capability for a given search term as well as a full document retrieval system for carrying out the method.
According to a first aspect of the present invention, there is provided a document data processing method for full document retrieval which comprises processing steps mentioned below as well as a system for carrying out the method. (1) A step of storing or loading texts themselves. (2) A step of decomposing texts as stored into a plurality of fragmental character strings at word level, checking inclusion relation possibly existing among the Fragmental character strings resulting from the decomposition and creating condensed texts, each composed of a set of fragmental character strings in which any character string included or covered by other character string is eliminated. (3) A step of creating a component character table in which characters used in the text are collected without duplication. (4) A step of dividing or splitting a given search term at character level and effecting a component character table search for extracting only the documents that contain all the characters constituting the search term. (5) A step of extracting the documents containing the given search term by consulting the condensed texts corresponding to the documents extracted through the component character table search. (6) When a given search query condition (i.e. statement of condition for search or search condition statement, to say in another way) designates positional relations among a plurality of given search terms in a text, a step of executing a text body search for extracting only the document that contains the given search terms and at the same time satisfies the query condition such as the positional relation among the search terms by consulting the text body data corresponding to the documents extracted through the condensed text search.
By adopting such hierarchical presearch mechanism that the documents subjected to the retrieval are decreased in number hierarchically through the component character table search and the condensed text search and finally undergo the text body search according to the teachings of the invention as described above, those documents which can not meet the given search query condition ape discarded through the component character table search and the condensed text search in precedence to the text body search, whereby the number of documents which ape to undergo the text body search for retrieving the text of document of concern can significantly be decreased (i.e. significant reduction of the documents can be realized before the text body search). Thus, the search time as a whole can be shortened owing to the reduction in the time needed for the text body search which occupies a large proportion of the whole search time.
For the illustrative purpose, let's assume that there is given a query condition statement reading "search a document having a text in which " " and " " occur in one and the same sentence". In that case, according to the search methods known heretofore which are designed to perform the search straightforwardly on the texts, it will take 250 seconds or about 4 minutes for searching all the texts of the 500 MB on the assumption that the search processing rate is 2 MB/sec. In contrast, when the hierarchical presearch taught by the present invention is adopted, the number of texts can be reduced to 10% of all the texts stored in the database through the component character table search and can Further be decreased through the condensed text search to 10% of the texts or documents remaining after the component character table search in a typical case. In this conjunction, assuming that the volume of the condensed texts is 30% of the texts, the volume of the condensed texts to be subjected to the search will be 15 MB because the capacity of the component character table is so small as to be neglected when compared with the capacity of the database as a whole. Consequently, the volume of the texts which are to undergo the text body search processing will amount to no more than 1% of the capacity of the database, i.e. 5 MB. Thus, the document retrieval processing can be completed within 10 seconds even with the search rate of 2 MB/sec.
In this manner, in the hierarchical presearch processing according to the first aspect of the present invention, the two presearches of "component character table search" and "condensed text search" are stepwise carried out preparatorily to sieve out the documents at "character level" and "term or word level", respectively, to thereby constrict (or reduce the number of) the documents which are to be subjected to the text body search to a possible minimum. By virtue of this feature, the number of the documents to undergo the most time-consuming text body search can be decreased, whereby the full text search can be carried out at a correspondingly increased speed.
Furthermore, when the query condition statement designates a single search term or prescribes AND, OR or NOT condition (Boolean condition) for plural search terms, the result of the condensed text search can be outputted as the final result of the document retrieval. This can be explained by the fact that the word or term existing in the condensed text need not be searched once again because such term exists in the text as well without fail. In that case, the text body search which takes a lot of time for the search at the word level can utterly be spared, whereby the whole search time can further be shortened.
As will be appreciated from the foregoing, a high-speed full document retrieval or full text search can be realized according to the document retrieval method which comprises the processing steps mentioned hereinbefore because the load for searching directly the texts can be reduced beforehand.
According to a second aspect of the present invention, there is provided a document data processing method for full document retrieval which comprises processing steps mentioned below as well as a system for carrying out the same.
Namely, upon document registration for implementing a database,
(1) a step of loading text data,
(2) a step of counting the frequencies at which character strings each composed of a predetermined number n of characters make appearance in the text data and rearraying the character strings in the order of respective frequencies,
(3) a step of establishing correspondences between the character strings and a number of entries which is smaller than that of the character strings and storing the correspondences in the form of a hash table, and
(4) a step of storing the character strings used in the text data at the entries corresponding to the character strings in the form of component character lists indicating existence of the character strings, respectively,
while upon retrieval of a designated document,
(5) a step of dividing or splitting a designated keyword or search term into fragmental character strings each composed of n characters,
(6) a step of extracting the entries corresponding to the character strings resulting from the splitting from the abovementioned component character table, and
(7) a step of searching a document in which all the character strings composing the keyword exist, by consulting the entries extracted from the component character table, to thereby perform the component character table search for extracting only the document that contains possibly the designated search term.
The second aspect of the present invention is thus concerned with an improvement of the character component table structure.
The procedure for preparing or creating the component character table through the abovementioned processing will be described below in some detail by referring to FIG. 34 of the accompanying drawings.
In the first step, the text data is decomposed into character strings each of a predetermined length n.
In the second step, decision is made as to which of the entries in the component character table a character string resulting from the decomposition corresponds.
In the third step, information indicating existence of the character string corresponding to the entry as decided in the second step is recorded.
A procedure for searching the component character table prepared in this manner will be described below by reference to FIG. 35.
For retrieval or search processing, a search term is split into fragmental character strings each of a same string length n in a first step as is in the case of creation of the component character table.
In a second step, entries of the component character table which correspond to the fragmental character strings are obtained by using a same hash table as used in the creation of the component character table.
In a third step, only the document that contains descriptions of all existence information at the entries of the component character table corresponding to all the fragmental character strings thus obtained is outputted as the result of the component character table search.
By preparing the component character table on the basis of character strings each having a predetermined length (predetermined number of characters) contained in the text data in this manner, those documents which can not be reduced in number through the search by using a single character which is frequently used in documents written in Japanese can efficiently be marked and filtered out by using a preceding and/or succeeding character as a clue. By way of example, let's consider a character string " ". In that case, the number of documents can not be reduced to more than about 20% with the single character search. In contrast, the search with the character string consisting of two characters allow the number of documents to be reduced down to 3%. Consequently, the number of the document of which text data has to be scanned is correspondingly decreased, whereby the full text or document search can be accomplished within a correspondingly shortened time.
In preparation of the hash table used in the second step of the component character table creation processing, the individual character strings ape so distributed that they can be hit as uniformly as possible on the basis of frequency information of the characters contained in the document to be registered. To this end, all the character codes ape previously checked as to frequencies at which the corresponding characters occur in the documents, whereon as many low-frequency characters as possible are mapped to a same entry of the component character table so that a maximum hit ratio can be attained on an average regardless of the characters used in the search term.
When the component character table is created on the basis of plural characters, it is necessary to create the component character table containing combinations of all the characters as the entries. In this conjunction, it is noted that in the case of the shift JIS code system, the number of characters amounts to "6,879", which in turn means that for creation of the component character table by combining, For example, two characters, the entries have to be prepared for as many character strings as 47,320,641 different combinations (=6,879.times.6,879). Obviously, this is unfavorable from the practical standpoint, because a memory of an enormous capacity will then be required.
To cope with this problem, the characters are first hashed on a character-by-character basis to be mapped to a smaller number of different characters such as, for example, 256 characters, whereby 65,536 combinations (=256.times.256) are prepared. Subsequently, the 65,536 combinations are again hashed to thereby create the component character table containing a reduced number of fragmental character strings. This table will hereinafter be referred to as the concatenated component character table for distinguishing it from the component character table used according to the first aspect of the invention. By way of example, the 65,536 combinations may be hashed to 2,048 entries. Owing to the hash processings at two steps as mentioned above, the concatenated component character table can be implemented with a practically acceptable memory capacity.
Upon execution of search by using, for example, a two-character based concatenated component character table, a given search term, for example, " " is split on a two-character basis as follows: " " " " " " " "
as shown in FIG. 36 at 1 or alternatively " " " " " " " " " " " " " "
as shown in FIG. 36 at 2, whereon a document containing all of-these character combinations is searched out from the concatenated component character table containing the character combinations as the entries as described above.
Further, to cope with the designation of the search term consisting of a single character, there is provided a component character table created on a single-character basis in addition to the abovementioned concatenated component character table. In that case, unless folding is effected by hashing, the result of the search of the single-character based component character table can be outputted as the final retrieval result, because no more than one character is mapped to the relevant entry. Thus, the document retrieval search can be completed upon completion of the search of the single-character based component character table for the given search term consisting of one character.
By creating the concatenated component character table in which a fragmental character string composed of n characters constitutes one entry, occurrence frequency of the character string can be suppressed lower when compared with that of the string consisting of one character, whereby a sufficient reduction of the documents with regard to the number thereof can be realized even if the characters consisting the search term are those used frequently. To say in another way, stable document screening or filtering function can be accomplished without being affected by the characters of the search term. By virtue of this feature, a relatively large number of documents which are irrelevant to the search term can be discarded through the search of the concatenated component character table, as a result of which the number of the condensed texts and hence that of the texts subjected to the subsequent text body search can significantly be reduced. This means that the time taken for the text body search which occupies a greater proportion of the whole search time as well as the time taken for the condensed text search can be reduced, whereby the whole search time is remarkably shortened.
Now, let's assume that there is designated a query condition statement reading "search a document having a text in which ` ` and ` ` coexist in a same sentence", which is one example of the query condition statement which also designates the positional relation between the two search terms in a text. In this case, according to the prior art method which is carried out by consulting directly the texts, it. takes 250 seconds or about 4 minutes fop performing the search on all the texts of 500 MB on the assumption that the search processing rate is 2 MB/sec.. Further, it is assumed that through the hierarchical presearch performed by using the one-character based component character table, the number of documents can be reduced or constricted to 30% of the whole volume of the database through the component character table search while it can be reduced down only to 1% of the whole database through the condensed text search. In that case, when the volume of the condensed texts is 30% of that of the texts, the volume of the condensed texts subjected to the search is 45 MB while that of the texts is 5 MB, i.e. 1% of the whole database capacity, with the volume of the component character table being neglected, then the search processing can be completed within 25 seconds at the search speed of 2 MB/sec.. In contrast, when the document number can be reduced to 10% of the whole database through the concatenated component character search, i.e. one-third of the document number reduced by the prior art method, then the volume of the condensed texts to undergo the search is 15 MB with that of the texts being 5 MB, which makes it possible to complete the retrieval processing within 10 seconds, meaning that the search or retrieving speed can be increased about 2.5 times as high as that of the prior art method.
In this manner, by executing hierarchically the two-level presearch processing with the aid of the concatenated component character table and the condensed texts, respectively, the documents can be screened or sieved out at two levels of the n-character-based fragmental character string level and the word level, respectively, to thereby reduce the number of the documents to be subjected to the text body search which is the most time-consuming processing, as a result of which the full document retrieval can be realized at a very high speed.
Further, when a single search term composed of less than n characters is given, the result of the concatenated component character table search can be outputted as the final result of the document retrieval. In this case, the retrieval result can be obtained within an extremely short time.
As will now be appreciated from the foregoing, the full document retrieval method comprising the aforementioned steps (1) to (8) can reduce remarkably the volume of the texts to be directly searched by virtue of the inventive hierarchical presearch, whereby the full document retrieval can be accomplished at an extremely high speed, to a great advantage.
BRIEF DESCRIPTION OF THE DRAWINGS
FIG. 1 is a schematic diagram showing a general arrangement of a full document retrieval system according to a first embodiment of the present invention;
FIG. 2 is a schematic flow chart fop illustrating a registration processing for a hierarchical presearch according to an aspect of the present invention;
FIG. 3 is a schematic flow chart for illustrating a search processing involved in the hierarchical presearch according to an aspect of the invention;
FIG. 4 is a schematic flow chart for illustrating, by way of example, a procedure for creating a condensed text;
FIG. 5 is a view for illustrating, by way of example, a manner in which the condensed texts are stored;
FIG. 6 is a view showing schematically a structure of a component character table;
FIG. 7 is a view for illustrating schematically a component character table search procedure;
FIG. 8 is a PAD diagram showing a processing procedure involved in the hierarchical presearch;
FIG. 9 is a diagram for illustrating a component character table search processing according to a third embodiment of the invention;
FIG. 10 is a PAD diagram for illustrating a processing involved in code translation of a component character table used in the full document retrieval system according to a third embodiment of the invention;
FIG. 11 is a diagram for illustrating a code transformation of the component character table adopted in the full document retrieval system according to a fourth embodiment of the invention;
FIG. 12 is a view fop illustrating schematically a structure of the component character table employed according to a fourth embodiment of the invention;
FIG. 13 is a view for illustrating schematically a structure of the component character table employed in a fifth embodiment of the invention;
FIG. 14 is a PAD diagram for illustrating a processing procedure of the hierarchical presearch adopted in the fifth embodiment of the invention;
FIG. 15 is a view for illustrating schematically a general structure of the component character table employed according to a sixth embodiment of the invention;
FIG. 16 is a PAD diagram fop illustrating a processing procedure of the hierarchical presearch employed according to a sixth embodiment of the invention;
FIG. 17 is a view For illustrating a general concept underlying a component character table creation method according to a seventh embodiment of the invention;
FIG. 18 is a view showing schematically a structure of a character code/entry ID number correspondence table for illustrating a hash function employed according to the seventh embodiment of the invention;
FIG. 19 is a schematic flow chart for illustrating a method of creating a condensed text according to an eighth embodiment of the invention;
FIG. 20 is a schematic flow chart for illustrating a hiragana character string processing method for a condensed text used in the eighth embodiment of the invention;
FIG. 21 is a view showing, by way of example, a structure of a basic word dictionary referred to in a subsidiary word analysis adopted in the eighth embodiment of the invention;
FIG. 22 is a view showing, by way of example, conjunction rules referred to in the subsidiary word analysis adopted in the eighth embodiment of the invention;
FIG. 23 is a PAD diagram for illustrating a processing procedure of a hierarchical presearch adopted in the eighth embodiment of the invention;
FIG. 24 is a schematic flow chart for illustrating a method of creating a condensed text according to a ninth embodiment of the invention;
FIG. 25 is a PAD diagram for illustrating a processing procedure of a hierarchical presearch according to the ninth embodiment of the invention;
FIG. 26 is a schematic flow chart for illustrating a method of creating a condensed text according to a tenth embodiment of the invention;
FIG. 27 is a PAD diagram showing a processing procedure of a hierarchical presearch adopted in the tenth embodiment of the invention;
FIG. 28 is a schematic flow chart showing a method of creating a condensed text according to an eleventh embodiment of the invention;
FIG. 29 is a PAD diagram showing a processing procedure of a hierarchical presearch adopted in the eleventh embodiment of the invention;
FIG. 30 is a schematic functional block diagram showing a general arrangement of a full document retrieval system according to a twelfth embodiment of the invention;
FIG. 31 is a view similar to FIG. 30 and shows remaining parts of the system according to the twelfth embodiment of the invention;
FIG. 32 is a PAD diagram showing a processing procedure of a hierarchical presearch adopted in the twelfth embodiment of the invention;
FIG. 33 is a view for illustrating, by way of example, a structure of a component character table storing characters on a character-by-character basis;
FIG. 34 is a PAD diagram for illustrating a procedure for creating a component character table;
FIG. 35 is a PAD diagram showing a procedure for performing search on a component character table;
FIG. 36 is a view for illustrating manners in which concatenated character strings are prepared;
FIG. 37 is a schematic functional block diagram showing a general arrangement of the full document retrieval system according to a fifteenth embodiment of the invention;
FIG. 38 is a PAD diagram for illustrating a procedure of document registration;
FIG. 39 is a PAD diagram for illustrating a procedure of creating a concatenated component character table of character code dependent type;
FIG. 40 is a view showing, by way of example, a structure of the concatenated component character table;
FIG. 41 is a PAD diagram showing a control procedure of a hierarchical search;
FIG. 42 is a view showing, by way of example, a structure of the character code dependent type concatenated component character table;
FIG. 43 is a PAD diagram for illustrating a procedure of searching the character code dependent type concatenated component character table;
FIG. 44 is a view for illustrating schematically a component character table search;
FIG. 45 is a view showing schematically a general concept underlying the component character table search effected by using duplicative character strings according to a sixteenth embodiment of the invention;
FIG. 46 is a view showing entries of no use in the character code dependent type concatenated component character table;
FIG. 47 is a view for illustrating a code transformation processing of a component character table according to a seventeenth embodiment of the invention;
FIG. 48 is a view showing schematically a structure of a character code transformation type concatenated component character table;
FIG. 49 is a PAD diagram showing a procedure for creating the character code transformation type concatenated component character table;
FIG. 50 is a PAD diagram for illustrating a procedure of searching the character code transformation type concatenated component character table;
FIG. 51 is a PAD diagram for illustrating a hierarchical search control procedure executed by using a hashing type concatenated component character table;
FIG. 52 is a PAD diagram for illustrating a procedure of creating a hash type concatenated component character table according to an eighteenth embodiment of the invention;
FIG. 53 is a view showing schematically a structure of a hashing type concatenated component character table;
FIG. 54 is a PAD diagram for illustrating a procedure of searching a hashing type concatenated component character table;
FIG. 55 is a view showing schematically a structure of a character-species-based hashing type concatenated component table employed in a nineteenth embodiment of the invention;
FIG. 56 is a PAD diagram for illustrating a procedure of creating the character-species-based hashing type concatenated component character table;
FIG. 57 is a view showing character code ranges of characters of various species;
FIG. 58 is a PAD diagram for illustrating a procedure of searching a character-species-based hashing type concatenated component character table;
FIG. 59 is a schematic functional block diagram showing a general arrangement of the full document retrieval system according to a twentieth embodiment of the invention;
FIG. 60 is a view for illustrating determination of a standard or reference for the hashing employed in the preparation of a frequency information hashing type concatenated component character table;
FIG. 61 is a PAD diagram for illustrating a procedure for determining a standard or reference for the hashing employed in the preparation of the frequency information hashing type concatenated component character table;
FIG. 62 is a view for illustrating schematically a concept of the frequency information hashing;
FIG. 63 is a PAD diagram for illustrating schematically a frequency information hashing procedure;
FIG. 64 is a view showing schematically a structure of a hash table;
FIG. 65 is a PAD diagram for illustrating a procedure involved in searching the frequency information hashing type component character table;
FIG. 66 is a schematic functional block diagram showing a general arrangement of a full document retrieval system according to a twenty-first embodiment of the invention;
FIG. 67 is a PAD diagram for illustrating a procedure of preparing or creating a frequency information prehashing type concatenated component character table; and
FIG. 68 is a view for illustrating a method of accessing a concatenated component character table by using a prehash table.
DESCRIPTION OF THE PREFERRED EMBODIMENTS
Now, the present invention will be described in detail in conjunction with preferred or exemplary embodiments thereof by reference to the drawings.
In the first place, the description is directed to a first aspect of the teachings of the invention incarnated in first to fourteenth illustrated embodiments.
Referring to FIG. 1, the first embodiment of the present invention will be described. A document data processing system illustrated in this figure comprises a display unit 100, a keyboard 101, a central processing unit or CPU 102, a storage file unit 110 including a magnetic disk or the like which serves as a storage medium for storing a component character table 105, condensed texts 104 and documents of texts 103, a floppy-disk driver of FDD 106 and a main memory 20. Further, a reference numeral 107 denotes a floppy disk.
There are stored in the main memory 200, a text registration program 201, a condensed text creation/ registration program 202, a component character table creation/registration program 203, a component character table search program 204, a condensed text search program 205, a text body search program 206, and a hierarchical presearch control program 207. Further, a data area 208 is secured on the main memory 200. The programs mentioned above are executed by the CPU 102.
For registration of a document, a corresponding command is inputted through the keyboard 101. In response to the command, the CPU 102 fetches document data from the floppy disk 107 placed in the floppy-disk driver 106 and executes the text registration program 201 to thereby store the fetched document data in the file 110 as a text 103. In this conjunction, it should be mentioned that the present invention is never limited to the inputting of the document data by using the floppy disk. The invention can equally be applied to such an arrangement in which the document data is loaded from other apparatus or system via a communication line or the like circuits. Subsequently, the CPU 102 executes the condensed text creation/registration program 202 to thereby divide or decompose the text 103 into fragmental character strings at a word level and check a mutual inclusion relation possibly existing among the fragmental character strings resulting from the decomposition to thereby eliminate those fragmental character strings which are included or covered by other fragmental character strings, as a result of which there is created a condensed text composed of a set of those fragmental character strings which bear no inclusion relation to one another. The condensed text 104 created or prepared in this manner is stored in the file 110. Finally, the CPU 102 executes the component character table creation/registration program 202 to thereby create the component character table 105 in which characters used in the text 103 are collected without duplication. The component character table 105 thus prepared is then stored in the file 110 as well.
In document or text search operation, a query condition statement (i.e. statement of condition for the search) is inputted via the keyboard 101 and supplied to the CPU 102, which responds thereto by executing first the hierarchical presearch control program 207, which is then followed by sequential executions of the component character table search program 204, the condensed text search program 205 and the text body search program 206 in this order under the control of the hierarchical presearch control program 207.
More specifically, upon execution of the component character table search, a search term (or terms) given by the inputted query condition statement is divided or split to constituent or component characters, whereon only those documents that contain all the characters constituting the search term are extracted. Next, the condensed texts which correspond to the documents extracted through the component character table search are consulted to thereby extract the documents which contain the given search term (or terms). In case the given query condition statement designates only a single search term (i.e. term serving as a keyword for searching or retrieving a document) or only a logical or Boolean relation among a plurality of search terms and unless it designates the positional relation of these search terms in the text, the text or document retrieval processing then comes to an end by outputting the result of the condensed text search as the final result of the document retrieval. In contrast, in the other case where the positional relation(s) or condition(s) among a plurality of search terms in the text is designated by the given query condition statement, the text data or text bodies corresponding to the documents extracted through the condensed text search are checked, whereby only the text which contains the given search terms and which satisfies the query condition concerning the positional relation imposed on the search terms is extracted to be outputted as the result of the retrieval as performed.
The above is an outline of the concept underlying the full text or document retrieval according to the first embodiment of the present invention.
In the following, description will be made generally of registration and search methods in conjunction with the hierarchical search processing which includes presearch steps of the component character table search and the condensed text search for screening and reducing the documents in respect to the number and the text body search according to the first embodiment of the present invention.
At first, it should be recalled that creation of the condensed text and the component character table is automatically effectuated upon registration of a document. A procedure of the processing involved in the creation and registration of the condensed texts and the component character table is illustrated in FIG. 2.
Referring to FIG. 2, when a document to be registered is loaded, the document is stored intact as a text. Subsequently, a condensed text is created or generated from this text. The condensed text is then prepared by decomposing the text into character strings on the basis of the character species or types such as Chinese character (kanji), cursive kana character (hiragana), square kana character (katakana), alphabetic character and others, while excluding duplication of any character string making appearance repetitively. Let's assume, by way of example, that a text of concern reads " . . . (A search technique for a fuzzy search . . . )", as exemplified by a text #1 shown in FIG. 2. In that case, the word " (search)" is discarded as a duplicative word, as a result of which there are left " (fuzzy)", " (search technique) and " (for)" as the fragmental character strings constituting a condensed text.
Further, a component character table is created from the text. To this end, characters appearing in the text are assigned or allocated with one-bit information. For example, in the case of the abovementioned text #1, bit information of "1" is set for " " and " " (hiragana characters), respectively, since they make appearance in the text 1, while bit "0" is assigned or allocated to " " which does not occur in the text #1. Likewise, bit "1" is set for " " and " " (Chinese characters), respectively. Through similar procedure, those characters of the component character table which are found in the text of concern are assigned with "1s", respectively, while the characters in the component character table which are absent in the text under consideration are affixed with "0s", respectively.
Through the procedure described above, the condensed text and the component character table are automatically created upon registration of a document to thereby make preparation for execution of the hierarchical presearch processing.
The text or document retrieval is carried out by consulting the auxiliary file(s) storing the condensed texts and the component character table in the order reversed relative to that for the registration, as is illustrated in FIG. 3.
More specifically, the component character table search is first carried out, whereby those component characters in the component character table which are assigned with "1s" and which correspond to all the characters appearing in a given search term are selected. In a second step, the condensed text search is effected, whereby the condensed texts containing the characters selected through the component character table search are checked to thereby pick up selectively the documents, if any, which contain the search term given by the query condition statement. Finally, in the text body search, only the text that contains the search term (or terms) which makes appearance in the text at a position (or positions) meeting the given query condition is selected. In the case of the example shown only for the illustrative purpose in FIG. 3, it is assumed that the following query condition statement is given:
" [4C] "
The above query condition statement prescribes "search a document containing a text in which terms ` ` and ` ` makes appearance in such proximity relation that both terms are not distanced from each other by more than four characters". As a result of this search processing, there is extracted a document containing a text #4 in which " " and " " occur at the respective positions which are distanced by four characters from each other.
Description which follows is directed to elucidation in concrete of a method of creating or preparing the condensed texts of character-species-based decomposed/duplication excluded type and a method of creating the component character table of character-code dependent type along with the hierarchical preseaPch control method in which the condensed texts and the component character table of the types mentioned above are made use of.
In the first place, description will be made of the character-species-based decomposed/duplication excluded type condensed text creation method adopted in the first embodiment of the document data processing system for full document retrieval according to the invention. As is illustrated in FIG. 4, a given text is decomposed into fragmental character strings on the basis of (or in accordance with) character types or species. As the character species, there may be mentioned "kanji (Chinese character)", "hiragana (Japanese cursive kana character)", "katakana (Japanese square kana character)", "alphabetic letter", "numeric character", "symbol" etc.. The text is decomposed into fragmental character strings each consisting of a string of characters of same type, e.g. kanji character string, hiragana character string, katakana character string and so forth. Next, any character string which results from the the decomposition mentioned above and included or covered completely by other character string also resulting from the decomposition of the same text containing the former is eliminated or excluded as a duplicative character string from the set of the fragmental characters constituting the corresponding condensed text. By way of example, let's consider a character string " (search)". It will be readily understood that this character string is completely included or covered by other character string " (intelligent search technique)" which exists in the same text. Accordingly, the character string " " is excluded from the registration. It should however be noted that the character string " (search)" can be hit in the condensed text search as a part of the character string " (intelligent search technique)", even though the string " " is not registered.
The character strings thus determined for registration while excluding duplication in the registration in this way are separated from one another by inserting a separator in each of texts or documents, as is illustrated in FIG. 5. In the case of the example illustrated in FIG. 5,there is employed a symbol "," as the separator. On the other hand, in the case of the examples illustrated in FIGS. 2 and 3, the separator is represented by a symbol ".vertline.". In this conncection, it is unnecessary to represent the separator in the form of a character. Any specific code which is not allocated to the character may equally be used as the separator to the same effect.
Next, description will be turned to a method of creating or preparing the character-code dependent type component character table used in the instant embodiment of the invention.
As is illustrated in FIG. 6, the character-code dependent type component character table is utilized for determing the bit position at which "1" is to be set as the information bit indicating the presence of a character as a character code. In the case of the example shown in FIG. 6, it is assumed that the shift JIS code system is adopted only for the illustrative purpose. In this figure, "(XXXX)H " represents characters in the hexadecimal notation. For giving indication that a character string " " exists in a text of a document #1, bit "1s" are set at the positions (8C9F)H and (8DFS)H in the bit list for the document #1. For convenience of the description, the bit position corresponding to a character of concern will be referred to as entry identifier or ID number of the component character table. Thus, the entry ID number (identifier) of " ", for example, is given by "(8C9F)H" or "35999" in the decimal notation.
By resorting to the component character table and the condensed texts described above, the hierarchical presearch control and the document (text) search operation are carried out in the manner which will be described below. At first, the search term designated by the query condition statement is split into individual characters in order to perform the component character table search. Through this component character table search, there are determined documents having the respective bit lists in which "1s" are set at the positions of the entry ID numbers in the list which correspond to the character codes constituting the given search term. By way of example, let's assume that a character string " " is given as the search term. In that case, there are retrieved as the result of the component character table search the documents #1, #2, #, #4, . . . , all of which have the respective bit lists in which "1s" are set at the bit positions corresponding to the codes (8C9F)H and (8DF5)H corresponding to " " and " ", respectively. In more particular, referring to FIG. 7, an AND operation is performed on a bit-by-bit basis between a bit list 701 having the entry ID number of "(8C9F)H" representing the character " " and a bit list 702 having the entry ID number of "(8DF5)H" representing the character " " to thereby derive a result of the bit-based AND operation in the form of a bit list 703. In this bit list 703 containing the results of the bit-based AND operation, the document ID numbers corresponding to the bit positions of "1" represent the documents hit in the course of execution of the component character table search. In other words, all the documents (texts) containing " " and " " are extracted as the result of the component character table search.
Parenthetically, in the case where the search term consists of only one character such as " " (kanji character meaning "lake" in English), the document retrieval is ended by outputting the result of the component character table search.
Next, search processing is performed on the condensed texts of the documents extracted through the component character table search. To this end, the contents of the condensed texts registered on a document-by-document basis as illustrated in FIG. 5 are scanned for thereby extracting the documents, if any, which contains the given search term as a word constituting a part of the document. In other words, in the case of the abovementioned example, only the document containing the two characters " " and " " which make appearance in succession are extracted. To say in another way, such documents which contain the characters " " and " " and in which these characters occur as parts of mutually different terms such as " " and " " are discarded. For this purpose, search is performed on a character-by-character basis as in the case of the text body search performed on the condensed texts of every documents retrieved through the component character table search processing. At that time, it is however sufficient to scan only the contents of the condensed texts corresponding to the document ID numbers obtained as the result of the component character table search. By way of example, in case the component character table search results in the retrieval of the document (ID) numbers #1, #2, #3, #4 and so forth, the condensed texts of the document ID numbers #1, #2, #3, #4 and so forth are scanned in the condensed text search processing, and the document(s) containing the search term existing in reality in the corresponding condensed text(s) is outputted as the retrieval result of the condensed text search processing.
As will be appreciated from the above description, in the hierarchical presearch scheme taught by the present invention incarnated in the illustrative embodiment now under discussion, two steps of the presearch, i.e. the component character table search and the condensed text search are previously performed to sieve out the documents at the character level and the word (or phrase) level, respectively, to thereby reduce previously the number of the documents which are to be subjected to the time-consuming body text search to a possible minimum, whereby the volume of the documents to undergo the text body search can be reduced correspondingly, which in turn means equivalently that the full text or document retrieval can be realized at a very high speed.
In more concrete, in the component character table search in which the presence of a character of concern is represented by one-bit information, the data volume to be searched for the retrieval can extremely be reduced with the time taken for the search being correspondingly shortened. Moreover, by logically ANDing the bit lists generated for the characters constituting parts of the search word (keyword), respectively, a relatively large number of the documents which are irrelevant to the search term(s) can be discarded, whereby the number of the documents to be subjected to the subsequent retrieval processing can remarkably be decreased.
Additionally, it is noted that in the condensed text search processing according to the invention, the time taken therefor can also be reduced because of a decreased amount of data as compared with that involved in scanning directly the texts.
Description will now be turned to a second embodiment of the full document retrieval method and system according to the invention. The second embodiment of the invention is also directed to the full document retrieval which allows the hierarchical presearch processings to be efficiently carried out even in the case where a plurality of search terms are designated.
By way of example, let's assume that a query condition statement prescribing "` ` AND ` `" is given. In that case, the component character table is searched as the first processing step. In this step, all the documents containing all the characters constituting parts of the given search terms are searched, which is then followed by searching the document to be outputted which satisfies the relation imposed on the search terms. For the query condition statement reading, for example, "` ` AND ` `", the documents containing two characters " " and " " as well as two characters " " and " " are searched. Namely, search is performed to find out the documents which satisfy the condition given below:
"(` ` AND ` `) AND (` ` AND ` `)"
To state in another way,
"` ` AND ` ` and ` ` AND ` `"
In other words, the documents containing concurrently the four characters mentioned above are searched.
Next, search is performed on the filtered condensed texts corresponding to the documents which have been found out as the result of the component character table search. In the condensed text search, only the documents in which the designated keywords make appearance as the semantically meaningful words (or phrases) are extracted. Namely, the document containing simultaneously both words or phrases (meaningful character strings) " " and " " are searched.
When the relation between the search terms is represented by the Boolean relation such as "AND", "OR" or the like and unless any other conditions prescribing the positional relation between the search terms (keywords) are given, the retrieval processing comes to an end, whereon the result of the condensed text search is outputted as the final result of the document retrieval processing. On the other hand, when any positional condition is designated, search is performed on the texts extracted through the condensed text search to thereby mark the text which satisfies the designated condition and output it as the final result of the document retrieval processing.
The retrieval or search operation of the full document retrieval system according to the instant (second) embodiment of the present invention will now be understood from the foregoing description. By performing the component character table search and the ANDing operation on the the search terms in the component character table search, the hierarchical presearch can efficiently be performed to thereby realize a high-speed full text retrieval even when a plurality of search terms are given.
Next, description will be made of a third embodiment of the present invention for elucidating the search control in the hierarchical presearch in general terms. FIG. 8 is a PAD diagram (Problem Analysis Diagram) for illustrating the control involved in the hierarchical presearch procedure. It is again assumed that a query condition statement reading as follows is given:
"` ` OR ` `"
The above statement commands that a document containing either " (computer)" or " (intelligent interface" is to be searched and retrieved.
At first, in a step 8000, the component character table search is performed. In this step, the documents which contain all the characters of the search terms are searched for each of the search terms as designated, which is then followed by the step of outputting the documents which can satisfy the compound condition imposed on the search terms. In the case of the example now under consideration, for each of three characters constituting a search term " ", the bit-based AND operation performed between the relevant entry ID numbers in the component character table, as is illustrated in FIG. 9. Subsequently, the bit-based AND operation is similarly performed between the relevant entry ID numbers in the component character table for each of the nine characters constituting " ". Finally, the result of the bit-based AND operation for " " and that for " " are logically ORed. To say in another way, the following search condition command is executed.
"(` ` AND ` ` AND ` `) OR (` ` AND ` ` AND ` ` AND ` ` AND ` ` AND ` ` AND ` ` AND ` ` AND ` `)"
As a result of this, there is extracted all the documents, if any, which contain all the three characters constituting " " or all the nine characters constituting " ".
In case the number of the document extracted through the component character table search mentioned above is zero, the search result indicating zero document (i.e. none of the documents) is outputted as the ultimate result of the document retrieval, as is shown in FIG. 8, whereupon the document retrieval processing comes to an end. Further, when the search term consists of only one character as in the case of " ", the retrieval processing is ended by outputting the result of the component character table search (step 8010 in FIG. 8).
When the search term is constituted by a plurality of characters and unless the result of the component character table search results in zero text, then the condensed text search is carried out in succession. The registered contents of the condensed text are composed of character strings resulting from the character-species-based decomposition as described hereinbefore. For the search term constituted by characters of different types or species such as exemplified by " (kanji plus katakana characters)" the term is decomposed into fragmental character strings in the condensed text such as " , " separated by the separator mark ",". Consequently, simple searching of a condensed text containing the heterogeneous search term such as " " will result in absence of the corresponding character string. Under the circumstances, the search term is checked before executing the condensed text search, to thereby decompose any search term consisting of different character species into character substrings each of the same or homogeneous character species. For convenience of the description, the search term undergone the decomposition based on the character species in this manner will be referred to as the split search term to distinguish the latter from the source search term in which the split search term origins. The condensed text search is then effected by using the split search terms " (intelligent)" and " (interface)" in addition to " (computer)" in the case of the aforementioned example. It should however be noted that the split search terms which originate in a same source search term are logically ANDed in the execution of the condensed text search. In the case of the query condition statement reading, for example,
" OR " "
the condensed text search is performed on the following condition:
( ) OR (" " AND " ")
The above condition commands that a document in which " " and " " exist concurrently or a document in which " " exists be searched.
When the result of the condensed text search is zero (no text), the search result of "zero" or "no document" is outputted, whereupon the condensed text search comes to an end. At this time, the text body search is performed only when a proximity condition or a contextual condition is designated or when a search term to be split such as " " is given (i.e. when the search term differs from the split search terms), Otherwise, the hierarchical presearch processing is completed by outputting the result of the condensed text search. At this juncture, the contextual condition (or simply the context) is such as given by the following condition statement;
" " [S] " "
which commands that a document in which " " and " " coexist in one and the same sentence be searched. Further, the proximity condition is, for example, such as described as follows:
" " [10C] " "
This proximity condition statement commands that a document in which " " and " " makes appearance in such proximity that both terms are distanced by no more than ten characters be searched.
In other words, the contextual condition and the proximity condition represent the query conditions designating the positional relations between the search terms appearing in a document.
When the query condition indicating the positional relation between the search terms appearing in a text is given or when a heterogeneous search term consisting of substrings of heterogeneous or different character species punctuated by the separator in the condensed text is presented, the text data corresponding to the result of the condensed text search is referred to, to thereby output as the result of retrieval only the document in which the search terms exist in the text in conformance with the given condition, whereupon the document retrieval processing comes to an end.
As can be understood from the above description, the hierarchical presearch can efficiently be carried out to thereby allow a high-speed full text retrieval to be realized ev |