|
|
|
Access augmentation or optimizing |
Presearch type document search method and apparatus6094647
Abstract
A method for making document information searches. In performing a document search with respect to the desired key word, two stages of presearch are carried out. In a first stage of presearch, a character component table in which an existence of character codes for every document is stated with respect to all the character codes contained in the group of document text data of stored documents is generated, and the character component table is searched for all the character strings constituting a desiredly designated search subject key word to thereby extract all the documents each containing all the character codes constituting the search subject key word. In a second stage of presearch, contracted text data for every document in which adjuncts and duplication of repeatedly stated words contained in advance in the text data are eliminated is generated, and the documents each containing the search subject key words by word are extracted from the documents extracted by the first presearch. After the second stage of presearch, text search is performed in accordance with a neighbor condition, a contextual condition, or the like.
Claims
We claim:
1. A document information search method for searching for specified text data containing a given search subject key word from a group of documents including document text data stored in advance, said method comprising the steps of:
registering text data of said documents as a source file,
generating a presearch file representative of component characters of the document text data in the source file,
generating a search query component character file comprising character components of the given search subject key word,
selecting as possible search documents for key word searching, selected ones of the documents having a presearch file including the character components of the search query component character file; and
key word searching of the selected ones for the given search subject key word.
2. The document information search method according to claim 1, wherein:
said presearch file is generated by extracting a character string composed of a combination of certain single characters or connected plurality of characters from said text file to generate a character code or character strings code indicating an existing information of said characters or character strings, or placing a coding conversion to said character code or character string code.
3. The document information search method according to claim 2, wherein:
a repetition of a same character is omitted in extracting said single character while extracting said single character or character strings from said text file, and character strings other than a search object included in said extracted character string is omitted in said character string extraction.
4. The document information search method according to claim 1, wherein:
dividing said text file according to a kind of character such as kanji, hirakana, katakana, alphabet or numeral while extracting said single character or character strings from said text file, and extracting character strings composed of a combination of single characters or said certain connected plurality of characters from said divided character or divided character strings, and generating said presearch file by generating a character code or character strings code indicating an existing information of said characters or character string, or placing a coded conversion to said character code or character strings code.
5. The document information search method according to claim 4, wherein:
a repetition of the same character in separated characters is omitted while extracting said single character or single character string from said text file, and character strings other than a search object included in a separate character string are omitted.
6. The document information search method according to claim 1, wherein:
generating a plurality of presearch files corresponding to a character string length extracted from said text file, and narrowing down an object text by presearching the presearch files for a short character string length extracted from said text file.
7. The document information search method according to claim 6, wherein:
while extracting said single character and single character string from said text by using said presearch file with said short extracted character length, a single character and single character string are omitted from a certain kind of characters.
8. The document information search method according to claim 1, wherein:
generating further presearch files by code converting said presearch file in multiple stages to compress information, and narrowing down said object text by presearching said further presearch files with said compressed information.
9. The document information search method according to claim 1, wherein:
said text file and said presearch file are generated for each text.
10. A computer readable medium for use in a document information search system for searching for specified text data containing a given search subject key word from a document text data stored in advance, said medium comprising:
a source file for registering text data of said documents;
a presearch file for generating component characters representative of the document text data in the source file,
a search query component character file comprising character components of the given search subject key word,
a first code selector for selecting as possible search documents for key word searching, selected ones of the documents having a presearch file including the character components of the search query component character file; and,
a second code selector for key word searching of the selected ones for the given search subject key word.
Description
TECHNICAL FIELD
The present invention relates to an information retrieval system and, particularly, relates to a scanning-type full text search method and system. More in particular, it relates to a document search method and apparatus suitable for preventing omission in search caused by differences in synonyms and notations at the time of searching using non-controlled key words (called free words). The present invention provides a method and apparatus suitable for collectively judging whether or not a plurality of term sets exist in a character stream to be searched. Further, the invention provides a collective type magnetic disk unit having a large storage capacity suitable for enforcement of the aforementioned method and suitable for writing and reading in a short time and a collective type magnetic disk device adapted for continuous writing and reading of a plurality of files.
BACKGROUND ART
In recent years, importance of large-scale data base service including not only secondary information (bibliographic information), such as literature information, patent information and the like, but primary information (text information) have become greater. Heretofore, a method using key words or classification codes has been used for retrieving information in such a data base (hereinafter sometimes abbreviated to "DB").
Key words are selected from a collection of control terms (called "thesaurus") by a specialist who conducts key word provision (called "indexing") at the time of registration of information into a data base. On the other hand, a DB searcher employs a system in which key words are selected from the thesaurus to perform search. However, the key word provision work includes very troublesome work. In short, a vocabulary suitable for expression of the contents of a document to be registered must be selected from the thesaurus after reading the contents thereof. If indexing is unsuitable, accurate information cannot be acquired from the data base. Accordingly, there arises a problem in that the indexing requires a specialist having special knowledge of contents of documents and being well versed in vocabulary registered in the thesaurus. Further, there arises a problem in that requested documents cannot be called or unnecessary texts may be mixed in the called documents if suitable vocabulary according to the thesaurus cannot be designated as key words at the time of searching.
Further, the classification system in the thesaurus always changes with the passage of time. There arises a problem in that key words and classification codes must be always updated.
Further, since a large time is required for indexing, new documents collected by considerable quantities must be registered by batch processing. Accordingly, there arises a problem in that search-enabled information is always a predetermined time behind the present time. In such circumstances, with the spread of DB, provision of a system in which everyone, not only a specialist in DB, can conduct both document registration and document retrieval easily by using free words (also called "non-controlled words") with no restraint of the thesaurus has been desired.
With the increase in data base scale, it becomes impossible to describe contents of documents fully in detail by use of only the control words in the thesaurus. Accordingly, the number of documents which can be narrowed down by retrieval using key words is limited to a range from the order of tens to the order of hundreds. To find a target one from the documents, there is no method but a method of reading the contents thereof directly. This causes a serious problem in search efficiency.
Attempts of automatic summarizing and automatic indexing have been made counter to the problem in the present searching method based on the indexing using control words in the thesaurus. However, the problem is not yet substantially solved because Japanese requires various dictionaries for the reason of its linguistical difficulty.
Omission in search may often arise in the searching process using such free words because of differences in notation or expression, though both a search term as a key word designated by the user and a target term used in the DB have one meaning. For example,
the term " (piano; katakana)" may be represented as " (piyano; katakana)", or
the term " (intafeisu; katakana)" may be represented as " (intafesu; katakana)", " (intafeisu; katakana)" or " (intafesu; katakana)". Hence, search for desired information often becomes impossible because of fine differences between variations in syllabic notation as described above.
Hereinafter, development into terms different in notation is called "different notation development". Development into other terms by use of a dictionary is called "synonym development". The concept "terms different in notation" is called "different notation".
As means for thoroughly solving these problems, there has been proposed a full text search system in which the searcher can search for the contents of documents on direct reference to the texts of the documents based on free key words (called "free words" or "non-controlled words").
In the following, the proposed system is described with reference to a typical example of construction thereof as shown in FIG. 1.
A search system 101 is connected to a host computer and performs both reception of a search request and transmission of a search result through a communication circuit. When a search request 107 is issued from the host computer, a search controller 103 accepts it, analyzes it and then sends corresponding search control information 108 to both a term comparator 105 and a query resolver 104. Further, the search controller 102 controls a storage controller 103 so that term data (text data) 111 stored in a search data base 106 are transferred to the term comparator 105.
The term comparator 105 makes a comparison between input term data and preliminarily set search terms (key words). When a matched term is detected, the term comparator 105 sends detection information 110 to the query resolver 104. The query resolver 104 judges whether or not the detection information 110 satisfies a complex condition pertaining to the positional relation, co-existence relation and the like between terms described in the search request. When it satisfies the complex condition, identification information for corresponding document data, as well as the content of the document as a search result 109, is returned to the host computer.
An example of such a prior art is described in the reading, R. L. Haskin and A. Hollaar: "Operational Characteristics of a Hardware-Based Pattern Matcher", ACM Trans. on Database System, Vol. 8, No. 1, 1983.
As a term comparison method in the term comparator 213 which is an important part of the term search system 200, a method of retrieving a plurality of terms by one scanning by use of a finite-state automaton is known. A typical example of the method is disclosed in the reading, A. V. Aho and M. J. Corasick: "Efficient String Matching", CACM, Vol. 18, No. 6, 1975.
Two methods for generating an automaton and a string matching method using the automaton are described in the aforementioned reading. In the following, the methods are described.
A first one of the methods (hereinafter referred to as "conventional method 1") will be now described with reference to FIG. 2. The drawing shows automaton state transition for searching term data for a key word " (intafesu; katakana)" given by a user. In the drawing, the circles represent automaton states, and the arrows represent state transitions. Respective characters given to the arrows represent input characters when state transitions corresponding to the arrows occur. In the case of representing negation, for exampler in the case of representing characters other than " (n; katakana)" and other than " (i; katakana)", such negation is expressed with a negation symbol "" added to the characters to be denied, for example, as --{"", ""}--. The arrow 403 represents a starting state in which state transition starts. Numerical values given to the inside of the circles represent state numbers. The double circle represents an ending state in which comparison of " (intafesu; katakana)" is finished. This method is characterized in that state transitions corresponding to all input characters having a possibility to be inputted are described by an automaton. Therefore, the number of state transitions increases. Accordingly, there arises a problem in that a very large time is required for generating an automaton when the number of key words increases.
In the following, the term comparing operation in the conventional method 1 is described with reference to the drawing. When a character is inputted into the automaton, a token is placed to reveal the state in which comparison of the input character is to be made. In short, the token is a mark for indicating the transit state position in the automaton. First, the token is initialized to be placed in the state 0 as a starting state. In this example, the token moves to the state 1 when the input character is " (i; katakana)". When, on the contrary, a character other than " (i; katakana)" enters, the token moves to the sate 0. When, on the other hand, the token is in the state 1 and the input character is " (n; katakana)", the token moves to the state 2. When the input character is " (i; katakana)", the token moves to the state 1. When the input character is not " (i; katakana)" and not " (n; katakana)", the token moves to the state 0. When the token is in the state 2 and the input character is " (ta; katakana)", the token moves to the state 3. When the input character is " (i; katakana)", the token moves to the state 1. When the token is in the state 3 and " (fesu; katakana) " enters, the token successively moves to the state 4, the state 5, the state 6 and the state 7. The double circle is given to the state 7, so that comparison of the term (intafesu; katakana)" is perfected.
Because state transitions corresponding to all input characters having a possibility to be inputted must be described in the automaton in the conventional method 1, the number of state transitions increases as the number of key words increases. Accordingly, there arises a problem in that a very large time is required for generating an automaton. Hardware for putting the method into practice has been disclosed in Japanese Patent Unexamined Publication No. Sho-60-105040.
In the following, a second method (hereinafter referred to as "conventional method 2") is described. The conventional method 2 is designed to shorten the time required for generating an automaton, compared with the conventional method 1. The automaton generation time in the conventional method 2 is improved greatly to be one-third as long as that in the conventional method 1. The conventional method 2 has been described in detail in Japanese Patent Unexamined Publication No. Sho-63-311530. The conventional method 2 is described now with reference to FIGS. 3 and 4. FIG. 3 shows state transition in the automaton in the case where the same term " (intafesu; katakana)" as in FIG. 2 is compared. The token is initialized to be placed in the state 0 as a starting state. When a character " (i; katakana)" enters, the token moves to the state 1 after comparison is made in the state 0 in which the token is placed. When, on the contrary, a character other than " (i; katakana)" enters, the token moves to the state 0.
When the token is in the state 1 and a character " (n; katakana)" enters, the token moves to the state 2. When the token is in the state 2 and a character " (ta; katakana)" enters, the token moves to the state 3. When the token is in the state 3 and a character (for example, " (i; katakana)") other than " (fu; katakana)" described in the automaton enters, a failure function is established in the conventional method 2 so that reference to a failure function table as shown in FIG. 4 is made. Number of state effected by failure, to be re-compared with the number of the state in which the token is placed, is stored in the failure function table. In this example, the value 0 of state effected by failure, corresponding to the current state number 3, is obtained, so that the token moves to the state 0. In the state 0, comparison of the input character " (i; katakana)" is made to thereby move the token to the state 1. The aforementioned function is called "failure function". When a string of characters " (ntafesu; katakana)" enter one by one, the token successively moves to the state 2, the state 3, the state 4, the state 5, the state 6 and the state 7. The double circle is given to the state 7, so that comparison of the term " (intafesu; katakana)" is perfected.
When, for example, the term " (intafesu; katakana)" is given as a key word, the term may be described in the text by different notation as a term different from the search term designated by the user.
In the text, the term " (intafesu; katakana)" using "- (minus sign)" instead of "-- (prolonged sound symbol)" may be used (this is called "prolonged sound different notation") or the term " (intafesu; katakana)" additionally using "" may be used (this is called "presence or absence of prolonged sound") or the term " (intafeisu; katakana)" using " (fei; katakana)" instead of " (fe; katakana)" may be used based on difference in pronunciation (this is called "phonetic different notation").
To search for all the terms, nine terms, that is, " (intafesu; katakana)", " (intafesu; katakana)", " (intafeisu; katakana)":, " (intafeisu; katakana)", " (intafeisu; katakana)", " (intafesu; katakana)", " (intafesu; katakana)" " (intafesu; katakana)" and " (intafesu; katakana)" formed by combination of these different notations must be all recognized as key words.
The aforementioned example is explained with reference to FIGS. 5 and 6. FIG. 5 is a view of automaton state transition in the case where term data are compared with the nine terms written in different notations.
Comparison is started from the head of the key word, so that the state branches off and leads to another state when there is difference in transition character.
When, for example, " (intafesu; katakana)" and " (intafesu; katakana)" are subjected to comparison successively from the head of the key word, the two are the same till " (inta; katakana)" and the two are different in transition character between the next characters " (fu; katakana)" and "". Therefore, there occurs branching of state transition that the state is transited from state 3 to sate 22 at the transition character " (fu; katakana)" and the state is transited from state 3 to state 4 at the transition character "".
In short, because respective transit states are assigned to different transition characters in a certain state, the automaton is shaped like a tree. FIG. 6 is an explanatory view of a failure function table showing transition in the case where a character not described in the automaton enters. When comparison including different notation is made as described above, there arises a problem in that the number of states increases because the number of key words increases.
Further, don't care characters may be used in key words in term retrieval. An example of use of fixed-length don't care characters in key words is explained with reference to FIGS. 7 and 8. FIG. 7 shows automaton state transition in the case where a key word "A?B" including a don't care character "?" having the fixed length of one character is retrieved. FIG. 8 is an explanatory view of a failure function table for indicating the number of state effected by failure in the case where a character not described in the automaton enters.
In this example, an automaton is generated by use of one-byte character codes (JIS codes). "?" is a character symbol which is allowed to satisfy an arbitrary character or symbol. Accordingly, transition based on a don't care character "?" is shown as transition based on all character codes 00-FF in the case where the state 1 in the drawing is a current state. In short, "A?B" shows a designation for retrieving terms composed of a head character "A", an arbitrary character and an ending character "B".
There arises a problem in that the number of states in the automaton increases greatly in spite of such a simple retrieval condition when fixed-length don't care characters enter.
As a method for solving the problem pertaining to different notation and synonym, Japanese Patent Unexamined Publication No. Sho-62-011932 has been proposed. In this quatation, different notation development is called "different notation generation" and synonym development is called "synonym extraction".
FIG. 9 is a block diagram showing an example of construction of the quotation.
In the construction, a search term written in romaji or katakana is once converted into a term written in katakana by a standard notation. In short, a standardizing process to collect a plurality of notations into one is first carried out by the operation reverse to the different notation generation. On the other hand, a search term written in alphabets is generalized to expression in katakana by borrowed word/kana conversion.
The katakana term thus once standardized is subjected to synonym development by use of a synonym dictionary, so that words synonymous to the input katakana term are sent out as katakana terms. After synonym extraction, the katakana terms are subjected to kana/kanji conversion, kana/borrowed word conversion and kana/romaji conversion to form kanji terms, alphabet terms of foreign origin and romaji terms, respectively.
As described above, the katakana terms as the result of synonym extraction are respectively converted into kanji, romaji, katakana and alphabet terms to thereby carry out different notation development.
In the conventional term search system 101 as shown in FIG. 1, a magnetic disk device capable of storing a large quantity of data is required as a search data base 106 which is one of constituent members of the term search system 101. A general magnetic disk device has a problem in that high-speed data input/output is impossible. On the other hand, a multi-head magnetic disk device in which high-speed data input/output is possible has a problem in that the device is very high in cost.
Therefore, a collective-type magnetic disk unit formed by connecting a plurality of general low-cost small-size magnetic disks to improve data input/output speed has been considered. As one of this type disk units, a "picture data division storage" is disclosed in Japanese Patent Unexamined Publication No. Sho-60-117326.
This unit has a plurality of magnetic disk devices, magnetic disk controllers in the same number as the magnetic disk devices, and a master controller for controlling data transmission between an input/output buffer and an external device. In the master controller, data given from the external device are divided into a quantity not more than the capacity of the input/output buffer. The divided data are successively transferred to the respective magnetic disk controllers. The magnetic disk controllers serve to write the data in corresponding magnetic disk devices, respectively. The master controller gives a seek instruction to magnetic disk controllers corresponding to magnetic disk devices free from the writing operation, to omit the apparent seek time of data-storage magnetic disk devices on and after the second magnetic device to thereby shorten data write/read time.
When the conventional search system as shown in FIG. 1 is applied to large-capacity database search, the following problems arise.
A first problem is in search time. When, for example, full text search is applied to 20000 documents having a capacity of 20 KB per document, 400 MB data must be scanned.
If data processing is carried out by the steps of: storing the 40 MB text data in the search data base; reading the data at a mean effective speed of about 1 MB/s; and performing comparison in the term comparator at the same speed, it takes about 7 minutes to perfect the search. There arises a problem in that the time required for reading the text data is too long to bear practical use when general magnetic disk devices are used. In short, it is necessary that the reading speed in the search data base for storing the text data is improved to the same degree as the processing speed in the term comparator. Here is a first object of the invention.
Even though the reading speed in the search data base is improved to the same degree as the processing speed in the term comparator or in other words even though the reading speed is improved to 10 MB/s, it takes still 40 seconds to perfect the scanning of the 400 MB text data. To shorten the search time to a practically allowable value of the order of several seconds is a second problem of the invention.
In respect to the technique for improvement in search speed, a "term search method" has been proposed in JP-A-62-241026. In the "term search method", a "table of distribution of frequency in use of characters" is generated by examining frequency in use of characters from the contents of text (called "data") in advance, in order to improve the processing speed in the process of searching a text data base (called "file") for a designated term.
The proposed method has the steps of: performing test search with reference to the "table of distribution of frequency in use of characters" with use of a character of lowest frequency in the key word designated by the user as a key; and performing comparison on characters before and behind the character if the character satisfies a specific condition.
Further, the JP-A-62-241026 has described that retrieval can be finished without text search in the case where the character of lowest frequency in the key word has a frequency of zero in the "table of distribution of frequency in use of characters".
In short, according to the JP-A-62-241026, the number of wasteful character comparisons can be reduced, thus to attain an effect that the search processing speed is improved.
However, the method is designed to generate a "table of distribution of frequency in use of characters" in the whole of a data base (file) to thereby search the data base for a text file (data) based on the table (refer to the drawing). Accordingly, the method has an effect in efficiency in search processing in the case where a key word pertaining to characters absent in the data base is retrieved. In general, the number of characters absent in the data base is reduced as the scale of the data base increases. There arises a problem in that the effect of the method in search processing disappears.
To solve the aforementioned problem in order to enforce efficient search processing to make equivalently high-speed full text search possible is the second object of the invention.
On the other hand, in full text search using free words, a difference in expression may arise between the key word designated by the searcher and the term described in the text though they are the same in meaning. In this case, documents having a different expression form are omitted from search to make it impossible to retrieve target documents. Examples of such terms are synonyms, different-form words (called "different notation words" or "different notation"), and the like.
Examples of words synonymous to " (keisanki; kanji)" are " (densikeisanki; kanji)", " (densanki; kanji)", "Computer", and the like. Examples of different notations with respect to " (konpyuta; katakana)" are " (konpyuta; katakana)", " (konpyuta; katakana)", " (konpyuta; katakana)", " (konpyuata; katakana)", " (konpyuta; katakana)", " (konpyuta; katakana)", " (konpyuta; katakana)", " (konpyuta; katakana)", " (konpyuta; katakana)", " (konpyuta; katakana)", " (konpyuta; katakana)", " (konpyuta; katakana)", " (konpyuta; katakana)", " (konpyuta; katakana)", and the like. Examples of different notations with respect to "Computer" are "computer", "COMPUTER", and the like. To cope with the problem in difference in notation between the key word designated by the user and the term described in the text of the document, the searcher must conduct search after designating all synonyms and different notations. However, it is practically difficult that the searcher designates all different notations, because hundreds of different notations may exist. To solve the problem is a third object of the present invention.
In the conventional method, expected development results in most cases cannot be attained, because information in an original term is changed at the time of the standardizing of notation.
This fact is explained with reference to the rule of partial term conversion
from " (hoo; katakana)" to " (hou; katakana)" to standardize katakana notation. When the conversion rule is applied, the term
" (jouhoo; katakana)"
However, when the same conversion rule is applied to a given term
" (jouhoon; katakana)"
the term is standardized to a false term
" (jouhoun; katakana)"
This has an influence on both synonym development process after the standardizing process and different notation development process following the synonym development process, so that expected development results cannot be attained.
One object of the present invention is to attain expected development results without performing the aforementioned standardizing process.
In the aforementioned quotation, the key word synonym development from " (keisanki; kanji)" to " (konpyuta; katakana)" based on a synonym dictionary is made by the steps of: converting once the search key word designated by the user into a katakana term; making the katakana term be subject to synonym development; and then making the resulting term be subject to kana/kanji conversion, kana/romaji conversion and kana/foreign language conversion. Therefore, the synonym dictionary must have an ability of development from katakana term to katakana term. In short, synonyms must be always written in katakana as follows.
Headword: " (konpyuta; katakana)"
Synonym 1: " (keisanki; katakana)"
Synonym 2: " (jouhoushorishouchi; katakana)"
This causes a problem in that the scale of the dictionary is enlarged, because output terms having an expression form corresponding to the synonym development must be registered in both the kana/kanji conversion dictionary and kana/borrowed word conversion dictionary. A large number of homonyms exist in the Japanese language. This causes failure in synonym development. For example, the term " (kensaku; katakana)" can be interpreted as " (kensaku; kanji)" or can be interpreted as " (kensaku; kanji)". There arises a problem in that the distinction between the two words cannot be recognized by the synonym dictionary using only katakana notation. Further, there arises a problem in that homonym selection in katakana/kanji conversion after synonym development is made by interactive processing.
Further, a foreign language/kana conversion dictionary, a kana/kanji conversion dictionary and a kana/foreign language conversion dictionary are required for converting the search key word into katakana and converting the katakana word into a suitable-form word after synonym development. There arises a problem in that a great deal of labor is required for generation and maintenance of dictionaries, because a great variety of large-scale dictionaries must be used.
In short, the third object of the invention is to solve the problem in homonyms at the time of kana/kanji conversion and kana/foreign language conversion and solve the problem in generation and maintenance of large-scale dictionary used for the aforementioned conversion.
In the case where hundreds of synonyms and different notations are considered as key words in retrieval, a term comparator for collectively comparing these words is required. When retrieval is made under the consideration of synonyms and different notations with no use of the term comparator, the search time increases by hundreds of times so that it cannot bear practical use. A fourth object of the present invention is to provide a term comparator in which search processing can be made with no reduction of the comparison speed even though hundreds of key words are designated.
In the conventional search method using an automaton, all key words including different notations are listed and developed. Further, an automaton is generated based on the key words. Because the automaton thus generated is shaped like a tree, a very large number of automaton states are required.
In the case of retrieval with don't care character designation, all combinations of character codes allowed by don't care characters are listed and developed into key words. Because the automaton is generated based on the key words, a very large number of automaton states are required similarly to the case of different notation.
As described above, the increase in the number of automaton states causes the increase in automaton generation time and, accordingly, the increase in the capacity of a transition table for storing the automaton, that is, the increase in hardware.
An object of the invention is to provide a search method using an automaton in which the number of states is reduced by describing collectively transition in the automaton in the case where retrieval is made under the consideration of different notations and with designation of don't care characters to thereby shorten the automaton generation time, and in which the capacity of a state transition table is reduced to thereby attain retrieval by compact hardware.
When document data are further successively registered in the text data base, the capacity of the magnetic disk device which forms a search data base becomes full at a certain point of time. In this case, it is necessary that the storage capacity of the system can be enlarged with no losing the stored data. In the case where the capacity of the search text data base is enlarged to a capacity for 100000 texts, that is, a capacity for 4 GB, processing time increases as the storage capacity of the magnetic disk device is enlarged. Accordingly, the original object cannot be attained. Therefore, it is necessary to enlarge the scale of the storage capacity with no deterioration in search time.
A fifth object of the invention is therefore to provide a search system having an architecture satisfying such a requirement.
In the search data base in the term search system, there are three important factors, namely, large storage capacity, continuous high-speed input/output of a plurality of files, and low cost. A collective-type magnetic disk unit satisfying these factors has been desired.
The conventional technique is designed to shorten data write/read time merely through omitting apparently access time of seek time. In short, there is no consideration of the number of magnetic disk devices required corresponding to the data transfer rate necessary for an external device. Accordingly, the conventional technique has a problem in the viewpoint of cost performance.
The conventional technique has an effect in that access time can be saved in the case where a file large in size of data such as picture data is stored separately in a plurality of magnetic disk devices. However, the conventional technique has a problem in that the access time becomes equal to the access time in one magnetic disk device in the case where writing/reading is carried out with respect to a file small in data size which can be stored in one magnetic disk device, because seek time cannot be omitted in this case.
Further, in the conventional technique, there is no consideration of continuous writing/reading with respect to a plurality of files. Accordingly, a write/read instruction from a higher-rank apparatus can be processed with respect to only one file. In access to a plurality of files, the file processing with respect to one file must be repeated. There arises a problem in that a large overhead time is required for the repetition.
As one component of the overhead time, there is the processing time required for retrieving information pertaining to magnetic disk device storage position from file identification codes to designate files as targets of access from the higher-rank apparatus.
In a conventional general magnetic disk device, a file identification code is expressed by a file name constituted by a string of character codes such as ASCII codes. Physical storage position must be found through retrieving file management information stored in the file management information area of the magnetic disk device, based on the file name. There arises a problem in that the processing time required for it is large.
An object of the invention is therefore to provide a low-cost collective-type magnetic disk unit which has such a large storage capacity that continuous high-speed input/output with respect to a plurality of files can be attained regardless of the size of the files.
On the other hand, document information is constituted by not only text data but graphic data such as pictorial data, photographic data, and the like. Accordingly, it is necessary to answer the requirement that the retrieved document can be seen in print image. A sixth object of the invention is to provide a search system having an architecture which can answer the requirement.
Further, the text data base is provided to be shared to a plurality of users. For example, it is necessary to make access to the text data base from a conversation-type workstation through LAN (local area network). Accordingly, the search system must have a function connected to the LAN to answer search requests from other workstations. A seventh object of the invention is to provide a full text search system having the aforementioned function.
A final object of the invention is to provide a full text search system which can answer the aforementioned problems.
A BRIEF SUMMARY OF THE INVENTION
To solve these problems, the following means are used in the text search system provided according to the present invention.
In accordance with the present invention, a document information search method is provided for searching for specific text data containing a search subject key word. A database or a group of previously stored text data is searched and entries likely to contain the search subject key word are extracted. More particularly, a presearch file is generated as representative of all the component characters of the document text. A search query component character file is also generated comprising the character components of the desired search subject key words. The invention has the advantage of utilizing the presearch file and the search query component character file to identify possible search documents for final key word searching. Specifically, selected words in documents having a presearch file including the character components of the search query component character file are the only documents that are selected for the key word searching for the desired search subject key words.
First, a search data base constituted by a plurality of magnetic disk devices is used in order to increase the reading speed of the search data base (storing text data) to the same degree as the processing speed of the term comparator. In short, a high reading speed is attained by a multiplex output obtained by simultaneously driving the magnetic disk devices in parallel.
According to the present invention, the collective-type magnetic disk unit comprises a plurality of data storages having magnetic disk devices, an input/output buffer for temporarily storing input/output data of the data storages, and a multi-disk controller for performing control of the data storages and the input/output buffer.
Further, each data storage may be constituted by a magnetic disk device provided with a magnetic disk controller or may be constituted by a plurality of magnetic disk devices provided with magnetic disk controllers, and a multiplexer for selecting magnetic disk devices.
Further, the input/output buffer has at least a capacity corresponding to one cylinder of a magnetic disk device, per data storage. The input/output buffer is constituted by one semiconductor memory or two semiconductor memories.
The semiconductor memory used may be replaced by other high-speed storage devices such as optical memories or the like.
The multi-disk controller for performing control of the data storages and the input/output buffer is constituted by a communication memory using a semiconductor storage device for storing requests from higher-rank apparatus, a multiplex controller for performing control of data transfer, a physical information table using a semiconductor storage device for retrieving physical storage position in the respective magnetic disk device, and a master controller for controlling these constituent parts. The semi-conductor storage device used in both the communication memory and the physical information table may be replaced by other high-speed storage devices such as optical memories or the like.
The master controller is constituted by a micro-computer to control the constituent parts.
As file identifiers, logical classification identifiers (called "ID") constituted by logical classification file identification codes given to files when files are logically classified into hierarchical groups and file IDs constituted by special numbers in the logical classification are used in the multi-disk controller.
The multi-disk controller may be designed so that a structure definition table for storing management information for determining the physical file storage position of the respective magnetic disk device according to the logical classification ID in the file ID is placed in the memory in the master controller.
The multiplex controller for performing control of data transfer between the higher-rank apparatus and the input/output buffer is constituted by a multiplexer for selecting a data bus of the input/output buffer, a DMA controller for performing data transfer regardless of the mater controller, a top address registration table for registering the top address of the data-transfer-requested area of the input/output buffer, and an end address registration table for registering the end address thereof.
Assuming now that the number of data storages is n; the data transfer rate from a magnetic disk device of a data storage to the input/output buffer is t [Byte/sec] when seeking operation is unnecessary because transfer data in the magnetic disk device exist in one track; the one-cylinder capacity of the disk device is M [Byte]; the minimum seek time of the magnetic disk device is s [sec]; the revolution velocity of the magnetic disk device is R [rps]; and the capacity of the output buffer is the same as the one-cylinder capacity M [Byte] of the magnetic disk device, then the data transfer rate from the collective-type magnetic disk unit to the higher-rank apparatus must satisfy the following condition.
When the minimum seek time s [sec] of the magnetic disk device is larger than the time (M/T) [sec] required for transferring one data of M [Byte] from the input/output buffer to the higher-rank apparatus, the data transfer time from the data storage to the output buffer is represented by the sum of the minimum seek time s [sec] of the magnetic disk device, the maximum rotation wait time (I/R) [sec] of the magnetic disk device, and the data transfer time (M/T) [sec] from the data storage to the input/output buffer. Accordingly, it is necessary that the sum is within the time (nM/T) [sec] required for transferring all data from the input/output buffer to the higher-rank apparatus.
This can be represent ed by the following equation. ##EQU1## Accordingly, the number, n, of the data storages can be represented by the following equation. ##EQU2##
When the minimum seek time s [sec] of the magnetic disk device is smaller than the time (M/T) [sec] required for transferring one data of M [Byte] from the input/output buffer to the higher-rank apparatus, data transfer from the data storage to the input/output buffer is not allowed because the data transferring operation of the input/output buffer is not yet finished though the seeking operation of the magnetic disk device is finished. Therefore, it is necessary to wait for perfection of data transfer from the input/output buffer to the higher-rank apparatus. In short, the data transfer time from the data storage to the output buffer is represented by the sum of the data transfer rate (M/T) [sec] from one input/output buffer to the higher-rank apparatus, the maximum rotation wait time (I/R) [sec] of the magnetic disk device, and the data transfer time (M/t) [sec] from the data storage to the input/output buffer. Accordingly, it is necessary that the sum is within the time (nM/T) [sec] required for transferring all data from the input/output buffer to the higher-rank apparatus. This can be represented by the following equation. ##EQU3## Accordingly, the number, n, of the data storages can be represented by the following equation. ##EQU4##
By providing a collective-type magnetic disk unit constituted by the minimum number of data storages satisfying these conditional equations, a magnetic disk unit which is so excellent in cost performance that the data transfer rate required by the higher-rank apparatus can be satisfied can be provided.
The data storage serves to store data files. By providing a data storage constituted by a magnetic disk device having a magnetic disk controller, data write/read control on magnetic disks can be carried out by the magnetic disk controller so that the processing load on the multi-disk controller can be reduced. Further, by providing a data storage constituted by a plurality of magnetic disk devices, and a multiplexer for selecting a data bus of the respective magnetic disk device to connect the data bus to the data bus of the input-out buffer.
The input/output buffer serves to temporarily store the input/output data of the data storages.
In the case of data writing, data are successively transferred from the higher-rank apparatus to the input/output buffer at a higher rate than the writing rate of the magnetic disk device in the data storage. The input/output buffer in which data transfer is finished serves to write data in the magnetic disk device at the writing rate of the magnetic disk device. In the case of data reading, the respective magnetic disk device performs data reading to the input/output buffer at the reading rate of the magnetic disk device. The input/output buffer in which data reading is finished performs data transfer to the higher-rank apparatus at a higher rate than the reading rate of the magnetic disk device. In short, the input/output of data with respect to the higher-apparatus can be made at a higher rate than the writing/reading rate of the magnetic disk device.
Further, by providing two input/output buffers corresponding to one data storage, data writing/reading between a second input/ output buffer and the data storage can be made while data transfer between a first input/output buffer and the higher-rank apparatus is made. Accordingly, the wait time of the magnetic disk device in which the data transferring operation of the magnetic disk device is stopped till perfection of data transfer between the input/output buffer and the higher-rank apparatus can be reduced, so that writing/reading can be carried out in a short time. The condition necessary for providing a magnetic disk unit excellent in cost performance to satisfy the data transfer rate required by the higher-rank apparatus can be represented by the equation (1).
The multi-disk controller serves to perform control of the data storages and the input/output buffers according to the data file write/read request issued from the higher-rank apparatus. When the communication memory is constituted by a semiconductor device capable of storing file IDs of target files, the overhead time required for reporting acceptance of an instruction from the higher-rank apparatus and perfection of processing can be reduced, so that continuous writing/reading on data files can be carried out in a short time.
When the physical information table is constituted by a semiconductor storage device capable of short-time access, the physical storage position of the magnetic disk device can be calculated in a short time from the physical file IDs. Accordingly, the overhead time required for reading data files can be reduced.
In the conventional technique, files stored in the magnetic disk device are identified by file names constituted by a variable-length string of character codes. According to the present invention, file IDs are constituted by a fixed length string of digit codes. Accordingly, the respective file ID can be expressed by a small-size code, so that designation of data files to be used for writing/reading and detection of the physical storage position can be simplified. Consequently, the overhead time required for the procedures can be shortened.
In the case where data files are stored, the seek time can be shortened by approaching the physical storage positions of files having relation logically to each other. Accordingly, the access time can be shortened.
The multiplexer in the multiplex controller serves to select the data bus of the input/output buffer. The top address registration table and the end address registration table store top addresses and end addresses for indicating storage areas of necessary data among data stored in the input/output buffer. The DMA controller serves to transfer data in areas of the input/output buffer designated by the top address registration table and the end address registration table, to the higher-rank apparatus at a high speed regardless of the master controller.
In the case where a plurality of files to be read exist in one cylinder of the magnetic disk device, the average rotation wait time is (1/2R) sec and the condition that the time required for reading the files at once is shorter than the time required for reading the files one by one can be represented by the equation ##EQU5## in which f1 [Byte] and f2 [Byte] represent sizes of necessary files to be read, k [Byte] represents the total size of unnecessary files not to be read, t [Byte/sec] represents the reading rate of the magnetic disk device, R [rps] represents the revolution speed of the magnetic disk device, and S [sec] represents the average seek time of the magnetic disk device.
The equation can be rewritten easily as follows. ##EQU6##
When the conditional equation is satisfied, the multiplex controller serves to read unnecessary files to the input/output buffer once and remove the unnecessary file portion at the time of data transfer to the higher-rank apparatus to thereby transfer only the necessary portion. Accordingly, the magnetic disk can read a plurality of files at once by one reading process, so that the access time produced in the reading process can be shortened.
Secondly, a term comparator which mounts hardware (called "search engine") exclusively used for term comparison based on a finite-state automaton method is used. The hardware exclusively used for term comparison makes it possible to retrieve about thousand key words collectively with no reduction of comparison speed.
To attain the foregoing object, a different notation search automaton is formed by making an automaton transition branch at the top of a partial term of a key word in which different notations and by collecting the branch transitions at the end thereof to thereby reduce the number of states. A don't care character designation search automaton is formed in the same manner as the different notation search automaton. In short, a done care character designation search automaton is formed by making an automaton transition branch at a done care character with regarding a group of characters group allowed by the don't care character as different notations and by collecting the branch transitions thereof at one state to thereby reduce the number of states.
As a result, a compact search system in which the time for generating the automaton is shortened and in which allowance of the state transition table can be reduced is provided.
An automaton generating method used as means for solving the problem in the increase of automaton states is now explained. The method is different from the conventional method 2 in the following points. In short, because the conventional method 2 uses "failure process", an automaton having branch state transitions like a tree must be generated for the necessity of calculating state effected by failure. Accordingly, the number of states increases. However, because the present method does not require "failure process", not only the number of state transition branches can be reduced but state transitions can be collected so that transit state can be shared. Accordingly, the number of states is reduced. (Hereinafter, the automaton generated by the present method is called collective transition allowance automaton.)
In the following, a method for collecting state transitions is explained.
FIG. 46 shows state transition in the automaton generated by the present method.
The automaton in the drawing is provided for retrieval with respect to the same nine key words as in the automaton shown in FIG. 5, namely, " (intafesu; katakana)" and different notations thereof, such as " (intafesu; katakana)", " (intafeisu; katakana)", " (intafeisu; katakana)", " (intafeisu; katakana)", " (intafesu; katakana)", " (intafesu; katakana)", " (intafesu; katakana)" and " (intafesu; katakana)".
These can be represented by a complex expression character stream (expression 1) as shown in the under portion of FIG. 9. ##EQU7## Different notations of " (fe; katakana)" represented by ##EQU8## will be explained hereunder.
Because " (fe; katakana)" can be replaced by " (fei; katakana)" based on a phonetic different notation, the notations ##EQU9## can be obtained.
Because "--" as the long sound of " (fe; katakana)" can be replaced by "--" based on a long sound different notation, the notations ##EQU10## can be obtained.
When the long sound different notations ##EQU11## are applied to ##EQU12## the following notation can be obtained. ##EQU13##
Hereinafter, the relationship between interchangeable terms is called equivalence.
Further, the transit state is expressed as state 5.
By using this method, the number of automaton states can be reduced to about one-third the number of states in the automaton in FIG. 5.
Thirdly, a two-stage presearch means is provided as a method for accelerating scanning type full text search. As a first stage presearch used is a character component table search means in which only documents containing characters constituting a designated key word are extracted by use of a character component table in which characters written in a contracted text (which will be described later) are expressed as one-bit information. As a second stage presearch used is a means (called contracted text search) in which only documents having a designated key word are extracted by scanning a data file (called "contracted text") formed by removing adjuncts, such as postpositional words respectively functioning as an auxiliary to a main word, conjunctions, and the like, from a text and removing words appearing repeatedly. Accordingly, the text as related to the documents narrowed down by the two-stage presearch is read from the magnetic disk device and scanned (called "text search") to thereby attain equivalently very high-speed full text search. Hereinafter, examination for successively narrowing down the number of documents through character component table search and contracted text search is called "hierarchical presearch".
Fourthly, a query resolver to make retrieval under combination of not only the logical condition, the neighbor condition and the contextual condition possible in order to attain fine retrieval peculiar to full text search is provided.
Fifthly, a full test search system (called "search machine") is formed by collecting units arranged in parallel and a controller as a higher-rank device for controlling the units, each of the units being formed by collecting the search data base constituted by the plurality of magnetic disk devices, the term comparator, the presearch means and the query resolver. The aforementioned arrangement not only matches with not only a large-capacity data base but matches with increase in the capacity of the data base through extension of units in the machine.
Sixthly, an LAN connection means for connecting the text search machine to the LAN is provided in the text search machine to give service to a plurality of users and make the construction of a large-scale text data base possible. The aforementioned construction not only matches with the large-scale data base through connection of a plurality of search machines to the LAN but matches with the increase in the capacity of the data base through extension of machines in the LAN.
Seventhly, a synonym development means and a different notation development means for automatically carrying out a synonym and different notation development process are provided in the text search machine in order to solve the problem in synonyms and different notations. Desired documents can be entirely retrieved by full text search with use of the developed vocabulary as key words.
To solve the problem, a term given through the keyboard as shown in FIG. 26 is developed as follows: the term is once subjected to different notation development; the terms obtained by the different notation development are respectively subjected to synonym development by reference to a synonym dictionary; and then the terms obtained by the synonym development are respectively subjected to different notation development.
The outline of the different notation and synonym development processes is shown in FIG. 27. A key word (term) 2701 designated by a user is once subjected to different notation development. The terms 2702 obtained by the different notation development are respectively subjected to synonym development by using a synonym dictionary 2710. Then, the terms 2703 obtained by the synonym development are respectively subjected to different notation development to thereby obtain terms 2704 as a results of development.
By carrying out different notation development before synonym development, the development processes can be carried out with no change of information to standardize notation. Accordingly, because a dictionary can be formed with no consideration of expression and notation of terms in the synonym dictionary, generation or edition of the dictionary becomes easy. Different notations of terms newly obtained by synonym development can be obtained by different notation development thereof.
In the following, different notation development as an important means in the present invention is explained. Different notation development is made as follows. First, an input term is divided by character types into three groups of partial terms, *namely, kanji and hiragana terms, katakana terms and alphabet terms. The partial terms thus obtained are successively subjected to a term replacement process using a conversion rule table to carry out different notation development on kanji terms and katakana terms. In respect to alphabet terms, alphabet characters in the input terms are subjected to code conversion, for example, minuscule-to-majuscule conversion or majuscule-to-minuscule conversion.
The conversion rule table herein used has a plurality of conversion rules for replacing a partial term in the input term with a list of terms.
For example, in the case where a term " (iu; katakana)" is developed into two terms " (iu; katakana)" and " (yuu; katakana)", the conversion rule is represented by the following description.
[" (iu; katakana)".fwdarw."(" (iu; katakana)", " (yuu; katakana)"]
Conversion of a partial term into a list of terms is called "replacement".
For example, the following conversion rules are considered as conversion rules for different notation development of kanji terms and katakana terms.
(1) Case of kanji and hiragana terms
(a) Conversion rule pertaining to development corresponding to difference in notation between the old style and the new style of kanji. Example:
[" (sei)".fwdarw.("", "", "", "")]
(b) Conversion rule pertaining to development corresponding to difference in notation between declensional kana of kanji. Example:
[" (yomitori)".fwdarw.("", "")]
(2) Case of katakana
Conversion rule pertaining to development into various notations in similar syllables. Example:
[" (pia)".fwdarw.(" (pia)", " (piya)"]
As described above, kanji and hiragana terms are subjected to different notation development by using the conversion rule table.
In the following, different notation development of romaji is explained.
Hepburnian system notation, directive system notation and mixture notation thereof are considered as different notations of romaji. Accordingly, it is necessary to generate a conversion rule adopted to both syllabic notations of Hepburnian system and directive system. For example, a conversion rule for replacing a partial term
"SHI" as romaji in the Hepburnian system with a list of two partial terms
"SI" and "SHI" as romaji respectively in the directive system and the Hepburnian system is generated as follows.
["SHI".fwdarw.("SI", "SHI")]
In short, different notation development of romaji can be method, a method comprising the steps of converting all romaji terms into katakana terms, and replacing respective syllables in the katakana terms by both Hepburnian system notation and directive system notation may be used. For example, a romaji term
"SISHAMO"
is once converted into a katakana term
""
and then converted into romaji, for example, by use of the following conversion rule.
["".fwdarw.("SI", "SHI")]
After the development by character types as described above, developed partial terms are combined in the order of the initial term to form a final output of different notation development.
In short, the different notation development process can be summarized as follows.
(1) Different notation development pertaining to kanji and hiragana terms (declensional kana endings and new and old kanji styles).
(2) Different notation development pertaining to katakana terms.
(3) Different notation development pertaining to romaji terms (Hepburnian system and directive system notations).
(4) Different notation development pertaining to alphabet characters (majuscules and minuscules).
However, it is not always necessary to use all of the different notation developments. By providing means for selecting the kind of the conversion rule table to be used, wasteful development process can be eliminated and, accordingly, search process at the desire of a user can be attained.
In the following, synonym development as another important means in the present invention is explained. As synonym development, the following four developments are carried out on the input term by using a synonym dictionary.
(1) Same-rank term development
Development into vocabulary of the same rank in concept
Example:
Development from " (keisanki; kanji)" to " (kompyuta; katakana)" and " (johoshorisochi; kanji)"
(2) Higher-rank term development
Development into vocabulary having a hihger-rank meaning
Example:
Development from " (keisanki; kanji)" to " (denshikiki; kanji)"
(3) Lower-rank term development
Development into vocabulary having a lower-rank meaning
Example:
Development from " (keisanki; kanji)" to " (denshitakujokeisanki; kanji)"
(4) Related term development
Development into vocabulary having a related meaning
Example:
Development from " (keisanki; kanji)" to " (ofisuotomeishon; katakana)"
Further, by providing means for selecting suitable one from the four developments in synonym development process similarly to the different notation development process, more flexible search at the desire of a user can be attained.
By the aforementioned means, a search term given by a user is developed as follows. First, the search term is subjected to different notation development. Terms obtained by the different notation development are one-by-one subjected to synonym development. New terms obtained by the synonym development are subjected to different notation development.
Because different notation development is made before synonym development, omission of information caused by standardization of notation can be avoided so that expected development results can be always attained.
Further, because it is not necessary to standardize notation in the synonym dictionary, generation and maintenance of the dictionary can be simplified. Because different notation development is further made after the synonym development, there is no necessity of describing variations of notation in the synonym dictionary. Accordingly, the size of the dictionary can be reduced.
As described above, according to the present invention, the following effects can be attained.
Firstly, a high reading speed can be obtained by generalizing output data obtained by driving simultaneously a plurality of magnetic disk devices which are arranged in parallel to form a search data base for storing text data.
Secondly, collective search of about one thousand key words can be made with no reduction of the comparison speed, by use of a term comparator which mounts hardware exclusively used for term comparison based on a finite-state automaton method. This meaning is that search of key words including synonyms and different notations can be perfected only by scanning the text data base by once.
Thirdly, the number of times for reference to a text stored in a magnetic disk can be reduced by providing two-stage presearch means as a scanning type full text search accelerating method. In short, the quantity of text search processing which accounts for a large part of search time can be reduced, by which the search time as the whole can be shortened.
In order to carry out the hierarchical presearch, it is necessary to prepare auxiliary files "contracted text" and "character component table" before the search. The auxiliary files "contracted text" and "character component table" are generated automatically at the time of document registration. The content of the process is shown in FIG. 16.
In the drawing, a given document to be registered is stored as "text" in a magnetic disk device.
Then, "contracted text" is generated from the "text". In short, the "contacted text" is generated by removing adjuncts (not used in search) from the "text" and removing overlapping terms repeatedly appearing in the "text". In the case of document 1 having the text " (aimai kensaku notameno kensakugijutsu; hiragana and kanji)" , an adjunct " (notameno; hiragana)" and an overlapping term " (kensaku; kanji)" are removed, so that terms " (aimai; hiragana)" and " (kensakugijutsu; kanji)" are left as "contracted text".
Finally, "character component table" is generated from the "contracted text". In this embodiment, characters which appear in the "contracted text" are respectively expressed by one-bit information. In the case of document 1, bit data "1" is given to columns for expressing characters " (a; hiragana)" and " (i; hiragana)" because those characters are present. Bit data "0" is given to a column for expressing a character " (u; hiragana)" because such a character is absent. Similarly, bit data "1" is given to columns for expressing characters " (ken; kanji)" and " (saku; kanji)" because those characters are present. In short, bit data "1" is given to each character column (for expressing a specific character) in the character component table when the character is present in the "contracted text", and on the other hand, bit data "0" is given to each character column when the character is absent.
As described above, the "contacted text" and the "character component table" are generated automatically at the time of document registration, thus to prepare for hierarchical presearch.
At the time of search, reference to these auxiliary files is made in the order reverse to registration, as shown in FIG. 17. Let it be supposed that a conditional expression "[4c]" is given. The conditional expression expresses retrieval of a document containing two terms " (kensaku; kanji)" and " (rikai; kanji)" arranged at a distance of not more than 4 characters.
As a first step, the character component table is searched. The extracted by this search is only one document or documents containing all constituent characters of designated key words. In the case as shown in the drawing, all documents containing both constituent characters of " (kensaku; kanji)", that is, two characters " (ken; kanji)" and " (saku; kanji)" are retrieved. The document retrieval method comprises ANDing (making logical product) bits given to the columns "" and "" in the character component table. Documents having a result "1" are documents containing the two characters "" and "". On the contrary, documents having a result "0" are documents not having one of the two characters "" and "" or not having the two characters. Accordingly, the documents having a result "0" can be removed from the subject of search which will be made hereinafter.
Because the existence of a character is expressed by one-bit information (also called "bit list") in the character component table, the quantity of data to be searched can be reduced to a small quantity. As this result, search time can be shortened. Further, documents not related to key words can be cut off remarkably based on ANDing bits of constituent characters of key words in the bit list. Accordingly, the number of documents as a target of search can be narrowed down greatly.
Then, the contracted text of the documents narrowed down by the character component table search is searched. In the case of extracting only documents containing designated key words as words as shown in the drawing, such documents containing two characters "" and "" appearing continuously are extracted. In short, documents containing two characters "" and "" appearing as different words such as for example two words "" and "" in the document 3 are cut off here.
In respect to the term " (rikai; kanji)", the same procedure as described above is applied to character component table search and contracted text search. Each of documents thus finally remaining is subjected to text search to examine whether the text thereof satisfies the complex condition designated by the search conditional expression. In the case as shown in the drawing, a document satisfying the neighbor condition "[4c]" designated by the search conditional expression is retrieved. As this result, document 4 containing two words "" and "" arranged at a distance of four characters is retrieved.
As described above, in the "hierarchical presearch method", the two stages of presearch of "character component table" and "contracted text" are carreid out preliminarily to perform screening through the sieves of "character level" and "word level" to thereby narrow down the number of documents as a target of text search which, in general, requires a largest time. Thus, equivalently very-high-speed full text search can be attained.
Fourthly, fine search peculiar to full text search can be attained by providing a query resolver for extracting one satisfying logical condition, neighbor condition and contextual condition described in the search expression, based on the output result of the search engine in text search. In respect to logical condition, search operations, such as OR operation, AND operation, NOT operation, and the like, can be attained. In respect to neighbor condition, inter-character distance conditional search in which an upper limit or a lower limit is given to the number of characters between key words can be attained in the case of Japanese or inter-word distance conditional search in which an upper limit or a lower limit is given to the number words between key words can be attained in the case of English. Examples of neighbor condition are "inter-character distance condition" adopted to Japanese and "inter-word distance condition" adopted to English. Examples of inter-character distance condition are as follows.
"[8c]" (1)
"[10c]" (2)
"[8c, 10c]" (3)
"<10c>" (4)
The conditional expression (1) "[8c] expresses retrieval of a document (or documents) containing two terms " (bunsho; kanji)" and " (kensaku;kanji)" arranged in the order and at a distance of not more than 8 characters. Accordingly, in FIG. 14, documents 1 and 2 are detected.
The conditional expression (2) "[10c]" expresses retrieval of a document (or documents) containing two terms " (bunsho; kanji)" and " (kensaku; kanji) " arranged at a distance of not more than 10 characters regardless of the order of the two terms, that is, regardless of whether " (bunsho; kanji)" appears before " (kensaku; kanji)" or whether " (kensaku; kanji)" appears before " (bunsho; kanji)". Accordingly, in FIG. 14, documents 1, 2 and 3 are detected.
The conditional expression (3) "[8c, 10c]" expresses retrieval of a document (or documents) containing two terms " (bunsho; kanji)" and " (kensaku; kanji)" arranged at a distance of from 8 characters to 10 characters regardless of the order of the two terms. Accordingly, in FIG. 14, documents 2 and 3 are detected.
The conditional expression (4) "<10c>" expresses retrieval of a document (or documents) containing two terms " (bunsho; kanji)" and " (kensaku;kanji)" arranged at a distance of not less than 10 characters regardless of the order of the two terms. Accordingly, in FIG. 14, documents 3 and 4 are detected.
Examples of inter-word distance condition are as follows.
"text [8W] retrieval" (5)
"text [10w] retrieval" (6)
"text [8w, 10w] retrieval" (7)
"text <10w> retrieval" (8)
The conditional expression (5) "text [8W] retrieval" expresses search for a document (or documents) containing two terms "text" and "retrieval" arranged in the order and at a distance of not more than 8 words.
The conditional expression (6) "text [10w] retrieval" expresses search for a document (or documents) containing two terms "text" and "retrieval" arranged at a distance of not more than 10 words regardless of the order of the two terms, that is, regaredless of whether "text" appears before "retrieval" or whether "retrieval" appears before "text".
The conditional expression (7) "text [8w, 10w] retrieval" expresses search for a document (or documents) containing two terms "text" and "retrieval" arranged at a distance of from 8 words to 10 words regardless of the order of the two terms.
The conditional expression (8) "text <10w> retrieval" expresses search for a document (or documents) containing two terms "text" and "retrieval" arranged at a distance of not less than 10 words regardless of the order of the two terms.
In respect to contextual condition, a search function for searching for a document (or documents) containing key words coexisting in one sentence or in one paragraph can be attained.
Examples of contextual condition search adopted to both Japanese and English are as follows.
"[P]" (9)
"[p]" (10)
"[S]" (11)
"[s]" (12)
"[PH]" (13)
"[ph]" (14)
The conditional expression (9) "[P]", expresses search of a document (or documents) containing two terms " (bunsho; kanji)" and " (kensaku; kanji)" arranged in the order and in one paragraph.
The conditional expression (10) "[P]" expresses search of a document (or documents) containing two terms " (bunsho; kanji)" and " (kensaku; kanji)" arranged in one paragraph regardless of the order.
The conditional expression (11) "[S]" expresses search for a document (or documents) containing two terms " (bunsho; kanji)" and " (kensaku; kanji)" arranged in the order and in one sentence.
The conditional expression (12) "(s)" expresses search for a document (or documents) containing two terms " (bunsho; kanji)" and " (kensaku; kanji)" arranged in one sentence regardless of the order.
The conditional expression (13) "[PH]" expresses search for a document (or documents) containing two terms " (bunsho; kanji)" and " (kensaku; kanji)" arranged in the order and in one phrase. In the case of Japanese, the phrase means a composition set off by "," and ".largecircle.". In the case of English, the phrase means a composition set off by ","and ".".
The conditional expression (14) "[ph]" expresses search for a document (or documents) containing two terms " (bunsho; kanji)" and " (kensaku; kanji)" arranged in one phrase regardless of the order.
Examples of logical condition search adopted to both Japanese and English are as follows.
"[AND]" (also expressed as "and (,)") (15)
"[OR]" (also expressed as "and ()") (16)
"[NOT]" (also expressed as "and ()") (17)
The conditional expression (15) "[AND]" expresses search for a document (or documents) containing two terms " (bunsho; kanji)" and " (kensaku; kanji)".
The conditional expression (16) "[OR]" expresses search for a document (or documents) containing " (bunsho; kanji)" and " (kensaku; kanji)".
The conditonal expression (17) "[NOT]" expresses search for a document (or documents) not containing two terms " (bunsho; kanji)" and " (kensaku; kanji)".
Fifthly, not only matching with a large-capacity text data base but matching with increase in the capacity of the data base through extension of units in a search machine can be attained by providing such a search machine formed by collecting units arranged in parallel and a controller as a higher-rank device for controlling the units, each of the units being formed by collecting the search data base constituted by the plurality of magnetic disk devices, the term comparator, the presearch means and the query resolver.
Sixthly, a LAN connection means for connecting the text search machine to the LAN is provided in the text search machine to thereby give service to a plurality of users and make the construction of a large-scale text data base possible. In short, the aforementioned arrangement not only matches with the large-scale data base through connection of a plurality of search machines to the LAN but matches with the increase in the capacity of the data base through extension of machines in the LAN.
Seventhly, full text search automatically using all developed vocabulary, such as synonyms, different notations and the like, as key words can be made with no necessity that a user has to be conscious of these problems, by providing synonym development means and different notation development means for carrying out a synonym and different notation development process in the inside of the text search machine. Further, desired documents can be entirely retrieved with no occurrence of omission caused by difference in expression or notation.
BRIEF DESCRIPTION OF THE DRAWINGS
FIG. 1 is a block diagram showing a conventional retrieval system;
FIGS. 2, 3, 5 and 7 are explanatory views showing the theory of term retrieval using a finite-state automaton;
FIGS. 4, 6 and 8 are explanatory views of failure function tables in the conventional system;
FIG. 9 is a block diagram showing conventional construction of different notation development;
FIG. 10 is a block diagram showing the outline of a first embodiment of the present invention;
FIG. 11 is a view showing an example of comparison position information;
FIG. 12 is a view showing an example of search engine output information containing comparison position information;
FIG. 13 is a view showing in detail the query resolver;
FIG. 14 is a view showing an example of retrieval using two key words;
FIG. 15 is a view showing a full text search accelerating procedure as one of features of the present invention;
FIG. 16 is a view showing a text registration procedure;
FIG. 17 is a view showing a procedure for performing retrieval processing based on a character component table registered and generated in FIG. 16;
FIG. 18 is a view showing an example of construction of the character component table and an example of search using the character component table;
FIG. 19 is a view showing generation of a contracted text;
FIGS. 20A, 20B, 21, 22, 23 and 24 are pad views showing character component table searching procedures;
FIG. 25 is a block diagram showing a modification of the embodiment shown in FIG. 10;
FIG. 26 is a block diagram showing an embodiment of the invention in which a series of synonym/different notation development processes as one of features of the present invention is carried out;
FIG. 27 is a view for explaining the outline of the series of processes in the embodiment in FIG. 26;
FIG. 28 is a block diagram showing the embodiment;
FIG. 29 is a view showing an example of steps of the different notation development process;
FIG. 30 is a block diagram of a different notation development means;
FIG. 31 is a view showing a conversion rule application process in the different notation development portion by use of a katakana term;
FIG. 32 is a pad view showing the different notation development process;
FIG. 33 is a view showing an example in which headword retrieval is carried out by use of a finite-state automaton;
FIG. 34 is a view of a goto function table in the finite-state automaton;
FIG. 35 is a view of an output table in the finite-state automaton;
FIG. 36 is a pad view showing a method for generating both the goto function table and the output table in the retrieval automaton;
FIG. 37 is a katakana different notation conversion table;
FIG. 38 is a different notation conversion table pertaining to the new and old styles of kanji;
FIG. 39 is a different notation conversion table pertaining to kana added to kanji;
FIG. 40 is a romaji/katakana conversion table;
FIG. 41 is a block diagram showing an embodiment in which development modes in the different notation development means can be selected;
FIG. 42 is a view showing the control states of conversion portions, development portions a switches in the different notation development;
FIG. 43 is a view showing a synonymous dictionary;
FIG. 44 is a view showing the outline of a method for searching headwords of the synonymous dictionary by using an index table;
FIG. 45 is a block diagram of a character retrieval circuit using a finite-state automaton as an embodiment of the invention;
FIGS. 46, 47, 48, 49, 50, 51, 52 and 53 are explanatory views showing the theory of term retrieval using a finite-state automaton in this embodiment;
FIG. 54 is an explanatory view of a goto function table in this embodiment;
FIG. 55 is an explanatory view of an output table;
FIG. 56 is a block diagram showing an example of construction of a collective-type magnetic disk device as an embodiment of the present invention;
FIG. 57 is a block diagram showing another embodiment;
FIGS. 58A and 58B are views showing an example of construction of a structure definition table;
FIGS. 59A, 59B, 59C and 59D are views showing an example of construction of a storage position pointer table;
FIGS. 60A, 60B, 60C and 60D are views showing an example of construction of a physical information table;
FIG. 61 is a flow chart of file writing in the embodiment shown in FIG. 57;
FIG. 62 is a timing chart of a file writing process in the collective-type magnetic disk device in FIG. 57;
FIG. 63 is a flow chart of a file reading process in the embodiment in FIG. 57;
FIG. 64 is a view showing an example of construction of a multiplex controller;
FIG. 65 is a timing chart of a file reading process in the collective-type magnetic disk device in the embodiment in FIG. 57;
FIG. 66 is a timing chart of a file reading process in the collective-type magnetic disk device in the embodiment in FIG. 57p;
FIG. 67 is a timing chart of a file reading process in the collective-type magnetic disk device composed of three magnetic disk devices in the embodiment in FIG. 57;
FIG. 68 is a timing chart of a file reading process in the collective-type magnetic disk device composed of four magnetic disk devices in the embodiment in FIG. 57;
FIG. 69 is a timing chart of a file reading process in the collective-type magnetic disk device composed of five magnetic disk devices in the embodiment in FIG. 57;
FIG. 70 is a timing chart of a file reading process in two collective-type magnetic disk devices in the embodiment in FIG. 56;
FIG. 71 is a block diagram showing an embodiment in which the present invention is applied to LAN;
FIG. 72 is a block diagram showing a modification of the embodiment shown in FIG. 71;
FIG. 73 is a block diagram showing another modification of the embodiment shown in FIG. 71;
FIG. 74 is a block diagram showing a further modification of the embodiment shown in FIG. 71; and
FIG. 75 is a view showing an example of a RAM disk device.
BEST MODE FOR CARRYING OUT THE INVENTION
A first embodiment of the present invention will be described with reference to FIG. 10.
The search system in this embodiment comprises a keyboard 1101, a search machine control computer (CPU.sub.0) 1150, a display 1120, an automaton generating computer (CPU.sub.1) 1105a, a bit search computer (CPU.sub.3) 1107a, a string search engine 1106, a query resolving computer (CPU.sub.2) 1156a, a search result storage memory 1146, and a text data file 1110. In the search machine control computer (CPU.sub.0) 1150, a search expression analysis program 1102, a synonym/different notation development program 1103a, a query analysis program 1141a, a search execution control program 1108 and a search result indication program 1147 are executed. In the automaton generating computer (CPU.sub.1) 1105a, an automaton generating program 1105 is executed. In the bit search computer (CPU.sub.3) 1107a, a bit search program 1107 is executed. In the query resolving computer (CPU.sub.2) 1145a, a query resolving program 1145 is executed.
A search conditional expression given through the keyboard 1101 is analyzed based on the search expression analysis program 1102 on the search machine control computer (CPUO) 1150. In short, a search conditional expression is separated, based on the search expression analysis program 1102, into a key word portion and a complex condition description portion in which a complex condition constituted by inclusion conditions and arrangement conditions is described. The inclusion conditions are described as logical conditions. The arrangement conditions are described as neighbor conditions or contextual conditions. After separation or extraction, the key word portion and the complex condition description portion are respectively delivered to the synonym/different notation development program 1103a and the query analysis program 1141a on the CPU.sub.0 1150.
Based on the synonym/different notation development program 1103a, synonyms with respect to the given key word and different notations thereof are found by reference to a synonym dictionary included in the program and according to conversion rules. For example, when a key word " (keisanki; kanji)" is given, synonyms thereof, such as " (densanki; kanji)", " (konpyuta; katakana)" and the like, are generated and different notations thereof, such as " (konpyuta; katakana)" and the like, are generated.
In the synonym development, synonyms include not only the same-rank words as described above, but higher-rank words, lower-rank words, related words and the like. An example of higher-rank word is " (denshikiki; kanji)", or the like. An example of lower-rank word is " (dentaku; kanji)", or the like. An example of related word is " (ofisuotomeshon; katakana)", or the like.
Different notation development includes katakana development, kanji and hiragana development, and alphabet development. The shown in the drawing is katakana development. The kanji and hiragana development includes new/old style conversion and okurigana development. An example of the new/old style conversion is conversion from " (sei; kanji)" to " (sei; kanji)" and (sei; kanji)". An example of okurigana development is development from " (yomitori; kanji)" to " (yomitori; kanji and hiragana)" and " (yomitori; kanji and hiragana)". The alphabet development includes Hepburnian system development of romaji, directive system development of romaji and majuscule/minuscule development of alphabets. An example of Hepburnian system development of romaji is development from " (katakana)" to "TISIKI (romaji)". An example of directive system development of romaji is development to "CHISHIKI (romaji)". An example of majuscule/minuscule development of alphabets is development from "TISIKI" to "tisiki".
The aforementioned development rules in synonym development and different notation development can be designated by a user so that a suitable combination thereof can be selected.
Examples of synonyms in English are as follows.
______________________________________
looking glass
.fwdarw.
mirror
pingpong .fwdarw. table tennis
the Lord .fwdarw. God
typhoon .fwdarw. cyclone .fwdarw. hurricane
HAL .fwdarw. Hitachi America Limited
ws .fwdarw. work station
______________________________________
Examples of different notations in English are as follows.
______________________________________
center .fwdarw. centre
liter .fwdarw. litre
brier .fwdarw. briar
humor .fwdarw. humour
modeler .fwdarw. modeller
Chile .fwdarw. Chili
orangutan .fwdarw. orangoutan .fwdarw. orangoutang
MacDonald .fwdarw. McDonald
______________________________________
Examples of synonyms in German are as follows.
______________________________________
Brief .fwdarw. Schreiben
Mostert .fwdarw. Mostrich
Maschine .fwdarw. Motor
______________________________________
Examples of different notations in German are as follows.
______________________________________
Foto .fwdarw. Photo
Coda .fwdarw. Koda
Code .fwdarw. Kode
Buffet .fwdarw. Buffet
Friburg .fwdarw. Fribourg
______________________________________
The group of key words thus generated by synonym and different notation development are delivered to the automaton generating program 1105 on the automaton generating computer (CPU.sub.1) 1105a.
Based on the automaton generating program 1105, an automaton for performing collective comparison with respect to the key word group delivered from the synonym/different notation development program 1103a is generated. Hundreds of development results may be obtained by synonym/different notation development in the case where the number of initially given key words is large.
Search for the key words one by one in given text data makes high-speed retrieval impossible. Accordingly, it is necessary to search for the key words collectively by scanning the text data only once. As a method of collectively comparing a plurality of key words (also called "multiplex comparison"), a comparison method using an automaton is known. As an example of the method, a method of executing the automaton by hardware has been proposed in Japanese Patent Unexamined Publication No. Sho-63-311530. The search engine 1106 is a high-speed multiplex term comparison circuit attained by the growth of the method. Accordingly, in the automaton generating program 1105, key word identification code information to be compared with a state transition table provided in the search engine 1106 is generated to be transferred to the search engine 1106.
The group of key words obtained by synonym/different notation development based on the synonym/different notation development program 1103a are, together with corresponding key word identification codes (also called "key word identifiers"), delivered to the bit search program 1107 on the bit search computer (CPU.sub.3) 1107a.
On the other hand, in the query analysis program 1141 on the search machine control computer (CPU.sub.0) 1150 receiving the complex condition description portion of the input search conditional expression from the search expression analysis program 1102, conditions such as neighbor condition, contextual condition, logical condition and the like are analyzed to form designation distance information (for indicating distance between the identification codes of the designated key words), designation contextual code information and designation logical condition code information as control information for judging the conditions, so that the information is delivered to the query resolving program 1145 on the query resolving computer (CPU.sub.2) 1145a.
When the aforementioned search expression analyzing process, synonym/different notation development process, automaton generating process and query analyzing process are finished and when the control information is delivered respectively to the bit search program 1107 on the bit search computer (CPU.sub.3) 1107a, the search engine 1106 and the query resolving program 1145 on the query resolving computer (CPU.sub.2) 1145a, a search process is started.
The search process is controlled by the search execution control program 1108 on the search machine control computer (CPU.sub.0) 1150. In short, in the search execution control program 1108, search text data are read from the text data file 1110 while operating the bit search program 1107, search engine 1106 and query resolving program 1145, to perform both hierarchical presearch and text search. First, a character component table is read from the text data file 1110 to the bit search program 1107 to carry out character component table search. Document identifiers as results of character component table search are written in the search result storage memory 1146. Then, contracted texts of documenhts identified by the document identifiers are read from the text data file 1110 to the string search engine 1106 to perform contracted text search. In the string search engine 1106, a group of key words designated by preliminarily set state transition table information are retrieved in input contracted text data. When any one of the key words is found, the identifier of the text file and the identification code and positional information of the key word are delivered to the query resolving program 1145 on the query resolving computer (CPU.sub.2) 1145a.
The positional information added as output information of the search engine is information for indicating the position of a document in which the key word is found. For example, the positional information is represented by the number of characters obtained by counting characters from the head of the document. FIG. 11 shows an example of comparison positional information. The drawing shows retrieval based on a key word " (chitekikensaku; kanji)" in the case where the contents of the document is "". In the drawing, the term " (chitekikensaku; kanji)" in " (chitekikensaku-gijutsu; kanji)" coincides with the key word. Accordingly, the term is detected. In the respect to comparison positional information, the position of the last character " (saku; kanji)" in the term " (chitekikensaku; kanji)" is employed as a value obtained by counting characters from the head of the document. In this example, 13 is the comparison positional information.
The output information of the search engine having the comparison positional information added thereto has a structure as shown in FIG. 15. In this embodiment, the output information is constituted by a 32-bit length key word identifier and a 32-bit length key word comparison positional information code. A document identifier is given to each document before a key word identifier is given, so that correspondence of the comparison positional information to the document can be recognized.
Results of contracted text search, that is, comparison information formed by combining the document identifier, key word identifier and key word comparison positional information, are delivered to the query resolving program 1145 on the query resolving computer (CPU.sub.2) 1145a. In the query resolving program 1145, a document (or documents) satisfying designated conditions is detected based on the query resolving control information set preliminarily and, on the bases of the detection, a corresponding document identifier (or identifiers) is written in the search result storage memory 1146. In the search execution control program 1108, a judgment is made as to whether the complex condition includes a neighbor condition or a contextual condition. When the complex condition includes a neighbor condition or a contextual condition, final text search is carried out. In short, text data corresponding to the document identifier obtained as the result of contracted text search are read from the text data file 1110 to the string search engine 1106 to perform text search. The comparison information as an output of the string search engine 1106 is delivered to the query resolving program 1145, in which a judgment is made as to whether data satisfy both the neighbor condition and contextual condition designated in the information. The result of the judgment is delivered to the sarch result storage memory 1146, in the form of a document identifier as final search result information.
When contracted search or text search is finished, that is, when a search process is finished, the search result indicating program 1147 on the search machine control computer (CPU.sub.0) 1150 is operated to read bibliographic items, such as the number of search results, names and authors of hit documents, or the like, from the text data file 1110 by reference to the document identifiers on the search result storage memory 1146 to thereby make up a list on the display 1120 or to read text data of hit documents from the text data file 1110 based on the designation of a user to thereby indicate the text data on the display 1120.
Thus, the first embodiment of the full text search system provided according to the present invention has been described above.
In the following, a second embodiment of the present invention is described with reference to FIG. 25.
The system in this embodiment comprises a keyboard 2501, a search machine control computer (CPU.sub.0) 2520, a display 2520, an automaton generating computer (CPU.sub.1) 2505a, a bit search computer (CPU.sub.3) 2507a, a string search engine 2506, a query resolving computer (CPU.sub.2) 2545a, a search result storage memory 2546, a semiconductor memory device 2510a, a RAM disk device 2510b, a collective type magnetic disk unit 2510c, and an image data file 2530. In the search machine control computer (CPU.sub.0) 2550, a search expression analysis program 2502, a synonym development program 2503, a different notation development program 2504, a query analysis program 2542, a contextual condition analysis program 2543, a logical condition analysis program 2544, a search execution control program 2508 and a search result indication program 2547 are executed. In the automaton generating computer (CPU.sub.1) 2505a, an automaton generating program 2505 is executed. In the bit search computer (CPU.sub.3) 2507a, a bit search program 2507 is executed. In the query resolving computer (CPU.sub.2) 2545a, a query resolving program 2545 is executed. The collective type magnetic disk unit 2510c is constituted a collective type magnetic disk controller 2510d and magnetic disk devices 2510e.sub.1 to 2510e.sub.12.
In the drawing, a search conditional expression given through the keyboard 2501 is analyzed based on the search expression analysis program 2502 on the search machine control computer (CPU.sub.0) 2550. In short, a search conditional expression is separated, based on the search expression analysis program 2502, into a key word portion and a complex condition description portion in which a complex condition constituted by inclusion conditions and arrangement conditions is described. The inclusion conditions are described as logical conditions. The arrangement conditions are described as neighbor conditions or contextual conditions. After separation or extraction, the key word portion and the complex condition description portion are respectively delivered to the synonym development program 2503 and the query analysis program 2541 on the CPU.sub.0 2550.
Based on the synonym development program 2503, synonyms with respect to the given key word are found by reference to a synonym dictionary included in the program. The group of key words obtained by synonym development are delivered to the different notation development program 2504. For example, when a key word " (keisanki; kanji)" is given, synonyms thereof, such as " (densanki; kanji)", " (konpyuta; katakana)" and "COMPUTER", and the like, are generated.
Based on the different notation development program 2504, the group of key words thus delivered thereto are subjected to different notation development. In this example, " (konpyuta; katakana)" is generated from " (konpyuta; katakana)" and "Computer" is generated from "COMPUTER".
The group of key words thus generated by the synonym and different notation development are delivered to the automaton generating program 2505 on the automaton generating computer (CPU.sub.1) 2505a.
Based on the automaton generating program 2505, an automaton for performing collective comparison with respect to the key word group delivered from the different notation development program 2504 is generated. Key word identification code information to be compared with a state transition table is provided in the search engine 2506. The search engine 2506 is a high-speed multiplex term comparison circuit based on a finite state automaton.
The group of key words generated by the different notation development based on the different notation program 2504 are, together with corresponding key word identification codes, delivered to the bit search program 2507 on the bit search computer (CPU.sub.3) 2507a.
On the other hand, in the query analysis program 2541 on the search machine control computer (CPU.sub.0) 2550 receiving the complex condition description portion of the input search conditional expression from the search expression analysis program 2502, the complex condition description portion is separated into a neighbor condition description portion, a contextual condition description portion and a logical condition description portion. The condition description portions are delivered to the neighbor condition analysis program 2542, the contextual condition analysis program 2543 and the logical condition analysis program 2544, respectively.
In the neighbor condition analysis program 2542, inter-character distance conditions and inter-word distance conditions are extracted. The respective conditions thus extracted are converted into identification codes of designated key words and information pertaining to distance between the key words and then delivered to the query resolving program 2545 on the query resolving computer (CPU.sub.2) 2545.
In the contextual condition analysis program 2543, various coexistence conditions, such as coexistence conditions in one sentence, coexistence conditions in one paragraph, coexistence conditions in one clause, coexistence conditions in one chapter, and the like, are extracted. The respective conditions thus extracted are converted into information pertaining to identification codes of designated key words and designated contextual codes and then delivered to the query resolving program 2545 on the query resolving computer (CPU.sub.2) 2545a.
In the logical condition analysis program 2544, logical conditions designated in the search conditional expression are extracted, converted into logical condition code information and then delivered to the query resolving program 2546 on the query resolving computer (CPU.sub.2) 2545a.
When the aforementioned search expression analyzing process, synonym and different notation development process, automaton generating process, query analyzing process, neighbor condition analyzing process, contextual condition analyzing process and logical condition analyzing process are finished and when the control information is delivered respectively to the bit search program 2507 on the bit search computer (CPU.sub.3) 2507a, the search engine 2506 and the query resolving program 2545 on the query resolving computer (CPU.sub.2) 2545a, a search process is started.
The search process is controlled by the search execution control program 2508 on the search machine control computer (CPU.sub.0) 2550. In short, in the search execution control program 2508, character component table search is carried out while reading a character component table from the semiconductor memory device 2510a by operating the bit search program 2507. As results of the character component table search, document identififiers are written in the search result storage memory 2546.
Contracted texts of documents designated by the document identifiers written in the search result storage memory 2546 are read from the RAM disk device 2510b to the string search engine 2506 while operating the string search engine 2506, the query resolving program 2545 and the RAM disk device 2510b, thus to carry out contracted text search. As results of the contacted text search, comparison information pertaining to document identifier, comparison key word identifier and key word comparison position is delivered to the query resolving program 2545 on the query resolving computer (CPU.sub.2) 2545a. In the query resolving program 2545, documents satisfying designated conditions are detected based on the preliminarily set query resolving control information to write the document identifiers thereof in the search result storage memory 2546.
In the search execution control program 2508, a judgment is made as to whether the complex condition includes at least one neighbor condition or contextual condition. When the complex condition includes at least one neighbor condition or contextual condition, final text search is carried out. In short, text data corresponding to the document identifier stored in the search result storage memory 2546 and obtained as the result of contracted text search are read from the collective type magnetic disk unit 2510c to the string search engine 2506 while operating the string search engine 2506, the query resolving program 2545 and the collective type magnetic disk unit 2510c, thus to perform text search.
The collective type magnetic disk unit 2510c is constituted by a plurality of magnetic disk devices 2510e.sub.1 through 2510e.sub.12 provided to store various text data, such as character component table, contracted text, text, bibliographic items, and the like, therein dispersively. The magnetic disk devices 2510e.sub.1 to 2510e.sub.12 controlled by the collective type disk controller 2510d can read text data in parallel and independently. The text data thus read are integrated or multiplexed by the collective type disk controller 2510dand then delivered to the string search engine 2506 at a high speed. In the case where 12 magnetic disk devices are operated simultaneously, a reading rate about ten times as fast as that in the case of use of only one device can be obtained.
Collective information from the string search engine 2506 is delivered to the query resolving program 2545, in which a judgment is made as to whether the text data satisfies the designated neighbor condition and contextual condition. The result of the judgment is delivered to the search result storage memory 2546 in the form of a document identifier as final search result information.
When contracted search or text search is finished, that is, when a search process is finished, the search result indicating program 2547 on the search machine control computer (CPU.sub.0) 2550 is operated to read bibliographic items, such as the number of search results, names and authors of hit documents, or the like, from the collective type magnetic disk unit 2510c by reference to the document identifiers on the search result storage memory 2546 to thereby make up a list on the display 2520 or to read text data of hit documents from the collective type magnetic disk unit 2510c based on the designation of a user to thereby indicate the text data on the display 2520. Further, in the case wehre the user designates seeing of figures or image information in hit documents corresponding image data are read from the image data file 2530 and indicated on the display 2520.
Thus, the second embodiment of the full text search system provided according to the present invention has been described above.
Although this embodiment has shown the case where a collective type disk controller 110d (FIG. 20) is used as a text data file 110 (FIG. 1) for storing text data, the invention can be applied to the case where a collective type optical disk unit may be used to increase the capacity of the text data file 110. In short, the magnetic disk devices 110e1 to 110e12 are replaced by optical disk devices. In the case where optical disk devices are used, the text search speed is reduced because optical disk devices are inferior to magnetic disk devices in access speed. An addition type optical device can be used in the case where correction of text data is not required. A rewritable type optical device can be used in the case where correction of text data is required.
In the following, a specific embodiment of the RAM disk device 2510b used in the second embodiment of the invention is described with reference to FIG. 75.
In the drawing, the RAM disk device 2510b comprises a semiconductor memory 7100 (RAM) for storing contracted texts, and an RAM disk controller 7200 for controlling the reading of the contacted texts on the semiconductor memory 7100.
The RAM disk controller 7200 is composed of a direct memory access controller 7210 (DMAC), an address controller 7220, and an address memory 7230. The address memory 7230 is arranged so that pairs of data each composed of a start address STARTn and an end address ENDn can be set in the address memory to indicate areas of the semiconductor memory 7100 to be read. The start address 7360 and the end address 7370 are given by the search execution control program 2508 based on information pertaining to the identifiers of contracted texts written in the search result storage memory 2546 and indicating targets to be read, by reference to contracted text storage information managed in the search execution control program 2508.
The address controller 7220 reads readout-area address information stored in the address memory 7230, that is, reads a start address START.sub.1 and an end address END.sub.1 based on an operation signal given from the search execution control program 2508, calculates a top address 7310 of an area to be read and the number 7320 of words to be read and sets them in the direct memory address controller 7210 to operate it. The direct memory access controller 7210 reads data from the designated area of the semiconductor memory 7100 based on the designated address 7310 and word number 7320.
When reading is finished, the direct memory access controller 7210 sends an end signal 7370 to the address controller 7220. The address controller 7220 receiving it reads information pertaining to next transfer address, that is, a start address START.sub.2 and an end address END.sub.2, calculates a top address 7310 of an area to be read and the number 7320 of words to be read and sets them in the direct memory access controller 7210 to operate it. The direct memory access controller 7210 reads data from the designated area of the semiconductor memory 7100 based on the designated address 7310 and word number 7320.
Data in the semiconductor memory 7100 corresponding to transfer information set in the address memory 7230 can be perfectly read out by repeating the aforementioned procedure.
Thus, an example of the RAM disk device 2510b has been described above.
In the following, a more detailed embodiment of the query analysis program 2541 (FIG. 25) in the second embodiment of the invention is described with reference to FIG. 13.
In this embodiment, the query analysis program 1141 has a pipe-line structure composed of a neighbor condition judgment program 330, a contextual condition judgment program 340 and a logical condition judgment program 350.
In respect to the search execution control step, the case of execution of text search is described as an example. In short, in respect to input text data, text data are given from the collective type magnetic disk unit 1110c to perform key word retrieval and comparison based on the text data by using the search engine 1106.
In respect to the search conditional expression, a complex conditional expression 301 including logical conditions, neighbor conditions and contextual conditions is given.
Complex conditional expression 301: Q.dbd.and ([4C], [S])
The complex conditional expression 301 means retrieval of a document (or documents) which has terms " (bunsho; kanji)" and " (rikai; kanji)" arranged in the order and at a distance of not more than 4 characters and which has terms " (bunsho; kanji)" and " (kensaku; kanji)" coexisting in one sentence. In short, "[4C]" expresses a neighbor condition in which terms " (bunsho; kanji)" and " (rikai; kanji)" are arranged in the order and at a distance of not more than 4 characters. Further, "[S]" expresses a contextual condition in which terms " (bunsho; kanji)" and " (rikai; kanji)" coexist in one sentence. Further, "and (. . . , . . . )" expresses a logical condition in which the two expressions are satisfied simultaneously.
When the complex conditional expression 301 is designated, the search conditional expression is analyzed based on the search expression analysis program 1102 as described above in the second embodiment (FIG. 25), so that key words contained in the search conditional expression, that is, terms " (bunsho; kanji)", " (rikai; kanji)" and " (kensaku; kanji)", are extracted. Identifiers T.sub.1, T.sub.2 and T.sub.3 are given to the key words and then delivered to the synonym development program 1103 and the different notation development program 1104. For simplification of description, it is now assumed that there is no term developed into synonyms and different notations. Accordingly, terms obtained as the result of the synonym and different notation development are input key words " (bunsho; kanji)", " (rikai; kanji)" and " (kensaku; kanji)". These terms are delivered to the automaton generating program 1107, in which an automaton for performing comparison with respect to the terms is generated, so that the state transition table thereof is set in the search engine 1106.
On the other hand, the complex condition in the search conditional expression is decomposed, based on the query analysis program 1141, into a neighbor condition "[4C]", a contextual condition "[S]" and a logical condition "and (. . . , . . . )". At this time, the key words in the respective conditional expressions are replaced by the key word identifiers (also called "term identifiers") given at the time of the generation of the automaton. Accordingly, the neighbor condition is represented by an expression "T.sub.1 [4C]T.sub.2 ". The contextual condition is represented by an expression "T.sub.1 [S]T.sub.3 ". Further, item identifiers I.sub.1 and I.sub.2 are given to the conditional expressions, respectively. Accordingly, the logical conditional expression is represented as "and (I.sub.1, I.sub.2)". The aforementioned procedures are respectively carried out in the neighbor condition analysis program 2542 (FIG. 25), the contextual condition analysis program 2543 (FIG. 25) and the logical condition analysis program 2544 (FIG. 25). The respective conditions expressed by term identifiers and item identifiers are delivered to the respective condition judgment programs in the query resolving program 2545 (FIG. 25).
When the automaton state transition table for search term comparison and the search term identifier information are set in the search engine 1106 and then the respective conditional expressions described by search term identifiers and item identifiers are set in the neighbor condition judgment program 330, the contextual condition judgment program 340 and the logical condition judgment program 350, the collective type magnetic disk unit 1110c, the search engine 1106, the query analysis program 1145, the neighbor condition judgment program 330, the contextual condition judgment program 340 and the logical condition judgment program 350 are actuated to start based on the search execution control program 1108.
As a result, text data are read from the collective type magnetic disk unit 1110c and delivered to the search engine 1106. When any one of the designated search terms " (bunsho; kanji)", " (rikai; kanji)" and " (kensaku; kanji)" is detected in the search engine 1106, a corresponding term identifier is delivered to the neighbor condition judgment program 330 together with positional information for indicating the position thereof in a text in which the search term identifier T.sub.1, T.sub.2 and T.sub.3 are detected. Also, punctuation mark ".largecircle." is detected by the search engine 1106 regardless of designation of a user, so that a corresponding punctuation mark identifier T.sub.0 together with positional information is delivered to the neighbor condition judgment program 330.
In the neighbor condition judgment program 330, the search term identifier delivered from the search engine 1106 is referred to the designated neighbor condition, inclusively of positional information. If the designated neighbor condition "T.sub.1 [4C]T.sub.2 ", that is, "[4C]", is satisfied, an item identifier I1 corresponding to the condition as the result of comparison is delivered to the contextual condition judgment program 340, together with the punctuation mark identifier T.sub.0 and search term identifiers T.sub.1, T.sub.2 and T.sub.3 given by the search engine 1106.
In the contextual condition judgment program 340, the designated contextual condition is checked based on the punctuation mark identifier T.sub.0 and search term identifiers T.sub.1 and T.sub.3 and positional information thereof. The contextual condition "T.sub.1 [S]T.sub.3 " is judged from the arrangement of the punctuation mark identifier T.sub.0 and term identifiers T.sub.1 and T.sub.3. In short, if T.sub.1 and T.sub.3 are arranged in the order and between two T.sub.0, a conclusion is made that the contextual condition "T.sub.1 [S]T.sub.3 " is established. If one satisfying the contextual condition "[S]" is detected, an item identifier I.sub.2 corresponding to the condition as the result of the comparison is delivered to the logical condition judgment program 350, together with the punctuation mark identifier T.sub.0, search term identifiers T.sub.1 and T.sub.3 and item identifier T.sub.1 given by the neighbor condition judgment program 330.
In the logical condition judgment program 350, a judgment is made based on the punctuation mark identifier T.sub.0, search term identifiers T.sub.1 and T.sub.3 and item identifiers I.sub.1 and I.sub.2 given by the contextual condition judgment program 340 as to whether there exist identifiers I.sub.1 and I.sub.2 satisfying the designated logical condition "and (I.sub.1, I.sub.2)". In short, if the two item identifiers I.sub.1 and I.sub.2 are detected, the complex condition search expression Q is established, so that the text (document) is retrieved by the search expression Q. Thus, the text 302 as shown in the drawing is retrieved.
On the other hand, identifiers of text data not described above are contained in comparison data delivered from the collective type magnetic disk unit 1110c to the search engine 1106, the neighbor condition judgment program 330, the contextual condition judgment program 340 and the logical condition judgment program 350. In short, when text data satisfy the search expression Q in the logical condition judgment program 350, the document identifier thereof is delivered to the next step, that is, the search result indication program. Based on the search result indication program, the number of hit documents is indicated or bibliographic items of the hit documents are read from the collective type magnetic disk unit 1110c based on the document identifiers and indicated on the display 1120.
The foregoing is the detailed description about the query resolving program 2545 (FIG. 25) in the second embodiment of the invention.
The foregoing is the detailed description about the text data file 1110 (FIG. 10) and the query analysis portion 1141 (FIG. 10) in the first embodiment of the invention.
In the following, the full text search system according to the present invention is described in detail.
According to the present invention, two-stage presearch, that is, character component table search 402 and contracted text search 403 as shown in FIG. 15, is employed as a scanning type full text search accelerating method. In short, the two-stage presearch is carried out before the text search 403 to thereby narrow down the number of times for reference to texts stored in the magnetic disk. Accordingly, the quantity of text search processing which requires a large part of search time can be reduced, so that the total search time can be shortened.
All of the foregoing are controlled by a search execution control program. First, an embodiment of the character component table search as a first presearch stage is described.
In the character component table search, a list of documents containing character codes in the texts thereof with respect to all the character codes in the contracted texts (which will be described later) thereof is generated according to a flow of the whole of registration process as shown in FIG. 16 and a hash coding procedure as shown in detail in FIG. 18. In short, character codes in each document are represented by one-bit information (called "bit list") and then hashed to form a character component table 500.
For example, when a key word " (kensaku; kanji)" is designated, entry addresses of the respective characters " (ken; kanji)" and " (saku; kanji)" in the character component table 500 are detected through a hash function 510 as shown in FIG. 18. A bit list 520 for documents containing two characters " (ken; kanji)" and " (saku; kanji)" is detected by ANDing the bits of bit lists 503 and 506 obtained from hash values of character codes.
The aforementioned character component table search procedure is as shown in FIG. 23. In short, character component table search is repeated by the number of key words contained in the designated search conditional expression. In the character component table search for each key word, ANDing of bits of bit lists is repeated by the number of characters contained in the key word. As a result, |