Query processing (i.e., searching)

Full-text search apparatus utilizing two-stage index file to achieve high speed and reliability of searching a text which is a continuous sequence of characters

5706496

Abstract

A new type of text search apparatus, capable of finding all occurrence positions of a search string that is an arbitrary character string, within a text which is written as a continous sequence of characters, utilizes for text position reference purposes in an index file, words which each occur (at least once within the text) as the maximum length word, referred to as an extension word, among a set of arbitrarily predefined dictionary words extending from a specific character position. Each such occurrence of a word as an extension word defines one of a set of text position elements, with that set covering all of the character positions of the text. The index file also includes a table which relates each of the extension words to the respective positions at which each of the partial character strings of the word occur within the word. Each occurrence of an arbitrary search string within the text can thereby be expressed as either a partial character string within a single text position element, or as a sequence of partial character strings within a set of sequentially occurring text position elements, so that all such occurrences can be found by utilizing the index file.


Claims

What is claimed is:

1. A text search apparatus operable for deriving a set of text position values expressing all positions of occurrence of a search string within an object text, said search string being an arbitrarily specified string of characters from a predetermined character set, said object text being formed of a finite number of characters from said character set, the apparatus comprising:

a dictionary list having registered therein a predetermined list of dictionary words or identifiers of said dictionary words, each of said dictionary words being a string of characters from said character set,

text position table means comprising a plurality of text position elements each expressing a combination of one of said dictionary words which appears within said object text, or an identifier of said dictionary word, in conjunction with a text position value of said dictionary word, each dictionary word of a text position element in said text position table being an extension word which is a maximum length word within a set of dictionary words each extending from a leading character position of said text position element;

extension word position correspondence table means having registered therein information expressing, for each of respective partial character strings of said extension words, a corresponding set of extension words and respective positions of said partial character string within each of said corresponding extension words,

extension word assessment means for obtaining from said extension word correspondence table means a first set of extension words each containing said search string as a partial character string and intra-word position information indicating respective positions of said search string within the words of said first set of extension words,

text position element look-up means for obtaining first sets of text position elements from said text position table means, each set of said first sets of text position elements being keyed to a corresponding extension word of said first set of extension words,

total set formation means for operating on said first sets of text position elements in conjunction with said intra-word position information, to obtain a first set of text position values of said search string, and

cover word assessment means for determining a specific sequence of cover words extending from a leading character to a final character of said search string, each of said cover words other than a final cover word of said sequence being a dictionary word, said final cover word being a dictionary word or a leading partial character string of a dictionary word,

wherein said extension word assessment means obtains from said extension word position correspondence table means respective second sets of extension words, said sets corresponding to respective ones of said cover words, and obtains intra-word information expressing respective positions of each of said cover words within each of the extension words of the corresponding one of said second sets,

said text position element look-up means obtains from said text position table means, for each extension word of each set of said second sets of extension words, a corresponding second set of said text position elements,

said cover word assessment means executes comparison processing of said second sets of text position elements to select groups of sequentially occurring text position elements, each of said groups containing said cover word sequence extending continuously from within a leading text position element of the group,

and wherein said total set formation means obtains, based on respective leading text position elements of said groups in conjunction with said intra-word information from said extension word assessment means, a second set of text position values of said search string, and combines said second set with said first set of text position values to obtain a complete set of text position values of said search string.

2. A text search apparatus according to claim 1, wherein said dictionary list is predetermined such that every character position within said object text is covered by at least one of said text position elements.

3. A text search apparatus according to claim 2, further including a table generating apparatus for generating said text position table, said table generating apparatus comprising means for

executing a first processing operation of sequentially examining all characters appearing at successive character positions of said object text, to obtain for each of said character positions a set of said dictionary words each beginning from said character position,

if said set of dictionary words, beginning from a first specific character position, is found to be empty, sequentially examining characters at successive character positions following said first specific character position, to find a second specific character position from which a set of dictionary words begins which is not empty, registering in said dictionary list as a new dictionary word a string of characters which extends from said first specific character position to a character position immediately preceding said second specific character position, and registering said new dictionary word in conjunction with a character number expressing said first specific character position, as a text position element of an entry in said text position table,

and means for

executing a second processing operation of sequentially examining characters at successive character positions of said object text, to obtain for each of said character positions a set of dictionary words each beginning from said each character position,

selecting as a maximum length word of said set a word of said set which satisfies respective conditions of being the longest word in said set and of extending beyond all dictionary words which begin from any preceding character position, and

designating each of said maximum length words as an extension word, and registering each said extension word, in conjunction with a character number expressing the character position from which said extension word begins within said object text, as a text position element in said text position table.

4. A text search apparatus according to claim 2, further including a table generating apparatus for generating said text position table, said table generating apparatus comprising means for

sequentially examining all characters appearing at successive character positions of said object text, to obtain for each of said character positions a set of said dictionary words each beginning from said character position,

selecting, as a maximum length word of said set, a word of said set which satisfies respective conditions of being the longest word in said set and of extending beyond all dictionary words which begin from any preceding character positions,

if said set is found to be empty at a specific character position, registering in said dictionary list a character which appears at said specific character position, and designating said character as constituting said maximum length word, with respect to said specific character position, and

designating each of said maximum length words as an extension word, and registering said extension word, in conjunction with a character number expressing the character position from which said extension word begins within said object text, as a text position element in said text position table.

5. A text search apparatus according to claim 1, wherein said extension word correspondence table means comprises

a dictionary sequence word table comprising a list of said extension words arranged in dictionary sequence, in conjunction with respective dictionary sequence word numbers of said extension words, said dictionary sequence word numbers successively increasing in magnitude in accordance with positions in the dictionary sequence,

an inverse dictionary sequence word table comprising a list of said extension words arranged in inverse dictionary sequence, in conjunction with respectively corresponding ones of said dictionary sequence word numbers,

a single character intra-word position table which relates each character of said character set to at least all of said extension words which each contain said character at a position other than a leading character position or a final character position, and which relates said each character to each position at which said character appears within each extension word containing the character, at least within a range of positions extending from a second character position to a character position immediately preceding a final character position within said each extension word, said extension words being registered in conjunction with respectively corresponding ones of said dictionary sequence word numbers,

and wherein respective ones of said extension words of said entries in said text position table are assigned respectively corresponding ones of said dictionary sequence word numbers.

6. A text search apparatus according to claim 5, wherein said extension word assessment means comprises

prefix extension word assessment means for obtaining from said dictionary sequence word table respective ones of said extension words which have said search string as a prefix portion thereof,

suffix extension word assessment means for obtaining from said inverse dictionary sequence word table respective ones of said extension words which have said search string as a suffix portion thereof, and

two-way extension word assessment means for obtaining from said inverse dictionary sequence word table respective ones of said extension words which have said search string as a partial character string that is other than a prefix portion or suffix portion.

7. A text search apparatus according to claim 1, wherein said extension word correspondence table means comprises

a dictionary sequence word table comprising a list of said extension words or identifiers of said extension words, arranged in dictionary sequence, in conjunction with respective dictionary sequence word numbers of said extension words, said dictionary sequence word numbers successively increasing in magnitude in accordance with word positions in the dictionary sequence,

a dictionary sequence initial character table comprising a list of all characters of said character set or identifiers of said characters, in conjunction with respective ones of said dictionary sequence word numbers, with each dictionary sequence word number assigned to a character being the dictionary sequence word number of the leading extension word having said character as the leading character thereof, in said extension word list in said dictionary sequence word table,

an inverse dictionary sequence word table comprising a list of said extension words or identifiers of said extension words, arranged in inverse dictionary sequence, in conjunction with respectively corresponding ones of said dictionary sequence word numbers, and in conjunction with respective inverse dictionary sequence word numbers of said extension words, said inverse dictionary sequence word numbers successively increasing in magnitude in accordance with word positions in the inverse dictionary sequence,

an inverse dictionary sequence final character table comprising a list of all characters of said character set or identifiers of said characters, in conjunction with respective ones of said inverse dictionary sequence word numbers, with each inverse dictionary sequence word number assigned to a character being that of the leading extension word having said character as the final character thereof, in said extension word list in said inverse dictionary sequence word table, and

a single character intra-word position table which relates each character of said character set to at least all of said extension words which each contain said character at a position other than a leading character position or a final character position, and which relates said each character to each position at which said character appears within each extension word containing the character, at least within a range of positions extending from a second character position to a character position immediately preceding a final character position within said each extension word, said extension words being registered in conjunction with respectively corresponding ones of said dictionary sequence word numbers,

and wherein respective ones of said extension words of said entries in said text position table are assigned respectively corresponding ones of said dictionary sequence word numbers.

8. A text search apparatus according to claim 7, wherein said extension word assessment means comprises

means for obtaining from said dictionary sequence initial character table a first dictionary sequence word number, said first dictionary sequence word number corresponding to a leading character of said search string, and a second dictionary sequence word number, said second dictionary sequence word number corresponding to a character which immediately succeeds said leading character of the search string in said dictionary sequence,

prefix extension word derivation means for obtaining from said dictionary sequence word table, from within a range of said extension words listed therein that extends from an extension word corresponding to said first dictionary sequence word number to an extension word corresponding to a dictionary sequence word number which is one less than said second dictionary sequence word number, respective dictionary sequence word numbers of a set of extension words each having said search string as a prefix portion,

means for obtaining from said inverse dictionary sequence final character table a first inverse dictionary sequence word number, said first inverse dictionary sequence word number corresponding to a final character of said search string, and a second inverse dictionary sequence word number, said second inverse dictionary sequence word number corresponding to a character which immediately succeeds said final character of the search string in said dictionary sequence, and

suffix extension word derivation means, for obtaining from said inverse dictionary sequence word table, from within a range of said extension words listed therein that extends from an extension word corresponding to said first inverse dictionary sequence word number to an extension word corresponding to an inverse dictionary sequence word number which is one less than said second inverse dictionary sequence word number, respective dictionary sequence word numbers of a set of extension words each having said search string as a suffix portion, and

two-way extension word derivation means for obtaining, from said single character intra-word position table, respective dictionary sequence word numbers of sets of extension words having respective characters of said search string at specific character positions within said extension words, and for examining said sets to extract respective dictionary sequence word numbers of those extension words which contain said search string as a partial character string other than as only a prefix portion or suffix portion.

9. A text search apparatus according to claim 8, wherein said prefix extension word derivation means comprises

means for assigning, as an initial value of a first pointer variable, a dictionary sequence word number which is substantially centrally located within a range of dictionary sequence word numbers extending from said first dictionary sequence word number to said dictionary sequence word number which is one less than said second dictionary sequence word number, and for thereafter successively executing steps of

judging whether an extension word corresponding to the dictionary sequence word number specified by said first pointer variable is a leading word of a range of extension words having said search string as a prefix portion,

if said extension word is not said leading word of said range of extension words, judging whether said extension word is located before or after said search string within said dictionary sequence, dividing said range of dictionary sequence word numbers into two substantially equal ranges, selecting one of said ranges in accordance with a result obtained by said judging, and assigning as a new value to said first pointer variable a dictionary sequence word number that is substantially centrally located within the selected one of said ranges,

and wherein said suffix extension word derivation means comprises

means for assigning as an initial value of a second pointer variable, an inverse dictionary sequence word number which is substantially centrally located within a range of inverse dictionary sequence word numbers extending from said first inverse dictionary sequence word number to said inverse dictionary sequence word number which is one less than said second inverse dictionary sequence word number, and for thereafter successively executing steps of

judging whether an extension word corresponding to the inverse dictionary sequence word number specified by said second pointer variable is a leading word of a range of extension words having said search string as a suffix portion,

if said extension word is not said leading word of the range of extension words having the search string as a suffix portion, judging whether said extension word is located before or after said search string within said inverse dictionary sequence, dividing said range of inverse dictionary sequence word numbers into two substantially equal ranges, selecting one of said ranges in accordance with a result obtained by said judging, and assigning as a new value to said second pointer variable an inverse dictionary sequence word number that is substantially centrally located within the selected one of said ranges.

10. A text search apparatus according to claim 1, wherein said extension word correspondence table means comprises a dictionary sequence high occurrence word table comprising a list of high occurrence words or identifiers of said high occurrence words, arranged in dictionary sequence, in conjunction with respective dictionary sequence high occurrence word numbers of said high occurrence words, said dictionary sequence high occurrence word numbers successively increasing in magnitude in accordance with word positions in the dictionary sequence, said high occurrence words comprising respective ones of said extension words which occur within said object text a number of times which exceeds a predetermined threshold value,

an inverse dictionary sequence high occurrence word table comprising a list of said high occurrence words or identifiers of said high occurrence words, arranged in inverse dictionary sequence, in conjunction with respectively corresponding ones of said dictionary sequence high occurrence word numbers,

a dictionary sequence low occurrence word table comprising a list of low occurrence words or identifiers of said low occurrence words, arranged in dictionary sequence, said list of low occurrence words being divided into a plurality of first groups of successive low occurrence words in accordance with predetermined criteria, each of said low occurrence words being listed in conjunction with a combination of a first group number which indicates membership of a specific one of said first groups and an intra-group number which is indicates a position of said low occurrence word within said one of the first groups, said low occurrence words comprising respective ones of said extension words which each occur within said object text a number of times which is no greater than said threshold value,

an inverse dictionary sequence low occurrence word table comprising a list of low occurrence words or identifiers of said low occurrence words, arranged in inverse dictionary sequence, said list of low occurrence words being divided into a plurality of second groups of successive low occurrence words in accordance with predetermined criteria, each of said low occurrence words being listed in conjunction with a combination of a second group number which indicates membership of a specific one of said second groups and an intra-group number which indicates a position of said low occurrence word within said one of the second groups, and

a single character intra-word position table which relates each character of said character set to at least all of said extension words which each contain said character at a position other than a leading character position or a final character position, and which relates said each character to each position at which said character appears within each extension word containing the character, at least within a range of positions extending from a second character position to a character position immediately preceding a final character position within said each extension word, with extension words which are classified as high occurrence words being registered in conjunction with respectively corresponding ones of said dictionary sequence high occurrence word numbers, and with extension words which are classified as low occurrence words being registered in conjunction with respective ones of said combinations of a first group number with an intra-group number or combinations of a second group number with an intra-group number,

and wherein said text position table means comprises

a first text position table comprising a plurality of table entries corresponding to respective ones of said high occurrence words, said table entries being respectively identified by corresponding ones of said dictionary sequence high occurrence word numbers, each of said table entries comprising the corresponding high occurrence word in conjunction with an array of respective text position values from all text position elements of said high occurrence word, said text position values being arrayed in sequence of increasing magnitude,

a second text position table comprising a plurality of table entries corresponding to respective ones of said first groups, said table entries being respectively identified by corresponding group numbers of said first group, each of said table entries comprising the dictionary sequence low occurrence word number of a leading word of the corresponding one of said first groups, in conjunction with an array of of all text position elements of low occurrence words of said first group, each of said text position elements comprising a value indicating a position of a low occurrence word within one of said first groups, in conjunction with a text position value of said low occurrence word, said text position elements being arrayed in sequence of increasing magnitude of text position value, and

a third text position table comprising a plurality of table entries corresponding to respective ones of said second groups, said table entries being respectively identified by corresponding group numbers of said second group, each of said table entries comprising the inverse dictionary sequence low occurrence word number of a leading word of the corresponding one of said second groups, in conjunction with an array of of all text position elements of low occurrence words of said first group, each of said text position elements comprising a value indicating a position of a low occurrence word within one of said second groups, in conjunction with a text position value of said low occurrence word, said text position elements being arrayed in sequence of increasing magnitude of text position value.

11. A text search apparatus according to claim 10, wherein said extension word assessment means comprises

first prefix extension word assessment means for obtaining from said dictionary sequence high occurrence word table respective ones of said high occurrence words which have said search string as a prefix portion thereof,

first suffix extension word assessment means for obtaining from said inverse dictionary sequence high occurrence word table respective dictionary sequence high occurrence word numbers of high occurrence words which have said search string as a suffix portion thereof,

second prefix extension word assessment means for obtaining from said dictionary sequence low occurrence word table, for each low occurrence word which has said search string as a prefix portion thereof, a combination of a group number and an intra-group number, said group number specifying one of a plurality of groups of low occurrence words listed in said dictionary sequence low occurrence word table and said intra-group number specifying a position of said each low occurrence word within said one of the groups,

second suffix extension word assessment means for obtaining from said inverse dictionary sequence low occurrence word table for each low occurrence word which has said search string as a suffix portion thereof, a combination of a group number and an intra-group number, said group number specifying one of a plurality of groups of low occurrence words listed in said dictionary sequence low occurrence word table and said intra-group number specifying a position of said each low occurrence word within said one of the groups, and

two-way extension word assessment means for obtaining from said single character intra-word position table respective dictionary sequence high occurrence word numbers or combinations of a group number and intra-group number, for extension words having said search string as a partial character string other than only as a prefix portion or suffix portion.

12. A text search apparatus according to claim 1, wherein said extension word correspondence table means comprises

a dictionary sequence high occurrence word table comprising a list of high occurrence words or identifiers of said high occurrence words, arranged in dictionary sequence, in conjunction with respective dictionary sequence high occurrence word numbers of said high occurrence words, said dictionary sequence high occurrence word numbers successively increasing in magnitude in accordance with word positions in the dictionary sequence, said high occurrence words comprising respective ones of said extension words which occur within said object text a number of times which exceeds a predetermined threshold value,

a first dictionary sequence initial character table comprising a list of all characters of said character set or identifiers of said characters, in conjunction with respective ones of said dictionary sequence high occurrence word numbers, with each dictionary sequence high occurrence word number assigned to a character being the dictionary sequence high occurrence word number of the leading high occurrence word, in said list in said dictionary sequence high occurrence word table, having said character as the leading character thereof,

an inverse dictionary sequence high occurrence word table comprising a list of said high occurrence words or identifiers of said high occurrence words, arranged in inverse dictionary sequence, in conjunction with respectively corresponding ones of said dictionary sequence high occurrence word numbers and in conjunction with respective inverse dictionary sequence high occurrence word numbers, said inverse dictionary sequence high occurrence word numbers successive increasing in magnitude in accordance with word positions in the inverse dictionary sequence,

a first inverse dictionary sequence final character table comprising a list of all characters of said character set or identifiers of said characters, in conjunction with respective ones of said dictionary sequence word numbers, with each dictionary sequence word number assigned to a character being the dictionary sequence word number of the leading high occurrence word, in said list in said inverse dictionary sequence high occurrence word table, having said character as the final character thereof,

a dictionary sequence low occurrence word table comprising a list of low occurrence words or identifiers of said low occurrence words, arranged in dictionary sequence in conjunction with respective dictionary sequence low occurrence word numbers which successively increase in value in accordance with word positions in the dictionary sequence, said list of low occurrence words being divided into a plurality of first groups of successive low occurrence words in accordance with predetermined criteria, each of said low occurrence words being listed in conjunction with a combination of a first group number which indicates membership of a specific one of said first groups and an intra-group number which indicates a position of said low occurrence word within said one of the first groups, said low occurrence words comprising respective ones of said extension words which each occur within said object text a number of times which is no greater than said threshold value,

a second dictionary sequence initial character table comprising a list of all characters of said character set or identifiers of said characters, in conjunction with respective ones of said dictionary sequence low occurrence word numbers, with each dictionary sequence low occurrence word number assigned to a character being the dictionary sequence word number of the leading low occurrence word, in said list in said dictionary sequence low occurrence word table, having said character as the leading character thereof,

an inverse dictionary sequence low occurrence word table comprising a list of low occurrence words or identifiers of said low occurrence words, arranged in inverse dictionary sequence in conjunction with respective inverse dictionary sequence low occurrence word numbers which successively increase in value in accordance with word positions in the inverse dictionary sequence, said list of low occurrence words being divided into a plurality of second groups of successive low occurrence words in accordance with predetermined criteria, each of said low occurrence words being listed in conjunction with a combination of a second group number which indicates membership of a specific one of said second groups and an intra-group number which indicates a position of said low occurrence word within said one of the second groups,

a second inverse dictionary sequence final character table comprising a list of all characters of said character set or identifiers of said characters, in conjunction with respective ones of said inverse dictionary sequence low occurrence word numbers, with each inverse dictionary sequence low occurrence word number assigned to a character being the inverse dictionary sequence low occurrence word number of the leading low occurrence word, in said list in said inverse dictionary sequence low occurrence word table, having said character as the final character thereof,

a single character intra-word position table which relates each character of said character set to at least all of said extension words which each contain said character at a position other than a leading character position or a final character position, and which relates said each character to each position at which said character appears within each extension word containing the character, at least within a range of positions extending from a second character position to a character position immediately preceding a final character position within said each extension word, with extension words which are classified as high occurrence words being registered in conjunction with respectively corresponding ones of said dictionary sequence high occurrence word numbers, and with extension words which are classified as low occurrence words being registered in conjunction with respectively corresponding ones of said combinations of a first group number with an intragroup number or combinations of a second group number with an intra-group number,

and wherein said text position table means comprises

a first text position Gable comprising a plurality of table entries corresponding to respective ones of said high occurrence words, said table entries being respectively identified by corresponding ones of said dictionary sequence high occurrence word numbers, each of said table entries comprising the corresponding high occurrence word in conjunction with an array of respective text position values from all text position elements of said high occurrence word, said text position values being arrayed in sequence of increasing magnitude,

a second text position table comprising a plurality of table entries corresponding to respective ones of said first groups, said table entries being respectively identified by corresponding ones of said first group numbers, each of said table entries comprising the dictionary sequence low occurrence word number of a leading word of the corresponding one of said first groups, in conjunction with an array of of all text position elements of low occurrence words of said first group, each of said text position elements comprising a value indicating a position of a low occurrence word within one of said first groups, in conjunction with a text position value of said low occurrence word, said text position elements being arrayed in sequence of increasing magnitude of text position value, and

a third text position Gable comprising a plurality of table entries corresponding to respective ones of said second groups, said table entries being respectively identified by corresponding ones of said second group numbers, each of said table entries comprising the inverse dictionary sequence low occurrence word number of a leading word of the corresponding one of said second groups, in conjunction with an array of of all text position elements of low occurrence words of said first group, each of said text position elements comprising a value indicating a position of a low occurrence word within one of said second groups, in conjunction with a text position value of said low occurrence word, said text position elements being arrayed in sequence of increasing magnitude of text position value.

13. A text search apparatus according to claim 12, wherein said extension word assessment means comprises

means for obtaining from said first dictionary sequence initial character table a first dictionary sequence high occurrence word number, said first dictionary sequence high occurrence word number corresponding to a leading character of said search string, and a second dictionary sequence high occurrence word number, said second dictionary sequence high occurrence word number corresponding to a character which immediately succeeds said leading character of the search string in said dictionary sequence,

means for obtaining from said dictionary sequence high occurrence word table, from within a range of said extension words listed therein that extends from an extension word corresponding to said first dictionary sequence high occurrence word number to an extension word corresponding to a dictionary sequence high occurrence word number which is one less than said second dictionary sequence high occurrence word number, respective dictionary sequence high occurrence word numbers of high occurrence words having said search string as a prefix portion,

means for obtaining from said first inverse dictionary sequence final character table a first inverse dictionary sequence high occurrence word number, said first inverse dictionary sequence high occurrence word number corresponding to a final character of said search string, and a second inverse dictionary sequence high occurrence word number, said second inverse dictionary sequence high occurrence word number corresponding to a character which immediately succeeds said final character of the search string in said dictionary sequence,

means for obtaining from said inverse dictionary sequence high occurrence word table, from within a range of said extension words listed therein that extends from an extension word corresponding to said first inverse dictionary sequence high occurrence word number to an extension word corresponding to an inverse dictionary sequence high occurrence word number which is one less than said second inverse dictionary sequence high occurrence word number, respective dictionary sequence high occurrence word numbers high occurrence words having said search string as a suffix portion,

means for obtaining from said second dictionary sequence initial character table a first dictionary sequence low occurrence word number, said first dictionary sequence low occurrence word number corresponding to said leading character of said search string, and a second dictionary sequence high occurrence word number, said second dictionary sequence high occurrence word number corresponding to said character which immediately succeeds the leading character of the search string in said dictionary sequence,

means for obtaining from said dictionary sequence low occurrence word table, from within a range of said extension words listed therein that extends from an extension word corresponding to said first dictionary sequence low occurrence word number to an extension word corresponding to a dictionary sequence low occurrence word number which is one less than said second dictionary sequence low occurrence word number, respective ones of said combinations of a first group number and an intragroup number, corresponding to low occurrence words having said search string as a prefix portion,

means for obtaining from said second inverse dictionary sequence final character table a first inverse dictionary sequence low occurrence word number, said first inverse dictionary sequence low occurrence word number corresponding to said final character of said search string, and a second inverse dictionary sequence low occurrence word number, said second inverse dictionary sequence low occurrence word number corresponding to said character which immediately succeeds said final character of the search string in the dictionary sequence, and

means for obtaining from said inverse dictionary sequence low occurrence word table, from within a range of said extension words listed therein that extends from an extension word corresponding to said first inverse dictionary sequence low occurrence word number to an extension word corresponding to an inverse dictionary sequence low occurrence word number which is one less than said second inverse dictionary sequence low occurrence word number, respective ones of said combinations of a second group number and an intra-group number, corresponding to low occurrence words having said search string as a suffix portion.

14. A text search apparatus according to claim 1, wherein said dictionary list is arbitrarily predetermined.

15. A text search apparatus according to claim 1, wherein said partial character strings in said extension word position correspondence table means comprise respective ones of all partial character strings of said extension words of said set of text position elements.

16. A text search apparatus according to claim 1, wherein said partial character strings in said extension word position correspondence table means comprise respective ones of all of said dictionary words which are individual characters and occur within said object text.

17. A text search apparatus according to claim 1, wherein said partial character strings in said extension word position correspondence table means comprise respective ones of all of said dictionary words which occur within said object text.

18. A text search apparatus according to claim 1, wherein said text position table means comprises a text position table having table entries assigned to respective ones of said extension words, each of said table entries comprising an extension word in conjunction with an array of text position values from all of said text position elements which are keyed to said extension word, said text position values being arranged in sequence of increasing magnitude.

19. A text search apparatus according to claim 1, wherein said cover word assessment means comprises means functioning, for each of successive pairs of said sets of text position elements, with each of said pairs of sets corresponding to a pair of cover words which occur directly sequentially in said specific sequence, to compare each of successive pairs of text position elements taken respectively from said each pair of sets, for thereby judging whether at least a part of said search string occurs within a sequential occurrence of said each pair of text position elements .


Description

BACKGROUND OF THE INVENTION

1. Field of the Invention

The present invention relates to a text search apparatus for use in searching a text which is stored in an electronic storage apparatus, to an index file which is a basic component of the text search apparatus, and to an apparatus for generating the index file.

In particular, the invention relates to a text search apparatus which is capable of locating all positions of any arbitrary character string appearing within a continuous-sequence text, i.e. a text having no delimiting spaces between successive words.

More specifically, the invention relates to a text search apparatus whereby high search speed and complete reliability of searching can be achieved when searching a continuous-sequence text, and whereby only a moderate amount of data storage capacity is required.

Such a text search apparatus is applicable to data bases which contain large amounts of text information, as well as to relatively short text documents which are stored in general types of word processors or office computers used for text processing.

In the following description of prior art examples and of embodiments of the invention, text examples utilizing the lower-case English alphabet character set will be employed. In the case of a language such as English, the character set is small. However it should be understood that the invention can be advantageously applied to languages such as Japanese or Chinese for example, wherein the character set may contain several thousand characters, and wherein a high proportion of these characters will in themselves constitute meaningful words.

2. Description of Prior Art

In recent years, there has been an increasing tendency to store large amounts of text information using electronic types of storage medium. Such text information relates for example to electronic mail, electronic catalogs, electronic publishing, data bases, etc. As a result of the development of such large amounts of text storage, there is an increasing requirement for improved types of text search apparatus whereby parts of the stored text can be rapidly located, and in particular, whereby respective positions of occurrence of any arbitrary word or character string within the stored text can be rapidly obtained.

In general, prior to executing a text search, the user specifies the search conditions by a character string, generally referred to as the search condition string. That may simply consist of a specific character string that is to be located, or may also include further information for specifying one or more conditions which are to be applied to the search. It will be assumed for simplicity of description in the following that the search condition string consist only of the character string which is to be found, and it will be referred to simply as the search string. Various types of text search method have been proposed in the prior art, which are described for example in "Information Retrieval; Data Structures and Algorithms" by Frakes and Baez-Yates, Prentice Hall, pages 29 to 43. Three basic types are as follows:

(1) Full-text scanning method

(2) Signature file method

(3) Index file (inversion file) method

Search method (1) above is a full-text search method in which the entire text which is to be searched (referred to in the following as the object text) is scanned sequentially, character by character, to compare each character string in the text with the search string. Each time a text string is found which satisfies the conditions expressed by the search string, the corresponding position within the stored text is outputted, as a search result. However due to the slow speed of searching, such a method is not suitable for searching through very large amounts of stored text.

With method (2) above, files which are referred to as signature files are prepared beforehand, from the object text and are used as auxiliary files during a text search. A pre-search of these files is executed prior to performing the actual text search. As a result of the pre-search operation it may be found for example, that the desired text portion will not be found within certain of the stored documents. Thus, it is only necessary to search through the remaining stored documents, and hence the search speed can be increased by comparison with the full-text scanning method (1) above. However it is necessary to prepare and store the signature files, which may occupy a total amount of memory capacity that may be 10% to 15% of the main text storage capacity.

With method (3) above, an index file is prepared (sometimes referred to as an inverted file), in which all of the words of the object text (i.e. those words which have been found beforehand by applying some form of word analysis processing) are listed in sorted form, together with information linking each word to each of the locations in the text where the word appears. Since the words listed in the index file are arranged in sorted sequence, they can be rapidly scanned to obtain the location or locations of a desired word within a document, so that a high speed of searching can be achieved.

To establish such an index file it is necessary to first find all occurrences of each word which appears within the stored text. These words are then used to form the index file, with each word being linked to a set of text position values which point to one or more locations within the object text. Thereafter, to find the location(s) of any word within the object text it is only necessary to search the index file. Thus, a substantially higher speed of searching can be achieved than is possible with a full-text scanning method.

If the amount of object text which is to be searched is large, then the index file text search method has the disadvantage that it is necessary to store a very large amount of data in the index file, which may be substantially more data than that constituting the text that is to be searched. Thus the storage capacity requirements may be considerable.

As can be understood from the above, each of the prior art types of full-text search methods has respective advantages and disadvantages, so that an optimum method must be selected on the basis of the particular requirements of a specific application. However when the primary consideration is to achieve a high speed of searching, the index file method (3) above is to be preferred.

With a language in which respective words of a text are delimited by intra-word spaces, such as a European language, it is comparatively easy to prepare a prior art type of index file as described above. However there are various continuous-sequence languages, i.e. languages in which words in a text are not delimited by spaces, such as Chinese or Japanese. This presents special problems with respect to text searching. When constructing an index file for a text which is written in such a continuous-sequence language it is first necessary to divide the sequence of characters constituting the object text into a sequence of words, as accurately as possible. That is done by finding respective character strings which have been predefined as being words and which occur in the text, for example using a technique such as morpheme analysis in conjunction with a standard dictionary. A list of such words which appear in the text can thereby be prepared. The conventional type of index file can then be constructed, as described above, using these words. It is necessary to update the contents of the index file when changes are made in the object text.

An individual who understands the context of the text contents can read such a text and can in general accurately separate out respective words based upon the context, in spite of the absence of intra-word spacing. However that is not possible for a machine intelligence to achieve accurately at the present state of technology. As a result, if the words for the index file are selected by a machine intelligence, some words which appear in the text (i.e. character strings which were intended by the writer of the text to be understood as discrete words) may be omitted from the index file.

The problems imposed by searching a continuous-sequence text are basically different from those of a text in a language such as English. Taking for example a text containing a portion ". . . he made them appoint . . . ", which in continuous-sequence form becomes ". . . hemadethemappoint . . . ", it is possible that a machine intelligence might extract the words "he", "made", "the", "map", and "point", while omitting "them" and "appoint".

In the following, an example of a prior art type of index file will be described, for use in a full-text search apparatus which is applicable to a language in which inter-word spaces are omitted. It will be assumed that the apparatus should enable searching for any arbitrarily selected word which occurs within the object text. FIG. 1 is a basic block system diagram of an apparatus for constructing such an index file. In FIG. 1, numeral 251 denotes a text (stored in an electronic storage medium) which is to be searched, and which will be referred to as the object text. A word analysis processing section 252 separates the object text into sequential words, using a method such as analysis of morphemes, based on a dictionary list 253 (i.e. a list of strings which have been predefined as words). An index file generating section 254 then registers each of the words which are found in the text, together with of values indicating the respective positions within the object text where the word appears (such values being referred to in the following as text position values), as respective entries in an index file 255.

An example of the operation of such an apparatus will be described referring to the simple object text example of FIG. 2. It will be assumed, for the purposes of description, that the object text is formed from the English lower-case alphabetic letters, and consists of the character sequence "theeditorwonit" which, as a result of word analysis processing of the object text by a machine intelligence, might be divided the sequence of words "the edit or won it" or "the editor won it". In the first case, the resultant index file 255 would be as shown in FIG. 4, assuming that each text position value is expressed as the position of the initial letter of each word within the object text.

In that case, it would not be possible to find the position of occurrence (within the text portion shown) of the word "editor", by using such an index file.

Thus, due to the shortcomings of word analysis techniques for dividing a text into units of words, such an index file will not provide accurate results in locating all words which occur within the object text.

Furthermore, it is not possible to use such an index file to locate respective positions of occurrence within the object text of arbitrary character strings (such as "orwo" in the example "theeditorwonit") which do not appear in the dictionary list, and so will not be registered in the index file.

In order to overcome the above problems, a new type of index file has been proposed for application to Japanese texts. In such an index file, rather than utilizing dictionary words as position-specified units in the index file, fixed-length character strings referred to as n-grams (i.e. each formed of n characters) are used, which successively overlap within the object text. The n-grams are typically single characters, or pairs of characters (referred to as bi-grams). That method has been described in "The Transactions of the Institute of Electronics, Information and Communication Engineering", Vol J75-D-I, No. 9, pp. 836-846 (1992), published in Japan. The resultant index file consists of the ordered set of n-grams which appear in the object text, together with information linking each n-gram to each of the positions where it appears within the text. In order to search for any arbitrary character string in the object text, it is only necessary to examine the index file to find an n-gram which is identical to the desired character string, or a set of two or more n-grams occurring at sequential positions in the object text, which (arranged in successively overlapping form) constitute the desired character string. The position of the desired character string within the object text is then defined by the position of the first n-gram of that set.

An example of an index file constructed using bi-grams is shown in FIG. 5, for the case of the previously utilized text example. As shown, the object text is divided into sequentially overlappping character pairs, i.e. "th", "he", "ed", "di", . . . "on", "ni", "it". The index file consists of a table showing the respective positions of occurrence of these character pairs within the object text. Using such an index file, if for example it is desired to find the character string "it", then the two positions at which that string appears in the object text are obtained directly from the index file. If it is desired to find the character string "tor", then that string is first divided into successive pairs of bi-grams, i.e. "to", and "or". The index file is then searched to find each location in the object text where "to" occurs. The index file is then again searched to find locations of occurrence of the second bi-gram "or". For each bi-gram "or" thus found, i.e. which begins with the character "o", a check is made as to whether the (initial character) position of that bi-gram within the object text immediately succeeds the (initial character) position of any of the bi-grams "to". If such a combination of sequentially overlapping bi-grams is found, then the location (or locations) of the desired three-letter string "tor" within the object text has been determined, i.e. is obtained as the position (7) of the first bi-gram "to". Obviously the method can be applied to search for character strings of greater length.

It can thus be understood that the above method can be used to search for any arbitrarily selected character string, within a text or text portion which consists of successive characters which are not delimited into words by spaces.

However such a method has a basic disadvantage of a low search speed, if the size of the object text is substantial. One reason for this is that even if the character string being searched for is a dictionary word, (when searching for a word which is longer than the n-gram length) it is necessary to execute the above-described operations of dividing the required character string into n-grams, then searching the index file to find a set of n-grams which satisfy the requisite conditions. It can be understood that the number of occurrences of certain n-grams within a large text may be very great, so that considerable time may be required to complete each search operation.

Furthermore the conventional full-text search method (3) described above which employs an index file containing dictionary words, although providing a high speed of searching, also requires a very large data storage capacity, unless the size of the object text is small. Moreover as described above, complete accuracy of text searching cannot be ensured, when such a method is applied to a continuous-sequence text.

There is therefore a requirement for an improved text search apparatus which is applicable to a continuous-sequence text, whereby:

(a) a high speed of searching is achieved,

(b) it is possible to find, with complete reliability, any arbitrarily selected character string which appears at one or more positions within the text, and

(c) the data storage capacity requirements are moderate.

SUMMARY OF THE INVENTION

It is an objective of the present invention to overcome the problems of the prior art set out above, by providing a text search apparatus utilizing a new type of index file, and an apparatus for constructing such an index file. The text search apparatus enables a high speed of searching to be achieved, while ensuring that any character string which appears within the object text can be found with complete reliability. That reliability of text searching is attained even when the text search apparatus is applied to a text written in a continuous-sequence language such as Japanese or Chinese.

Moreover with such a text search apparatus, by comparison with prior art types of text search method which utilize an index file, the amount of storage capacity required to implement the index file can be moderate, so that it becomes practicable to apply the apparatus of the invention to searching in large amounts of text.

With the prior art text search apparatus of FIG. 1 described above, the index file contains a set of elements (i.e. text position elements) each of which relates a word (from the dictionary list) which has been found to occur in the text, to a position of occurrence of that word in the text. Ideally, such an element is registered for each occurrence of each dictionary word. However with an index file of a text search apparatus according to the present invention, a new type of text position element is utilized, which can be considered as a "maximum size" text position element. A set of such text position elements, which sequentially cover all of the character positions of the object text, is obtained by successively examining each text character position, to find if a dictionary word extends from that position which satisfies the conditions:

(a) it is the longest dictionary word extending from that character position, and

(b) no dictionary word extends beyond that longest dictionary word from any preceding text character position.

Such "longest dictionary words" are referred to herein as extension words. Since each extension word may consist of a plurality of sequential dictionary words, the total number of text position elements required to cover all of the character positions in the object text can be minimized, so that data storage requirements are reduced.

The index file of such a text search apparatus basically consists of two sections. The first of these, the text position table section, contains at least one table containing a plurality of arrays of text position elements, each array being keyed to a specific extension word. (Unless otherwise stated, the term "text position element" is used in the following description and the appended claims to signify a text position element which is keyed to an extension word).

In order to ensure that such a set of text position elements can be derived which will cover every character position within the object text, and to ensure that the index file will enable the text position of any arbitrary character string to be found, the following relationship must exist between the dictionary list and the object text before the set of text position elements is derived: for each character position in the object text, at least one dictionary word begins from that character position. It is possible in some cases that this necessary condition may be satisfied by a pre-existing dictionary list, however it may be necessary to augment such an existing dictionary list. If the above conditions for the dictionary list and extension words are met, then:

(1) the complete set of text position elements covers all of the character positions in the object text, and

(2) at every occurrence of a dictionary word in the object text, the word is either a partial character string within a text position element, or is itself the extension word of a text position element.

The second basic section of the index file is an extension word correspondence table section. This stores information expressing, for each character string which occurs in the object text and is a partial character string of any extension word, the set of extension words which each contain that dictionary word as a partial character string, and the respective intra-word positions of the character string within these extension words.

Thus, to obtain each of the occurrence positions of a search string within the object text the text search apparatus obtains, from the extension word correspondence table section of the index file, each of the extension words of which the search string is a partial character string, and information specifying the intra-word positions of the search string within each of these extension words. The apparatus then obtains, from the text position table section of the index file, each of the arrays of text position elements corresponding to these extension words. That information is then combined with the intra-word position information to obtain a set of positions of occurrence of the search string. If the search string is a dictionary word, then all of its occurrence positions have thereby been found. Such processing will be referred to as primary search processing.

If the search string is not a dictionary word, then it is not certain that at each of its occurrences within the object text, it will always occur within a text position element (i.e. it may extend between two sequential text position elements). In that case, it is necessary to execute further search processing, referred to in the following as secondary search processing.

First, a set of dictionary words is derived, which extend sequentially from the leading character to the final character of the search string, and are referred to as cover words. The respective sets of extension words corresponding to these cover words are then obtained from the extension word correspondence table section, then the respective sets of text position elements for each of these extension words are obtained from the text position table section and examined, to find those sequences of text position elements within which the search string appears. The respective positions of occurrence of the search string where the string extends between a plurality of text position elements, can thereby be obtained. The occurrences position information thus obtained is then combined with the occurrence position information obtained from the primary search processing, to obtain all of the occurrence positions of the search string within the object text.

More specifically, the invention provides a text search apparatus operable for deriving a set of text position values expressing all positions of occurrence within an object text of a character string which is specified as a search string, the object text being formed of a finite number of characters which are respective members of a predetermined character set, the apparatus comprising:

a dictionary list having registered therein a predetermined list of dictionary words or identifiers of the dictionary words, each of the dictionary words being a string of characters from the character set,

text position table means comprising a plurality of text position elements each expressing a combination of one of the dictionary words which appears within the object text, or an identifier of the dictionary word, in conjunction with a text position value of the dictionary word, each dictionary word of a text position element in the text position table being an extension word which is a maximum length word within a set of dictionary words each extending from a leading character position of the text position element;

extension word position correspondence table means having registered therein information expressing, for each of respective partial character strings of the extension words, a corresponding set of extension words and respective positions of the partial character string within each of the corresponding extension words,

extension word assessment means for obtaining from the extension word correspondence table means a first set of extension words each containing the search string as a partial character string and intra-word position information indicating respective positions of the search string within the words of the first set of extension words,

text position element look-up means for obtaining first sets of text position elements from the text position table means, each set of the first sets of text position elements being keyed to a corresponding extension word of the first set of extension words,

total set formation means for operating on the first sets of text position elements in conjunction with the intra-word position information, to obtain a first set of text position values of the search string, and

cover word assessment means for determining a specific sequence of cover words extending from a leading character to a final character of the search string, each of the cover words other than a final cover word of the sequence being a dictionary word, the final cover word being a dictionary word or a leading partial character string of a dictionary word,

wherein the extension word assessment means obtains from the extension word position correspondence table means respective second sets of extension words corresponding to the cover words and intra-word information expressing respective positions of each of the cover words within each of the extension words of the corresponding one of the second sets of extension words,

the text position element look-up means obtains from the text position table means, for each extension word of each set in the second sets of extension words, a corresponding second set of the text position elements;

the cover word assessment means executes comparison processing of the second sets of text position elements to select groups of sequentially occurring text position elements, each of the groups containing the cover word sequence extending continuously from within a leading text position element of the group,

and wherein the total set formation means obtains, based on respective leading text position elements of the groups in conjunction with the intra-word information from the extension word assessment means, a second set of text position values of the search string, and combines that second set with the first set of text position values to obtain a complete set of text position values of the search string, expressing all positions of occurrence of the search string within the object text.

The apparatus may also include means for modifying the dictionary list in accordance with the object text, to satisfy the above-mentioned necessary condition.

With such a text search apparatus, the partial character strings (of respective extension words) which are registered in the extension word position correspondence table means preferably comprise all of the partial character strings of the extension words. This provides maximum speed of searching.

Alternative, the partial character strings in the extension word position correspondence table means can comprise respective ones of all of the dictionary words which are individual characters and occur within the object text. This will minimize the storage capacity required to implement that table.

As another alternative, the partial character strings in the extension word position correspondence table means can comprise respective ones of all of the dictionary words which occur within the object text. This provides a compromise between the requirements for high search speed and moderate data storage capacity.

With such a text search apparatus, the text position table means comprises at least one text position table having table entries assigned to respective ones of the extension words, each of the table entries preferably comprising an extension word in conjunction with an array of text position values (specifying respective intra-text occurrence positions of that extension word) from all of the text position elements which are keyed to the extension word, the text position values being arranged in sequence of increasing magnitude.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a general block diagram of an example of a prior art type of text search apparatus;

FIG. 2 shows a simple example of an object text;

FIG. 3 shows an example of a dictionary list corresponding to the text example of FIG. 2;

FIG. 4 shows an example of an index file for use with a prior art type of text search apparatus, as applied to the text example of FIG. 2, in which text position values of dictionary words are registered;

FIG. 5 shows another example of an index file for use with a second prior art type of text search apparatus, as applied to the text example of FIG. 2, in which text position values of bi-grams are registered;

FIG. 6 is a general block diagram of a first embodiment of a text search apparatus according to the present invention;

FIG. 7 is a general block diagram illustrating the basic structure of an index file utilized in a text search apparatus according to the present invention, formed of an extension word correspondence table section and a text position table section;

FIG. 8 illustrates the basic concept of text position elements;

FIG. 9 shows a complete table of text position values for each dictionary word and character which appears in the text example of FIG. 2;

FIG. 10 shows a dictionary list, for the text example of FIG. 2, for use with the first text search apparatus embodiment;

FIG. 11 shows an example of a text position table, consisting of arrays of text position elements that are keyed to respective extension words, which constitutes the text position table section of the index file of the first text search apparatus embodiment;

FIG. 12A shows an example of an extension word correspondence table, which relates each partial character string of each extension word to respective intra-word positions within the corresponding extension words, for the first text search apparatus embodiment;

FIG. 12B shows an example of an extension word correspondence table for use with a modified version of the first text search apparatus embodiment;

FIG. 12C shows an example of an extension word correspondence table for use with another modified version of the first text search apparatus embodiment;

FIG. 13 is an example for illustrating an occurrence of a search string extending between sequential text position elements, showing the relationship between the search string and a corresponding sequence of cover words;

FIGS. 14A, 14B constitute a flow diagram of an algorithm which is executed by the first text search apparatus embodiment, to perform a text search;

FIG. 15 shows the contents of a second step in the flow diagram of FIGS. 14A, 14B;

FIG. 16 is a flow diagram showing the contents of a third step in the flow diagram of FIGS. 14A, 14B, for obtaining the set of text position elements which each contain the search string;

FIG. 17 is a flow diagram showing the contents of a fourth step in the flow diagram of FIGS. 14A, 14B, whereby each of the text position values of the search string are obtained for those occurrences where the search string is contained within respective text position elements;

FIG. 18 is a flow diagram showing the contents of a step in the flow diagram of FIGS. 14A, 14B, whereby a sequence of cover words which constitutes the search string is obtained;

FIG. 19 is a flow diagram showing the contents of a step in the flow diagram of FIGS. 14A, 14B, whereby respective sets of extension words corresponding to the cover words are obtained;

FIG. 20 is a flow diagram showing the contents of a step in the flow diagram of FIGS. 14A, 14B, whereby respective sets of text position elements corresponding to the extension words derived by the flow diagram of FIG. 19 are obtained;

FIG. 21 is a flow diagram showing the contents of a step in the flow diagram of FIGS. 14A, 14B, whereby groups of sequential text position elements are derived, each group containing the sequence of cover words;

FIG. 22 is a flow diagram showing details of the operation of the flow diagram of FIG. 21;

FIG. 23 illustrates processing performed by the flow diagram of FIG. 21 to find each group of sequential text position elements within which the cover word sequence occurs;

FIG. 24 is a general block diagram of a first embodiment of an apparatus for generating an index file for use in the first embodiment of a text search apparatus;

FIG. 25 is a flow diagram of an algorithm executed by the first index file generating apparatus embodiment, to generate a text position table;

FIG. 26 is a flow diagram of an algorithm executed by the apparatus of FIG. 24, to generate an extension word correspondence table;

FIG. 27 is a flow diagram of an algorithm executed by a second embodiment of an index file generating apparatus, to generate a text position table;

FIGS. 28A, 28B, 28C are conceptual diagrams for illustrating the basic principles of an extension word correspondence table section of the index file used in a second embodiment of a text search apparatus according to the present invention;

FIG. 29 illustrates the merging and sorting processing which must be applied to the text position values of a search string that are obtained from a plurality of arrays of text position elements keyed to respectively different extension words;

FIG. 30 illustrates the configuration of an example of a text position table utilized in the index file of the second text search apparatus embodiment, for low-occurrence extension words, in which each entry consist of an array of sequentially occurring text position elements which are keyed to respectively different extension words;

FIG. 31 illustrates the basic configuration of a table of high occurrence words listed in dictionary sequence, used in the extension word correspondence table section of the second text search apparatus embodiment;

FIG. 32 illustrates the basic configuration of a table of high occurrence words listed in inverse dictionary sequence, used in the extension word correspondence table section of the second text search apparatus embodiment;

FIG. 33 illustrates the basic configuration of a single-character intra-word position table, used in the extension word correspondence table section of the second text search apparatus embodiment, which specifies the relationship between each character which occurs within an extension word other than as the leading or final character, and the corresponding extension words and intra-word position values;

FIG. 34 is a general block diagram of the second embodiment of a text search apparatus according to the present invention;

FIG. 35 is a general block diagram illustrating the configuration of the index file utilized in the second text search apparatus embodiment;

FIG. 36 is a partial example of a table of initial characters of high occurrence words which appear in a dictionary sequence list; 10 FIG. 37 is a table constituting the dictionary sequence list of high occurrence words corresponding to the table of FIG. 36;

FIG. 38 is a partial example of a table of final characters of high occurrence words which appear in an inverse dictionary sequence list;

FIGS. 39A, 39B form a table constituting the inverse dictionary sequence list of high occurrence words corresponding to the table of FIG. 38;

FIG. 40 is a partial example of a table of initial characters of low occurrence words which appear in a dictionary sequence list;

FIG. 41 is a table constituting the dictionary sequence list of low occurrence words corresponding to the table of FIG. 40;

FIG. 42 is a partial example of a table of final characters of low occurrence words which appear in an inverse dictionary sequence list;

FIG. 43 is a table constituting the inverse dictionary sequence list of low occurrence words corresponding to the table of FIG. 42;

FIGS. 44A, 44B constitute a partial example of a single-character intra-word position table which is used in the extension word correspondence table section of the second text search apparatus embodiment in conjunction 10 with the tables of FIGS. 36 to 43;

FIG. 45 is a partial example of a text position table of high occurrence words, used in the text position table section of the second text search apparatus embodiment, in which each entry expresses an array of text position elements which are keyed to a specific extension word;

FIG. 46 is a partial example of a text position table of low occurrence words, used in the text position table section of the second text search apparatus embodiment, in which each entry expresses an array of text position elements which are keyed to respectively different low occurrence words, for a specific group of low occurrence words which are listed sequentially in the table of FIG. 41;

FIG. 47 is a partial example of a text position table of low occurrence words, used in the text position table section of the second text search apparatus embodiment, in which each entry expresses an array of text position elements which are keyed to respectively different low occurrence words, for a specific group of low occurrence words which are listed sequentially in the table of FIG. 43;

FIGS. 48A, 48B constitute a flow diagram of an algorithm executed by the second text search apparatus embodiment to perform a text search;

FIG. 49 is a flow diagram showing the contents of a second step in the flow diagram of FIGS. 48A, 48B;

FIG. 50 is a flow diagram showing the contents of a first step in the flow diagram of FIG. 49, executed to obtain all of the extension words having the search string as a prefix portion;

FIG. 51 is a diagram for assistance in describing a method of locating a set of extension words having the search string as a prefix portion, within a serially numbered list of extension words;

FIGS. 52A, 52B constitute a flow diagram of an algorithm for locating a set of high occurrence words which each have the search string as a prefix portion, based on the method illustrated in FIG. 51;

FIG. 53 is a flow diagram showing the contents of a step in the flow diagram of FIGS. 52A, 52B, for detecting whether a specific word is the first of a set of listed high occurrence words which each begin with the search string;

FIGS. 54A, 54B constitute a flow diagram of an algorithm for locating a set of low occurrence words which each have the search string as a prefix portion;

FIG. 55 is a flow diagram showing the contents of a second step in the flow diagram of FIG. 49, for obtaining all extension words having the search string as a suffix portion;

FIGS. 56A, 56B constitute a flow diagram of an algorithm for locating a set of high occurrence words which each have the search string as a suffix portion;

FIGS. 57A, 57B constitute a flow diagram of an algorithm for locating a set of low occurrence words which each have the search string as a suffix portion;

FIG. 58 is a flow diagram showing the contents of a third step in the flow diagram of FIG. 49, for locating all extension words within which the search string occurs as a partial character string, other than as a prefix or a suffix portion;

FIG. 59 is flow diagram showing the contents of a third step in the flow diagram of FIGS. 48A, 48B, for obtaining all of the text position elements which each entirely contain the search string;

FIG. 60 is a flow diagram showing the contents of a step in the flow diagram of FIG. 59, for obtaining all of the text position elements which are keyed to a specific extension word;

FIG. 61 is a flow diagram showing the contents of a fourth step in the flow diagram of FIGS. 48A, 48B executed to obtain all of the text position values of a search string, for those occurrences where the search string is contained entirely within respective text position elements; and

FIG. 62 is a flow diagram showing the contents of a step in the flow diagram of FIGS. 48A, 48B which is executed to obtain respective sets of extension words which correspond to respective ones of a sequence of cover words of the search string, for those occurrences where the search string extends between two or more text position elements.

DESCRIPTION OF PREFERRED EMBODIMENTS

FIG. 6 is a conceptual block diagram of a first embodiment of a text search apparatus according to the present invention, and FIGS. 15A, 15B constitute a flow diagram for describing an algorithm which is executed by the apparatus to perform a text search to find the intra-text positions of a character string. In FIG. 6, the functions of the text search apparatus are indicated as being performed by various sections of the apparatus. However in practice, most of these functions can of course be implemented by a computer apparatus which is programed to execute the search algorithm described hereinafter.

As shown in FIG. 6, the text search apparatus has a search string input apparatus 41 which can consist of data input devices such as a keyboard, mouse, etc. The apparatus further includes an index file 42, which is held in an electronic data storage medium, a dictionary list 44 which is similarly stored, and a search result output apparatus 50, consisting of a data output device such as a display terminal. The apparatus moreover includes a dictionary list look-up section 43, a cover word assessment section 45, an extension word assessment section 46, a text position element search section 47, a continuity possibility assessment section 48, and a total set combining section 49. The character strings which are registered in the dictionary list will be referred to as the dictionary words, in the following. In its initial state (prior to generating an index file for a specific object text, as described hereinafter) the dictionary list can be an arbitrarily determined list of strings which have been predefined as dictionary words. However in order to completely achieve all of the objectives of the present invention, it may be necessary to augment the dictionary list, in accordance with the contents of the object text. Specifically, it is necessary to ensure the following. For each character position of the object text, there must exist at least one dictionary word which begins from that character position. However with a language such as Japanese or Chinese, having thousands of characters which may be used, it is possible that a text may contain one or more (rarely used) characters which are not registered in an ordinary dictionary. For that reason as described in detail hereinafter, before generating the index file, the object text is preferably examined to locate any character position in the text which does not meet the condition that at least one dictionary word begins from that character position. If such a position is found in the text, then a string which begins from that character position is registered as a new word in the dictionary list.

The essence of the invention lies in the contents of the index file 42. That consists of two basic sections, as shown in FIG. 7, i.e. an extension word correspondence table section 31 and a text position table section 32. It is an essential feature of the invention that certain occurrences in the object text of specific dictionary words are utilized for text position reference purposes. Specifically, when a dictionary word occurs, beginning at a certain character position within the object text, such that it is the longest dictionary word among the set of dictionary words which extend from that character position, and there is no dictionary word which begins at a character position preceding the first-mentioned character position and which extends beyond the final character of the first-mentioned dictionary word, then that longest dictionary word is designated as an extension word, and the combination of that extension word and the intra-text position of the leading character of the word is designated as a text position element which is keyed to that extension word. In general, a plurality of text position elements will be keyed to a single extension word. It can be understood that each extension word consists of a single dictionary word, which may have one or more multi-character dictionary words as partial character strings thereof.

It will be assumed that respective character positions in the object text have been assigned sequential identifying numbers, referred to herein as text character numbers. The text position of a character string will be assumed to be specified by the text character number of the leading character of that character string. That text character number of the leading character will be referred to as the text position value of the character string.

The extension words are selected by executing an algorithm (described hereinafter) for sequentially examining each of the character positions of the object text to find, for each character position, the longest dictionary word of the set of dictionary words which begin from that character position. If that longest dictionary word (referred to in the following as the maximum-length word for that character position) extends beyond all maximum-length words which have been found for the preceding character positions, then that maximum-length word is selected as an extension word with respect to that position in the text. All of the text position elements of the object text are expressed by the contents of the text position table section 32 of the index file 43.

The basic concepts of text position elements and extension words are illustrated in the text example shown in FIG. 8. It will be assumed that the dictionary list includes the words "reasonable", "reason", "treason", "son", and "on", and that the portions "areasonableman", "thereasonfor" and "treasonwasfound" occur within the object text at the positions shown. In that case, the set of text character positions numbered 44 to 53 constitute a text position element which is keyed to the extension word "reasonable" (containing the dictionary words "reason", "son", etc.), and the set of positions numbered 73 to 78 constitute a text position element which is keyed to the extension word "reason", for example.

The second basic section of the index file is the extension word correspondence table section 31, which expresses intra-word position relationships between each extension word and all of the partial character strings of that extension word. The basic use can be understood from the example of FIG. 8 in which three text position element examples are designated as TPa, TPb an TPc. For example, the leading character of the string "son" is positioned at the fourth character position within the extension word "reason", i.e. it is displaced by three character positions from the initial character of "reason". Hence, one text position of "son" is obtained from the text position element (reason, 73), i.e. TP.sub.b. That text position is obtained as (73+3)=76. The displacement of the leading character of a partial character string of an extension word from the leading character of that extension word will be referred to as the intra-word position value of that string with respect to that extension word.

The term "sequential" as applied herein to words or text position elements should be understood to have the specific significance of "words (or text position elements) which occur either immediately consecutively, or sequentially but partially overlapping". In that sense, for example, the words "bore" and "red" occur sequentially within the word "bored", while "editor" and "won" occur sequentially in the text portion "editorwon".

The basic operation of the first embodiment will be described in more detail in the following, using the aforementioned text example "theeditorwonit". It will be assumed first that the dictionary list is as shown in FIG. 10. More specifically, FIG. 10 shows an example of a set of character strings, listed as respective dictionary words, whereby the aforementioned necessary condition is satisfied in relation to the object text of FIG. 2, i.e. at least one dictionary word begins from each character position in the text.

The text position table 21 of FIG. 11 is generated (by an apparatus described in detail hereinafter) by examining successive character positions in the object text as follows, for the case of the text example of FIG. 2. Firstly, the set of dictionary words extending from the first character position of the text is ›t, the!. The maximum-length word of that set is "the", and so this is selected as the extension word of a first text position element, which can be expressed in the form (the, 1). The set of dictionary words extending from the second character position of the text is ›he, heed!. In that case, the string "heed" satisfies the necessary conditions for being an extension word, i.e.:

(a) it is the maximum-length word (of the set of dictionary words which begin from the second character position), and

(b) no word, beginning from a preceding character position, extends beyond "heed".

Hence, (heed, 2) is selected as the second text position element. By repeating such operations, the set of text position elements of the text position table 21 shown in FIG. 11 is obtained. The table 21 is an example of the contents of the text position table section 32 of the index file 42 of this embodiment, for the text example of FIG. 2. Each entry in the text position table 21 consists of an array of text position elements, each array being keyed to a specific extension word, although in this very simple example each array has only a single member, such as the arrays (it, 5), and (heed, 2). Each array is preferably arranged in order of successively increasing text position values.

FIG. 12A shows an example of the contents of the extension word correspondence table section 31 of the index file 42 of this embodiment, consisting of an extension word correspondence table 22. Again, the simple text example of FIG. 2 is assumed. In the table 22, each table entry consists of a character string which is a partial character string of at least one extension word, in conjunction with the corresponding set of (extension word/intra-word position value) combinations. The entry for the character string "he" in table 22, for example, indicates that this string is a partial character string of each of the extension words "the" and "heed", and has the respective intra-word position values of 2 and 1 with respect to these extension word, signifying that the amounts of position difference between the leading character of "he" and the respective leading characters of these extension words are 1 and 0 character positions respectively.

FIG. 9 shows a table of the complete set of intra-text position values for all of the dictionary words which appear in the example text. It can be understood that the set of text position elements of table 21 is a sub-set of the complete set of FIG. 9

The principles of using the text position table 21 and extension word correspondence table 22 are as follows, assuming the simple text example "theeditorwonit" of FIG. 2 is searched. If for example the text position value of the string "or" is to be found, then the extension word correspondence table 22 is first examined to find the corresponding entry therein, and so obtain each of the extension words which contain the required string as a partial character string, and the intra-word position value of the required string with respect to each of these extension words. In this case, there is the single extension word "editor", and the intra-word position value is 5. The array of text position elements corresponding to that extension word is then found from the text position table 21. In this case, there is the single text position element (editor, 4). The text position value of the required string is obtained as:

{text position value of extension word+intra-word position value of string within that extension word-1}, i.e. (5+4-1)=8.

Furthermore, as described above, the dictionary list 44 is assumed to be configured such as to satisfy the aforementioned condition "for each character position of the object text, there exists at least one dictionary word which begins from that position".

In that case, it becomes possible to find the text position value of any occurrence of an arbitrary search string where that string extends between two successive text position elements within the text.

As a specific example, it will be assumed that the position of the string "orwo" is to be found. In that case, the string is first examined, beginning from its leading character, to find the longest dictionary word which extends from that character position, i.e. "or", with such a dictionary word being referred to in the following as a cover word, in relation to the search string. There is no dictionary word corresponding to the final two characters "wo", and so these will be considered to constitute a provisional "longest word" extending from the "w" position in the search string. The extension word correspondence table 22 is then examined to find each of the extension words which contain "or" (in this case, the single extension word "editor"), and each of the extension words which contain "wo" (in this case, the single extension word "won"). The cover words are determined by the cover word assessment section 45 in FIG. 6. Each of the text position elements corresponding to these extension words are then obtained from the extension word correspondence table 22, in order to find any sequential occurrences of these or partially overlapping occurrences of these text position elements such that the string "orwon" is formed. These operations are performed by the extension word assessment section 46 and text position element look-up section 47 of FIG. 6. Since "or" begins from the fifth character position in the extension word "editor" and "wo" begins from the first character position in the extension word "won", and since (from the text position table 21) it is found that there is a text position element keyed to "editor" which begins from text position value 4, and a text position element keyed to "won" which begins from the text position value 10, it is found that the search string "orwo" has an occurrence beginning from the text position value of the string "or", which is 8, as derived above. That is to say, an occurrence of the string "orwo" has been found as a string which extends between the consecutively occurring text position elements (editor, 4) and (won, 10). That operation is performed by the continuity possibility decision section 48 of FIG. 6.

It should be noted that it is inherent to the above operation that when deriving the cover word sequence for a search string, if a final character string is encountered which is not a dictionary word or sequence of dictionary words (as is the case with "wo" in the above example) then that character string must be an initial partial character string of a dictionary word. In that case, the final character string is considered as a final cover word of the cover word sequence, i.e. a pseudo-word.

It is equally possible to apply such a technique to find the text position value of a search string which extends between two partially overlapping text position elements. For example, it is clear that it is similarly possible to find the text position value of the string "eedi" in the text example of FIG. 2, which extends between the overlapping text position elements (heed, 2) and (editor, 4). In that case, the cover word sequence would be "e", "e", "d" and "i", each of which is registered as a word in the dictionary list of FIG. 10.

The basic operations performed in a text search by the first embodiment of the invention are as follows. First the search string is supplied, from the search string input apparatus 41 in FIG. 6. The extension word correspondence table 22 in the index file 42 is then searched, by the extension word assessment section 46, to find if there is an entry for the search string in that table. That will be the case if the search string is a dictionary word, or is a character string which is not a dictionary word but which occurs (at least once within the object text) as a partial character string of an extension word which is keyed to a text position element. If that is the case, then the set of corresponding extension words is obtained by the extension word assessment section 46 from the extension word correspondence table 22. The text position table 21 of the index file 42 is then searched (by the text position element search section 47) to find, for each member of that set of corresponding extension words, the corresponding array of text position elements. The extension word assessment section 46 then obtains from table 22 of the index file 42 the respective intra-word position values of the search string with respect to each of the corresponding extension words. The total set formation section 49 thereby obtains all of the text position values for the search string, i.e. based on respective combinations of an extension word text position element and an intra-word position value within that text position element. The total set formation section 49 then examins the respective arrays of text position values for the search string which have thereby been obtained from respective extension words, merges these into a single array (while eliminating any duplicated text position values), and then sorts the resultant set of text position values into a single sequential array (i.e. in order of increasing magnitude). That completes the primary search processing, which locates each occurrence of the search string as an extension word which is keyed to a text position element or as a partial character string of such an extension word.

A decision is then made as to whether the search string is a dictionary word. If so, there is no possibility that the search string can extend over two or more sequential text position elements. The text position values obtained are supplied to the search results output apparatus 50, and this terminates the search processing.

As explained above, if it is judged that the search string is not a single dictionary word, it is necessary to execute the secondary search processing to find any possible positions of occurrence where the search string extends over two or more text position elements. This search processing begins by finding a sequence of dictionary words which, when arranged in a specific sequence (directly consecutive or partially overlapping), constitute the search string, i.e. the aforementioned cover words. The sequence of cover words is preferably found by successively examining each of the character positions of the search string, from the first character, to find the set of dictionary words each of which begins from that character position, and selecting as a cover word any word in that set which satisfies the conditions:

(a) it is the longest word of the set, i.e. is the maximum-length word for that character position, and,

(b) it extends beyond any dictionary word which begins from a preceding character position within the search string, and

(c) it terminates at or before the final character of the search string.

A word which satisfies these conditions is designated as a cover word for the character position from which it begins. However when the point is reached at which the final cover word of the sequence is to be obtained, it may be found that there is no dictionary word which satisfies the above conditions. For example, if the search string were to be "itorwo", in the text example of FIG. 2, and assuming that the character "w" is not registerd in the dictionary list, although the string "won" is a dictioary word, then after the successive cover words "it", "to", "or" have been obtained based on the above conditions, it will be found that there is no word which meets these conditions, extending from the "w" position within the search string. In such a case, the final string "wo" is treated as a pseudo-word which constitutes the final cover word of the sequence. It can be understood that in such a case, the pseudo-word will always be a leading partial character string of at least one dictionary word, and hence of at least one extension word.

The above processing is executed by the cover word assessment section 45. If for example the search string is "rwon", in the case of the object text "theeditorwonit", then the single dictionary word "r" is obtained as the longest word for the first character position of the search string, the set of words "w" and "won" is obtained for the second character position, the set of words "o" and "on" is obtained for the third character position, and "n" is obtained for the final character position. For that reason, the set of cover words "r" and "won" will be obtained for the search string.

That set of cover words is then examined, to find respective sets of extension words corresponding to them. This is performed by the extension word assessment section 46, by searching in the extension word correspondence table 22. The respective sets of text position elements corresponding to these extension words are then obtained by the text position element search section 47, from the text position table 21. The respective sets of text position elements of the extension words are then examined, by the continuity possibility decision section 48, to find each occurrence of a condition whereby a text position element keyed to an extension word of the first cover word of the sequence can link to a text position element which is keyed to an extension word of the second cover word in the sequence of cover words (i.e. is immediately followed by, or is followed by in a partially overlapping manner, by that text position element of the second cover word). For each such linkage that is found, the pair of text position elements is examined to find if the first and second cover words occur sequentially within that pair of text position elements. The text position elements keyed to the extension words of the third cover word are then examined to find if any of these can be similarly connected to the previously found linkages, and so on. Each occurrence within the object text is thereby found of a linked sequence of text position elements within which the cover words appear in the sequence which covers the search string. The corresponding positions of the search string within the object text are then obtained based on the respective first text position elements of the linked sequences which have been found.

For example, if the search string is "orwon", for the object text "theeditorwonit", with the cover words being "or" and "won", then respective single-member sets of text position elements for the cover words "or" and "won" are obtained, as "editor, 4" and "won, 10". From Table 22, it is found that "or" occurs from the fifth character within the extension word "editor", while "wo" begins from the first character position within the extension word "won". It is thereby found that there is an occurrence of "or" at the text position value (5+4-1) =8, and an occurrence of "wo" with the text position value (10+1-1)=10. Thus the above linkage conditions are satisfied for the text position elements "editor, 4" and "won, 10", i.e. it has been found that the search string extends between these two text position elements. The text position value of the search string is then obtained as that of the first of the two cover words "or" and "won", i.e. is obtained as 8.

It will be apparent that the above operation would be similar if the final cover word were a pseudo-word, such as "wo", rather than a complete dictionary word. In that case, the extension word correspondence table would be used to find all of the extension words which begin with "wo". However (in the normal case in which the object text and index file contain large numbers of words), such a pseudo-word may be be linked to a greater number of text position elements than is a dictionary word. For example, while "won" might correspond to the extension word "wonderful", "wo" might correspond to extension words such as "wonderful", "woman", "worry", etc.

In the above example, text positions are obtained where the search string is found to extend between two directly consecutive text position elements. However the apparatus will similarly obtain each occurrence of a search string which extends over two partially overlapping sequential text position elements.

Furthermore it will be apparent that the operation will find any occurrences where the search string extends over more than two sequential text position elements.

Thus, the entire set of text occurrence positions of the search string, where that string extends between two or more sequentially occurring text position elements, can be obtained by the secondary search processing. The respective arrays of text position values for the search string which have thereby been obtained are then merged into a single set of text position values, which is then sorted into a sequential array of text position values, by the total set formation section 49, and that array of text position values is then merged with the array of text position values found by the primary search processing, with the resultant set of text position values then being sorted into a sequential array, by the total set formation section 49. The results are supplied to the search results output section 50.

The above process is further illustrated in the example of FIG. 13, in which it is assumed that the object text includes a character sequence "reasonabletotal" extending from character positions 50 to 64, with the search string being "ableto", and the sequence of cover words "able" "let" and "to" being obtained. In this case, the cover words partially overlap, as shown.

The operation of the first text search apparatus embodiment will be described more specifically referring to the flow diagram of FIGS. 14A, 14B. This shows a basic algorithm which is executed by the apparatus to find each occurrence within the object text of an arbitrarily determined search string which is supplied by the user via the search condition input apparatus 41. Firstly, in step S1, the search string is supplied. The search string has N characters, with respective character positions within that string being numbered 0 to (N-1). Step S2 is then executed, in which the extension word assessment section 46 searches the extension word correspondence table 22 of the index file 42, to find a table entry which corresponds to the search string. If such an entry is found, then the corresponding set of E extension words is obtained from that entry, with the set being designated as having m members (el to em). In addition, the extension word assessment section 46 obtains the respective intra-word position values of the search string with respect to each of the extension words of the set E. Next, in step S3, the text position element look-up section 47 searches the text position table 21 to find the respective sets of text position elements which correspond to each of the extension words (e1, e2, . . . en) of set E. The total set EIDX of these sets of text position elements thus consists of the set (ed.sub.-- 11, ed.sub.-- 12, . . . ) obtained for the first extension word e1 of the set E, the set (ed.sub.-- 21, ed.sub.-- 22, . . . ) obtained for the second extension word e2 of the set E, and so on.

Step S4 is then executed, in which the text position elements of the set EIDX, in conjunction with the respective intra-word position values of the search string with respect to these, are used by the total set formation section 49 (i.e. in merging and sorting processing as described above) to obtain a set of text position value for the search string. The resultant set of text position values is designated as P={p1, p2, . . . pk'}, i.e. having k' members. That completes the primary search processing.

Step 5 is then executed, in which a decision is made as to whether the search string is a dictionary word Specifically, the dictionary list look-up section 43 searches the dictionary list, to find if the search string is listed therein. It will be assumed that the specific function of judging whether or not a character string is a dictionary word is also performed by the dictionary list look-up section 43. If the search string is found to be a dictionary word, then this signifies that all of its positions of occurrence in the object text are specified by the set P, and so after outputting the set P to the search results output apparatus 50, processing is terminated.

If it is found in step S5 that the search string is not a dictionary word, then it is necessary to execute the secondary search processing, to find any positions where the search string occurs in the object text as a string which extends between two or more extension words. First, step S7 is executed, in which a set of cover words for the search string is obtained by the cover word assessment section 45, as described above, with that set being designated as the set C={w1, w2, . . . wr}. Respective sets of extension words corresponding to each of these cover words are then obtained by the cover word assessment section 45, from the extension word correspondence table 22 of the index file 42, in step S8. The set of extension words obtained for the first cover word w1 will be designated as E1 (=extension word.sub.-- 11, extension word.sub.-- 12, . . . ). The total set of the sets El, E2, . . . is designated as Ei. As a simple example, if the search string is "ableperson", the cover words obtained for the string might be "able", and "person", and the set E1 of first extension words(extension word.sub.-- 11, extension word.sub.-- 12) might be "notable" and "valuable" respectively, while the second extension words might be "personality" and "persons", as extension word.sub.-- 21 and extension word.sub.-- 22 respectively of the set E2.

Next, in step S9, the respective arrays of text position elements corresponding to the extension words in the set Ei are obtained by the text position element look-up section 47, from the text position table 21 of the index file 42. The total set of the various arrays of text position elements is designated as EIDX, and the respective members of that total set are designated as EIDX1, EIDX2, etc.

Step S10 is then executed, in which the text position elements of set EIDXi are assessed by the continuity possibility decision section 48, to find each occurrence in the text of a set of those text position elements which satisfy the conditions that

(a) they occur sequentially (i.e. directly consecutively, or partially overlapping, as described above), and

(b) the required sequence of cover words occurs within them.

These sets of extension word text position elements which satisfy these conditions are designated as CED1, CED2, etc., as members of an overall set CEIDX.

The purpose of the operations executed in step S11 is to obtain each of the text position values of the search string. As described hereinabove, this is achieved by obtaining (using tables 22 and 21) the position of the initial cover word of the cover word sequence within each of the sets of text position elements CED1, CED2, etc., i.e. by obtaining the amount of offset between the initial character of that first cover word and the initial character of a set of sequential text position elements, from table 22, then obtaining the text position value of the leading text position element of that set, from table 21.

In step S11, the respective first extension word text position elements of the groups CED1, CED2, etc. of the set CEIDX are first extracted. From each of these text position elements, the corresponding text position value of the search string is obtained, based on that text position element and the intra-word position value of the leading cover word (and hence of the search string) with respect to the extension word of that text position element. The set of values thus obtained is designated as Q (={q1, q2, . . . }), and specifies all of the text position values for the search string. The set Q is produced and combined with the set P, by merging and sorting processing as described above, and operation is then terminated.

FIG. 15 shows details of the contents of step S2 of the flow diagram of FIGS. 14A, 14B of this embodiment. As shown, the extension word correspondence table 22 is searched to find the set E of extension words which correspond to the search string.

FIG. 16 is a flow diagram showing details of the contents of step S3 of the flow diagram of FIGS. 14A, 14B. As shown, this includes a step S34 in which, for each of the members (e1, e2, etc.) of the set of extension words E, i.e. for the i.sup.th member (ei) of that set, the array of text position elements {ed.sub.-- i1, ed.sub.-- i2, . . . } corresponding to that extension word are obtained (from the entry for the extension word ei in the text position table. 21). With the number of members of the set E being designated as m, the step S34 is executed successively m times. At each execution, the array of text position elements obtained for extension word ei is added to the set EIDX.

FIG. 17 is a flow diagram showing details of the contents of step S4 of the flow diagram of FIGS. 14A, 14B. As shown, in a step. S42, each of the text position elements (edj) of the set EIDX is expressed in the form (ej, pj) i.e. as a combination of a specific extension word and a text position value of that extension word. In a step S45, the amount of offset L between the initial character of the extension word ej and the search string is derived, to thereby obtain a text position value pj' for the search string. These steps are executed successively for each of the text position elements of the set EIDX. As a result, the processing of step S4 derives the text position values for each occurrence of the search string within a text position element. Hence, if the search string is a dictionary word, each occurrence of the search string has been obtained at this stage. In that case, a "yes" decision is made in step S5 (e.g. after searching the dictionary list and finding the search string registered therein), the set of text position values obtained is outputted, and the search processing is terminated.

FIG. 18 is a flow diagram showing details of the contents of step S7 of the flow diagram of FIGS. 14A, 14B of this embodiment. As shown, the character positions of the search string are successively examined, by using a counter variable s. For each character, the set W of dictionary words which begin from that character position (i.e. extending towards the end of the search string) is found. As described hereinabove, it is possible that the search string may terminate in a string which is not a dictionary word, i.e. it may be found that the set W is empty. That condition is detected in step S73, whereupon in step S74 the terminating string portion is designated as the only member of the set W (in spite of the fact that it does not constitute a dictionary word, i.e. is a pseudo-word).

The longest word w of the set W is then found, in step S75, and its length is designated as L. A step S76 is then executed, to find if that longest word w extends, within the search string, beyond the position to which any previous longest word (beginning from a preceding character) has extended. If so, then in step S77 that word w is added to the set C of cover words for the search string. Otherwise, processing proceeds to the next character of the search string.

It can thus be understood that, the final member wr of the set C is either a dictionary word, or a pseudo-word which is an initial partial character string of a dictionary word.

FIG. 19 is a flow diagram showing details of the contents of step S8 of the flow diagram of FIGS. 14A, 14B of this embodiment. As shown, each of the cover words of set C is examined, sequentially, in a step S83. In that step, the set of extension words corresponding to a cover word (i.e. the set of extension words which each contain that cover word as a partial character string) is obtained, by looking up the extension word correspondence table 22 to find the entry corresponding to that cover word. For the i.sup.th cover word of the set C, the corresponding set of extension words is designated as Ei (={ew.sub.-- i1, ew.sub.-- i2, . . . }.

FIG. 20 is a flow diagram showing details of the contents of step S9 of the flow diagram of FIGS. 14A, 14B. As shown, the number of extension words corresponding to the i.sup.th cover word, i.e. number of members in the set Ei, is designated as ENi in step S93. A step S96 is repetitively executed, for each such set Ei. A counter variable j is used to successively select each of the extension words of set Ei, and in step S96, the array of text position elements corresponding to the j.sup.th extension word of the set Ei is obtained, by accessing the entry corresponding to that extension word in the text position table 21. That array of text position elements is added to a set EIDXi of respective arrays of text position elements for the extension words of set Ei. Thus, upon completion of ENi repetitions of step S96, all of the text position elements which contain the ith cover word are contained in the set EIDXi. These operations are then repeated for the set of extension words which correspond to the next cover word in the aforementioned set C, by incrementing the value of the counter i, in step S98.

Respective arrays of text position elements EIDX1, EIDX2, etc. are thereby successively obtained, corresponding to respective cover words, and are sequentially added to a complete set EIDX. In that way, the final state of EIDX consists of all of the arrays of text position elements for each of the sets of extension words which respectively correspond to the cover words of the set C.

FIG. 21 is a flow diagram showing details of the contents of step S10 of the flow diagram of FIGS. 14A, 14B. Firstly, the value of i (used as a counter which corresponds to respective ones of the set C of cover words w1, w2, . . . ) is initialized at 1. Next, in step S102, the complete set of arrays (›ed.sub.-- i1!, ›ed.sub.-- i2!, . . . ) of text position elements for the extension words of the first cover word (w1) is obtained from the aforementioned set EIDX. These arrays of text position elements are then designated as preliminary members of a set Li.

Thereafter, a counter variable j (used to specify successive members of set Li) and counter variable jj (used to specify successive members of the set of text position elements of the next cover word of set C) are initialized to 1. After testing the values of these counter variables, a step 107 is executed, to find if any text position element in the set Li and the text position element for the succeeding cover word, which are currently respectively selected, are either consecutive or sequentially overlapping, and if so, might form part of a sequence of text position elements within which the search string occurs. If such a linkage is found between the pair of text position elements, that pair is added as an element (sub-set) of the set L(i+1), in step S108.

The basic operation can be understood from the simple example of FIG. 23, in which it is assumed, for ease of description that:

(a) there is a total of 3 cover words (w1, w2, w3) in set C,

(b) each cover word has four corresponding extension words, and

(c) each of the extension words has an array of corresponding text position elements which has only a single member. Thus for example, for the first of the four extension words which correspond to the first cover word w1, the corresponding array of text position elements ›ed.sub.-- 11! has a single member, which is designated as ed.sub.-- 11.

The first execution of step S107 is illustrated by the "first stage" processing shown in the example of FIG. 23. First the text position element ed.sub.-- 11 of the set Li is compared with each of the text position elements ed.sub.-- 21 to ed.sub.-- 24 for the second cover word w2, then that operation is similarly performed for each of the other members of set Li. The preliminary linkages thus found are illustrated by the broken-line arrow indications in FIG. 23. It is assumed in that example that three such linkages are found, for the initial state of set Li, so that the set L(i+1) is assigned three sub-sets, i.e. arrays, as shown.

On completion of that process, i is incremented by 1, so that the set L(i+1) becomes the updated set Li. The above process is then repeated, to find linkages between any member of set Li and the text position elements of the third cover word (w3), as illustrated by the "second stage" operation in FIG. 23. In that example, two linkages are found, so that the final result is that two sets of text position elements have been found, i.e. (ed.sub.-- 11, ed.sub.-- 23, ed.sub.-- 31) and (ed.sub.-- 12, ed.sub.-- 21, ed.sub.-- 34) which are assigned as arrays of the set CEIDX.

The operations performed in step S107 are shown in greater detail, as the set of steps S107A to S107C in FIG. 22. As shown, after finding a pair of text position elements (corresponding to a pair of sequential cover words of set C) which satisfy the necessary conditions for "possibility of linkage continuity" (i.e. the text position elements occur sequentially) in step S107A, it is necessary to search in the extension word correspondence table 22 to find the respective positions at which the two cover words appear within the text position elements. One of these text position elements is the ith member of a sub-set within set Li, while the other is a text position element which contains the succeeding cover word. For example, referring to the "second stage" operation shown in FIG. 23, if it has been found that the text position elements ed.sub.-- 23 (which is the second text position element of a sub-set in set Li) and ed.sub.-- 31 are sequential, in step S107A, then it is necessary to find the position of the second cover word w2 within the text position element ed.sub.-- 21 and the position of the third cover word within the text position element ed.sub.-- 31.

Step 107C is then performed, to find whether that pair of cover words actually appear in the requisite sequence within the pair of text position elements. If so, then the text position element ed.sub.-- 31 is added to the sub-set (ed.sub.-- 11, ed.sub.-- 23), to obtain what in this case is a final sub-set of Li, as illustrated by the full-line arrow indications in the "second stage" processing of FIG. 23.

It can thus be understood that upon completion of the processing shown in FIG. 21, operating on all of the text position elements which were obtained in step S9, the final contents of the set Li will consist of sub-sets of sequential text position elements, with the search string extending within each of these sub-sets, and with the first element of each sub-set containing the initial character of the search string. The final contents of L(i) are designated as the set CEIDX, whose members are designated as {CED1, CED2, . . . } in step S10 of FIGS. 14A, 14B. For example, the sub-sets ›ed.sub.-- 11, ed.sub.-- 23, ed.sub.-- 31! ›ed.sub.-- 12, ed.sub.-- 21, ed.sub.-- 34! in the example of FIG. 23 might constitute the elements CED1, CED1 of the set CEIDX. Hence, by extracting the respective first text position elements of CED1, CED2, etc., and obtaining from the extension word correspondence table 22 the respective positions of the first cover word w1 within each of these first elements, the set Q of text position values of the search string is obtained, in step S11.

It can thus be understood that with the first text search apparatus embodiment described above it becomes possible to find, for any arbitrary character string which appears within the object text, all of the positions where that character string appears, with complete reliablity of searching. If the search string is a dictionary word, then the highest speed of searching will be attained, since it is not necessary to execute the steps of the secondary search processing (steps S7 to S12