DATABASE SCHEMA OR DATA STRUCTURE

Information search method and apparatus, and medium for storing information searching program

5992737

Abstract

A system and method for searching a large volume of data stored in a disk in, for example, document file format at a high speed while allowing desired ambiguity. The present invention provides a character string search scheme which enables a user to optionally designate the degree of ambiguity for a character string to be searched. The present invention also provides an index structure for implementing a character string search scheme which enables a user of the search scheme to optionally designate the degree of ambiguity for a character string to be searched. The present invention also provides a character string search scheme which enables the scheme to perform searches in a manner close to human feeling by ambiguity search. Details of the manner in which such enablements are implemented are found in the detailed specification.


Claims

What is claimed is:

1. A method for searching a document character string which contains a variable length chain document character string, and which is similar to a search character string at a predetermined similarity factor or higher of character string, said search character string containing a variable length chain search character string, in a document searchably stored through computer processing, the method comprising the steps of:

(a) associating document character string identifying information for identifying a document character string in which said variable length chain document character string exists and document character string position information for indicating the position in the document character string at which said variable length chain document character string exists with the variable length chain document character string;

(b) extracting and storing a partial document character string of a length of M characters (M being a predetermined integer of one or more) from said variable length chain document character string;

(c) associating variable length chain document character string identifying information for identifying a variable length chain document character string in which said partial document character string exists and variable length chain document character string position information for indicating the position in the variable length chain document character string at which said partial document character string exists with said partial document character string;

(d) searching a partial document character string which matches a partial search character string of a length of N characters (N being a predetermined integer of one or more) extracted from said variable length chain search character string;

(e) searching a partial document character string which matches a partial search character string of a length of N characters in the variable length chain search character string but having a start position shifted by one;

(f) identifying a set of partial document character strings matching the partial search character string in said variable length chain search character string by repeating said step (e);

(g) identifying candidates of variable length chain document character strings which have possibility of matching said variable length chain search character string at the predetermined matching factor or higher of variable length chain from variable length chain document character string identifying information and variable length chain document character string position information for a partial character string belonging to said set of partial document character strings;

(h) calculating a matching factor of variable length chain between said variable length chain search character string and said variable length chain document character string candidates;

(i) selecting a variable length chain document character string with a matching factor of variable length chain equal to or higher than a predetermined matching factor of variable length chain with the variable length chain search character string from said variable length chain document character string candidates;

(j) identifying a document character string in which said selected variable length chain document character string exists from the document character string identifying information for said selected variable length chain document character string;

(k) calculating a character string similarity factor between the document character string in which the selected variable length chain document character string exists and the search character string containing the variable length chain search character string; and

(l) when said calculated character string similarity factor is equal to or higher than the predetermined similarity factor of character string, displaying said identified variable length chain document character string.

2. A method for searching a document character string which contains a variable length chain document character string, and which is similar to a search character string, said search character string containing a variable length chain search character string, in a document searchably stored through computer processing, the method comprising the steps of:

(a) associating said variable length chain document character string with document character string identifying information for identifying a document character string in which said variable length chain document character string exists;

(b) associating, from said variable length chain document character string, a partial document character string of a length of M characters (M being a predetermined integer of one or more) with variable length chain document character string identifying information for identifying a variable length chain document character string in which said partial document character string exists;

(c) searching a partial document character string which matches a partial search character string of a length of N characters (N being a predetermined integer of one or more) extracted from said variable length chain search character string;

(d) identifying from variable length chain document character string identifying information for a partial document character string which matches said partial search character string a variable length chain document character string relating to the partial document character string which matches said partial search character string; and

(e) identifying from document character string identifying information for the identified variable length chain document character string a document character string in which the identified variable length chain document character string exists.

3. A method for searching a document character string which contains a variable length chain document character string, and which is similar to a search character string, said search character string containing a variable length chain search character string, in a document searchably stored through computer processing, the method comprising the steps of:

(a) searching a partial document character string matching a partial search character string of a length of N characters (N being a predetermined integer of one or more) extracted from said variable length chain search character string in an index file in which said variable length chain document character string is associated with document character string identifying information for identifying a document character string in which said variable length chain document character string exists, and associated with variable length chain document character string identifying information which identifies a partial document character string of a length M characters (M being a predetermined integer of one or more) and a variable length chain document character string in which the partial document character string exists from the variable length chain document character string;

(b) identifying, from variable length chain document character string identifying information for the partial document character string matching said partial search character string, a variable length chain document character string relating to the partial document character string matching said partial search character string; and

(c) identifying, from the document character string identifying information for said identified variable length chain document character string, a document character string in which the identified variable length chain document character string exists.

4. A method for searching a document character string which contains a variable length chain document character string, and which is similar to a search character string at a predetermined similarity factor or higher, said search character string containing a variable length chain search character string, in a document searchably stored through computer processing, the method comprising the steps of:

(a) associating document character string identifying information for identifying a document character string in which said variable length chain document character string exists and document character string position information for indicating the position in the document character string at which said variable length chain document character string exists with said variable length chain document character string;

(b) extracting and storing a partial document character string of a length of M characters (M being a predetermined integer of one or more) from said variable length chain document character string;

(c) associating variable length chain document character string identifying information for identifying a variable length chain document character string in which said partial document character string exists and variable length chain document character string position information for indicating the position in the variable length chain document character string at which said partial document character string exists with the partial document character string;

(d) searching a partial document character string which matches a partial search character string of a length of N characters (N being a predetermined integer of one or more) extracted from said variable length chain search character string;

(e) searching a partial document character string which matches a partial search character string of a length of N characters in said variable length chain search character string but having a start position shifted by one;

(f) identifying a set of partial document character strings matching the partial search character string in said variable length chain search character string by repeating the step (e);

(g) identifying variable length chain document character strings which are similar to said variable length chain search character string from variable length chain document character string identifying information and variable length chain document character string position information for a partial character string belonging to said set of partial document character strings; and

(h) identifying, from the document character string identifying information for said identified variable length chain document character string, a document character string in which the identified variable length chain document character string exists.

5. A method for creating a file for search by the computer from a document containing a plurality of variable length chains in a storage device accessible by a computer, the method comprising the steps of:

(a) extracting a first variable length chain from said document;

(b) storing information for the position in said document at which said first variable length chain appears;

(c) extracting a first extended fixed length chain from said first variable length chain; and

(d) storing information for identifying said first variable length chain in which said first extended fixed length chain exists and information for the position in said first variable length chain at which said first extended fixed length chain exists.

6. A method for creating a file for search by the computer from a document containing a plurality of variable length chains in a storage device accessible by a computer, the method comprising the steps of:

(a) extracting a plurality of variable length chains from said document and storing them in a character chain file;

(b) storing in a position information file information for the positions in said document at which said plurality of variable length chains appear;

(c) extracting extended fixed length chains of a length of M characters (M being a predetermined integer of one or more) from each of said plurality of variable length chains and storing them in an extended character chain file; and

(d) storing in an extended position information file information for identifying said variable length chains in which said extended fixed length chains exist and information for the position in said variable length chains in which said extended fixed length chains exist.

7. A method for calculating with a computer the similarity factor between a first character string containing a first variable length chain and a second character string containing a second variable length chain, the method comprising the steps of:

(a) searching said first variable length chain character string for a partial character string of a length of N characters (N being a predetermined integer of one or more) which matches in said second variable length chain;

(b) searching a partial character string of a length of N characters which is a partial character string in said first variable length chain with the start position shifted by one for a partial character string which matches in said second variable length chain;

(c) deriving variable length chain matching character identifying information for identifying characters matching said first variable length chain in said second variable length chain by repeating the step (b);

(d) calculating a variable length chain matching factor between said first variable length chain and said second variable length chain based on the variable length chain matching character identifying information and the number of characters in the first variable length chain; and

(e) calculating the similarity factor of the first character string based on the variable length chain matching factor.

8. A method for calculating with a computer the similarity factor between a first character string containing a first variable length chain and a second character string containing a second variable length chain, the method comprising the steps of:

(a) associating character string identifying information for identifying a second character string in which said second variable length chain character string exists and character string position information for indicating the position in the second character string at which said second variable length chain character string exists with said second variable length chain character string;

(b) extracting and storing a first partial character string of a length of M characters (M being a predetermined integer of one or more) from said second variable length chain character string;

(c) associating variable length chain character string identifying information for identifying the second variable length chain character string in which said first partial character string exists and variable length chain character string position information for indicating the position in the second variable length chain character string at which said first partial character string exists with the second partial document character string;

(d) searching a second partial character string which matches the first partial character string of a length of N characters (N being a predetermined integer of one or more) extracted from said first variable length chain character string;

(e) searching a second partial character string matching the first partial search character string of a length of N characters which is a first partial character string in said first variable length chain character string with the start position shifted by one;

(f) identifying a set of the second partial character strings matching the first partial character string in said first variable length chain character string by repeating the step (e);

(g) identifying candidates of variable length chain character strings which have possibility of matching said first variable length chain character string at the predetermined matching factor or more of variable length chain from variable length chain character string identifying information and variable length chain character string position information for a second partial character string belonging to said set of second partial character strings;

(h) calculating a matching factor of variable length chain between said first variable length chain character string and said variable length chain character string candidates;

(i) selecting a second variable length chain character string with a matching factor of variable length chain equal to or higher than a predetermined matching factor of variable length chain with said first variable length chain character string from said variable length chain character string candidates;

(j) storing the matching factor of variable length chain for said selected second variable length chain character string;

(k) identifying from character string identifying information for said selected second variable length chain character string a second document character string in which said selected second variable length chain character string exists; and (l) calculating the character string similarity factor between said second character string in which said selected second variable length chain character string exists and said search character string containing said first variable length chain search character string based on said stored variable length chain matching factor.

9. A method for searching a position where a document character string similar to a search character string with a predetermined similarity factor or higher appears, the search character string containing a variable length chain search character string in a document including variable length chain document character string, searchably stored through computer processing, the method comprising the steps of:

(a) locating the start and end positions of match in said variable length chain document character string for a partial character string of a length of M characters (M being a predetermined integer of one or more) from the top of said variable length chain search character string (hereinafter the partial character string of a length of M characters determined by the start and end positions being referred to a "valid matched character string");

(b) locating the (i+i)-th valid matched character string satisfying the following conditions:

e(D, i)+1.ltoreq.s(D, i+1).ltoreq.e(D, i)+L+1,

and

s(C, i+1)>e(C, i)-(M-1)

where

s(D, i) is the start position of the i-th valid matched character string in the document;

e(D, i) is the end position of the i-th valid matched character string in the document;

s(C, i) is the start position of the i-th valid matched character string in a character string to be searched;

e(C, i) is the end position of the i-th valid matched character string in a character string to be searched; and

L is a predetermined integer of one or more;

(c) continuing the step (b) as long as said valid matched character strings can be located;

(d) calculating a similarity factor between said variable length chain search character string and a character string which exists between the start position of the first valid matched character string in said variable length chain document character string and the end position of the last valid matched character string in said variable length chain document character string based on information on existence of valid matched character strings in a character string at least between the start position of the first valid matched character string in said variable length chain document character string and the end position of the last valid matched character string in said variable length chain document character string;

(e) identifying a variable length document character string in response to the fact that said calculated similarity factor is equal to or higher than a predetermined similarity factor; and

(f) identifying a document character string containing said identified variable length chain document character string.

10. An apparatus for searching a document character string which contains a variable length chain document character string, and which is similar to a search character string at a predetermined similarity factor or higher of character string, said search character string containing a variable length chain search character string, in a document searchably stored through computer processing, the apparatus comprising:

(a) an input device for inputting said search character string;

(b) a storage device for storing:

(b1) a character chain file for storing said variable length chain document character string;

(b2) a position information file for storing document character string identifying information for identifying a document character string in which said variable length chain document character string exists, and document character string position information for indicating the position in the document character string at which the variable length chain document character string exists;

(b3) an extended character chain file for storing a partial document character string of a length of M characters (M being a predetermined integer of one or more) extracted from said variable length chain document character string; and

(b4) variable length chain document character string identifying information for identifying a variable length chain document character string in which said partial document character string exists, and an extended position information file for storing variable length chain document character string position information indicating the position in the variable length chain document character string at which said partial document character string exists;

(c) control means for performing:

(c1) search of a partial document character string matching a partial search character string of a length of N characters (N being a predetermined integer of one or more) extracted from said variable length chain search character string;

(c2) search of a partial document character string matching a partial search character string of a length of N characters in said variable length chain search character string but having a start position shifted by one;

(c3) identification of a set of partial document character strings matching the partial search character string in said variable length chain search character string by repeating the step (c2);

(c4) identification of candidates of variable length chain document character strings which have possibility of matching said variable length chain search character string at the predetermined matching factor or higher of variable length chain from variable length chain document character string identifying information and variable length chain document character string position information for a partial character string belonging to the set of partial document character strings;

(c5) calculation of a variable length chain matching factor between said variable length chain search character string and said variable length chain document character string candidates;

(c6) selection of variable length chain document character strings having a variable length chain matching factor equal to or higher than a predetermined variable length chain matching factor with said variable length chain search character string from said variable length chain document character string candidates,

(c7) identification of a document character string in which said selected variable length chain document character string exists from said document character string identifying information for the selected variable length chain document character strings; and

(c8) calculation of a character string similarity factor between a document character string in which said selected variable length chain document character string exists and the search character string containing said variable length chain search character string; and

(d) a display device for displaying the identified variable length chain document character string when said calculated character string similarity factor is equal to or higher than said predetermined similarity factor of character string.

11. An apparatus for searching a document character string which contains a variable length chain document character string, and which is similar to a search character string, said search character string containing a variable length chain search character string, in a document searchably stored through computer processing, the apparatus comprising:

(a) means for associating said variable length chain document character string with document character string identifying information for identifying a document character string in which said variable length chain document character string exists;

(b) means for associating a partial document character string of a length of M characters (M being a predetermined integer of one or more) from said variable length chain document character string with variable length chain document character string identifying information for identifying a variable length chain document character string in which said partial document character string exists;

(c) means for searching a partial document character string matching a partial search character string of a length of N characters (N being a predetermined integer of one or more) extracted from said variable length chain search character string;

(d) means for identifying a variable length chain document character string relating to the partial document character string matching said partial search character string from the variable length chain document character string identifying information for the partial document character string matching said partial search character string; and

(e) means for identifying a document character string in which said identified variable length chain document character string exists from document character string identifying information for said identified variable length chain document character string.

12. An apparatus for searching a document character string which contains a variable length chain document character string, and which is similar to a search character string, said search character string containing a variable length chain search character string, in a document searchably stored through computer processing, the apparatus comprising:

(a) means for searching a partial document character string matching a partial search character string of a length of N characters (N being a predetermined integer of one or more) extracted from said variable length chain search character string in an index file, in which index file said variable length chain document character string is associated with document character string identifying information for identifying a document character string in which said variable length chain document character string exists, and associated with variable length chain document character string identifying information for identifying a partial document character string of a length of M characters (M being a predetermined integer of one or more) and a variable length chain document character string in which said partial document character string exists from said variable length chain document character string;

(b) means for identifying a variable length chain document character string relating to the partial document character string matching said partial search character string from the variable length chain document character string identifying information for the partial document character string matching said partial search character string; and

(c) means for identifying a document character string in which said identified variable length chain document character string exists from document character string identifying information for said identified variable length chain document character string.

13. An apparatus for searching a document character string which contains a variable length chain document character string, and which is similar to a search character string at a predetermined similarity factor or higher, said search character string containing a variable length chain search character string, in a document searchably stored through computer processing, the apparatus comprising:

(a) means for associating document character string identifying information for identifying a document character string in which said variable length chain document character string exists and document character string position information indicating the position in the document character string at which said variable length chain document character string exists with said variable length chain document character string;

(b) means for extracting and storing a partial document character string of a length of M characters (M being a predetermined integer of one or more) from said variable length chain document character string;

(c) means for associating variable length chain document character string identifying information for identifying a variable length chain document character string in which said partial document character string exists and variable length chain document character string position information for indicating the position in the variable length chain document character string at which said partial document character string exists with said partial document character string;

(d) means for searching a partial document character string matching a partial search character string of a length of N characters (N being a predetermined integer of one or more) extracted from said variable length chain search character string;

(e) means for searching a partial document character string matching a partial search character string of a length of N characters in said variable length chain search character string but having a start position shifted by one;

(f) means for identifying a set of partial document character strings matching the partial search character string in said variable length chain search character string by repeating the step (e);

(g) means for identifying a variable length chain document character string similar to said variable length chain search character string from variable length chain document character string identifying information and variable length chain document character string position information for a partial character string belonging to said set of partial document character strings; and

(h) means for identifying a document character string in which said identified variable length chain document character string exists from the document character string identifying information for said identified variable length chain document character string.

14. An apparatus, including a storage device, for creating a file for search by a computer from a document containing a plurality of variable length chains in a storage device accessible by a computer, the apparatus comprising:

(a) means for extracting a first variable length chain from said document;

(b) means for storing position information for the position in said document at which said first variable length chain appears;

(c) means for extracting a first extended fixed length chain from said first variable length chain; and

(d) means for storing information for identifying said first variable length chain in which said first extended fixed length chain exists and position information for the position in the first variable length chain at which said first extended fixed length chain exists.

15. An apparatus for creating a file for search by a computer from a document containing a plurality of variable length chains in a storage device accessible by a computer, the apparatus comprising:

(a) means for extracting said plurality of variable length chains from said document and storing them in a character chain file;

(b) means for storing position information for the positions in said document at which said plurality of variable length chains appear in a position information file;

(c) means for extracting an extended fixed length chain of a length of M characters (M being a predetermined integer of one or more) from each of said plurality of variable length chains, and storing it in an extended character chain file;

(d) means for storing information for identifying said variable length chain in which said extended fixed length chain exists and position information for the position in the variable length chain at which said extended fixed length chain exists in an extended position information file.

16. An apparatus for calculating with a computer a similarity factor between a first character string containing a first variable length chain and a second character string containing a second variable length chain, the apparatus comprising:

(a) means for searching a partial character string from the first variable length chain character string, the partial character string matching said second variable length chain for a partial character string of a length of N characters (N being a predetermined integer of one or more);

(b) means for searching a partial character string matching said second variable length chain for a partial character string of a length of N characters the start position of which is that of a partial character string in said first variable length chain shifted by one;

(c) means for deriving variable length chain matching character identifying information for identifying characters matching said first variable length chain in said second variable length chain by repeating said means for searching a partial character string matching said second variable length chain;

(d) means for calculating a variable length chain matching factor between said first and said second variable length chains based on said variable length chain matched character identifying information and the number of characters in said first variable length chain; and

(e) means for calculating a similarity factor of said first character string based on said variable length chain matching factor.

17. An apparatus for calculating with a computer a similarity factor between a first character string containing a first variable length chain and a second character string containing a second variable length chain, the apparatus comprising

(a) a storage device for storing:

(a1) a character chain file managing the content of said second variable length chain character string;

(a2) a position information file managing character string identifying information identifying the second character string in which said second variable length chain character string exists, and character string position information indicating the position in the second character string at which said second variable length chain character string exists;

(a3) an extended character chain file managing a first partial character string of a length of M characters (M being a predetermined integer of one or more) extracted from said second variable length chain character string; and

(a4) variable length chain character string identifying information identifying a second variable length chain character string in which the first partial character string exists, and an extended position information file managing variable length chain character string position information indicating the position in the second variable length chain character string at which the first partial character string exists;

(b) means for searching a second partial character string matching a first partial character string of a length of N characters (N being a predetermined integer of one or more) extracted from said first variable length chain character string;

(c) means for searching a second partial character string matching a first partial search character string of a length of N characters which is a first partial character string in said first variable length chain character string but having a start position shifted by one;

(d) means for identifying a set of second partial character strings matching the first partial character string in said first variable length chain character string by repeating said means (c);

(e) means for identifying candidates of variable length chain character strings which have possibility of matching said first variable length chain character string at the predetermined matching factor or higher variable length chain from variable length chain character string identifying information and variable length chain character string position information for a second partial character string belonging to said set of second partial character strings;

(f) means for calculating a variable length chain matching factor between said first variable length chain character string and said variable length chain character string candidates;

(g) means for a selecting a second variable length chain character string with a matching factor of variable length chain equal to or higher than a predetermined matching factor of variable length chain with the first variable length chain character string from the variable length chain character string candidates;

(h) means for identifying the second document character string in which said selected second variable length chain character string exists from the character string identifying information for said selected second variable length chain character string; and

(i) means for calculating a character string similarity factor between the second character string in which said selected second variable length chain character string exists and the search character string containing said first variable length chain search character string based on said variable length chain matching factor.

18. An apparatus for locating a position where there exists a document character string in a document containing a variable length chain character string and searchably stored through computer processing, the document character string being similar to a search character string containing a variable length chain character string with a predetermined similarly factor or higher, the apparatus comprising:

(a) means for locating the start and end positions of match in said variable length chain document character string for a partial character string of a length of M characters (M being a predetermined integer of one or more) from the top of said variable length chain search character string (hereinafter the partial character string of a length of M characters determined by the start and end positions being referred to a "valid matched character string");

(b) means for locating the (i+i)-th valid matched character string satisfying the following conditions:

e(D, i)+1.ltoreq.s(D, i+1).ltoreq.e(D, i)+L+1,

and

s(C, i+1)>e(C, i)-(M-1)

where

s(D, i) is the start position of the i-th valid matched character string in the document;

e(D, i) is the end position of the i-th valid matched character string in the document;

s(C, i) is the start position of the i-th valid matched character string in a character string to be searched;

e(C, i) is the end position of the i-th valid matched character string in a character string to be searched; and

L is a predetermined integer of one or more;

(c) means for continuing the step (b) as long as said valid matched character strings can be located;

(d) means for calculating a similarity factor between said variable length chain search character string and a character string which exists between the start position of the first valid matched character string in said variable length chain document character string and the end position of the last valid matched character string in said variable length chain document character string based on information on existence of valid matched character strings in a character string at least between the start position of the first valid matched character string in said variable length chain document character string and the end position of the last valid matched character string in said variable length chain document character string;

(e) means for identifying said variable length document character string in response to the fact that said calculated similarity factor is equal to or higher than a predetermined similarity factor; and

(f) means for identifying a document character string containing said identified variable length chain document character string.

19. A medium readable by a computer for storing a program which searches a document character string which contains a variable length chain document character string, and which is similar to a search character string, said search character string containing a variable length chain search character string, in a document searchably stored through computer processing, the program comprising:

(a) program code means for directing the computer to associate said variable length chain document character string with document character string identifying information for identifying a document character string in which said variable length chain document character string exists;

(b) program code means for directing the computer to associate a partial document character string of a length of M characters (M being a predetermined integer of one or more) from said variable length chain document character string with variable length chain document character string identifying information for identifying a variable length chain document character string in which said partial document character string exists;

(c) program code means for directing the computer to search a partial document character string matching a partial search character string of a length of N characters (N being a predetermined integer of one or more) extracted from said variable length chain search character string;

(d) program code means for directing the computer to identify a variable length chain document character string relating to the partial document character string matching said partial search character string from the variable length chain document character string identifying information for the partial document character string matching said partial search character string; and

(e) program code means for directing the computer to identify a document character string in which said identified variable length chain document character string exists from document character string identifying information for said identified variable length chain document character string.

20. A medium readable by a computer for storing a program which searches a document character string which contains a variable length chain document character string, and which is similar to a search character string, said search character string containing a variable length chain search character string, in a document searchably stored through computer processing, the program comprising:

(a) program code means for directing the computer to search a partial document character string matching a partial search character string of a length of N characters (N being a predetermined integer of one or more) extracted from said variable length chain search character string in an index file, in which index file said variable length chain document character string is associated with document character string identifying information for identifying a document character string in which said variable length chain document character string exists, and a partial document character string of a length of M characters (M being a predetermined integer of one or more) from said variable length chain document character string is associated with variable length chain document character string identifying information for identifying a variable length chain document character string in which said partial document character string exists;

(b) program code means for directing the computer to identify a variable length chain document character string relating to the partial document character string matching said partial search character string from the variable length chain document character string identifying information for the partial document character string matching said partial search character string; and

(c) program code means for directing the computer to identify a document character string in which said identified variable length chain document character string exists from document character string identifying information for said identified variable length chain document character string.

21. A medium readable by a computer, including a storage device, for storing a program which creates a file for computer search from a document containing a plurality of variable length chains in a storage device accessible by a computer, the program comprising:

(a) program code means for directing the computer to extract a first variable length chain from the document;

(b) program code means for directing the computer to store position information for the position in said document at which said first variable length chain appears;

(c) program code means for directing the computer to extract a first extended fixed length chain from said first variable length chain; and

(d) program code means for directing the computer to store information for identifying said first variable length chain in which said first extended fixed length chain exists and position information for the position in said first variable length chain at which said first extended fixed length chain exists.

22. A medium readable by a computer for storing a file for computer search, the medium comprising:

(a) a character chain file for storing a plurality of variable length chains extracted from a document containing said plurality of variable length chains;

(b) a position information file for managing position information for the position in the document at which said plurality of variable length chains appear;

(c) an extended character chain file for storing an extended fixed length chain of a length of M characters (M being a predetermined integer of one or more) extracted from each of the plurality of variable length chains; and

(d) an extended position information file for storing information for identifying said variable length chain in which said extended fixed length chain exists, and position information for the position in the variable length chain at which said extended fixed length chain exists.


Description

BACKGROUND OF THE INVENTION

The present invention relates to a system and method for searching a large volume of data stored in a disk in, for example, document file format at a high speed while allowing desired ambiguity.

Conventionally, there has been demand to search a large volume of data, such as newspaper accounts, patent publications, or scientific/engineering literature, written in natural language and stored in a disk at a high speed. There have been proposed various search methods. Such methods are roughly divided as follows.

Keyword search scheme: This scheme previously indexes individual document and keywords indicating contents of the document. At the moment, methods for determining the keywords include automatic keyword extraction by the morphemic analysis, manual appending of keywords, and a combination of them. However, this scheme can only search character strings indexed as keywords. In addition, because the accuracy of automatic keyword extraction by the morphemic analysis depends on the accuracy of a dictionary, there is a disadvantage that it requires much manpower for maintaining the dictionary.

Full text search scheme without index: This is a scheme not to use an index, but to scan all documents to be searched every time a character string which is wanted to be searched is designated. Some systems use special hardware to increase the search speed. However, a system employing special hardware increases cost, and may impose restrictions on the type of machine which can be used in a client/server environment.

Full text search scheme with index: The present invention relates to a scheme for searching full document with index. Known techniques intending to increase the speed of full text search with use of index include the following.

Japanese Published Unexamined Patent Application (PUPA) No. 4-205560 discloses creating a search file by dividing a character string to be searched into search units, each of which is a unit for performing the search; appending ascending codes for every search unit; appending attribute codes to the divided search units, the attribute code indicating a logical partition for the search unit; and appending character position sequence codes to every characters in the character string to be searched, the sequence codes indicating the position of character in the search unit, whereby character position information consisting of a search unit identification code, the character position sequence code, and the attribute code is created and stored in areas by character type.

Japanese PUPA No. 4-215181 discloses creating a search file for reducing the number of collation for character strings for search processing and for enabling a general purpose information processing system to perform high speed collation, in which search file character set position information is grouped by character set type, the character set position information indicating at which position in the character string each character set constituting the character string to be searched is positioned.

There frequently arises a necessity to search not only document containing character strings fully matching a search character string, but also document containing character strings partially matching it. For example, there is a case where the user is uncertain of the search character string, or where the search character string may appear in variations so that it is difficult to enumerate all such variations. A typical method for designating a partial character string in the prior art is to use regular expression. According to such method, it is possible to designate repetition of any character zero or more times, repetition of any character one or more times, the end of a line, the top of a line, or any character within a range of specific character codes.

In addition, Japanese PUPA No. 63-99830 discloses a system having a function for partial matching between search character string data and character string data to be searched, the system comprising a table which accumulates data on synonym relationship in the search character string data, and data indicating whether the search character string data appears in any character string data to be searched.

Furthermore, Japanese PUPA No. 62-221027 discloses, when a character string taken out from the top of a character string for partial search is not retrieved from a dictionary, reducing the number of invalid searches by performing forward search for a next taken-out character string the length of which is incremented by one, whereby a word can be taken out at a relatively high speed and efficiently.

Furthermore, Japanese PUPA Nos. 4-326164 and 5-174067 disclose a database search system comprising search means which stores self-correlation information for every articles to be searched, finds the degree of matching between the self-correlation information of search keys and the self-correlation information of the articles to be searched, and outputs the number of articles in the descending order of the degree of matching.

However, these character string search techniques of the prior art have difficulty in designating the degree of ambiguity of character strings to be searched or the like so that the result of a search may be not desirable for the user, or contain many unnatural character strings. In addition, in these character string search techniques of the prior art, when a character string to be searched is "This is a pencil", it is impossible to determine character strings such as not only "This is a pemcil" (one wrong character) and "This is a red pencil" (one interposed word) but also "This is a red pemcil" (one wrong character and one interposed word) as a character string similar to the search character string.

SUMMARY OF THE INVENTION

A purpose of the present invention is to provide a character string search scheme which enables a user to optionally designate the degree of ambiguity for a character string to be searched.

Another purpose of the present invention is to provide an index structure for implementing a character string search scheme which enables it to optionally designate the degree of ambiguity for a character string to be searched.

Still another purpose of the present invention is to provide a character string search scheme which enables the scheme to perform searches in a manner close to human feeling by ambiguity search.

BRIEF DESCRIPTION OF THE DRAWINGS

Some of the purposes of the invention having been stated, others will appear as the description proceeds, when taken in connection with the accompanying drawings, in which:

FIG. 1 is a block diagram showing a hardware configuration;

FIG. 2 is a block diagram of process elements;

FIG. 3 is a structure of index file;

FIG. 4 is a structure of index file;

FIG. 5 is a flowchart showing an index file creation process;

FIG. 6 is a flowchart showing an index file creation process;

FIG. 7 is a flowchart showing a character string search process using an index file;

FIG. 8 is a flowchart showing an ambiguity search process using an index file; and

FIG. 9 is a flowchart showing an ambiguity search process using an index file.

DESCRIPTION OF THE PREFERRED EMBODIMENT(S)

While the present invention will be described more fully hereinafter with reference to the accompanying drawings, in which a preferred embodiment of the present invention is shown, it is to be understood at the outset of the description which follows that persons of skill in the appropriate arts may modify the invention here described while still achieving the favorable results of the invention. Accordingly, the description which follows is to be understood as being a broad, teaching disclosure directed to persons of skill in the appropriate arts, and not as limiting upon the present invention.

A preferred embodiment of the present invention enables it to search any consecutive N characters in a document, or any consecutive N' characters in a word (word in English or the like) at a high speed so that it is possible to search by using the index file at a high speed how many consecutive characters match between a character string to be searched and the document, and how many characters are interposed in them.

A preferred embodiment of the present invention can attain high speed search for a document written in both a language which has many kinds of characters, but does not have explicit word delimiter in representation such as Japanese and Chinese (non-delimiter language), and a language which has relatively small kinds of characters, and is expressed with explicit word delimiters such as English (delimiter language).

According to the present invention, to perform full text search for a database consisting of a plurality of documents, unique numbers (symbols) are appended to individual document, an index file being arranged to store a sequence of characters belonging to suitably determined character set C (hereinafter called a "variable length chain") in each document, individual consecutive N characters not belonging to the character set C (hereinafter called a "fixed length chain"), and the number of document in which both of them appear and position information in the document. In addition, the variable length chain is appended with a unique number. The index file is arranged to store individual N' characters of the variable length chain (hereinafter called an "extended fixed length chain"), and the number of variable length chain in which these N' characters appear and the position information in the variable length chain. Preferably, the index file comprises four files, a character chain file, a position information file, an extended character chain file, and an extended position information file. The character chain file stores the variable length chain, the fixed length chain, a delimiter pattern, and where in the position information file the document number corresponding to them and the position number in document are positioned. The position information file stores the document number and the position number in document. The extended character chain file stores the extended character chain, and where in the extended portion information file the variable length chain number corresponding to it and the position number in the variable length chain are positioned. The extended position information file stores the variable length chain number and the position number in the variable length chain.

The present invention proposes a scheme for searching at a high speed a document containing a character string which has an arrangement of characters similar to a designated character string by using such an index file. This scheme makes it possible to designate a character string to be searched and search accuracy (larger than zero, but one or less) so that a document containing a "similar character string" with a "similarity factor" to the character string to be searched equal to or larger than the designated search accuracy and the position in the document of the "similar character string" can be identified. This processing proceeds by selecting a character string to be searched and a "similar character string" from a document, and digitizing a "similarity factor" from two viewpoints of how many characters are consecutively matched, and how many extra characters are interposed in it.

In this case, if the "similarity factor" is at its maximum value of 1, it means that the character strings are fully matched. Or, if the character strings are fully matched, the "similarity factor" is always at 1. If one or more extra characters which do not exist in a character string to be searched are interposed in a similar character string, or only a part of the character string to be searched appears in the similar character string, the "similarity factor" is a value smaller than 1. However, according to the present invention, such "similarity factor" becomes a reasonable value considerably matching to normal human feeling.

According to one aspect of the present invention, there is provided a method for searching a document character string which contains a variable length chain document character string, and which is similar to a search character string at a predetermined similarity factor, said search character string containing a variable length chain search character string, in a document searchably stored through computer processing, the method comprising the steps of associating document character string identifying information for identifying a document character string in which said variable length chain document character string exists and document character string position information for indicating the position in the document character string at which the variable length chain document character string exists with the variable length chain document character string; extracting and storing a partial document character string of a length of M characters (M being a predetermined integer of one or more) from the variable length chain document character string; associating variable length chain document character string identifying information for identifying a variable length chain document character string in which the partial document character string exists and variable length chain document character string position information for indicating the position in the variable length chain document character string at which the partial document character string exists with the partial document character string; searching a partial document character string which matches a partial search character string of a length of N characters (N being a predetermined integer of one or more) extracted from the variable length chain search character string; searching a partial document character string which matches a partial search character string of a length of N characters in the variable length chain search character string but having a start position shifted by one; Identifying a set of partial document character strings matching the partial search character string in the variable length chain search character string by repeating the step of searching the partial document character strings; identifying candidates of variable length chain document character strings which have possibility of matching the variable length chain search character string at the predetermined matching factor or higher of variable length chain from variable length chain document character string identifying information and variable length chain document character string position information for a partial character string belonging to the set of partial document character strings; calculating a matching factor of variable length chain between the variable length chain search character string and the variable length chain document character string candidates; selecting a variable length chain document character string with a matching factor of variable length chain equal to or higher than a predetermined matching factor of variable length chain with the variable length chain search character string from the variable length chain document character string candidates; identifying a document character string in which the selected variable length chain document character string exists from the document character string identifying information for the selected variable length chain document character string; calculating a character string similarity factor between the document character string in which the selected variable length chain document character string exists and the search character string containing the variable length chain search character string; and, when the calculated character string similarity factor is equal to or higher than the predetermined similarity factor of character string, displaying the identified variable length chain document character string.

The terms "in a document" in discussion of the present invention includes not only "in a single document" but also "in a plurality of documents" including a database or the like. In addition, the phrases "variable length chain document character string" and the "variable length search character string" mean a variable length chain in a document and search characters, respectively, and preferably correspond to one word of the a delimiter language as defined above. However, it is not necessarily a word of the delimiter language, but may be arranged by the non-delimiter language. Such arrangement may be applied to search of a set of variable length character strings such as keywords in the abstract of a patent specification. In addition, the terms "equal to or higher than the certain similarity factor" means a similarity factor exceeding zero, but also includes the similarity factor of one, that is, the similarity factor for complete matching.

The association of the document character string identifying information and the document character string position information with the variable length chain document character string, and the association of the variable length chain document character string identifying information and the variable length chain document character string position information with the partial document character string need not to be maintained as separate files as described in the preferred embodiment of the present invention. It is sufficient that there is a link pointing to the document character string identifying information and the document character string position information to which the variable length chain document character string corresponds. Similarly, it is sufficient that there is a link allowing the scheme to point to the variable length chain document character string identifying information and the variable length chain document character string position information to which the partial document character string corresponds.

In discussion of the present invention, the phrase "search character string containing the variable length chain search character string" means not only a search character string containing another variable length chain search character string and a search character string containing a fixed length chain search character string, but also a search character string which is the variable length chain search character string itself. The search character string is preferably entered by a user through a keyboard, but may be entered from a client or another system through a network. When the present invention is applied to spelling check for an English word, the search string is automatically entered from the system by extracting words not existing in a dictionary. The "partial document character string" and the "partial search character string" mean the extended fixed length chain in the embodiment. Their length of M or N is set by the design so that it may or may not be freely set by the user. In addition, it can be arranged that the length is changed by the system according to an entered similarity factor, a sample with an average length in the variable length chain document character string in the document, or the length of a variable length chain search character string. In addition, in extracting the "partial document character string" and the "partial search character string, " it may be possible to add a symbol such as "$" indicating the beginning of a variable length chain, and a code such as ".Yen." indicating the end of the variable length chain. The "certain similarity factor" means a similarity factor entered by the user or one determined as a default, and may be arranged to be entered by the client or another system.

According to another aspect of the present invention, there is provided a method for searching a document character string which contains a variable length chain document character string, and which is similar to a search character string at a predetermined similarity factor or higher, said search character string containing a variable length chain search character string, in a document searchably stored through computer processing, the method comprising the steps of associating the variable length chain document character string with document character string identifying information for identifying a document character string in which the variable length chain document character string exists; associating, from the variable length chain document character string, a partial document character string of a length of M characters (M being a predetermined integer of one or more) with variable length chain document character string identifying information for identifying a variable length chain document character string in which the partial document character string exists; searching a partial document character string which matches a partial search character string of a length of N characters (N being a predetermined integer of one or more) extracted from the variable length chain search character string; identifying from variable length chain document character string identifying information for a partial document character string which matches the partial search character string a variable length chain document character string relating to the partial document character string which matches the partial search character string; and identifying from document character string identifying information for the identified variable length chain document character string a document character string in which the identified variable length chain document character string exists.

In discussing the present invention, the phrase "document character string identifying information" means the document number in the embodiment, but is sufficient to be information such as address information or a document name identifying a document character string in which the variable length chain document character string exists, and not limited to the document number. In addition, the variable length chain document character string identifying information correspondes to the variable length chain number in the embodiment, but is sufficient to be information such as address information or a word name identifying a variable length chain document character string in which the partial document character string exists.

According to another aspect of the present invention, there is provided a method for searching a document character string which contains a variable length chain document character string, and which is similar to a search character string at a predetermined similarity factor or higher, said search character string containing a variable length chain search character string, in a document searchably stored through computer processing, the method comprising the steps of searching a partial document character string matching a partial search character string of a length of N characters (N being a predetermined integer of one or more) extracted from the variable length chain search character string in an index file in which the variable length chain document character string is associated with document character string identifying information for identifying a document character string in which the variable length chain document character string exists, and associated with variable length chain document character string identifying information which identifies a partial document character string of a length M characters (M being a predetermined integer of one or more) and a variable length chain document character string in which the partial document character string exists from the variable length chain document character string; identifying, from variable length chain document character string identifying information for the partial document character string matching the partial search character string, a variable length chain document character string relating to the partial document character string matching the partial search character string; and identifying, from the document character string identifying information for the identified variable length chain document character string, a document character string in which the identified variable length chain document character string exists. The "index file" has been created prior to implementation of the method of the present invention, and is stored in various media in the computer system such as RAM or the hard disk, or in various media external to the computer system such as a floppy disk or a CD-ROM.

According to another aspect of the present invention, there is provided a method for searching a document character string which contains a variable length chain document character string, and which is similar to a search character string at a predetermined similarity factor or higher, said search character string containing a variable length chain search character string, in a document searchably stored through computer processing, the method comprising the steps of associating document character string identifying information for identifying a document character string in which the variable length chain document character string exists and document character string position information for indicating the position in the document character string at which the variable length chain document character string exists with the variable length chain document character string; extracting and storing a partial document character string of a length of M characters (M being a predetermined integer of one or more) from the variable length chain document character string; associating variable length chain document character string identifying information for identifying a variable length chain document character string in which the partial document character string exists and variable length chain document character string position information for indicating the position in the variable length chain document character string at which the partial document character string exists with the partial document character string; searching a partial document character string which matches a partial search character string of a length of N characters (N being a predetermined integer of one or more) extracted from the variable length chain search character string; searching a partial document character string which matches a partial search character string of a length of N characters in the variable length chain search character string but having a start position shifted by one; identifying a set of partial document character strings matching the partial search character string in the variable length chain search character string by repeating the step of searching the matched partial document character strings; identifying, from the document character string identifying information for said identified variable length chain character string, variable length chain document character strings which are similar to the variable length chain search character string from variable length chain document character string identifying information and variable length chain document character string position information for a partial character string belonging to the set of partial document character strings; and identifying a document character string in which the identified variable length chain document character string exists.

According to another aspect of the present invention, there is provided a method for creating a file for search by the computer from a document containing a plurality of variable length chains in a storage device accessible by a computer, the method comprising the steps of extracting a first variable length chain from the document; storing information for the position in the document at which the first variable length chain appears; extracting a first extended fixed length chain from the first variable length chain; and storing information for identifying the first variable length chain in which the first extended fixed length chain exists and information for the position in the first variable length chain at which the first extended fixed length chain exists.

In discussing of the present invention, the "information for the position in the document at which the first variable length chain appears", "information for identifying the first variable length chain in which the first extended fixed length chain exists", and "information for the position in the first variable length chain at which the first extended fixed length chain exists" may be stored in separate storage devices as long as they can be accessed by the computer.

According to another aspect of the present invention, there is provided a method for creating a file for search by the computer from a document containing a plurality of variable length chains in a storage device accessible by a computer, the method comprising the steps of extracting a plurality of variable length chains from the document and storing them in a character chain file; storing in a position information file information for the positions in the document at which the plurality of variable length chains appear; extracting extended fixed length chains of a length of M characters (M being a predetermined integer of one or more) from each of the plurality of variable length chains and storing them in an extended character chain file; and storing in an extended position information file information for identifying the variable length chains in which the extended fixed length chains exist and information for the position in the variable length chains in which the extended fixed length chains exist.

According to another aspect of the present invention, there is provided a method for calculating with a computer the similarity factor between a first character string containing a first variable length chain and a second character string containing a second variable length chain, the method comprising the steps of searching the first variable length chain character string for a partial character string of a length of N characters (N being a predetermined integer of one or more) which matches in the second variable length chain; searching a partial character string of a length of N characters which is a partial character string in the first variable length chain with the start position shifted by one for a partial character string which matches in the second variable length chain; deriving variable length chain matching character identifying information for identifying characters matching the first variable length chain in the second variable length chain by repeating the step of searching the matched partial character string in the second variable length chain; calculating a variable length chain matching factor between the first variable length chain and the second variable length chain based on the variable length chain matching character identifying information and the number of characters in the first variable length chain; and calculating the similarity factor of the first character string based on the variable length chain matching factor. The variable length chain matched character identifying information is information for identifying matched and non-matched characters in the first or second variable length chain, and, particularly, the number of characters in valid matched character strings in the first or second variable length chain, but the matching factor may be calculated with information for identifying non-matched characters instead of the matched characters. In addition, it may be possible to weight or impose a penalty on the matching factor calculation depending on the nature of matched and non-matched characters instead of simply using the number of matched or non-matched characters.

According to another aspect of the present invention, there is provided a method for calculating with a computer the similarity factor between a first character string containing a first variable length chain and a second character string containing a second variable length chain, the method comprising the steps of associating character string identifying information for identifying the second character string in which the second variable length chain character string exists and character string position information for indicating the position in the second character string at which the second variable length chain character string exists with the second variable length chain character string; extracting and storing a second partial character string of a length of M characters (M being a predetermined integer of one or more) from the second variable length chain character string; associating variable length chain character string identifying information for identifying the second variable length chain character string in which the second partial character string exists and variable length chain character string position information for indicating the position in the second variable length chain character string at which the second partial character string exists with the second partial document character string; searching a second partial document character string which matches the first partial character string of a length of N characters (N being a predetermined integer of one or more) extracted from the first variable length chain character string; searching a second partial character string matching the first partial search character string of a length of N characters which is a first partial character string in the first variable length chain character string with the start position shifted by one; identifying a set of the second partial character strings matching the first partial character string in the first variable length chain character string by repeating the step of searching the second partial character string matching the first partial search character string; identifying candidates of variable length chain character strings which have possibility of matching the first variable length chain character string at the predetermined matching factor or higher of variable length chain from variable length chain character string identifying information and variable length chain character string position information for a second partial character string belonging to the set of second partial character strings; calculating a matching factor of variable length chain between the first variable length chain character string and the variable length chain character string candidates; selecting a second variable length chain character string with a matching factor of variable length chain equal to or higher than a predetermined matching factor of variable length chain with the first variable length chain character string from the variable length chain character string candidates; storing the matching factor of variable length chain for the selected second variable length chain character string; identifying from character string identifying information for the selected second variable length chain character string a second document character string in which the selected second variable length chain character string exists; and calculating the character string similarity factor between the second character string in which the selected second variable length chain character string exists and the search character string containing the first variable length chain search character string based on the stored variable length chain matching factor.

According to another aspect of the present invention, there is provided a method for searching a position where a character string similar to a search character string with a predetermined similarity factor or higher appears, the search character string containing a variable length chain search character string, the method comprising the steps of locating the start and end positions of match in the variable length chain document character string for a partial character string of a length of M characters (M being a predetermined integer of one or more) from the top of the variable length chain search character string (hereinafter the partial character string of a length of M character determined by the start and end positions being referred to a "valid matched character string"); locating the (i+i)-th valid matched character string satisfying the following conditions:

e(D,i)+1<s(D,i+1)<e(D,i)+L+1,

and

s(C,i+1)>e(C,i)-(M-1)

where

s(D,i) is the start position of the i-th valid matched character string in the document;

e(D,i) is the end position of the i-th valid matched character string in the document;

s(C,i) is the start position of the i-th valid matched character string in a character string to be searched;

e(C,i) is the end position of the i-th valid matched character string in a character string to be searched; and

L is a predetermined integer of one or more; continuing the step of locating the (i+i)-th valid matched character string as long as valid matched character strings can be located; calculating a similarity factor between the variable length chain search character string and a character string which exists between the start position of the first valid matched character string in the variable length chain document character string and the end position of the last valid matched character string in the variable length chain document character string based on information on existence of valid matched character strings in a character string at least between the start position of the first valid matched character string in the variable length chain document character string and the end position of the last valid matched character string in the variable length chain document character string; identifying a variable length document character string in response to the fact that the calculated similarity factor is equal to or higher than a predetermined similarity factor; and identifying a document character string containing the identified variable length chain document character string.

According to another aspect of the present invention, there is provided an apparatus for searching a document character string which contains a variable length chain document character string, and which is similar to a search character string at a predetermined similarity factor or higher of character string, said search character string containing a variable length chain search character string, in a document searchably stored through computer processing, the apparatus comprising an input device for inputting the search character string; a storage device for storing a character chain file for storing the variable length chain document character string, a position information file for storing document character sting identifying information for identifying a document character string in which the variable length chain document character string exists and document character string position information for indicating the position in the document character string at which the variable length chain document character string exists, an extended character chain file for storing a partial document character string of a length of M character (M being a predetermined integer of one or more) extracted from the variable length chain document character string, and variable length chain document character string identifying information for identifying a variable length chain document character string in which the partial document character string exists, and an extended position information file for storing variable length chain document character string position information indicating the position in the variable length chain document character string at which the partial document character string exists; control means for performing search of a partial document character string matching a partial search character string of a length of N characters (N being a predetermined integer of one or more) extracted from the variable length chain search character string, search of a partial document character string matching a partial search character string of a length of N characters in the variable length chain search character string but having a start position shifted by one, identification of a set of partial document character strings matching the partial search character string in the variable length chain search character string by repeating the search of partial document character string matching the partial search character string, identification of candidates of variable length chain document character strings which have possibility of matching the variable length chain search character string at the predetermined matching factor or higher of variable length chain from variable length chain document character string identifying information and variable length chain document character string position information for a partial character string belonging to the set of partial document character strings, calculation of a variable length chain matching factor between the variable length chain search character string and the variable length chain document character string candidates, selection of variable length chain document character strings having a variable length chain matching factor equal to or higher than a predetermined variable length chain matching factor with the variable length chain search character string from the variable length chain document character string candidates, identification of a document character string in which the selected variable length chain document character string exists from the document character string identifying information for the selected variable length chain document character strings, and calculation of a character string similarity factor between a document character string in which the selected variable length chain document character string exists and the search character string containing the variable length chain search character string; and a display device for displaying the identified variable length chain document character string when the calculated character string similarity factor is equal to or higher than the predetermined similarity factor.

In discussing the present invention, the "input device" is preferably a keyboard which allows a user to input characters, but the concept includes a pointing device such as a mouse and a track ball, and a touch panel for pointing a character displayed on the display device. In addition, where a search character string is input from another system, that other system is constituted as the input device. The "storage device" can be preferably implemented by using RAM, but may include various storage devices such as a hard disk drive or cache memory. The "control means" is preferably constituted by a central processing unit (CPU) having a bus, an arithmetic function and an input/output control function, a main memory for loading a program and providing a work area for the CPU, an operating system for controlling the central processing unit, and a combination of programs such as a search engine. The "display device" is preferably constituted by a display device, but includes a printing device such as a printer and a plotter for displaying a character or the like through printing.

According to another aspect of the present invention, there is provided an apparatus for searching a document character string which contains a variable length chain document character string, and which is similar to a search character string at a predetermined factor or higher, said search character string containing a variable length chain search character string, in a document searchably stored through computer processing, the apparatus comprising means for associating the variable length chain document character string with document character string identifying information for identifying a document character string in which the variable length chain document character string exists; means for associating a partial document character string of a length of M characters (M being a predetermined integer of one or more) from the variable length chain document character string with variable length chain document character string identifying information for identifying a variable length chain document character string in which the partial document character string exists; means for searching a partial document character string matching a partial search character string of a length of N characters (N being a predetermined integer of one or more) extracted from the variable length chain search character string; means for identifying a variable length chain document character string relating to the partial document character string matching the partial search character string from the variable length chain document character string identifying information for the partial document character string matching the partial search character string; and means for identifying a document character string in which the identified variable length chain document character string exists from document character string identifying information for the identified variable length chain document character string.

According to another aspect of the present invention, there is provided an apparatus for searching a document character string which contains a variable length chain document character string, and which is similar to a search character string at a predetermined similarity factor or higher, said search character string containing a variable length chain search character string, in a document searchably stored through computer processing, the apparatus comprising means for searching a partial document character string matching a partial search character string of a length of N characters (N being a predetermined integer of one or more) extracted from the variable length chain search character string in an index file, in which index file said variable length chain document character string is associated with document character string identifying information for identifying a document character string in which the variable length chain document character string exists, and associated with variable length chain document character string identifying information for identifying a partial document character string of a length of M characters (M being a predetermined integer of one or more) and a variable length chain document character string in which the partial document character string exists from said variable length chain document character string; means for identifying a variable length chain document character string relating to the partial document character string matching the partial search character string from the variable length chain document character string identifying information for the partial document character string matching the partial search character string; and means for identifying a document character string in which the identified variable length chain document character string exists from document character string identifying information for the identified variable length chain document character string.

According to another aspect of the present invention, there is provided an apparatus for searching a document character string which contains a variable length chain document character string, and which is similar to a search character string at a predetermined similarity factor or higher, said search character string containing a variable length chain search character string, in a document searchably stored through computer processing, the apparatus comprising means for associating document character string identifying information for identifying a document character string in which the variable length chain document character string exists and document character string position information indicating the position in the document character string at which the variable length chain document character string exists with the variable length chain document character string; means for extracting and storing a partial document character string of a length of M characters (M being a predetermined integer of one or more) from the variable length chain document character string; means for associating variable length chain document character string identifying information for identifying a variable length chain document character string in which the partial document character string exists and variable length chain document character string position information for indicating the position in the variable length chain document character string at which the partial document character string exists with the partial document character string; means for searching a partial document character string matching a partial search character string of a length of N characters (N being a predetermined integer of one or more) extracted from the variable length chain search character string; means for searching a partial document character string matching a partial search character string of a length of N characters in the variable length chain search character string but having a start position shifted by one; means for identifying a set of partial document character strings matching the partial search character string in the variable length chain search character string by repeating the search of partial document character string; means for identifying a variable length chain document character string similar to the variable length chain search character string from variable length chain document character string identifying information and variable length chain document character string position information for a partial character string belonging to the set of partial document character strings; and means for identifying a document character string in which the identified variable length chain document character string exists from the document character string identifying information for the identified variable length chain document character string.

According to another aspect of the present invention, there is provided an apparatus for creating a file for search by a computer from a document containing a plurality of variable length chains in a storage device accessible by a computer, the apparatus comprising means for extracting a first variable length chain from the document; means for storing position information for the position in the document at which the first variable length chain appears; means for extracting a first extended fixed length chain from the first variable length chain; and means for storing information for identifying the first variable length chain in which the first extended fixed length chain exists and position information for the position in the first variable length chain at which the first extended fixed length chain exists.

According to another aspect of the present invention, there is provided an apparatus for creating a file for search by a computer from a document containing a plurality of variable length chains in a storage device accessible by a computer, the apparatus comprising means for extracting a plurality of variable length chains from the document and storing them in a character chain file; means for storing position information for the positions in the document at which the plurality of variable length chains appear in a position information file; means for extracting an extended fixed length chain of a length of M characters (M being a predetermined integer of one or more) from each of the plurality of variable length chains, and storing it in an extended character chain file; means for storing information for identifying a variable length chain in which the extended fixed length chain exists and position information for the position in the variable length chain at which said extended fixed length chain exists in an extended position information file.

According to another aspect of the present invention, there is provided an apparatus for calculating with a computer a similarity factor between a first character string containing a first variable length chain and a second character string containing a second variable length chain, the apparatus comprising means for searching a partial character string from the first variable length chain character string, the partial character string matching in the second variable length chain for a partial character string of a length of N characters (N being a predetermined integer of one or more); means for searching a partial character string matching in the second variable length chain for a partial character string of a length of N characters the start position of which is that of a partial character string in the first variable length chain shifted by one; means for deriving variable length chain matched character identifying information for identifying characters matching the first variable length chain in the second variable length chain by repeating the means for searching the matched partial character string; means for calculating a variable length chain matching factor between the first and the second variable length chains based on the variable length chain matched character identifying information and the number of characters in the first variable length chain; and means for calculating a similarity factor of the first character string based on the variable length chain matching factor.

According to another aspect of the present invention, there is provided an apparatus for calculating with a computer a similarity factor between a first character string containing a first variable length chain and a second character string containing a second variable length chain, the apparatus comprising a storage device for storing a character chain file managing the content of the second variable length chain character string, a position information file managing character string identifying information identifying the second character string in which the second variable length chain character string exists and character string position information indicating the position in the second character string at which the second variable length chain character string exists, an extended character chain file managing a second partial character string of a length of M characters (M being a predetermined integer of one or more) extracted from the second variable length chain character string, and variable length chain character string identifying information identifying a second variable length chain character string in which the second partial character string exists, and an extended position information file managing variable length chain character string position information indicating the position in the second variable length chain character string at which the second partial character string exists; means for searching a second partial character string matching a first partial character string of a length of N characters (N being a predetermined integer of one or more) extracted from the first variable length chain character string; means for searching a second partial character string matching a first partial search character string of a length of N characters which is a first partial character string in the first variable length chain character string but having a start position shifted by one; means for identifying a set of second partial character strings matching the first partial character string in the first variable length chain character string by repeating the means for searching the second partial search character string; matching the first partial search character string; means for identifying candidates of variable length chain character strings which have possibility of matching the first variable length chain character string at the predetermined variable length matching factor or higher from variable length chain character string identifying information and variable length chain character string position information for a second partial character string belonging to the set of second partial character strings; means for calculating a variable length chain matching factor between the first variable length chain character string and the variable length chain character string candidates; means for a selecting a second variable length chain character string with a matching factor of variable length chain equal to or higher than a predetermined matching factor of variable length chain with the first variable length chain character string from the variable length chain character string candidates; means for identifying the second document character string in which the selected second variable length chain character string exists from the character string identifying information for the selected second variable length chain character string; and means for calculating a character string similarity factor between the second character string in which the selected second variable length chain character string exists and the search character string containing the first variable length chain search character string based on the variable length chain matching factor.

According to another aspect of the present invention, there is provided an apparatus for locating a position where there exists a document character string in a document containing a variable length chain document character string and searchably stored through computer processing, the document character string being similar to a search character string containing a variable length chain search character string with a predetermined similarity factor or higher, the apparatus comprising means for locating the start and end positions of match in the variable length chain document character string for a partial character string of a length of M characters (M being a predetermined integer of one or more) from the top of the variable length chain search character string (hereinafter the partial character string of a length of M characters determined by the start and end positions being referred to a "valid matched character string"); means for locating the (i+i)-th valid matched character string satisfying the following conditions:

e(D,i)+1<s(D,i+1)<e(D,i)+L+1,

and

s(C,i+1)>e(C,i)-(M-1)

where

s(D,i) is the start position of the i-th valid matched character string in the document;

e(D,i) is the end position of the i-th valid matched character string in the document;

s(C,i) is the start position of the i-th valid matched character string in a character string to be searched;

e(C,i) is the end position of the i-th valid matched character string in a character string to be searched; and

L is a predetermined integer of one or more; means for continuing the step of locating the (i+i)-th valid matched character string as long as the valid matched character strings can be located; means for calculating a similarity factor between the variable length chain search character string and a character string which exists between the start position of the first valid matched character string in the variable length chain document character string and the end position of the last valid matched character string in the variable length chain document character string based on information on existence of valid matched character strings in a character string at least between the start position of the first valid matched character string in the variable length chain document character string and the end position of the last valid matched character string in the variable length chain document character string; means for identifying the variable length document character string in response to the fact that the calculated similarity factor is equal to or higher than a predetermined similarity factor; and means for identifying a document character string containing the identified variable length chain document character string.

According to another aspect of the present invention, there is provided a medium for storing a program which searches a document character string which contains a variable length chain document character string, and which is similar to a search character string at a predetermined similarity factor or higher, said search character string containing a variable length chain search character string, in a document searchably stored through computer processing, the program comprising program code means for directing a computer to associate the variable length chain document character string with document character string identifying information for identifying a document character string in which the variable length chain document character string exists; program code means for directing the computer to associate a partial document character string of a length of M characters (M being a predetermined integer of one or more) from the variable length chain document character string with variable length chain document character string identifying information for identifying a variable length chain document character string in which the partial document character string exists; program code means for directing the computer to search a partial document character string matching a partial search character string of a length of N characters (N being a predetermined integer of one or more) extracted from the variable length chain search character string; program code means for directing the computer to identify a variable length chain document character string relating to the partial document character string matching the partial search character string from the variable length chain document character string identifying information for the partial document character string matching the partial search character string; and program code means for directing the computer to identify a document character string in which the identified variable length chain document character string exists from document character string identifying information for the identified variable length chain document character string.

The medium includes a floppy disk, a CD-ROM, an MO, a PD, and a storage device connected to a network. The program codes may be divided into a plurality of segments and stored in a plurality of media. In addition, the program may be compressed and stored in the medium. The medium is loaded into a system through various drives such as a floppy disk drive, a modem, or a serial port, or the like.

According to another aspect of the present invention, there is provided a medium for storing a program which searches a document character string which contains a variable length chain document character string, and which is similar to a search character string at a predetermined similarity factor or higher, said search character string containing a variable length chain search character string, in a document searchably stored through computer processing, the program comprising program code means for directing a computer to search a partial document character string matching a partial search character string of a length of N characters (N being a predetermined integer of one or more) extracted from the variable length chain search character string in an index file, in which index file the variable length chain document character string is associated with document character string identifying information for identifying a document character string in which the variable length chain document character string exists, and a partial document character string of a length of M characters (M being a predetermined integer of one or more) from the variable length chain document character string is associated with variable length chain document character string identifying information for identifying a variable length chain document character string in which the partial document character string exists; program code means for directing the computer to identify a variable length chain document character string relating to the partial document character string matching the partial search character string from the variable length chain document character string identifying information for the partial document character string matching the partial search character string; and program code means for directing the computer to identify a document character string in which the identified variable length chain document character string exists from document character string identifying information for the identified variable length chain document character string.

According to another aspect of the present invention, there is provided a medium for storing a program which creates a file for computer search from a document containing a plurality of variable length chains in a storage device accessible by a computer, the program comprising program code means for directing a computer to extract a first variable length chain from the document; program code means for directing the computer to store position information for the position in the document at which the first variable length chain appears; program code means for directing the computer to extract a first extended fixed length chain from the first variable length chain; and program code means for directing the computer to store information for identifying the first variable length chain in which the first extended fixed length chain exists and position information for the position in the first variable length chain at which the first extended fixed length chain exists.

According to another aspect of the present invention, there is provided a medium readable by a computer for storing a file for computer search, the medium comprising a character chain file for storing a plurality of variable length chains extracted from a document containing the plurality of variable length chains; a position information file for managing position information for the position in the document at which the plurality of variable length chains appear; an extended character chain file for storing an extended fixed length chain of a length of M characters (M being a predetermined integer of one or more) extracted from each of the plurality of variable length chains; and an extended position information file for storing information for identifying a variable length chain in which the extended fixed length chain exists, and position information for the position in the variable length chain at which the extended fixed length chain exists. Such unique file configuration allows it to easily perform ambiguity search when it is incorporated in a computer system. In addition, since overlapped character strings are compressed, the size of file may be reduced than a case where documents are stored as they are. Moreover, since overlapped character strings are not repeatedly searched, high speed search can be attained. The present invention is particularly effective for a document in which a same word repeatedly appears very often. While an embodiment represents position information for the positions at which overlapped character strings appear with numerals, the size for a whole document may be reduced by coding and compressing such numerals.

Referring to FIG. 1, there is shown a block diagram of a system configuration for implementing the present invention. In this configuration, a bus 101 is connected with a central processing unit (CPU) 102 having capabilities of arithmetic operation and input/output control, a main memory (RAM) 104 for loading a program and providing a work area for the CPU 102, a keyboard 106 for key entering commands, a character string to be searched and the like, a hard disk 108 which stores an operating system for controlling the central processing unit, a database file, a search engine, an index file and the like, a display device 110 for displaying a search result of database, and a pointing device (including a mouse, a track ball or the like) 112 for pointing any position on the screen of the display device 110 and communicating its position information to the central processing unit. Therefore, it will be readily understood that the present invention can be implemented on a conventional personal computer (PC), a workstation, or a combination of them. In addition, there is shown a storage medium 114 for storing a program code providing instructions to the CPU or the like in cooperation with the operating system to implement the present invention. The storage medium includes a floppy disk, a CD-ROM, an MO, a PD, and a storage device connected to a network. The aforementioned program code may be divided into a plurality of segments, or compressed for storage in the storage medium. The storage medium 114 is loaded into a system through various drives such as a floppy disk drive, a modem, or a serial port, or the like, whereby the system shown in FIG. 1 would be configured as the system of the present invention.

A desirable operating system is a one such as Windows (trademark of Microsoft), OS/2 (trademark of IBM), or X-window (trademark of MIT) on AIX (trademark of IBM) which supports GUI multiwindow environment as a standard feature. However, the present invention may be implemented in a character-base environment such as PC-DOS (trademark of IBM) or MS-DOS (trademark of Microsoft), and is not limited to a specific operating system environment. In addition, FIG. 1 shows a system in a stand-alone environment. However, since, in general, the database file requires a disk drive with a large capacity, it may be arranged to implement the present invention as a client/server system in which a database file and a search engine are placed on the server machine, the client machine is LAN connected to the server machine through Ethernet or Token-Ring, and only a display controller is placed on the client machine to view results of search.

The system configuration of the present invention will be described by referring to the block diagram of FIG. 2. It should be noted here that components indicated by individual blocks in FIG. 2 are individually or collectively stored in the hard disk 108 of FIG. 1 as data files or program files. The database 202 is assumed by the present invention to be a one storing a plurality of documents such as a news article database or a patent specification database. However, it should be noted that the application of the present invention is not limited to search for a database consisting of a plurality of documents, but it can be applied to search in a single document. In this case, the content of individual documents is searchably stored in the document file format. Furthermore, individual documents are provided with unique document numbers. Preferable document numbers are sequential numbers in ascending order beginning with one. However, in the case of the patent specification database, the application number or laid-open number may be used as a unique document number. In addition, a symbol such as "ABC" or "&XYZ" may be used to identify the individual documents instead of the sequential number. However, because more bytes are necessary to represent such identification symbol than numerals, it is preferable in practical use to identify the document with the sequential number.

In general, since it takes a long processing time to directly search enormous contents such as news articles or patent specifications stored in the database 202, an index file 204 is previously created by an index creation/update module 206 for the contents of all news articles stored in the database 202. According to an embodiment of the present invention described later, the thus created index file 204 comprises four files of a character chain file, a position information file, an extended character chain file, and an extended position information file. The character chain file stores where in the position information file fixed length chains, variable length chains, delimiter patterns and document numbers corresponding thereto, and position numbers in the document are positioned. The position information file stores document numbers and position numbers in the document. The extended character chain file stores where in the extended position information file extended character chains, variable length chain numbers and position numbers in the variable length chain corresponding thereto are positioned. The extended position information file stores variable length chain numbers and position numbers in the variable length chain.

In addition, the database 202 may be a one managing individual documents as separate files, or a one sequentially arranging all documents in a consecutive single file. In summary, it is essential that individual documents are provided with unique numbers, and the content of individual documents can be accessed with such unique numbers. In the former case, the database 202 manages a table which causes the unique numbers for individual documents to correspond to actual file names storing the documents. In the latter case, the database 202 manages a table which causes the unique numbers for individual documents to correspond to offset in the single database file and the size of document. A search engine 208 has capabilities to search the index file 204 with a search character string given by a search character input module 210 as an input, and to return a document number(s) of document containing the input search character string and a position(s) at which the input search character string appears in the document. The search character input module 210 preferably consists of a dialogue box in the multiwindow environment, in the input box of which the desired character(s) to be searched is input through the keyboard 106.

Furthermore, according to a feature of the present invention, the search character input module 210 allows it to input the similarity factor for ambiguity search in a value of 0-1 (it may be numerals of 0-100 based on percentage). Thus, the search character input module 210 displays a slider or scroll bar which has a handle pointing any position between zero and one. The handle of slider points one as default, and can be manipulated to point another value by dragging and moving it with the mouse 112.

A result display module 212 accesses the database 202 based on the document number as the result of search given by the search engine 208, and the value of position at which the search character appears in the document, and displays a line corresponding to that position in the document preferably in a separate search result display window. If the search result cannot be contained in one screen of the window, a scroll bar appears so that the user can sequentially view the search result by clicking it.

In the present invention, files are created, which index all of continuation of characters belonging to a character set C (variable length chains), all of continuous N characters not belonging to the character set C (fixed length chains), their positions in the document, and division information in document with the document (a document chain file 302, and a position information file 304). Here, the "character set C" means a predetermined set of characters, and preferably alphabet (`A`-`Z, ` and `a`-`z`). However, it is contemplated to be characters used in other languages such as German, French, Italian, or Russian, to impose a condition that the character should be of single byte, to be alphanumeric including double byte characters, and to add several symbol characters, special symbols or the like such as "?", "!" or "'". Moreover, the "division information in document" means typically a delimiter in document sentence such as "." or ",", and a delimiter in document in a broader sense such as "Chapter 1", "Summary", a blank line, or a blank character(s). Then, files are created in response to the variable length chain, which index all of continuous N' characters in all variable length chains (extended fixed length chains) and their positions in the variable length chains with the variable length chains (an extended character chain file 306, and an extended position information file 308).

However, the four files of the document chain file 302, the position information file 304, the extended character chain file 306, and the extended position information file 308 are not necessarily physically different files, but are sufficient to be stored in such a manner that the content controlled by each file can be logically processed.

In a preferred embodiment of the present invention, the first processing necessary for creating the index file is to normalize the character string which is processing in the following. When the document to be searched is particularly a Japanese document file, single-byte and double-byte characters may be intermixed. Then, processing is performed to, for example, replace single-byte characters with corresponding double-byte characters (or vise versa), and lowercases with uppercases (or vise versa). The normalization of character string is not essential component of the present invention.

The detail of normalization may be changed to normalize single-byte characters to double-byte characters other than alphabet, or to only normalize non-delimiter languages, or by change of the search condition for differentiate the single-byte characters from the double-byte characters, or user specification.

The next step for creating the index file is, for all characters to be searched in the normalized character string and not belonging to the character set C, to extract continuous N characters starting from these characters (hereinafter called "fixed length chains"), and to store them in the index file together with the document number and the position number in document. However, N.gtoreq.1, and N=2 is suitable for Japanese, Chinese and Korean.

It is desirable not to search continuation of a character belonging to the character set C and an adjacent blank character in order to reduce the size of the index file.

The next step to create the index file is to extract continuation of characters in the normalized character string and belonging to the character set C (variable length chains), and to store them together with the document number and the position number in document in the index file. The character set C may be defined to be other than alphabet. In such case, the character string may contain a plurality of variable length chains in which character strings are continuous. In such case, while the character string may include a plurality of variable length chains, a variable length chain can be extracted by using a blank, line feed, ", ", ". ", "!", or "?" as a delimiter. For example, in the case of "Boys be ambitious." or "Boys (line feed) be ambitious. ", three variable length chains of "Boys", "be", and "ambitious" can be extracted. In the preferred embodiment of the present invention, the case of "line feed" following "-" and not including any blank before or after it, continuation of characters before or after it may be determined to be a single variable length chain. Accordingly, even in the case such as "Boys be ambi-(line feed)tious", three variable length chains of "Boys", "be", and "ambitious" are extracted. In addition, it may be possible to perform normalization such as conversion between uppercases and lowercases, conversion between the plural and the singular, and conversion of 15 tense from the past or past perfect tense to the present tense, as desired.

In the preferred embodiment of the pre