Access augmentation or optimizing

Document search method and apparatus and portable medium used therefor

6377946

Abstract

A document search method and apparatus and a portable medium used therefor are described, in which when registering a document in a data base, the logic structures of each document to be registered are superposed one on another to generate a structure index in which the structure elements having the same position of occurrence in the document are represented by a single meta-node. At the time of document search, a mass of the meta-nodes meeting a specified structural condition is determined with reference to the structure index. A string index is searched with the meta-node identifiers as a key thereby to determine a mass of documents meeting the specified condition. As a result, a highly accurate structure-specified search is made possible on a document data base including a mass of structured documents. In the structure-specified search of structured documents, the conditions for the position of occurrence of the logic elements in the document are specified, thereby making possible a highly accurate structure-specified search.


Claims

What is claimed is:

1. A method of registering a structured document in a document search system for searching the contents of a mass of documents registered in advance, wherein the process for registering a document comprises the steps of:

registering in a document data base the analyzed document data obtained by analyzing the logical structure of a document to be registered; and

sequentially superposing the logical structure of each document to be registered, generating a structure index with a single meta-node representing a set of structure elements having the same type and the same position of occurrence in the document, and attaching a context identifier for uniquely identifying the meta-node to each meta-node configuring a structure tree of said structure index.

2. In a method of registering a structured document according to claim 1,

a portable medium for recording a structure index generated by executing said step of generating and registering said analyzed document data and said step of generating said structure index.

3. In a method of registering a structured document according to claim 1,

a portable medium for recording a computer program having the function of executing said step of generating and registering said analyzed document data and said step of generating said structure index.

4. A method of searching a structured document according to claim 1, wherein the process of registering a document includes the step of:

converting the name attached to the structure of a registered document to the type for superposing the structures in the structure index.

5. A method of registering a structured document in a document search system for searching the contents of a mass of documents registered in advance, wherein the process for registering a document comprises the step of:

extracting a predetermined substring, character position information of said substring in a document to be registered, a document identifier for uniquely identifying said document to be registered, in a document data base, and a context identifier of a meta-node representing the string data including said substring in said structure index, from each string data included in each of said documents to be registered, generating structured character position information including said character position information, said document identifier and said context identifier, and updating the string index by registering the correspondence between said substring and said structured character position information.

6. In a method of registering a structured document according to claim 5,

a portable medium for recording said string index generated by executing said step of updating said string index.

7. In method of registering a structured document according to claim 5,

a portable medium for recording a computer program having the function of executing said step of updating said string index.

8. A method of registering a structured document in a document search system for searching the contents of a mass of documents registered in advance, wherein the process for registering a document comprises the structure index generation step of:

sequentially superposing the logical structure of the documents to be registered one on another, generating a structure index including a set of elements having the same type and the same position of occurrence in the documents represented by a single meta-node, and attaching a context identifier for identifying a meta-node uniquely to each meta-node constituting a structure tree of said structure index, wherein the step of superposing the structure trees of two documents to be registered includes the substeps of:

comparing two nodes constituting the structure trees of the two documents and deciding that the two nodes correspond to each other in the case where the two nodes are both the root of the structure tree;

determining, when the two nodes are not the root of the structure tree, that the two nodes correspond to each other in the case where the superior nodes of the two nodes coincide with each other, the types of the two nodes coincide with each other and said two nodes occur in the same order as counted in forward direction from the head of the arrangement of the sibling nodes having the same type as said nodes; and

constructing said structure index so that the two corresponding nodes are represented by a single meta-node.

9. A method of registering a structured document in a document search system for searching the contents of a mass of documents registered in advance, wherein the process for registering a document comprises the structure index generation step of:

sequentially superposing the logical structure of the documents to be registered one on another, generating a structure index including a set of elements having the same type and the same position of occurrence in the documents represented by a single meta-node, and attaching a context identifier for identifying a meta-node uniquely to each meta-node constituting a structure tree of said structure index, wherein the step of superposing the structure trees of two documents to be registered includes the substeps of:

comparing two nodes constituting the structure trees of the two documents and deciding that the two nodes correspond to each other in the case where the two nodes are both the root of the structure tree;

determining, when the two nodes are not the root of the structure tree, that the two nodes correspond to each other in the case where the superior nodes of the two nodes coincide with each other, the types of the two nodes coincide with each other and said two nodes occur in the same order as counted in reverse direction from the tail of the arrangement of the sibling nodes having the same type as said nodes; and

constructing said structure index so that the two corresponding nodes are represented by a single meta-node.

10. A method of registering a structured document in a document search system for searching the contents of a mass of documents registered in advance, wherein the process for registering a document comprises the structure index generation step of:

sequentially superposing the logical structure of the documents to be registered one on another, generating a structure index including a set of elements having the same type and the same position of occurrence in the documents represented by a single meta-node, and attaching a context identifier for identifying a meta-node uniquely to each meta-node constituting a structure tree of said structure index, wherein the step of superposing the structure trees of two documents to be registered includes the substeps of:

comparing two nodes constituting the structure trees of the two documents and deciding that the two nodes correspond to each other in the case where the two nodes are both the root of the structure tree;

determining, when the two nodes are not the root of the structure tree, that the two nodes correspond to each other in the case where the superior nodes of the two nodes coincide with each other, the types of the two nodes coincide with each other and said two nodes occur at the head or at other than the head of the arrangement of the sibling nodes having the same type; and

constructing said structure index so that the two corresponding nodes are represented by a single meta-node.

11. A method of registering a structured document in a document search system for searching the contents of a mass of documents registered in advance, wherein the process for registering a document comprises the structure index generation step of:

sequentially superposing the logical structure of the documents to be registered one on another, generating a structure index including a set of elements having the same type and the same position of occurrence in the documents represented by a single meta-node, and attaching a context identifier for identifying a meta-node uniquely to each meta-node constituting a structure tree of said structure index, wherein the step of superposing the structure trees of two documents to be registered includes the substeps of:

comparing two nodes constituting the structure trees of the two documents and deciding that the two nodes correspond to each other in the case where the two nodes are both the root of the structure tree;

determining, when the two nodes are not the root of the structure tree, that the two nodes correspond to each other in the case where the superior nodes of the two nodes coincide with each other, the types of the two nodes coincide with each other and said two nodes occur at the tail or at other than the tail of the arrangement of the sibling nodes having the same type; and

constructing said structure index so that the two corresponding nodes are represented by a single meta-node.

12. A method of registering a structured document in a document search system for searching the contents of a mass of documents registered in advance, wherein the process for registering a document comprises the step of:

registering in a document data base the analyzed document data obtained by analyzing the logical structure of the documents to be registered, including the substeps of:

extracting the elements and the contents not suitable as an object of search from said analyzed document data, and deleting said elements and said contents from the analyzed document data; and

registering said analyzed document data in the document data base.

13. A method of searching a structured document in a document search system for searching the contents of a mass of documents registered in advance, wherein the process for searching a document includes the steps of:

determining a mass of context identifiers meeting a specified structural condition with reference to a structure index;

extracting a predetermined substring from a search term and extracting a mass of structured character position information corresponding to said substring with reference to a string index; and

extracting the structured character position information having the context identifiers included in said mass determined in said structural condition determining step and having the same positional relation as the arrangement of said substring on said search term from said mass of said structured character position information.

14. In a method of searching a structured document according to claim 13,

a portable medium for recording a computer program having the function of executing said structural condition determining step, said structured character position information extraction step and said index search step.

15. A method of searching a structured document according to claim 13, wherein the process of searching a document includes the step of:

determining a mass of context identifiers meeting a specified structural condition with reference to a structure index, said step including the substep of;

converting the element type name described in the query into the type information with reference to the type definition table defining the correspondence between the element type name and the structure type of a document and determining a mass of context identifiers adapted for the type information with reference to the structure index.

16. A method of searching a structured document according to claim 13, wherein the process for searching a document includes the step of:

determining a mass of context identifiers meeting a specified structural condition with reference to the structure index, said step including the substep of;

determining a mass of context identifiers adapted for type information by acquiring the position of a node corresponding to the structure index based on the element type name described in a search condition using a common name index.

17. A method of registering a structured document in a document search system for searching the contents of a mass of documents registered in advance, wherein the process for registering a document includes the steps of:

generating an analyzed document data by analyzing the logical structure of a document to be registered;

analyzing the contents of a type definition table defining the correspondence between the element type name of a document and the type of the element, and producing an element regarded as the same type;

superposing the logic structures of a plurality of analyzed documents regarded to have the base document element of the same type, causing one meta-node to represent the elements regarded to have the same position and the same type in two or more documents, and generating one meta-node expressing a structure regarded to occur in only one document, thereby generating a structure index in which a plurality of registered documents regarded to have the same base document element are expressed with a common structure information expressed by a structure tree having a meta-node as an element;

generating a root meta-node constituting the most superior node and lacking a structure corresponding to a registered document as a common containing element of the meta-nodes indicating the base document element of a plurality of structure indexes generated for each plurality of registered documents regarded to have the same type as the base document element, and generating a meta structure index with the root meta-node as the base document element, in which the meta-node indicating the base document element of each registered document is a subelement of the root meta-node;

attaching a context identifier to a meta-node constituting each structure index for uniquely identifying the meta-node constituting the meta structure index including the root meta-node; and

registering in the string index the string in the structure of each registered document corresponding to each meta-node, together with the identification information of the registered document and the context identification information.

18. A method of registering a structured document in a document search system for searching the contents of a mass of documents registered in advance, wherein the process for registering a document comprises the steps of:

generating the analyzed document data by analyzing the logic structure of a document to be registered;

producing a structure regarded as the same type by analyzing the contents of a type definition table defining the correspondence between the element type name of the document and the type of the structure;

superposing the logical structures of a plurality of analyzed document data regarded to have the same type of the base document element, causing a meta-node to represent the structures of at least two documents regarded to have the same position and the same type, and generating a meta-node representing a structure regarded to occur only in one document, thereby generating a structure index including common structure information expressed by a structure tree with a meta-node as an element for each plurality of registered documents regarded to have the same type of the base document element;

generating a root meta-node lacking a structure corresponding to the registered document as a common containing element of the meta-nodes representing the base document elements of a plurality of structure indexes generated for a plurality of registered documents, respectively, regarded to have the same type of the base document element, and generating a meta structure index having a root meta-node constituting the most superior node, in which the meta-nodes representing the base document element of each registered document are subelements of the root meta-node;

attaching a context identifier for uniquely identifying the meta-node constituting the meta structure index including the root meta-node to the meta-node constituting each structure index;

registering in the string index the string in the structure of each registered document corresponding to each meta-node, together with the identification information of the registered document and the context identification information of the registered document; and

setting as an alias a common name for the structures of the same type occurring at different positions, and generating an alias structure index in which said alias corresponds to the context identifier of the meta-node.

19. A method of registering a structured document in a document search system for searching the contents of a mass of documents registered in advance, wherein the process for registering a document includes the steps of:

generating an analyzed document data by analyzing the logical structure of a document to be registered;

analyzing the contents of a type definition table defining the correspondence between the element type name of a document and the type of the element, and producing an element regarded as the same type;

superposing the logical structures of a plurality of analyzed documents regarded to have the base document element of the same type, causing one meta-node to represent the elements regarded to have the same position and the same type in two or more documents, and generating one meta-node expressing a structure regarded to occur in only one document, thereby generating a structure index in which a plurality of registered documents regarded to have the same base document element are expressed with a common structure information expressed by a structure tree having a meta-node as an element;

attaching a context identifier for uniquely identifying the meta-node constituting each structure index to the meta-node constituting the structure index;

attaching a structure index identifier for identifying each structure index to the structure index;

registering the string in the structure of each registered document corresponding to each meta-node in the string index together with the identification information of the registered document and the information on the context identifier;

reading a common name definition table in which a common name set as an alias of the structures having a specific type and regarded as the same type over a plurality of structure indexes, and generating a common name structure index including the correspondence between the common element type name, the identifier of each structure index and the context identifier of the meta-node; and

setting a common name as an alias of the structures occurring in different structure indexes of the same type or at different positions, and generating an alias structure index in which said alias corresponds to the structure index identifier and the context identifier of the meta-node.

20. A method of registering a structured document in a document search system for searching the contents of a mass of documents registered in advance, wherein the process for registering a document includes the steps of:

generating an analyzed document data by analyzing the logical structure of a document to be registered;

analyzing the contents of a type definition table defining the correspondence between the element type name of a document and the type of the element, and producing an element regarded as the same type;

converting documents having the base document element into a document in which a provisional base document element common to all the documents is added as a containing element of said base document element;

superposing the logical structures of a plurality of analyzed documents with said provisional base document element as the base document element, causing one meta-node to represent the structures regarded to have the same position and the same type in at least two documents, and expressing a structure regarded to occur only in one document by generating one meta-node expressing the particular structure, thereby expressing common structure information by a structure tree having meta-nodes as elements thereof;

attaching a context identifier for uniquely identifying a meta-node constituting the structure index to said meta-node; and

registering in the string index the string in the structure of each registered document corresponding to each meta-node together with the identification information and the context identifier information of the registered document.


Description

BACKGROUND OF THE INVENTION

The present invention relates to a method of document registration and a method of document search for a document search system or a document management system using a computer system, or more in particular to a method and apparatus for registration and search of a mass of structured documents each having a logical structure, which is capable of searching specific document contents at high speed, and a portable medium used for them.

With the full scale progress of the information society, computerized document information generated using the word processor, the personal computer or the like have increased more than ever before. Under these circumstances, demand is rising for quickly and accurately retrieving a document containing the required information from a vast accumulation of computerized documents.

A technique meeting this demand is the full-text search. In full-text search, the entire text in the document to be registered is loaded in a computer system and converted into a data base, and the data base is searched directly for a specified character string (hereinafter referred to as the query term). This requires no key word and basically makes possible a search free of detection failure.

On the other hand, high-accuracy search can be realized by adding conditions for logic structure to the query (hereinafter referred to as the structure-specified search) intended for documents in which individual logic elements can be identified (hereinafter referred to as the structured document), including a document described in SGML, for example (C. F. Goldfarb: "THE SGML HANDBOOK" Oxford 1993).

A search method permitting the structure-specified search is proposed in JP-A-8-147311 (hereinafter referred to as the well-known example 1). The well-known example 1 will be briefly described below.

In the method of structured document search according to the well-known example 1, a document is registered first as a text directly in a search data base.

Then, a specific character string (hereinafter referred to as the front marker for the well-known example 1) indicating the head of each logic structure of the registered text and a specific character string (hereinafter referred to as the rear marker for the well-known example 1) indicating the tail of each logic structure of the registered text are detected thereby to identify the logic structure while at the same time segmenting the text by logic structure. In the electronically filed patent specification, for example, "<SDOABJ>" is detected as a front marker and "</SDO>" as a rear marker indicating the scope of the logic structure "abstract", whereby the text defined by them is cut out as a text corresponding to the "abstract". A similar cut-out work is performed also for other logic structures to segment the text by logic structure.

Then, the text corresponding to each logic structure is condensed, and a condensed text is produced. Specifically, as for the "abstract", the text thereof is segmented into substrings by word, and the inclusion relation is checked mutually between the substrings thus segmented. In the process, the character strings contained in other substrings are removed, thereby producing a condensed test of the "abstract". A similar processing is performed for other logic structures to produce a condensed text by logic structure and registered in the search data base as a condensed text file.

Then, "1" is set to a bit corresponding to the character code of the characters appearing in the text to generate a character component table, which is registered as a character component table file in the search data base.

After constructing a search data base in this way, the document search is conducted in the following manner for the well-known example 1.

First, a specified query term is decomposed by character, and the documents containing all the characters constituting the query term are extracted with reference to the character component table.

Then, the condensed text file for the logic structure specified as an object of search is selected among the condensed text files corresponding to logic structures. At the same time, only the condensed text of a document extracted by the character component table search is searched, thereby extracting a document containing the query term specified in the specified logic structure. In the case where the positional relation between a plurality of query terms in the text is not specified in the specified query formula, the search process is terminated. In the case where such a positional relation is specified, on the other hand, the contents of the text corresponding to the document extracted as a result of condensed text search is read, and only those texts containing all the specified query terms and meeting the specified conditions for the positional relation between the query terms are extracted.

In this way, according to the method of the well-known example 1, a structure-specified search is made possible while maintaining a practical search speed for a large-scale text data base.

SUMMARY OF THE INVENTION

The prior art disclosed in the well-known example 1 described above makes possible a structure-specified search to some extent. Nevertheless, there may be the case in which search meeting the structural conditions is impossible as intended by the structure-specified search of the well-known example 1.

In the method of the well-known example 1, the structure of a registered document involved is segmented into several predetermined subelements, and a condensed text file is produced for each subelement. At the time of search, a mass of the condensed text files to be searched is determined by reference to a table defining the correspondence between the structure name of the subelement and the condensed text file, and only the condensed text files contained in the particular mass are searched thereby to realize a structure-specified search.

This method estimates a future search specifying the structural condition at the time of constructing a text data base, and segments the condensed text files in such a manner as to permit a search meeting such a condition. Therefore, the search specifying the structural condition not assumed at the time of data base construction is impossible to conduct.

Assume, for example, that a document is configured of two logic elements (hereinafter called the elements) including "abstract" and "body", and the latter is configured of repetitions of an arbitrary number of "clauses", which in turn includes one "clause subject" and an arbitrary number of "paragraphs". In constructing a text data base from a set of documents having this structure, the condensed text files is segmented into those corresponding to "abstract" and those corresponding to "body". It is impossible to conduct a structure-specified search meeting the condition that "a set of documents containing a string XX in the clause subject is determined".

Of course, this condition can be met if instead of making one condensed text file of the whole "body", the "body" is segmented further into "clause subjected" and "paragraph" to produce a condensed text file. Even when the file is configured this way, however, it is impossible to meet the structural condition that "a set of documents containing a string XX in the first clause (clause subject or paragraph) is determined" or that "a set of documents containing a string XX in the last paragraph of a clause is determined". For this structural condition with a specified order is to be met, it is necessary to prepare a condensed text file for each order of occurrence of a clause and a paragraph. In view of the fact that an arbitrary number of clauses and paragraphs can occur, however, the number of the condensed text files would become enormous. In addition, the well-known example 1 lacks means for setting a correspondence between the structural condition containing an arbitrary specification of the order of occurrence and a mass of finely segmented condensed text files. Actually, therefore, the search meeting this condition is impossible.

As described above, in the prior art, the condition for the position of occurrence of the logic elements in a document cannot be included in the specification of the structural condition, and therefore a highly accurate structure-specified search cannot be executed.

An object of the present invention is to solve the above-mentioned problem of the prior art and to provide a function of conducting a highly accurate and efficient structure-specified search.

Further, the prior art described above can realize only the structure-specified search for a set of documents having a predetermined structure.

Specifically, a structure document such as SGML is the one having a structure predetermined by the DTD (document type definition). In the case where a structure-specified search is conducted for a set of documents according to a specified document type definition, therefore, a document is segmented structurally in order to meet all the conditions for structure specification that can occur, thus making a structure-specified search possible.

Nevertheless, there is not only one document type definition. A thesis, a report, etc. for example, has a different document type definition. In this way, a structured document has various document structures for different objects of the document, and a document type definition corresponding to a particular document structure is produced.

These documents are grouped and registered by document type definition, so that the structure-specified search becomes possible for each group. An attempt to realize a search specifying a common structure that can occur for all the groups, however, cannot be achieved unless the structure-specified search is conducted independently for each group and the result is integrated.

On the other hand, standardization of a structured document not necessarily requiring a specific structure like XML (Extended Markup Language) is going one at W3C (World Wide Web Consortium). The probable trend is toward the situation in which the document having a document structure meeting a specific DTD like SGML is not the only object of search.

Further, according to the prior art described above, even structures having the same meaning (type) like "title", "subject" are regarded as different structures when the element type name is different. In the structure-specified search in terms of "a document containing `SGML` in `title`", for example, a document meeting the condition "a document containing `SGML` in `subject`" cannot be produced as the search result.

Especially when a document type definition is different, different element type names may be attached to the same type of structure for each document type definition.

Assume that a structure-specified search is to be conducted for "title", for example. Unless the user specifies element type names meaning "title" occurring in each document type definition, such as "title", "subject", "name", "TITLE" and prepares a query specifying a structure, all the documents required cannot be acquired. Also, unless all the document type definitions of the registered documents are known, all the structures meaning "title" cannot be covered by the element type name determined by the user. A document according to the document type definition that a title is described in the structure "T", for example, can never be acquired by the structure-specified search by the user not knowing the rule.

Another object of the present invention is to solve the problems mentioned above and to provide a function of highly accurately and efficiently conducting structure-specified search on a set of documents having different document structures coexisting therein.

Further, assume that a condition for the structure-specified search is set as "a document containing the word `SGML` in the title of any item including a chapter, a clause, etc.". It is necessary to search all the structures meeting the structural condition "title", thereby leading to a reduced search efficiency.

If all the elements down to title are specified sequentially from the base document element such as "/document/chapter/title" as a query, a structure can be efficiently specified. This requires the user, however, to prepare the structure-specified search condition indicating all the structures, like "/document/chapter/title" or "/document/chapter/clause/title" or so forth", and thus increases the load on the user. In addition, unless the user grasps all the structures of the document to be searched, a complete search may be impossible.

Still another object of the invention is to solve the problems mentioned above and to provide a function of efficiently realizing a search specifying the same type of structure occurring in a plurality of hierarchical levels without specifying a complicated structural condition.

In order to solve the problems mentioned above, according to the present invention, there are provided a document registration and search method, comprising the following steps.

Specifically, a document registration method according to this invention includes the steps of:

(1) analyzing the logic structure of a document to be registered, generating analyzed document data, and registering the analyzed document data in a document data base;

(2) superpose the logic structures of the documents to be registered, sequentially in the order of registration, causing a single meta element to represent a set of elements having the same position of occurrence in the document and the same type, and causing a single meta string data to represent a set of string data having the same position of occurrence in the document, thereby generating a structure index composed of a structure tree of a set of meta elements and a set of meta string data (hereinafter collectively referred to as the meta-nodes), and attaching to all the meta-nodes constituting the structure index a context identifier for uniquely identifying them in the structure index;

(3) generating structured full-text data composed of the definition of the correspondence between all the string data contained in the analyzed document data corresponding to each document to be registered on the one hand and the context identifier of the meta string data representing the string data in the structure index; and

(4) extracting from the structured full-text data corresponding to each document to be registered, a predetermined substring, character position information of the substring in the document to be registered, a document identifier for uniquely identifying the document to be registered, in the document data base, and a context identifier of the metal string data representing the string data containing the substring in the structure index; generating the structured character position information including the character position information, the document identifier and the context identifier; and registering the correspondence between the substring and the structured character position information thereby to update the string index.

Also, in a document search method according to this invention, the process for searching a registered document includes the steps of:

(1) determining a mass of context identifiers meeting a specified structural condition with reference to the structure index;

(2) extracting a predetermined substring from a query term, and extracting a mass of structured character position information corresponding to the substring with reference to the string index; and

(3) extracting from the mass of the structured character position information the structured character position information having a context identifier contained in the mass determined in the structural condition determining step and having the same positional relation as the arrangement of the substring on the query term.

Further, in a document search method according to the invention, the process for collectively registering documents having a plurality of document structures includes the steps of:

(1) acquiring the type of a particular structure from the element type name with reference to a type definition table describing the correspondence between the name and the type of the structure that can occur in a plurality of structures in the structure index;

(2) acquiring a structure index having the base document element of the same type as the base document element of the document; and

(3) providing a parent node (root meta node) for collecting the structure indexes at the root of the structure index of the documents having a plurality of document structures at the time of registering the structured documents, thereby collecting a plurality of structure indexes into a single meta structure index.

Alternatively, the process for collectively registering documents having a plurality of document structures includes the steps of:

(1) acquiring the type of a particular structure from the element type name with reference to a type definition table describing the correspondence between the name and the type of each structure that can occur in a plurality of structures in a structure index; and

(4) adding a provisional base document element shared by all the documents to the analyzed document data obtained by analyzing the structure of a registered document.

The type definition table is prepared beforehand, manually or automatically by assigning synonyms to the same type using a thesaurus or the like.

Further, in a document search method according to this invention, in order to efficiently realize the structure-specified search specifying the elements of the same type occurring at many positions in the structure index, a document registration program includes the step of:

(1) generating an alias structure index together with a structure index at the time of document registration.

The alias structure index is a structure index prepared so that the information capable of being set for each document structure, such as the date of preparation and the data of updating, can be searched collectively without tracing the structure index. The structure-specified search conducted by specifying the type acquired from the alias structure index enables a plurality of elements in the structure index corresponding to an alias to be acquired collectively from the alias structure index, and therefore the search can be realized more efficiently than when acquiring the context identifier of a specified element by tracing the structure index.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a diagram showing a general configuration of a document search system according to a first embodiment of the invention.

FIG. 2 is a diagram showing a configuration of a document registration subsystem according to the first embodiment of the invention.

FIG. 3 is a PAD generally showing the steps of document registration process according to the first embodiment of the invention.

FIG. 4 is a diagram showing an example of the DTD defining a document logic structure.

FIG. 5 is a diagram showing an example description of the structured document according to SGML.

FIG. 6 is a model diagram showing a pattern of the logic structure of a document expressed by SGML.

FIG. 7 is a PAD showing the steps of a document structure analysis program according to the first embodiment of the invention.

FIG. 8 is a diagram showing a data structure of a document structure table.

FIG. 9 is a PAD showing the steps of a structure index generation program according to the first embodiment of the invention.

FIG. 10 is a diagram showing the order in which the analyzed document data are traced according to the first embodiment of the invention.

FIG. 11 is a diagram showing the correspondence between nodes and meta-nodes according to the first embodiment of the invention.

FIG. 12 is a diagram showing the process of generating a structure index according to the first embodiment of the invention.

FIG. 13 is a diagram showing the steps of processing a structured full-text data generation program according to the first embodiment of the invention.

FIG. 14 is a diagram showing a file format of the structured full-text data according to the first embodiment of the invention.

FIG. 15 is a PAD showing the steps of processing a string index generation program according to the first embodiment of the invention.

FIG. 16 is a diagram showing the data structure of a string index according to the first embodiment of the invention.

FIG. 17 is a diagram showing a configuration of a document search server according to the first embodiment of the invention.

FIG. 18 is a PAD showing the steps of document search process according to the first embodiment of the invention.

FIG. 19 is a PAD showing the steps of processing in the search condition analysis program according to the first embodiment of the invention.

FIG. 20 is a diagram an example of generating an expanded search condition data according to the first embodiment of the invention.

FIG. 21 is a PAD showing the steps of processing a string index search program according to the first embodiment of the invention.

FIG. 22 is a diagram showing an example of execution of the neighboring plural-character occurrence determination process according to the first embodiment of the invention.

FIG. 23 is a diagram showing a data structure of the search result data according to the first embodiment of the invention.

FIG. 24 is a PAD showing the detailed steps of processing the search result data transfer according to the first embodiment of the invention.

FIG. 25 is a diagram showing a configuration of a document search client according to the first embodiment of the invention.

FIG. 26 is a PAD showing the steps of operation of a search client according to the first embodiment of the invention.

FIG. 27 is a PAD showing the steps of processing a query input program according to the first embodiment of the invention.

FIG. 28 is a PAD showing the steps of processing a search result display program according to the first embodiment of the invention.

FIG. 29 is a diagram showing a configuration of a document registration subsystem according to a second embodiment of the invention.

FIG. 30 is a PAD briefly showing the steps of processing the document registration according to the second embodiment of the invention.

FIG. 31 is a PAD showing the steps of processing a last-come first-served structure index generation program according to the second embodiment of the invention.

FIG. 32 is a diagram showing the order of tracing the analyzed document data according to the second embodiment of the invention.

FIG. 33 is a diagram showing the correspondence between nodes and meta-nodes according to the second embodiment of the invention.

FIG. 34 is a diagram showing the process of generating a last-come first-served structure index according to the second embodiment of the invention.

FIG. 35 is a PAD showing the steps of processing in a structured full-text data generation program according to the second embodiment of the invention.

FIG. 36 is a diagram showing a file format of the structured full-text data according to the second embodiment of the invention.

FIG. 37 is a PAD showing a data format of the string index according to the second embodiment of the invention.

FIG. 38 is a diagram showing a configuration of a document search server according to the second embodiment of the invention.

FIG. 39 is a PAD briefly showing the steps of the document search process according to the second embodiment of the invention.

FIG. 40 is a diagram showing the steps of processing in the search condition analysis program according to the second embodiment of the invention.

FIG. 41 is a diagram showing an example of generating an expanded search condition data according to the second embodiment of the invention.

FIG. 42 is a PAD showing the steps of processing in the string index search program according to the second embodiment of the invention.

FIG. 43 is a diagram showing the correspondence between nodes and meta-nodes according to a third embodiment of the invention.

FIG. 44 is a diagram showing a configuration of a document registration subsystem according to the third embodiment of the invention.

FIG. 45 is a PAD briefly showing the steps of the document registration process according to a fourth embodiment of the invention.

FIG. 46 is a PAD showing the steps of processing in the document structure normalization program according to the fourth embodiment of the invention.

FIG. 47 is a diagram showing a specific example of the normalization process according to the fourth embodiment of the invention.

FIG. 48 is a diagram showing a configuration of a document registration subsystem according to the fifth embodiment of the invention.

FIG. 49 is a diagram showing an example of generating a meta structure index according to the fifth embodiment of the invention.

FIG. 50 is a diagram showing the contents of a type definition table according to the fifth embodiment of the invention.

FIG. 51 is a PAD briefly showing the steps of the document registration process according to the fifth embodiment of the invention.

FIG. 52 is a PAD briefly showing the steps of process for generating a meta structure index according to the fifth embodiment of the invention.

FIG. 53 is a diagram showing a first example of the process for updating the meta structure index according to the fifth embodiment of the invention.

FIG. 54 is a diagram showing a second example of the process for updating the meta structure index according to the fifth embodiment of the invention.

FIG. 55 is a diagram showing a configuration of a document search server according to the fifth embodiment of the invention.

FIG. 56 is a PAD briefly showing the steps of the document search process according to the fifth embodiment of the invention.

FIG. 57 is a PAD showing the steps of processing in a meta structure index-applied search condition analysis program according to the fifth embodiment of the invention.

FIG. 58 is a diagram showing an example of generating the expanded search condition data according to the fifth embodiment of the invention.

FIG. 59 is a diagram showing an example of structure condition conversion according to a sixth embodiment of the invention.

FIG. 60 is a diagram showing a configuration of a document registration subsystem according to a seventh embodiment of the invention.

FIG. 61 is a diagram showing an example of the result of processing the root node add program according to the seventh embodiment of the invention.

FIG. 62 is a diagram showing a configuration of a document search server according to the seventh embodiment of the invention.

FIG. 63 is a diagram showing the steps of processing in the root node add program according to the seventh embodiment of the invention.

FIG. 64 is a diagram showing the contents of the process for converting the structural conditions according to the seventh embodiment of the invention.

FIG. 65 is a diagram showing an alias structure index according to an eighth embodiment of the invention.

FIG. 66 is a diagram showing a system configuration of a document registration subsystem according to the eighth embodiment of the invention.

FIG. 67 is a PAD briefly showing the steps of the registration process according to the eighth embodiment of the invention.

FIG. 68 is a PAD briefly showing the steps of generating an alias structure index according to the eighth embodiment of the invention.

FIG. 69 is a diagram showing a system configuration of a document search server according to the eighth embodiment of the invention.

FIG. 70 is a PAD briefly showing the document search process according to the eighth embodiment of the invention.

FIG. 71 is a PAD showing the steps of processing in an alias structure index-applied search condition analysis program according to the eighth embodiment of the invention.

FIG. 72 is a diagram showing the contents of a type definition table according to a ninth embodiment of the invention.

FIG. 73 is a diagram showing the correspondence between the meta structure index and the type definition management table for each structure index according to a tenth embodiment of the invention.

DESCRIPTION OF THE PREFERRED EMBODIMENTS

(1) First Embodiment

A first embodiment of the present invention will be described below with reference to the drawings.

First, a system configuration according to this embodiment will be explained.

FIG. 1 is a diagram showing a general configuration of a document search system according to the first embodiment of the invention. As shown in FIG. 1, the document search system according to the first embodiment of the invention comprises a document registration subsystem 101, a document search server 102, document search clients 103, 104, and a network 105.

The document registration subsystem 101 analyzes the structure of each document input thereto as an object of search and generates index data required for the search. This index data is transferred through the network 105 to the document search server 102 which uses it for the element search process.

The document search server 102 receives a search command from the search clients 103, 104, searches the contents of the documents meeting the conditions specified by the search command using the index data generated by the document registration subsystem 101, and sends back the search result data to the search client constituting an origin of the search request.

The search clients 103, 104 display a screen for the user to specify a search condition (query) interactively, converts the search condition specified by the user on the screen into a search command that can be interpreted by the document search server 102, and transmits the search command to the document search server 102 through the network 105. Upon receipt of the search result data sent back after the search conducted by the document search server 102 in response to the search command, the search client 102 proposes the received search result data by displaying it on the screen. FIG. 1 shows a configuration example using the two computers 103, 104 as search clients. Nevertheless, either a single computer or three or more computers can be configured as search clients.

The network 105 is a local area network and/or a wide area network used by the document registration subsystem 101, the document search server 102 and the search clients 103, 104 to exchange various data and commands.

In FIG. 1, the network 105 is used for transferring the index data from the document registration subsystem 101 to the document search server 102. As an alternative, a configuration is possible which uses a portable medium such as a floppy disk, a magneto-optic disk, a write-once optical disk, etc. As another alternative, the document registration subsystem 101 and the document search server 102 are mounted on a single computer and the data transfer is eliminated.

Further, although FIG. 1 shows the case in which a computer is used for each of the search clients 103, 104, and the document search server 102, a configuration can be employed in which one search client or more and the document search server share the same computer.

The document registration subsystem according to this embodiment, i.e. the subsystem 101 in FIG. 1 will be explained below.

FIG. 2 is a diagram showing a configuration of the document registration subsystem 101 according to this embodiment.

The document registration subsystem 101 shown in FIG. 2 includes a display 201, a keyboard 202, a central processing unit (CPU) 203, a floppy disk drive 204, a floppy disk 205, a communication control unit 206, a main memory 207, a magnetic disk unit 208 and a system bus 209.

The display 201 is used for displaying the progress of execution of the document registration process in this subsystem. The keyboard 202 is used for inputting a command specifying the execution of the document registration process, etc. The central processing unit 203 executes various programs configuring this subsystem. The floppy disk drive 204 is used for writing and reading data to and from the floppy disk 205. The floppy disk 205 is used for storing a document to be registered and inputting the same document to this subsystem. The communication control unit 206 is used for communicating with the document search server 102 through the network 105 to exchange requests and data. The main memory 207 is used for holding various programs and provisional data for executing the processes in this subsystem. The magnetic disk unit 208 is used for storing the document data registered and the index data generated by this subsystem. The system bus 209 is used for connecting these various units.

The main memory 207 stores therein a document structure analysis program 210, a structure index generation program 211, a structured full-text data generation program 212, a string index generation program 213, a document registration control program 214 and a system program 215 on the one hand, and holds a work area 216 on the other hand. An analyzed document data storage area 217, a structure index storage area 218, a structured full-text data storage area 219 and a string index storage area 220 are secured in the magnetic disk unit 208.

The document structure analysis program 210, which is described by SGML, is used for reading the document to be registered stored in the floppy disk 205, generating the analyzed document data by analyzing the logic structure of the document to be registered, and storing the analyzed document data in the analyzed document data storage area 217. The structure index generation program 211 is executed for registering the information on the logic structure of the analyzed document data in the structure index stored in the structure index storage area 218 and updating the structure index. The structured full-text data generation program 212 is executed for generating the structured full-text data on the document to be registered from the analyzed document data and storing the same data in the structured full-text data storage area 219.

The string index generation program 213 is executed for generating the data indicating the correspondence between a predetermined substring and the structured character position information of the substring from the structured full-text data, and registering the same data in the string index stored in the string index storage area 220 thereby to update the string index.

The document registration control program 214 is used for controlling the activation and execution of the document structure analysis program 210, the structure index generation program 211, the structured full-text data generation program 212 and the string index generation program 213, while at the same time transferring the analyzed document data, the structure index and the string index generated by these programs to the document search server 102 through the network 105. The system program 215 provides basic functions such as inputting/outputting data to and from the peripheral units on the computer for executing each program constituting this subsystem. The work area 216 is used for storing the data temporarily required for executing each program.

Although this embodiment represents a configuration in which the document to be registered stored in the floppy disk 205 is read as an input, it is also possible to employ a configuration in which such a document is read from a magneto-optic disk, a write once optical disk or other portable medium, or a configuration in which the document transferred through the network 105 is input. Further, according to this embodiment, the network 105 is used for transferring the analyzed document data, the structure index and the string index generated to the document search server 102. Instead, a configuration can be employed which uses a floppy disk, a magneto-optic disk, a write once optical disk or the like portable medium, or a configuration in which the document registration subsystem 101 and the document search server 102 are mounted on a single computer to eliminate data transfer.

Now, the steps of processing the document registration according to this embodiment will be explained.

FIG. 3 is a PAD (Problem Analysis Diagram) briefly showing the steps of processing the document registration according to a first embodiment of the invention. Upon activation of the document registration control program 214 in response to a registration command or the like from the keyboard 202, this program first checks the floppy disk 205 for the presence or absence and the number of documents to be registered stored therein, and repeatedly executes a series of the process including steps 302 to 305 for all the documents to be registered (step 301).

In step 302, an unprocessed document to be registered is selectively read from the set of documents to be registered stored in the floppy disk 205. In step 303, the document to be registered thus read is assigned a document identifier. The document identifier is the number for identifying a specific document uniquely in a document data base.

In step 304, the document structure analysis program 210 is executed with this document to be registered read as an input. The document structure analysis program 210 generates the analyzed document data corresponding to the document to be registered and stores the data in the analyzed document data storage area 217.

In step 305, the structure index generation program 211 is executed with the analyzed document data generated in step 304 as an input. The structure index generation program 211 first reads the current structure index from the structure index storage area 217, registers the structure information held in the supplied analyzed document data in the structure index, and stores the updated structure index again in the structure index storage area 218.

In step 306, the analyzed document data generated in step 304 is supplied as an input and the structured full-text data generation program 212 is executed. The structured full-text data generation program 212, with reference to the analyzed document data supplied thereto, generates the structured full-text data corresponding to the document to be registered read in step 303, and stores it in the structured full-text data storage area 219.

In step 307, the string index generation program 213 is executed in response to the structured full-text data generated in step 306 and supplied thereto as an input. The string index generation program 213 first reads the current string index from the string index storage area 220, generates the data indicating the correspondence between a predetermined substring and the structured character position information of the particular substring from the structured full-text data, registers it in the string index, and stores the updated string index again in the string index storage area 220.

Upon complete series of process from steps 302 to 307 on all the documents to be registered, the document registration control program 214 executes step 308 and terminates the process. In step 308, all the analyzed document data stored in the analyzed document data storage area 217, the structure index stored in the structure index storage area 218, and the string index stored in the string index storage area 220 are transferred to the document search server 102 through the network 105.

Now, the detail of step 304 in FIG. 3, i.e. the steps of processing in the document structure analysis program 210 according to this embodiment will be explained.

The document structure analysis program 210 processes the structural analysis of a single document to be registered described using SGML. In SGML, the logic structure shared by a set of documents of a specific type is defined by DTD (document type definition). FIG. 4 shows an example of DTD. The DTD defines a mass of logic elements (hereinafter referred simply as "elements") constituting a document thereby to define the logic structure of the document. In FIG. 4, the part defined by the string "<!ELEMENT" and string ">" is called an element type declaration. Each element type declaration specifies the name (called the element type name) shared by a set of elements having an element type and the structure thereof. The string indicated in the left part of the element type declaration indicates the element type name and the right part is the definition of the structure of the content thereof.

In the DTD shown in FIG. 4, the element type declaration for the element type "thesis" specifies that the content of the element associated with this element type has a structure including each one of the elements of the element types "title", "author", "date", "text" and "reference list" arranged in that order. A plurality of element type names are arranged by separating them by "" from each other, indicating that the elements associated with these element type names are required to occur in the specified order.

The element type declaration for the element type "author" specifies that the content of the element associated with this element type has a structure including at least one repetition of the element associated with the element type "name". The character "+" is added to the tail of the element type name to indicate that at least one element associated with the particular element type name occurs.

The element type declaration for the element type "text" specifies that the content of the element associated with this element type has a structure including at least zero repetition of the element associated with the element type "chapter". The character "*" is added to the tail of the element type name to indicate that at least zero element associated with this element type name occurs.

The element type declaration for the element type "chapter" specifies that the content of the element associated with this element type has a structure including at least zero neighboring element associated with the element type "paragraph" or "remark" at the tail of one element associated with the element type "chapter title", followed by at least zero repetition of the element associated with the element type "clause". A plurality of element type names are segmented by ".vertline." to indicate that an element associated with any one of the element types segmented by the character occurs.

The element type declaration for the element type "clause" specifies that the content of the element associated with this element type has a structure including one element associated with the element type "clause title", followed by at least zero neighboring element associated with the element type "paragraph" or "remark", further followed by at least zero repetition of the element associated with the element type "term".

The element type declaration for the element type "term" specifies that the content of the element associated with this element type has a structure including one element associated with the element type "term title", followed by at least zero repetition of the element associated with the element type "paragraph" or "remark".

The element type declaration for the element type "reference list" specifies that the content of the element associated with this element type has a structure including at least one repetition of the element associated with the element type "reference".

The element type declaration for the element type "reference" specifies that the content of the element associated with this element type has a structure including one element each associated with the element types "author", "date" and "source" arranged in that order.

Also, the content of the elements associated with the element types "title", "name", "date", "chapter title", "clause title", "term title", "emphasis" and "source" is specified simply as "#PCDATA". This specifies that these elements have no subelements and has a content composed simply of a character string. The element type declaration for the element types "paragraph" and "remark", on the other hand, specifies that the elements associated with these element types have a structure including at least zero repetition of an element or a simple character string associated with the element type "emphasis".

In the DTD, the part defined between the string "<!ATTLIST" and the string ">" is called an attribute list declaration, which defines the attribute shared by a set of elements associated with an element type. In the DTD shown in FIG. 4, it is defined that the element associated with the element type "remark" has the attribute "type", that this attribute can assume a value of "refer" or "note", and that in the case where this last definition is omitted, "refer" is given as a value.

An example of the SGML document described according to the DTD shown in FIG. 4 is shown in FIG. 5. The part defined between the string "<!DOCTYPE" and the string ">" at the head of the document is called the document type declaration, which declares the DTD followed by the particular SGML document and the element type name of the base document element. In the example shown in FIG. 5, this part specifies that this document follows the DTD stored in the file "ronbun.dtd", and that the element type name of the base document element is "thesis". In this case, assume that the DTD shown in FIG. 4 is stored in the file "ronbun.dtd".

As shown in FIG. 5, the document structure is expressly described in SGML by adding a mark indicating the head position and a mark indicating the tail position of each element constituting a document. The mark indicating the head position of each element is called the "start tag" and the mark indicating the tail position thereof is called the "end tag". The start tag is indicated by describing the element type name of a particular element between the strings "<" and ">". The end tag is indicated by describing the element type name of a particular element between the strings "</" and ">". In the case where an element has an attribute, the specification of the attribute value can be described in the start tag (after the element type name). The specification of an attribute value is indicated by placing the string "=" between the attribute name and the attribute value. In FIG. 5, for example, the start tag "<remark type=note>" attaches the attribute value "note" to the attribute "type" of the element "remark". In the SGML document, the part describing the document structure using these tags is called the "document instance".

The detail of step 304 in FIG. 3, i.e. the steps of processing the document structure analysis program 210 according to this embodiment is shown in the PAD of FIG. 7.

As shown in FIG. 7, the document structure analysis program 210, upon activation thereof by the input thereto of one document to be registered described in SGML, first reads the document type declaration described at the head of the particular document and analyzes the syntax thereof (step 701). Then, step 702 determines the presence or absence of a syntax error in the document type declaration. In the case where a syntax error is detected, the process proceeds to step 703 where an error message is output and the process is suspended.

In the absence of a syntax error in the document type declaration, the process proceeds to step 704 for determining whether the DTD file specified in the particular document type declaration is present or not. Unless the DTD file is detected, the process proceeds to step 705 where an error message is output and the process is suspended.

In the case where the DTD file is detected, on the other hand, the process proceeds to step 706 where the content of the file is read and the syntax thereof is analyzed. Then, in step 707, the presence or absence of a syntax error in the DTD is determined. In the case where a syntax error is detected, the process proceeds to step 708, where an error message is output and the process is suspended. In the case where no syntax error is detected, on the other hand, the process proceeds to step 709 where a document structure table providing data describing the document structure model defined by the DTD is generated on the memory.

Then, in step 710, the document instance is read with reference to the document structure table described above, and the structure is analyzed, with the result that an analyzed document data is generated. Then, step 711 determines whether the document instance contains a syntax error or a structural error (deviation from the structure model defined by DTD) or not. In the case where a syntax error or a structure error is detected, the process proceeds to step 712, where an error message is output and the process is suspended. In the case where no error is detected, on the other hand, the process proceeds to step 713 where the analyzed document data including a document identifier for identifying the document to be registered and the analysis result data obtained by the structural analysis in step 710 are output to the analyzed document data storage area 217 and the process is terminated.

As an example, reference is made to the case in which the document structure analysis program 210 is executed with the SGML document of FIG. 5 as a document to be registered, and where the content of the DTD file "ronbun.dtd" referred to by the document is the DTD shown in FIG. 4. In this case, the document structure table generated in step 709 assumes a data structure as shown in FIG. 8. As shown in FIG. 8, the document structure table includes two parts, a structure definition and an attribute definition. The structure definition defines the data model of the content that the element associated with a particular element type corresponding to the element type name of each element type configuring the DTD. The attribute definition, on the other hand, defines the name, the type of the attribute value and the default value of each attribute of each element associated with each element type corresponding to the element type name configuring the DTD. By referring to this structural definition, it is determined whether the arrangement or the hierarchical relation of a set of elements occurring in the document instance is correct or not (presence or absence of an element error). Also, in the case where a tag is omitted or an attribute value is specified, they can be complemented.

Assume that the SGML document shown in FIG. 5 is supplied to the document structure analysis program 210 as a document to be registered and that the DTD thereof is as shown in FIG. 4. Then, the structure tree shown in FIG. 6 is obtained as the analyzed document data. FIG. 6 is a model diagram showing a pattern of the logic structure of the document expressed by the SGML description shown in FIG. 5. As shown in FIG. 6, the logic structure of the structured document can be grasped as a structure tree with each element as an intermediate node and the string data as end nodes. In FIG. 6, each element is expressed by a circle, and the string data by a rectangle.

According to this embodiment, a configuration is employed in which the structured document described in SGML is processed as a document to be registered. Nevertheless, a configuration is possible in which a structured document described in other forms such as ODA (open document architecture) can be used as a document to be registered.

FIG. 9 is a PAD showing the detail of step 305 in FIG. 3, i.e. a PAD showing the steps of processing in the structure index generation program 211 according to this embodiment.

The structure index generation program 211 first determines, in step 901, whether the existing structure index is present in the structure index storage area 218. In the case where the structure index is not present in the structure index storage area 218, the process proceeds to step 902 for generating an initial (vacant) structure index. In the case where the existing structure index is detected, on the other hand, the process proceeds to step 903 for reading the same structure index.

Then, in step 904, the analyzed document data of the document to be registered is read from the analyzed document data storage area 217.

Next, in step 905, the process of steps 906 to 909 is repeated for all the nodes (elements and string data) making up the structure tree of the analyzed document data.

Step 906 determines whether or not a meta-node (meta element or meta string) corresponding to a node currently closely watched in the analyzed document data exists in the structure index. In the case where there exists no such corresponding meta-node, the process proceeds to step 907 where a meta-node corresponding to the particular node is generated and registered in the structure index and further the meta-node thus registered is assigned a context identifier providing the number for uniquely identifying the meta-node in the structure index (step 908). In step 909, the correspondence between the node currently closely watched in the analyzed document data and the context identifier for identifying the meta-node corresponding to the node in the structure index is added to the analyzed document data, and thus the analyzed document data is updated.

Upon complete repetition of step 905 and subsequent steps, the process proceeds to step 910 for outputting the updated analyzed document data and storing them in the analyzed document data storage area 217. Then, in step 911, the updated structure index is output and stored in the structure index storage area 218, thus terminating the process.

Now, the order in which individual nodes are processed by tracing the structure tree of the analyzed document data at the time of repetitive processing of all the nodes making up the structure tree in step 905 will be explained with reference to FIG. 10. In FIG. 10, each element node is designated by an ellipse, each string node by a rectangle, and in the case where a given node has a plurality of subnodes, the latter are expressed by arranging them from left to right in the order of occurrence. Also, the numerical character attached to each node indicates the order of processing the particular node. As shown in FIG. 10, in step 905, a set of nodes are processed in such an order that starting with the node located at the root of the structure tree, each specific node is processed first and then the subnodes thereof are processed sequentially in the order of occurrence.

Now, the specific process in step 906, i.e. the specific process for determining whether or not a meta-node corresponding to a node currently closely watched in the analyzed document data exists in the structure index will be explained with reference to FIG. 11. FIG. 11 is a diagram showing the correspondence between a set of nodes making up the structure tree of the analyzed document data shown to the left of the drawing and a set of nodes (meta-nodes) constituting the structure tree of the structure index shown to the right in the drawing.

According to this embodiment, it is defined that a given node in the structure tree of the analyzed document corresponds to a metal node in the structure tree of the structure index in the case where the structure tree address of the particular node and the structure tree address of the particular meta-node are equal to each other.

The structure tree address is defined herein as a combination of the type of each node (element, string data, and in the case of an element, the element type) existing along a given route starting with the root of the structure tree and traced from a superior node to subnodes till reaching the specific node on the one hand and the number of the order in which the particular node occurs in the sibling nodes having the same node type.

For example, among a set of nodes in the analyzed document data shown in FIG. 11, the node 1101 has no superior node and the first "thesis" element node in the sibling nodes. Therefore, the structure tree address of this node can be expressed as "/thesis[1]". In similar fashion, the node 1102 is a subnode of the node 1101 and the first "chapter" element node in the sibling nodes. Therefore, the structure tree address of this node can be expressed as "/thesis[1]/chapter[1]". Also, the node 1103 is a subnode of the node 1102, and the second "clause" element node in the sibling nodes, and therefore the structure tree address of this node can be expressed as "/thesis[1]/chapter[1]/clause[2]". Further, the node 1104 is a subnode of the node 1103 and the first "paragraph" element node in the sibling nodes, and therefore the structure tree address of this node can be expressed as "/thesis[1]/chapter[1]/clause[2]/paragraph[1]".

In similar fashion, the structure tree address of each meta-node making up a structure tree of the structure index on the right side of FIG. 11 is determined as follows. The structure tree address of the meta-node 1105 is "/thesis[1]" and equal to that of the node 1101. Similarly, the structure tree address of the meta-node 1106 is "/thesis[1]/chapter[1]" and equal to the structure tree address of the node 1102. The structure tree address of the meta-node 1107 is "/thesis[1]/chapter[1]/clause[2]" and equal to the structure tree address of the node 1103. As a result, in step 906, it is determined that the node 1101 corresponds to the meta-node 1105, the node 1102 to the meta-node 1106, and the node 1103 to the meta-node 1107.

In the structure index of FIG. 11, there is no meta-node having the same structure tree address as the node 1104. Therefore, it is determined that there exists no meta-node corresponding to the node 1104 in the structure index. Thus, in step 907, a new meta-node is generated and registered in the structure index. In the case where a new meta-node corresponding to a given node is registered in step 907, a meta-node of the type corresponding to the particular node is added to the tail of the subnodes having a meta-node corresponding to a superior node of the particular node. In the case where a meta-node corresponding to the node 1104 in FIG. 11 is registered, for example, a meta-node of the element type "paragraph" is added to the subnodes of the meta-node 1107 corresponding to the node 1103 which is a superior node of the node 1104, and the particular meta-node is placed at the tail end of the sibling meta-nodes.

Now, the process of generating a structure index by sequentially superposing a plurality of analyzed document data will be explained with reference to FIG. 12. In FIG. 12, numerals 1201, 1203 and 1205 designate analyzed document data for documents to be registered, respectively. The elements of these analyzed document data are sequentially superposed on the existing structure index thereby to form a structure index. Initially, the structure index is vacant. First, when the analyzed document data 1201 of the document I is input, therefore, a structure tree equivalent to the analyzed data is generated and directly registered in the structure index, so that the structure index assumes the state shown by 1202. The newly-generated meta elements are assigned context identifiers E1 to D5, while the newly-generated meta string data are assigned context identifiers C1 to C3, respectively.

When the analyzed document data 1203 of the document 2 is input, nothing is done with the part where the existing structure index (1202) is superposed, but only subelements (hatched portions in the drawing) lacking a corresponding part in the structure index 1202 are newly registered.

The meta elements newly generated are assigned the context identifiers E6 and E7, and the meta string data newly generated is assigned the context identifier C4. Then, when the analyzed document data 1205 of the document 3 is input, nothing is done with the portion where the structure thereof is superposed with the existing structure index 1204, but only the subelements (hatched portions in the drawing) lacking a corresponding part in the structure index 1204 are registered anew. The meta elements newly generated are assigned context identifiers E8, E9 and E10, and the meta string data newly generated are assigned context identifiers C5 and C6. In this way, with the three documents registered, the structure index assumes the state shown by 1206.

FIG. 13 is a PAD showing the detail of step 306 in FIG. 3, i.e. the steps of processing the structured full-text data generation program 212 according to this embodiment.

First in step 1301, the structured full-text data generation program 212 reads the analyzed document data of the document to be registered described above from the analyzed document data storage area 217.

In step 1302, the document identifier for identifying the document to be registered is output to the structured full-text data storage area 219.

Then, in step 1303, the process of steps 1304 to 1306 is repeated for all the nodes (element nodes and string data nodes) making up the structure tree of the analyzed document data.

In step 1304, it is determined whether a node currently closely watched in the analyzed document data is an element node or a string data node. Only in the case where the node is a string data node, the process proceeds to step 1305. In step 1305, a context identifier corresponding to the string data node current closely watched is acquired from the analyzed document data and output to the structured full-text data storage area 219. Then, in step 1306, the content of the string data node current closely watched is output to the structured full-text data storage area 219.

Upon complete repetition of step 1303 and subsequent steps, the process for this program is terminated.

FIG. 14 shows a file format of the structured full-text data output by the structured full-text data generation program 212. FIG. 14 illustrates the case in which the structured full-text data is generated with the SGML document of FIG. 5 as an input. As shown in FIG. 14, the data file of the structured full-text data according to this embodiment is so structured that a document identifier is described at the head, followed by the repetition of as many pairs of a context identifier and a corresponding content as the string data existing in the document.

For example, the document identifier of the document to be registered corresponding to the structured full-text data shown in FIG. 14 is "D1". In FIG. 5, the string data described as the content of the element "date" is assigned the context identifier "C5". In FIG. 14 and other figures, these identifiers are expressed by symbols. However, the value actually recorded in the data as a document identifier is the number (integer) for identifying a specific document uniquely in a mass of documents to be registered, and the value of a context identifier is the number (integer) for identifying a specific meta-node uniquely in a mass of meta-nodes making up the structure index.

FIG. 15 is a PAD showing the detail of step 307 in FIG. 3, i.e. the steps of processing the string index generation program 213 according to this embodiment.

The string index generation program 213, first n step 1501, determines whether the existing string index is present in the string index storage area 220. In the case where the string index is not present in that area, the process proceeds to step 1502 and generates an initial (vacant) string index. In the case where the existing string index is detected, on the other hand, the process proceeds to step 1503 for reading the particular string index.

Then, in step 1504, the structured full-text data of the document to be registered is read from the structured full-text data storage area 219.

Then, in step 1505, the process of steps 1506 to 1507 is repeated for all the contents making up the structured full-text data.

In step 1506, a predetermined substring is extracted from the content currently closely watched in the structured full-text data. In step 1507, the correspondence between each substring extracted in step 1506 and the structured character position information of the substring is registered in the string index.

Upon complete repetition of step 1505 and subsequent steps, the process proceeds to step 1508, where the structured full-text data no longer required are deleted from the structured full-text data storage area 219 and discarded. Then, in step 1509, the updated string index is output and stored in the string index storage area 220, thus terminating the process.

In extracting a predetermined substring from a given content in step 1506, the length of the substring to be extracted is predetermined, and starting with the head of the content involved, the substrings of the predetermined length are sequentially extracted while at the same time incrementing the start position one by one. In the case where the length of the substring to be extracted is 2 characters and the content "{character pullout}", which means "actual example of conversion process" in Japanese, corresponding to the context identifier C129 among a set of contents shown in FIG. 14 is used as an object of processing, for example, six substrings are extracted, including "{character pullout}", "{character pullout}", "{character pullout}", "{character pullout}", "{character pullout}", "{character pullout}",

Further, as for the tail of the content, each string having a length of one or more characters is extracted. In the example described above, "{character pullout}" is extracted. In step 1507, these substrings are registered in the string index as a correspondence between each substring and the structured character position information indicating the position where the substring occurs. The structured character position information includes a document identifier of a document containing a corresponding substring, a context identifier for identifying the position of the string data containing the substring in the document structure, and the head character position of the substring in the document.

FIG. 16 shows a data structure of the string index according to this embodiment. FIG. 16 illustrates a part of the data structure (the portion associated with the content "{character pullout}(actual example of conversion process)") of the string index as of the time when the structured full-text data shown in FIG. 14 is processed using the string index generation program 213 and when the substring set contained in the structured full-text data is registered in the string index. In FIG. 16, however, the character node corresponding to {character pullout}at the tail of the content and the structured character position information are not shown. Also, the position of the character immediately before the content is expressed as a relative character position "X".

As shown in FIG. 16, the string index holds a list of the occurrence position information (the structured character position information including a combination of the document identifier, context identifier and the head character position) for all the substrings of a predetermined length occurring in the document to be registered. In order to increase the speed of index search, a data structure is employed in which a set of all substrings having the same first character share the first-character information. Also, the pointer to the first character from the root of the string index is arranged in the order of the code of the character indicated by the pointer. In similar fashion, the pointer from the first-character node to the second-character node is arranged in the order of the code of the character indicated by the pointer.

Once all the documents to be registered in the document data base are processed so that the set of substrings appearing therein are registered in the string index, then the position in the document where a string of given two characters (a method of searching for a string of other than two characters in length will be described later) appears can be determined simply by referring to the particular string index (without the need of scanning the document data proper at all).

According to this embodiment, the length of a substring is predetermined as two characters. Nevertheless, another length can be employed to construct a similar string index. Although the length of the substring is fixed according to this embodiment, a variable length can be used for constructing a similar string index.

The foregoing is the description of the document registration subsystem 101 according to this embodiment.

Now, an explanation will be given of the document search server according to this embodiment, i.e. the server 102 in FIG. 1.

FIG. 17 is a diagram showing a configuration of the document search server 102 according to this embodiment.

The document search server 102 shown in FIG. 17 includes a display 201, a keyboard 202, a central processing unit (CPU) 203, a communication control unit 206, a main memory 207, a magnetic disk unit 208 and a system bus 209.

The display 201 is used for displaying the operating situation of the server. The keyboard 202 is used for inputting commands for activation and deactivation of the server. The central processing unit 203 executes various programs making up the server. The communication control unit 206 is used for communication with the document registration subsystem 10 and the search clients 103 and 104 through the network 105 to exchange requests and data. The main memory 207 is used for holding various programs and temporary data for executing the process by the server. The magnetic disk unit 208 is used for storing a set of document data constituting the document data base and the index data referred to at the time of document search by the server. The system bus 209 is used for connecting these various units.

The main memory 207 holds therein a search condition analysis program 1701, a string index search program 1702, a document search control program 1703 and a system program 215. In addition, it holds a work area 216. The magnetic disk unit 208 secures therein an analyzed document data storage area 217, a structure index storage area 218, a string index storage area 220 and a search result data storage area 1704.

The search condition analysis program 1701 analyzes the search condition formula included in the search request received from the search clients 103, 104 and translates it into a condition specification that can be directly searched by the string index search program 1702. The string index search program 1702 search the string index stored in the string index storage area 220 in accordance with the condition specification translated by the search condition analysis program 1701, and stores the search result data thus obtained in the search result data storage area 1704.

The document search control program 1703 controls the activation and execution of the search condition analysis program 1701 and the string index search program 1702, while at the same time exchanging requests and data with the document registration subsystem 101 and the search clients 103, 104 through the network 105. The system program 215 provides the basic functions such as data input/output to and from the peripheral units for executing each program constituting the server in the computer. The work area 216 is used for storing data temporarily as required at the time of program execution.

According to this embodiment, the network 105 for is used for transferring data between the document search subsystem 101 and the search clients 103, 104. Alternatively, a configuration can be employed which uses a floppy disk, a magneto-optic disk, a write-once optical disk or the like portable medium. Also, a configuration is possible in which the document registration subsystem 101 and the document search server 102 are mounted on a single computer, and no data is transferred between them. Also, a configuration can be employed in which one or more search clients are executed by the same computer as the document search server 102, and no data is transferred between them.

FIG. 18 is a PAD briefly showing the steps of processing the document search according to the first embodiment of the invention. Upon activation of the document search control program 1703 in response to a server activation command from the keyboard 202, the program enters a loop in which as a server, it receives requests from the document registration subsystem 101 and the search clients 103, 104 and process them (step 1801). This loop continues until a command for deactivating the server is input from the keyboard 202.

The loop in step 1801 repeats the process (step 1802) for receiving requests from the document registration subsystem 101 and the search clients 103, 104 and the process (step 1803) for determining the type of the received requests and separating them into the processes corresponding to each type.

Step 1803 determines the type of the received requests, and in the case where a request is a data base update request (a request to register a new set of documents and update the document data base) transmitted from the document registration subsystem 101, the process branches into steps 1804 and 1805.

In the case where the request is a document search request (a request to search for a set of documents meeting a specific search condition) transmitted from the search clients 103, 104, on the other hand, the process branches into steps 1806, 1807 and 1808. Also, in the case where the request is an inquiry about the search result (a request to inquire about the result of a specific search process) transmitted from the search clients 103, 104, the process branches into step 1809. Further, in the case where the request is a document transfer request (a request to transfer a specified document data) transmitted from the search clients 103, 104, the process branches into step 1810. Upon complete processing at the destination of branching, the process returns to step 1802 to continue the loop.

In step 1804, the analyzed document data of a set of newly registered documents are received from the document registration subsystem 101, and added to the analyzed document data storage area 216. Then, in step 1805, the structure index and the string index updated in a manner to reflect the content of the newly-registered document set are received from the document registration subsystem 101 and stored in the structure index storage area 218 and the string index storage area 220, respectively.

In step 1806, the search condition analysis program 1701 is executed, and the search condition specified in the document search request is analyzed and converted into a condition specification (hereinafter referred to as the expanded search condition data) that can be directly processed by the string index search program 1702. Then, in step 1807, the string index search program 1702 is executed in response to an input of the expanded search condition data generated in step 1806, and the document set meeting the condition specified by the expanded search condition data is searched to determine the search result data. The search result data are stored in the search result data storage area 1704 in a manner corresponding to the search result identifier for identifying the search result data uniquely. Next, in step 1808, the search result identifier is returned to the search client constituting a request source.

In step 1809, a part or the whole of the search result data acquired in step 1807 is extracted from the search result data storage area 1704 in accordance with the content of the query and transferred to the search client constituting the request source.

In step 1810, the analyzed document data of the documents (all the plural specified documents, if any) specified in the document transfer request are extracted from the analyzed document data storage area 217 and transferred to the search client constituting the request source.

FIG. 19 is a PAD showing the detail of step 1806 in FIG. 18, i.e. the steps of processing in the search condition analysis program 1701 according to this embodiment.

The search condition analysis program 1701, upon activation thereof in response to an input thereof including the search condition specified in the document search request, first determines, in step 1901, whether the structural condition is included in the search condition or not. Only in the case where the structural condition is so included, the process including steps 1902 and 1903 is executed. Unless the structural condition is so included, on the other hand, the process proceeds to step 1904.

In step 1902, the structure index is read from the structure index storage area 218. In step 1903, a mass of context identifiers of all the string data included in the structure meeting the structural condition are determined with reference to the structure index. The mass is hereinafter called the context identifier mass.

In step 1904, it is determined whether the length of the string specified as a string condition in the search condition exceeds the length of the substring predetermined at the time of generating the string index. In the case where the length of the specified string exceeds the substring length, the process proceeds to step 1905, where the start character position is incremented one by one from the head of the specified string, and a set of substrings having the same length as the substring length is extracted, so that a substring list including these substrings as elements is generated. In the case where the length of the specified string does not exceed the substring length, on the other hand, the process proceeds to step 1906 for generating a vacant (lacking elements) substring list.

In step 1907, an expanded substring data including the context identifier mass obtained in step 1903, the specified string included in the query and the substring list generated in step 1905 or 1906 is generated and the process is terminated.

FIG. 20 is a diagram showing an example of generating the expanded search condition data in processing the search condition analysis program 1701.

In FIG. 20, numeral 2001 designates an example of search condition specified in the document search request. The search condition 2001 is configured of the structural condition specification "chapter/paragraph[1]" and the string condition specification "{character pullout}" which is Japanese expression of "guard". The search condition specifies the requirement to search for a case in which the string "{character pullout}" appears in the first "paragraph" element immediately under the "chapter" element.

Assume that the content of the structure index is as shown by 2002. In step 1903, the context identifiers of the "paragraph" element meeting the structural condition specification are seen to be E5 and E14 by reference to the structure index. As a result, it is known that a case should be searched for in which the string "{character pullout}" occurs in the string data underlying these paragraphs, i.e. in the string data with the context identifier C3 or C9. In view of the fact that the position of occurrence is registered only for the substrings having the length of 2 in the string index used for search, however, the specified string having three characters cannot be searched directly. In step 1905, therefore, a list of substrings of length 2 is generated by decomposing the specified string. In the case where the specified string is "{character pullout}" as described above, for example, the extracted substrings are "{character pullout}" and "{character pullout}".

As a result, in step 1907, the expanded search condition data shown by 2003, i.e. the data having the context identifier mass {C3, C9}, the specified string of "{character pullout}" and the substring list of {"{character pullout}" and "{character pullout}"} are generated.

FIG. 21 is a PAD showing the detail of step 1807 in FIG. 18, i.e. the steps of processing the string index search program 1702 according to this embodiment.

The string index search program 1702 is activated in response to an input of the expanded search condition data generated by the search condition analysis program 1701. This program, upon activation thereof, first reads the string index from the string index storage area 220. Then, the process proceeds to step 2102 for initializing the search result data.

Then, in step 2103, the length of the specified string included in the expanded search condition data is compared with the length of the substring predetermined at the time of generating the string index. In the case where the length of the specified string is equal to the length of the substring, the process proceeds to step 2104. In the case where the length of the specified string is shorter than the length of the substring, the process proceeds to step 2105. In the case where the length of the specified string exceeds the length of the substring, on the other hand, the process branches to step 2106.

In step 2104, the specified string is searched for in the string index to determine a mass of the structured character positions corresponding to the string. Then, only a set of structured character position information having any one of the context identifiers contained in the context identifier mass in the expanded search condition data is extracted thereby to generate a mass of hit positions including the extracted set of the structured character information.

In step 2105, the string index is searched for the specified string, and a mass of all the structured character position information existing before the character node corresponding to the tail end of the string is acquired, and only the mass of structured character position information having any one of the context identifiers included in the context identifier mass in the expanded search condition data is extracted thereby to generate a mass of hit positions including the extracted mass of structured character position information.

In step 2106, the process of step 2107 is repeated for each substring configuring the substring list in the expanded search condition data. In step 2107, the string index is searched for the substring, a mass of the structured character position information corresponding to the string is acquired, and only a set of structured character position information having any one of the context identifiers included in the context identifier mass in the expanded search condition data is extracted, and the extracted set of the structured character position information are stored in a manner corresponding to the substring.

Upon complete repetition in step 2106, the process proceeds to step 2108, where the neighboring plural-character occurrence is determined for the set of the structured character position information stored in at corresponding positions in step 2107, and only the set of the structured character position information constituting the specified string is extracted as a neighboring string. Also, in each set thus extracted, only the structured character position information corresponding to the substring located at the head of the specified string is extracted, and a mass of hit positions is generated from the extracted set of the structured character position information.

Upon complete processing of all the steps branching from step 2103, the process proceeds to step 2109 where a set of structured character position information included in the mass of hit positions are grouped into a set having the same document identifier and registered in the search result data.

Now, step 2108, i.e. the processing of determining a neighboring plural-character occurrence in the process of the string index search program 1702 will be explained in more detail with reference to FIG. 22.

In FIG. 22, numeral 2201 designates an example (a part) of the string index. The string index holding the data shown in 2201 is searched according to the condition indicated by the expanded search condition data 2003 in FIG. 20. As shown in step 2107, first, only the structured character position information having a context identifier C3 or C9 are extracted from those corresponding to the substrings "{character pullout}" and "{character pullout}". The data corresponding to extracted set of the structured character position information stored at positions corresponding to the substring are shown in 2202. The neighboring plural-character occurrence is determined based on this data.

In the process of determining the neighboring plural-character occurrence of step 2108, it is determined whether there exists a combination of the structured character position information constituting the specified string in the extracted structured character position information as a whole. Such a combination is required to meet the following conditions:

(1) All the document identifiers coincide among the sets of structured character position information.

(2) All the context identifiers coincide among the sets of structured character position information.

(3) By arranging the structured character position information in the ascending order of character position value and arranging the corresponding substring sets according to the string positions, a string equal to the specified string is obtained as a whole.

The cases shown by 2202 include a combination which constitutes the specified string "{character pullout}" as a whole.

Once a combination of the structured character position information meeting the above-mentioned condition is found, the structured character information with a smallest character position value is selected as a representative of the structured character information set included in the combination and registered in the mass of hit positions.

FIG. 23 is a diagram showing a data structure of the search result data generated as a result of individual search process. As shown in FIG. 23, the search result data has such a configuration that the character position information set included in the mass of hit positions is divided into groups by document identifier, a list with the group as an element is generated, and information indicating the total number of detected documents is added. The search result data are set at positions corresponding to the search result identifiers for identifying the search result data uniquely in the mass of the search result data, and stored in the search result data storage area 1704.

Then, step 1809 of FIG. 18, i.e. the process of transferring the search result corresponding to the content of the inquiry about the search result to the source client will be explained in more detail with reference to FIG. 24. FIG. 24 is a PAD showing the process of step 1809 in detail.

The body of the search result inquiry is composed of three parts including a search result identifier specification, an inquiry type specification and a document identifier specification. Some type of inquiry may have no document identifier specification.

As shown in FIG. 24, the process corresponding to step 1809 first includes step 2401 in which the search result data corresponding to the search result identifier specified in the inquiry is searched, and the search result data is read from the search result data storage area 1704.

Next, in step 2402, the inquiry type is determined, and in the case where the inquiry type is the inquiry about the number of detected documents, the process branches to step 2403. In the case where the inquiry type is the document identifier inquiry, the process branches to step 2404, and in the case where the inquiry type is the character position information inquiry, the process branches to step 2405.

In step 2403, the number of detected documents is extracted from the search result data read in step 2401, and the value of the number of detected documents is transferred to the source search client, thereby terminating the process.

In step 2404, the mass of all document IDs included in the search result data read in step 2401 is obtained, and the mass is transferred to the source search client, thereby terminating the process.

In step 2405, a list of the structured character position information corresponding to the document identifier designated in the inquiry is extracted from the search result data read in step 2401, and the list is transferred to the source search client, thereby terminating the process.

The foregoing is the description of the document search server 102 according to this embodiment.

Now, an explanation will be given of the document search client according to the first embodiment of the invention, i.e., the component parts 103 and 104 in FIG. 1.

The document search client shown in FIG. 25 is configured of a display 201, a keyboard 202, a central processing unit (CPU) 203, a communication control unit 206, a main memory 207, a magnetic disk unit 208 and a system bus 209.

The display 201 is used for displaying the screen by way of which the user inputs the search condition interactively, and also for displaying the search result, etc. The keyboard 202 is used for inputting a command for executing a search condition, a search process, etc. The central processing unit 203 executes various programs configuring the client. The communication control unit 206 is used for communicating with the document search server 102 through the network 105 and exchanging requests and data. The main memory 207 is used for holding various programs and temporary data with which the client executes the process. The magnetic disk unit 208 is used for storing the documents obtained as a result of search and other data. The system bus 209 is used for connecting the various units mentioned above.

The main memory 207 holds therein a query input program 2501, a search result display program 2502, a client control program 2503, a system program 215 and a work area 216. An analyzed document data storage area 217 and a search result data storage area 1704 are secured in the magnetic disk unit 208.

The search condition input program 2501 inputs and interprets the search condition interactively with the user. The search result display program 2502 displays the search result received from the document search server 102. The client control program 2503 controls the activation and execution of the search condition input program 2501 and the search result display program 2502, while at the same time exchanging requests and data with the document search server 102 through the network 105. The system program 215 provides basic functions such as input/output to and from the peripheral units for executing each program configuring the client in the computer. The work area 216 is used for storing the data temporarily required for program execution.

According to this embodiment, the network 105 is used for transferring data with the document search server 102. As an alternative, a floppy disk, a magneto-optic disk, a write-once optical disk or the like portable medium can be used. Also, one or more search clients can be executed on the same computer as the document search server 102 without data transfer between them. It is also possible to connect a printer to this client to print the search result.

FIG. 26 is a PAD showing the steps of operation of the search client according to the first embodiment of the invention. The client control program 2503 is activated by a client activation command or the like entered from the keyboard 202. Then, the program receives a document search command from the user and enters a loop for processing it (step 2601). This loop continues until a deactivation command is input from the client by way of the keyboard 202.

The loop of step 2601 repeats the process shown in steps 2602 to 2605.

In step 2602, the search condition input program 2501 is executed. The search condition is input interactively with the user and the search condition is converted into a document search request that can be interpreted by the document search server 102. In step 2603, the document search request is transmitted through the network 105 to the document search server 102. In step 2604, the search result identifier is returned to and received in response to the document search request by the document search server 102.

In step 2605, the search result display program 2502 is executed in response to the search result identifier as an input, while at the same time making an inquiry and displaying the search result on the screen interactively.

FIG. 27 is a PAD showing the detailed steps of processing the search condition input program 2501 executed in step 2602 in FIG. 26. The search condition input program 2501, once activated from the client control program 2503, displays on the display 201 a screen for the user to specify the search condition interactively (step 2701).

Next , in step 2702, the search condition specified by the user on the screen is read. Then, in step 2703, the search condition read in step 2702 is transformed into the form of the document search request that can be directly interpreted by the document search server 102.

FIG. 28 is a PAD showing the detailed steps of processing the search result display program 2502 executed in step 2605 of FIG. 26. The search result display program 2502, once activated in response to the search result identifier from the client control program 2503, immediately enters the loop of step 2801. This loop repeatedly executes the process shown in steps 2802 to 2815 until a command for terminating the display of the search result is received from the user.

In the loop of step 2801, which starts with step 2802, a screen used for displaying the search result and the command input from the user is displayed on the display 201. Then, in step 2803, the content specified by the user on the screen is read.

In step 2804, the type specified by the user is determined, and the process proceeds to a step corresponding to the particular type. Specifically, in the case where the command is for requesting the display of the number of detected documents, the process proceeds to steps 2805 and 2806. In the case where the command is for requesting the display of a document identifier list of a set of detected documents, on the other hand, the process proceeds to steps 2807 and 2808. In the case where the commands is for requesting the display of the content of the document, the process proceeds to steps 2809 to 2815. Upon complete processing of these steps, the process returns to step 2802 thereby to resume the loop.

In step 2805, a number-of-detected-documents inquiry list is generated for inquiring about the number of detected documents and transmitted to the document search server 102. Then, in step 2806, the number of detected documents transferred from the document search server 102 in response to the request is received and displayed on the display unit 201.

In step 2807, a document identifier inquiry list for inquiring about the document identifier list of the detected document set is generated, and the inquiry is transmitted to the document search server 102. Then, in step 2802, a mass of the document identifiers transferred from the document search server 102 in response to the inquiry is received and the document identifier set included in the mass is displayed as a list on the display unit 201.

In step 2809, the document identifier for specifying the document to be displayed is input. Then, in step 2810, a document transfer request is generated for acquiring analyzed document data for the document to be identified by the identifier, and transmitted to the document search server 102. Then, in step 2811, the analyzed document data transferred from the document search server 102 in response to the request is received, and stored in the analyzed document data storage area 217.

In step 2812, a character position information inquiry is generated for inquiring about the position where a string specified in the search condition is detected in the analyzed document data, which inquiry is transmitted to the document search server 102. Then, in step 2813, the list of the structured character position information transferred from the document search server 102 in response to the search condition is received and stored in the search result data storage area 1704.

In step 2814, the data are processed for reversed display of the specified string detected at the time of document search, with reference to the analyzed document data received in step 2811 and the structured character position information list received in step 2813. In step 2815, the analyzed document data thus reversed is formatted and displayed on the display unit 201.

The foregoing is the description of the operating steps performed on the part of the search clients 103 and 104 according to the first embodiment of the invention.

(2) Second Embodiment

Now, a second embodiment of the present invention will be explained with reference to the drawings.

FIG. 29 is a diagram showing a configuration of a document registration subsystem 101 according to this embodiment.

The document registration subsystem 101 shown in FIG. 29 has the same hardware configuration as the corresponding subsystem in the first embodiment shown in FIG. 2. The main memory 207, however, in addition to the program set held in the first embodiment, holds a last-come first-served structure index generation program 2901. Also, a last-come first-served structure index storage area 2902 is secured in the magnetic disk unit 208 in addition to the area set secured in the first embodiment. The last-come first-served structure index generation program 2901 is such that the information on the logic structure held in the analyzed document data of the document to be registered is registered in the last-come first-served structure index stored in the last-come first-served structure index storage area 2902 thereby to update the last-come first-served structure index.

According to this embodiment, the document registration control program 214 controls the activation and execution of the document structure analysis program 210, the structure index generation program 211, the last-come first-served structure index generation program 2901, the structured full-text data generation program 212 and the string index generation program 213, while at the same time transferring the analyzed document data, the structure index, the last-come first-served structure index and the string index generated by these programs to the document search server 102 through the network 105.

This embodiment is configured to read the document to be registered stored in the floppy disk 205 as an input. Alternatively, a configuration is possible to read from a magneto-optic disk, a write-once optical disk or the like portable medium. It is also possible to employ a configuration in which the document transferred by way of the network 105 is input. Further, according to this embodiment, the network 105 is used for transferring the analyzed document data, the structure index, the last-come first-served structure index and the string index to the document search server 102. As an alternative, a configuration is possible to employ a portable medium such as a floppy disk, a magneto-optic disk or a write-once optical disk. As another alternative, the document registration subsystem 101 and the document search server 102 can be mounted on a single computer, thus eliminating the data transfer.

FIG. 30 is a PAD briefly showing the steps of processing the document registration according to the second embodiment of the invention. The steps shown in FIG. 30 are substantially similar to those of the first embodiment shown in FIG. 3, but is different in that step 3001 is added immediately after step 305, and step 3002 is executed instead of step 308.

In step 3001, the last-come first-served structure index generation program 2901 is executed in response to the analyzed document data generated in step 304 input thereto. The last-come first-served structure index generation program 2901 first reads the current last-come first-served structure index from the last-come first-served structure index storage area 2902, registers the structure information held in the analyzed document data in the last-come first-served structure index, and stores the updated last-come first-served structure index again in the last-come first-served structure index storage area 2902.

In step 3002, all the analyzed document data stored in the analyzed document data storage area 217, the structure index stored in the structure index storage area 218, the last-come first-served structure index stored in the last-come first-served structure index storage area 2902 and the string index stored in the string index storage area 220 are transferred to the document search server 102 through the network 105.

FIG. 31 is a PAD showing the detail of step 3001 in FIG. 30, i.e. the steps of processing the last-come first-served structure index generation program 2901 according to this embodiment.

The last-come first-served structure index generation program 2901 determines, first in step 3101, whether the existing last-come first-served structure index is present in the last-come first-served structure index storage area 2902. In the case where the last-come first-served structure index does not exist in that area, the process proceeds to step 3102 for generating an initial (vacant) last-come first-served structure index. In the cases where the existing last-come first-served structure index is detected, on the other hand, the process proceeds to step 3103 for reading the last-come first-served structure index.

Then, in step 3104, the analyzed document data of the document to be registered is read.

Then, in step 3105, the process of steps 3106 to 3109 is repeated for all the nodes (elements and string data) making up the structure tree of the analyzed document data.

In step 3106, it is determined whether a meta-node (meta element or meta string data) corresponding to the node currently closely watched in the analyzed document data exists in the last-come first-served structure index. In the absence of a corresponding meta-node, the process proceeds to step 3107, where a meta-node corresponding to the node is generated and registered in the last-come first-served structure index. Further, the registered meta-node is assigned a last-come first-served context identifier as the number for uniquely identifying it in the last-come first-served structure index (step 3108).

In step 3109, the correspondence between the node currently closely watched in the analyzed document data and the last-come first-served context identifier for identifying the meta-node corresponding to particular node in the last-come first-served structure index is added to the analyzed document data. In this way, the analyzed document data is updated.

Upon complete repetitive processing of step 3105 and subsequent steps, the process proceeds to step 3110, where the updated analyzed document data is output and stored in the analyzed document data storage area 217. Then, in step 3111 the updated last-come first-served structure index is output and stored in the last-come first-served structure index storage area 2902 thereby to terminate the process.

As described above, the steps of processing according to the last-come first-served structure index generation program 2901 substantially corresponds to the steps of processing according to the structure index generation program 211 shown in FIG. 9. However, the order in which the structure tree of the analyzed document is traced in the repetition of step 3105 is different from that for the structure index generation program 211, with the result that the structure tree of the last-come first-served structure index is different from the structure tree of the structure