Method for computerized information retrieval using shallow linguistic analysis5696962Abstract A computerized method for retrieving documents from a text corpus in response to a user-supplied natural language input string, e.g., a question. An input string is accepted and analyzed to detect phrases therein. A series of queries based on the detected phrases is automatically constructed through a sequence of successive broadening and narrowing operations designed to generate an optimal query or queries. The queries of the series are executed to retrieve documents, which are then ranked and made available for output to the user, a storage device, or further processing. In another aspect the method is implemented in the context of a larger two-phase method, of which the first phase comprises the method of the invention and the second phase of the method comprises answer extraction. Claims What is claimed is: Description COPYRIGHT NOTIFICATION
______________________________________
term represents the single search term term. A
term can be an individual word or in some
cases another query.
<term1 term2 . . . >
represents strict ordering of terms. The IR
subsystem determines that a document matches
this query if and only if all the terms
enclosed in the angle brackets appear in the
document within a sequence containing a
maximum of p intervening words between the
first and last words of the sequence (that is,
a sequence of at most p+2 words) and in the
exact order in which they appear in the
brackets. The query <0 phrase> matches only
the exact wording of phrase phrase. Strict
ordering queries can be nested as in, for
example, the query
<5 <0 Abraham Lincoln> president>
(p term1 term2 . . . )
represents terms within a proximity of p words
from one another, with no strict ordering
imposed. The IR subsystem determines that a
document matches this query if and only if all
the terms enclosed. in parentheses appear in
the document within a sequence containing a
maximum of p intervening words. For example,
the query
(3 big white ball)
matches a document containing a sentence that
begins "Taking up the white ball, with the big
bat in its hands, the gorilla began to play
baseball . . . " because the sequence that
begins with the first term matched ("white")
and ends with the last term matched ("big")
has no more than 3 intervening words between
the first and last words of the sequence. The
order of the terms within the sequence is not
considered for this query. Proximity queries
can be nested and can also contain strict
ordering queries as in, for example, the query
(20 <0 Abraham Lincoln>
(10 slavery freedom))
›term1 term2 . . . !
represents a Boolean logical AND of terms. The IR
subsystem determines that a document matches this
query if and only if each of the terms within the
square brackets occurs at least once in the
document. AND queries can include proximity
queries or strict ordering queries, as in, for
example, the query
›<0 Abraham Lincoln>
(10 slavery freedom)!
______________________________________
5.2 Query Execution Executing a query, whether an initial query or one of the other queries to be described below, comprises a number of component steps, as illustrated for one embodiment of the invention in the flowchart of FIG. 4. In this embodiment query execution is carried out by a function called MAKE-QUERY. In step 400 MAKE-QUERY accepts as input the query to be executed. In step 402, the query is compared to any results previously stored by MAKE-QUERY to determine whether it has previously been executed. If the query has already been executed, MAKE-QUERY returns immediately; otherwise MAKE-QUERY continues with step 405. In step 405 MAKE-QUERY sends the query from the primary query construction subsystem to the IR subsystem via a channel that connects these subsystems. In step 410 the IR subsystem performs the query in conjunction with the text corpus to determine "hits," that is, documents that match the query. In step 415 the IR subsystem sends the results of this processing back to the primary query construction subsystem via a channel. The results comprise the number of hits found plus a list of the retrieved documents or of identifiers, such as names or index numbers, for these documents. In step 420 MAKE-QUERY determines whether to store the returned results. More specifically, in step 420, MAKE-QUERY determines whether the number of hits is greater than or equal to a threshold value such as 1, and less than a maximum value such as 50. The threshold and maximum values are typically specified by default (e.g., by the programmer) ahead of time. They also can be set by the user in some embodiments or dynamically by the primary query construction subsystem in some embodiments. The exact values chosen will depend on a number of factors, including without limitation the particular task to which the method of the invention is applied, the size of the text corpus, the amount of information the user wishes to see, and so forth. Furthermore, also in step 420 MAKE-QUERY checks to see whether the current query is a broadened version of a previous query (see below), and if so, whether the query has returned the same number of hits as that previous query. If so, the results are deemed redundant and so will not be stored. Depending on the outcome of step 420, either step 421 or step 423 is executed. If in step 420 the number of hits is determined to be within the range established by the threshold and maximum values and the results are not deemed redundant, step 421 is executed. In step 421 the list of documents or document identifiers sent back by the IR subsystem is stored in association with the input query and the number of hits. If in step 420 the number of hits is determined to be less than the threshold value or greater than or equal to the maximum value, or the results are deemed redundant, then step 423 is executed. In step 423 MAKE-QUERY discards the results returned by the IR subsystem, and resets the number of hits to an artificial value. The artificial value is 0 if the number of hits detected by the IR subsystem was less than the threshold or if MAKE-QUERY deemed the results to be redundant, and is the maximum value if the number of hits was greater than or equal to the maximum value. After step 421 or step 423 is complete execution proceeds at step 430. In step 430, MAKE-QUERY returns the number of hits along with any results stored in step 421, or, if step 423 was executed, an indication that the results were discarded. If step 423 was executed then number of hits that MAKE-QUERY returns step 430 is the artificial value set in step 423. 5.3 Broadening and Narrowing A word on broadening and narrowing is helpful at this point. A "broad" query is one that generates relatively many hits (that is, document matches) when executed. It is so called because it represents a relatively broad scope of search by the IR subsystem. A "narrow" query is one that generates relatively few hits when executed. It is so called because it represents a relatively narrow scope of IR search. "Broadening" means the formulation of a new query from an existing query in such a manner that the new query is broader than--that is, tends to generate more hits than-the existing query. "Narrowing" means the formulation of a new query from an existing query in such a manner that the new query tends to generate fewer hits than the existing query. Broadening can be achieved, for example, by relaxing proximity constraints in the existing query or by deleting words or component queries from the existing query in order to formulate the new query. Narrowing can be achieved, for example, by tightening proximity constraints in the existing query or by adding words or component queries to the existing query in order to formulate the new query. 5.4 Overview of Query Reformulation FIG. 5 flowcharts the component steps of automatic query construction and reformulation in one embodiment of the method. In step 500 a table for tallying information about certain queries is created. In steps 505 through 570 a sequence of queries is performed automatically by the system with little or no user intervention. These queries are based on the completed analysis of the input string as stored in step 330. Steps 505, 510, and 525 concern simple queries, which are queries based on individual noun phrases or title phrases. In step 505 initial queries are constructed and executed based on noun phrases in the input string. In step 510 broadened versions of these initial queries can be constructed and executed. In step 525 simple queries based on title phrases are constructed and executed. Steps 530, 540, 545, 550, 560, and 570 primarily concern compound queries, which are queries comprising two or more components, each component being a simple query or another search term such as a verb. In step 530 queries based on combinations of title phrases and noun phrases are constructed and executed. In step 540 queries based on maximal combinations of noun phrases are constructed and executed. In step 545 additional queries based on lesser combinations of noun phrases are constructed and executed. In step 550 queries based on main verbs in conjunction with title phrases or noun phrases are constructed and executed. In step 560 queries based on head words of noun phrases can be constructed and executed. In step 570 still further queries, such as queries based on synonyms or hyponyms of words in noun phrases, can be constructed and executed. Each of the steps of FIG. 5 will now be more fully described. 5.5 Frequency Table In step 500 a table is created for tallying information about certain of the queries that will be created and processed in subsequent steps. This table is called the noun phrase frequency table, or simply the frequency table. The frequency table is empty at its creation and is filled in as later steps are executed. The frequency table can be stored in the processor's memory. Alternatively or additionally it can be stored in a storage device coupled to the processor. The frequency table can be represented as an array, a list of 3-tuples, or any other suitable data structure. In one embodiment the frequency table is represented as a two-dimensional, 3-rows-by-n-columns array, where n is the number of noun phrases detected in the input string. Each column of the array is associated with exactly one of the n noun phrases and represents a 3-tuple containing the elements (i, number-of-hits, proximity). Here i is the unique index number, stored in step 330, associated with the noun phrase; number-of-hits is the number of document matches, or "hits", found for this query; and proximity, which is more fully explained by way of the examples given below, is an indication of the proximity constraints used in the query. Table 1 represents the contents of an example of a frequency table according to this embodiment prior to the execution of initial queries:
TABLE 1
______________________________________
Example of a frequency table before initial queries are run
i 1 2 3 4 . . .
______________________________________
number of hits
0 0 0 0 . . .
proximity 0 0 0 0 . . .
______________________________________
It will be observed that a single value is used to represent proximity information in this embodiment of the frequency table. A single value suffices because in this embodiment noun phrase queries that specify strict ordering--that is, queries of the form <p NP>--are performed if and only if the value of p is 0, and noun phrase queries that do not specify strict ordering--that is, queries of the form (p NP)--are performed if and only if the value of p is not 0. In other embodiments, separate values can be stored for proximity and strict ordering, or other information can be stored. 5.6 Initial Queries In step 505, initial IR queries are constructed and executed. Initial queries are based on the noun phrases that were detected in the input string during question processing. There is one initial query for each such noun phrase. Each initial query is of the form <0 NP.sub.i > where NP.sub.i is the text of the noun phrase whose unique index number is i. In other words, an initial query seeks to match the noun phrase exactly as it appears in the input string. In an embodiment that uses the MAKE-QUERY function described above, step 505 is executed by repeated calls to E-QUERY, one for each initial query. After each such call to MAKE-QUERY returns, if MAKE-QUERY returns results--that is, if the number of hits is within the range established by the threshold and maximum values--then the frequency table is updated by setting the number-of-hits value for the initial query to the number of hits returned by MAKE-QUERY. Otherwise the number-of-hits value remains at zero or is set to the maximum, according to the artificial value returned by MAKE-QUERY. An example of this is illustrated in Table 2 below. Table 2 represents the contents of an example of a frequency table after the execution of four initial queries for index values i=1, 2, 3, 4:
TABLE 2
______________________________________
Example of a frequency table after initial queries are run
i 1 2 3 4 . . .
______________________________________
number of hits
2 30 0 0 . . .
proximity 0 0 0 0 . . .
______________________________________
It can be seen that the first initial query returned 2 hits and the second returned 30 hits, while the third and fourth each returned a number of hits that was less than the threshold. For each query the proximity value is 0 because the proximity constraint value in the query is 0 and the query was a strict-order query. In some embodiments, if a noun phrase contains only one word it can be processed without an initial query as such in step 505. In certain IR subsystems it is inefficient to run an initial query for a one-word noun phrase because information on the frequency of individual words within the text corpus is made directly available without the need for a query. Thus instead of calling MAKE-QUERY for the query <0 NP.sub.i >, where i is the unique index number of the one-word noun phrase, in such embodiments the IR subsystem can be instructed to look up the frequency with which the single word that is the noun phrase appears in the corpus. This frequency is then stored in the frequency table as the number of hits associated with noun phrase i. Step 505 is followed by step 510, which concerns the construction and execution of broadened versions of the initial queries of step 505. Broadened versions of initial queries are of the form (p NP.sub.i) where p is a proximity constraint. In other words, broadened versions of initial queries seek to match the words of the noun phrase regardless of the order in which they appear in a document so long as these words occur within a sequence of words that is at most p+2 words long. Broadened queries are executed when initial queries or previous broadened queries return insufficient hits. Step 510 comprises a number of component steps 511 through 520 as illustrated in FIG. 6. Steps 511 through 519 form a loop that is executed repeatedly n times, once for each noun phrase detected in the input string. In this embodiment the noun phrases have index numbers i that run consecutively from 1 to n. Thus each pass of the loop concerns the ith noun phrase--that is, the noun phrase whose unique index number is i. In step 520 the updated frequency table produced in the loop of steps 511 through 519 is sorted according to number of hits. In step 511 the ith noun phrase is checked to see whether it contains more than one word. One-word noun phrases are not susceptible to broadening through relaxation of proximity constraints. If the ith noun phrase is only one word long, most of the remaining steps of the loop are skipped and execution proceeds at step 518. If the ith noun phrase contains more than one word execution proceeds with step 512. In step 512 the frequency table is examined to determine whether the number of hits for the ith noun phrase is nonzero. The number of hits can be nonzero either because hits were found for the initial query (or the alternative direct determination of frequency for one-word noun phrases), or because hits were found for previous broadened initial queries. If the number of hits for the ith noun phrase is zero, execution proceeds at step 513; otherwise, most of the remaining steps of the loop are skipped and execution proceeds at step 518. In steps 513 through 518 a loop is executed over a series of increasing proximity values p, for example, once for each of the values p=1, 5, 10, 20, and 40. This loop is executed either until a broadened initial query returns a number of hits that is in the range specified by the threshold and maximum values of MAKE-QUERY, or until the series of increasing proximity values is exhausted. In step 513 a broadened initial query of the form (p NP.sub.i), where NP.sub.i is the ith noun phrase, is constructed and executed. The query can be executed using MAKE-QUERY. In step 514 it is determined whether the number of hits returned by MAKE-QUERY is nonzero. If so, execution proceeds at step 515; otherwise execution proceeds at step 516. In step 515 the frequency table is updated to reflect the number of hits detected by the broadened initial query of step 513. That is, the number-of-hits value associated with the index i in the frequency table is set equal to the number of hits returned by MAKE-QUERY. This is a value between the threshold and maximum values inclusive, with the maximum value representing any number of hits greater than or equal to the maximum. Thereafter execution proceeds at step 518. In step 516, which is executed if the broadened initial query of step 513 detects a number of hits less than the threshold, the frequency table is left unchanged so that the number-of-hits value for noun phrase i remains at zero. The proximity value p is checked. If the proximity value has not yet achieved its maximum, then execution proceeds at step 517; otherwise execution proceeds at step 518. In step 517 the proximity value p is updated to the next larger value. For example, if the series of increasing proximity values for the loop of steps 513 through 518 is p=(1, 5, 10, 20, 40) and step 513 was executed with p=10, then in step 517 p is increased to 20. Thereafter execution transfers back to step 513, where a broadened initial query using the newly updated proximity value will be run. In step 518, the noun phrase index i is checked to see whether all n noun phrases have yet been processed. If not, then in step 519 the index value i is incremented and execution transfers back to the top of the loop at step 511; otherwise, the loop over noun phrases terminates and execution proceeds at step 520. Table 3 represents the contents of an example of a frequency table after initial processing and broadening of queries for index values i=1, 2, 3, 4, but before execution of step 520:
TABLE 3
______________________________________
Example of unsorted frequency table
after initial query broadening
i 1 2 3 4 . . .
______________________________________
number of hits
2 30 6 10 . . .
proximity 0 0 10 5 . . .
______________________________________
In Table 3, as in Table 2, 2 hits were detected for the initial query for noun phrase 1 and 30 hits were detected for the initial query for noun phrase 2. (It will be observed that if either of these noun phrases, for example, noun phrase 2, is only one word long, then its associated number of hits was determined directly from corpus frequency information, with no need for an initial query as such.) Thus no broadened queries were run for these two noun phrases. It can further be seen in Table 3 that 6 hits were detected for noun phrase 3, but only after the proximity constraint was broadened out to 10. Assuming that the series of proximity values p=1, 5, 10, 20, 40 is used, it follows that the initial query for noun phrase 3 both in its original form and broadened out to proximity constraints of 1 and 5 generated a number of hits below the threshold, for example, 0 hits. Similarly, 10 hits were detected for the broadened initial query for noun phrase 4 at a proximity value of 5, that is, for the query (5 NP.sub.4); none were detected for the original initial query <0 NP.sub.4 > or the broadened initial query (1 NP.sub.4). In step 520 the frequency table is sorted in order of descending number of hits. For example, after step 520 the frequency table of Table 3 becomes the frequency table shown in Table 4:
TABLE 4
______________________________________
Example of sorted frequency table
after initial query broadening
i 2 4 3 1 . . .
______________________________________
number of hits
30 10 6 2 . . .
proximity 0 5 10 0 . . .
______________________________________
In Table 4 the columns of Table 3 have been sorted so that the noun phrase for which the most hits were found appears in the leftmost column, the noun phrase for which the second most hits were found appears in the second-from-leftmost column, and so forth. Proximity values are not taken into account in the sorting of step 520. The flowchart of FIG. 6 can be expressed more compactly as pseudocode, for example as follows:
______________________________________
/* Outer loop over noun phrases */
FOR i = 1 to n
{IF (NP > 1 word) and (frequency.sub.-- table(i) = 0)
/* Inner loop over proximity values */
{FOR p = (1, 5, 10, 20, 40)
{/* Construct and execute query;
save number of hits */
query = (p NP.sub.1);
MAKE-QUERY (query,
number.sub.-- of.sub.-- hits,
results);
/* If the number of hits was above the
threshold, terminate inner loop
by setting p equal to 40 */
IF (number.sub.-- of.sub.-- hits >0)
{frequency.sub.-- table(i) = number.sub.-- of.sub.-- hits;
p = 40;
}
}
}
}
______________________________________
Here the statement query=(p NP.sub.i) constructs the query (p NP.sub.i). The statement M-QUERY(query, number.sub.-- of.sub.-- hits, results) executes the query and returns the number of hits and the query results (if any) in the variables number.sub.-- of.sub.-- hits and results respectively. If no results are returned, the variable results is set to the Boolean value FALSE or to some other value indicating that results were not returned. In the remainder of this Part (Part II) of the description, pseudocode is used from time to time to express certain steps of the method. 5.7 Title Phrase Queries step 525 concerns queries based on title phrases. It is executed if any title phrases were detected during question processing; otherwise it is skipped. Step 525 comprises component steps according to the following pseudocode:
______________________________________
FOR i = 1 to t
{IF (TP.sub.1 contains no high-frequency function words)
10 {query = <0 TP.sub.1 >}
ELSE /* query based on content words only */
{1.sub.1 = number of intervening words between
first and last content words of TP.sub.1 ;
query = <1.sub.1 TP.sub.1 (content1) TP.sub.1 (content2) . . . >}
MAKE-QUERY (query, number of hits, results);
STORE(number of hits in association with TP.sub.1);
}
______________________________________
The FOR loop is executed repeatedly, once for each title phrase detected during question processing. It is assumed that there are a total of t title phrases TP.sub.i having unique index numbers i that run from 1 to t. A query is constructed and executed for each title phrase. The query is formulated to match the exact wording of the title phrase as closely as possible. First the title phrase is checked to see whether it contains any high-frequency function words. These are words such as "the," "a," "and," "of," and so forth that typically appear with very high frequency in any text corpus. If the title phrase contains no high-frequency function words, a query of the form <0 TP.sub.i > is constructed and executed. This query seeks the exact wording of the title. If the title phrase contains high-frequency function words, a query is constructed and executed based on the content words of the title, that is, the words that are most important in the title. For example, the title phrase "Across the River and Into the Trees" gives rise to the query <3 River Trees>, because "River" and "Trees" are the content words that appear in the title in a sequence in which "River" precedes "Trees" and is separated from "Trees" by at most 3 words. Once the title phrase's query is constructed and executed, the number of hits detected is stored in association with the title phrase. The storage can be done in memory or in a storage device. In one embodiment the stored information is organized in an array of 2 rows by t columns, one column for each title phrase. In each column the top row is the unique index number of the title phrase and the bottom row is the number of hits. It can be seen that this is similar in structure to the noun phrase frequency table; however, there is no row for proximity values, because title phrases are not broadened by proximity constraint relaxation in this embodiment. 5.8 Interleaved Broadening and Narrowing Steps 530 through 570 primarily concern compound queries. In these steps, a sequence or series of queries is performed. This sequence is designed to include optimal queries, where "optimal query" means a query that retrieves a manageable number (e.g., 2, 10, or 30, depending on the application) of relevant documents and very few if any irrelevant documents. Put another way, the sequence of queries is designed to find the documents the user needs and only the documents the user needs, and to do so through a systematic approach that is carried out transparently to the user. In each of the steps 530 through 570, multiple queries can be generated by applying a set of interleaved broadening and narrowing operations to one or more base queries. To describe these broadening and narrowing operations it is helpful to begin with a concrete example. For ease of reference this example will be referred to in later sections as Example 1. Consider an input string for which three noun phrases NP.sub.1, NP.sub.2, NP.sub.3 are detected and no other phrases are detected. Suppose that after initial query processing the unsorted frequency table for this string is as follows:
______________________________________
i 1 2 3
______________________________________
number of hits
30 3 12
proximity 0 10 5
______________________________________
It will be observed that for each column of the table the number of hits and proximity value determine a query, specifically, the query that was executed when these values were stored. That is, the values in this frequency table came from the following three queries: <0 NP.sub.1 > (10 NP.sub.2) (5 NP.sub.3) A combination query that includes these three queries can be constructed: ›<0 NP.sub.1 >(10 NP.sub.2) (5NP.sub.3)! This query, which will be called the base query, is the Boolean AND of its three component noun phrase queries. It matches documents that contain all three noun phrases as modified according to the specified proximity and order constraints. The base query serves as the starting point for a series of queries generated automatically through systematic broadening and narrowing of the base query, for example:
______________________________________
›(1 NP.sub.1) (10 NP.sub.2) (5 NP.sub.3)!
| broaden NP.sub.1 query
(40 (1 NP.sub.1) (10 NP.sub.2) (5 NP.sub.3))
| narrowing
(20 (1 NP.sub.1) (10 NP.sub.2) (5 NP.sub.3))
| narrowing
. . .
›(5 NP.sub.1) (10 NP.sub.2) (5 NP.sub.3)!
| broaden it further
(40 (5 NP.sub.1) (10 NP.sub.2) (5 NP.sub.3))
| narrowing
(20 (5 NP.sub.1) (10 NP.sub.2) (5 NP.sub.3))
| narrowing
. . .
›(10 NP.sub.1) (10 NP.sub.2) (5 NP.sub.3)!
| even further
(40 (10 NP.sub.1) (10 NP.sub.2) (5 NP.sub.3))
| narrowing
(20 (10 Np.sub.1) (10 NP.sub.2) (5 NP.sub.3))
| narrowing
. . .
. . .
›<0 NP.sub.1 > (20 NP.sub.2) (5 NP.sub.3)!
| broaden NP.sub.2 query
(40 <0 NP.sub.1 > (20 NP.sub.2) (5 Np.sub.3))
| narrowing
(20 <0 NP.sub.1 > (20 NP.sub.2) (5 NP.sub.3))
| narrowing
. . .
. . .
. . . | etc.
______________________________________
The queries are broadened by successive relaxations of the proximity constraint of each component noun phrase query. Between broadenings the queries are narrowed by tightening the overall relative proximity constraint between the component noun phrase queries; specifically, the Boolean AND query, which seeks the component noun phrases anywhere within a document, is followed by proximity queries that seek the component noun phrases closer and closer together within a document. The procedure for generating the series of queries in the above example can be expressed in pseudocode as follows:
______________________________________
/* SET-UP */
/* Construct component queries for all noun phrases, using proximity
values from unsorted frequency table. Noun phrases have index
numbers from 1 to n. If the proximity associated with a noun phrase is
0, the component query is a strict-order query; otherwise it is a
proximity constraint query without strict ordering. If the number of
hits associated with a noun phrase is 0, its component query is
considered empty and is not included in the base query or used in the
sequence of broadening and narrowing.
FOR i = 1 to n
{hits(i) = frequency.sub.-- table(2,i);
proximity(i) = frequency.sub.-- table(3 ,
IF (hits(i) = 0)
{component.sub.-- query(i) = EMPTY}
ELSE
{IF (proximity(i) = Q)
{component.sub.-- query(i) = <O NP.sub.1 >}
ELSE
{component.sub.-- query(i) = (proximity(i) NP.sub.1)}
}
}
/* Construct base query as Boolean AND of all non-empty component
queries. */
base.sub.-- query = ›component query(i)!,
(FOR i= 1 to n,
component.sub.-- query(i) * EMPTY);
/* BROADENING */
/* Outer loop over all component queries: Broaden each component
query in turn by relaxing its proximity constraint, that is,by trying
all proximity values greater than the proximity value of the base
query.
One-word noun phrase queries#and queries not present in the base
query are not broadened.
*/
FOR i = 1 to n
{FOR p = (1, 5, 10, 20, 40), p > proximity(i)
{IF (NP.sub.1 contains >1 word)
and (component.sub.-- query(i) is in base.sub.-- query)
/* Perform the Boolean AND query: Formulate the
broadened ith component query and AND it with the other
component queries of the base query. Execute the query
thus constructed. */
{broad(i) = (p NP.sub.1);
query = ›broad(i) component#query (j)!,
(FOR j= 1 to n, j .noteq. i,
component.sub.-- query (j)
is in base.sub.-- query);
MAKE-QUERY(query,number.sub.-- of.sub.-- hits,results);
/* NARROWING */
/* Inner loop, executed if the number of hits returned by the
AND query exceeds a specified value: Narrow the query by
tightening its overall proximity constraint, that is, the
proximity constraint of the component noun phrases
relative to one another. Narrow repeatedly until the
number of hits returned is below the specified value or
until all allowable proximity values have been tried. */
FOR p = (40, 20, 10, 5)
{WHILE (number.sub.-- of hits > VALUE)
{query =
(p broad(i) component.sub.-- query(j)),
(FOR j= 1 to n, j .noteq. i,
component.sub.-- query (j)
is in base.sub.-- query);
MAKE-QUERY (query,
number.sub.-- of.sub.-- hits,results);
}
}
}
}
______________________________________
The specified value used in the narrowing loop can be, for example, 10. Like the threshold and maximum values used in MAKE-QUERY, it can be set by default or by the user in advance of execution. The specified value ensures that unnecessary narrowing is avoided, thereby improving computational efficiency. In some embodiments, an additional specified value can be used in conjunction with the outer broadening loop, so that if a query that includes the broadened ith component (p NP.sub.i) returns more hits than the additional specified value, no further broadening is done for the ith component. For all queries executed during the broadening and narrowing series, MAKE-QUERY saves non-redundant query results when the number of hits is within the range set by the threshold and maximum values. The results are simply saved at this point; they will be ranked later on in step 260. With Example 1 in mind, steps 530 through 570 will now be described. 5.9 Combination Queries Title Phrase and Noun Phrase Combinations. In step 530 combination queries that include both title phrases and noun phrases are performed. For each title phrase TP.sub.i for which a nonzero number of hits was found in step 525, a base query is constructed of the form ›<0 TP.sub.i >(p, NP.sub.1) (P.sub.2 NP.sub.2) . . . ! Here and in what follows it is to be understood that although the title phrase component query is typically expressed in the form <0 TP.sub.i >, if the title phrase contains high-frequency function words the actual query that is performed can be based on the content words of the title phrase and expressed in the form <p TP(content,) TP(content.sub.2) . . . > as described in conjunction with step 525 above. Likewise, although each noun phrase component query is expressed in the form (p.sub.j NP.sub.j), where p.sub.j is the frequency table proximity value associated with the jth noun phrase NP.sub.j, the component query can be <0 NP.sub.j > if p.sub.j =0 and is omitted if no hits were found during initial query processing for NP.sub.j. The component query for noun phrase NP.sub.j is also omitted from the base query if noun phrase NP.sub.j is contained in title phrase TP.sub.i. Once the base query is formed it is broadened and narrowed in a manner similar to that described in Example 1 above. Each noun phrase component query NP.sub.j in the base query is broadened in turn through successive relaxations of its proximity constraint. For each such relaxation the query is narrowed by tightening the overall proximity constraint. Throughout this process the title phrase query is left unchanged. Component queries based on single-word noun phrases are not broadened. As an example of step 530, suppose that the base query contains title phrase TP.sub.i and noun phrases NP.sub.1 and NP.sub.2, where TP.sub.1 is "Gone With the Wind," NP.sub.1 is "great fire" and NP.sub.2 is "Scarlett O'Hara." If the proximity values for noun phrases NP.sub.1 and NP.sub.2 are 5 and 0 respectively, the base query is ›<3 Gone Wind> (5 great fire) <0 Scarlett O'Hara>! If the list of proximity values to be tried is p=(1, 5, 10, 20, 40), then a series of queries such as the following is executed:
______________________________________
›<3 Gone Wind>(10 great fire) <0 Scarlett O'Hara>!
(40 <3 Gone Wind>(10 great fire) <O Scarlett O'Hara>)
(20 <3 Gone Wind>(10 great fire) <O Scarlett O'Hara>)
. . .
›<3 Gone Wind>(20 great fire) <0 Scarlett O'Hara>!
(40 <3 Gone Wind>(20 great fire) <0 Scarlett O'Hara>)
(20 <3 Gone Wind>(20 great fire) <0 Scarlett O'Hara>)
. . .
. . .
›<3 Gone Wind>(5 great fire) (1 Scarlett O'Hara)!
(40 <3 Gone Wind>(5 great fire) (1 Scarlett O'Hara))
(20 <3 Gone Wind>(5 great fire) (1 Scarlett O'Hara))
. . .
. . .
______________________________________
Upon completion of step 530, every title phrase TP.sub.i for which a nonzero number of hits was found in step 525 has had its respective base query processed according to the interleaved broadening and narrowing scheme. Maximal noun phrase combination. In step 540 combination queries are performed that include a maximal number of noun phrases. A base query is constructed that includes a component query for each noun phrase NP.sub.i for which a nonzero number of hits was found during initial query processing: ›(p.sub.1 NP.sub.1) (p.sub.2 NP.sub.2) (p.sub.3 NP.sub.3) . . . ! This will be called the maximal noun phrase base query or simply the maximal query. It will be appreciated that the base query from Example 1 is of this form. As in step 530 above, the proximity values p.sub.i are the proximity values from the frequency table. Also as in step 530 above, although each component query of the maximal query is expressed here in the form (p.sub.i NP.sub.i), it is to be understood that the component query can be <0 NP.sub.i >if p.sub.i =0. Noun phrases for which no hits were found during initial query processing are omitted from the maximal query. Once formed, the maximal query is broadened and narrowed in the manner of Example 1. It will be observed that the maximal query can contain noun phrases that are parts of title phrases. This can lead to more robust performance, for example, when a user enters an incorrect or partial title in the input string. Lesser noun phrase combinations. In step 545 combination queries are performed that include some but not all of the noun phrases included in the maximal query. Base queries are constructed by dropping one, two, or more component queries from the maximal query. Each component, pair of components, etc. is dropped in turn. The greater the number of components dropped, the broader the query. An example illustrates this. If the maximal query is ›<0 black cat> (5 yellow dog) (40 pink frog) (20 blue tiger)! then base queries can be constructed such as
______________________________________
›(5 yellow dog) (40 pink frog) (20 blue tiger)!
›<0 black cat> (40 pink frog) (20 blue tiger)!
›<0 black cat> (5 yellow dog) (20 blue tiger)!
›<0 black cat> (5 yellow dog) (40 pink frog)!
›(40 pink frog) (20 blue tiger)!
›(5 yellow dog) (20 blue tiger)!
›(5 yellow dog) (40 pink frog)!
›<0 black cat> (20 blue tiger)!
›<0 black cat> (40 pink frog)!
›<0 black cat> (5 yellow dog)!
______________________________________
In some embodiments only single components are dropped from the maximal query in step 545. In other embodiments only single components and pairs of components are dropped. In still other embodiments all combinations of k components, k=1 to m-2 where m=the total number of components in the maximal query, are dropped. In yet other embodiments the order of processing can be reversed, so that base queries of k components, k=2 to m-1, are constructed and processed. Main verb combinations. Step 550 concerns queries that involve main verbs. Base queries are formed by adding one or more main verbs to the title-phrase or noun-phrase base queries constructed in steps 530, 540, and 545. These queries can then be broadened and narrowed in some embodiments. For example, consider once again the example in which a title-phrase base query contains the title phrase "Gone With the Wind" and the noun phrases "great fire" and "Scarlett O'Hara," with proximity constraint values for these noun phrases of 5 and 0 respectively. Suppose further that the input string contains one main verb, "burn." Then a base query can be constructed as ›burn <3 Gone Wind> (5 great fire) <0 Scarlett O'Hara>! and can be broadened and narrowed by relaxing the proximity constraints of the noun phrase component queries and tightening the overall proximity constraint between component queries as described above. If there is another main verb in the input string, for example "mend," then a second base query can be constructed as ›mend <3 Gone Wind> (5 great fire) <0 Scarlett O'Hara>! and a third base query containing both main verbs can be constructed as ›burn mend <3 Gone Wind> (5 great fire) <0 Scarlett O'Hara>! Each of these base queries can be broadened and narrowed in the manner described above. The addition of main verbs to compound queries serves effectively to narrow those queries. Thus in some embodiments main verb queries are performed when title-phrase or noun-phrase compound queries return a number of hits that exceeds a specified value even after overall proximity constraint is narrowed to its lowest proximity value. As with other specified values discussed above, the specified value used to trigger main verb queries can be set by default or by the user in advance. In some embodiments the IR subsystem performs automatic stemming of main verbs, that is, reduction of verbs to their uninflected form. In other embodiments stemming is performed by the primary query construction subsystem in the context of constructing base queries using main verbs. 5.10 Additional Queries In some embodiments, as shown in FIG. 5, steps 560 and 570 are performed. In these steps attempts are made to detect additional hits for noun phrases or title phrases. Steps 560 and 570 can be particularly helpful with regard to noun phrases or title phrases for which few hits are detected in previous steps. Head words. Step 560 concerns queries based on the head words of noun phrases. The "head word" of a noun phrase is the most important or significant word in the noun phrase. In English the head word is typically the rightmost word in the noun phrase. For example, in the phrase "great horned owl" the head word is "owl." The head word of a noun phrase can be detected, for example, using the input string analysis developed during question processing. In one embodiment a simple query that seeks just the head word is constructed and executed for each noun phrase of more than one word for which a number of hits below a specified minimum number is recorded in the frequency table. The specified minimum can be set by default or by the user in advance. The simple head word query serves to broaden the noun phrase query beyond the scope of the initial query and the proximity-broadened versions of the same by dropping the constraint that requires all words of the noun phrase to be present in the document. As an example, if the noun phrase "great horned owl" generated 0 hits even after its proximity was broadened to the maximum value of 40, so that the query (40 great horned owl) generated 0 hits, then the simple head word query owl could be run. Alternatively, the frequency of the word "owl" in the text corpus can be provided directly by the IR subsystem in embodiments that support this. In some embodiments compound queries comprising the simple head word query and other components such as noun phrase queries, title phrase queries, main verbs, or other head word queries can be constructed and executed, for example if the simple head word query detected hits and the initial query from which it was derived did not. Such compound queries can be broadened and narrowed by varying proximity constraints after the manner of Example 1 in some embodiments. Related words. Step 570 concerns queries in which certain words of the input string are replaced by substitute related words such as synonyms or hyponyms. Related words can be drawn from a thesaurus, dictionary, or any other appropriate information source made available to the system. Related word substitution is a broadening operation in that a series of queries based on the set of words comprising both the original words of the input string and words related to these original words will tend to generate a greater number of hits than a series of queries based on the original words only. Related words can be used in conjunction with simple or compound title phrase queries by substituting at least one related word for one or more of the content words in the title phrase. Related words can be used in conjunction with simple or compound noun phrase queries by substituting at least one related word for one or more of the words in the noun phrase, for example, for the head word of the noun phrase. Related words can be used in conjunction with simple or compound main verb queries by substituted at least one related word for each main verb. In one embodiment related word substitution is performed for each noun phrase for which a number of hits below a specified minimum number is recorded in the frequency table and also for each title phrase for which a number of hits below a (possibly different) specified number was found in step 525. Each of the specified minima can be set by default or by the user in advance. Related word substitution can be performed in conjunction with head word queries in some embodiments. For example, if after broadening an initial query to its maximum proximity value and thereafter further broadening it by dropping words other than the head word the number of returned hits remains unacceptably low, the simple head word query can be yet further broadened by substituting one or more synonyms or hyponyms for the head word. In some embodiments compound queries comprising one or more related word substitutions can be broadened and narrowed by varying proximity constraints after the manner of Example 1. 6. Results Ranking This section examines step 260, results ranking, in detail. Results ranking comprises ordering the results (matched documents) retrieved in step 240 to determine which are most likely to be relevant to the user's question or other input string. A great many results ranking strategies are contemplated within the scope of the invention; one such strategy is described here. Step 260 proceeds in component steps 261, 262, 263, and 264 as shown for one embodiment in FIG. 7. In step 261 the search results generated by the queries of step 240 are made available for ranking. These results are pairs of the form (query document-list) In step 262 these results pairs are ranked according to criteria, the criteria being selected so as to place the most relevant results at the top of the ranking. In step 263 documents within the pairs are ranked and a list of documents or document identifiers to be output is constructed. In some embodiments step 264 is executed in which this list can be modified. The list created in step 263 as modified in step 264 becomes the list that is output to one or more destinations such as the user, storage, or another process in step 280. Steps 261, 262, 263, and 264 will now be more fully described. 6.1 Results Pairs At the beginning of step 260 it is assumed that non-redundant queries have been stored in association with their respective results. It will be recalled that the MAKE-QUERY routine described earlier does such storage. In step 261 the associated queries and results are retrieved from storage or otherwise made available for further processing. For example, they can be placed in memory in an association list or equivalent data structure comprising pairs of queries and document lists, each such document list being a list of the documents that were retrieved by the IR subsystem from the text corpus when the document list's associated query was executed:
______________________________________
( (query.sub.1 (document.sub.1,1 document.sub.1,2 ...))
(query.sub.2 (document.sub.2,1 document.sub.2,2 ...))
. . .
(query.sub.k (document.sub.k,1. . . document.sub.k,i...))
. . .
. . . )
______________________________________
Each element of this association list will be called a results pair. 6.2 Ranking of Results Pairs In step 262 the results pairs are ranked according to one or more criteria. Criteria are typically chosen such that documents most likely to be relevant to the user's question are ranked highest. Many different kinds of criteria can be used as the basis for results ranking. In one embodiment four criteria are used: (1) number of title phrase words matched; (2) number of capitalized words matched; (3) total number of words matched; (4) smallest number of documents in document list. Each of these criteria will now be described. (1) Number of title phrase words matched. The most highly ranked results pairs are those for which the queries include words from one or more title phrases. The more title phrases used to generate the query and the more words from each title phrase are included in the query, the higher is the ranking of the results pair. A rationale behind this criterion is that if a title phrase is found in a document, that document is likely to be highly relevant. The more title words that are found, the more likely relevant is the document. (2) Number of capitalized words matched. The next most highly ranked results pairs are those for which the queries include words having initial capital letters. The greater the number of such words in the query, the higher the ranking of the result pair. This criterion gives greater weight to queries that contain proper nouns such as person or place names. It is applied only for results pairs that do not meet criterion (1), that is, results pairs whose queries do not include component queries based on title phrases. (3) Total number of words matched. The third most highly ranked results pairs are those for which the queries have relatively many words. The more words in the query, the higher the ranking of the result pair. This criterion is applied only for results pairs that do not meet criteria (1) or (2), that is, results pairs whose queries do not include component queries based either on title phrases or on noun phrases with initial capital letters. (4) Smallest number of documents in document list. Results pairs that do not meet any of criteria (1), (2), or (3) are ranked least highly. Among themselves such results pairs are ranked according to the number of documents retrieved. Results pairs that have fewer documents are more highly ranked than those that have many documents. An example of applying the four criteria will now be given. Suppose that results pairs for the following four queries are to be ranked: 1) ›<0 Great Britain> <0 Ireland>! 2) ›Ireland sea! 3) ›<0 Great Britain> ocean! 4) ›Britain ocean sea! None of the queries contains a title phrase, so the first ranking criterion does not apply. Query 1 contains 3 words with initial capital letters, query 3 contains 2, and queries 2 and 4 each contain 1. Therefore according to the second ranking criterion results pair 1 is ranked highest and results pair 3 second highest. As between results pairs 2 and 4, query 4 contains 3 words while query 2 contains only 2. Therefore according to the third ranking criterion query 4 is ranked higher. The final ranking is thus 1, 3, 4, 2. If in the previous example the second and fourth queries are changed to: 2) ›Ireland ocean sea! 4) ›Britain ocean sea! then the third ranking criterion does not apply. Further supposing that query 2 generated 3 hits and query 4 generated 5, then according to the fourth ranking criterion results pair 2 is ranked above results pair 4, so that the final ranking becomes 1, 3, 2, 4. 6.3 Presentation List Construction In step 263 documents within one or more ranked results pairs are ranked according to one or more criteria, and an ordered list of documents most likely to be relevant to the user's question or input string is constructed based on this ranking. This ordered list, which will be called the presentation list, becomes the ordered list of relevant documents that is output in step 280, either directly or, in some embodiments, as modified in step 264. The presentation list contains N unique documents in it. That is, there are N documents in the list, and none of these documents is the same as any other. N is a number set by default or in advance by the user. N can be, for example, 3 or 30, depending on the application. In step 263, as in step 262, many kinds of ranking criteria can be used. In one embodiment, a single criterion is used, namely, the number of words that appear in the input string and do not appear in the query of the results pair. Documents that contain the greatest number of such words receive the highest rank. An example illustrates this. Suppose that the top-ranked results pair has two documents, doc.sub.i,1 and doc.sub.1,2, in its document list (that is, its query, query i, generated two hits): (›<0 Great Britain> Ireland! (doc.sub.i,1 doc.sub.i,2)) Suppose further that the input string contained an additional word such as the verb "cross" (as in "crossing the sea"). Then if doc.sub.i,1 contains the word "crossing" but doc.sub.i,2 does not, doc.sub.i,1 is ranked above doc.sub.i,2. Next suppose that the second-ranked results pair has four documents in its document list, which in order from highest rank to lowest are (doc.sub.i,1 doc.sub.i,7 doc.sub.i,3 doc.sub.i,5). Then if N=3, for example, the 3 documents doc.sub.i,1 doc.sub.i,2 and doc.sub.i,7 are the top-ranking, most-probably-relevant documents produced by step 263. doc.sub.i,1 is ranked the highest because it is the highest-ranking document from the highest-ranking results pair. doc.sub.i,2 is ranked second because it is the second highest-ranking document from the highest-ranking results pair. doc.sub.i,7 is ranked third because it is the highest-ranking unique document from the second highest-ranking results pair, that is, the highest-ranking document that is in the second-ranked results pair and that is not in the first-ranked results pair. Because N unique documents are present in the first two results pairs, the ranking of documents of any remaining results pairs is skipped in this embodiment. The foregoing procedure can be expressed in pseudocode, for example as follows:
______________________________________
/* Initialize the document counter to 0 and the presentation list to the
empty list. The document counter keeps track of how many documents
have been added to the presentation list. */
document.sub.-- counter =
presentation.sub.-- list = EMPTY;
/* Seek N unique documents. */
WHILE (document.sub.-- counter <N)
/* Outer loop over r results pairs. */
{FOR i = 1 to r
{SORT (all documents doc.sub.iJ from results pair i
according to ranking criteria);
/* Inner loop over the d ranked documents from results pair i */
FOR j = 1 to d.sub.1
{IF (jth sorted document is not yet in
presentation.sub.-- list)
{ADD doc.sub.iJ to the end of
presentation.sub.-- list;
document.sub.-- counter = document.sub.-- counter + 1
}
}
}
______________________________________
In some embodiments different ranking criteria can be used. For example, consideration can be given to the frequency of particular search terms within the retrieved documents or within the text corpus as a whole. In embodiments in which synonyms are used in queries, priority can be given to queries based on original words over queries based on synonyms. In some embodiments the presentation list comprises a list of document identifiers instead of a list of documents. 6.4 Modifications to Presentation List In some embodiments, as shown in FIG. 7, step 264 is executed after step 263. In step 264 the presentation list created in step 263 can be modified. Typically such modification is done to enhance the presentation of output to the user or to make the output suitable for further processing. A wide variety of modifications to the presentation list is possible in step 264 within the scope of the invention. In some embodiments the documents in the presentation list can be alphabetized by title or otherwise reordered in step 264. In some embodiments additional information can be added for each document in the presentation list in step 264. For example, the additional information can comprise information about the queries used to retrieve the documents. Suppose that a document was retrieved in response to a query containing a particular title phrase. This title phrase can be identified as a title phrase and can be highlighted in the text of the retrieved document, for example by a distinctive text presentation such as boldfacing, italicization, underlining, or a special font. As another example, the additional information can comprise match sentences, that is, sentences in the retrieved documents that caused or helped to cause the documents to be retrieved. Again supposing that a document was retrieved in response to a query containing a particular title phrase, the sentence or sentences in which this title phrase occurs can be associated with the retrieved document. 7. Alternative Embodiments The foregoing embodiment of the method is only one possible embodiment among many. Moreover, variations of the method fall within the scope of the invention. A few alternative embodiments will now be briefly described. Further alternatives will be apparent to those of skill in the art. Alternative techniques can be used to carry out question processing in step 220. For example, a parser based on a context-free grammar can be used. Case-splitting of noun phrases need not be performed. Cues other than or in addition to text formatting can be used to detect title phrases. Most generally, any technique can be used that converts the natural language input string into phrases--that is, contiguous, relatively short sequences of words--that can subsequently be used as the basis for IR queries in query reformulation step 240. The types of phrases detected in step 220 need not be limited to noun phrases, title phrases, and verb phrases. Step 240 can likewise be carried out using alternative techniques. For example, a query language with a greater variety of features than the simple query language described above can be use. Such features can include, for example, sentence- and paragraph-based proximity constraints. Details of component steps of step 240 can vary as well. For example, in some embodiments step 525 can further comprise the construction and execution of additional IR queries based on combinations of two or more titles in cases where queries based on single title phrases generate excessively large numbers of hits. As another example, although the particular sequence of broadening and narrowing operations described in connection with Example 1 above is effective and computationally efficient, it will be appreciated that other sequences of interleaved broadening and narrowing can be used within the scope of the invention. In yet another example, an IR subsystem that caches the results of partial Boolean expressions can be incorporated to reduce redundant searching during broadening. In still another example, for input strings comprising multiple sentences or comma-delimited clauses, input string punctuation can be used to help determine which noun phrases should be grouped together in compound queries. In a further example, when component queries are dropped in step 245, in some embodiments only selected component queries are dropped, for example, when dropping all possible component queries would require considerable computation time. If only selected component queries are dropped, it is preferable in some embodiments to drop first those components that appear with the greatest frequency in the corpus, that is, which had the largest number of hits. Dropping high-frequency component queries tends to improve the likelihood that documents retrieved by the resulting broadened queries will be specifically relevant to the input string. Part III. Murax 1. Organization of Part III Part III of the description focuses on MURAX, a system that embodies the method of the present invention in the context of a larger two-phase method for answering a natural-language question. MURAX provides a robust linguistic approach for answering natural-language questions about specific facts. It derives its answers to questions from a database comprising an on-line encyclopedia. MURAX has been implemented and tested using a hardware configuration in which a Sun SparcStation workstation runs both the primary query construction and answer extraction subsystems, and the information retrieval subsystem and its associated corpus are located remotely from the Sun workstation and are accessed by the workstation via a network. The rest of Part III is organized as follows: In Section 2 the motivation for MURAX's question-answering task is given, along with a description of the kind of questions that are its concern and their characteristics. Section 3 describes the system components. These include the encyclopedia and the IR system for accessing it. Shallow linguistic analysis is done using a part-of-speech tagger and finite-state recognizers for matching lexico-syntactic patterns. Section 4 describes the analysis of a question by considering an example and illustrates the system output. Analysis proceeds in two phases. The first, primary query construction, finds articles that are relevant to the question. The second phase (called answer extraction) analyzes these articles to find noun phrases (called answer hypotheses) that are likely to be the answer. Both phases require searching the encyclopedia. Queries made during the first phase are called primary queries, and only involve phrases from the question. The second phase creates secondary queries which MURAX generates to verify specific phrase relations. Secondary queries involve both answer hypotheses and phrases from the question. Section 5 explains the primary query construction phase, which is carried out in MURAX according to the method of the present invention. Section 6 explains the answer extraction phase, which is carried out in MURAX according to the method of co-pending application entitled METHOD FOR EXTRACTING FROM A TEXT CORPUS ANSWERS TO QUESTIONS STATED IN NATURAL LANGUAGE BY USING LINGUISTIC ANALYSIS AND HYPOTHESIS GENERATION as incorporated hereinabove by reference. Section 7 presents a performance evaluation of MURAX. 2. Task Selection MURAX's task is to answer certain general-knowledge questions using a text corpus that comprises an encyclopedia. There are several reasons for choosing this task. Robust analysis is needed for the task because the encyclopedia is composed of a significant quantity of unrestricted text. General knowledge is a broad domain, which means that it is impractical to manually provide detailed lexical or semantic information for the words of the vocabulary (the encyclopedia contains over 100,000 word stems). MURAX demonstrates that shallow syntactic analysis can be used to practical advantage in broad domains, where the types of relations and objects involved are not known in advance and may differ for each new question. Its analysis capitalizes on the information available in a question and profits from treating the encyclopedia as a lexical resource. MURAX also demonstrates that natural language analysis can add to the quality of the information retrieval process, providing text to the user that confirms phrase relations and not just word matches. MURAX uses questions that have definite answers. This means that its performance can be evaluated in a straightforward way by using a set of questions and correct answers. Given a correct noun phrase answer, it is generally easy to judge whether a noun phrase hypothesized by the system is correct or not. Thus relevance judgements are simplified, and if one correct hypothesis is considered as good as any other, recall measurements are not required and performance can be considered simply as the percentage of correctly hypothesized answers. 2.1 Question Characteristics MURAX operates on closed-class questions. A closed-class question is a direct question whose answer is assumed to lie in a set of objects and is expressible as a noun phrase. Put another way, a closed-class question is a question, stated in natural language, that assumes some definite answer typified by a noun phrase rather than by a procedural answer. Examples of closed-class questions are given below in Table 5. TABLE 5 Example Questions 1. What U.S. city is at the junction of the Allegheny and Monongahela rivers? 2. Who wrote "Across the River and into the Trees"? 3. Who married actress Nancy Davis? 4. What's the capital of the Netherlands? 5. Who was the last of the Apache warrior chiefs? 6. What chief justice headed the commission that declared: "Lee Harvey Oswald . . . acted alone."? 7. What famed falls are split in two by Goat Island? 8. What is November's birthstone? 9. Who's won the most Oscars for costume design? 10. What is the state flower of Alaska? (These Trivial Pursuit questions are Copyright Horn Abbott Ltd. Trivial Pursuit is a Registered Trademark of Horn Abbot Ltd.) The questions in Table 5 appear in the general-knowledge "Trivial Pursuit" game and typify the form of question that is the concern of the MURAX task. These questions have the virtue of being created independently of the retrieval task (that is, they are unbiased) and have a consistent and simple stylized form. Yet they are flexible in their expressive power. The interrogative words that introduce a question are an important source of information. They indicate particular expectations about the answer and some of these are illustrated below in Table 6. TABLE 6 Question Words and Expectations Who/Whose: Person What/Which: Thing, Person, Location Where: Location When: Time How Many: Number Notable omissions here are the words why and how. Questions that begin with "why" or "how" typically expect a procedural answer rather than a noun phrase answer (e.g., "How do you make a loaf of bread?"). The expectations established by the interrogative words can be used to filter various answer hypotheses. The answers to questions beginning with the word "who" are likely to be people's names. This fact can be used to advantage because various heuristics can be applied to verify whether a noun phrase is a person's name. A question introduced by "what" may or may not refer to a person; however, other characteristics can be exploited. Consider the following sentence fragments, where NP symbolizes a noun phrase: "What is the NP . . . " and "What NP . . . " The noun phrase at the start of such questions is called the question's type phrase and it indicates what type of thing the answer is. The encyclopedia can be searched to try to find evidence that an answer hypothesis is an instance of the type phrase (details are presented An section 6.1.1 below). The verbs in a question are also a useful source of information as they express a relation that exists between the answer and other phrases in the question. The answer hypotheses for "Where . . . " questions are likely to be locations, which often appear with locative prepositions or as arguments to verbs of motion. Questions of the form "When . . . " often expect answer hypotheses that are dates or times and the expectation of questions beginning "How many . . . " are numeric expressions. 3. Components An on-line version of Grolier's Academic American Encyclopedia (The Academic American Encyclopedia, Danbury, Conn.: Grolier Electronic Publishing, 1990) serves as the text corpus for MURAX. The on-line encyclopedia contains approximately 27,000 articles, which are accessed via the Text Database (D. R. Cutting, J. Pedersen, and P. -K. Halvorsen, "An object-oriented architecture for text retrieval," in Conference Proceedings of RIAO '91, Intelligent Text and Image Handling, Barcelona, Spain, April 1991, pages 285-298), which is a flexible platform for the development of retrieval system prototypes and is structured so that additional functional components, e.g., search strategies and text taggers (see D. Cutting, J. Kupiec, J. Pedersen, and P. Sibun, "A practical part-of-speech tagger," in Proceedings of the Third Conference on Applied Natural Language Processing, Trento, Italy, April 1992, ACL) can be easily integrated. The components responsible for linguistic analysis in MURAX are a part-of-speech tagger and a lexico-syntactic pattern matcher. The tagger is based on a hidden Markov model (HMM). HMM's are probabilistic and their parameters can be estimated by training on a sample of ordinary untagged text. Once trained, the Viterbi algorithm is used for tagging. To assess performance, an HMM tagger (J. M. Kupiec, "Robust part-of-speech tagging using a hidden Markov model," Computer Speech and Language, 6:225-242, 1992) was trained on the untagged words of half of the Brown corpus (W. N. Francis and F. Kucera, Frequency Analysis of English Usage, Houghton Mifflin, 1982) and then tested against the manually assigned tags of the other half. This gave an overall error rate of 4% (corresponding to an error rate of 11.2% on words that can assume more than one part-of-speech category). The percentage of tagger errors that affect correct recognition of noun phrases is much lower than 4%. The tagger uses both suffix information and local context to predict the categories of words for which it has no lexicon entries. The HMM used for tagging the encyclopedia text was also trained using the encyclopedia. A benefit of such training is that the tagger can adapt to certain characteristics of the domain. An observation in this regard was made with the word "I". The text of the encyclopedia is written in an impersonal style and the word is most often used in phrases like "King George I" and "World War I". The tagger trained on encyclopedia text assigned "I" appropriately (as a proper noun) whereas the tagger trained on the Brown corpus (a mixture of different kinds of text) assigned such instances as a pronoun. Given a sentence of text, the tagger produces a sequence of pairs of words with associated part-of-speech categories. These enable phrase recognition to be done. Phrases are specified by regular expressions in the finite-state calculus (Hopcroft and Ullman, supra). Noun phrases are identified solely by part-of-speech categories, but more generally categories and words are used to define lexico-syntactic patterns against which text is matched. Initially, only simple noun phrases are identified because they are recognized with the greatest reliability. Analysis involving prepositional phrases or other coordination is applied subsequently as part of more detailed matching procedures. Word-initial capitalization turns out to be useful for splitting noun phrases appropriately. Thus "New York City borough" is split into "New York City" and "borough". Such case-splitting improves the efficiency of Boolean query construction, because queries based on case-split noun phrases tend to generate phrase matches directly. This reduces the need to drop several words successively from a noun phrase in order to produce a match. 3.1 Title Phrases A title phrase is a multi-word phrase that is the title of a film, book, play, etc. It is usefully treated as a single unit in MURAX. A title phrase need not be a noun phrase. For example, the title "Play Misty for Me" contains the verb "play". Title phrases are readily identified when marked typographically by enclosing quotes or italics. However, title phrases may be marked only by word-initial-capitalized letters. Furthermore, some title phrase words, such as short function words, need not be capitalized. Thus, the correct extent of the title phrase can be ambiguous and alternative possibilities must be accommodated. In MURAX the most likely alternative is chosen after phrase matching has been done and the alternatives compared, based on the matches and frequency of the alternative interpretations. 4. Operational Overview This section presents an overview of the two-phase operation of the MURAX system, along with an example of the output of the system. The context is the example question presented Table 7 below.
TABLE 7
______________________________________
Example Question and Component Noun Phrases
______________________________________
"Who was the Pulitzer Prize-
wining novelist that ran for
mayor of New York City?"
Pulitzer Prize winning novelist
mayor New York City
______________________________________
4.1 Primary Query Construction In its first phase of processing MURAX embodies the method of the present invention. Simple noun phrases and main verbs are first extracted from the input question, as illustrated in Table 7. These phrases are used in a query construction/refinement procedure that forms Boolean queries with associated proximity and order constraints (section 5). The queries are used to search the encyclopedia to find a list of relevant articles. The relevant articles are heuristically scored according to the degree and number of matches with the phrases of the input question. Matching head words in a noun phrase receive double the score of other matching words in a phrase. Words with matching stems but incompatible part-of-speech categories are given minimal scores. After scoring, the relevant articles are ranked according to their scores. A subset of the ranked articles is made available for the second phrase of processing. MURAX assumes that these articles, which are called the primary documents, contain the answer to the user's question. 4.2 Answer Extraction In its second phase of processing MURAX embodies the method of co-pending application entitled METHOD FOR EXTRACTING FROM A TEXT CORPUS ANSWERS TO QUESTIONS STATED IN NATURAL LANGUAGE BY USING LINGUISTIC ANALYSIS AND HYPOTHESIS GENERATION as incorporated hereinabove by reference. Answer extraction begins by finding all simple noun phrases contained in the match sentences of the primary documents, that is, in those sentences that caused or helped to cause the primary documents to be retrieved. The match sentences correspond to one or more phrases of the input question. Each noun phrase found becomes an answer hypothesis that is distinguished by its component words and the article and match sentence in which it occurs. Answer hypotheses are scored on a per-article basis according to the sum of the scores of the articles in which they occur. The purpose of this is to minimize the probability of overlooking the correct answer hypothesis if a subsequent non-exhaustive search is performed using the hypotheses. For each answer hypothesis the system tries to verify phrase relations implied by the question. For the question in Table 7, it will be observed that the answer is likely to be a person (indicated by "who"). The type phrase indicates that the answer is preferably a "Pulitzer Prize winning novelist", or at least a "novelist" as indicated by the head noun of the type phrase. The relative pronoun indicates that the answer also "ran for mayor of New York City". Phrase matching procedures (detailed in section 6) perform the verification using the answer hypotheses and the primary documents. Verification is not limited to primary documents. It can happen that a pertinent phrase relation is not present in the primary documents although it can be confirmed elsewhere in the encyclopedia. This is because too few words are involved in the relation in comparison to other phrase matches, so the appropriate sentence does not rank high enough to be in the selected primary documents. It is also possible that appropriate verification information is not expressed in any primary document and depends only on the answer hypothesis. This is the case with one heuristic that MURAX uses to attempt to verify that a noun phrase represents a person's name. The heuristic involves looking for an article that has the noun phrase in its title; thus if the article does not share any phrases with the question, it would not be part of any primary document. Secondary queries are used as an alternative means to confirm phrase relations. A secondary query may consist of solely an answer hypothesis (as for the heuristic just mentioned) or it may also include other phrases from the input question, such as the type phrase of the question. To find out whether an answer hypothesis is a "novelist", for example, a query comprising the answer hypothesis and the word "novelist" is constructed and executed to yield a list of relevant articles. Documents whose match sentences contain co-occurrences of an answer hypothesis and a question phrase are called secondary documents. MURAX analyzes secondary documents using lexico-syntactic patterns to see if answer hypotheses can be validated as instances of the type phrase. 4.3 System Output For the question given in Table 7 MURAX produces the output shown in Table 8.
TABLE 8
______________________________________
Output for Example Question
______________________________________
The best matching phrase for this question is:
Mailer, Norman
The following documents were most relevant:
Document Title: Mailer, Norman
Relevant Text:
.cndot. "The Armies of the Night (1968), a
personal narrative of the 1967
peace march on the Pentagon, won
Mailer the Pulitzer Prize and the
National Book Award."
.cndot. "In 1969 Mailer ran unsuccessfully
as an independent candidate for
mayor of New York City."
Document Title: novel
Relevant Text:
.cndot. "Among contemporary American
novelists, Saul Bellow, John Dos
Passos, John Hawkes, Joseph
Heller, Norman Mailer, Bernard
Malamud, Thomas Pynchon, and
J.D. Salinger have reached wide
audiences."
Next best: Edith Wharton, William Faulkner
______________________________________
The presentation here is different from that of prior art IR systems. Answer hypotheses are shown to the user to focus his or her attention on likely answers and how they relate to other phrases in the question. The text presented is not necessarily from documents that have high similarity scores, but rather from documents that confirm phrase relations that lend evidence for an answer. This behavior is readily understood by users, even though they have not been involved in the tedious intermediate work done by the system. In Table 8, the first two sentences are from primary documents. The last sentence confirming Norman Mailer as a novelist is a from a secondary document. The sentence was confirmed by a lexico-syntactic pattern which identifies the answer hypothesis as being in a list-inclusion relationship with the type phrase. MURAX's approach contrasts with the alternative, prior art approach of vector-space search. Vector-space search using full-length documents is not as well suited as MURAX's approach to the task that MURAX performs. For the example question, a vector-space search was done using a typical similarity measure and the bag of content words of the question. The most relevant document (about Norman Mailer) was ranked 37th. Somewhat better results can be expected if sentence or paragraph level matching is done. However, the resulting text matches do not have the benefit of being correlated in terms of a particular answer, and they muddle information for different answer hypotheses. 5. Primary Query Construction This section describes in further detail how MURAX performs primary query construction. In accordance with the method of the present invention, MURAX translates phrases from a question into Boolean queries with proximity and order constraints. These queries are passed to an IR system which searches the encyclopedia and returns a list of matching documents (or hits). It is assumed that the IR system can process the following kinds of primitive queries, as well as compound queries constructed of these primitives: 1. The Boolean AND of terms, denoted here as: ›term.sub.1 term.sub.2 . . . term.sub.n ! 2. Proximity of a strict sequence of terms, separated by up to p other terms denoted here as: <p term.sub.1, term.sub.2, . . . term.sub.n > 3. Proximity of an unordered list of terms, separated by up to p other terms denoted here as: (p term.sub.1, term.sub.2, . . . term.sub.n) The overall process is illustrated with an example question, which will be called Example 2 below: "Who shot President Lincoln?" The question is first tagged and the noun phrases and main verbs are found. In Example 2 the only noun phrase is President Lincoln and the main verb is shot. | ||||||
