Method and apparatus for automated, context-dependent retrieval of information6236768Abstract Documents stored in a database are searched for relevance to contextual information, instead of (or in addition to) similar text. Each stored document is indexed in term of meta-information specifying contextual information about the document. Current contextual information is acquired, either from the user or the current computational or physical environment, and this "meta-information" is used as the basis for identifying stored documents of possible relevance. Claims What is claimed is: Description BACKGROUND OF THE INVENTION
(int) (width*uns int) (int) (uns int) (uns int)
(uns int)
NUM_WORDS, WORDCODE-1, NUM_DOCS=N1, DOC-1, DOC-2, . . . , DOC-N1,
WORDCODE-2, NUM_DOCS=N2, DOC-1, DOC-2, . . . , DOC-N2,
etc.
The headings indicate the type of data each variable represents (integer, unsigned integer). The first entry in the wordvec file, NUM_WORDS, is the number of words ap pearing in the entire file. Each word in the wordvec is represented by a unique numerical code, the "width" indicating the number of integers in the code (the RA uses two integers per code). The NUM_DOCS field indicates the number of documents containing the word specified by the associated wordcode. The word-count variables DOC-1, DOC-2, . . . , DOC-N1 each correspond to a document containing the word, and reflect the numb of occurrences of the word divided by the total number of words in the the document. A word offset file contains the file offsets for each word in the wordvec file, and is used to overcome the difficulties that would attend attempting to locate a particular wordcode in the wordvec file. Because each wordcode in the wordvec file can be associated with an arbitrary number of documents, locating a particular wordcode would require searching wordcode by wordcode, jumping between wordcodes separated by the arbitrary numbers of intervening word-count variables. To avoid this, a "wordvec offset" file is used to specify the location of each wordcode in the wordvec file.
(width*uns int) (long)
WORDCODE-1, OFFSET-1,
WORDCODE-2, OFFSET-2,
etc.
Since each entry has a fixed length, it is possible to perform rapid binary searches on the wordvec offset file to locate any desired wordcode. Accordingly, for each word in the query vector, the RA first looks up the word in the word offset file, and from that the word's entry is looked up in the wordvec file. An array of document similarities is used to maintain a running tally of documents and their similarities, in terms of numbers of word matches, to the query vector. The array is sorted by similarity, with the most similar documents at the top of the list. Similarity is computed for each word in the query vector by taking the product of the query-vector entry and the weight of each document in the corresponding wordvec file. To normalize this product, it is then divided by the query-vector magnitude (computed in the same manner as the document magnitude) and also by the document magnitude. The final value is added to the current running-total similarity for that document, and the process repeated for the next word in the query. In summary, the query vector is analyzed wordcode by wordcode, with the similarities array indicating the relevance to the query of each document. When computing the similarity of a query to an indexed document, it is preferred to employ a "chopping" approach that prevents an indexed word in a document from having a higher weight than the word has in the query vector. If the weight of the word in the indexed document is higher than its weight in the query vector, the document weight gets "chopped" back to the query's value. This approach avoids situations where, for example, a query containing the word "spam" as just a single unimportant word will not get overwhelmingly matched to one-word documents (which have the highest possible eight) or documents like "spam spam spam spam eggs bacon spam. . ." This method is slower on indexing and the index files take more space, but is much faster on retrieval because only documents containing words in the query are even examined. The other files created on indexing are a location file (doc_locs) containing a mapping between document number and filename for that document, a titles file containing the information for the one-line summary (titles), offset files for doc_locs and titles (dl_offs and t_offs) to do quick lookups, and a window-offset file specifying where to jump in a file for a particular portion of a windowed document. While the RA offers substantial capabilities for automated, "observational" retrieval, the cues it utilizes to identify possibly relevant documents are limited to word similarities. This is adequate for many computational tasks and clearly suits the traditional desktop environment of everyday computing: if the user is engaged in a word-related computational task, word-based cues represent a natural basis for relevance determinations. In other words, the current information reliably indicates the relevance of similar information. More broadly, however, human memory does not operate in a vacuum of query-response pairs. Instead, the context as well as the content of a remember episode or task frequently embodies information bearing on its relevance to later experience; the context may include, for example, the physical location of an event, who was there, what was happening at the same time, and what happened immediately before and after. As computer components grow smaller and less expensive, so-called "wearable" computers that accompany the user at all times become more feasible. Users will perform an ever-increasing range of computational tasks away from the desktop and in the changing environmental context of everyday life. Consequently, that changing context) will become more relevant for recall purposes. Even now, inexpensive laptop computers allow users to monitor their physical locations via global-positioning systems ("GPSs") or infrared ("IR") beacons, and to access various kinds of environmental sensors or electronic identification badges. Since information is created in a particular context, the atributes of that context may prove as significant as the information itself in determining relevance to a future context. Contextual "meta-information" is not limited to physical surroundings. Even in traditional desktop environments, where for practical purposes the physical context remains constant, meta-information such as the date, the time of day, the day of the week, or the general subject can provide cues bearing on the relevance of information (regardless of, or more typically in addition to, the content of the information itself). Word-based searching and retrieval systems such as the RA are incapable of capturing these meta-informational cues. SUMMARY OF THE INVENTION The present invention improves on the RA by extending its comparative capabilties beyond word similarities. The invention monitors various aspects of the user's computational and/or physical environment, and utilizes these as bases for relevance suggestions. Accordingly, unlike the RA, the "context" for assessing relevance is not what the user is typing, but can instead be any kind of information about the user's current situation. Examples include the user's current location (room or GPS location), the time of day, the day of the week, the date, the subject being discussed, person being talked to, etc. In this way, the invention can remind a user of personal information relevant to the current environment, or use environmental cues in effect as search vectors in a broader search that extends beyond these cues. As a result, the invention can be implemented only in the traditional computing environment in which RA operates, but also in fundamentally different environments. For example, the invention may be implemented as wearable or portable memory aid. In the RA, only the words within an indexed document are used to determine relevance. In accordance with the present invention, by contrast, these documents may be associated with a wide range of meta-information (i.e., information about the information), and it is this meta-information that is used to determine relevance--either alone or, if desired, in combination with the lexical comparisons implemented by RA. Meta-information about a document can be entered explicitly by a user, can be tagged automatically when the information is first created, or can be generated by analyzing the structure of the document. For example, if a student were writing notes in class, she could explicitly write the current date in a special header at the top of the notes; the date would then function as meta-information searchable by the present invention. Alternatively, the notes might automatically be tagged with the date based on the system clock. Finally, if the notes were instead e-mail, the e-mail would already bear a timestamp, so a system configured to recognize the structure of an e-mail document could glean the date the mail was sent from the existing e-mail header without modification or special handling. Meta-information useful in the context of the present invention can include far more than just date, of course, and the invention works best if several different kinds of meta-information are available. Some examples illustrate various forms of meta-information and their relevance to the capabilities of the present invention: Scenario #1: A student takes notes in class, and as her notes are saved as files, they are automatically tagged with several pieces of meta-information, including the room in which the notes were taken, the date, the day of the week, and the time of day. As she enters the classroom a week later, an infrared (IR) beacon broadcasts the room number to her wearable computer. The time of day and day of the week are also available to the computer from its system clock, and the invention automatically brings up the previous week's class notes as a "relevant" document on her computer. Scenario #2: A salesman is at a trade show and meets a potential client at the booth. He does not recognize the client, but the trade show has supplied everyone with name badges that also broadcast the person's name via an IR carrier beacon. The salesman's wearable computer receives that information from the potential client, and matches the person name to a note file written two years ago at a previous trade show. The notes concerned a previous meeting in which the potential client had listed his needs and business plans for the future; since at that previous meeting there were no active badges, the salesman had explicitly tagged the note with the person's name by typing it in. Because of the name match, the invention now displays the relevant information on his eyeglass-mounted computer display, allowing him to make a more focused sales pitch. Scenario #3: A tourist is visiting Seattle for the first time, and his car is equipped with a dashboard computer running the invention. As he drives around, the invention brings up notes from his wife's previous trip to Seattle, based on the location supplied from the car's global-positioning system and the time of day. He winds up going to a restaurant in which his wife had entered the note "Great place for lunch--try the fish." Scenario #4: A businesswoman has indexed her day-planner files, and the invention has gleaned the dates, times, and locations for appointments from the structured files. The invention reminds her of the time for her dentist appointment as this draws near. When she drives by the grocery store on the way back from the dentist, however, her location (supplied by GPS) triggers the invention to automatically remind her of her calendar entry "get birthday cake [Quality Foods]." In this case, the calendar entry was tagged both with a date and a machine-readable location. Accordingly, in a first aspect, the invention provides an apparatus for context-based document identification. The apparatus, ordinarily implemented on a programmable computer, includes a database for indexing a plurality of documents in terms of meta-information that specifies contextual information about the document or its contents; means for acquiring current contextual information (e.g., regarding a user's physical or computational environment); means for searching the database to locate documents whose meta-information is relevant to the current contextual information; and means for reporting the identified documents to the user. In a second aspect, the invention comprises a method of identifying document from a stored document database in response to contextual information. The method comprises indexing each stored document in terms of meta-information specifying contextual information about the document; acquiring current contextual information; identifying stored documents whose meta-information comprises information relevant to the current contextual information; and reporting the located documents to the user. BRIEF DESCRIPTION OF THE DRAWING The detailed description below refers to the accompanying drawing, which illustrates a representative hardware platform for practicing the invention. DETAILED DESCRIPTION OF AN ILLUSTRATIVE EMBODIMENT While the present invention can utilize any kind of meta-information about a document, certain kinds of sensors are particularly preferred. It should be stressed, however, that meta-information need not be provided by physical or computational sensors, but instead may be entered explicitly by the user, or through analysis of the structure of the document currently being indexed (or viewed by the user). Furthermore, the term "document" need not connote or be limited to a traditional text file; instead, the context-based approach of the present invention can be used to identify materials such as images, video or audio (either explicitly or automatically recorded) not easily searchable by conventional means. A representative list of meta-information items useful in accordance with the present invention, along with potential sources and the meaning attributable to the meta-information, is as follows:
Meta-information: LOCATION
Supplied by: IR beacon, GPS, human-entered, included in document
Meaning: Place where note was taken, information
regarding this place
Meta-information: PERSON
Supplied by: IR-transmitting name tag, human-entered,
video face-recognition, biometrics (voice-print),
included in document (e.g., "from" field in e-mail)
Meaning: Person/people who were there when note was taken,
information regarding this person; person/people
otherwise associated with document (e.g., author)
Meta-information: DATE (e.g., date and timestamp)
Supplied by: System clock, included in document (e.g., calendar
entry), entered by person
Meaning: Date/time when note was taken, information
regarding this time
Meta-information: TIME-OF-DAY
Supplied by: System clock, included in document (e.g. calendar
entry), entered by person
Meaning: Time-of-day when note was taken, information is
regarding this time
Meta-information: DAY-OF-WEEK
Supplied by: System clock, included in document (e.g. calendar
entry), entered by person
Meaning: Day of week when note was taken, information is
regarding this day of the week.
Meta-information: SUBJECT
Supplied by: Included in document (e.g., subject line in e-mail,
or key words in technical paper), entered by person,
speech-recognition software doing word spotting
Meaning: This is the subject of a piece of information, or
the subject of a current conversation or discussion
Meta-information: DOMAIN-SPECIFIC INFO/OTHER
Supplied by: Variable
Meaning: Many specific applications have special kinds of
information that may be used as meta-information.
For example, a shopper might be
interested in products names. When the shopper
passes an item on the shelf at the supermarket,
the invention might remind him that that product
is on his shopping list.
Meta-information: Text being edited, text on screen
Supplied by: Current computational situation (on screen)
Meaning: This information is not really meta-information, but
is rather the information itself, vectorized in the
same way the RA vectorizes information.
This information may be treated the same as any other
sensory information.
Many kinds of meta-information, such as people's names, can be represented as strings of words. Such information is considered "discrete," in the sense that a person either is or is not Jane Doe, and if she is not Jane Doe she cannot be "a close second." Other kinds of information are "continuous," such as times or latitude and longitude. With continuous information, values can be near each other, far from each other, or anywhere in between. Refer now to FIG. 1, which illustrates, in block-diagram form, a hardware platform incorporating a representative, generalized embodiment of the invention. As indicated therein, the system includes a central-processing unit ("CPU") 100, which perform operations on and interacts with a main system memory 103 and components thereof. System memory 103 typically includes volatile or random-access memory ("RAM") for temporary storage of information, including buffers, executing programs, and portions of the computer's basic operating system. The platform typically also includes read-only memory ("ROM") for permanent storage of the computer's configuration and additional portions of the basic operating system, and at least one mass storage device 106, such a hard disk and/or CD-ROM drive. All components of the platform are interconnected by and communicate over, a bidirectional system bus 110. The platform preferably also includes one or more sensors 113 for gathering environmental or physical information. Sensors 113 may include, for example, means for ascertaining the current geographical location of the platform (by means of a GPS circuit), local environmental or positional information (by means of an IR beacon, radio broadcast receiver, or transducer circuit), or the identities of individuals in the immediate area (by means of a receiver for IR-transmitting name tags, video face-recognition circuitry, or biometric-analysis circuitry (e.g., voice-print)). The operation of sensors 113 is regulated by a meta-information interface ("MII") 116, which also provides access to the information obtained by currently active sensors. In addition, MII 116 provides access to meta-information originated by the operating system, e.g., the date, the time of day, and the day of the week. The user interacts with the platform by means of, for example, a keyboard 120 and/or a position-sensing device (such as a mouse) 123, and an output device 126 (a conventional display or, for example, audio output from headphones or an automobile dashboard speaker). It should be stressed, however, that these components need not take the form normally associated with desktop or laptop computers, since the invention is amenable to use with wearable data-processing equipment. Thus, display 126 may be an eyepiece projecting a virtual image instead of a CRT or flat-panel screen, and positionsensing device 123 may be a small hand-held sensor pad or a wireless mouse. Furthermore, the components of the platform need not reside within a single package. Again, given the varying architectures of actual and proposed wearable computing systems, different components may reside in different physical locations, linking to one another by wired network circuits or bodyborne electrical signals. The main memory 103 contains a group of modules that control the operation of CPU 100 and its interaction with the other hardware components. These modules are implemented as executable machine instructions, running (by means of CPU 100) as active processes effectively capable of interacting (i.e., exchanging data and control commands) as illustrated. An operating system 130 directs the execution of low-level, basic system functions such as memory allocation, file management, and operation of mass storage devices 106. At a higher level, an analyzer module 133 directs execution of the primary functions performed by the invention, as discussed below; and instructions defining a user interface 136 allow straightforward interaction over display 126. User interface 136 generates words or graphical images on display 126 to facilitate user action and examination of documents, and accepts user commands from keyboard 120 and/or position-sensing device 123. The current document (i.e., the file with which the user is interacting) resides document buffer 140. A document index buffer 143 contains a vectorized index of all candidate documents, which are themselves stored on mass storage device 106. Meta-information obtained from MII 116 is stored in a memory partition 146, and is used by analysis module 133 in performing searches of the index buffer 143 in accordance with the invention. Memory 103 may also contain a series buffers 150 for storing document templates, which are used by analysis module 133 to index and, possibly, to locate meta-information as discussed below. Again, the system may contain large numbers of templates stored on device 106, only some of which reside at any one time in buffers 150. It must be understood that although the modules of main memory 103 have been described separately, this is for clarity of presentation only; so long as the system performs all necessary functions, it is immaterial how they are distributed within the system and the programming or hardware architecture thereof. Furthermore, while arrows indicate the operative relationships among the memory modules, and dashed arrows the relationships between memory modules and hardware components, the actual basis for communication is provided by operating system 130 and CPU 100. Analysis module 133 first indexes all the documents in a corpus of data (which, again, are stored as files mass storage device 106, which is assumed for explanatory purposes to be a hard disk), and writes the indices to disk. Unlike the RA, the invention preferably keeps several vectors for each document. These include not only the wordvec vector for text (if any) in the document, but also vectors for meta-information, e.g., subject, people, time, date, day of week, location, etc. Analysis module 133 indexes a document as follows: 1. Identification: Documents are broken into types based on a template file specific to the particular type of document. For example, an e-mail file includes the following recognition criteria: Template plain_email { Recognize {anyorder {startline, "From: "} {startline, "Date: "}} . . . Because the invention is intended to search different kinds of documents, recognition criteria are used to indicate the manner in which particular kinds of files are organized to facilitate the search. For example, recognition criteria can set forth file organization, indicating the manner in which files are separated from one another. The recognition criteria can also indicate where specific pieces of meta-information (e.g., the date, subject, recipient, etc.) are located, as well as the location of document text. Recognition criteria for particular types of documents are contained in templates. The invention checks for the presence of a template matching a particular document type, and if one is found, the recognition criteria therein are employed. If no template is matched, the document is considered to be raw text. (If more than one template is matched, the first is used.) 2. After the document is identified, different fields are extracted, again based on the template. For example, the e-mail template continues: Delimiter {startline, "From"} Format {{anyorder {startline, "From: ", PERSON, ".backslash.n"} {startline, "Date: ", DATE, ".backslash.n"} optional {startline, "Subject: ", SUBJECT, ".backslash.n"}} ".backslash.n.backslash.n", BODY} } Bias 21100000 The delimiter command explicitly identifies the separator between one document of this template type and another, should they both reside in the same file. (For example, a plain e-mail archive may contain several pieces of mail in the same file, all separated by the word "From" plus a space at the start of a line.) The remainder of the template specifies that the "From:" line contains the person or people associated with this document. and the line starting with "Date:" contains the date/timestamp of the document. Templates can also be employed during document creation, modification or storage to guide the acquisition of meta-information from MII 116 and its association with the document (the template typically being invoked in this situation by the application program responsible for creating the document). That is, a template field may point not to data within the document, but to meta-information acquired by MII 116 and/or sensors 113 that is to be stored with the document as part of its header. Suppose, for example, that the template for a document specifies a meta-information field indicating the geographic location where the document was created. Depending on the configuration of the system, that information may be continually acquired by a GPS sensor 113 and periodically placed in meta-information memory partition 146 by MII 116. Alternatively, GPS sensor 113 may be only intermittently active in response to a command issued by MII 116. In this case, the template instructs analysis module 133 to request a geographic location from MII 116, which in response activates the GPS sensor and loads the acquired value into partition 146. Numerous variations on these approaches are of course possible, combining varying frequencies of sensor activity, data acquisition and storage, as well user action. In this way, meta-information may be automatically associated with a document at appropriate junctures without distracting the user. For indexing purposes, the template structure of the present invention may be similar to the templates used by the RA, but with a different interpretation. With the RA, the date and person information was only used to create the one-line summary. In accordance with the present invention, each type of meta-information is placed in its own vector, and a single vector represents each type of meta-information supported by the invention. The final entry in the template file is the bias number for the particular type of file, which ranks the fields of the file in terms of importance. In the e-mail example above, the bias means that the body of the e-mail is most important, person and date fields are secondary (in a ratio 2 to 1 to 1), and no other fields are used to compute similarity. 3. Vectorization Once information is parsed out of the document, it is encoded and vectorized. The encoding is as follows. The invention uses three integers to encode words (as cot pared with the two-integer wordcodes of the RA). Consequently, each character is 6 bits, with the most significant 6 bits in the first integer being the type field. Bits wrap to the next byte, as follows: tttttt 111111 222222 333333 444444 55=32 bits 55555 666666 777777 888888 999999 0000 00 111111 222222 333333 444444 555555 =15 characters, 6 bits type
Code Type
0x0 Body
0x1 Text Location (descrete)
0x2 Subject
0x3 Person
0x4 Date
0x5 Time
0x6 Day
0x7 GPS Location (continuous)
Characters are packed into a 6-bit packed representation: a-z=01-1A 0-9=1B-24 .sub.-- =25 -=26 !=27 Anything else gets mapped to ascii(c) & 0x3F (lower 6 bits) Day of week is simply encoded as a number, 0-7, plus the type bits. Date is encoded as type bits plus number of seconds since the epoch (Jan. 1, 1970, 12:00:01AM). Time of day is encoded as number of seconds since midnight, plus type, bits. Any meta-information that can be represented by text (e.g., subject lines, room names, people names, bodies of text, etc.) is encoded in accordance with the above scheme. Like the body of text, each word in these text strings is encoded seperately and added to a vector. Vectors of discrete (text) data are all stored in one file, but the vectors are still conceptually distinct and are distinguished by their type bits. The file format for discrete type information is the same as the wordvec file format. Non-discrete information is stored in its own separate file, in order to better search for information that is "close enough" to information in a query vector. 4. Determination of relevance For each element of each discrete vector in a query--the generation and vectorization of which is described below--the algorithm used by the RA may be used to determine relevance to documents in the corpus. For "continuous" vectors (e.g., date, GPS location, etc.), the algorithm is modified to permit degrees of similarity, producing a value between 0.0 and 1.0. For example, for each date in the query's date vector, a binary search on the date_file is performed to find any dates within 24 hours of the query date. These are given a distance value based on how close--that is, how temporally proximate--the values are to the query date. These distance values are converted to weighted similarity, and are added to the similarity of the date vectors in the same way as in the discrete case. 5. Weighted addition of vectors The result of the foregoing operations is a single similarity value for each type of meta-information. These values are associated with each document in the indexed corpus, and are used to compute the overall similarity using bias values for query and document types, by the following formula: Query biases=bq pq sq lq dq etc. (i.e., body_query_bias, person_query_bias, etc.) Index bias=bi pi si li di etc. (i.e., biases for this indexed document, gleaned from the template file) Non-normalized biases=bq*bi pq*pi sq*si lq*li dq*di etc. Normalized biases=bq*bi/M pq*pi/M sq*si/M lq*li/M dq*di/M etc. where M=magnitude=(bq*bi+pq*pi+sq*si+lq*li+dq*di) Each vector similarity is multiplied by its respective bias, and the resulting biased similarity is summed, to produce an overall similarity between zero and one. Analysis module 133 preferably generates search queries autonomously from the current document in document buffer 140 or by reference to a current context. In the former case, analysis module 133 classifies the document either by its header or by reference to a template, and extracts the appropriate meta-information. In the latter case, the user's physical or interpersonal surroundings furnish the meta-information upon which the query is based. It is not necessary for the documents searched or identified to correspond in type to a current document. Furthermore, the query may not be limited to meta-information. Instead, the invention may utilize both a meta-information component (with relevance to candidate documents determined as discussed above) and a text component (with relevance determined in accordance with the RA). Analysis module 133 may also respond to queries provided by the user directly. For example, the user may specify a search in accordance with meta-information not ordinarily associated with the current document or the current surroundings, requesting, for example, a particular kind of document (or, indeed, any document generated) the last time the user was in or near a specified location. The query is vectorized as described above in connection with the RA. Analysis module 133 supplies a ranked list of the most relevant documents, which may be continually, intermittently, or upon request presented to the user over display 126. If desired, or upon user command, the list may be pruned to include only documents whose relevance level exceeds a predetermined threshold. It will therefore be seen that the foregoing represents a versatile and highly robust approach to document searching and recall. The terms and expressions employed herein are used as terms of description and not of limitation, and there is no intention, in the use of such terms and expressions, of excluding any equivalents of the features shown and described or portions thereof, but it is recognized that various modifications are possible within the scope of the invention claimed.
|
Same subclass Same class Consider this |
||||||||||
