Method and apparatus for semantic token generation based on marked phrases in a content stream6405199Abstract A content stream having a plurality of phrases is accessed. One or more phrases in the content stream are marked. For example, the phrases may be annotated and/or highlighted. The marked one or more phrases are extracted and are processed to determine semantic information. A token is created based on the semantic information. Such a token might be used as part of a method for assigning semantic characterization to a content stream. For example, the token could be associated with one or more profiles. At least a portion of the content stream could be represented in a semantic space corresponding to one or more profiles, and/or a semantic record might be instantiated from the profiles and compared with other semantic records. Claims What is claimed is: Description TECHNICAL FIELD
Phrase Token(s)
java coffee, computer language, . . .
latte coffee, . . .
espresso coffee, . . .
skim milk, . . .
cream milk, . . .
milk milk, . . .
When the phrases "java", "latte" or "espresso" are parsed from the content stream, the token dictionary is referenced to determine that the token "coffee" is associated with the content stream. As such, one can determine that the principal associated with the content stream is referencing "coffee." Likewise, when the phrases "skim" or "cream" are parsed, one can determine that the associated principal is referencing "milk." As illustrated in the last entry, the phrase can be very similar to the associated token. Phrase extraction can be literal (i.e. parsing the content stream directly) or interpretive (i.e. interpreting the content of the content stream). Some examples of interpretive phrase extraction include relevancy determinations (e.g. such as those found on Internet search engines such as ALTAVISTA, EXCITE, etc.), linguistic morphology, tagged content streams, and the like. For instance, using an interpretive phrase extraction, such as linguistic morphology, the word "cool" could be evaluated against the context in which it is used to determine whether the associated token should be "cold" or "excellent". The extracted tokens are then passed to the record build mechanism 36, which could be embodied in a process, a program, or other forms. The record build mechanism 36 accesses the profile store 37, which contains a plurality of profiles. In the present example, the profile store is a database contained on a computer readable medium. The extracted tokens are compared with the profiles. Each profile comprises a plurality of associated tokens. Each extracted token may participate in the instantiation of multiple semantic records. Preferably, each profile defines a semantic concept. For instance, a profile may correspond to "drinks", which profile could include the tokens "coffee", "milk", "soft drinks", etc. The profiles act as a class definition where the resulting semantic record is an instantiation of the corresponding class. The profiles are instantiated into semantic records based on the extracted tokens. Further, the semantic records are associated with the principal associated with the content stream from which the tokens were extracted. In one embodiment, the profiles additionally store a frequency threshold of tokens along with other qualifying criteria for the creation of, admission of tokens to, deletion of tokens from, or deletion/destruction of--one or more semantic records. The instantiated profiles are stored in the semantic record store 38, which stores a plurality of semantic records. In one embodiment, the semantic record represents the content stream in a semantic space, such as a TVS. A semantic space is a representation of the domain of interest. The semantic space is modeled by a TVS such that the axis of the TVS span the domain of interest in some metric that can measure the position, direction, and distance between any two points in the TVS. For example, measuring how much a person likes milk requires us to develop a means and method to measure and quantify the "likes milk" metric to be placed on the "likes milk" axis of the TVS. Another example would be to represent "taste" in a TVS. In this case there could be at least three axis, "bitter," "sweet," and "salty." Extracting tokens describing the taste of a thing will yield semantic records that can be positioned in the TVS such that things that taste like "apples" will tend to clump together in the multi-axis space of the TVS. As such, each semantic record is a function of a principal, the associated content stream, and time. Each semantic record contains a variety of points within the TVS representing or characterizing the principal's activity within that semantic space. One reference that discusses mapping in a TVS is Latent Semantic Indexing Is An Optimal Special Case Of Multidimensional Scaling by Brian T. Bartell, Garrison W. Cottrell, and Richard K. Belew, which is hereby incorporated by reference. Preferably, the semantic record store is dynamic. As time transpires and the content stream varies, new semantic records are added by the record build mechanism 36. Likewise, existing semantic records are updated and modified by the record build mechanism 36 based on the changing content stream associated with the principal. In other words, the cloud of points within the TVS can vary, thereby changing the strength, frequency, location, etc. of the principal within that semantic space. Preferably, after a period of time of a predefined threshold inactivity defined in the associated profile, a given semantic record can be destructed by the record build mechanism 36. FIG. 4 depicts an example of a system for managing the token dictionary 35 and the profile store 37. Preferably, both the token dictionary 35 and the profile store 37 are editable and configurable. In this example, the profile definition mechanism 41 is the principal engine for such management. The profile definition mechanism 41 reads and writes to the token dictionary 35 and the profile store 37. Preferably, the profile definition mechanism 41 consolidates the information such that the definition of entries in the token dictionary 35 and the profile store 37 are normalized to each other. Information about the various principals are retrieved from the directory of principals 46, such as a database, a distributed directory, an index, or the like. The directory of principals 46 is editable and updatable through he external source of directory information 47. The external source of profile information 42 interfaces with the profile definition mechanism 41 so as to update and modify the token dictionary 35 and the profile store 37. For instance, predefined profiles 43, user 44 input, agent 45 input, and the like can be used to modify the token dictionary 35 and the profile store 37. FIG. 5 illustrates yet another example of the present invention. The connection 51 has access to a content stream. The token extraction mechanism 52 parses phrases from the content stream, which phrases are resolved to one or more tokens by referencing the token dictionary 53. The resolved phrases and tokens are stored as tagged content 54, which can be accessed by other mechanisms and processes. Preferably, the presence of a token can be mapped back to the associated phrase using the tagged content 54. The record build mechanism 55 receives the extracted tokens from the token extraction mechanism 52. The profile store 56 is accessed and where appropriate semantic records are instantiated from profiles. The semantic records are then stored in the semantic record store 57. The search engine 70 allows the tagged content 54 and semantic record store 57 to be searched. Further, two or more semantic records can be compared to one another. For instance, for a given semantic space like a TVS, each profile with a defined mapping to that TVS may be represented as a scalar field or state function which evolves though time. The methods of functional analysis and operator theory can be applied to the state functions, and the results of such methods can be used to compare the various semantic spaces. For instance, semantic records could be compared to see if the principals are active in the same semantic neighborhood (i.e. near or far from one another), or whether the principals are converging or diverging from one another. Further, the search engine 70 may perform a search based on a request from another search engine 70. The token extraction mechanism 62 operates to develop a semantic space of a semantic space. The token extraction mechanism 62 parses the semantic records in the store 57 into phrases, which are resolved to tokens in the token dictionary 63. The resolved phrases and tokens are stored as tagged content 64. The extracted tokens are passed to the record build mechanism 65, which accesses the profile store 66 to create semantic records, which are then stored in the semantic record store 67. The new semantic record can be viewed as the first derivative of the semantic records contained in the semantic record store 57. One with ordinary skill in the art will readily recognize that many higher derivatives could readily be created using this teaching. FIG. 6 illustrates an embodiment of another aspect of the present invention. The query mechanism 81 uses a plurality of semantic record stores 82, 83 to access a plurality of semantic records. At issue is how to compare semantic records instantiated from different profiles. The normalize mechanism 87 receives from the profile store 86 information about the profiles, in this example Profile 1 and Profile 2. The normalize mechanism 87 defines a normalized profile 88 that allows the transformation from the original profiles to the normalized profile 88. The build content space mechanism 90 receives the tagged content 84, 85, and builds new semantic records using the normalized profile 88. The new semantic records are then stored in the semantic record store 89, which can be readily queried. In addition, the build content space mechanism 90 stores the tagged content 91. As discussed above, tokens can be generated and extracted in a variety of different ways. One such technique, which may be implemented independently or in combination of the foregoing teachings, involves attributing contextual semantic information to content when work is practiced on the content. The work is monitored, codified, and analyzed to generate one or more tokens based on the work itself. One example of such a technique is the method 100 illustrated in FIG. 7. In steps 101 and 102, a content stream is accessed and monitored. At step 103, work on content in the content stream is performed. Typically, the work is associated with a principal. By way of example, one type of work includes marking the content, such as highlighting, adding a note, comment or sound annotation, adding document or hyper link, and the like. For instance, using a high lighting marker to emphasize content, using a specific color of highlighting marker to emphasize content where multiple colors are used in a document, using a note feature to leave notes of explanation concerning a region of a document, etc. During step 104, the work is evaluated and processed to determine semantic information related to the work. Each action in context with the content and the work practice being executed allows in-context tokens to be generated in step 105. Where the principal is a group, content being worked on by the group can provide even more semantics taking into account the members of the group, their primary focus, etc. Further semantics can be identified as marked content, like notes, comments and annotations, are themselves worked on. Tokens can be generated from work performed on any content, whether it be textual, audio, visual or other. For instance, token generation can include the identifying of a object in a bitmap image. For example, a visual element in the RGB domain that is identified as an "eye." A gray scale histogram of the same "eye" region would have additional, different semantic information. The same processing can be done in other spacial domains, signal domains, etc. Sources of such domains can include any medium that can carry a signal, such as sound, visual, tactile, etc. FIG. 8 depicts an example of a system 110 for generating tokens based on work to content. In use, the system 110 extracts tokens based on the marking and annotating an electronic document. Any document viewing mechanisms can be used to generate the tokens as long as the viewing technology provides some mechanism for marking (e.g., highlighting). An electronic document may comprise text, rich text, vector drawings, raster or bitmap drawings, sound, database record or set of records, meta data, etc. The resulting tokens are based on the principal's 115 markings to the electronic document based on the "locations" (virtual, mapped, or real) of the marking in view of the markings relationship to the document. The two clouds 111, 112 represent a network that has a common connection 113 through which a content stream can be accessed. The marking tool 114 is associated with a principal 115 and is used to work on content. In this example, the work involves marking the content. The specific form of the marking tool 114 will depend on the type of content which is being marked. The monitoring agent 116, such as a process or program having a series of instructions in a computer system, has access to the connection 113 and the content stream. The monitoring agent 116 extracts the content and markings made by the marking tool 114 for processing by the semantic extraction agent 117. Optionally, the monitoring agent 116 can additionally extract other data, such as the identity of the associated principal 115, as may be needed for later processing. The semantic extraction agent 117 then analyzes the markings to determine the associated semantic content of the markings. This information is passed to the token creation module 119, which creates the actual tokens. In one embodiment, the tokens are defined in a semantic space. Tokens may be generated as the semantic extraction agent 117 evaluates the order (e.g., chronology of creation), locality (e.g., position within the document as defined by the semantic space), type (e.g., highlighting, notes, comments), and attributes (color, annotation type (text, marking, sound)), and the like. This evaluation may occur via real-time or from some mechanism that provides the marking information in some order that fits the semantic space (one example of this order is chronology--when the token generator is described as "watching" this refers to this mechanism). If the user/agent has a mechanism for allowing a discipline (e.g. marking schema) of the marking tool 114, usage then more semantic information may be extracted. The token creation module 119 of the system does not rely on or specify any single or group of semantic discovery/definition mechanisms. Rather, any semantic discover/definition mechanism that produces a token that encapsulates the semantic meta data and source material used to produce the semantic mapping may be used. Optionally, the monitoring of the work practice of the work to define markings can produce the semantics necessary to train the marking agent 118 to perform the same marking process within some range of acceptability. For instance, if you use the semantics of the work practice to identify an eye then the marking agent 118 could "learn" that work practice and perform it with some degree of consistency. Training over time using the semantics of the work practice would refine this ability. The following provides several examples of the method 100 and system 110 in use. Tokens can be generated without principal discipline. Assume the principal 115 is a single user/agent which is marking an electronic document. The monitoring agent 116 "watches" the user/agent usage of the marking tool 114 and the attributes of the tools. For instance, the user/agent uses yellow, green, and blue to highlight paragraphs and phrases in a text document. Each individual marking will possibly have some semantic information that can be extracted by the semantic extraction agent 117 (e.g. the nouns or a subject of the highlighted text revealed via linguistic morphology). If the marking is for a single word then the semantics may be derived from the phrase, sentence, and/or paragraph that the word is in. Further, all of the yellow markings may be combined and evaluated in a similar manner to see if various semantic mechanisms might expose concepts, summaries, topics, etc. The same would be done for green and blue. Further, notes in close proximity to a highlight (or some other marking) would be evaluated as a part of the semantic context for the particular color that is in close proximity. The token creation module 119 then creates tokens based on the extracted semantic content. Tokens can be generated with a principal's discipline. If the user/agent can be disciplined to use a "marking schema" then the semantic content of markings and annotations become ever stronger. For instance, if the user/agent defines before applying any markings what each color means (e.g., explicit mapping to an existing semantic space, mapping to a set of keywords and/or concept words) then more semantic mechanisms can be applied intelligently to extract tokens (e.g., keywords associated with a color now become keywords lending context to the marking). Further, as a discipline is used it may be enhanced because of that use. If the discipline is mapped to a semantic store, then the semantic content of the discipline is enhanced as the semantic store is refined. Tokens can be generated in a collaborative group. Assume no discipline exists on the part of a group of users/agents. In the case of generating tokens from a group of users/agents, locality can become very important. Since we can not rely on any discipline to define a strong context, the context can be extracted based on locality. If several users/agents have marked the same phrase then a context may be assumed between the several users/agents. Mapping the context to the various colors used by the group may yield more information. For instance, if User A used yellow where User B used green and some large percentage of the yellow and green markings are the same then we have a good case for a shared context within the set of A's yellow marking and B's green markings. Further, any other annotations (e.g., notes, comments, document links, hyperlinks) that can be associated with the yellow and green marking may be used to enhance and extend the semantic value of the markings. Tokens can be generated in a collaborative group with partial group discipline. If some of the members of a collaborative group are disciplined, then the mechanism of marking collaborative semantic links as described above can be strengthen. Again, if A's yellow and B's green have a significant overlap and A is disciplined then B may draw from A's discipline. Tokens can also be generated in a collaborative group with full group discipline. If all members of a group are disciplined, then the semantic context and mapping become even stronger. Where high correlations are found between users/agents, stronger disciplines may be used to enhance the semantic context and mapping in weaker disciplines. As still a further example, as a user/agent applies markings, the order may be used to discover semantics that may be used in turn to discover other semantic coupling within or without the electronic document. This can take the form of allowing the user/agent to make several passes through the document and allowing the discipline of a marking mechanism to be specified as secondary to the previous pass through the document (e.g., the first pass is bright yellow markings, the second is a lighter yellow showing a coupling to less important phrases or regions that are related to the first pass set, etc.) Evaluating the marking work process in real-time may also yield new semantic information. For example, if yellow markings were applied to a document in the order of: page 1, page 2, page 5, page 6, page 4--the page 4 may be a sub-concept or sub-context to one of the previous pages. While some of the foregoing examples are founded upon a text or rich text documents, any other type of content may be used by applying appropriate annotation and marking mechanisms. For instance, in a bitmap the location would be two dimensional and marked phrases can refer to groups of pixels, etc. Other electronic documents may have a marking and context metric defined (e.g., database) to yield the same results. While a text or rich text document has been used to illustrate the invention one should not assume that only text or highlighting markers, etc. can be used. Any type of content that has a viewing and marking mechanism may he enhanced by the application of the invention. The foregoing description of the preferred embodiment of the invention has been presented for purposes of illustration and description. It is not intended to be exhaustive nor to limit the invention to the precise form disclosed. Many alternatives, modifications, and variations will be apparent to those skilled in the art in light of the above teaching. Accordingly, this invention is intended to embrace all alternatives, modifications, and variations that fall within the spirit and broad scope of the amended claims.
|
Same subclass Same class Consider this |
||||||||||
