Content filtering for electronic documents generated in multiple foreign languages6542888Abstract A system for collecting and categorizing metadata about content provided via the internet or intranet, regardless of the language of generation of the content. The content of each document is assigned token IDs, which token IDs are the same for any given topic irrespective of the language in which the document is written. Filtering of single language documents will generate a single output; whereas, multilingual documents will be divided into language segments with each segment being filtered by the appropriate language filter. Claims Having thus described our invention, what we claim as new and desire to secure by Letters Patent is: Description FIELD OF THE INVENTION
TYPE SERVER DIRECTORY CHANNELS
Web HR /publish/benefits/401k 401k
Web HR /publish/jobopenings Jobs
Web Marketing /publish/product/specs Product Specs
Web www.badco.com /pub/productspecs Competition Specs
Web www.goodco.com /pub/products/electrnic Customer Products
PCFile engineering /projects/chipdesigns Chip Designs
PCFile marketing /reports/companalysis Competitive Anly.
FTP engineering /projects/status Status Reports
Notes engineering /specs/chipspeed A1200 Design
FIG. 2 provides a schematic illustration of the sources accessible to the Customer Intranet Server of the fictitious company, directly or through the System Server, and the channels that result from receiving or crawling those sources. Information gathered from external sources will also be mapped to the established channels, so that an end user can readily access all relevant information in a category or channel as the result of a single query. While some amount of categorization may be straightforward, such as those above-noted examples wherein any information obtained from a certain source will necessarily be provided on a given channel (i.e., with sites or site directories being mapped to the channels), the bulk of document categorization requires intensive analysis of the document contents. In addition to the crawlers which automatically funnel documents obtained from certain sources into pre-established channels, there are two other primary means by which documents are categorized. The first, and most rudimentary, is categorization by manual user interface, whereby a system administrator (or even document author) identifies the document to be loaded into the server and identifies the channels in which the document is to appear. The second, more complex, means is automatic categorization by content filtering, which is conducted by system components located at either the Customer Intranet Server or the System Server 10, the details of which are further provided below and in the co-pending application, Ser. No. 08/979,248, abandoned entitled "Electronic Document Content Filtering", which is assigned to the present assignee, and is being filed on even date herewith. Such automatic categorization can also be utilized at the Customer Intranet Server for the purpose of categorizing internal documents into channels which may match or be unique from the channels provided by the System Server. Such channel definitions can be applied as well to documents received from the System Server to fill customer defined channels with news or other external documents. After query processing and document content categorization, it is desirable to analyze the categories to ascertain if other relationships exist among the categories, which relationships themselves may be identified as new categories or channels. "(See Co-pending patent application, U.S. Pat. No. 6,182,066 is entitled "category Processing of Query Topics and Electronic Document Content Topics", and, which is assigned to the present assignee, and is being filed on even date herewith. The teachings of the foregoing applications are incorporated by reference herein, as are the teachings of co-pending patent application U.S. Pat. No. 6,236,991, entitled "Method and System for Providing Access to Categorized Information from Online Internet and Intranet Sources," which is assigned to the present assignee and filed on even date herewith. Once documents from both the internal and external sources have been categorized/assigned channels, both the documents and the assigned channels are stored in a local database at the Customer Intranet Server or associated customer location. Inventive components at the Customer Intranet Server match the channels assigned to each of the incoming documents with the user's interests as found in the user profile. Each document is then made available for access by, or is sent to, the user whose interests it matches. The System Server's above-noted functions may be provided as part of a customer intranet, wholly outside of the customer domain, or divided in function between the two locations. In the "outside" example, all document collection and categorization would be done at the System Server as a service of the provider. Documents found on the external internet, as well as those which may be supplied from the customer's own intranet and/or databases, would be analyzed and categorized at the provider location. In the instance where the customer wishes to additionally be a provider to end users, two alternative scenarios are possible. Under the first scenario, an outside provider would still assemble and categorize documents from outside sources and make them available at the customer's server. The customer's server would also be adapted to perform assembly and categorization of "in-house" documents, merging of the in-house assemblage with the categorized documents from outside sources, matching the resultant merged documents to user request profiles, and disseminating the matching results to the user. The second alternative implementation would locate all categorization functionality at the customer location. In all three implementations, the customer location would retain the capability for receipt of user request input, creation and storage of the user profile, matching of the user profile to the categories or channels into which the documents are placed, and provision of the matched documents for end user review. The customer site is provided with the capability for building applications to create a series of different user interfaces with different interaction means, different restrictions for user access (e.g., providing some users access to only documents from outside sources, while others would have access to both externally-obtained and internally-generated documents), and different levels of query and content complexity. For the detailed descriptions of the processing "stages," including user query analysis and profile creation, document categorization, and matching, it is to be noted that the same types of analyses can frequently be applied at each stage. For example, finding relationships between two seemingly disparate user query subject categories can parallel the effort to identify commonality of subject matter from two input documents, as well as a subsequent effort to match the profile to a category/channel. Therefore, where appropriate, the ensuing processes will reference one, two or all of profile analysis, document content categorization, and matching stages. Users of the system initially specify which topics are of interest. This specification takes the form of one of more of the following: selection of pre-defined user categories, modification of pre-defined user categories, or user-customized sets of queries. Each query represents a topic, and can identify a channel and additionally contain boolean, fuzzy, proximity and/or hierarchical operators. A set of topics preferred by a user is know as a user profile. The present method reduces each query to one or more vector entries with the entry's index into the vector corresponding to a hash of the queries textual expression of the importance of that query to the overall topic/profile. A query can be either a single token (word or phrase) or a combination of tokens which includes boolean, fuzzy, proximity and/or hierarchical operators. Token IDs are assigned to each query item as hereinafter detailed. Automatic query processing, as well as document content categorization, is optimized in the present invention by first tokenizing the content thereof. In such a tokenization process, all the word/phrases are first identified as units, then stemmed. After all stop words and phrases are filtered out, only a few of the original word/phrases are left. These surviving words/phrases are called tokens. The tokens are usually just the stems of the original words, or made-up labels which correspond to phrases. The stems or made-up labels are referred to as "terms". Terms are strings, and since the system must handle quite a few thousand terms, the total memory which can be consumed by terms could take up a significant amount of computer memory. Therefore, a hash function is provided to assign unique token IDs to the terms (which may also consist of expressions containing words and phrases as terms combined with a variety of query operations) found in the documents and queries. The term strings are replaced by 32 bit integers. A "reverse dictionary" can be maintained which comprises a lexicon with token IDs as the keys and the words, phrases, queries as the values. However, if the need is to mark the document with categories, and not to catalog and retrieve based on the specific tokens matched, a lexicon will not be needed. Clearly, when comparisons are being made, comparisons of 32 bit integers will be significantly faster than the prior art string comparisons. Textual messages are likewise mapped to vectors using the same procedures as were used for the topics, above. All vectors are then normalized. Classification and matching are thereby reduced to vector processing. Yet another challenge to providing a ubiquitous online system is the categorization and dissemination of documents which have been prepared in different languages, or which combine terms from more than one language (a so-called "multilingual" document). Rather than requiring translation of every document, and entry of user preferences/queries in every language, a common token will be assigned to the same content regardless of the language in which it is rendered. Each predefined topic/category needs to be specified by one or more topic specialists in each of the supported languages, using a query language consisting of topic keywords combined with boolean, fuzzy, proximity and/or hierarchical operators. The topic specialist assigns a unique topic category ID, regardless of the language. Words and phrases that are part of a topic/category definition are referred to as "keywords." The topic/category definition requires that the keywords generally appear not by themselves but in particular combinations with other keywords in the same document. These keywords are combined using boolean, fuzzy, proximity and hierarchical operators to form a topic or category. Thus, these topic/category definitions are just like queries. When the proper combinations of such keywords in a document matches a category or topic, all keywords involved are replaced by the topic's ID. In such case, it will be said that a "topic event" has occurred, or that a topic was "fired". The keywords, operations and how they are structured in the query corresponding to a particular topic/category will, in general, vary from language to language, but the assigned ID will not vary. Topic specialists and linguists create synonym lists, assign IDs to each synonym list, and create dictionaries for each language to be supported, insuring that terms with similar meaning, across different languages, have the same IDs. Therefore, for example, the category "horse racing" is specified in English using keywords such as "horse", "race", "rider" etc. and assigned a unique ID, say 58961243. When the same or another topic specialist created the same category in Spanish for "carreras de caballos," the keywords will be in Spanish, such as "caballo", "carrera", "jinete" etc., but the ID assigned to the Spanish category will be the same as the ID assigned to the same category in English, namely 58961243. The processing commences with identification of documents as monolingual or multilingual, they are fed from a source to a language identifier which supports all required languages. The language identifier labels the documents and portions thereof with the languages in which they are written. The output of the language identifier is multiplexed into several outputs according to language, each output going into a name tokenizer specialized to handle a particular language. If, for example, the system needs to handle English, Spanish and French, the language identifier will have three outputs, one for each language. Each output will be fed into one of three different name tokenizers. Monolingual documents are processed by a single name tokenizer, the one appropriate to the language in which the document is written. A monolingual document contains one segment, only, including the whole document. A multilingual document, however, is segmented according to language, following a one-segment:one-language rule, and each segment is processed by a single name tokenizer (i.e., the one appropriate for the language in which that segment is written). The name tokenizer executes the following tasks: replacement of all keywords involved in a topic event with a single token ID corresponding to the topic/category that was fired; replacement of all significant words/phrase with corresponding token IDs; replacement of all stop words with their corresponding token IDs; and, elimination of all other remaining characters, including numbers, punctuation marks, etc., filtered out from the name tokenizer's output stream. The output stream of token IDs from the name tokenizer becomes the input to the stopword list filter (one for each language) which eliminates the stop word token IDs from the stream. At this point, all segments corresponding to the original document are collected by the segment consolidator filter. The output stream of topic token IDs and significant token IDs from the segment consolidator filter is the input to the vectorizer. The vectorizer converts this stream into a normalized vector representation of the original document. Each vector position (token ID) represents a topic event or a significant word match event. The score stored at that position represents the frequency and location of that topic event or significant word match event in that document, thereby indicating how closely the document matches the topic associated with that particular event. The vector output from the vectorizer goes into a categorizer, which retrieves the list of all stored topic/query vectors that have at least one token ID in common with this new document vector. Next, the dot products of the document vectors and each of those topic vectors are calculated and stored. The resulting dot products are sorted from highest to lowest. Since all of the vectors involved (document and topic vectors) have been previously normalized to a common length (e.g., 1000), all of the resulting dot product scores can be compared with one another and sorted, in a valid manner. Each score represents how closely the original document matches the corresponding original topic/category. The document corresponding to the document vector is labelled with the topic IDs and scores of the N topic vectors that produced the highest dot product scores. In this manner, all document content has been correlated regardless of language of origin. The invention has been described with reference to several specific embodiments. One having skill in the relevant art will recognize that modifications may be made without departing from the spirit and scope of the invention as set forth in the appended claims.
|
Same subclass Same class Consider this |
||||||||||
