Electronic mail filtering system and methods6772196Abstract The system filters-out undesirable email messages sent to a user email address. The system includes a data store providing updateable storage of signature records that correspond to a subset of undesirable email messages that may be sent to the predetermined email address. An email filter processor is coupled to the store of signature records and operates against the email messages received at the predetermined email address to identify and filter-out email messages corresponding to any of the signature records. An update system is provided to automatically receive a set of signature records, which are then used to update the plurality of signature records stored by the data store. The system can be implemented to include at least a portion of the email processor system within a client site email transport system, which receives the email messages addressed to the set of email addresses assigned or associated with the client site, including the predetermined email address. Claims What is claimed is: Description BACKGROUND OF THE INVENTION
TABLE I
Message Content Form Conversion
Multiple forms: select the text/plain version.
HTML form only: <br> and <p> convert to single and
dual line returns, strip characters and codes
inside < > and convert &entities
( ", etc.) to ordinary characters.
Other forms: reduce to a plain text equivalent.
Other form conversions may be similarly implemented as different forms are identified adopted for use in connection with email messages. For example, extensible markup language (XML) documents may soon be adopted for the transport of email content. In general, XML documents may be treated like hypertext markup language (HTML) documents, since both are derivative subsets of a broader document description standard known as standard generalized markup language (SGML). Extensions to the HTML form conversions may need to be adopted to handle XML, though the goal of generally reducing the content to a plain text form remains. Normalization of the content of a message is then performed to effectively standardize the presentation of the content. This content normalization process is intended to remove distinctions from the message content that do not carry repeatably identifying information. Such content distinctions are generally of a nature that do not affect or interfere with the substance of the content presentation. These distinctions are generally of a nature that allow for random insertion of characters or varying the presentation of existing characters in a non-substantive way. For example, randomly selected numbers or odd characters can be inserted at the end of lines or sentences, extra spaces between words, and blank lines between paragraphs. Also, the capitalization pattern of existing characters and the number of punctuation marks used can be varied without affecting the overall presentation of the content. The normalization functions taken to remove these and other distinctions from the content of a message are presented in Table II:
TABLE II
Message Content Normalization
Case: Content is converted to lowercase.
Whitespace: Excess whitespace (single line breaks, tabs, multiple
spaces, etc.) are coalesced with adjacent whitespace
into a single space. Markers, identifiers, pointers,
or other references are used to preserve
identification of content line, paragraph, and
sentence breaks.
Multiple Line For paragraph processing, multiple adjacent blank lines
Breaks: (paragraph boundaries) are coalesced into a single
blank line.
Punctuation: Punctuation is removed. Optionally, markers,
identifiers, pointers, or other references
may be used to preserve the possible significance
of punctuation that occurs within words and
numbers.
Extended For extended characters, the 8th bit is cleared. For
character and Unicode, the characters are normalized to a standard
Unicode: character set.
SMTP headers: SMTP header fields that are easily forged are ignored
or stripped out. The contents of the subject line is
maintained. An original source IP address may
be maintained.
Numbers: Any number of more than 5 consecutive digits
is ignored. Well-formed numbers, such as
telephone numbers, are maintained. Any single
numbers at the end of a line or on a line by
themselves are ignored.
Number All content beyond the first 25 lines
of Lines: (which is a tunable parameter) or 1K bytes
(also tunable), whichever comes first
is ignored as being outside the prime content
area and easily randomized without impacting
the UEM message.
Long words: Single words longer than 32 characters (a tunable
parameter) are likely to be a random character
string and, therefore, are ignored.
The normalizations presented in Table II are the currently preferred set of such normalizations. Additional normalizations may be developed, in a generally consistent manner, as UEM vendors develop new methods of affecting the presentation of their UEM messages in an effort to randomly vary the appearance of their messages to automated detection tools and, thereby, avoid detection as UEM messages. Once normalized, a message is then passed to an array of algorithmic processors 48.sub.1-N to generate signatures based on a variety of differently selected subsets of the message content. These subsets may be selected as named blocks, such as a particular header field, as non-overlapping blocks, such as lines, sentences and paragraphs, and overlapping blocks of words as may be selected through a sliding window. In the preferred embodiments of the present invention, the algorithmic processors implement computational, comparative, and other processes for generating signatures. Computationally-based signatures are preferably digests or other mathematical operations that produce signatures representing some corresponding subset of the message content. Many acceptable computational digests forms exist, including checksums, message digests, such as MD5, and cyclic redundancy checks, such as CRC-8 and CRC-32. Table III lists a preferred set of the computational algorithms, employing a preferred checksum digest form, usable in connection with the present invention.
TABLE III
Checksum-based Signature Algorithms
Multiple Generate a checksum for each window group of four
Words: consecutive words (a tunable parameter) that occur
within the window as the window is slid over the
message content, where window group words are
preferably exclusive of stop-list words (words,
such as "and, or, this," that are
non-contextual and as commonly defined in the
literature concerning automated full-text
searching systems). This is a preferred algorithm.
Lines: Generate a checksum for each of the first 20 lines (a
tunable parameter) that occurs within the message
content, where each line is delineated by a line break
in the original message content, preferably ignoring
multiple line breaks and blank lines.
Sentences: Generate a checksum for each sentence in the
message content, where sentences are delineated
by the occurrence of a period-space character
combination in the original message content.
Paragraphs: Generate a checksum for each paragraph in
the message content, where paragraphs are
delineated by multiple line breaks or text
indents in the original message content. The
subject line header field is preferably
considered a separate paragraph.
Originating IP Generate a checksum for the first or the
address: apparent original source IP address given
in the header fields of this message.
Single Byte Generate a checksum for the first 1000 bytes
Chunk: (a tunable parameter) that occurs in the message
content, preferably exclusive of header fields.
Multiple Byte Generate a checksum for each of the first 10 blocks (a
Chunks: tunable parameter) of 100 bytes (also a tunable
parameter) that occur in the message content,
preferably exclusive of header fields.
Block Chunks: Generate a checksum for the first 25 lines (a, tunable
parameter) of the original message content, preferably
exclusive of header fields and blank lines.
Line Chunks: Generate a checksum for each set of 4 lines (a tunable
parameter) of the message content, preferably exclusive
of header fields and blank lines.
Sliding Generate a checksum for each set of 4 lines (a tunable
Window parameter) sliding by 1 line (also a tunable parameter)
Chunks: of the message content, preferably exclusive of header
fields and blank lines.
Authorized Generate checksums for the sets of words, numbers,
Words: and words and numbers that are also found on a
defined word list empirically constructed or
progressively developed to contain the most
common words and numbers used in UEM messages.
High IDF Generate a checksum for those terms that occur
Terms: within the message content considered, preferably
through statistical analysis, to be significant
in identifying the message content. The most
significant terms will preferably include
unique descriptive phrases, specialized product
and service names, email addresses, phone
numbers, postal addresses, and URLs.
Unique Terms: Generate a checksum for just the unique terms
that occur within the message content, as
determined against a common word-dictionary.
The unique terms will preferably include
specialized product and service names, email
addresses, phone numbers, postal addresses,
and URLs.
Call to Action Generate a checksum for the words and short phrases
Terms: identified from a list of known action words
and phrases empirically defined or progressively
identified from reviewed UEM messages to be within
a "call to action," such as an email address,
URL, phone number, and postal address. If three or
fewer terms are found in the UEM message, do not
generate a checksum.
Although many effective algorithms may be based on checksum generated signatures, or digests in general, other algorithms for generating signatures can be equally if not more effective in identifying UEM messages. These other algorithms may be based on absolute and relative counts of the occurrence of particular words and phrases that occur in a message. An absolute count is defined as the number of terms, identified from a defined list, that occur within some portion of the message. A relative count is defined as the number of terms, also identified from a defined list, that occur within some portion of the message relative to the total number of comparable terms that occur within the same portion of the message. These defined lists may be statically defined or dynamically generated based on empirical or progressive reviews of UEM messages, or generated based on known texts. Table IV lists a preferred set of other algorithms usable in connection with the present invention:
TABLE IV
Adder Signature Algorithms
Word Groups: The signature is a value representing the percentage of
word groups in the message that also appear in a
master word group list. Preferably, each word group
is a group of 4 successive words (a tunable parameter),
excluding stop-words given by a predefined list.
Further, the signature is preferably the percentage
of those word groups that appear within the first
25 lines (also a tunable parameter) of the message
that also appear in the master word group list.
This adder value is empirically set, such as in
a range of 0.25 and 2 points.
Predefined The signature is a true or false value reflecting
KeyWord the existence of any of a defined list of words
Terms: in the subject line of the message. The defined list
includes words such as "advertisement," "adv,"
"sale," "please read," and "chance
of a lifetime." This adder value is empirically set, such as
in a range of 0.25 and 2 points.
Legal Terms The signature is the percentage of legal terms and
and phrases that appear within the message and also
Phrases: appear on a master legal term list. This master
legal term list preferably includes terms selected
from legally required UEM notices. Thus, for a notice
that states, "This message is being sent
to you in compliance with the proposed Federal
legislation for commercial e-mail (S. 1618 -
SECTION 301). "Pursuant to Section 301,
Paragraph (a)(2)(C) of S. 1618, further transmissions
to you by the sender of this e-mail may be stopped at
no cost to you by submitting a request," the
master legal term list preferably includes the
terms: "301," "1618," "further
transmissions to you by the sender," and
"a 2 C". This adder value is empirically set,
such as in a range of 0.25 and 3 points.
Improper A signature is generated for each of several
Header header fields. Each header signature is a true or
Fields: false value reflecting whether the corresponding
header field value is deemed to be improper,
such as a blank "To:" field, a blank or
invalid domain name in the "From:" field,
and a blank or absent "X-Authenticated"
field. This adder value is empirically set,
such as in a range of 2 and 5 points for each
improper header.
The array of algorithmic processors 48.sub.1-N preferably operate in parallel to generate a full signature record set of signatures based on a particular message. Each of these signatures is then passed to a final analysis processor 50 that composes a corresponding signature record set. Before storing the newly constructed signature record set to the signature database 18, the final analysis processor 50 preferably scans the signature record sets previously stored in the signature database 18 to determine whether a corresponding signature record set already exists in the signature database 18. The determination made is preferably whether or not there is a sufficient calculated degree of similarity, represented as a score, between the newly generated signature record set and those prior stored in the signature database 18. Generally, where the similarity score produced by the comparison scan is above a pre-set threshold, the newly generated signature record set or a variant thereof is considered to already exist in the signature database 18. The newly generated signature record set is therefore discarded. Conversely, if the threshold level is not met, the newly generated signature record set is finalized and stored to the signature database 18. In the preferred embodiment of the present invention, the comparison scan operates to compare a subset of the signatures taken from the newly generated signature record set with those present in the signature database 18. This comparison is constrained to only comparing signatures that are generated using the same algorithm. Thus, for example, a signature newly generated using a three-word sliding window algorithm is only compared against signatures in the database 18 that were previously generated using the same three-word sliding window algorithm. For the preferred embodiment then, no comparison correlation is made or respected between the newly generated signature record set, as a whole, and any prior stored signature record set. Alternate embodiments of the present invention may determine correlations based on a signature record set basis to obtain a possible higher level of validity to the matches made, but at greatly increased computational requirements. In either case, the count of comparison matches to comparisons made yields a ratio value on a per algorithm basis. A degree of similarity score is then produced from a preferably mathematically based combination of the ratio values. In greater detail with reference to FIG. 2B, the determination process begins with the final analysis processor 50 first operating to select subsets of the signatures generated by several of the different algorithmic processors 48.sub.1-N from the newly generated signature record set 56. The total number of signatures selected, the algorithms for which signatures are selected, and the number of signatures selected corresponding to a particular algorithm are preferably determined empirically. For example, one algorithm applied to an email May produce 200 signatures. A representative set of these signatures, for purposes of comparison, may be a randomly selected set of 10 or 20 signatures. Thus, the comparison scan may only employ 10 signatures generated by a three word sliding window algorithm, 15 generated by a line-selection algorithm, and 5 generated by a unique word selection algorithm. The selected signatures are then compared via a comparative analysis block 58 against signatures of like generated signature records retrieved from the signature database 18. The ratio values determined through the comparison 58 are then analyzed through a match analyzer 59 to produce a degree of similarity score. While many different similarity algorithms may be used, based for example on statistical or stochastic analysis, relatively simple averaging algorithms are presently preferred. Table V provides a list of the preferred similarity detection algorithms usable in connection with the present invention:
TABLE V
Signature Combination Algorithms
Averaging: The checksum signature ratio match values are averaged
together to provide a signature record level comparison
score. This is a preferred algorithm.
Mean Tested The checksum signature ratio match values are first ranked
Averaging: against the mean value of the match values, weighted
proportionately, and averaged together to provide a
signature-set comparison score.
Differential The one highest and one lowest (both tunable parameters)
Averaging: checksum signature ratio match values are discarded and
the remaining signature match values are averaged
together to provide a signature-set comparison score.
A predefined threshold similarity level, preferably programmable and stored by the final analysis processor 50, is then used to determine whether the currently generated signature record set should be stored to the signature database 18. If the degree of similarity computed for a newly generated signature record set is below the threshold, the record set is stored. Otherwise, the signature record set is discarded. For example, the newly generated signature record set 56 may include signatures generated by four algorithms, with each algorithm generating ten checksum signatures. The signature record sets previously stored in the signature database 18 may include signatures produced by four or more algorithms though including at least the four algorithms used in generating the current generated signature record set 56. If the comparisons between the signatures, on a per-algorithm basis, produce ratio match values of 8, 7, 9, and 5, the resulting degree of similarity score, using averaging, is 72.5%. Using differential averaging, the degree of similarity score is 75%. Finally, if the predefined threshold similarity level is set to 75%, inclusive, the newly generated signature record set 56 would be stored to the database 18 if the degree of similarity determination is defined as using averaging. Where the degree of similarity determination is defined as using differential averaging, the current generated signature record set 56 would be considered to be a sufficiently close variant of a prior recognized UEM message and, therefore, would be discarded. In each case, however, a date/time-stamp is effectively updated for any signature prior stored in the signature database that matched a signature during the comparison scan. Thus, for preferred embodiments of the present invention where signatures are aged and progressively removed from the signature database 18, updating of the signature date/time-stamp values allows the signature database 18, as a whole, to track with the progression of variants of UEM received by the server 12. Finally, in arriving at a similarity score, the use of only a subset of the signatures generated by the array of algorithmic processors 48.sub.1-N for comparison against those stored by the signature database 18 serves a number of related purposes. By introducing an unpredictable variation in the particular set of signatures that will ultimately be used in screening received email messages, UEM vendors are unable to discern or reliably predict the essential criteria that would result in an email being compared and determined to be UEM. Additionally, using a subset whose algorithmic composition may equally change unpredictably, the introduction of new and additional algorithms is hidden from any UEM vendor who may try to discern or reliably predict UEM detection criteria. An alternate preferred embodiment of the examination process is detailed in FIG. 2C. The final analysis processor 50 operates to compare the current signature record set 56, as generated by the algorithmic processors 48.sub.1-N, against a currently selected signature record set 57, previously stored in the signature database 18. The two signature record sets are preferably arranged, at least logically, into signature subsets 56.sub.1-N and 57.sub.1-N. Each of the signature subsets 56.sub.1-N contains the set of signatures produced by a respective algorithmic processor 48.sub.1-N. Likewise, the signature subsets 57.sub.1-N contain the sets of signatures earlier produced by respective algorithmic processors 48.sub.1-N. A comparative analysis block 58 preferably receives each of the signature subsets 56.sub.1-N, 57.sub.1-N and performs signature comparisons between the paired subsets 56.sub.1-N, 57.sub.1-N that correspond to a respective algorithmic processor 48.sub.1-N. As between algorithm matched subsets 56.sub.X, 57.sub.X, each signature in the subset 56.sub.X is compared to each signature in the subset 57.sub.X. Subset totals of the number of identity signature matches and non-matches found are kept for each algorithm matched pairing of the subsets 56.sub.1-N, 57.sub.1-N. The resulting totals are then passed to a match analysis block 59 for a determination of whether to discard the signature record set 56 as being identical or sufficiently similar to a signature record set 57 already present in the signature database 18. Referring again to FIG. 2A, the empirical selection of signatures for use in the comparison is preferably performed by an administrator through a local manager 52. This local manager 52 may be an administrative console attached to the server 12 or a separate administrative system. Preferably, the local manager 52 also operates to monitor the operation of the server 12 including generation of statistical and summary reports reflecting the operation of the server 12. The local manager 52 is also preferably responsible for establishing an aging algorithm for retiring signatures and potentially entire signature record sets from the signature database 18. Since only one or a small set of typically less than ten signature record sets are required to identify a particular UEM message and related variations, the storage capacity requirement for the signature database 1816 is not great. However, individual UEM mass mailings tend to occur over relatively short periods, typically 1 to 5 days. Therefore, corresponding signature record sets can be considered to have a similarly limited effective life-span. Preferably then, the aging algorithm maintains signatures in the signature database 18 for an empirically selected aging period that may lost from one day to several weeks, with a preferred period of a few days, such as two to five days. The removal of signatures is accomplished by a periodic scanning of the signature database 18 and removing signatures whose date/time-stamps are older than the currently defined aging period. The algorithmic processors 48.sub.1-N that are not selected to produce signatures for new signature record sets preferably continue to generate signatures. The operation of such algorithmic processors 48.sub.1-N is preferably maintained for a period of time at least equal to the aging period. Alternately, when an algorithm is retired, the signatures created using that algorithm are removed from the signature database. This tends if not actually ensures that signatures are retired before the corresponding generating algorithm is retired. Thus, the signatures generated against any given message are a proper superset of those stored by the signature record sets in the signature database 18 at any given time. The scan of the signature database 18 by the final analysis processor 50 for a matching signature record set can be performed irrespective of the particular subset of signatures selected for use in the current generation of signature record sets. When a newly generated signature-set is finally identified by the final analysis processor 50 as representing a sufficiently different UEM message, the current generated signature record set is stored in the database 18. This record set as stored may include tuples storing a particular signature value and a versioned identification of the particular algorithm used to generate the signature. The date/time-stamp and other, optional data is preferably stored with the signature value at this time. The other data may include signature record generation dates, other data identifying or characterizing the signature or some aspect of the content from which the signature was generated, and data that may be used in support of the aging of the signature, the associated signature record, or full signature record set. The signature record set is also provided to an update manager 54 for use in a hot update of the client systems 14, 14'. In the preferred embodiments of the present invention, the operation of the update manager 54 is ultimately defined by the local manager 52 and administrator. Preferably, this defined operation is a relatively continuous process of serving hot updates of newly identified signature record sets to the client systems 14, 14'. This is desired in order to minimize the latency from the first receipt of a message from a new UEM mailing campaign to the updating of the client systems 14, 14'. In a preferred embodiment of the present invention, the hot updates are dynamically sourced by the update manager 54 over the Internet 22 to the client systems 14, 14'. A conventional hot update system and proprietary communications protocol may be used. A hot update message will likely contain a single signature record set and, therefore, may be relatively small and quickly delivered. The record set size may be further reduced by sending only those signatures that did not match against the signature database 18 and updated date/time-stamps for those that did match. A hot update message may also include new or updated algorithms for use by the client systems 14, 14'. That is, any time a new or modified algorithm is adopted by the server 12 for use by one of the algorithmic processors 48.sub.1-N, the algorithm is concurrently provided to the clients 14, 14'. Preferably, the hot update communications protocol is secure, such as through the use of an encryption protocol, to protect the content of the hot updates. Referring now to FIG. 3, the client system 14, representing the client systems 14, 14', is shown in greater detail. A SMTP proxy server 60 is installed in place of a conventional SMTP server 62 that is used as an email routing relay for a community of Users.sub.1-N 64. In a preferred embodiment of the present invention, the SMTP proxy 60 operates to route email messages inbound from the Internet to a white list processor 66. A white list 68, which is accessible by the white list processor 66, is managed by a local manager 70 under the control of a client site administrator. This local manager 70 may be an administrative console attached to the server 14 or a separate administrative system. Preferably, the local manager 70 also operates to monitor the operation of the server 14 including generation of statistical and summary reports reflecting the operation of the server 14. Using the local manager 70, the client site administrator can manage a set of white list entries, typically consisting of some set of domain names and email addresses corresponding to sites and individual correspondents that are trusted not to be a source or forwarder of UEM messages. Thus, the white list processor preferably operates to examine each message received from the SMTP proxy to determine if the "From:" domain or email address is on the white list 68. If present, the message is passed directly to the SMTP server 62 for eventual distribution to an end-user 64. All other inbound email messages are next passed, in a preferred embodiment of the present invention, to a hold queue 72. A UEM detection processor 74 operates on messages as entered into the hold queue 72 to determine whether the message corresponds to any signature record set stored in a client signature database 76. This detection process is performed generally in three steps: first, a client signature record set, reflecting the contents of a particular queued message, is generated and a scan comparison is made between a signature-set of the client signature record set and the signatures stored in the signature database 76; second, a similarity score is generated for the client signature-set; and third, a preferably threshold-based determination is made to determine whether the similarity score credited to the client signature record set is sufficient to consider the corresponding email message to be a known UEM message. Preferably, the generation of the client signature-set is performed by an array of algorithmic processors, provided within the UEM detection processor 74, that is essentially identical to the array 48.sub.1-N. The UEM detection processor 74 also effectively includes at least that portion of the final analysis processor 50 that operates to collect the signatures generated by the array 48.sub.1-N, accesses the signature database 18, and performs the scan comparison to identify matching signatures in order to generate a similarity score, essentially in the same manner as performed by the server 12. This includes determining the ratio of matching checksum signatures found, normalizing, as appropriate, the ratio identity value, and adding any "adder" values to determining a final similarity score, as described above in connection with the operation of the final analysis processor 50. Finally, based on the generated score, a client-site determination is made as to whether a scored message is a UEM message. In preferred embodiments of the present invention, a score threshold value is established by the client site administrator through a local manager 70. Messages having scores higher than the threshold value are considered to be UEM messages and dealt with in a manner determined by the client site administrator. Preferably, the options are to drop the UEM messages completely, to distinctively mark the message as UEM and thus delegate the further handling of the messages to the user-client email client applications, and to maintain a UEM message in the hold queue 72 pending review by the client site administrator. Preferred embodiments of the present invention allow the client site administrator to set and manage through the local manager 70 multiple threshold values to support discrimination between different handling options. For example, a high threshold may be set to immediately drop UEM messages detected with a very high degree of certainty. One or more intermediate threshold values can be set to filter between messages that should be marked differently as UEM messages, such as with different rating numbers reflecting different likelihoods that the message is UEM. A low threshold value may be set to filter for all messages that are at least above a suspect threshold as bing UEM messages. An algorithm set store 78 is preferably used to manage the current set of algorithms usable by the UEM detection processor 74. Both the signature database 76 and algorithm set store 78 are preferably hot-updateable from the server 12. In preferred embodiments of the present invention, the hot-update protocol is used to access a secure internet port 80 on the client system 14 and transfer updates to the local manager 70. In turn, the local manager 70, in a preferably automated process, stores the signatures of new signature record sets in the signature database 76 and new and updated signature algorithms, including their related parameters, in or through the algorithm set store 78. The hold queue 72 can also be used to implement a delay function. In an alternate preferred embodiment of the present invention, the examination of messages entered into the hold queue 72 may be intentionally delayed. By forcing a delay before the UEM detection processor 74 begins the evaluation of a message newly entered into the hold queue 72, an opportunity is provided for the server 12 to notice any newly started UEM mass emailing and to generate and distribute corresponding signature record sets to the client systems 14, 14'. Preferably, the delay imposed is a variable parameter that can be set by the client site administrator though the local manager 70. In preferred embodiments of the present invention, the period of the delay may be set between 0 and 30 minutes, with a typical setting of between 5 and 15 minutes. In a presently preferred embodiment of the present invention, however, the delay is set at 0 provided a sufficient number of decoy email addresses have been deployed for a period of time sufficient to expect that many have been harvested. As a result, there is a substantial statistical chance that a decoy address will be in the first 1 to 5 percent of the email addresses used in a new UEM mass mailing campaign. Given that the latency of the server 12 in generating a corresponding and distributing a new signature record set, plus the latency of the clients 14, 14' in installing the signature record set, is relatively small in comparison to the progression of a UEM mass mailing, the clients 14, 14' will be updated in time to block at least 90% of any new UEM messages. A quite reasonably expected blockage rate of 98% is attained. This number is reached by assuming that the number of decoy addresses are sufficient to reasonably have the occurrence of a decoy address within the first 1% of the UEM mass mailing and assuming that the combined server 12, client 14, 14' latency only allows another 1% of the UEM mass mailing to be delivered without detection. Even with a successful delivery rate of 10%, however, most UEM vendors should find that their UEM mass mailings are prohibitively ineffective to pursue. FIG. 4 details, in flow diagram form, the process 100 of creating a signature record set by a server system 12 and scoring a message by a client system 14, 14'. Although performed for different ultimate purposes by the server 12 and client 14, 14' systems, the process 100 nonetheless utilizes substantially the same steps. The process 100 begins with a single message, which is first converted 102 to a plain text form (Table I) and then normalized 104 to conform (Table II) the content to a standard presentation. The message content is then evaluated through computationally based algorithms 106, weight-based algorithms 108, and any other algorithms 110. The computationally based algorithms 106 (Table III) include text block 112, chunking 114, selected text 116, and other 118 algorithms that operate to generate signatures based on a computation, such as checksums. Weight oriented algorithms 108 (Table IV) include improper field checking 120, count-based signatures 122, and other algorithms 124. The use of the process 100 by a server system 12 next collects 126 the various generated signatures into a newly generated signature record set. A determination then made to store the signature set record in the database 18 where a generated similarity score is below a pre-set threshold. When a record set is stored to the database 18, the signature record set is stored and hot updated to the client systems 14, 14'. Changes in the current set of algorithms used, scoring parameters, and other data used by the algorithmic processor array 48.sub.1-N are also included in the hot update. The process 100, as used by the client systems 14, 14', preferably relies on the scoring parameters provided for the specific algorithms 112-124 to determine the similarity score for the message being evaluated. The computational 128 and weighted adds 130 components of the score are tallied and then combined 132 to produce a final similarity score for the message. Based on pre-established score thresholds, a score comparision 134 is performed and an appropriate action is then taken, such as dropping the message or marking the message with a relative indicator of the likelihood that the message is a UEM message. Thus, methods and a system for identifying UEM messages and supporting the filtering of such messages from the desired stream of inbound email has been described. While the present invention has been described particularly with reference to the filtering of electronic mail, the present invention is equally applicable to other and future forms of communications that operate on the basis of distributed public addresses for user-to-user communications. In view of the above description of the preferred embodiments of the present invention, many modifications and variations of the disclosed embodiments will be readily appreciated by those of skill in the art. In particular, the nature of the content normalization processes may be readily adapted to handing other and new content presentation conventions and may be adjusted to convert message content to a defined form other than plain/text. Additionally, the algorithms may be permuted in various manners to provide a renewable source of different distinct signatures and new algorithms, provided they create reproducible signatures, may be introduced at any time. Also, UEM should not be construed as restricted to just email transported by the simple mail transfer protocol (SMTP). Rather, UEM should be understood to include other message types where a server or proxy can be placed or executed in the transport path of these messages. It is therefore to be understood that, within the scope of the appended claims, the invention may be practiced otherwise than as specifically described above.
|
Same subclass Same class Consider this |
||||||||||
