Apparatus for and method of multiple parallel string searching6738779Abstract An apparatus for and method of simultaneously searching an input character stream for the presence of multiple strings. The strings to be searched for are determined a priori, processed and stored in substring tables during a configuration phase. The strings to be searched for are divided into a plurality of two and three character substrings and stored in substring tables. A hash of each substring is calculated and stored in a hash table whose output is an index to a substring table. During searching, the content filter generates the hash of the input character stream and attempts to find a matching substring stored in the hash table. A string is declared found if all the substrings making up the string have been received in correct consecutive order. Claims What is claimed is: Description FIELD OF THE INVENTION
Term Definition
ASIC Application Specific Integrated Circuit
CPU Central Processing Unit
DSP Digital Signal Processor
EEPROM Electrically Erasable Programmable Read Only Memory
EEROM Electrically Erasable Read Only Memory
EPROM Erasable Programmable Read Only Memory
FPGA Field Programmable Gate Array
FTP File Transfer Protocol
PC Personal Computer
PDA Personal Digital Assistant
PDU Protocol Data Unit
RAM Random Access Memory
ROM Read Only Memory
TTL Time To Live
WLAN Wireless Local Area Network
WLL Wireless Local Loop
Detailed Description of the Invention The present invention is an apparatus for and a method of searching multiple strings in parallel. The present invention is embodied in a content filter which is suitable for use in applications where an input data string is to be searched for the presence of one or more strings. For example, the content filter can be used in data communication systems to provide a real time search mechanism for searching the payload content of frames or packets in an input data stream. The content filter is operative to simultaneously search for a given set of strings contained within the input stream. The resulting output comprises a list of the matching strings found. Note that the input stream may comprise any type of input data in accordance with the particular application, such as frames, packets, bytes, PDUs, etc. For illustration purposes only, the input data stream is considered as a sequence of characters. The strings to be searched for are composed of the characters in the input data stream. A diagram illustrating an example embodiment of the content filter of the present invention is shown in FIG. 1. The content filter, generally referenced 10, comprises an input register file 12, 3-character hash function 14, 3-character hash table 18, 3-character string table 22, 2-character hash function 16, 2-character hash table 20, 2-character string table 24, content search processor 26, configuration module 28 and temporary registers 38. It is noted that the content filter illustrated herein is shown for example purposes only and is not intended to limit the scope of the invention. One skilled in the electrical arts can construct other content filters in accordance with the particular implementation requirements using the principles of the present invention described herein without departing from the spirit and scope of the invention. The content filter is operative to search the input data stream 32 for a plurality of strings simultaneously. The strings to be searched for are determined a priori, processed and stored in the substring tables 22, 24 during configuration of the content filter. As described in more detail hereinbelow, during configuration, the strings to be searched for are divided into a plurality of two and three character substrings. The substring tables function to store the data structures representing these three and two character substrings. The hash of these substrings are generated and stored in hash tables used to generate an index into the substring tables. During searching, the content filter generates the hash of the input character stream and attempts to find a matching string stored in the table. Thus, hash functions are used both in configuring the filter and during the actual searching of the input data stream to determine whether a particular string is present. In operation, the input data stream is input to registers 12 which comprise three registers for temporarily storing three characters of the input stream. The registers effectively create a three and two character sliding window into the input stream. The characters are clocked in and their hash values calculated using hash functions 14, 16. Depending on the implementation, the characters are then output as an output data stream 34 or alternatively may be discarded. Both a three character hash and a two character hash are generated. The three character hash function 14 is taken over the most recent three characters in the input stream while the two character hash function is taken over the most recent two characters in the input stream. The three and two character hash functions may comprise any suitable function, such as exclusive OR (XOR) wherein an exclusive OR (XOR) of the three or two characters with each other is generated. More sophisticated hash functions may also be used so as to obtain a particular distribution density function of hits. The 3-character hash value generated is used as an index to the 3-character hash table. Similarly, the 2-character hash value generated is used as an index to the 2-character hash table. The 3-character hash table stores pointers to substring entries in the 3-character substring table. Similarly, the 2-character hash table stores pointers to substring entries in the 2-character substring table. The three and two character substrings found are input to the content search processor which functions to determine whether all the substrings making up a string have been found. In order for a string to be declared found, all its substrings must have been found in the correct consecutive order. The structure of the substring table entries will now be described in more detail. A diagram illustrating the structure of an entry in the two character string table is shown in FIG. 2A. The two character table entry, generally referenced 40, comprises a 9-bit previous index/application field 42, ignore case bit 44, ignore application bit 46, a 1-bit field 48 to indicate that the substring is the first in the string, a 1-bit field 50 to indicate that the substring is the last in the string, a 1-bit field 52 to indicate that the entry is the last of multiple entries for a particular hash value, and two byte fields 54, 56 for storing the two characters, char #1 and char #2 making up the substring. A diagram illustrating the structure of an entry in the three character string table is shown in FIG. 2B. The structure of the three character table entry, generally referenced 60, is similar to that of the two character table with the addition of a third character, char #3. In particular, the table comprises a 9-bit previous index/application field 62, ignore case bit 64, ignore application bit 66, a 1-bit field 68 to indicate that the substring is the first in the string, a 1-bit field 70 to indicate that the substring is the last in the string, a 1-bit field 72 to indicate that the entry is the last of multiple entries for a particular hash value, and three byte fields 74, 76, 78 for storing the three characters, char #1, char #2 and char #3 making up the substring. The configuration process portion of the present invention will now be described in more detail with reference to FIGS. 1, 2A, 2B and 3. A flow diagram illustrating the method of constructing the contents of the three and two character string tables in accordance with the present invention is shown in FIG. 3. The configuration module 28 is adapted to perform the content configuration method of the present invention. As described previously, the strings to be searched must be determined beforehand and processed. A string must comprise at least two characters in order for it to be searched. The first step is to divide the entire set of strings into two and three character substrings (step 80). As long a string contains two or more characters, it can be broken down into two and three character substrings. The hash of every three and two character substring is generated using any suitable hash function, e.g., exclusive or (step 82). The output of the hash function comprises an 8-bit pointer to the corresponding hash look up table. Entries are then created in both the hash tables and the substring tables (step 84). Note that the hash tables are initialized with null values while the substring tables may optionally be initialized. A null value indicates that there are no substrings in the substring table for that particular hash value. In the example embodiment presented herein, the three and two character hash tables comprise 256 entries each wherein the 8-bit hash result is used as an index thereto. The hash tables output an 8-bit index into the substring tables. The three character substrings are placed in the three character substring table while the two character substrings are placed in the two character substring table. Both substring tables comprises 256 entries wherein the 8-bit hash pointer is used as an index thereto. A pointer to the particular entry in the substring table is stored in the appropriate location in the hash table (step 86). The location in the hash table corresponds to the hash value of the substring. Thus, during operation, the hash of the input stream characters serves as the index to the hash tables. In the event of a hit, the pointer in the hash table functions to point to a corresponding entry in one of the substring tables. The substring is read out and input to the content search processor. Next, the fields in the table entry are filled as follows. If the substring is the first substring in the string (step 88), the first substring field is set (step 96). If the substring is not the first, the index (i.e. address location) of the previous substring within the string is written to the previous index field of the current substring (step 90). If the substring is the last substring of the string (step 92), the last substring field is set to indicate this (step 94). Note that if multiple substrings generate the same hash value, all the entries corresponding to this hash value are stored in the same section of the substring table. There may be many strings that comprise the same substring. In this case, it is likely that each has a different previous index. In addition, the same substring may be the first, last or intermediate substring in a particular string. Each case where the same substring appears in a different string, an additional entry is created with the fields set to the appropriate values. Thus, during searching, all the substrings located in the same area of the substring table that correspond to the same hash pointer, are read and processed by the content search processor. In the case of multiple identical substrings, the location of the first substring in the group is stored as the pointer location in the hash table. In addition, the last string field of all but the last substring is cleared. The last string field of the last substring in the group is set to indicate that there are no more substrings stored in the substring table for that particular hash value. Note that there is no previous index for the first substring in a string. Optionally, the 9-bit field for the substring can be used to store additional information such as the particular application (or port) associated with the string. In the example presented herein, both the first two character substring entry and the first three character substring entry are used to store the application. It is noted that using a 256 entry three and two character substring table provides sufficient storage for a total of 1,250 characters. This value is derived from the sum of 256.times.3 characters and 256.times.2 characters. It is appreciated that larger or smaller substring tables may be constructed with corresponding different size hash tables depending on the requirements of the particular application. Once all the strings have been divided into substrings, their hash values calculated and the corresponding entries in the hash tables and substring tables created, the configuration phase is complete. The content filter is now ready to process characters in the input data stream. The hash of the input characters is used to index the substring table. Substrings found are input to the content search processor which is adapted to check the order the substrings are found using the previous index fields of the substring table entries. If all the substrings making up a string are found in the correct consecutive order, the string is declared found. The content search processor stores the substring information in found substring temporary registers. Found strings are stored in status registers. A diagram illustrating the found substring temporary register used in determining the substrings making up a string is shown in FIG. 4. The found substring temporary register, generally referenced 100, comprises at least two fields: a 9-bit index field 102 and a 2-bit time to live (TTL) field 104. The TTL field is used in determining whether the substrings found are consecutive with respect to each other. A diagram illustrating the status register used to provide the location and index of found strings is shown in FIG. 5. The status register, generally referenced 110, comprises a 1-bit valid field to indicate whether the contents of the register are valid, an 11-bit location field 114 representing the location of the last character of the string within the payload of the frame or packet and a 9-bit field 116 of the index of the last substring in the particular substring table. The operation of the content filter during searching will now be described in more detail. A flow diagram illustrating the method of searching an input data stream for a plurality of strings in accordance with the present invention is shown in FIG. 6. As described previously, the input characters are clocked into the registers 12 (FIG. 1). The hash on the most recent two characters is calculated (step 120) and the lash on the most recent three characters is also calculated (step 122) by the 2-character hash 16 and three character hash 14, respectively. The hash values are used as look ups to the hash tables. The pointer in the hash table is used as an index to the substring tables. In the event of a hit on the hash table (i.e. the contents of the location is non-null), the one or more substrings in the substring table corresponding to the pointer are read out (step 124). In the case of a single entry corresponding to the hash pointer, the input characters are compared with the substring. In the case of multiple entries corresponding to the hash pointer, the input characters are compared with each substring in the group. If a match is found (step 126), the first substring field is checked to see if the substring is the first in the string (step 128). If it is, it is then checked whether the substring is the last within the string (step 132). Note that it is possible that a string is comprised of a single substring. In this case, the first and last substring fields in the character table entry will be set. If it is not the last substring in the string, then a found substring temporary register is created and the index corresponding to the substring is stored in the register (step 134). The index comprises 9 bits made up of a bit to indicate either the three or two character substring table and the 8-bit address of the entry within the particular substring table. In addition, the 2-bit time to live field is initialized to the value three. The TTL field is used by the content search processor in determining whether two substrings are consecutive with each other in the input data stream. Each character clock, the TTL field is decremented by one. If the TTL field of a substring reaches zero the substring is discarded since no substring consecutive to this one was found. If the substring is not the first substring in the string (step 128), the index and the TTL fields of the previous substring are verified (step 130). When a substring is received, the content search processor examines the previous index field of the substring entry. A bit indicating either the three or two character substring table is added to the previous index field resulting in a 9-bit index. The processor then searches for a temporary register having a matching index field. If a match is found, the TTL field is examined. If the current substring is two characters, the TTL is checked for a value of two. If the current substring is three characters, the TTL is checked for a value of one. Note that this assumes that the TTL field is decremented at the end of each character processing cycle. Alternatively, depending on the particular implementation, the time to live field may comprise a different length, e.g., three bits. The operation, however, is the same in that the field is used to indicate the `freshness` of the substring. A substring for which no subsequent substring was found within the appropriate time is discarded. Thus, for a substring to be considered in the correct consecutive order, a temporary register must be found whose index matched that of the previous index field of the newly found substring. In addition, the TTL field must be set to the proper value. For a particular substring, if the next substring is three characters long, than the TTL count will decrement to one since a substring match cannot be detected until the next three characters are clocked in. Likewise, if the next substring is two characters long, than the TTL count will decrement to two since a substring match cannot be detected until the next two characters are input. If the index and TTL fields verify correctly (step 130), it is then checked if the substring is the last in the string (step 132). If it is, the string is declared as found and a status register is written with the valid bit set, location of the string and the index of the last substring. Depending on the application, the information is forwarded to another module for further processing. Note that in the embodiment shown, the substrings making up a string are not saved during the search process. Thus, upon finding a complete string, only the index of the last substring and its location in the input stream is provided. Note that the location in the input stream may comprise the location in the payload portion of a frame or the location in the input stream since the search was begun. It is noted that subsequent processing stages can easily construct the string found by referencing the previous index field of each substring in backwards fashion starting with the last substring indicated in the status register. Alternatively, the content filter may comprise means for reconstructing the entire string. If the substring is not the last in the string (step 132), substring information including the 9-bit index and TTL field is stored in a temporary register (step 134). The TTL field is initialized to three and decremented at each character clock cycle. If the TTL field decrements to zero, the content of the corresponding temporary register is discarded. Thus, in this fashion, the content search processor guarantees that the substrings making up a string must be in the correct consecutive order in order for a string to be declared as found. Note that in alternative embodiments, the string search can be limited depending upon one or more criteria. For example, the search can be limited to strings associated with a particular application or port. In this case, the application or port field of the first substring in the string is checked before the search process continues. If the field contains an application or port not in the allowed set, the substring is discarded. Further options can be defined depending on the application requirements. For example, the ignore case field can be used to instruct the content search processor to declare a string found regardless of whether the input characters are upper or lower case thus making the string search case insensitive. In addition, the string search can be performed regardless of the particular application associated with a string. If this bit is set, the application field of the first substring is ignored. Thus, the content filter of the present invention can be used to filter an input character stream for the presence of a set of strings. For example, the set of strings may comprise the relevant strings associated with one or more communication protocols. Computer Embodiment In another embodiment, a computer is operative to execute software adapted to perform the noise normalization method of the present invention. A block diagram illustrating an example computer processing system adapted to perform the multiple parallel string search method of the present invention is shown in FIG. 7. The system may be incorporated within a communications device such as a PDA, cellular telephone, cable modem, broadband modem, laptop, PC, network transmission or switching equipment, network device or any other wired or wireless communications device. The device may be constructed using any combination of hardware and/or software. The computer system, generally referenced 140, comprises a processor 142 which may be implemented as a microcontroller, microprocessor, microcomputer, ASIC core, FPGA core, central processing unit (CPU) or digital signal processor (DSP). The system further comprises static read only memory (ROM) 146 and dynamic main memory (e.g., RAM) 150 all in communication with the processor. The processor is also in communication, via a bus 144, with a number of peripheral devices that are also included in the computer system. The device is connected to a data communications network 152 via a network interface 154. The interface comprises wired and/or wireless interfaces to one or more communication channels. Communications I/O processing 156 transfers data between the network interface and the processor. A optional user interface 158 responds to user inputs and provides feedback and other status information. A host interface 160 connects a host device 162 to the system. The host is adapted to configure, control and maintain the operation of the system. The system also comprises magnetic storage device 148 for storing application programs and data. The system comprises computer readable storage medium which may include any suitable memory means including but not limited to magnetic storage, optical storage, semiconductor volatile or non-volatile memory, biological memory devices, or any other memory storage device. The multiple parallel string search method software is adapted to reside on a computer readable medium, such as a magnetic disk within a disk drive unit. Alternatively, the computer readable medium may comprise a floppy disk, Flash memory card, EPROM, EEROM, EEPROM based memory, bubble memory storage, ROM storage, etc. The software adapted to perform the multiple parallel string search method of the present invention may also reside, in whole or in part, in the static or dynamic main memories or in firmware within the processor of the computer system (i.e. within microcontroller, microprocessor, microcomputer, DSP, etc. internal memory). In alternative embodiments, the method of the present invention may be applicable to implementations of the invention in integrated circuits, field programmable gate arrays (FPGAs), chip sets or application specific integrated circuits (ASICs), wireless implementations and other communication system products. It is intended that the appended claims cover all such features and advantages of the invention that fall within the spirit and scope of the present invention. As numerous modifications and changes will readily occur to those skilled in the art, it is intended that the invention not be limited to the limited number of embodiments described herein. Accordingly, it will be appreciated that all suitable variations, modifications and equivalents may be resorted to, falling within the spirit and scope of the present invention.
|
Same subclass Same class Consider this |
||||||||||
