Method of searching a data record for a valid identifier6564214Abstract A backend data processor of a network monitoring system attempts to identify the point of presence (POP) associated with each uploaded network performance data record by comparing the POP identification information contained in each uploaded data record with the known, valid POP telephone numbers stored in a lookup phone book. A series of lookup searches are performed by comparing a certain number of the digits of the raw POP string with corresponding digits of the POP numbers stored in the lookup phone book until an exact, unique match is found. An initial "pessimistic" lookup search compares the rightmost N digits of the uploaded POP string with the rightmost N digits of each POP number in the lookup phone book for all countries. If the initial pessimistic lookup search is unsuccessful, an optimistic lookup search is conducted taking into account independent information indicating the country code and area code from which the data record originated. If the optimistic lookup search is unsuccessful, a final pessimistic lookup search is conducted by again comparing the rightmost N digits of the uploaded POP string with the rightmost N digits of each POP number in the lookup phone book for all countries, with successively smaller values of N. Claims What is claimed is: Description BACKGROUND OF THE INVENTION
TABLE 1
Attribute Column Description Example
ProtocolVer InSight data version 3
OS Operation system platform/ "Win 95 4.0.1212 b"
version
OEM1 OEM 1 "ISP Name"
ProductVer InSight version "3.10"
RawProvider ISP/DUN name uploaded "ISP Name"
RawPOP POP number uploaded "555-1212"
RawCountry Originating country code "1"
uploaded
RawCity/Area Originating city/area code "609"
uploaded
RawModem Modem name uploaded "Standard 28800"
PPPServerIP PPP Server IP "207.240.85.52"
RawRasUserName RasUserName uploaded "MyAccount"
PSTTime Date/time in PST timezone "03/15/1998 15:56:06"
LocalTime Date/time in local timezone "03/15/1998 15:56:06"
ResultCode Result code 0
ElapsedTimeMs Milliseconds from start of 31147
call/test to result code
InitConnectSpeed Initial modem connection 28800
speed
IsRedial Whether this is a redial 0
attempt
It will be understood that the data record shown in Table 1 is provided by way of example only and is not intended to be limiting in any way on the scope of the invention. In general, any data record format suitable for conveying network monitoring data, user configuration data and identification information such as country, area code, POP number, service provider, etc. falls within the scope of the invention. The data records collected at collectors 15 are forwarded to aggregator 16, following any data cleansing, over connection 48. Connection 48 can be, for example, a persistent TCP connection. If performance is less of a concern, a non-persistent connection can be used. The data transfer can be done securely or non-securely. The data collected from the user modules, by its very nature, is not in an ideally normalized form. For example, one user module might be reporting connection data relating to a POP number of *70,555-1234 while another user module might report for a POP number of 555-1234. Recognizing that the prefix "*70" is the "turn call waiting off" code, it is immediately apparent that these two user modules are reporting data with respect to the same POP. In order to correlate and aggregate the data at database server 18 informatively, these similarities need to be detected. In accordance with the present invention, in order to more reliably determine the identity of the POP connecting an end-user who has uploaded monitoring information to the aggregator, the aggregator employs a customer supplied "phone book" with information about the exact number used by the customers for their POPs. More specifically, each POP can be uniquely identified by the telephone number (area code and local number) used to connect to the POP (i.e., its "POP number"). The POP identification technique of the present invention attempts to identify the POP associated with each record of monitoring data sent to the aggregator by comparing a raw POP string contained in each uploaded data record with preloaded POP numbers stored in the lookup phone book. As described in greater detailed hereinbelow, the lookup phone book is essentially a list or table of pre-stored POP telephone numbers that are known to be valid. The lookup phone book can be organized by identifiers such as country code, service provider, equipment manufacturer, etc. The preloaded POP numbers are supplied by the service providers, and the lookup phone book of POP numbers is continually maintained (e.g., on a weekly or monthly basis) to reflect added, deleted and modified POP numbers within the service providers' systems. The present invention involves comparing a portion (i.e., a certain number of digits) of the raw POP string with the POP numbers stored in the lookup phone book until an exact, unique match is found. The method includes an initial "pessimistic" lookup search in which comparisons are made between the last (rightmost) N digits of the raw POP string and the rightmost N digits of each POP number in the lookup phone book. The initial pessimistic lookup search involves comparing the digits of the raw POP string with all POP numbers in lookup phone book for all countries. If a unique match is found by the initial pessimistic lookup search, the identity of the POP is determined to be the matching POP number in the lookup phone book. Character strings or numbers are conventionally presented as a sequence or digits extending from left (most significant digit) to right (least significant digit). As used herein, the term "last" or "rightmost" refers to the sequence of characters or digits that appear last or on the right-hand side when a string of digits or characters are presented in this manner. For example, in a ten-digit telephone number with a three digit area code followed by a seven digit local number, e.g., (800) 555-1212, the last or rightmost seven digits contain the local number, i.e., 5551212. If the initial pessimistic lookup search fails to find a unique match between the raw POP string and any of the POP numbers in the lookup phone book, an "optimistic" lookup search is conducted in which independent information indicating the user's country code and area code (e.g., user configuration information uploaded along with the raw POP string) is relied upon to match a portion of the raw POP string to a POP number in the lookup phone book. The optimistic lookup search attempts to match the raw POP string to POP numbers in the lookup phone book that correspond to the calling country indicated by the user's country code contained in the user-configured information. If a unique match is found by the optimistic lookup search, the identity of the POP is determined to be the matching POP number in the lookup phone book. If the optimistic lookup search fails to find a unique match between the raw POP string and any of the POP numbers in the lookup phone book, the independent country code and area code information is distrusted, and a final "pessimistic" lookup search is conducted in which comparisons are made between the last N digits of the raw POP string and the last N digits of each POP number in the lookup phone book for all countries. In the final "pessimistic" lookup search, the number of compared digits N is successively decremented down to a minimum value until a unique match is found or multiple matches are found. In the case of multiple matches, the POP is identified by the matching portion of the POP string in the lookup phone book, resulting in an incomplete or partial identification of the POP string. A more detailed explanation of an exemplary embodiment of the POP lookup method of the present invention follows. As seen in the example shown in Table 1, the aggregator receives several pieces of information with each call record uploaded from an end user machine. One piece of information contained in this set of data is the actual telephone number/string dialed by the user's modem. This string may contain a calling card number prefix, a dialing prefix, or other optional dialing digits or characters. By way of example, the modem-dialed string containing the POP number may contain one or more of following tokens: escape characters to reach an outside line (e.g. "9," from a typical U.S. Hotel); pause characters (e.g. ","); a country code; a code to indicate calls to a foreign country (e.g. "011" in U.S.); a code to indicate calls across area codes/regions (e.g. "1" in U.S.); a call waiting cancel code; a caller ID blocking code; an area code; a local number; calling card information; ISDN information; extraneous characters; and other miscellaneous characters. Again, POPs are identified by their telephone numbers, including the area code and local number. Contained within the character string dialed by the telephone modem is a sequence of numeric digits representing the telephone number of the POP to which the user is connected or, alternatively, an invalid POP number to which the user attempted unsuccessfully to connect (such information can be uploaded to the aggregator subsequent to the failed attempted, once the end user successfully connects to the network by dialing a valid POP number). Thus, the POP string is essentially a data segment (i.e., a sequence of alphanumeric characters, symbols or numeric digits) embedded within a particular data field of the data record along with other data. As used herein and in the claims, the term "string" refers to a sequence of symbols that represent data or information, and the term "digit" refers to a single one of the symbols in such a sequence or the position of a particular symbol within the sequence. Although, in the context of a modem-dialed string, the string includes alphanumeric characters and possibly punctuation and other keyboard symbols, and the POP number comprises decimal-based numbers (i.e., base-ten numbers), more generally, a "string" as used herein and in the claims can be a sequence of any kind of information symbols (e.g., binary numbers). Likewise, a "data segment" can be any portion of (or all of) such a string. The present invention attempts to extract from the telephone modem string the identity of the POP corresponding to the monitoring data uploaded from the end user client. A top level flow diagram of the method of present invention is shown in FIG. 3. In accordance with a first step 100, the aggregator reduces the telephone modem string to a raw POP string by stripping out certain characters known not to be part of the POP string. For example, any commas (,) and characters preceding a comma (,), any ampersands (&) and characters following an ampersand, and any non-digit (i.e., non-numeric) characters are stripped from the telephone modem string to yield the raw POP string. Note that the stripping process employed in the present invention need not be as exhaustive as that used in an algorithm that attempts to remove all characters that are not part of the POP identifier (i.e., all but the area code and local number). According to the present invention, characters other than those corresponding to the POP number can remain in the raw POP string provided the rightmost digits of the raw POP string are those corresponding to the POP telephone number, or a specific digit or character position can be identified at the rightmost digit of the POP phone number, effectively making that digit the rightmost digit of the raw POP string. After preparing the raw POP string in step 100, the initial "pessimistic" lookup search is performed in step 200. As shown in greater detail in FIG. 4, the rightmost N digits of the raw POP string are selected (step 210). The integer N represents a string length defining the number of digits of the raw POP string to be compared with pre-stored valid POP numbers in the lookup phone book. By way of non-limiting example, the value of N can be initially set to nine (9). The lookup phone book contains a list of all valid POP phone numbers for the customer grouped by service provider, country, and area code. The phone book is essentially a database that is populated either manually or in an automated manner from the directory that a service provider maintains regarding its POPs. When this data is imported into the network monitoring system, it can be used as a "matching" table to give more accurate recognition of the user-supplied dialed string than the aforementioned conventional parsing algorithm. Table 2 provides examples of the type of information that may be contained in the lookup phone book.
TABLE 2
Country Area Local
OEM ISP Code Code Number POP Description
OEM1 ISP1 1 800 5551212 800-555- toll free
1212
OEM2 ISP2 1 963 1234567 963-123-
4567
OEM3 ISP3 49 0355 4968485 0355-496-
8485
OEM4 ISP4 49 0368 6464800 0368-646-
4800
The N digits of the raw POP string are compared with the rightmost N digits of each of the POP numbers contained in the lookup phone book for every country (step 220). If an exact, unique match is found between the N digits of the raw POP string and one of the POP numbers in the lookup phone book, then it is determined in step 230 that the raw POP string contains a valid POP number, namely the valid POP number whose rightmost digits matched those of the raw POP string. If an exact, unique match is not found, then processing continues to step 240. If the rightmost N digits of the raw POP string produce an exact match with the N rightmost digits of more than one of the POP numbers in the lookup phone book, then it is decided in step 240 to terminate the initial pessimistic lookup search without a match, and the optimistic lookup search is initiated. If, on the other hand, no matches are found between the rightmost N digits of the raw POP string and the rightmost N digits of any of the POP numbers in the lookup phone book for all countries, the value of N is decremented to eight (8), and the process is repeated, as shown in FIG. 4, by comparing the rightmost eight digits of the raw POP string with the rightmost eight digits of each of the POP numbers contained in the lookup phone book for every country. If an exact, unique match is found between the rightmost eight digits of the raw POP string and the rightmost eight digits of one of the POP numbers in the lookup phone book, then it is determined that the raw POP string contains a valid POP number, namely the valid POP number whose rightmost digits matched those of the raw POP string. If the rightmost eight digits of the raw POP string produce an exact match with the rightmost eight digits of more than one of the POP numbers in the lookup phone book, the initial pessimistic lookup search is terminated without declaring a match, and the optimistic lookup search is initiated. If no match is found between the rightmost eight digits of the raw POP string and rightmost eight digits of any of the POP numbers in the lookup phone book, the initial pessimistic lookup search is terminated without declaring a match, and the optimistic lookup search is initiated. It will be understood from the foregoing that each comparison performed in the pessimistic lookup search does not necessarily involve comparing the entire raw POP string with the entire stored POP number. For example, in the U.S., a complete POP number, including area code and local number, consists of ten digits. Consequently, comparing the rightmost eight or nine digits of the raw POP string excludes from the comparison the leftmost digit(s) of U.S. POP numbers that include the area code. Nevertheless, the approach taken by the initial pessimistic lookup search allows the raw POP string to be easily compared with all POP numbers from every country without regard to the various different formats of the POP numbers throughout the world, while still providing a high likelihood of successfully finding a unique match. Moreover, the method does not rely on any supplemental information, such as knowledge of the calling country or area code; consequently, the initial pessimistic lookup search cannot be corrupted by inaccurate supplemental information. Thus, the initial pessimistic lookup search relies on a minimum amount of information to conduct a very broad search. While described herein as involving the rightmost nine and eight digits, it will be understood that the initial pessimistic lookup search can involve comparing any suitable number of rightmost digits (e.g.: 10, 9 and 8; 9, 8 and 7; etc.), and the present invention is not limited to the exemplary embodiment involving searches with only the rightmost nine and eight digits. However, use of nine and eight digits in the initial pessimistic lookup search has been found to be well suited for efficient searching given the present length and format of POP numbers throughout the world, and provides a very low probability of matching the raw POP string with the wrong valid POP number. Further, if the wrong portion of the modem-dialed string is assumed to be the raw POP string (e.g., a sequence corresponding to a credit card number), there is very little chance that a random match will occur between any of the pre-stored POP numbers and the eight or nine errant digits of the modem-dialed string. Thus, the initial pessimistic lookup search is a "strict" search in the sense that comparisons involving only a fairly large number of digits (e.g., eight or nine) are attempted and only an unique, exact match is considered positive identification of the POP number. Referring again to FIG. 3, if the initial pessimistic lookup search fails to identify a unique match between the rightmost N digits of the raw POP string and rightmost N digits of any of the POP numbers in the lookup phone book or finds plural matches, an "optimistic" lookup search is performed (step 300). Unlike the initial pessimistic lookup search, the optimistic lookup search is an information-assisted search in that relies on information about the data record, in addition to the raw POP string, to refine and narrow the search for a matching valid POP number in the lookup phone book. Specifically, the optimistic lookup search attempts to rely on supplemental information that reveals the calling end-user's country indicating from where the uploaded data record originated and, if necessary, the area code and service provider corresponding to the data record. One of the primary difficulties with international phone numbers is that the area codes and local numbers are variable length. To address this problem in the context of the present invention, a list of country codes, number of digits of local numbers, and number of digits of area codes is maintained in a configuration database. The list forms a set of "POP rules" which indicate, for each POP number format in each country, the number of digits of the area code and the number of digits of the local number. These POP rules are employed in the optimistic lookup search where the user-configured country code is used to find a match between the raw POP string and a POP number in the lookup phone book. By way of non-limiting example, a POP rules list is shown in Table 3. Note that certain countries, e.g. Germany, may have more than one POP rule (i.e., different POP numbers may have different length area codes and local numbers).
TABLE 3
Country Code # digits local number # digits area code
1 (US) 7 3
49 (Germany) 7 4
49 (Germany) 7 3
49 (Germany) 7 2
49 (Germany) 6 3
49 (Germany) 6 2
Referring to FIG. 5, a detailed flow diagram illustrating the steps of the optimistic lookup search is shown. In a first step 310, a country code, an area code and a service provider identifier corresponding to the uploaded data record are identified. As can be seen from the example shown in Table 1, the data record may contain date fields, separate from the data field containing the raw POP number, that contain the country code, the area code and the service provider identifier of the data record. The country code, area code and service provider identifier are typically user-configured parameters preset in the end-user's machine and automatically inserted into the uploaded record. Because this user-configured information is not actually used to make the connection to the network, there exists some possibility that this information is inaccurate or outdated. In general, the country code, area code and service provider information used in the optimistic lookup search can be derived from any suitable source and conveyed to the aggregator in any convenient manner, so long as this information is essentially independent of or at least distinct from the raw POP string itself. Referring again to FIG. 5, in accordance with a first stage of the optimistic lookup search, the country code information is used to obtain a list of K different POP rules for that country, which specify the number of digits in the area code (AC) and the number of digits in the local number (LN), where K is a positive integer (step 315). The POP rule is used to determine the length of the raw POP string that will be compared with the valid POP numbers stored in the lookup phone book. Beginning with the first POP rule in the retrieved list (represented in FIG. 5 as POP rule(I), where I=1), the rightmost AC+LN digits of the raw POP string are selected and compared with the rightmost AC+LN digits of each of the POP numbers in the lookup phone book for the country indicated by the country code (step 320). For example, in the U.S., the lone POP rule (K=1) requires a three-digit area code (AC=3) and seven-digit local number (LN=7), resulting in comparisons between the rightmost ten digits of the raw POP string and the rightmost ten digits of the pre-stored valid POP numbers. Note that, in the optimistic lookup search, the search is limited to the pre-stored POP numbers of the country indicated by the user's country code. Pre-stored POP codes of other countries are not searched. If an exact, unique match is found between the rightmost AC+LN digits of the raw POP string and the rightmost AC+LN digits of one of the pre-stored POP numbers for the identified country, then it is determined in step 325 that the raw POP string contains a valid POP number, namely the valid POP number whose rightmost AC+LN digits matched those of the raw POP string. If the comparisons performed in step 320 fail to produce an exact, unique match, step 320 is repeated for each of the remain POP rules for that country until an exact unique match is found or until the search has been conducted with all of the POP rules for the country (step 330). Note that many countries may have only a single POP rule (e.g., in the U.S. the area code is always three digits and the local number is always seven digits and K=1), while other countries may have multiple POP rules (e.g., Germany) which may require step 320 to be repeated for each rule until a match is found. Preferably, where a country has more than one POP rule, the rules are sorted in descending order from longest to shortest (length=AC+LN), and searching is conducted in order of descending POP rule length. In FIG. 5, the looping through of the set of POP rules is represented by incrementing a POP rule index and repeating step 320 with POP rule(I) until an exact, unique match is found or the index I equals the total number of POP rule K. If, in the first stage of the optimistic lookup search, no match is found between the rightmost AC+LN digits of the raw POP string and those of the pre-stored valid POP number of the country for any of the POP rules of the country, a second search stage is performed under the optimistic lookup search. Specifically, another lookup search is conducted by concatenating the independently-supplied user area code with the rightmost LN digits of the raw POP string (i.e., the portion of the raw POP string that represents the local number), thereby forming a concatenated string of AC+LN digits (step 335). Beginning with the first listed POP rule (POP rule(1)) for the calling country, the concatenated digits are compared with the rightmost AC+LN digits of the POP numbers in the lookup phone book corresponding to the user's country (step 340). If an exact, unique match is found between the concatenated digits and the rightmost AC+LN digits of one of the pre-stored POP numbers for the identified country, then it is determined at step 345 that the raw POP string contains a valid POP number, namely the valid POP number whose rightmost AC+LN digits matched those of the concatenated digits. If no match is found, steps 335 and 340 are repeated for each of the country's K listed POP rules until a match is found or until all the POP rules for that country have been tried (step 350). The second search stage of the optimistic lookup search is essentially the same as the first search stage, except that the user-configured area code has been substituted for the digits of the raw POP string that were assumed to be the area code in the first optimistic search approach. Note that the second stage of the optimistic lookup search will produce an exact, unique match in the case where the user is not required to dial the area code in order to connect to the POP, whereas the first stage will fail to obtain a match in this case, since the area code is not reflected in the raw POP string. Since requirements to dial the area code presently are not universal, in practice, the second stage of the optimistic lookup search may correctly determine the identity of the POP associated with the data record in a great number of instances where the first stage does not. Further, if the local number portions of the POP numbers in the lookup phone book are not unique, and the user did not dial the area code, the method of present invention can rely on the user-configured area code to uniquely match the raw POP string to one of the pre-stored POP numbers in accordance with the second stage of the optimistic lookup search. If the second stage of the optimistic lookup search fails to produce an exact, unique match between the concatenated digits and the rightmost AC+LN digits of the pre-stored POP numbers of the user's calling country, a third stage of the optimistic lookup search is performed. In the third stage, the area code information is disregarded, and an attempt is made to match only the local number. Specifically, in accordance with the first-listed POP rule (POP rule(1)) of the calling country, the rightmost LN digits of the raw POP string are selected and compared with the rightmost LN digits of each of the POP numbers in the lookup phone book for the country (step 355). If an exact, unique match is found between the rightmost LN digits of the raw POP string and the rightmost LN digits of one of the pre-stored POP numbers, then it is determined in step 360 that the raw POP string contains a valid POP number, namely the valid POP number whose rightmost LN digits matched those of the raw POP string. If the rightmost LN digits of the raw POP string match the rightmost LN digits of two or more of the pre-stored POP numbers, and if the matching pre-stored POP numbers all correspond to the same service provider, the POP is determined in step 365 to be a valid POP. However, the POP is identified only by the local number (i.e., the rightmost LN digits of the matching pre-stored POP numbers) rather than by a complete, unique POP number having both an area code and a local number. If no match is found between the rightmost LN digits of the raw POP string and the rightmost LN digits of any of the pre-stored POP numbers, then step 355 is repeated for each of the K POP rules in the list until one or more matches is found or step 355 has been performed for all K POP rules for the country (step 370). If, after performing step 355 for all the POP rules of the country, no match has been found, the optimistic lookup search is terminated without declaring a match, and a final pessimistic lookup search is performed. By performing the optimistic lookup search only after the initial pessimistic lookup search fails, the present invention relies on the user-supplied country code and area code correct only when an exact, unique match cannot be achieved through a simple comparison of raw POP string digits and digits of a valid POP number in the lookup phone book. This approach has the advantage of avoiding possible inaccuracies of the supplemental county and area code information where possible, while still taking advantage of this supplemental information where necessary. Unlike the initial pessimistic lookup search, which involves a broad search of valid POP numbers while relying on minimum information (i.e., only the raw POP string), the optimistic lookup search involves a much narrower search of valid POP number of the calling country using refined search criteria derived from a greater amount of information (i.e., the raw POP string, the country code, the area code, and the service provider) to refine the search criteria. While the optimistic lookup search of the exemplary embodiment relies in a particular manner on the user's country code and, in some cases, on the area code and the service provider identifier, it should be understood that the optimistic lookup search of the present invention encompasses information-assisted searches that involve or are aided by supplemental information in addition to the raw POP string itself. Thus, the optimistic lookup search can involve a subset of the country code, area code and service provider identifier or any combination of these and any other supplemental identifier information that may be useful in correctly determining the identity of the POP. Moreover, the particular stages of the optimistic lookup search and the processing and decisions performed within each of the stages may vary in accordance with the particular supplemental information relied upon and how it is being relied upon. Referring again to FIG. 3, if the optimistic lookup search fails to successfully identify a valid POP number in the lookup phone book matching the raw POP string, the user-configured information used in the optimistic lookup search is distrusted, and a final "pessimistic" lookup search is performed (step 400). Like the initial pessimistic lookup search, the final pessimistic lookup search does not rely on supplemental information, such as the user's country or the area code, in attempting to identify a valid POP number that matches the raw POP string; thus, the final pessimistic search is a broad lookup search relying on a minimum amount of information (i.e., only the raw POP string). However, the final pessimistic lookup search is more "lenient" than the initial pessimistic lookup search in that raw POP string matches to multiple pre-stored POP numbers are treated as acceptable matches, and attempts to match the rightmost digits continue with successively fewer digits down to a significantly smaller number of digits before stopping the search. In this manner, at least some information about the POP's identity (albeit potentially imperfect information) may be gleaned from the raw POP string. As shown in FIG. 6, the final pessimistic lookup search begins by selecting the rightmost N digits of the raw POP string, where N is initially set to a value of nine (step 410). The N digits of the raw POP string are compared with the rightmost N digits of each of the POP numbers contained in the lookup phone book for every country (step 420). If an exact, unique match is found between the N digits of the raw POP string and the N digits of one of the POP numbers in the lookup phone book, then it is determined in step 430 that the raw POP string contains a valid POP number, namely the valid POP number whose rightmost digits matched those of the raw POP string. If the rightmost N digits of the raw POP string produce an exact match with the rightmost N digits of more than one of the POP numbers in the lookup phone book, and if all of the matching pre-stored POP numbers correspond to the same country and to the same service provider, it is determined in step 440 that the raw POP string contains a valid POP number. However, the POP is identified only by the rightmost N digits of the raw POP string that matched. This essentially amounts to a partial or incomplete validation of the POP number of the uploaded data record. If no matches are found between the rightmost N digits of the raw POP string and the rightmost N digits of any of the POP numbers in the lookup phone book, the value of N is decremented, and the process is repeated, as shown in FIG. 6, until N is decrement below six. Thus, steps 410, 420, 430 and 440 are repeated first by comparing the rightmost eight digits of the raw POP string with the rightmost eight digits of each of the POP numbers contained in the lookup phone book for every country. If the process fails to produce a match with the rightmost eight digits, N is decremented to seven. If no match is found with N equal to seven, N is decremented to six. If no match is found with N equal to six, the raw POP string is determined to be unparsable and the search process is terminated. Note that there is a small chance of a random match if the digits of the raw POP string are not, in fact, a phone number. In the final pessimistic lookup search, as few a six digits may be compared. Given a phone book of 3000 POPs, there exists only a 0.3% of a false match with a random 6-digit number assuming a uniform distribution. Even if the raw POP string and stored valid POP numbers do not have an exact uniform distribution, it can be readily be seen that the likelihood of a false match is quite small. The terms "optimistic" and "pessimistic" refer to the degree to which the search method relies on user-configured information supplied by the end-user's computer, with the "pessimistic" searches essentially assuming that the user-configured information is unreliable and therefore not relying upon this information. In the initial pessimistic lookup search, there is no reliance at all on the user-configured information, and the search is based on a longer string of digits (i.e., eight or nine), thereby reducing the chances of an incorrect match. In the optimistic lookup search, user-configured information is assumed to be reliable, and progressively more user-configured information is introduced in each of the three stages (country code, then area code, then service provider if needed) with each stage being more "optimistic" in relying upon increasingly more of the user-configured information. If unsuccessful, the final pessimistic lookup search reverts to the original assumption that the user-configured information is unreliable and again relies only on the raw POP string itself in attempting to find a POP number match. However, the final pessimistic lookup search is more lenient that the initial pessimistic lookup search in that a smaller number of matching digits are considered a valid match and even multiple matches will be considered valid (albeit imperfect) matches. The POP identification technique of the present invention can be used in conjunction with conventional parsing algorithms. For example, the aggregator can be configured to allow an operator to select either the POP identification method of the present invention or a conventional raw POP string parsing algorithm. Further, if the POP identification method of the present invention fails match the raw POP string to a valid POP number, a conventional parsing algorithm can subsequently be applied to raw POP string in a further attempt to extract a valid POP number, such as the algorithm described in the aforementioned Chu et al. patent application. The POP identification technique of the present invention is designed to minimize the probability of matching a raw POP string with an incorrect POP number, while maximizing the probability of matching the raw POP string with the correct POP number when the raw POP string, in fact, represents a valid POP number. In experimental tests, the novel combination of the pessimistic and optimistic search techniques of the present invention achieves a substantially higher matching percentage than more difficult to maintain conventional parsing algorithms, and works well across phone and dialing conventions in most countries in the world. Thus, the POP identification method of the present invention permits substantially more end-user data to be correctly associated and aggregated, thereby yielding more meaningful network monitoring information useful for more accurately assessing network performance, troubleshooting problems within the network system, and planning network development. While the present invention has been described in the context of identifying POP telephone numbers in order to associate, aggregate and organize data records according to common POP numbers, it will be understood that the concept of the present invention applies to any data record identifier that could potentially be useful for categorizing data records or associating data records in a relational database. Thus, for example, where network performance data is being aggregated and reported on a service provider basis (i.e., data records are being separated and sorted based on who the service provider is) to provide an overall comparison of different service providers, the technique of the present invention could be used to validate that the service provider information uploaded with each data record corresponds to a valid, pre-stored service provider identifier. Likewise, if network performance data is being aggregated on a country-by-country basis, an area-code-by-area-code basis or on an equipment manufacturer basis, the technique of the present invention could be applied to validate that uploaded information corresponds to a valid country, area code or OEM. Note, however, that the present invention is particularly useful for identifying the POP number within a data record, since identification of a POP number is inherently more challenging given that the POP number is contained within a larger character string whose attributes and contents may not be fully known and may vary considerably from data record to data record. Moreover, while POPs are commonly used to connect end users to communication networks, the present invention can be used to associate data records in accordance with any type of network connection node that is identified within the data record. Furthermore, the present invention is also applicable to systems that collect and aggregate data other than network monitoring data and that generate reports and statistical information therefrom, where identification or validation of data record identifiers used to categorize, organize, correlate, associate or group the data is desirable. Having described preferred embodiments of new and improved method of identifying information within a character string, it is believed that other modifications, variations and changes will be suggested to those skilled in the art in view of the teachings set forth herein. It is therefore to be understood that all such variations, modifications and changes are believed to fall within the scope of the present invention as defined by the appended claims. Although specific terms are employed herein, they are used in a generic and descriptive sense only and not for purposes of limitation.
|
Same subclass Same class Consider this |
||||||||||
