System and method for organizing data6751610Abstract A system and method for organizing raw data from one or more sources. The content of the raw data is converted into an appropriate number system and stored in a format that facilitates the use of efficient mathematical operations. The number system is selected to handle each of the various elements, characters, or other representative indicia found in the raw data. Furthermore, the number system is selected so that the numerical data retains semantic significance with respect to the raw data. Once converted into the numeric format, the data is processed using various techniques to extract the best information from the raw data into a distilled database. Claims What is claimed is: Description BACKGROUND
TABLE 1
Alpha- Base-10 Base-40
Numeric Number Number
0 0 0
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5
6 6 6
7 7 7
8 8 8
9 9 9
A or a 10 A
B or b 11 B
C or c 12 C
D or d 13 D
E or e 14 E
F or f 15 F
G or g 16 G
H or h 17 H
I or i 18 I
J or j 19 J
K or k 20 K
L or l 21 L
M or m 22 M
N or n 23 N
O or o 24 O
P or p 25 P
Q or q 26 Q
R or r 27 R
S or s 28 S
T or t 29 T
U or u 30 U
V or v 31 V
W or w 32 W
X or x 33 X
Y or y 34 Y
Z or z 35 Z
-- 36 [
-- 37 .backslash.
-- 38 ]
-- 39
Representation of raw data 210 in a base-40 format has numerous benefits. One benefit is that raw data 210 may be represented in a numeric fashion, facilitating straightforward mathematical manipulation. Another benefit is that proper selection of both the radix and the numerals in the number system allows the represented content to maintain semantic significance, facilitating recognition the content of raw data 210 in its representation in the numeric format. For example, the word "JOHN" represented by the four alphanumeric characters "J" "O" "H" "N" may be represented in various number systems. One such number system is a base-40 number system. Using Table 1, representing the alphanumeric characters "JOHN" as a base-40 number would result in the "tetradecimal" value `JOHN`, which is equivalent to the decimal value 1,255,103 (19*40.sup.3 +24*40.sup.2 +17*40.sup.1 +23*40.sup.0, where base-40 `J` equals decimal 19, etc.). Note that the base-10 number loses semantic significance from the content of raw data 210 whereas the base-40 number retains semantic significance, as the number `JOHN` is recognizable as the content "JOHN." Semantic significance provides the benefits of a numeric representation while maintaining the ability to convey semantic content. In some embodiments of the present invention, the selection of a radix and its corresponding number system may depend upon the number of bits used by processor 110. The number of bits used by processor 110 and the radix chosen for the number system define the number characters that can be represented by a data word in processor 110. This relationship is governed according to the following equation: N=B*ln(2)/ln(R), where N is the number of whole characters (i.e., fractional characters are discarded) represented by a data word of processor 110, B is the number of bits per data word, and R is the selected radix. This relationship limits the number of data elements 420 of raw data 210 that may fit in a data word. For example, in a 32-bit machine, the maximum number of characters that may fit in a data word using a base-40 number system is six (32*ln(2)/ln(40)=6.013). The maximum number of characters that may fit in a data word using a base-41 number system is only five (32*ln(2)/ln(41)=5.973). Thus, in some embodiments of the present invention, in addition to having a radix sufficiently large to maintain semantic significance, the radix may also be selected to maximize the number of characters represented by a single data word. In the embodiment with raw data comprised of alphanumeric characters, an appropriate radix may range from 36 to 40. This range maintains semantic significance while maximizing the number of characters represented by the 32-bit data word. Other types of raw data and other sizes of data word may dictate other appropriate radix ranges in other embodiments of the present invention. The embodiment of the present invention described above does not distinguish between uppercase and lowercase characters. However, other embodiments of the present invention may distinguish between these types of characters. Accordingly, a base-64 representation ("0"-"9", "A"-"Z", "a"-"z", and two other values) may be appropriate to distinguish between these characters as would be apparent. The number of data elements 420 in each data field 410 also dictates the precision required by the number as represented in processor 110. As described above, each data field 410 may only be six characters or data elements 420 wide for single precision operations in a 32-bit machine. In some embodiments of the present invention, this may be insufficient. In these embodiments, double, triple, or even quadruple precision may be required to represent the entire data field 410 as a single value. Double precision numbers are sufficient for up to twelve character data fields 410; triple precision numbers are sufficient for up to eighteen characters; and quadruple precision numbers are sufficient for up to twenty-four characters. Alternate embodiments of the present invention may accommodate large data fields by breaking a large data field into one or more smaller data fields. The large data fields may be broken at boundaries defined by spaces. For example, a data field representing an address such as "123 West Main Street" may be broken into four smaller data fields: `123`, `West`, `Main`, and `Street`. The large data fields may also be broken at data word boundaries. In the address example above, the smaller data fields might be: `123We`, `st.backslash.Mai`, `n.backslash.Stre`, and `et`, where the number `.backslash.` is used to represent a space. Other embodiments of the present invention may accommodate large data fields in other manners as would be apparent. Data Structure Conversion As illustrated in FIG. 3, in a step 330, raw data 210 represented as a number is stored in a predefined data structure. In one embodiment of the present invention, this data structure is a single-field table as illustrated by Tables 610-670 of FIG. 6. This data structure may vary. For example, in other embodiments of the present invention, the data structure may be a multiple-field table instead of a single-field table. In these embodiments, the data structures may be implemented with standard features such as table headers and indices, and as explained in greater detail below, may also include probability values for each record. These probability values represent the likelihood that the data in that record is complete. Higher probability values may indicate a higher probability of completeness, and lower probability values similarly may indicate a lower probability of completeness. This is described in further detail below. Initially, the probability values are set to 0. Other embodiments may also include key numbers or identification numbers to aid in sorting and in maintaining relationships among the data records. In a preferred embodiment of the present invention, raw data 210 illustrated in FIG. 5 includes three tables 510, 520, and 530. Table 510 may represent raw data 210 from, for example, a company's accounts receivable system. Columns of table 510 represent data fields for an account number, a last name, a first initial, and additional fields for listing various orders processed for a particular individual. Rows of table 510 (such as 510-1 and 510-2) represent data records for different individuals. Tables 520 and 530 may represent raw data 210 maintain by credit card companies. Columns of tables 520 and 530 represent data fields for an account number, a last name, a first name, and an address. Rows of tables 520 and 530 represent data records for specific accounts. In the preferred embodiment, step 330 converts raw data 210 from the format illustrated in FIG. 5 into a format illustrated in FIG. 6. FIG. 6 illustrates raw data 210, combined from the various raw data tables 510, 520, 530 of FIG. 5, represented as numbers in a base-40 number system, and formatted as new tables (tables 610-670), which together may comprise reference database 220. Each reference database table 610-670 corresponds to an individual field from raw data tables 510, 520, and 530 of FIG. 5. More specifically, data records of reference data table 610-670 correspond to the data records of raw data table 510, followed by the data records of raw data table 520, followed by the data records of raw data table 530. In one embodiment of the present invention, where a raw data table record has no information for a particular data field 410 represented in a reference table 610-670, a empty field value is entered in that field in the reference table. For example, the first data record 510-1 of Table 510 has no information about an address, and thus an empty field value is placed in the first position of table 670. Data is preferably stored in reference database 220 in such a way that all data corresponding to a single data record in a raw data table is readily identified. In the embodiment represented in FIGS. 5 and 6, for example, data corresponding to any specific data record of the raw data tables (tables 510, 520, 530) is preferably represented in reference tables 610-670 as a "vector" of numeric data stored at an index i across reference tables 610-670. For example, data corresponding to the sixth record 520-6 of raw data table 520 (illustrated as account number "A60" belonging to "Jennifer Brown," residing at "51 Fourth Street") is represented in reference database tables 610-670 as a vector having coefficients formed from the tenth records 610-10, 620-10, 630-10, 640-10, 650-10, 660-10, and 670-10 of the tables 610-670. As illustrated in FIG. 6, reference database 220 includes a new table 610 that does not correspond to any data field 410 in raw data 210 illustrated in FIG. 5. This table is a "key table" that identifies the related data in these data vectors. As described below, reference database 220 comprised of the tables illustrated in FIG. 6 may include additional key tables for data fields. These may include a personal identification number ("PIDN"), an account identification number ("AIDN"), or other types of identification numbers. These key tables or identification numbers may be used to identify sets of related data vectors in reference database 220. In this example, key table 610 has a single field "PIDN," which stands for personal identification number. Key table 610 provides a unique identifier such that a specific PIDN number never refers to more than one person represented in raw data 210. In other words, the PIDN number reflects the fact that many multiple records in raw data 210 may refer to the same person. Preferably, each data record in the key table 610 initially corresponds to a different data record represented in the raw data tables 510, 520, and 530. For example, in FIG. 6, data record 610-10 in the key table 610 is implemented such that it includes identifiers (such as pointers or indices) for corresponding data in reference tables 620-670, which together corresponds to a single record 520-6 in raw data table 520. Initially, while a single PIDN does not refer to multiple individuals, a single individual may correspond to multiple PIDNs. For example, in FIG. 6, vector 4 (defined by PIDN 4) and vector 9 (defined by PIDN 9) appear to refer to the same person, but as illustrated, this person is initially assigned to two PIDN numbers--PIDN 4 and PIDN 9. As described below, the present invention enables a determination whether PIDN 4 and PIDN 9 do, in fact, refer to the same individual, and if so, assigns a single PIDN to this individual. Alternatively, some embodiments may assign a new PIDN number to individuals so determined and a reference to the old PIDN number may be retained. As discussed above, in this embodiment, records are represented in the reference database tables 610-670 as vectors having coefficients of base-40 numbers across eight one-field table. This numeric representation allows the data to be analyzed using straightforward mathematical operations that may be used to, for example, produce correlations, calculate eigenvectors, perform various coordinate transformations, and utilize various pattern recognition analyses. These operations may, in turn, be used to provide or derive information about the records and their relationships to one another. By using small, one-field tables, these operations may be performed quickly. In addition, as will be illustrated, representation in base-40 numbers with raw data 210 including alphanumeric characters allows content of raw data 210 to retain its semantic significance. Data Dialysis Referring back to FIG. 2, once reference database 220 is created as illustrated in FIG. 6, a data dialysis process 700 is applied to distill the most accurate data for inclusion in distilled database 230. Data dialysis 700 is now described with reference to FIG. 7. Partitioning the Reference Data In a step 710, reference database 220 is preferably partitioned or sorted into sets based on some criteria. These sorting criteria may vary. For example, as illustrated in table 810 of FIG. 8, in this embodiment, data records may be sorted into sets based on last name, with the values arranged in increasing numeric order (recall that content of raw data is now represented as base-40 numbers in reference database 220). Table 810 is derived from reference database table 620 illustrated in FIG. 6, with each entry of table 810 defined by a unique last name and having a corresponding set of table 620 records matching that last name. In the representation illustrated, table 810 includes a field for defining the set (in this case, a last name), as well as identifiers for members of the set (such as indices, pointers or other appropriated references--in this case PIDNs). In some embodiments of the present invention, not all vectors in reference database 220 will have data for the field on which the sets are based. Such vectors may be handled in various manners. For example, all vectors in reference database 220 having no data for that data field may be regarded as members of a single, additional set. Alternatively, each vector in reference database 220 having no data for that data field may be regarded as the single member of its own set. Identifying Duplicate Data Returning to FIG. 7, in a step 720, those data records within the partitioned sets identified as duplicates are marked. In some embodiments of the present invention, duplicates data may be unnecessary and may be discarded. In other embodiments, all information remains in reference database 220 as all information, even erroneous, incomplete, or duplicate information may be better than no information and may be useful for some purpose, such as identifying fraud. In some embodiments of the present invention, comparing a pair of vectors may identify duplicates. Various operations may be used, as would be apparent. In a simple example, a straightforward vector subtraction may be performed to measure the degree of similarity between two records. Other techniques may be used to identify duplicate vectors such as using "look-up" tables to identify common names, nicknames, abbreviations, etc. Table 810 of FIG. 8 illustrates that the last name "Smith" corresponds to PIDNs 2, 4, 8, 9, and 11, representing vectors formed from entries 2, 4, 8, 9, and 11 of the reference database tables 610-670 illustrated in FIG. 6:
For PIDN 2: [SMITH, J, 98-002, A40, A60, ]
For PIDN 4: [SMITH, J, 98-004, A50, B10, ]
For PIDN 8: [SMITH, Jennifer, , A40, , 300 Pine St.]
For PIDN 9: [SMITH, John, , A50, , 37 Hunt Dr.]
For PIDN 11: [SMITH, Jhon, , B10, , 85 Belmont Ave. ]
Vector (or matrix) operations comparing the vectors and thresholds for determining when two entries are similar enough to be regarded as duplicates may be defined as appropriate for various embodiments. In a simple example, the sum of the absolute differences between corresponding coefficients of a pair of vectors may indicate a similarity between the corresponding pair of records. This pair of vectors may be considered duplicates if a first vector is not inconsistent with any field of a second vector, and does not provide any additional data. In this embodiment, additional rules would also be defined, for example, for comparing entries of different lengths (e.g., right aligning character strings corresponding to numbers, and left aligning character strings corresponding to letters), for recognizing commonly misspelled or spelling variations of words, and for recognizing transposed letters in words. This processing may be performed by various mechanisms, as would be apparent. In the example of Table 810 of FIG. 8, none of the data records are exact duplicates, and so none are marked in step 720. Correlating Data Referring back to FIG. 7, in a step 730, the preferred embodiment of the present invention correlates data records remaining within each set and in a step 740, further partitions the data records into independent subsets of data records. In general, the "correlation" between two vectors is a measurement of how closely one is related to the other, and specific methods of correlation may vary depending on the intended application. A general discussion and examples of correlation functions may be found in references such as NUMERICAL RECIPES IN C: THE ART OF SCIENTIFIC COMPUTING (Cambridge University Press, 2nd ed. 1992) by William H. Press, et al. Other techniques and examples may be found in THE ART OF COMPUTER PROGRAMMING (Addison-Wesley Pub., 1998) by Donald E. Knuth. As an example, a simple measurement of the correlation between vectors is their dot product, which may be weighted as appropriate. Depending on the application, the dot product may be calculated on only a subset of the vector coefficients, or may be defined to compare not only corresponding coefficients, but also other pairs of coefficients determined to be in related fields (i.e., comparing a "first name" coefficient of a first vector with a "middle name" coefficient of a second vector). As with the operations for identifying duplicate data, the correlation function may be appropriately tailored for its intended application. For example, a correlation function may be defined to appropriately compare entries of different lengths and to appropriately distinguish between significant and insignificant differences, as would be apparent. In the embodiment explained with reference to the tables of FIGS. 5, 6, and 8, an example of a correlation function compares vectors corresponding to the members of a set sharing the same last name to identify independent subsets of vectors. Again, this determination may be based on application-specific criteria. In this example, independent vectors may be defined to be those vectors representing different individuals. As a result of applying the correlation function, a correlation parameter reflecting the degree of independence of a pair of vectors is assigned. For example, a high value may be assigned to indicate a high degree of similarity, and a low value may be assigned to indicate a limited degree of similarity. The correlation value is then compared to a predetermined threshold value--which again, may vary in different applications--to determine whether the two records corresponding to those vectors are considered to be independent. Based on the correlation values, in a step 740, the preferred embodiment partitions the data records into subsets of independent data records within each set. In the examples of FIGS. 5, 6, and Table 810 of FIG. 8, members of an independent subset may be identified as those members having: the same last name (taking into consideration misspellings and spelling variations); relatively similar first names (taking into consideration misspellings, spelling variations, nicknames, and combinations of first and middle names and initials); having one or more matching account numbers; and having no more than three addresses (to allow for work and home addresses, and one change of address). Results of applying such a function are illustrated in Table 820 of FIG. 8. The individuals identified are:
Jennifer Brown, PIDN 10;
Howard Lee, PIDNs 3 and 6;
Carole Lee, PIDN 7;
Jennifer Smith, PIDNs 2 and 8;
John Smith, PIDNs 4 and 11;
John Smith, PIDN 9;
Ann Zane, PIDNs 1, 5, and 12; and
Molly Zane, PIDN 13.
Other operations for correlating the vectors are available. These may include computing dot products, cross products, lengths, direction vectors, and a plethora of other functions and algorithms used for evaluation according to well-known techniques. FIG. 9 illustrates a two-dimensional example of a concept referred to as clustering which is used conceptually to describe some general aspects of the present invention. In FIG. 9, four clusters exist as a collection of two-dimensional points. These clusters are identified as: (a,b), (c,d), (e,f), and (g,h). As illustrated, each cluster is formed from one or more points in the two-dimensional space. Each point corresponds to a data record that represents (with more or less accuracy) the "true" value of the cluster in the space. As illustrated, clusters (a,b,) and (c,d) are fairly easy to distinguish from one another and from clusters (e,f) and (g,h). However, in this simple example, clusters (e,f) and (g,h) are not easily distinguished from one another. Extending the space (i.e., adding additional data fields to the vectors), may increase the separation between clusters such as (e,f) and (g,h) so that they become more readily distinguished from one another. Alternately, extending the space may indicate that (g,h) is a point that belongs to cluster (e,f) or even cluster (c,d). In the abstract, the space may be extended infinitely, resulting in a Hilbert space, which has various well-known characteristics. These characteristics may be exploited by the present invention for large, albeit not infinite, vectors as would be apparent. Furthermore, while adding additional data fields to the vectors (i.e., extending the space) may separate clusters from one another to aid in their correlation, deleting data fields from the vectors (i.e., reducing the space) may also identify some correlations. In some embodiments of the present invention, reducing the space may identify certain clusters that are in fact representing the same individual or other unique entity. For example, one record in a database may have ten data fields exactly identical to the same ten data fields in a second record in the database. These data fields may correspond to a first name, a birth date, an address, a mother's maiden name, etc. However, these two records may have two fields that are different. These two fields may correspond to a last name and a social security number. In some cases, these records may correspond to the same individual. The present invention simplifies the process for identifying these types of records that would be difficult, if not impossible, to detect using conventional methods. Thus, removing one or more particular data fields from a vector and reducing the corresponding space may reveal clusters that otherwise would not be apparent. Doing this for data fields traditionally used for identification purposes (e.g., last name, social security number, etc.) may reveal duplicate records in databases. This may be particularly useful for identifying fraud. Removing data fields where a vector includes an empty field value for that data field may also reveal clusters that would not otherwise be apparent. Furthermore, once the clusters are identified as representing the same individual or entity, the best information for the individual or entity may be extracted from the information provided by each record or "black dot." The principles of the present invention may be extended beyond simple vectors and data fields. For example, the present invention may be extended through the use of tensors representing objects in a multi-dimensional space. In this manner, the present invention may be used to represent the parameters of various physical phenomenon to gain additional insight into their operation and effect. Such application may be particularly useful for deciphering the human gene and aid in the efforts of programs such as the Human Genome Project. Handling Stranded Data Referring again to FIG. 7, in a step 750, the preferred embodiment of the present invention evaluates "stranded" data records. Stranded data records are those records from reference database 220 that were not partitioned into any set in step 710. In some embodiments, reference database 220 may include a large number of tables corresponding to data fields and a large number of vectors having data for various combinations of fields. For example, in an embodiment having a reference database 220 including 20 tables for different data fields and 1000 vectors defined by related data records for each table, suppose only 800 of those 1000 vectors have data for the field "last name," by which the sets were created in step 710. Step 710 may not partition those 200 vectors with no "last name" data into any set, or to partition each of those 200 vectors into its own set. In either case, the result is that those 200 vectors are not correlated with any others in steps 720, 730, and 740. Step 750 may evaluate those vectors. Methods of evaluation may vary. For example, one embodiment may correlate each stranded entry with one member of each subset identified in step 740. Depending on the resulting correlation values, that vector may be added to the subset with which it is most highly correlated, or may define a new subset. Alternatively, in some embodiments, it may be determined that such evaluation is too time-consuming and step 750 may be completely skipped. Repeating the Correlation Process Steps 710-750 may be repeated as needed for specific embodiments. As noted above, some embodiments will have reference data 220 having a large number of fields and a large number of entries, with many entries having data for only a subset of fields. In such a case, performing steps 710-750 on a single field is unlikely to derive all relevant information. Even in the simple example explained with reference to FIGS. 5, 6, and 8, correlating on the single field "last name" may provide only partial information about the correlation between those entries. For example, Jennifer Smith, corresponding to PIDNs 2 and 8 in FIG. 6, may be the same individual as Jennifer Brown, corresponding to PIDN 10, because PIDNs 2 and 10 may share a common account number. Performing the correlation on the last name field may not identify these PIDNs as corresponding to the same individual because they were evaluated only against other PIDNs sharing the same last name. Performing a correlation on the account number field may provide additional information about whether these PIDNs are related. Thus, correlation across various data fields may be necessary to fully evaluate the degree of relatedness of the data in reference database 220. Using Correlation Results to Update Reference Data Once steps 710-760 are completed, reference database 220 has been distilled into a distilled database 230, as illustrated in FIG. 2. In some embodiments of the present invention, these two databases are handled separately and coexist with one another. In other embodiments of the present invention, a single database exists with records marked or otherwise identified as belonging to reference database 220 or distilled database 230. This may be accomplished by assigning by using different ranges of PIDNs for the records in the two databases. Furthermore, relationships between records in the two databases may be maintained by adding a constant value to the PIDN for the record in reference database 220 to generate a PIDN for the record in distilled database 230. For example, a record with a PIDN of 12345 in reference database 220 may have a PIDN of 9012345 in distilled database 230. In this manner, the two databases may be treated as distinct portions of a single database. Using the Distilled Data Once data dialysis process 700 is complete, distilled database 230 identifies subsets of data records from the reference database 220 as related records, and as noted above, probabilities may be determined for fields in the reference database 220 to provide a qualitative measure of their completeness. This may be accomplished by assigning a probability of completeness to each of the individual data fields and then using them to compute an overall probability of completeness for the data record. For example, for a data field representing a first name, a value of `J` may be assigned a low probability (e.g., 0 or 0.1), a value of `JOHN` may be assigned a higher probability (e.g., 0.7 or 0.8), and a value of `JONATHAN` may be assigned the highest probability (e.g., 0.9 or 1.0). These values may be assigned somewhat arbitrarily. However, these values help identify which data fields in the set are most likely to include the most complete information or in other words, the most probable data. Use of the present invention may determine a significant amount of information about the records and their relationship to each other, and may be specifically tailored for particular applications. Furthermore, using standard database operations, distilled database 230 (which references records of the reference database 220) may be manipulated to provide formatted reports as needed. For example, an embodiment may be tailored to generate a report listing subsets of related records, with records of a subset providing information about a specific individual or entity. The records within such a subset may provide information, for example about different fields of information; aliases and/or variations of names, addresses, social security numbers, etc., used by the individual; and fields--such as occupation, address, and account numbers--for which that individual may have more than one entry. Recalling that all data is represented in numerical base-40 format, the subsets may be ordered numerically in the report. The base-40 format provides the additional advantage of representing alphabetical characters as their respective letters (as illustrated in the conversion table above). Thus, while the report will show entries in numerical representation, that representation retains the semantic significance of the data it represents, allowing the data to be manually read and analyzed. For example, if the report shows records for an individual having entries for names including J SMITH, JOHN SMITH, JOHN G SMITH, G SMITH, and GERALD SMITH, a person reading that report would understand that this individual uses various first names, including his first name or initial, his middle name or initial, or some combination thereof. Adding New Data As with conventional database applications, new data may be added from time to time. As illustrated in FIG. 2, the present invention accounts for adding new (or changed) data 240, which will affect reference database 220 and distilled database 230. Generally, new data records 240 may be formatted as described with reference to FIG. 3, and entered into the existing reference database 220. Additionally, new data records 240 may be measured against distilled database 230 to determine if new information or content is available in new data record 240. For example, a new data record 240 may be correlated with data records from distilled database 230 to determine whether that new data record 240 is related to any data records already present in distilled database 230. If so, and new data record 240 contains information or content not already present in distilled database 230, new data record 240 may be used to update distilled database 230. For example, if new data record 240 included information for an individual named John Smith that corresponds to data records already present in distilled database 230 but provided the additional information that Mr. Smith's middle name was Greg, that additional information may be appropriately added to distilled database 230. Changes to data records in reference database 220 and distilled database 230 may be handled using standard database protection operations, as described in references such as C. J. DATE, INTRODUCTION TO DATABASE SYSTEMS (Addison Wesley, 6th ed. 1994) (see specifically, Part IV), referenced above. For example, in the case that changes are made to reference database 220 by an authorized database administrator, related data records in reference database 220 are updated as determined by standard relational definitions and where appropriate, in accordance with relations defined in distilled database 230. Various embodiments of the present invention may be used for many different applications, some of which have been described and/or alluded to above. For example, in the application described above, the invention may be used to combine billing information collected from multiple sources to derive a distilled database in which related data records are recognized and duplicate and erroneous data records are eliminated. As suggested, this may be particularly useful in cases, for example, involving fraud. Typically, persons using credit card or other forms of retail fraud make minor changes to certain pieces of their personal information while leaving the majority of it the same. For example, oftentimes, digits in a social security number may be transposed or an alias may be used. Often, however, other information such as the person's address, date of birth, mother's maiden name, etc., is used identically. These types of fraud are readily identified by the present invention, even though they are difficult to identify by human analyses. Other possible applications include uses in telemarketing, to compile a list of targeted individuals or addresses, or in mail-order catalogs, to reduce a number of catalogs sent to the same individual or family. Still another potential application is in the medical research or diagnostics fields, in which nucleotide sequences of Adenine (A), Guanine (G), Cytosine (C), and Thymine (T) in nucleic acids may be identified. In other embodiments, the present invention may be used as a gatekeeper for a particular database at the outset to maintain integrity of the database from the very beginning, rather than achieving integrity in the database at a later date. In these embodiments, no raw data 210 is present and only new data 240 exists. Before new data 240 is added to the database, it is measured against distilled database 230 to determine whether new data 240 includes additional information or content. If so, only that new information or content is added to distilled database 230 by updating an existing record in distilled database 230 to reflect the new information or content as would be apparent. While this invention has been described in a preferred embodiment, other embodiments and variations are within the scope of the following claims. For example, formatting process 300 may format data using different radices or other character sets, and may use various data structures. The data structures may represent multiple fields, and depending on the application, will represent a variety of fields. For example, in a credit application, fields may include an account status, an account number, and a legal status, in addition to personal information about the account holder. In a medical diagnostic application, fields may include various alleles or other genetic characteristics detected in tissue samples.
|
Same subclass Same class Consider this |
||||||||||
