Database analysis using a probabilistic ontology6094650Abstract A method and system for efficiently analyzing databases. In one embodiment, the invention is used to analyze data represented in the form of attribute-value (a-v) pairs. A primary step in building the ontology is to identify parent, child and related a-v pairs of each given a-v pair in the database. A parent is an a-v pair that is always present whenever a given a-v pair is present. A child is an a-v pair that is never present unless the given a-v pair is present. Related pairs of a given a-v pair are those a-v pairs present some of the time when a given a-v pair is present. The system calculates relationships between a-v pairs to produce tables of a-v pairs presented according to the relationships. The user performs additional analysis by investigating the a-v pair relationships through a graphical user interface. Additional visualizations of the data are possible such as through Venn diagrams and animations. Plain-text data documents collected, for example, from the Internet can be analyzed. In this case, the system pre-processes the text data to build a-v pairs based on sentence syntax. Claims What is claimed is: Description BACKGROUND OF THE INVENTION
TABLE I
______________________________________
Family Genus Species
______________________________________
1 Micrococcaceae
Staphylococcus
Aureus
2 Micrococcaceae Staphylococcus Saprophyticus
3 Micrococcaceae Staphylococcus Epidermidis
4 Micrococcaceae Micrococcus Luteus
______________________________________
In Table I, the columns labelled "Family," "Genus" and "Species" are attributes. Each horizontal row is an entry in the database. Each entry has values for each defined attribute as shown in the corresponding column. Thus, entry 1 has a-v pairs as follows: (Family, Micrococcaceae), (Genus, Staphylococcus) and (Species, aureus). These can be abbreviated as (F, M), (G, S) and (S, A). A feature of the present invention is the ability to "work backward" to determine classification schemes based on database entries such as those shown in Table I. The system defines "parent" and "child" a-v pairs in relation to other a-v pairs. A first a-v pair is a parent of a second a-v pair if the first a-v pair occurs in every entry that the second a-v pair occurs. Thus, (F, M) is a parent to (G, S), (G, M), (S, A), (S, S), (S, E) and (S, L). A first a-v pair is a child to a second a-v pair if the first a-v pair never occurs in an entry unless the second a-v pair is also in the entry. From Table I, (G, S) is a child of (F, M) and (S, A) is a child to both (G, S) and (F, M). By starting with a database, such as the database represented in FIG. 4A, an analysis can be performed on the a-v pairs to determine all of the parent and child relationships. By considering parent pairs as classes of those pairs that are their childs, the classification hierarchy shown in FIG. 4A is achieved. A characteristic of the data shown in FIG. 4A and Table I is that each a-v pair has, at most, one parent a-v pair. It is easy to imagine databases where more than one parent exists for a given a-v pair. Such a database example is shown in FIG. 4B where the item CAR has more than one parent. Note that, in FIG. 4B, the items will be treated as values of attributes. The attributes themselves are not named but can be assigned as "quality" for the top row, "object" for the second row and "manufacturer" for the bottom row. In this case, the a-v pair (object, car) has (quality, transport vehicle) and (quality, collector's items) as its two parent nodes. Such an organization of data where an item can have more than one parent is referred to as an "ontology." A generalization to the ontology organization is to allow probabilistic relationships between the a-v pairs. So far, parent and child a-v pairs are shown as absolute existences. However, in any database, especially large databases, there are likely to be errors in the data. Also, characteristics and trends of interest will likely show up as statistical occurrences of something less than 100%. The ontology described so far is not flexible in handling rates of occurrence. The present invention solves this problem by creating a probabilistic ontology where statistics on rates of occurrence of parent and child relationships are computed and compiled for use in analysis. A preferred embodiment of the present invention is referred to as the "High-Performance Ontology Builder and Browser" (HOBB). HOBB not only generates an ontology but it also allows the user to "browse" attribute-value pairs that intersect in terms of common occurrences in database entries, but that aren't in strict parent/child relationships to each other. In other words, their parent/child relationships need not occur at 100%. FIG. 5 is an example of the Table Display of HOBB. The values displayed are part of an analysis of a film database from the Human-Computer Intereaction Group at UMD. The database provided about 1750 entries, or records, of 9 attribute-value (a-v) pairs each. Each database entry includes an a-v pair of a film title, subject, length, actor, actress, director, popularity, awards, and year of make. In FIG. 5, HOBB is analyzing film data and is presenting the results of calculating parent/child relationships to the user. Center column 304 includes, at the top, the a-v pair of interest at 306. This is listed as "Actor, Martin Steve." To the right of the a-v pair of interest, at 308, is the number of times that the pair occurred in the database, which in this case is nine. Left column 302 lists parents of "Actor,Martin Steve" . As expected, "Subject,Comedy" and "Awards,No" have been detected as parents. Even though the matrix of data is fairly populated in this example, the same sort of capability would have been detected in even a sparse matrix of data. Right column 310 lists children of "Actor,Martin Steve". In other words, every time "Director,Reiner Carl" appeared I the database, "Actor,Martin Steve" also was there. The entries in center column 304 below "Actor,Martin Steve", the a-v pair of interest, are a-v pairs that co-occurred in the database with "Actor,Martin Steve", ranked in order of highest frequency. The frequency of occurrence as a percentage is listed to the left of each pair. For example, one item in center column 304 is "Year,1987". This indicates that the year 1987 appeared with 22.2% (i.e. 2 of the 9) of the films in the database where "Actor,Martin Steve" appeared. We also see that "Length,60" co-occurred with "Actor,Martin Steve" 22.2% of the time. These related pairs are neither parents nor children of the a-v pair of interest, but may provide insight into the data because of their rather large "overlap" of occurrence with the a-v pair of interest. In this case, the two movies of curious titles "Steve Martin The Funnier Side of Eastern Canada" and "Steve Martin Live" were both exactly 60 minutes, and thus were probably TV specials. By displaying this portion of the probabilistic ontology the system of the present invention allows a user to quickly make inferences and form theories about relationships between the data. Refinements to the user interface are possible. For example, the system can allow the user to specify a cut-off threshold below which related pairs will not be displayed. In FIG. 5, the cut-off to be set to "above 15%", those pairs below "Director,Reiner Carl" would not be displayed. Also, thresholds can be applied to parent and child criteria so that 100% co-occurrences are not required to place a pair into the parent or child column for a given a-v pair. In the preferred embodiment, all of the a-v parent, child and co-occurrence relationships are pre-computed. This allows instant display of user interrogations into the a-v relationships. For example, a user can mouse-click on "Director,Reiner Carl" either in the middle or the right column to make "Director, Reiner Carl" the a-v pair of interest. "Director,Reiner Carl" will then be displayed at 306 and the display will update to show all related a-v pairs to "Director,Reiner Carl". This "browsing" feature of HOBB is very useful to the researcher in discovering relationships. The browsing feature is all the more useful because the display updating, when a new a-v pair of interest is selected, is instantaneous due to pre-computing. This allows a user to maintain concentration, be more efficient, and investigate a large number of possible relationships. Another advantage of computing the a-v relationships is that there is no need to keep the original database with the relationships database. The relationships database may be much smaller than the original database. For example, where only a few attributes from each entry are of interest the entire entry need not be analyzed and the resulting relationships database can be smaller than the original database. Also, there may be security issues in copying the original database in its original form. Once the relationships database is created it can be analyzed separately from any hardware and software necessary to support the original database. Yet another implementation of the invention uses existing database programs to examine the ontology. Once an ontology database of a-v pairs and their parent, child, co-occurrence relationships is created, the ontology database can simply be fed as data to an off-the-shelf database application program such as Lotus Excel or Microsoft Access. The user can operate these databases using the traditional controls provided by the third party database manufacturer, or the user can design a customized front-end to approximate the functions of the HOBB program presented herein. This allows the system of the present invention to be adaptable to small computers, such as personal computers, with a minimum of effort. An example of applications where HOBB can assist a database researcher is where an economist has a database where each quarter is an entry, and within these entries are a-v pairs to keep track of Gross Domestic Product (GDP) growth, exports, market movements, bank lending, etc., with all applicable leads and lags in time. Using the system of the present invention (Exports, High) can be selected as the a-v pair of interest. This might show that every time (GDP Growth, High), then (Exports, High) occurs two quarters later. In other words, (GDP Growth, High) is a parent to (Exports, High). Also, the same screen might show that every time (Exports, High), then (Consumer Confidence, Low), meaning that (Exports, High) is a parent to (Consumer Confidence, Low). By browsing around, relationships between economic occurrences will begin to form and the ones that seem prominent can be researched theoretically and otherwise, resulting in a much better understanding of the economy from a simple database. Some additional examples of HOBB's utility could be seen in the following professions of Table II: 1. Economist--After a bit of work, you were able to gather time series dating back to 1950 on a quarterly basis covering France's and Italy's market and economic movements. Given this spreadsheet of 188 rows and numerous columns, what sort of information would be most valuable? You would be interested in questions like: When the French GDP is shrinking, what tends to happen in the Italian series? Do Italian interest rates seem affected? What about exports? All these questions are co-occurrence questions which HOBB, through its browsing feature, makes clear and explorable. 2. Medical Researcher--Over the period of a year, your hospital has been keeping track of infections, how they were treated, and how successful the treatment was. The result is a large file of patient names along with bacteria names; antibiotic names, dosages, and days of use; and perceived side effects. Do certain side effects coincide with certain antibiotic dosages? If Vancomycin is ineffective, what other antibiotics tend to be ineffective? Are there certain antibiotics that do not work well with Genus Pseudomonas? All these questions are co-occurrence questions which HOBB, through its browsing feature, makes clear and explorable. 3. Retail Marketer--After much trouble and expense, your grocery chain has set up a tracking system that records each grocery purchase and stores the information in an Oracle database. You are now setting up a new store in a busy area of town and you want to convert your newfound data on grocery purchases into a layout that maximizes convenience for your customers. When customers purchase mayonnaise, how often is this accompanied by pickles? If they get peanut butter, do they also always buy bread? HOBB is a way to browse the data and get a solid understanding of grocery purchases before you take pen in hand and lay out the shelves. 4. Direct Marketer--Using a combination of databases, you gather demographic data on 50,000 customers you feel are good candidates for your mailings. After sending a test mailing, you would like to see if there is some consistent elements or combination of elements between the demographic data and whether or not the customers responded. Once again we need to see the database through the lens of co-occurrence, which can be done utilizing HOBB. TABLE II The system of the present invention can be adapted for use in more generalized databases that are not already represented as a-v pairs. In these cases, the database is first pre-processed to generate the a-v pairs. For example, in a text database, such as documents from the Internet, each document is treated as a record or entry. The occurrence of a given word in a sentence, as well as co-occurrences of other words with the given word in each sentence, is used to build the a-v pairs which are analyzed by the system. A feature of the present invention is the ability it provides to "visualize" data. Although the table display of FIG. 5 provides an adequate interface for looking at precise relationships between data, it requires some work and scrutiny do determine more "global" relationships involving larger number of a-v pairs. For example, where a first a-v pair co-occurs at 80% with another a-v pair it would seem to imply that there is a strong relationship with the two pairs. However, If the first a-v pair also occurs in 98% of the entries in the database then the fact that it intersects at 80% with the other a-v pair is not as significant. In fact, it becomes significant that it intersects with the other a-v pair only 80% of the time! In order to determine this from the table display of FIG. 5, a user must not just detect the, seemingly, high co-occurrence of the pairs in the middle column, but must compare the occurrences of each a-v pair to the database as a whole. To provide better global analysis of relationships the invention uses Venn diagrams in a "Venn Display" to show co-occurrences of a-v pairs as overlaps in the diagrams. By presenting co-occurrences visually it is easier to detect strong relationships between data. FIGS. 6 and 7 show two examples of Venn Displays. These diagrams are displayed in color in the actual system. FIG. 6 shows (Inflation, High) as the attribute value pair in yellow, and represented by yellow circle 350. This is the a-v pair selected, or designated as "of interest," such as the pair displayed at the top of the center column in FIG. 5, as discussed above. The pair (Long Bond Rates, High) is a second pair designated by the user for comparison with the pair of interest. In the preferred embodiment, the user can mouse-click on any a-v pair on the table display of FIG. 5 to designate the clicked pair for comparison. The user can use the scroll bars to the right of each column to bring additional pairs into view. FIG. 6 shows (LBR, H) as blue circle 352. The relative sizes of each circle, along with their area of overlap 354, are proportional with respect to the number of occurrences. That is, (I, H) occurs 20 times in the database and has a yellow circle 350 that is about 2/3 of the area of the blue circle 352 representing (LBR, H) which occurs 30 times in the database. The area of overlap of the two circles is 18, which is the number of times that the two a-v pairs co-occur in the database entries. Using the Venn Display of FIG. 6, the user can quickly see co-occurrence relationships and is prevented from making errors of the type discussed above where a true interpretation of data relationships hinges on an idea of the percentage of occurrence of each a-v pair to the entire database. FIG. 7 shows a second form of Venn Display, the "Full Information Display." Using the prior example of an economic model, assume that the attribute "Inflation" can have one of three values, either "Low," "Medium" or "High." The interaction between (LBR, H) and Inflation for every possible inflation value is shown graphically in FIG. 7. It is easy to see that most of the occurrences of (LBR, H) are when (I, H). Also, the overlap of (LBR, H) with (I, H) is a larger percentage of the overall occurrences of (I, H) in the database than with the other a-v pairs. That is, (LBR, H) occurs in 5/14 occurrences of (I, H); (LBR, H) occurs in 2/10 occurrences of (I, M) and (LB R, H) occurs in 1/16 occurrences of (I, L) Again, while not shown in FIG. 7, color is used to designate each of the regions (I, L), (I, M) and (I, H). FIGS. 8A-C show frames of a "movie" formed of several Venn Displays to create an animation that illustrates a change in data over time. Suppose a researcher is interested in using the database to see if Streptococcus fataliti is developing resistance to the antibiotic Vancomycin. A display similar to that of FIG. 7, the "Full-Information Display" is computed over different time intervals. These are shown in succession at a desired speed. From the movement of the center circle over time, it can be seen that the bacteria are gaining resistance to Vancomycin (i.e., it takes longer periods of treatment with Vancomycin to kill the bacteria). Thus, a system for analysis and visualization of data has been presented. Although the invention has been discussed with respect to a specific embodiment, many modifications to the specific embodiment are possible without deviating from the invention, the scope of which is determined solely by the appended claims.
|
Same subclass Same class Consider this |
||||||||||
