Market analysis, demand forecasting or surveying

Internet profiling

6839680

Abstract

A system, method, and various software products provide for consistent identification of web users across multiple web sites, servers and domains, monitoring and capture of data describing the users' web activities, categorization of the web activity data, aggregation of the data into time dependent models describing interest of users and groups over time. Categorization is made with respect to a category tree which may be standardized or customized for each web site. User groups may be defined based on membership rules for category interest information and demographics. Individual user profiles are then created for users automatically based on satisfaction of the user group membership rules. As new data is collected on a user over time, the category interest information extracted from the user's web activity is updated to form a current model of the user's interests relative to the various categories. This information is also used to automatically update group membership and user profile information. Identification of users across multiple sites is provided by a global service that recognizes each user and provides a globally unique identifier to a requesting web server, which can use the identifier to accumulate activity data for the user. Client side user identification is provided to track user activity data on web servers that do not communicate with the global service and do not process activity for category information. User profiles may be shared among web sites that form alliances. User activity data may be aggregated along various dimensions including users/user groups, categorization, and time to provide robust models of interest at any desired time scale.


Claims

We claim:

1. A system for profiling users of online information systems, comprising:

a first web server that receives requests from a user for web content items and records web events for selected web content items,

wherein each web event comprises content identification information identifying a web content item, time information describing an amount of time the user interacted with the web content item, and category relevance information for a plurality of categories about the web content item;

an aggregation service that aggregates the plurality of web events recorded by the first web server with at least one web event obtained from a source remote from the first web server into aggregated information along at least one of a plurality of categorization dimensions;

a user group service for defining a plurality of user groups, each user group having a definable membership rule which can be evaluated with respect to at least one category based on category relevance information from web events aggregated by the aggregation service;

a profile service that determines for each user a user profile, the user profile specifying at least one user group, defined by the user group service, of which the user is member;

an alliance service for sharing user profiles with one or more remote web servers participating in an alliance;

a client application resident on a computer used by the user that records web events for selected items of web content requested from a second web server that does not itself record such web events, each web event recorded by the client application containing information identifying the web content item and time information describing an amount of time the user interacted with the web content, where the client application uploads the recorded web events periodically to a global upload service; and

a global upload service, remote from the first web server, that maintains information for each user indicating one or more web servers that subscribe to receive web events for the user, receives the web events from the client application of a user, and provides the web events to the web servers that subscribe to the user's web events.

2. A system for profiling users of online information systems, comprising:

a first web server that receives requests from a user for web content items and records web events for selected web content items,

wherein each web event comprises content identification information identifying a web content item, time information describing an amount of time the user interacted with the web content item, and category relevance information for a plurality of categories about the web content item;

an aggregation service that:

an aggregates the plurality of web events recorded by the first web server with at least one web event obtained from a source remote from the first web server into aggregated information along at least one of a plurality of categorization dimensions;

scales the category relevance information of a web content item as a function of the amount of time the user interacted with the web content; and

transforms the category relevance information for each category into a weight, the weight being a function of a category score, the amount of time the user interacted with the web content item, a scaling factor, and a total time scaled category score for all categories;

a user group service for defining a plurality of user groups, each user group having a definable membership rule which can be evaluated with respect to at least one category based on category relevance information from web events aggregated by the aggregation service;

a profile service that determines for each user a user profile, the user profile specifying at least one user group, defined by the user group service, of which the user is member;

an alliance service for sharing user profiles with one or more remote web servers participating in an alliance;

wherein each category receives a scaled category score:

NewScore.sub.i =Category Score.sub.i *Duration.sub.i *Constant

where Duration.sub.i is the amount of time the user interacted with the web content item, and Constant is a scaling factor; and

wherein each category receives a Weight: ##EQU2##


Description

BACKGROUND

1. Field of the Invention

The present invention relates to the analysis of the behavior and interests of users of online networks, and more particularly to the analysis and modeling of user's interests for users of the Internet and World Wide Web.

2. Background of the Invention

In any market, customer behavior is important. This is true of traditional retail businesses, where there are well developed mechanisms for determining customer's interests. In brick-and-mortar businesses, the customers of the business can be observed by watching those customers walk through a store. Customer behavior can also be observed by tracking their purchases (e.g., through credit card purchases.) Customer observation is, in fact, an important technique used by many retail businesses. It is so important that major databases of customer behavior exist and are in continuous usage. For example, many supermarket chains have vast databases of customer behavior. Analysis of the data in such databases can be used for many purposes (e.g., inventory control, product placement, new product analysis).

Understanding customer behavior is also necessary for electronic commerce, but the techniques of observing the customer in this medium are necessarily different. The way that customers interact with an e-commerce web site is radically different from the experience of walking into a business in person and making a purchase, but many things remain the same. When Web visitors browse a web site, sometimes they buy, and sometimes they do not. Businesses are very interested in knowing why visitors buy and why they don't. So these new electronic merchants want to understand their prospects and their customers. These businesses must observe their web visitors. This observation leads to the need for modeling the interests of customers over time, the need for managing the tremendous amount of data that such modeling would entail, and the need for categorizing web content to providing for meaningful models of user interests.

Conventionally, observation of users in online systems has typically involved using user-provided information about users interests, such as surveys or forms that allow the user the identify the categories of information that are important to them. Examples of this approach include the various customizable home pages offered by search portals such as Yahoo and Excite. In these portals, users can select various predefined categories of interest, and relevant news and related data is then provided to the user. If however the user's interests change over time, the user must manually change the specified categories of interest; this is not done automatically. These sites also allow users to specify their interests with simple keywords, but again, if the interests change, the user must manually change these keywords.

Other web sites more systematically track user behavior in terms of clickthroughs and page views, and then assemble information about these activities. As the user's activity changes on this particular web site, the assembled information is updated. This approach, while capturing some aspects of change in user behavior, it typically limited to only identifying interests relative to a single web site. User behavior on other web sites does not effect the particular site's assembled information, even though such remote behavior may most accurately express the user's interests. More particularly, the analysis of user behavior is typically limited to the particular Internet domain of the server that tracks the usage. User activity at another domain is not tracked.

Further, the assembled information on such a server only expresses the user's interest without respect to potential future or past interests. That is, it does not model changing user interests over time. However, it is the change in user interest over time that is of significant value to web marketers and others attempting to deliver content to web visitors.

SUMMARY OF THE INVENTION

The present invention overcomes the limitations in the prior art by providing a system and methodology, and various software products that tracks user activity across multiple domains, and from such activity develops a time based model that describes the user's interests over time. The changing user interests are also used to determine each user's membership in any number of defined user groups. Each user's time based model of interests and group memberships forms a detailed profile of the Internet activity that can be used to market information and products to the user, to customize web content dynamically, or for other marketing purposes.

Thus the present invention fulfills an important need: to identify web visitors and understand their interests over time. The present invention, sometimes referred to herein as "ProReach" or "ProReach system" is a software system that tracks and analyzes web visitors on the World Wide Web. In short, it helps turn web visitors into web customers. The present invention has the following features and aspects.

First, the present invention can identify and monitor a web visitor as he visits a web site. Of course, on the internet there are many web sites, and there would many web visitors. Whether two web sites or thousands of web sites are involved, or there are millions of web visitors, the present invention provides a system which can identify many visitors across many web sites. Thus, in this aspect, the present invention identifies each visitor to a web site, with unique identification information. This allows the visitor to be consistently identified, during both multiple visits to the same web site, and during visits to other web sites.

ProReach combines data from many web activities to get a more complete picture of a web visitor. ProReach is able to combine the data from these different web sites because the visitor identification process works across the web. This simply means that when a web visitor goes from place to place on the world wide web, ProReach can repeatedly and consistently identify the web visitor typically. More specifically, in contrast to other web tracking products, the ProReach System collects data on both the web server and the web client. ProReach does the latter by providing downloadable software that web clients can install on their systems. Once installed, this software tracks the web user's actions from his machine. Each time he visits a web site, his actions are recorded. Periodically, a compact version of this data is uploaded to ProReach, and then distributed to other web sites which maintain profiles and user group information relative to the user.

Accordingly, the user's activity at each web site there is monitored to identify items of web content with which the user interacts, such as page views, purchases, and so forth. The monitoring may be done by the web server itself, or by the client side software. This monitoring includes identifying each item of web content, such as with its URL or URI, along with information about how long the user viewed the content. This is beneficial because web activities that take longer --such as reading a web page --reflect a higher of interest by the user. The data of a user's specific interaction with an item of content is stored in a web event record. (Certain web activities, e.g., simple, fast clickthroughs may not be tracked in a web event record because they do not useful reflect a user's interest.) This process of identifying web visitors and monitoring the web content they interact with occurs automatically and continuously. Over time then, a large number of web event records will be generated resulting from the activities of many web users at many web sites.

Once data of a web visitor visit to a web site is gathered, this data is not yet in a form that is particularly helpful to making business decisions. For example, it is not particularly helpful to know that some web visitor has viewed hundreds of web pages at a dozen web sites. Rather, it is more useful to understand what kinds of things did the web visitor look at: Motorcycles? Cosmetics? News? Technical information? Music CDs? Books?

Ideally, every document on the World Wide Web would be associated with a description that would describe briefly what that document was about.

That is, this description would categorize that document, much in the way in which books are categorized in a library. Such an ideal is never going to be a reality any time soon, if ever. So there needs to be a way to automatically categorize the documents that a web visitor sees. This categorization technique should be robust, accurate and maintainable.

The ProReach system provides just this capability. It uses a content recognition engine to do this. A content recognition engine is a software component that can take a document and a set of categories and compute how closely the document matches up with these categories. Using the content recognition engine, the ProReach system can categorize various kinds of web document, and provide a ranked list of categories, including hierarchical categories that pertain to the document. The basic idea is that the content recognition engine evaluates some number of categories that may or may not match up with a given document. The content recognition engine tests the document and returns a score as to how closely it matches with each category. During this process, the document gets tested against many categories, so the resulting categorization is really a vector of categorization scores. Each categorization score of that vector shows how well that document matches up with a given category, such as sports, news or computers.

Accordingly, each web event record is processed to determine its relevance to various defined categories. The categories are maintained in a category tree which covers a wide range of categories and topics. Preferably the web content is scored with respect to each category to indicate to the degree to which the content may be said to be about category. This categorization takes place automatically, without requiring action by a webmaster or system administrator.

The categories themselves used as part of the categorization process are part of the data that are provided to the content recognition engine. ProReach preferably provides turnkey categories, allowing the system to categorize web content as soon as ProReach is installed and running on a particular web site. In one embodiment, the turnkey categories are provided from a central host system that is in communication with a particular local ProReach system installation The host ProReach system provides a comprehensive set of categories that target the practical information needs of e-businesses, and it provides sample data for these categories.

As an optional capability, ProReach system users can modify categories, or create their own. In this way, a web site using the ProReach system can categorize the viewing habits of its prospects and customers in a custom fashion. They can create new kinds of categories. This customization is optional. They are not required to do this. ProReach is a turnkey system that is customizable. It is not a system that requires customization to be used. ProReach also provides other tools to assist in the process of category creation and maintenance.

The data about a web visitor's activities is valuable, but ProReach can distill more meaning from this data. Electronic commerce decision makers are interested in the psychographic and demographic profile of the user. They do not want every single detail of the user's activities, but rather a summary of the user's interests which is abstracted from the details of the user's activities. It therefore becomes very desirable that all the detailed data of the user's activities can be compressed into a highly meaningful summary. Accordingly, the present invention further processes this information to develop detailed Internet profiles of each user, and of different user groups and categories of information.

The ProReach system of the present invention creates summaries of a web visitor's activities via a process of web activity aggregation. Through this process, the ProReach system automatically takes the previous history of a visitor's activities and integrates this with data collected from new visits. This process of taking new visits and integrating them with previous visits is performed on an as-needed basis. In this way, the profile of a web visitor is always kept up to date, reflecting that web visitor's interests.

More specifically, ProReach aggregates web visitor's web activity data on three dimensions --on who they are (identity), what they did (content categorization) and when they did it (time). This process is called dimensional combining. Along these three dimensions, ProReach provides sophisticated, statistical-based aggregation.

Another strength of the ProReach system is its flexible approach to aggregating a visitor's activities. Different kinds of e-commerce businesses will want to summarize their visitor's activities in different ways. This is because different companies have different needs for understanding the nature of their customers. Accordingly, aggregation may be tuned to the needs of a particular business.

Hence the ProReach system provides excellent aggregation capabilities that can then be tuned by ProReach system administrators. It allows parameters to be set that control the aggregation process. Power and flexibility are combined. These parameters control what information is maintained and the amount of storage allowed for its maintenance.

In this aspect of the invention then, the web event records accumulated at a given web server are first aggregated into a set of aggregated results for each web user at the site, preferably on a periodic, fixed basis, such as a daily basis. Thus, a user may visit a particular web site several times a day, each time generating dozens of individual web event records. The same is true for many different users. Accordingly, for each user, the web event records are combined to collect all of the categorization information for that user together. In addition, the category score information in each web event record is processed to reflect the duration of the web activity. This processing results in a set of category weights.

The combined category weighting information for the collected period, such as a day, describes in detail the user's degree of interest across a number of categories. However, further processing is beneficial to obtain a more summarized model of the user's interests. Thus, from the weighted category information various statistical measures are derived such as the mean category weight over the period, maximum and minimum weights, standard deviation, and the like. In addition, a trend pattern is also extracted which described whether the user's interest in the category is increasing, decreasing, or constant, or some combination of these, over the time period. This summarized representation of the category weights for the time period can be stored, and best captures the changes in the user's interest, across a number of categories, over the time period. As a result, the underlying raw data of the web event records deleted, so that storage efficiency is achieved.

First, the period information may be aggregated for each user with respect to each of the categories across a longer time period. For example, the daily aggregated information for a user may be further aggregated for a week's time period, a month, a quarter, a year and so forth. This forms what is termed a user-category complex, wherein the statistical information for a single category from many different days is combined by an aggregation function. One exemplary aggregation function is mean, and thus the mean of the category weights for this particular category over the time period is obtain, along with trend pattern and other statistical measures.

Second, dimensional combining may be used to form category complexes. A category complex summarizes a large number of users' interests in a particular category over a selected time period. This complex describes the level of interest, over time, for a population of users in a particular category.

Another type of dimensional combining now makes use the user-category complexes. First, the many user-category complexes for an individual user may be combined for a selected time period, to form an aggregated view of the user's overall interests. That is, the category information from many different categories is aggregated and describes the user's interests overall.

Additionally, the user-category complexes may be combined for an individual category and across selected users who form a user group, to create user group-category complexes. The user group members are selected by having meet certain membership tests based on their category interests and optionally demographics. This gives a summary of the user group's interest in that category over time.

The user complexes can be further combined into user group complexes to describe overall group interest across all categories. Finally, the group complexes may be aggregated to form an overall total complex which describes the total population's interest across all categories for the selected time period.

In addition to the various complexes that may be aggregated, individual profiles of the users can be further augmented with the user group information. A number of user groups may be defined, each having particular membership criteria. Marketers can define groups of users that share interests, buying propensities or demographics. The criteria are preferably based on a user having (or not having) particular levels or ranges of category weights for one or more categories. A user may be member of multiple user groups. The group membership is automatically updated, as the users interacts with web content over time, and as their interests change as expressed by the changing levels of categories weights. The ProReach system will automatically classify a user into the right user groups based on his or her profile. If the definition of the user groups changes, then the ProReach system will automatically re-classify users into the right user groups. Similarly, as the interests of user change, they will automatically be put into the right visitor segments based on their new interests. In this way, a marketer has immediate access to market segments on demand, and can swiftly apply electronic sales campaigns.

The visitor profile information that ProReach systems generate can be retained for the sole use and benefit of the web site that created it. It also possible for ProReach systems to share their user profile information. To facilitate this sharing, ProReach provides a centralized service that helps ProReach systems define policies for the transfer of information between each other. For ProReach customers that want a deeper relationship with each other, the present invention provides for an alliances. An alliance is a group of ProReach systems who have decided to contribute their user profiles into a database of profiles. All members of the alliance contribute profiles, and all members of the alliance benefit by getting a degree of access to the alliance profiles. In particular, alliances are useful to vertical markets where companies may want to work together on the world wide web. Such groups of businesses may benefit from combining their information, but they need the infrastructure to facilitate this sharing, regulate it and make it safe. ProReach provides this enabling infrastructure. In an alliance, each member contributes visitor profiles created for visitors to the member's web sites. These contributed profiles are aggregated together in a database of profiles maintained by the alliance. All members to the alliance get controlled access to these profiles. A system of sharing rules controls this whole sharing process, so that companies only share selected information. ProReach supports the formation of multiple alliances. An ProReach-enabled system can belong to more than one alliance.

A very large amount of visitor activity data will be generated by web sites using ProReach systems. The existence of this data raises privacy concerns. It also raises issues about how ProReach Systems themselves share data amongst themselves. ProReach has an architecture that addresses privacy concerns. ProReach ensures the privacy of web visitors via what it calls an identity firewall. The purpose of an identity firewall is to establish a boundary. Inside the boundary of the identity firewall, the identity of a web visitor is accessible to authorized personnel or processes. Other personal information is also available, such as e-mail address, home address and age.

Outside the boundary of the identity firewall, no data is provided that could be used to identify a web visitor. Instead, any person or process requesting information outside an identity firewall, only gets an opaque visitor identifier. The ProReach System that issues the opaque visitor identifier can use it to uniquely identify the web visitor. Hence, an opaque visitor identifier is an externalizable reference to ProReach visitors.

A person or process with an opaque visitor identifier can present the opaque visitor identifier to that ProReach System. The ProReach System can then map that opaque visitor identifier back to the actual visitor. Using this method, it is possible for a web marketer, for example, to be given a large amount of information about the interests of a web visitor but the marketer doesn't know the visitor's identity or contact information. The web marketer is simply given an opaque visitor identifier (or a set of such identifiers). The marketer gets the data he needs, but the privacy of the visitor's data is maintained. So outside the identity firewall, a web visitor being tracked by ProReach is anonymous.

The web marketer may have the ProReach system contact the web visitor on his behalf using IPro's Visitor Contact Service. Given an opaque visitor identifier and a message, the Visitor Contract Service looks up the e-mail address (or other necessary information). It then sends the message to the web visitor. The web marketer gets his message delivered to the web visitor, but the web marketer does not know the web visitor's identity.

Identity firewalls can be flexibly configured. They can be configured so that the identity firewall encloses a single ProReach System. They can be configured so that an identity firewall encloses a group of ProReach systems. The latter configuration would make sense when there are multiple ProReach servers working as a group (e.g., for a portal with multiple servers) and data should be shared between the servers.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 illustrates the features of a web event.

FIG. 2 illustrates data flow in the process of aggregating web events and creating user profiles.

FIG. 3 illustrates a top level system architecture of various ProReach systems.

FIG. 4 illustrates the hub and spoke architecture of a ProReach system.

FIG. 5 illustrates an embodiment of a ProReach system operating with a firewall.

FIG. 6 illustrates the Global Services server.

FIGS. 7a-7f illustrate the overall processing flow of a ProReach system.

FIG. 8 illustrates an alliance of ProReach systems.

FIG. 9 illustrates the aggregator queue used to store web event records.

FIG. 10 illustrates the features of the aggregator service.

FIG. 11 illustrates the processing function of the parser.

FIG. 12 illustrates the concept of a category interest curve.

FIG. 13 illustrates the root portion of central category tree.

FIG. 14 describes the process of updating the standard category tree.

FIG. 15 illustrates the operation of the content recognition engine.

FIG. 16 illustrates the process of customizing content based on a user profile.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS

TABLE OF CONTENTS

I. WEB EVENTS AND AGGREGATION

A. WEB EVENT RECORDS

II. OVERVIEW OF PROREACH SYSTEM ARCHITECTURE

A. GLOBAL SERVICES

III. BASIC SYSTEM PROCESSING

A. PROREACH FUNCTIONAL OVERVIEW

B. CATEGORY DISCOVERY AND MAINTENANCE

1. Category Discovery

2. Category Maintenance

IV. PROREACH SYSTEMS WITH ALLIANCES

V. AGGREGATION

A. AGGREGATING DAILY WEB EVENTS

1. Transform Category Scores to Weights

2. Restructure Web Event Records to Collate Category Weights by User

3. Create Category Interest Time Model Information

B. DIMENSIONAL COMBINING

C. USER GROUP SYSTEM

D. DAILY AGGREGATION

E. AFFINITY GROUP MANAGER

F. THE UPDATE OBJECT

G. SCHEDULER

H. EVENT DISPATCHER

I. PROFILE SYSTEM

J. AQL SYSTEM

1. AQL Language

2. AQL Interpreter

VI. CATEGORIES AND CATEGORIZATION

A. OVERVIEW OF CATEGORIZATION

B. CATEGORIES AND HIERARCHIES ORGANIZE DATA

1. Building and Maintaining Category Hierarchies

C. CATEGORY NAMES AND ID'S

1. Default Unalterable User Category Structure

2. Similarities and Differences Between Categories and Groups

D. USING SOURCE OR LOCATION IN CATEGORIZATION

E. THE CONTENT CATEGORY LIFECYCLE:FORMATION, TUNING, AND CHANGE

1. The Standard Category Tree and Additions by ProReach System Administrators

a) Adding Categories At ProReach systems

b) Updating the Standard Category Tree

c) Building the Standard Category Tree

d) Discovery, Refinement, and Editing of Categories

F. CATEGORIZATION MODEL OF THE CONTENT RECOGNITION ENGINE

1. Category Creation

2. Document Categorization

3. Multiple Dictionary Categorization

4. Category Cache

VII. GLOBAL SERVICES

A. GLOBAL IDENTIFIER SERVICE

1. Requests For GIDs.

2. Individual Identification via PIDs

B. GLOBAL UPLOAD SERVICE

C. GLOBAL CLIENT MANAGEMENT SERVICE

D. YELLOW PAGES

E. GLOBAL EXCHANGE POLICY

VIII. PROREACH CLIENT SIDE WEB USAGE DATA COLLECTION

A. WEB ACTIVITY MONITORING

B. PROREACH CLIENT WEB USAGE DATA FILTRATION AND AGGREGATION

1. Time-based consolidation

a) Adjust web event record time stamps

b) Ignore short-term activities

c) Aggregate Web activities

2. Other Filtration of Data

3. Privacy Control

C. FILTRATION BASED ON PRIVACY SETTINGS (USER MODIFIABLE)

1. URL pattern-based filtration

2. Keyword-based filtration

D. DEFAULT PRIVACY-RELATED FILTRATION

E. PROREACH CLIENT DATA UPLOAD

1. ProReach client upload queue

2. ProReach Upload Stream and Upload Record

3. Data upload

a) Web Event Record upload

b) Homepage URL upload

4. Upload time and upload stages

a) Pre-upload stage

b) Upload stage

c) Post-upload stage

5. ProReach Upload Service and upload

IX. CONTENT TARGETING

A. ACCESS TO PROFILE BY A CGI

1. Access to page Metadata by CGI

a) Handling dynamic content categorization of multipart pages at runtime

I. Web Events and Aggregation

Referring now to FIG. 1, there is shown an illustration of the concept of a web event, which is used as a basic modeling unit for measuring the interests of web visitors in web content. A web event 101 is the combination of three different types of information. First, a web event 101 contains information which uniquely identifies the particular web visitor 103, or generically a "user." This user can be an individual person, or any group of persons to which the user is deemed to belong. Second, a web event 101 includes information which describes or measure 107 the amount of time spent by the web visitor in viewing or interacting with the web content.

Finally, the web event 101 includes information that identify one or more categories 105 into which the web content visited by the web visitor belongs and a measure of the user's interest in each of the one or more category. The categories used to describe the web content preferably form a hierarchy of categories, with parent categories (e.g., "Sports") having multiple child categories (e.g., "Soccer" and "Golf").

These three pieces of data are used model the basic idea that a user viewing or interacting with an item of web content is expressing an "interest" in whatever category or categories that web content is about. The longer the visitor views or interacts with the content, the greater the visitor's interest is presumed to be (other factors may also be used to scale the level of interest, such as the type of interaction, e.g., a simple viewing of a page versus a purchase).

This measure of interest in of a user in a category at a particular time or duration is expressed as a weight. A weight is a function of the amount of time spent by the visitor interacting with an item of web content, and the degree to which the category is deemed to describe the content. In a preferred embodiment where there are a number of categories available, a web event includes a weight for each category. This reflects the fact that a given item of web content may relate to many different categories in different degrees.

To provide a meaningful scale of interpretation of these weights, and hence a level of interest in a category, the weights are scaled to a standard unit called an interaction unit. An interaction unit is interpreted to mean 1 minute of attention paid by a user to an item of content. By scaling web events using interaction units, it becomes possible to meaningfully compare the interests of any variety of different users and categories of web content.

These three types of information are collected for each item of web content viewed by a web visitor at a particular web site, and by extension by multiple different visitors across many web sites. For example, as the visitor moves from one web page to another on a given web site, a web event is generated which encapsulates the information identifying the visitor, the category description of the page, and the amount of time spent by the visitor on the page. As the same visitor visits different web sites, they are identified and web events which capture the category of content and time spent viewing such content are generated.

In themselves, web events are merely individual data items, and do not directly describe the overall patterns of interest of any individual user or groups of users, or patterns relatives to categories or time. This level of abstraction is provided by a second aspect of the present invention, aggregation. Most generally, aggregation is the process of summarizing the weights of different groups of web events to establish patterns of interest. Generally, web events can be combined with respect to time periods, individuals users, groups of users, categories, or groups of categories, or any combinations of these. When considered together, there are six different ways to combine web events:

1) Combine all web events between two dates: This combination approach combines web events related to all categories and all users over a given time period to provide a model of the global interests of the population of users.

2) Combine all web events for a category between two dates: This combination combines the web events for a specific category (or group of categories) for all users over a given time period to provide a model of the user's level and pattern of interest in the specified category.

3) Combine web events for a user and a category between two dates: This combination combines the web events of specific user and a specific category over a time period to provide a model of the user's level and pattern of interest in the specified category.

4) Combine web events for a user group and a category between two dates: This combination provides a model of the group's interest in the specified category.

5) Combine web events for a specific user between two dates, across all categories. This combination provides a description of how the overall distribution of a user's interests for all categories, whether narrowly interested in one or a few categories maintained a web site narrow or broadly interested in many of the categories at the web site.

6) Combine web events for a user group between two dates, across all categories.

In one embodiment, when performing these various types of combinations, the events selected during a given time period are thus which start during the time period, even if they end after the selected time period.

We call the process of combining web events in these various ways "dimensional combining", since there are six "dimensions" in the data along which web events may be combined. These possible combinations can be used to provide an analysis of any user's or group's interest in any category or categories over any time period. Referring now to FIG. 2, there is shown an illustration of these various ways of combining web events.

In FIG. 2 there is shown a number of discrete web events 101, occurring over some period of time, such as a number of days. All of the web events 101 for each day are aggregated into user specific, daily aggregated results 201. These daily aggregated results 201 form what is labeled as Level 0 of the figure. To obtain an understanding of the web visitors' interests, the web events over some number of days (e.g., week, month, quarter, year, etc.) are combined in different ways, as discussed above.

First, in Level 1, the daily aggregates can be combined per (3) above into "UC" or User-Category complexes 203, or per (2) above into individual "C" or Category complexes 205. Note that a single daily aggregated result 101 may contribute to either of these complexes; that is, the results of a particular web visitor's web activity contributes to both the Category complexes 205 for all categories effected by that visitor's activity, as well as to that user's specific user-category complexes 203 describing that user's level of interest in the various categories.

Next in Level 2, the individual UC complexes can themselves be combined. First, per (4) above, the particular UC complex for certain users who form a user group can be combined into "GC" or Group-Category complex 206. This complex 206 describes the group's interest in the particular category for the data. Second, per (5) above, all of the User-Category complexes for a particular user can be combined to form a single "U" or User complex 207, summarizing the user's interests across all of the categories. The User complex 207 is particularly useful to gauge the breadth or narrowness of user's interest. For example, a web site may have a limited number of categories of content. For one user of this web site, the user complex 207 may show a high level of interest in a just one or two categories, whereas another user's user complex 207 may show a high level of interest in a majority of the categories; this second user is like to be more valuable to the web site for purposes of marketing or other value driven activities.

Next in Level 3, the complexes 207 for individual users can be combined per (6) into "G" or Group complex 209 across all categories.

Finally, in Level 4, the complexes 209 from the many groups can be combined per (1) above into Total complex 211, describing the interests of all users across all categories.

This web event modeling and aggregation framework provides many advantageous features. First, it allows a system administrator (or a member of ProReach System) to arbitrarily select the time period over which any of these aggregations to obtain broader or narrow analyses of the time pattern of the users' interests. This is useful to identify very short term interest trends or longer term trends in users' interest. Second, because each level of aggregation fully captures the information of the level below it, the underlying web event data may be selectively discarded to improve storage efficiency. For example, web events for categories which have a very low level of interest (identified by a low weight) may be discarded after their data has been summarized into UC or C data. Web events with greater weight may be stored longer to allow them to be used for more analysis or marketing.

A. Web Event Records

When a web visitor performs a web activity, such as viewing the contents of a uniform resource locator, or clicking on a submit button that initiates a web transaction, this web activity is recorded by client-side or server-side trackers, which record this web event. The data of each web event is stored in a web event record. Web event records are then aggregated into the daily aggregated results 101, and from there into the various complexes. The basic features of a web event record are as follows:

                         Web Event Record
    Field         Explanation
    User ID       Uniquely identifies the visitor
    Location      The URL or URI of the web content.
    Start time    Onset of activity in Greenwich Mean Time for a single
                  event. If there are multiple events at this URL, then the
                  time of the earliest download.
    End time      Last recorded activity in Greenwich Mean Time for a
                  single event. If there are multiple events at this URL,
                  then the time of the last download. If unknown, a default
                  1 minute from the start time is used.
    Event type    Stores a value indicating the type of web activity, such as
                  view, clickthrough, purchase, and so forth.
    Event count   The number of times this URL/URI was downloaded
    Category Score The category scores for the content.
    For example, assume that a user's web activity is as follows:
      Activity    Start Time-End Time  URL         Duration
          1       10:05 am-10:10 am    <URL A>  5 min
          2       10:10 am-10:12 am    <URL B>  2 min
          3       10:12 am-10:14 am    idle
          4       10:14 am-10:15 am    <URL C>  1 min
          5       10:15 am-10:15:03 am <URL B>  3 sec
          6       10:15:03 am-10:16 am <URL A> 57 sec
          7       10:16 am-10:16:06 am <URL D>  6 sec
          8       10:16:06 am-         <URL A>  4 sec
                  10:16:10 am
          9       10:16:10 am-         <URL E>  6 min 20 sec
                  10:22:30 am
         10       10:22:30 am-10:30 am idle


The web event records may be generated by either the web client 108 or the web server 102. If generated on the web client 108, the corresponding web event records would be as follows (note that the user ID and category score information is not shown here).
    URL       Start-time  End-time    Duration          Occurence
    <URL A> 10:05 am    10:16:10 am 5 min 57 sec          2
    <URL B> 10:10 am    10:12 am    4 min *(see Note 2)     1
    <URL C> 10:14 am    10:15 am    1 min                 1
    <URL E> 10:16:10 am 10:22:30 am 5 min *(see Note 3)     1


Note 1. When a URL is captured, the current time is stored in the Start-time timestamp field in web event record. The difference between the current time and the time in the timestamp of the previous record is calculated and stored in the previous record's "duration" field.

Note 2: Duration may or may not equal (End-time --Start-time). This is because there may be other events between the earliest download at this URL and the last download. For example, there is a gap of 2 minutes between visits to <URL B> and <URL C>. The "duration" in the activity table shows the actual time a user spends on browsing a particular URL, while the "duration" in web event record is an approximation of that time. Where the web event record is created by the web client 108, then the client software may only approximate the real "duration" by taking the Start-time of the next URL as the End-time of the current URL. There is no way for the software to know about idle gaps in between URL visits without user intervention. Where the web event record is generated by the web server 102 that is tracking the user, then the duration can be estimated.

Note 3. Here too, the duration for <URL E> can only be calculated by the web client 108 as 13 min 50 sec (10:30--10:16:10=00:13:50). The web client 108 will not know of the idle time after the access to <URL E>. However, the web client 108 (or the web server 102) may keep a pre-set max time for the duration of a single URL access, for example, 5 minutes. This is to normalize the "duration" factor so that no one single URL access can have abnormally large "duration". A user may be tied with other activities for a while between the two URL accesses, and this may result in some abnormally large duration numbers. Those abnormally large duration numbers will incorrectly affect a user's Web usage pattern and profile. Note that the cumulative duration, however, is not limited to that max duration number. For example, the duration for <URL A> is an aggregation of two separate URL accesses; therefore, it is not confined to the 5 minutes limitation.

Note 4. Activities 5, 7, and 8 were not included in the total duration of any web event since they were filtered out for being two short of a period of time. This is done to help reduce the data collection requirements and because such short duration views are not likely to be indicative of the user's actual interests.

The next sections we describe the architecture and functionality of a system which records web events and provides the various capabilities to aggregate data as described.

II. Overview of ProReach System Architecture

The present invention may be embodied in a system which we call "ProReach". We begin with a very high-level overview of the ProReach architecture, and describe the high-level components involved in this architecture, and show the high-level relationships between these components. We will also describe some typical configurations of ProReach, and show how ProReach supports one or more web servers, both behind and across firewalls. A discussion of the basic elements of alliances is included.

Referring to FIG. 3, there is shown various ProReach systems 100 operating over the Internet. Each ProReach system 100 handles one or more web servers 102. These web servers 102 can all belong to the same domain, or they can be belong to different domain. FIG. 1 depicts two ProReach systems 100. One ProReach system 100 supports a single web server 102, while the other ProReach system 100 supports two web servers 102. In all, there are three ProReach-enabled web servers 102 in this figure.

Each ProReach-enabled web server 102 of a ProReach system 100 tracks 20 the web visits of individual web visitors at the web site that the web server 102 serves. The web server 102 tracks and identifies the web visitor, obtains category information for the viewed content, and logs the visit, including its time or duration. Once this data is gathered, the ProReach system 100 architecture, and show the high-level relationships between these components. We will also describe some typical configurations of ProReach, and show how ProReach supports one or more web servers, both behind and across firewalls. A discussion of the basic elements of alliances is included.

Referring to FIG. 3, there is shown various ProReach systems 100 operating over the Internet. Each ProReach system 100 handles one or more web servers 102. These web servers 102 can all belong to the same domain, or they can be belong to different domain. FIG. 1 depicts two ProReach systems 100. One ProReach system 100 supports a single web server 102, while the other ProReach system 100 supports two web servers 102. In all, there are three ProReach-enabled web servers 102 in this figure.

Each ProReach-enabled web server 102 of a ProReach system 100 tracks the web visits of individual web visitors at the web site that the web server 102 serves. The web server 102 tracks and identifies the web visitor, obtains category information for the viewed content, and logs the visit, including its time or duration. Once this data is gathered, the ProReach system 100 analyzes the data in order to evaluate the web visitor, and create or update a profile of the web visitor. The resulting profile of the user (or other profiles that are effected by the user's visits) can be used for marketing purposes, for page composition or for driving banner ads.

The various ProReach system make use of ProReach Global Services 110. These global services 110 perform various tasks that are best centralized for purposes of efficiency and integrity of information. These global service 110, which are further discussed below, including identification of web visitors, maintenance and distribution of standardized categories to the various systems 100, and mechanisms for exchanging information between systems 100.

FIG. 1 further depicts two web clients 106, 108. A web client is a conventional computer that includes a web browser, such as Netscape Communicator.RTM. or Microsoft Internet Explorer.RTM.. ProReach integrates with existing web browsers, and a special browser is not necessary to obtain the features or benefits of the invention. As an optional enhancement however, certain web clients 108 may be ProReach-enabled. That means that these clients 108 executes client-side tracking software. On a periodic basis, ProReach-enabled clients 108 automatically use ProReach Global Services 110 to upload the data of their web activities, particularly to track web events of the users of the web client on web sites that are do not have a ProReach system 100. This feature allows a more complete view of a user's interest, since it allows for integration of information about all web activity of the user, not just that activity at the ProReach systems 100 and servers 102. ProReach Global Services 110 is then responsible for sending this client data to various ProReach systems 100.

Referring now to FIG. 4, to support multiple web servers 102, each ProReach system 100 is configured in a hub and spoke topology, that includes a hub 204 and one or more spokes 202. Each hub and spoke is a collection of executable software modules. Overall, a ProReach system 100 executes on enterprise server-class hardware, such as a Fujitsu teamserver M800i series server, which is a large scale web-hosting server with 4 Pentium.RTM. II Xeon.TM. processors and 8 GB of memory. The software environment preferably includes Microsoft Windows NT 4.0 as the operating system, including Microsoft.RTM. Internet Information Server.RTM. 4.0 (IIS) for web site management, Microsoft Proxy Server 2.0 for firewall management, Microsoft Site Server 3.0 for content management and delivery based on user and group profiles.

More particularly, each spoke 202 is dedicated to collecting and categorizing the visitor data from a web server 102. Once the data is collected from the web server 102, it is partially processed on the spoke 204. The partially processed data is then moved from the spoke 202 to the hub 204. At the hub 204, the data is aggregated and further analyzed to produce up-to-date visitor profiles. Note that data from the same web visitor might stream in from different spokes 202, where the hub 204 aggregate this data into the appropriate user profile.

ProReach is architected so that most ProReach services are within company firewalls. Web servers 102 themselves are outside the firewall. A typical ProReach configuration including a ProReach system 100 for a single web server is depicted in FIG. 5. Here, the ProReach-enabled web server 102 is outside the firewall 206. An ProReach spoke 202 is connected to the web server 102, with communication taking place using server-side plug-ins, such as Java servlets. The ProReach spoke 202 itself is connected to a ProReach hub 204, as previously described. In FIG. 5, only one spoke 202 is shown, but as described, multiple spokes 202 may be used, each supporting it own web server 102. ProReach-enabled clients 108, having tracked user visits at non-ProReach web servers 113, send their accumulated usage data to the ProReach Global Services 112. In turn, ProReach Global Services 112 routes the usage data to the appropriate ProReach systems 100. FIG. 5 also illustrates how a ProReach system 100 can partner with other ProReach systems 100. Note how the hub 204 of one ProReach system 100 communicates with other ProReach systems 100. Such communication can involve sharing of data between the systems 100.

ProReach also works across web firewalls 206. For example, suppose a company had two web servers 102, each with its own domain name and firewall 206. It might be desirable to track all the web visitors at these web sites. In this case, a different configuration of ProReach is used, in which one of the spokes 202 attached to a local hub 204, and the other spoke 204 is remote and behind another firewall 206. The ability for ProReach to work across firewalls is desirable, particularly when web sites belonging to different organizations or companies are to be grouped together as logical unit, with the data of their web visitors shared.

A. Global Services

In one embodiment, ProReach provides a number of global services 112. These services are provides by a master host system and server, such as may be provided by an overall provider of ProReach systems 100. The global services are shown in FIG. 6.

Global Identifier Service 502. This global service allocates global identifiers [GIDs] and provides other functionality related to visitor identification. A GID is used to globally identify a web visitor, so that the visitor's web events and other usage data can be properly collated when received from many different ProReach systems 100 or ProReach enabled web clients 108.

Global Category Tree Service 504. This global service maintains and distributes a standard collection of categories. This allow the different ProReach systems 100 to use a common set of categories for describing and categorizing web content. In this manner, interest information from many different web site can be measured and evaluated against a common framework of categories.

Global Upload Service 506. This global service works with the client tracking software to received uploaded web activity data from the various ProReach enabled web clients 108. This global service then distributed this web activity data to the appropriate ProReach systems 100.

Global Client Management Service 508. This global service helps manage ProReach-enabled ProReach enabled web clients 108, by keeping a list of all such clients, and by maintaining this list (e.g., adding new ProReach enabled web clients 108 and deleting those no longer in operation).

Global Yellow Pages 510. This global service maintains an LDAP directory of ProReach systems 100.

Global Exchange Policy Service 512. This global service allows individual ProReach system 100 to describe the business rules under which it will exchange web visitor information with other ProReach-enabled systems 100.

III. Basic System Processing

ProReach's job is to capture user data, subject it to analysis and produce a visitor profile summary for any individual visitor or groups of visitors collectively. The visitor profile summary describes the interests of that given web visitor or group. There are many different processes involved in producing this web profile summary. These generally are as follows:

tracking visitor web visitor activity on the web server;

tracking visitor web visitor activity from the web client;

categorizing the documents that the web visitor views and determining their weights;

aggregating web events by time, by user and by category;

identifying the same web visitor when he visits different web sites;

aggregating the data --at different web sites --for the same web visitor, so that a global profile of the web visitor results;

category discovery and maintenance

In the first of the next two sections, we will summarize through some ProReach's key applications processes. Following that section, we will look at category discovery and optimization.

A. ProReach Functional Overview

In this section, we describe the basic processing steps that take place, in order to show how data flows through a basic ProReach system 100. We will also view in more detail the structural features of a ProReach system 100.

Because we want to concentrate on these basic processing steps, we will make some simplifications and only explores a specific scenario. We will explore a scenario where the ProReach-enabled web server 102 only tracks web visits based on cookies resident on web clients 106. So while ProReach also tracks web visitors based on their login name and other information, this tracking is not shown below. We also assume here that the web client 106 allows cookies, which is true for most web clients.

In general, the overall process of tracking web activity is as follows:

A web client 106 visits a ProReach-enabled site 100.

The ProReach-enabled web server 102 redirects the web client 106 to a global service web server 112. This web server 112 is responsible for allocating global identifiers (GID) that identify web visitors. Web visitors are identified as specifically as is possible. Sometimes the identification pinpoints the actual person; sometimes it can only identify the web client 106 being used.

The global service web server 112 redirects the web client 100 back to the original ProReach-enabled web site 100 with extra data. That identifies the web visitor.

The ProReach-enabled web server 102 takes this identifier and logs the web hit on a log. The entry on the log contains this identifier.

The web server 102 reads from this log of web hits and sends the data to a ProReach spoke 202. Processing of each entry on this log begins on the spoke 202. The category of the web pages viewed by the visitor is computed. At this point the ProReach system 100 has determined who has accessed the web page and what the content of the web page is about.

Over time, a visitor's repeat visits to a web site 100 will result in a history of web events associated with that web visitor. ProReach manages this data by subjecting the data to an aggregation process. This process both keeps the data compact as possible, but while retaining useful analytical properties. In particular, the aggregation process summarizes web events into more generalized descriptions of web activity, including summaries across users and or categories.

After the aggregation step is completed, a profiling step takes place. This profiling step identifies the interests of a web visitor. The result is a web visitor profile summary of his or her interests.

The above steps demonstrate basic processing steps used to track, categorize and aggregate web visitor data. The result of these steps is a database of web visitor profiles that can be explored by web marketers, as well as being used for other purposed (selecting banner ads, personalizing content or services). Alternatively, a web marketer can then explore the population of his web visitors by using query tools.

These steps will now be explored in detail in the remainder of this section.

Referring to FIGS. 7a-7c, there is shown the web server 102 portion of a ProReach system 100. The web server 102 include a profile servlet 730, a category servlet 731, a logger 702, and a visitor log 704.

We begin our processing with a visit from a web client 106. The web client 106 accesses 701 a web page hosted by the web server 102. The Logger 702 requests a GID for the web client. To get this GID, the Logger 702 makes a request to the global identifier service 602 of the global ProReach service 112. This request is initiated by redirecting 703 the web client to a ProReach web server that is part of ProReach global services 612, via the HTTP protocol. In FIG. 7c, this web server can check whether the request from the web client 106 includes a ProReach cookie. If the ProReach cookie shows up in the request, the GID is extracted from the cookie. This is the GID that identifies this web client 106.

If the request does not include the ProReach cookie, and hence if the web client does not have a GID, then a new GID is generated by the global identifier 612. This GID is guaranteed to be globally unique. The GID that the global service has computed is now returned 707 to ProReach-enabled web server 102 via web redirection. The actual GID is encoded in the URL, so that the ProReach-enabled web client 106 can receive 705 this URL and extract the GID from it, storing the GID in a cookie. Other information is also encoded in the URL so that the web client 106 will be sent back to the page he originally requested.

If a web visitor has configured their browser not to accept cookies, the global identifier service 602 can detect this, and will still allocate a GID for this web visitor which is returned via the redirect as a GID in the usual way. However, the value of this GID tells the ProReach-enabled web server 102 not to try and issue a session cookie and to log the events of this web visitor as an unknown or anonymous user.

In FIG. 7d, once this GID is returned to the web server 102, the Logger 702 can uniquely identify the web client, and thus Logger logs 709 a web event record to the ProReach Visitor Log 704. This entry contains information on when the web access occurred, the GID, the URL of the web page that was accessed, and it has some other information as well. This sequence of operations is repeated for each web page or other web activity that the visitor generates.

As shown in FIG. 7d, the contents of the log 704 are periodically transferred from the web server 102 to a ProReach Spoke 202, which is inside the firewall. The spoke 204 includes various other processing modules, including a log pre-processor 706, a hub visitor log 708, an event queue 710, an event processor 712, a categorizer 714, a page metadata cache 716, and a content recognition engine 718.

Once the data reaches the spoke 204, it is pre-processed 706 for inclusion in the Visitor Log 708. The preprocessing turns the data --no matter its specific format --into web events of the standard form (e.g., an object representation of that data).

The Event Queue 710 monitors this log 708, and when new web event data is available, it fetches the data and also sorts the web entries by GID. The Event Queue 710 then calls on the Event Processor 712 to process each web event in the log 708. The Event Processor 712 ensures that the web event is categorized by making a request to the categorizer 714. It is possible that the web page has already been categorized, and that this categorization information has been entered as entries into the Page Metadata 716. Prior categorization occurs since ProReach spiders web sites in order to categorize their web pages as early as possible, as to avoid doing categorization at runtime. However, since some web sites produce web content dynamically, ProReach cannot pre-categorize all web pages, and must be prepared to categorized web pages on a just-in-time basis.

If the URL visited by the web visitor has already been categorized, then this data can be fetched from the Page Metadata cache 716. If this is not true, then the categorizer 714 then makes calls on a content recognition engine 718. The content recognition engine 718 manages a database of categories. Each category represents some kind of topic, such as "sports" or "news." A web page can be matched against any number of categories. The matched categories describe what a web page is about, and provide a means by which the visitor's interests can be identified.

The content recognition engine 718 provides a score for a number of categories, each score measuring the degree to which the page may be said to be about the category. Preferably, a score is provided by the content recognition engine 718 for each category in the category database; alternatively a score is provided only for a selected number of top scoring categories (e.g., top 10 highest scoring categories).

When the content recognition engine 718 completes its categorization process of a given web event, it updates the Page Metadata cache 916 for the web event to include a list of the scored categories and their respective scores. Once the cache is updated, the categories of the web event and their respective scores are returned to the Event Processor 712. The Event Processor 712 modifies the web event record to include the results of the categorization for that web event. Alternatively, the categorization information may be stored separately from the web event, and accessed from the web event by some other means, such as a URL. Once the web event record has been categorized, the web event is ready to be sent off to the next stage of processing. That next stage of processing is on the ProReach Hub 204. More generally, the categorized web events are streamed from the ProReach spoke 202 or spokes to the hub 204.

In FIG. 7e, there is shown the features of a ProReach hub 204. The hub 204 includes an aggregator queue 722, an aggregation system 724, a profiler 726, a database agent 728, and a profile database 720.

The hub 204 maintains a database 720 of web profiles. Each profile in this database 720 is uniquely identified by a GID. In each web profile, the web events of the web visitor are maintained by category. A exemplary web profile will describe a individual (or group's) interest in each of number of categories included in the category database.

The ProReach hub 204 takes newly categorized web events and integrates this data with the data of an existing web profile; this updates the profile of the visitor with the most current information about their interests, as captured in the web events generated from their web activity. If a web profile does not exist for the web visitor, then one is created.

The first step of this aggregation process is to fetch the needed web profile from the database 720, using the web visitor's GID to select the web profile. When an web event record or a set of event records are aggregated, they are processed in groups where each web event has the same GID.

Once the web profile for a GID is retrieved, the Aggregator System 724 performs an aggregation operation for all categories of documents that this web visitor has accessed. In one preferred embodiment, a threshold value for is updating category weights is established, and only those categories for which the document scored higher than the threshold are updated.

Generally, the aggregator 726 updates the various user, group, and category summaries as described with respect to FIG. 2. Each of these summaries is held in its own web event record, which identifies both the user or user group or the category to which it applies, and the appropriate other aggregated weight values. Because of this approach, ProReach can retain large amounts of visitor data at lower cost and this data is of higher quality, because it is designed to support the kind of operations needed by web marketers, that is, analysis of user interests and trends.

When the aggregation process is completed, the next step is to update the visitor's profile. Profiling 726 is a task that identifies the interests of a web visitor. To understand how this works, we first explore a brief example. Suppose there is a web marketer who wants to identify "sports enthusiasts" using visiting the web site. The web marketer first defines what he means when by "Sports Enthusiasts". There are many ways that this term could be defined:

Absolute Interest Magnitude Definition: A sports enthusiast is someone who looks at sports-related web pages at least twenty times every year;

Relative Interest Frequency Definition: A sports enthusiast is someone who looks at sports-related web pages more frequently than he looks at other web pages. For example, a sports enthusiast is someone who, if they look at 100 web pages, tends to look at least ten sports-related web pages.

Comparative Interest Frequency Definition: A sports enthusiast is someone who looks at sports-related documents much more often than other web visitors

Each of these three candidate definitions for the term Sports Enthusiast describe the interest as a function of the weight or weights of a "sports" category or categories, as determined from the web activity of the user.

Any of these types of definitions (or others) may be used to define an interest with respect to any set of categories. Logically, an interest may be understood as a query, such as one uses in SQL, against the profile database 720 that determines if a web visitor does not or does not have that interest. The query can be defined to evaluate the weights of any combination of categories. With ProReach, a web marketer can name and define such interests using a simple query tool, such as a query by example tool, that operates on the database 720 via database agent 728,

Once an interest is defined, the new interest is added into a given ProReach system 100 and activated. Once an interest is activated, it is the responsibility of the profiler 726 to take each interest and test whether a given web visitor has that interest or does not. When profiling takes place, each activated interest is applied to the web visitor's data to determine if the visitor has that interest. The result is profile which identifies which interests are applicable to the visitor.

For example, imagine that there were five active interests in the database 720, such as Sports Enthusiast, Conservative, Hobbyist, Recent Divorcee and Planning For Retirement, each of which has been previously defined by a set of criteria, such as described above, with respect to various categories. Thus, the Conservative interest may be defined by a relative frequency of accessing pages which are categorized in categories deemed to be associated with conservative ideas or beliefs; the Recent Divorcee interest may be defined by comparative frequency (to identify most current behaviors) of viewing web content related to divorce attorneys.

Such a set of interests are stored in the database 720 and applied by the profiler 726 to a web visitor's data. The query associated with each interest is applied (as a predicate) and the result of this predicate evaluation is a boolean value. From this processing, a set of results would flow, for example:
                 Sports   Conser-            Recent   Planning For
     INTEREST   Enthusiast vative  Hobbyist  Divorcee  Retirement
      RESULT       YES      YES      NO        NO         YES


Note there, the results are Boolean values, indicating whether or not the visitor had the interest. In an alternative embodiment using fuzzy set membership, each interest result may be expressed as a measure of the degree to which the user has the interest (e.g., a scaled value between 0.0 and 1.0).

Based on a result such as this example, the web profile of this web visitor is then updated 723. Preferably, a web profile summary record in the database 720 lists the interests of the web visitor. In one embodiment, the web profile summary record contains an interest field which list the interests of the web visitor, as determined by the profiler 726. After profiling completes, this interests field is updated. Each interest is associated with an interest identifier, and so it is actually a sequence of integers that is assigned to this interest field, such as

{101,321,19}

For example, if the SportsEnthusiast interest has an ID of 101, and the Conservative interest has an ID of 321, and the PlanningForRetirement interest has an ID of 19, then this means the same thing as:

{SportsEnthusiast, Conservative, PlanningForRetirement}.

Each such interest ID thus concisely identifies an interest for that web visitor.

Interests are useful because they help categorize web visitors. However, interests are distinct from categories, in several ways. First, interests describe users or groups of users, whereas categories describe web content. Second, interests are formed from combinations of multiple factors, including category scoring of visited web content, demographics, and the like and thus interests are not easily constrained to hierarchical parent-child relationship, as typified by the categories of the content recognition engine 718.

As ProReach profiles web visitors, it computes the interests of each web visitor, and then recomputes them as needed. When this computation is performed, the updated profile summary is then stored 722 back in the database 720 via database agent 728. The result is an updated web profile, with all the data relating to categories, and with all the interests of that web visitor updated as well.

Other ProReach tools, such as the query tools, can use this data to quickly pinpoint groups of ProReach web visitors. For example, a query can be made to identify all web visitors who are both "sports enthusiasts" and "conservative." Alternatively, a query could be made to identify all web visitors who are "sports enthusiasts" but who are not "conservative."

At this point, we have shown how interests are defined and how profiles are updated to reflect the web visitor's current set of interests. FIG. 7c indicates how the web server 102 can access web profile for any web visitor. The profile servlet 730 on the web server 102 fetchs 731 the web profile of any known web visitor based on a GID, which is obtained either from a cookie resident on the web client 106, or from the global identifier service. It is this ability that makes it desirable to identify the GID of the web visitor. Once the web server 102 has access to the visitor's GID, it can use it to selectively fetch data from the web visitor's corresponding profile. Given the interests in the profile, the web server 102 can dynamically compose a web page so as to maximize the content that would be of greatest interest to the web visitor, for example, by selecting content that most closely matches the categories that the visitor is interested in.

ProReach has many other capabilities, such as the tracking of web activities from the web client; it supports the exchange of web profile data between ProReach systems. It supports facilities helping web marketers identify and contact prospects. It supports advanced categorization techniques that allow businesses in vertical markets to create categories suited to their business. It also supports categorization techniques that automate the process of developing and maintaining categories.

B. Category Discovery And Maintenance

This section introduces ProReach's processes for category discovery and category maintenance. We will describe these processes by example.

1. Category Discovery

Suppose a ProReach system 100 has the following categories for computer peripherals, as managed by its content recognition engine 718:
                                          Number of
                  Category                Documents
                Storage device                500
                   CD Rom                     80
                 Hard drives                 200
                 Zip drives                   40
                Floppy drives                100


The Storage Device category is the parent category for the other categories. First, it should be noted that the total number of documents in the subcategories is 430, whereas there are 500 documents categorized as Storage Device documents. This suggests that there is some other category in these documents that is related to storage, but which is distinct from the existing subcategories.

The category discovery process uses statistical analysis to look for the hidden categories in some existing category. As will be further described below, category discovery identifies categories based on frequency and relationships between words appearing in a set of documents. In the example above, this category discovery process might find that many storage documents were about DVDs. It would then identify "DVDs" as a potential new category. In one embodiment, the category discovery process does not automatically create a new category. Instead, any category change suggested by the category discovery process is checked and confirmed by an operator. This interaction with the operator is desirable for a number of reasons. First of all, the category discovery process may make many valuable suggestions, but it may not always be right. Some degree of human guidance is useful to ensure that only meaningful categories get added.

Suppose in the above case that the operator confirmed that a new DVD category should be added. Once confirmation is given, the rest of the process is automatic; the category can then be used immediately by the content recognition engine 718 to categorize documents. Existing documents may also be re-evaluated to determine their category score.

One issue in determining when to apply the category discovery process is when should a search take place for new categories. In one embodiment a search for new categories takes place when any of the following are true:

There are a large number of documents categorized within a given category (e.g., more than a predetermined number or percentage of all categorized documents); or

There are signs of a missing category (e.g., parent category having more than a predetermined number or percentage of documents relative to its subcategories); or

There are a large number of web visitors accessing the documents with a given category (e.g., more than a predetermined number or percentage of visitors within a selected time period).

Also some branches of the category tree will likely exhibit more volatility over time (e.g., high technology). Hence, the historic volatility of that section of the category tree may also be a factor.

2. Category Maintenance

Category discovery pertains to discovering new categories. Category maintenance pertains to maintaining and improving existing categories. As with category discovery, the process of category maintenance is preferably an advisory process, which suggests changes to the categories. It does not execute those change unless confirmation is given; alternatively the changes may automatically implemented.

In particular, category maintenance provides suggestions for:

Removing a category; and

Altering the training documents related to a category;

Like category discovery, category maintenance involves statistical analysis. For example, a suggestion to remove a category might be made if there are very few web pages concerning this topic and there are very few people looking at such documents. Few documents and few viewers of them suggests that the category is a candidate for deletion.

For example, training documents are selected based on scoring; if the category scores are below a threshold the training documents are reselected. Categories are moved when the keywords associated with the category are not scoring sufficiently high.

To create category:

Select category

Select training documents

Score training documents, to generate keywords

Human judgment as to whether the keywords are reflective of the category.

IV. ProReach Systems With Alliances

FIGS. 1-6 show how ProReach spokes 202 feed web activity data to a central hub 204 of the ProReach system 100. This hub-and-spoke topology handles one or more web servers 102 in a flexible and scalable fashion. ProReach however, goes beyond this local accumulation of web events. Profiles of visitors maintained on a hub 204 are valuable, but the value of the information increases via aggregation across multiple hubs and ProReach systems 100. This aggregation can be accomplished by the merging of profiles from multiple sources, even when these sources of information belong to separate companies.

In existing systems, companies that might benefit from the sharing of visitor profile information are reluctant to do so for several reasons. There is no infrastructure to facilitate this sharing, so sharing the information would require a huge initial outlay of software support. There are also ownership and use issues in respect to the profile information itself: which companies own the profile information, and who decides?

In the present invention, alliances are a means of facilitating the sharing of profile information between businesses, and overcoming these barriers to sharing. By doing so, ProReach enables business-to-business sharing of data that is mutually beneficial to the business parties. In many cases, alliances are formed to service the businesses clustered around some vertical market. For example, there might an alliance for pharmaceuticals, or there might be an alliance for oil-related businesses. Referring to FIG. 8, each ProReach system 100 would be a member of zero, one or more alliances 800. Membership in an alliance is voluntary. The members of those alliance 800 send copies of their profile data to alliance 800. This data is then aggregated into an alliance profile. An alliance profile is an aggregation of the profiles collected from the alliance members.

Of course, the same web visitor may visit multiple ProReach systems 100 that are members of the same alliance 700. When different local hubs send profiles for the same web visitor, the alliance 700 can take these separate local profiles and assemble them together into a single alliance profile for that web visitor. Using the GID, the alliance can easily compute which profiles belong to the same web visitor, and correctly merge the information in these profiles to avoid duplication.

In exchange for providing their local profile information to the alliance, the members of the alliance 700 get some degree of access to the alliance profiles. An ProReach system 100 can be a full access, limited access or minimum access member of an alliance 800. The responsibilities and rewards of each membership level vary.

A full access member gets the maximum allowed access to vertical profiles. Full access members must also provide a maximum amount of information from its local profiles.

A limited access member gets a moderate degree of access. It must provide a moderate amount of information from its local profiles.

A minimum access member gets the least amount of access to vertical profiles. It is required to provide a minimal amount of profile information from its local profiles.

Participation in a vertical alliance allows each member controlled access to the jointly produced alliance profiles. Rewards and responsibilities are rationalized through the small number of membership levels. Memberships have to specify what categories of information they will provide and in what volume, and for what kind of web visitor. Hence this scheme provides a credible incentive for individual ProReach systems 100 to participate in various alliances.

ProReach systems 100 benefit from being members of alliance by having access to the alliance profiles of the web visitors. Because the alliance profiles are aggregated over multiple web sites and ProReach systems 100, they provide a more accurate and comprehensive assessment of the interests of the web visitor. This in turns allows a given ProReach system 100 to more accurately target web content to the w web visitor when the visitor visits the ProReach system 100 that is an alliance member.

V. Aggregation

In this section we describe in detail one embodiment of the process by which web events are aggregated by aggregation system 724 in conjunction with the aggregation queue 722. The aggregation queue 722 stores a set of web event records that are unconverted. These records are updated to the queue 722 by the event processor 712 on the spoke 204, in the order in which they are received, that is, as they come in from one or more spokes. Overall, the queue will store the web events generated by many different users over some time period.

Referring to FIG. 9, there is shown the logical structure of the aggregation queue 722. The aggregation queue 722 stores a collection of web events 900, each of which represents an instance of some visitor interacting with an item of web content. Each web event 900 contains a user identifier 902 (preferably the GID), a start time 904 of when the web activity began, a duration (in seconds) 906 of the activity (if the duration is not provided, the default is 1 minute), a type (representing either a transaction, a clickthrough or a page view), a URL (the domain name of the web site) and a category vector 908. The category vector 908 includes a list 910 of category identifiers, and respective category scores. Each category score indicates the degree to which the web content is evaluated by the content recognition engine 718 to be about the category. Preferably, there is a category score for each category stored in by the content recognition engine 718. Thus, for example, if there are 1,000 categories used by the content recognition engine 718, then the vector 908 contains 1,000<category ID, score> tuples. In one embodiment, the category scores are in a range from 0 to 1,000,000, but any useful range can be used with the appropriate scaling factors.

Referring now to FIG. 10 there is shown an illustration of the components of the aggregation system 724. The aggregation system 724 is generally responsible for various types of services. First, a Daily Aggregation System 919 is responsible for generating daily aggregates from the web events that occur on the web server 102. Second, a Dimensional Aggregation System 941 is responsible for combining the daily aggregates by dimensional combining into the various User and Category complexes illustrated in FIG. 2. Third, a User Group System 950 is responsible for defining and maintaining definitions of user groups. A Profile Service 955 is responsible for maintaining individual user profiles, and responding to queries regarding these aspects. All these services are within the scope of the aggregation system 724.

The Daily Aggregation System comprises a Handler object 920, a Calculus object 922, a Parser object 924, an Aggregator object 926. The aggregation queue 722 is also best understood as being a entry point to the Daily Aggregation System 724 (and was illustrated separately in FIGS. 7a-7d for convenience).

An Event Dispatcher 930 monitors all the activities within all the services of the Aggregation System, and fires events to whoever is interested in listening to them. The Event Dispatcher is not part of the services within the Aggregation System. It simply monitors and overlook and watches all the activities going on inside the Aggregation System like a camera.

The Daily Query object 932 is part of the Daily Aggregation System and is responsible for all queries concerning daily aggregates. The Daily Query object handles all types of queries regarding interests of users, as described above, including defining interests, and identifying users having particular interests (on daily basis). Queries are processed by a query language interpreter 944, which uses a query language 946. The handler 920 exports the interface of the Daily Aggregation System, and manages the remaining components of the daily aggregation service during the daily aggregation process of packets of web events.

The Combiner 938 is part of the Dimensional Aggregation System and is responsible for doing dimensional aggregation as scheduled by member of ProReach. More particularly, the Combiner 938 is responsible for the dimensional combining of the daily aggregated web events (or of the complexes) into higher level summaries (e.g., across times, users, group, and categories), such as illustrated in Levels 1-4 of FIG. 2, according to scheduled tasks done by some members.

The update object 940 is responsible for updating the Daily Aggregate whenever the Daily Aggregation System processes a packet of web events.

The database 720 stores the aggregated information from the web events in a number of different tables. These are as follows:

User Table: This table stores information identifying and describing each user. The fields of this table include: userID, last name, first name, this table is indexed by userID.

UserID Contact Table: This table contains the following columns regarding the contact address: userID, address, address2, city, state_prov, zipcode, country, and e-mail.

Demographic Table: This table contains demographic information about users. It contains the following columns: userID, gender, age, education, job.

Members Table: This table contains information about the members of ProReach System, that is the people (or companies) that have an account with ProReach System. This table contains the following columns: ID#, lastname, firstname, e-mail, login, password, URL, account type. The URL represents the domain name of the web site owned by the member. If the member does not own a web site, the URL column will be empty. The account_type represents the type of account the member has. According to this type, the member will have access to certain services and other services might be denied.

Categories Table: This table stores all of the categories used by the content recognition engine 718. The table includes the fields: categoryID, category name, and parent categoryID. The table is indexed by categoryID, and secondary indices on name and parent. The parent categoryID is used to construct a hierarchy of categories, and is further used to aggregate low level category information into higher categories.

Daily Aggregate Table: Each row in this tables stores daily aggregate objects for a specific user-category combination that occurred on a given day. This information corresponds to the data at Level 0 of the Aggregation Tree shown in FIG. 2. The fields include: userID, categoryID, weight, Deviation, Day, and Trend.

Deviation stores a standard deviation of the category weight over the given time period for the specified (by category ID) category.

Day stores a date or day number.

Trend stores a string or encoded value that describes the shape or slope of a curve of the user's interest of the time period. For example, and as will be further explained below, the trend may describe the curve as "increasing then decreasing", or as "constant then increasing".

User Group Table: This table identifies each of the user groups, along with their size and a description of what the user group is about, or what are the rules for defining membership. The fields include: user groupID, group name, description, and size. Size indicates the number of group members.

Criterion Table: This table stores the rules which may be used define various membership tests for any of the user groups. Used in conjunction with the user group criterion table, below. The fields include:

Criterion ID: identifies the rule number.

CategoryID: identifies the category to which the criterion is applied.

Minimum: defines the minimum weight a user can have to satisfy the rule

Maximum: defines the maximum weight that satisfies the rule.

Negation: specifies whether satisfying the rule results in group inclusion or exclusion.

Example: Assume that a rule had minimum=20 and maximum=80 and that negation="No." This membership rule means:

"for a user to satisfy the membership test, his/her weight for the category must be between 20 and 80"

If negation=Yes, then this means that the weight must not be between 20 and 80 in order to be a member of the group for this rule.

User Group Criterion Table: This table associates each user group with one or more of the membership rules defined in the criterion table. The field include: user group ID, and criterion ID.

Maintained Categories Table: This table contains the set of categories for which information (such as weight, user groups, profiles, and so forth) will be maintained. The field include: Category ID, CurrentValue, Permanent, LowInterested, MediumInterested, HighInterested, and VeryHighInterested.

This table allows the system administrator or a marketer to chose which categories will be maintained and which categories will be disregarded. This choice can be either absolute or dynamic. In the absolute case, the marketer simply chose a collection of categories one and for all and maintain information only about these categories. In the dynamic case, the marketer consider all categories on the same foot and giving each category a certain rank in the CurrentValue field. The CurrentValue rank can change dynamically according to how many users are interested in the category. If for example, the CurrentValue drops under a certain level, then the category will be disregarded and removed from the table. If a new category acquires a degree of importance, then it can be added to the table. This is the dynamic case.

The marketer can even combine both the dynamic and absolute case. For example, the marketer can chose a certain number of categories to be Permanent (Boolean flag), and other categories to be rather dynamic than permanent. The permanent categories will always stay in the table, and information related to them (through user groups, profiles, etc.) will always be maintained. The dynamic categories are categories that can be removed from this table whenever their CurrentValue is under a certain level. The threshold is preferably defined by a configuration file for the aggregation system 724 or by a system administrator.

The other columns of the table such as LowInterested, MediumInterested, HighInterested, VeryHighInterested contain the number of users whose interest in the category is low, medium, high, and very high, as determined by their weights. In one embodiment, these interest grouping are associated with weight quartiles: if the weight is between 1 and 24 the interest is low (hence the user is counted under "LowInterested"); if the interest is between 25 and 49, the interest is medium; if the interest is between 50 and 74, the interest is high, and between 75 and 100, very high interest.

Maintained Users Table: This table lists all of the users for which profiles will be maintained. The field include user ID, Rank, and HotCategoryID. The Rank field is a value that can change according to the importance of the user. If this value is under a certain level (e.g., below the 100.sup.th or 1000.sup.th rank), the user will be removed from the table and no profile will be maintained on this user. If however, a new user become very important, then this user will be added to this table and a profile will be maintained for the user.

HotCategoryID identifies the category which has the highest category weight for this user.

Profile Table: This table describes each user's profile in terms of which user groups the user is a member. The fields include: user ID, user group ID, Member Since, Membership Ended, Current Member, and Last Update.

Member Since: identifies the date that the user A user can be a member of many user groups and this membership is also dynamic and changes over time. The profile table keeps a history record of user group membership. For every user group, the profile table indicates when the first time the user became a member (Member Since), whether he/she is still member (Current Member) and when the membership ended (Membership Ended). From this history record of changes between different user groups, one can derive a certain behavior and pattern that can be used to predict user reactions in the future, and use this information for marketing purposes.

User-Category Complex Table: This table stores the data for the UC (User-Category) complexes 203 described for FIG. 2. The fields include: user ID, category ID, weight, deviation, weight against categories, weight against population, trend, from and to.

User ID and category ID define the respective user-category combination.

Weight: describes the average weight of the user's interest in the category specified by category ID.

Deviation: the standard deviation for this average.

Weight against categories: stores a measure of how important the specified category is for the user relative to other categories. In one embodiment, the value of WeightAgainstCategories is the percentage of the totaled categories weights for the specified category. That is, WeightAgainstCategories for category j is equal to the weight of category j divided by the sum of all category weights, and then multiplied by 100 to create a percentage (though raw decimal value may also be used).

Weight against population: stores a measure of how important the specified category is for the user relative to all other users. In one embodiment, the value of WeightAgainstPopulation is the percentage of the totaled categories weights for the specified category relative to all other users. That is, WeightAgainstPopulation for category j and user k is equal to the weight of category j for user k divided by the sum of category weights for category j for all users, and then multiplied by 100 to create a percentage (though raw decimal value may also be used).

Trend: describes the shape or slope of the user's interest in the category over the time period defined by From and To.

From and To: define the earliest and latest start time of web activity used to generate this complex.

User Complex Table: This table stores the contents of the U (User Category) complexes 205. The fields include user ID, weight, deviation, trend, from and to, and categories Count. Since a user complex summarizes the user's interest over many categories, Categories Count tracks the number of categories that interest the user. The number also is the number of children of the user complex object in the aggregation tree.

The Categories Count value is used in incremental updating of the weights. When a new user-category complex 207 is formed (i.e., a new child of a user-complex) with a new weight w, then the new weight of the User complex is incremented as follows:

new weight (UComplex)=([categoriesCount*old weight(UComplex)]+w)/(categoriesCount+1)

Category Complex Table: This table stores the data for the C (Category) complexes 205 described in FIG. 2. The fields include: category ID, Weight, Deviation, Trend, From and To As this complex summarizes over multiple users, thus the weight and deviation are with respect to all users with respect to the time period defined by From and To.

Group Category Complex Table: This table stores the contents of the GC (Group Category) complexes 207. The fields include user group ID, category ID, weight, deviation, trend, from and to, and users Count. Users Count tracks the number of users in this group with respect to the selected category.

Group Complex Table: This table stores the contents of Group complexes 209, that is group summaries across all categories. The fields include user group ID, Weight, Deviation, Trend, From and To, and user Count.

The user count is used to update the weight for a group during incremental aggregation as follows:

new weight(GComplex)=((usersCount*old weight(GComplex))+w)/(usersCount+1)

where w is the weight of the new added member to the user group.

Total Complex Table: Finally, this table stores the overall Total complex 211. Every row corresponds to a total complex 211 for a defined period of time. The fields include: Start Date, LengthDays, LengthWeeks, LengthMonths, LengthYears, weight, deviation, trend, and usergroup Count. The various length fields define the time interval over which the aggregation is performed for a particular complex. The user group count contains the total number of user groups over which the total is aggregated. As with the other counts, this is used during incremental aggregation:

new weight(TComplex)=((usergroupCount*old weight(TComplex))+w)/(usergroupCount+1)

where w is the weight of a new user group complex 209 being added to the total complex.

We now describe the process of aggregating web events.

A. Aggregating Daily Web Events

The scheduler 934 is responsible for initiating various processes for aggregating web events into aggregated information for various periods of time. Accordingly, on at least a daily basis, the scheduler 934 invokes the handler 920 to aggregate web events from the aggregation queue 722 into daily aggregated events, as shown in Level 0 of FIG. 2. Accordingly, The handler 920 requests and receives a set of web events from the aggregation queue 722 for a given day. The queue 722 keeps tracks of which events have been retrieved, and provides, in response to a handler request, those events which have not been processed, assembling the events that correspond to the desired day.

The Aggregation System does the combining using two subsystems. A first subsystem is responsible for generating the daily aggregates from the web events (the web events are called user hits in the terminology of the Aggregation System). The second subsystem is responsible for generating the higher level of aggregation (aggregation over weeks, months, quarters, or years, across categories, across users, across user groups), that is the dimensional combining.

The Daily Aggregation Service operates as follows:

1. The Handler object takes a packet of web events from the Aggregation Queue.

2. The Handler sends the packet to the Calculus object to compute the weights of the web events and to scale them from 0 to 100.

Let's give a very simple example. Suppose that the packet contains only two web events A and B. Web event A contains only one category C1 with a score 200 and a duration 4 minutes. Web event B contains one category C2 with a score 300 and duration 2 minutes. First, the Calculus object computes the weight for the category C1 in the web event A:

weight (C1)=score(C1) *duration=200*4=800.

Since there is no other categories in the web event A, we go to the next 20 web event B to compute the weight for the category C2 (in the second web event B):

weight(C2)=score(C2)* duration=300*4=600

Since there is no other categories in the web event B, we have finished computing the weights. Now we need to scale the numbers we have just computed, namely 800 and 600. Scaling consists of replacing 800 by:

[800/(800+600)]*100=57.14%

and replacing 600 by:

[600/(800+600)]*100 42.8%

Now, if the userID in web event A and in web event B are the same, and category C1 and category C2 are also the same, then in this case, The Aggregator object will average the two weights:

(57.14%+42.8%)/2

and keep the average. If the two web events A and B have different userID or different categories, then we do not average, and we keep the two weights 57.14% and 42.8%.

In any case, inside the DailyAggregate object, every pair (userID, category) has only one number between 0 and 100 (a percentage number) that we call the weight of the pair (userID, category). If (within a single packet of web events) one (userID, category) pair has many percentage numbers (i.e. many weights), then we average them (this is done by the Aggregator object when the Parser gives the hash map to the Aggregator, as described next).

1. The Calculus object returns the packet (of web events, where the scores are now weights that are scaled) to the Handler object and the Handler gives it to the Parser object. The Parser object transforms the data structure of the packet (from a vector to a hash map) and gives the hash map to the Aggregator object.

2. The Aggregator object computes certain quantities such as the mean, the deviation, trend and the time interval (from, to). The Aggregator object uses the services of the Calculus object to compute these quantities. After computing these quantities, the Aggregator object calls the update methods of the Update object. The Update object has many methods (that all start with the word update). Every method has its special purpose: For example, the method updateDailyAggregate( ) will update the values in the DailyAggregate object using incremental aggregation from the new hash map that was produced by the Aggregator. The method updateUCComplexo updates the values of all UCComplex objects using incremental aggregation from what has changed in level 0 of the aggregation tree, etc. That is, the dimensional aggregation is automatically done (incrementally) just after the Aggregator finishes processing one packet of web events.

So the Update object provides data access between the two systems, Daily Aggregation System and Dimensional Aggregation System. Whenever the Daily Aggregation System finishes processing a packet of web events, the Update object starts the Dimensional Aggregation (incrementally) based on what have changed at level 0 of the aggregation tree due to the processing a new packet of web events.

There is another aspect of the dimensional aggregation that is scheduled. We have just said that the dimensional aggregation starts automatically (and incrementally) each time the daily aggregation system finishes processing a single packet of web events. Let us explain why we also use a scheduled dimensional aggregation:

When the ProReach System is be running, it will have some members. A member is a person or a company that has an account with the central ProReach System. Let's say User A is a member. User A will have a login name and a password, and ID number that is assigned to User A by ProReach System (when you subscribed for the first time). When User A wants to use the services offered by ProReach System, he first to goes the web page of the central ProReach System and logs in using his login name and password. Once he logs in, he can use the services. Here is a short list of the services that he can use:

a. Issue queries (on the web page) and the answer to the queries will show on the web page.

Queries can be on profiles, user groups, on interest for some categories, etc.

b. Create user group and set the membership rules to be satisfied in order that a user be added to the user group User A has created. User A can schedule when to update the members of each user group, when to add new members, and how long he would like to keep each user group in the database.

C. If User A owns a web site, he can have the web traffic of your web site be sent to the central ProReach system, so that ProReach can do aggregation for the web events of his site and keep the results of the analysis in the ProReach's database ready for him to query it anytime.

These are only examples of the services that can be offered by ProReach System through the web. Each service has a certain fee. There are different types of accounts. Some accounts provide users with a certain set of services, and other accounts may provide users with larger set of services. For example, consider the case of a person (or company) that owns a web site and uses the last service of the list above (that is, service c.). Such a person has the right to chose when to do dimensional aggregation (for the web events of his/her web site) and for what time interval. Such a person can schedule these tasks from his/her account. This is what we call the scheduled dimensional aggregation tasks. This is different from the dimensional aggregation that is done automatically each time the Daily Aggregation System finishes processing a single packet of web events.

1. Transform Category Scores to Weights

The handler 920 first invokes the math package 922 to transform the category scores in each web event 900 (within a single packet of web events) into duration adjusted scores. This step normalizes the scores, and removes the need to separately store both the category scores and the duration of the event. Normalization further allows different web events to be compared as to their overall significance with respect to any category or user.

The Calculus object 922 operates as follows to support this function. As noted, each web event 900 includes a vector of categories and scores. The Calculus object 922 process each web event 900 in turn (inside a packet of web events). For each category in the category vector of a single web event 900, the math package 922 scales each category score by the duration of the web event, and with respect to all other category scores for that web event. In one embodiment, the scaling process is as follows:

First, the Calculus object 922 adjusts each score by the duration of the web event and the type of the web event:

NewScore=Score*Duration*type

where NewScore is the adjusted category score (that we will call weight after it will be scaled from 0 to 100), Score is the original category score, Duration is the time between the start time and end time (or the duration value if directly provided. If it is not provided, the duration's default value is 1 minute) and type is the a number that depends on the type of the web event. For example, if the web event is a transaction, the type would be higher than just a clickthrough or a page view. The type of a page view is higher than the type of a clickthrough.

Next, the Calculus object 922 scales the adjusted scores relative to all of the adjusted scores: ##EQU1##

where n is the number of categories (all the categories inside the packet of web events. A packet of web events might contain 10 web events. And each web event might contain 20 categories. So the total number of categories might be 200), and i iterates over each category.

The result of this process is that each web event 900 now contains a list of weights in place of the original category scores. The weights succinctly describe the significance of the category with respect to all other categories for that particular web event; more particularly, the weights describe as each category's score as a percentage of all of the time-adjusted scores.

2. Restructure Web Event Records to Collate Category Weights by User

The handler 920 next calls the parser 924, and passes in the updated packet of web events 900. The parser 924 restructures the packet for input into the Aggregator object 926. More particularly, the parser 924 collates the category weights of a number of web event records 900 first by user, and then by category.

Referring to FIG. 11, there is an example illustration of the processing function of the parser 924. As inputs, the parser takes a packet 900, each web event inside the packet includes, in part, the category vector 908. As described above, the web event includes a user ID 902, start time, duration, type (that is transaction, clickthrough or page view), URL (domain name of the visited web site) and N<category, weight> pairs, where N is the number of categories. The various web events correspond to different users, and there are likely to be many web events for the same user, since each clickthrough, transaction, page view, etc. may generate a web event.

Let us explain the task of the Parser object by a very simple example. Suppose that the packet of web events contains only 5 web events that we may call for example: we1, we2, we3, we4, and we5. (we is an abbreviation for Web Event). Assume that the first, third and last web events (we1, we3, we5) all have the same userID (let's call this userID by Jack). Assume further that a category C exists inside the three web events we1, we3, we5. We have three weights for the pair (Jack, C): w1, w3, w5. The first weight w1, is the weight of the category C inside the first web event we1:

w1=weightaack, C) inside web event we1

The second weight w3 is the weight of the same category C for the same user Jack, but inside the third web event we3:

w3=weight(Jack, C) inside web event we3

The third weight w5 is the weight of the same category C for the same user Jack but inside the last web event we5 of the packet:

w5 weight(Jack, C) inside web event we5

The Parser object associates the sequence (w1, w3, w5) to the pair (Jack, C). The sequence (w1, w3, w5) is a sequence of weights for different instant of time and it represents a curve (a function of time that measures the interest of the user Jack for the category C). This function is given only by this sequence (w1, w3, w5), and is thus a discrete function. Ideally, we would like to have a continuous function because a continuous function can shows us clearly what the shape of the graph is. If we know the shape of this graph (as a curve) than we know how the interest of Jack to the category C is changing with time. Since the sequence (w1, w3, w5) represents a discrete function and not a continuous function, we apply the rules of Probability theory to this discrete function in order to get some information about it.

The first thing we do about this discrete function is to compute what in Probability theory is called the expectation of the random variable. In our case, this expectation is simply the average of the weights in the sequence (w1, w3, w5). This average is called the mean and it is computed by the Aggregator object (with the help of the Calculus object). The second thing the Aggregator does, is to compute the "error", or what Probability theory calls the variance of the random variable. This "error" is called deviation. The third thing that the Aggregator object does is to determine what is roughly the shape of the graph of the discrete function represented by the values (w1, w3, w5). Is the shape of an increasing curve, or a decreasing curve or some sort of combination of the two? The shape of this curve is called the trend. Once this is done, the Aggregator object associates the data (mean, deviation, trend) to the pair (Jack, C) in some data structure (like a hash map, or a hash table, or the like . . . ). The Aggregator does all this for every pair (user, category).

When the Aggregator finishes the processing, the result (which is a hash map, or hash table, . . . ) forms an object that we call DailyAggregate. Therefore, a Daily Aggregate is an object that contains may pairs (user, category), and for every pair (user, category) there is associated to it a data of the sort (mean, deviation, trend). There is also a time stamp which is the time interval that was covered by the packet of web events.

In conclusion, the Daily Aggregation System processes a single packet of web events, and produces a result object that we call DailyAggregate.

When the Daily Aggregation System finishes processing a packet of web events (by producing a DailyAggregate object), it goes again to the Aggregation Queue to pick up another packet of web events. The Daily Aggregation System keeps processing web events from the Aggregation Queue by packets.

Now assume that we start the Daily Aggregation Service for the first time. The Daily Aggregation System goes to the Aggregation Queue and picks up the first packet of web events (packet1). After processing packet1, it produces an object (called daily aggregate, or just aggregate for short). Let us call this aggregate by agg1. Now the Daily Aggregation System goes again to the Aggregation Queue and takes the second packet of web events (packet2) and process it. After processing packet 2, it produces a second aggregate, that we can call agg2 for example. This aggregate agg2 is merged with agg1 to form only one aggregate object that we can call agg12, for example. After fusion, the aggregate agg1 and agg2 both cease to exit, and only the aggregate agg12 exists in the database. This fusion between agg1 and agg2 is an incremental aggregation that is carried out by the Update object (through its updateDailyAggregate( ) method). The new aggregate object agg12 represents the outcome of processing a single packet of web events that is the union of the first two packets, packet1 and packet2.

Daily Aggregate objects (or aggregates for shorts) are the data at level 0 of the Aggregation Tree illustrated in FIG. 2. Each day is represented by a single Daily Aggregate object.

The result is that for a given user associated with a number of web event records--as will typically occur during a visit to a web site, perhaps generating 20 to 100 or more web events the category weights from the many different records are collected and collated in a single category hash table 1100, so that for each category, all of the weights and start times are packaged together. This allows all of the relevant information about the user's web activity during the day the web event records were collected to be easily accessed from a single data source.

3. Create Category Interest Time Model Information

The result of the prior step is one user-category table 1100 for each user that appeared on the web server 102 on the day being processed. With each of these user-category hash table 1100, the handler 920 next calls the aggregation engine 926. The aggregation engine 926 processes these tables into a category interest time model information for each user. The summarized information describes the particular user's interests in the various categories over the day for the collected web event records. The aggregation engine 926 operates as follows on each received user-category hash table:

First, for each category table 1100 the aggregation engine 926 sorts the category's weight list 1102 by the start times. The aggregation engine 926 preferably does this by call a sorting routing in the math package 922. The result is a set of data points, essentially a curve, which describes the user's level of interest in the category over the time period from the earliest start time to the latest start time. FIG. 12 illustrates such a category interest curve 1200, for a hypothetical "Art Deco" category. The graph shows the data of 14 web events related to this category, sorted by their starting time, and shows that the user's interest was initially very high, then declined, and then rose again.

The goal at this next stage is then to capture each category interest curve 1200 mathematically, and eliminate the need to store the underlying weight and time data of the weight list. More particularly, for each category, the aggregation engine 926 determines the expected value of the category interest curve 1200 over the time period (e.g.,, one day). In one embodiment, the aggregation engine 926 determines the mean weight and the standard deviation of the weights in the category for the time period. The mean weight is simply the total of all weights in the weight list 1102 for the category divided by the number of weights, which will be the number of web events for this user during the time period. The standard deviation is computed normally. Again, these computations are preferably performed by the math package 922, as requested by the aggregation engine 926.

The aggregation engine 926 then creates a trend description for the category interest curve. The trend description describes the changes in the user's level of interest in the category over the time period represented by the curve. Preferably, this trend description is a string description (or its coded equivalent).

To obtain this trend in one embodiment, the aggregation engine 926 first takes the difference between the weight of the earliest start time and the mean weight. This describes whether the curve is increasing, decreasing, or constant relative to the earliest start time. Next, the aggregation engine 926 takes the difference between the mean weight and the latest start time, and again, determine if the curve is decreasing, increasing or constant. Thus, there are nine possible trends:

1. Increasing, decreasing

2. Increasing, constant

3. Increasing, increasing

4. Constant, decreasing

5. Constant, constant

6. Constant, increasing

7. Decreasing, decreasing

8. Decreasing, constant

9. Decreasing, increasing.

The aggregation engine 926 determines the appropriate time trend, and stores information for this time trend for the category. The stored information may be the strings themselves ("increasing," "constant," and "decreasing"), or code value for these (e.g., 1=increasing, and so forth). Obviously, more than three times/two segments can be selected to result in more complex time trend descriptions.

The aggregation engine 926 may apply other methods to determine the time trend of the category interest curve. In another embodiment, the aggregation engine 926 selects a number of sample times in the interest, including a point at or near the earliest start time, a point at or near the latest start time, and a number of times between these two times. Then beginning with the first selected time, the aggregation engine 926 determines whether the curve is increasing or decreasing, or constant to the next selected time, and assigned a string or code equivalent to that portion of the curve. For example, in one embodiment, three times are selected: the earliest start time, the middle start time, and the last start time. With these three times, there are two curve segments, and, the aggregation engine 926 determines whether the curve is increasing, decreasing or constant in each segment.

In yet another embodiment, the aggregation engine 926 determines the time trend, by identifying the times at which the slope of the category interest curve changes from positive to negative, and storing both the start time, and the appropriate descriptive information about the time period being described.

With the time trend information, the aggregation engine 926 now has a complete description of the user's category interest for the given day. More specifically, it can store the following category time pattern model for subsequent use:

{User ID, Category ID, Mean Category Weight, Category Weight Standard Deviation, From, To, Trend}

where "From" is the earliest start time, and "To" is the latest start time in the sorted weight list 1102, and Trend is the description of the curve changes (either string or encoded).

The underlying category weight information from the raw web events can now be deleted, and the category time pattern model stored in the database 720 in the User-Category table. This process is repeated for each category weight list in the user-category hash table 1100.

B. Dimensional Combining.

The combiner 938 is the component that is responsible for combining the daily aggregated information summarized complex information of the various complexes of The dimensional aggregation tasks carried out by the Combiner object correspond to scheduling tasks make by some members. The automatic (incremental) dimensional aggregation that occurs all the time is carried out by the Update object.

Referring again to FIG. 2, there is shown the various levels of aggregated information that are provided by ProReach, specifically which are computed by the combiner 938. The combiner 938 is designed to combine any provided set of category interest time pattern information with respect to any combination of user, category, or time period. We describe the operation of the combiner 938 with respect to the various levels of aggregated information in FIG. 2.

Generally, each of the aggregate complexes in FIG. 2 contains a weight value, as described with respect to each of the tables of the database 720. The weight value is computed by an aggregation function which operates on the weight values of all of the complexes which contribute to the complex being evaluated. For example, if