Method and system for monitoring online behavior at a remote site and creating online behavior profiles6983379Abstract A method and system for monitoring users on one or more computer networks, disassociating personally identifiable information from the collected data, and storing it in a database so that the privacy of the users is protected. The system includes monitoring transactions at both a client and at a server, collecting network transaction data, and aggregating the data collected at the client and at the server. The system receives a user identifier and uses it to create an anonymized identifier. The anonymized identifier is then associated with one or more users' computer network transactions. The data is stored by a collection engine and then aggregated to a central database server across a computer network. Claims What is claimed is: Description BACKGROUND OF THE INVENTION
Thus, an order including a purchase of books and electronics would have bits 0 and 1 set. Thus, the first character would be the binary representation 0011, which is equivalent to 3. If an order includes office supplies and electronics, the first character would be 1010, or A. The second character represents the method of payment used by the purchaser according to the following table:
Thus, the two-character transaction code "3D" represents that a customer purchased books and electronics using a Visa card. One of ordinary skill in the art will readily appreciate an abundance of data that can be encoded in a transaction code other than that shown in this representative example. Additional embodiments use multiple transaction codes. In an additional embodiment, a three-character transaction code is used representing a price range as shown in the following table:
In order for the collection engine 103 to create online behavioral profiles that are unassociated with individual users, the present invention uses an anonymized identifier to represent an individual user. In this embodiment of the present invention, the anonymized identifier is obtained from the username of the individual user. However, to maintain user anonymity, it is imperative that the original username cannot be obtained from the anonymized identifier. The present embodiment applies a one-way hashing function to the login usernames. One-way hashing functions, such as Message Digest 4 (MD4), Message Digest 5 (MD5), Secure Hashing Algorithm 1 (SHA-1), etc., are commonly used in cryptography applications including digital signatures. FIG. 2A shows an example of a unique identifier 203 being created from a username 201 and a key 204 using a one-way hashing function 202. In this example, the one-way hashing function is the Secure Hashing Algorithm (SHA) developed by the National Institute of Standards and Technology (NIST) and published as a Federal Information Processing Standard (FIPS PUB 180). The key 204 is appended to the username 201. One-way hashing function 202 is applied to the combined key 204 and username 201 to produce the anonymized identifier 203. Use of the key 204 makes it more difficult to decrypt the anonymized identifier and using a unique key for each ISP ensures usernames or other identifiers are unique across ISPs. One of skill in the art will readily appreciate that any other one-way hashing algorithm can be used with the present invention. FIG. 2B shows a two-pass method for creating online behavioral profiles that are unassociated with individual users. This two-pass method is similar to the one-pass method shown in FIG. 2A. In this embodiment, a first anonymized identifier is creating as discussed above with regard to FIG. 2A. Then, the first anonymized identifier encrypted using one-way hashing function 205 along with key B 206 to create a second anonymized identifier 207. The two-pass technique allows a third party to assist without compromising the security of the resulting collected data. When a user logged on to an ISP accesses a Web page located on a server 105, the user's workstation 101 opens a network connection to the desired server 105 using the Internet Protocol (IP). The network packets sent between workstation 101 and server 105 contain the network address of both devices; however, the packets do not contain a username. Thus, the collection engine 103 needs to associate a unique identifier 203 with a network IP address to record the transaction without tracing it to the individual user. In order to create a unique identifier 203 and associate it with an IP address, the collection engine 103 needs to obtain a username. In one embodiment of the present invention, the collection engine 103 monitors the network for packets containing authentication information that associate a user identifier with an IP address. For example, if the ISP 102 is using RADIUS to authenticate users, then the RADIUS server sends an authentication timestamp containing a username associated with an IP address whenever a user successfully logs on to the network. In alternative embodiments of the present invention, other authentication mechanisms may be used. In most cases, the user identifier and IP address are sent across the network unencrypted and can be obtained by the collection engine 103; however, some authentication mechanisms may use encryption or may not be sent across the network. In some instances, the access server is configured to suggest an IP address to the RADIUS server 107; if the address is not taken, the RADIUS server 107 sends back a packet allowing the assignment. In these cases, one of ordinary skill in the art using conventional software development techniques can develop software to obtain the user identifier/IP address correlation. Some other methods that are commonly used to assign IP addresses to users are Dynamic Host Configuration Protocol (DHCP) and Bootp. In one embodiment of the present invention, a collection engine 103 is an Intel™—based computer running LinuX™. In order to maintain a high degree of security, the operating system is hardened using conventional techniques. For example, the "inetd" daemon and other unnecessary daemons are disabled to limit the possibility that an unauthorized user could gain access to the system. The collection engine 103 also includes one or more network interface cards (NIC) that allow the operating system to send and receive information across a computer network. In some embodiments of the present invention, Internet network traffic and authentication network traffic may be sent across different networks. In this case, the collection engine 103 can use multiple NICs to monitor packets sent across the different networks. Additionally, a site may wish to monitor user activity on multiple networks. The collection engine 103 can monitor as many sites as the situation demands and the hardware supports. Using the network and hardware configuration discussed above, we now turn to the software implementation of the collection engine 103. In accordance with the present invention, application software is installed, that has been developed in a manner that is conventional and well-known to those of ordinary skill in the art, at the point-of-presence (POP) location with an ISP. The software includes a process that monitors packets sent across the device's network interfaces as shown in FIG. 4. This embodiment of the present invention begins by waiting for a network packet to be received. When a network packet is received in block S 401, relevant data is extracted from the packet in block 402. The relevant data depends on the protocol of the received packet. For example, if the packet is a RADIUS packet, the relevant data would include a user identifier, an IP address, and the time of authentication. If the packet is an HTTP packet, the system extracts the relevant header information including the size of the packet and the source and destination IP addresses, and records this information along with the date and time of the request. In addition, the system also records the requested Uniform Resource Locator (URL). For other packet types, the system extracts information including the source and destination IP addresses, the source and destination ports, the size of the packet, and the time of transmission. In the preferred embodiment of the present invention, the collection engine 103 is aware of several standard protocols including HTTP, FTP, ReatAudio™, RealVideo™, and Windows Media™. When network interactions are made using one of these protocols, the collection engine 103 can collect additional information such as the name of the files requested. One embodiment of the present invention also provides additional capabilities to track user sessions. For example, when a user is browsing a Web site, the user makes a series of LD separate requests to a Web server. In fact, a user may make several separate requests to a Web server in order to show a single Web page. When analyzing the behavior of a user to create a profile, it is useful to think of the related requests in terms of a single session instead of as multiple sessions. For example, when a user requests a Web page, the text of that Web page is downloaded along with each image referenced by that page. The user may then browse multiple pages within that Web site. In one embodiment of the present invention, the collection engine 103 records the beginning of an interaction in a datastore when an initial HTTP network connection is opened. The system also records the time when that interaction was opened. Additional HTTP requests are determined to be within the same interaction until the interaction ends. In one embodiment of the present invention, interactions end after an inactivity period. In an additional embodiment of the present invention, interactions remain active for Transmission Control Protocol (TCP) connections until the connection is closed using TCP flow control mechanisms. Once data has been collected by a collection engine 103, the data can be aggregated with data collected by other collection engines. For example, an ISP may have multiple POPs and may use a collection engine to collect data at each POP. The resulting data can then be aggregated by a central aggregation server 501. In one embodiment of the present invention, an aggregation server 501 is connected to the Internet 104 through a conventional mechanism. Additionally, one or more collection engines 103 are connected to the Internet 104, as well as one or more merchant collection engines 106. The aggregation server 501 can access each of the collection engines 103 and merchant collection engines 106 to configure and maintain them, as well as to receive network transaction data. As discussed above, efforts are taken to maintain the security of each collection engine 103. For this reason, a secure mechanism for logging on to collection engines 103 and merchant collection engines 106 and a secure mechanism to retrieve data are desirable. One embodiment of the present invention uses the Secure Shell (SSH) to provide strong authentication. This helps prevent unauthorized access to the server. SSH also provides a mechanism for encrypting the datastreams between collection engines 103 and an aggregation server 501. One of ordinary skill in the art will appreciate that many additional forms of secure login can be used, including one-time password systems and Kerberos™. As stated above, the aggregation server 501 performs two major tasks: (1) configuration and management of collection engines 103 and merchant collection engines 106; and (2) aggregating data from the engines. In one embodiment of the present invention, the aggregation server 501 monitors each collection engine 103 using a protocol based on the User Datagram Protocol (UDP). Every five minutes, a collection engine 103 sends a UDP packet to the aggregation server 501 signifying that the collection engine 103 is still alive. Additionally, the UDP packet also specifies the amount of data collected and the number of users currently using the system. In this manner, the aggregation server 501 can be alerted when a collection engine la 103 crashes, loses its network connection, or stops collecting data. This permits the effective management of the collection engines 103 from a central aggregation server 501. Additionally, the aggregation server 501 monitors each merchant collection engine 106 using a UDP-based protocol in a manner similar to that used with collection engines 103. In one embodiment, the UDP-based protocol specifies the number of transactions recorded s and the number of transactions pending. In alternative embodiments of the present invention, the collection engines 103 and the merchant collection engines 106 implement a Simple Network Management Protocol (SNMP) Management Information Base (MIB). The MIB includes information such as the time the server has been active, the amount of datastored on the server, and the number of active users and network sessions. The aggregation server 501 also performs the additional task of collecting and aggregating data from the various collection engines 103 and merchant collection engines 106. The data is collected once per day by the aggregation server 501 through a secure SSH connection as discussed above. The data is then initially validated so that corrupt packet information is removed and the data is sorted to facilitate loading into the central datastore. In some embodiments of the present invention, the collection engines 103 and 106 do not have enough storage to permit one collection every twenty-four (24) hours. In these cases, the aggregation server can collect data from the collection engine more often than every 24 hours. In one embodiment of the present invention, the UDP-based management protocol discussed above can be used to determine when a collection needs to be scheduled. In addition to the information discussed above, the UDP-based management protocol also includes the percentage of collection storage that has been used. A threshold can be set to initiate a collection. For example, if a collection engine 103 or a merchant collection engine 106 sends a UDP-based management protocol packet stating that it has used 70% of its storage capacity, then the aggregation server can initiate the process of aggregating the data from that collection engine as discussed above. In one embodiment of the present invention, aggregation server 501 is a Sun™ Enterprise 6500™ server with sixteen (16) Sparc Ultra II™ processors and four (4) Fiber Channel connections to an EMC™ disk array. The aggregation server 501 includes an Oracle™ database that is configured to store data retrieved from the various collection engines 103 and 106. In one embodiment of the present invention, the aggregation server 501 stores the following information that is retrieved from the various collection engines 103: (1) ISP, a representation of an ISP that collects data; (2) POP, a representation for a particular point of presence within an ISP; (3) AID, an anonymized user identifier; (4) Start Date, the date and time that an interaction began; (5) End Date, the date and time that an interaction ended; (6) Remote IP, the IP address of remote host (e.g., the IP address of a Web server being accessed by a user); (7) Remote Port, the port of the remote computer that is being accessed; (8) Packets To, the total number of packets sent during the interaction; (9) Bytes To, the total number of bytes sent to the remote server during an interaction; (10) Packets From, the total number of packets received from the remote computer; (11) Bytes From, the total number of bytes received from the remote computer; and (12) IP Protocol, the protocol code used during the interaction. For example, FIG. 6 shows a typical data table for the aggregation server. Protocols such as the Hypertext Transfer Protocol (HTTP) and the File Transfer Protocol (FTP) contain additional information that can be useful in describing user behavior. One embodiment of the present invention collects additional information for these protocols. For example, FIG. 7 shows a representative data table containing additional HTTP information as follows: (1) HTTP Host, the hostname sent as part of the HTTP request; (2) HTTP URL, the Uniform Resource Locator requested; (3) HTTP Version, the HTTP version sent as part of the request. The various embodiments of the present invention discussed above maintain the anonymity of the user by creating and using an anonymized identifier; however, the URL used in an HTTP request may contain identifying data. One embodiment of the present invention attempts to strip identifying data from URLs before storing them. According to this embodiment, the system searches for the following words within a URL: "SID", "username", "login", and "password". If these are found, the system strips the associated identifying information. For example, if the URL were "/cgibin/shop.exe/?username=bob", then the system would strip "bob" from the URL so that this identifying information would not be stored in the aggregated database. In one embodiment of the present invention, the aggregation server includes database associating anonymized identifiers with a classification. For example, in one embodiment, the classification is the physical location of the user. This information is determined using the address of the user. There are commercial applications available that will translate a well-formed address into a Census block code identifying the general location of that address. In another embodiment of the present invention, anonymized identifiers are associated with job functions. For example, a company may wish to monitor how classes of employees are using computer network resources. An anonymized identifier representing a single employee can be associated with a job function classification so that network utilization by employees with the same job function classification can be aggregated. One of ordinary skill in the art will readily appreciate that other classification systems can be used with the present invention. The transaction codes collected from the merchant collection engines 106 are associated with anonymized identifiers by matching IP addresses associated with transaction codes, and those associated with anonymized identifiers. In this manner, the system can record information about transactions made across the Web. For example, if a user logs on to the Internet through an ISP, he/she is assigned a dynamic IP address. The collection engine 103 stores the IP address or the hashed IP address of the user and associates it with an anonymized identifier. Then, every connection made by that user is logged together with other information including the IP address, the anonymized identifier, the time, the destination IP address, and the protocol being used. If the user accesses Amazon.com™ and makes a purchase, the collection engine 103 does not know whether a purchase was made or not; however, the collection engine 103 can determine all of the Web sites visited during the user's session. The aggregation server 501 retrieves all the information about the user's connections to Amazon.com™ from the collection engine 103; however, that collection engine cannot determine whether a purchase was made. If Amazon.com™ were running a merchant collection engine 106, then a transaction code containing information about the purchase would have been logged. The aggregation server 501 can collect the information from both collection engines and aggregate it into a single database so that the data can be analyzed to determine the actions that led to a purchase. As discussed above, various embodiments of the present invention permit the collection of network utilization data while ensuring the privacy of individual users. In the embodiments discussed above, the system maintains the IP addresses of users in order to match data collected on the client side with data collected on the merchant side. The use of IP addresses alone can weaken the privacy-protection features of various embodiments of the present invention by providing an identifier that can possibly be traced to a particular user. In this embodiment, the IP addresses are hashed in a manner analogous to user identifiers. Embodiments of the present invention have now been generally described in a non-limiting matter. It will be appreciated that these examples are merely illustrative of the lb present invention which is defined by the following claims. Many variations and modifications will be apparent to those of ordinary skill in the art.
|
Same subclass Same class Consider this |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
