Method and apparatus for analyzing performance of data processing system6742143Abstract A method and apparatus for analyzing the performance of a data processing system, particularly a distributed data processing system, provide a system user with tools for analyzing an application running thereon. Information about the flow and performance of the application can be specified, captured, and analyzed, without modifying it or degrading its performance or data security characteristics, even if it is distributed across multiple machines. The user interface permits the system user to filter the performance information, to set triggers which the performance analyzer is able to reduce and/or combine, to observe multiple time-synchronized displays of performance data either in real time or post mortem, and to play and re-play the operation of an automatically generated application model. The invention is implemented in part by providing suitable Application Program Interfaces (APIs) in the operating system of the data processing system. Claims We claim: Description TECHNICAL FIELD
TABLE 1
Pre-Defined Event Fields
Arguments
CausalityID
CorrelationID
DynamicEvent Data
Exception
ReturnValue
SecurityIdentity
SourceComponent
SourceHandle
SourceMachine
SourceProcess
SourceProcessName
SourceSession
SourceThread
TargetComponent
TargetHandle
TargetMachine
TargetProcess
TargetProcessName
TargetSession
TargetThread
Time
Entity
Instance
Because the default set of events is large, pre-defined event categories are provided to visually organize the events in the filter editor. Each event belongs to exactly one category, and each category may have any number of different events. Each category may also have any number of child categories. The combination of all of the events and categories makes a tree where the leaves are events and the branches are categories. Event categories have no semantic impact on the event but do allow the filter to be displayed, stored, and processed more efficiently. Event categories have merely an organizational function, in that they help the user understand events. Pre-defined event categories are listed in Table 2 below:
TABLE 2
Pre-Defined Event Categories
All
Call/Return
Measured
Query/Result
Start/Stop
Transaction
Each event has a type. The type is used to distinguish events that come from DECs. The event type is also used to distinguish events that are outbound (CALL or ENTER) from those that are inbound (LEAVE or RETURN). This distinction is important to matching up the steps of four events mentioned later regarding a CALL/ENTER/LEAVE/RETURN sequence. If an event belongs to either of these categories, then it is called generic. Event types are unrelated to event categories. Events of the same type may be in different categories, and, conversely, events in the same categories may be of different types. There are different types of events. The event type is used to specify how VSA 100 should interpret the event. Event types are listed in Table 3 below:
TABLE 3
Event Types
Begin/End - correspond to a set of events that surround an action.
Default - for a default event (or unspecified event type).
Generic - for a simple event (not a grouped event).
Measured - for DEC events.
Outbound/Inbound - for call/return events. Outbound means the
transition is "out" of the component. Inbound means the
transition is "into" the component.
The data design of the present invention allows the user to define his or her own events and event taxonomy. However, to provide some basic interoperability between data (so that generic analysis tools can be written and/or used), in one embodiment of the invention some typical events are defined. Compliant event generators within this embodiment are encouraged to use these events rather than to define their own. This helps simplify the filter editor. Alternative embodiments could either have no typical events or a very large set of typical events. The choice of typical events is merely dictated by the kind of events that are expected to be common within the embodiment of the invention which is implemented. Table 4 below identifies pre-defined events and their categories and types:
TABLE 4
Pre-Defined Events and Categories
Event Category Type
Call Call/Return Outbound
Call Data Call/Return Outbound
Component Start Start/Stop Begin
Component Stop Start/Stop End
Enter Call/Return Inbound
Enter Data Call/Return Inbound
Events Lost Transaction Generic
Leave Data Call/Return Outbound
Leave Exception Call/Return Outbound
Leave Normal Call/Return Outbound
Query Enter Query/Result Inbound
Query Leave Query/Result Outbound
Query Result Query/Result Inbound
Query Send Query/Result Outbound
Return Call/Return Inbound
Return Data Call/Return Inbound
Return Exception Call/Return Inbound
Return Normal Call/Return Inbound
Transaction Transaction End
Commit
Transaction Transaction End
Rollback
Transaction Start Transaction Begin
User All Generic
In Table 4, the "Category" descriptors are merely annotational, not semantic. A brief description of each Event listed in the "Event" column will now be given: A "Call" event is the first step of a four-part Call/Enter/Leave/Return transition. A function call is departing from a caller. "Call Data" means subsidiary data to a call is being transmitted. This always follows a Call. "Component Start" means a component has been created and is starting to execute (note that "component" in this sense is not the same as an "entity" as used herein; it means a real component). "Component Stop" means a component has been destroyed and is stopping its execution (note the comment above). "Enter" means the second step in a four-step transition. A function call is arriving at the callee. "Enter Data" means subsidiary data to an Enter has been received. "Events Lost" means the system has had to discard events to avoid overloading the eventing infrastructure. "Leave Data" means subsidiary data to a leave has been transmitted from a callee to the caller. "Leave Exception" means an exception (error) has been transmitted from the callee to the caller. This is the third step in the four-part transition. "Leave Normal" means a success has been transmitted from the callee to the caller. This is the third step in the four-part transition. "Query Enter" means a database query has arrived at the database. "Query Leave" means a database query has been completed. "Query Result" means a database query result set has started transmitting back to the caller. "Query Send" means a database query has left the caller. "Return" means the fourth step in the four-part transition. Control has returned to the caller. "Return Data" means subsidiary data to a Return has been received at the caller. "Return Exception" means an exception (error) has been received at the caller. This is the fourth step in the four-part transition. "Return Normal" means a success has been received at the caller. This is the fourth step in the four-part transition. "Transaction Commit" means a transaction has been committed successfully. "Transaction Rollback" means a transaction was aborted. "Transaction Start" means a new transaction was created and started. "User" means an unknown event. Data Design--E0/E1 Entity Transition FIG. 5 illustrates a transition between two entities, E0 and E1, within the hardware and operating environment. A "transition" occurs when one entity (e.g. a program, process, or object) turns execution over to another to complete a specific task. In FIG. 5 the illustrated transition comprises four events, a Call event, an Enter event, a Leave event, and a Return event. When understanding the structure and behavior of distributed systems, understanding transitions between different applications entities is important. The VSA employs an innovative data design that allows two communicating entities to describe their interactions despite knowing almost nothing about each other. Each participant in a transition provides only information about its environment, plus a unique identifier that allows the entity at the other end of the transition to link the pair of events. Every destination called needs to have a unique i.d., and every source of a Call has a unique i.d. In an embodiment which was implemented, these unique i.d.'s are GUIDs. This design has a number of benefits. First, because entity systems typically already include a quasi-unique identifier for transitions, no extra information needs to be transmitted between the two entities. Second, each entity data load is reduced through less duplicated data. Each application is treated as a series of black boxes. A "transition" is defined as when an application moves from one of those boxes to another one. So if we have a Client and a Server, a transition occurs when we go to the Server, and another occurs when we go back. In a three-tier design, a transition occurs for Client to Server, Server to Database, Database to Server, and Server to Client movements. These are entity to entity transitions and not necessarily machine to machine transitions. One example of an entity to entity transition is one COM client component calling a COM server component. Essentially four events represent that transition, which can be a remote procedure Call (RPC) within a distributed system. An event from the client says "I'm initiating a Call". An event at the server says "I've entered the server". An event at the server says "I'm leaving the server". And finally an event at the client says "I've returned". In the case of COM, an event occurs at both sides of the transition. By looking at all or nearly all of these events and taking appropriate pieces of information about them and correlating them, a great deal of information is derived about the structure and performance of the system, and accordingly a performance model of the system can be constructed. Data Design--Determination of Source/Target Relationship FIG. 6 is a table which illustrates how pre-defined event fields are used to establish a relationship between a source entity and a target entity. For each of the events involved in a Call, Enter, Leave, and Return sequence, the event producer specifies the Machine of the source, the Process of the source, the Entity (e.g. class, such as ADO) of the source, and the Instance of the source. Thus the VSA knows the Machine, Process, Entity, and Instance at the Source for a Call event, but it doesn't know the Machine, Process, Entity, and Instance at the Target for a Call event. And for the Enter event, the situation is reversed. The VSA doesn't know it for the Source, but it does know it for the Target. In almost all cases the events are fired at the place the event is happening. Using this information the VSA is able to piece together a functional block diagram of the system as described below. There are basically two kinds of users that use VSA. There are people who give us events, and there is the actual end user who is collecting data to understand it. The data design of the invention is manipulated and used by the portion of the operating system that gives us events, and the end user doesn't really need to understand it in great depth. This format makes it possible to draw a block diagram of the system, even though no one piece knows what the system should look like. In most existing systems, E0 and E1 have a very weak relationship. The data design of the present invention is innovative in that it can tolerate this weak relationship and still provide useful results. E0 doesn't really need to know what machine E1 is on, and vice versa. Even though these two entities communicate through the system, e.g. via COM, they don't really know about each other. So when a Call event is fired by E0, it doesn't really know whom it's talking to. When E1 fires the Enter event that goes with that Call event, it doesn't really know that that Enter event goes with that Call event. So the small amount of information that the operating system has is leveraged to make sure that the Call event maps the Enter event. The Handle, the Correlation i.d., and the Causality i.d. fields are largely responsible for enabling an Enter event to be linked with a Call event. There are generally two kinds of events. There are asynchronous events, e.g. "this thing happened". And there are transition events, e.g. going from E0 to E1. When you have a transition event, you typically have a transition back. The user firing the event specifies a Correlation i.d., which enables the Call event to be identified with the Return event. The Call and Return have the same Correlation i.d., and the Enter and Leave have their own Correlation i.d. Each Correlation pair matches up exactly one pair of Enter/Leave and Call/Return to enable the VSA to understand how to match up the pairs. Each event source has its own notion that correlates a CALL with a RETURN. For example, COM is able to generate a GUID based on the current execution context and processor. In an alternative embodiment, a Correlation i.d. could be generated using the time the CALL was made. Generation of a Correlation i.d. is typically simple but cannot really be generalized. Each IEC caller must pick its own scheme. Even within a currently implemented embodiment, several schemes for generating Correlation i.d.'s coexist. Another key piece of information is the Causality i.d. This is normally provided by COM, but any entity can provide its own value if desired. Whenever a COM RPC is created, a GUID is created for that RPC. That information is tracked around the network, e.g. for purposes of identifying when a circular reference has been created. For the purposes of the present invention, it is used to match things up. It's basically a unique i.d. to identify a particular stream of calls and to sort them out. It says that this Call goes with this Return, and that this Enter goes with this Leave. The VSA knows from the Causality i.d. that these are all somehow interrelated. In general, the Correlation i.d. operates on the events that are known to one machine, and the Causality i.d. operates on events that occur across machines. A Handle is a way of referencing an individual instance of an entity. Handles are used by a calling entity to call (reference) a particular instance of an entity. Thus, the calling entity knows what Handle it is calling, and the entity being called (the target) knows its own Handle. When this process is applied for both the source and the target (each of which will have its own Handle), it is possible to collect together four events into the standard group of CALL/ENTER/LEAVE/RETURN. It is important to realize that any entity instance can have many different Handles that refer to it. For example, when A and C are both talking to B, A might use the Handle "BAT" to refer to B, where C might use the Handle "BALL" to refer to B. From the information contained in the table shown in FIG. 6, the VSA deduces that Call 170 goes with Return 176, and that Enter 172 goes with Leave 174. The VSA knows they're related. By knowing that the Source Handle 180 for Call 170 corresponds to Source Handle 186 for Enter 172, and that Target Handle 182 for Call 170 corresponds to Target Handle 184 for Enter 172, it knows that Call 170 is linked with Enter 172. In similar fashion, the VSA determines that Enter 172 is linked with Leave 174, and that Leave 174 is linked with Return 176. The table shown in FIG. 6 will now be described in detail to illustrate how a relationship can be deduced between a source entity and a target entity. The table of FIG. 6 shows a standard four-event transition sequence. This sequence is not the only possible one but is merely one example. In this example, the CALL event fires, and the system is given full information about the source but only knows the target Handle is H1. When the target fires the ENTER event, two deductions can be made: (1) the CALL event can now be filled in, and (2) Handle H1 (the target) has now been defined to be M1, P1, E1, I1. So the CALL event is now completely specified. Additionally, the ENTER event uses Handle H0 which was previously defined to be M0, P0, E0, I0, and so the ENTER event can be completely filled in too. When the LEAVE event arrives again from the target, two more deductions can be made: (1) the source information for the LEAVE event can be filled in by noticing that Handle HO has previously been defined to mean M0, P0, E0, I0, and (2) we can now deduce that this LEAVE event and the previous ENTER event are a pair, because they have the same Correlation i.d. (i.e. "CB"). When the final RETURN event arrives, three deductions can be made: (1) we can fill in the target information for the RETURN event, because we know that H1 means M1, P1, E1, I1, (2) we can pair this RETURN up with the previous CALL by noticing that the Correlation i.d. ("CA") matches that of the CALL event, and (3) all four events are a set because their Causality i.d. is the same, and they have two pairs of matching Correlation i.d.'s. The proper choice of a Handle depends in part on the entity causing the event. As in the case of a Correlation i.d., the generation of a Handle is typically simple but cannot really be generalized. Several routine schemes for generating Handles exist within a currently implemented embodiment of the invention. It generally takes all three pieces of information together in context to create a functional diagram of how all of the pieces communicate. No single piece of information is vital to successful analysis. Dropping one or more fields still allows an implemented embodiment of the invention to generate useful analysis data. However, the removal of all source information makes it impossible to recognize a transition, for example, and thus impossible to diagram transitions in the system. Similarly, the loss of critical data such as the Correlation i.d. makes it impossible to draw a tree of events. It will be understood by one of ordinary skill that other options for ensuring that a source and a target can appropriately identify themselves are possible. Triggers FIG. 7 illustrates in schematic fashion how events selected by a user are monitored. Triggers enable the VSA user to watch for a selected condition or error to occur. In many cases, a developer knows that an error will occur, but he or she doesn't know exactly when it will occur. The present invention allows the developer to set a trigger for collecting data in these situations. Triggers can be set either for conditions for which an IEC creates an event, such as "a COM event in Machine A", or for conditions for which a DEC creates an event, such as PerfMon data reflecting CPU utilization. The user can use Boolean operators, for example "OR" and "AND", to specify a set of two or more trigger conditions to watch. For example, a client can request to be alerted when a first designated CPU utilization OR a second designated CPU utilization exceeds 75%. Alternatively, an alert could happen when CPU utilization exceeded 75% AND disk utilization was less than 10%, potentially highlighting the need to obtain additional processing power. A developer can also specify a first filter for "normal" event-monitoring, and a second filter (which can be more detailed or comprehensive than the first filter) to apply when the trigger condition occurs. A "filter" is a way in which the system user can specify what is to be monitored in the system under examination. Filters will be discussed in greater detail below in the sub-sections entitled "Filter Reduction", "Filter Combination", and "Filter Specification". In FIG. 7 an LEC 192 is depicted monitoring an application 190. Events created by IECs and DECs (not illustrated in FIG. 7) are collected by LEC 192. Upon the occurrence of a trigger condition, LEC 192 dumps the events to the VSA 100 or else signals an alert to the VSA 100. While watching for one or more trigger condition(s), event monitoring continues as usual, but data only requested by the trigger filter is not logged, while data requested by the monitoring filter continues to be logged as normal. While waiting for a trigger condition to occur, events are retained transiently by the LEC 192 in a circular buffer whose size can be specified by VSA 100. For example, VSA 100 can specify that the buffer store 500 events, so when the 501.sup.st event comes in, the first event is written over. When the user's specified trigger condition is detected, the LEC 192 can immediately transmit all of the buffered events to the VSA 100 for logging. These provide data about the application prior to the failure or other condition. In addition, the LEC 192 can start collecting more events at a higher rate (in accordance with the second filter, for example) which events provide additional detailed information. VSA 100 can also specify a reset condition, either as part of the second filter or as a separate filter. When the reset condition is met, the LEC 192 returns to the low-impact minimal collection condition specified by the first filter and once again monitors for a trigger condition. It will be apparent to one of ordinary skill in the art that suitable data compression techniques can be applied to increase the efficiency of the event buffering and data transmission aspects of the invention. Data compression can be used both for storing events and for sending large quantities of events or event-related data through the data processing system. Data Security Information that is processed by a system performance analysis tool is likely to be confidential. Like any debugging tool, the VSA should ensure that the debuggability of the system cannot become a security hole. Additionally, VSA debugging is a shared resource in a distributed environment. As such, it is important that proper security precautions be taken to prevent malicious users from obtaining this data. The invention provides a secure environment for data collection through the use of discretionary access controls. These access controls can be applied, at the discretion of the user, to the collection of data from a specific machine, to the monitoring of specific entities, and to the collection of specific events. In one aspect of the invention VSA 100 is implemented as a DCOM server which can be configured to run as any identity, so it can control the resources and information it has access to. In addition, the server can run in a Windows NT authenticated domain, so that access to the server can be controlled by discretionary access controls based on authentication identities. It will be apparent to one of ordinary skill in the art that discretionary access enforcement can be based on the processes desired to be monitored effectively. It will also be apparent to one of ordinary skill in the art that suitable encryption techniques can be employed to enhance security within the VSA. Since DCOM is used to communicate with the server, standard RPC encryption can be used. In addition, the use of COM's custom marshalling allows for any virtually any type of encryption technology to be used. Filter Reduction FIG. 8 illustrates a process of filter reduction as used within an exemplary embodiment of the invention. First, the use of filters within the context of the invention will be discussed. VSA users specify the desired information to monitor via a User Filter 200. That is, a filter defines what information the VSA will collect and analyze. Users can specify this information in a "system" scope, for example, "All COM and ADO events from Machines A and B". In addition to directing a filter to a machine, a filter can be directed to a process, component (e.g. ADO), IEC, DEC, event, thread, or to multiples or combinations of the foregoing. The user filter 200 can comprise a filter 202 for Machine A, which in turn can comprise filters 204, 206, 208 for Processes A1, A2, A3, respectively. Likewise user filter 200 can comprise a filter 212 for Machine B that in turn comprises filter 214, 216, 218 for Processes B1, B2, B3, respectively. A filter can generally be expressed as a single Boolean expression in a set of unbound variables. These variables communicate to the data provider with events, and to the event sources and their categories. Using the example above, the filter would be (Machine=A OR Machine=B) AND (EventSource=COM OR EventSource=ADO). Filter reduction is a process employed by the VSA to extract portions of a filter relevant to specify a specific portion of the monitoring infrastructure. Using the previous example, the filter would be reduced by "Machine A" and then "Machine B" to determine the filter fragments that are specific to each machine. These fragments are transmitted to the LECs. The LECs, in turn, reduce the filter by the registered entities/processes on the system. The result is a filter fragment that can be used to determine if a specific data source is enabled or disabled. This information is communicated to the IECs to provide the efficient IsActive function. Filter reduction is the process of modifying or creating a new version of a Boolean expression by binding a subset of the variables within the expression. For example, if the example filter above is sent to machine C, the Machine=A clause can be reduced to FALSE, and the Machine=B clause can be reduced to FALSE. Since the expression "FALSE AND anything" is FALSE, the whole expression evaluates to FALSE for machine C, meaning that all collection infrastructure on machine C can be deactivated. Another example of filter reduction would be to reduce the example filter ("All COM and ADO events from Machines A and B") by "Machine=A". This results in the filter "EventSource=COM OR EventSource=ADO". Thus the result of this filter reduction is a Boolean expression, not just a TRUE or FALSE expression. The LECs also make use of a specialized form of filter reduction to determine which dynamic data is desired. Collection and transmission of dynamic data is expensive, and a filter is scanned for clauses that specifically refer to the dynamic information that is required. The VSA is communicating with multiple LECs, and to operate efficiently it reduces the filter from a global scale down to a filter for a particular machine. What goes into an LEC is that portion of the filter that pertains to a particular machine. At the next level the LEC breaks the information into pieces which are germane to each IEC to identify whether or not that IEC should be turned on or off. So filter reduction occurs on at least two levels. The first level of filter reduction occurs at the VSA itself. The second level occurs at the LEC, which decides which IEC to turn on or off. It will be apparent to one of ordinary skill in the art that a third level could be at the IEC level. If at any point in the reduction the VSA determines that the filter is guaranteed to be False for a given machine, the collection mechanism is turned off on that machine. If a filter specifying "Machine=A and Process=7" is sent to Machine B, it's just False. Data collection for Machine B is left off and not turned on, which lets Machine B operate more efficiently. On Machine A the collection mechanism is left off for everything except Process 7. This is similar to binding variables in a Boolean expression. If it's either True or False, you know what to do. But if it's undefined, you have to send the expression further down the chain. This feature applies to processes and components as well. It will be apparent to one of ordinary skill in the art that it could be applied to any level, from the machine level down to the thread level. A machine-specific filter can be broadcast to a given machine. Generally, the reduction is performed at the client machine, and then the reduced filter is broadcast to specific machines. Again, it will be apparent to one of ordinary skill in the art that specific filters can be applied to any level. A third level of filter reduction can occur in the DEC. The DEC can specify exactly what pieces of information are being looked for. For example, an event monitoring application such as PerfMon can collect about 7000 pieces of information, and it's very expensive to collect each one. So the filter needs to be reduced further by identifying exactly which pieces of information to collect. In the VSA user interface, the user can, if desired, be constrained to select PerfMon events a certain way, so they can't select them in complex Boolean expressions. When the filter makes its way through the network to the right creator, those PerfMon expressions are specifically referenced to the filter and collect exactly those expressions. That combination of constraint in the VSA user interface and appropriate analysis of the results means that the VSA collects only those things specifically asked for in the dynamic case. This is important because every time a dynamic event is timed, one event can be fired every half second or every second, meaning a lot of events are fired. This can overwhelm the system infrastructure. So a filter reduction system is applied to the events that are initiated by the application. And extra reduction can be applied to events which are initiated by PerfMon. This could also be done for events at the IEC if desired. Filter Combination FIG. 9 illustrates a process of filter combination as used within an exemplary embodiment of the invention. It is possible, and likely, that multiple users will be monitoring applications running on shared servers. When this occurs, multiple filters can be issued to the same LEC. To ensure the most efficient collection, the LEC can combine all of the filters prior to performing the entity/process reduction. With reference to FIG. 9, a first user generates user filter 1 in box 231, while a second user generates user filter 2 in box 232. These filters are combined by the LEC into a merged or combined filter 235, which in turn applies a filter for process A1 in box 236, a filter for process A2 in box 237, and a filter for process A3 in box 238. The filters are reduced after they have been combined. Appropriate IECs and DECs then monitor and collect events in accordance with the combined filter. One or more LECs, depending upon whether the items being monitored are on one or multiple machines, collect events from the IECs and DECs, in accordance with the combined filter, and send them to their respective requesting users, who may be on a single control station or at multiple control stations. FIG. 10 illustrates another process of filter combination as used within an exemplary embodiment of the invention. With reference to FIG. 10, filters for processes B1-B3 in boxes 246-248, respectively, are combined in LEC 245 and passed on to users 1 and 2 in boxes 241 and 242, respectively. When events are collected by the LEC 245 from different sources within the data processing system under examination, it determines which clients are interested and routes the events to the respective clients who specified that the events be monitored. Because of the efficient and flexible nature of the filters, and the general-case nature of the reduction process described above, monitoring and collection from multiple machines imposes no extra performance overhead. Performance is simply as if all the monitoring were happening from a single machine. Filter Specification FIG. 11 illustrates a screen print of an exemplary user interface for specifying a filter. The VSA provides a large number of events that can be monitored. Consequently, an efficient mechanism is provided for the user to specify desired event data. The user interface (UI) of the invention provides a quick, easy graphical way for the user to specify the desired queries. In the graphical UI, users are presented with three trees, each appearing in a separate window 250, 252, 254, that represents the key information: a Machines/Processes window 250, a Components window 252, and a Categories/Events window 254. The Machines/Processes window 250 presents all of the machines being monitored and the processes on the machines. The Components window 252 presents the registered VSA data sources on the machines being monitored. The Categories/Events window 254 identifies all of the registered VSA events that can be monitored. These can be organized hierarchically in a pre-defined structure, but the user can tailor it to his or her own structure and define his or her own events to be monitored. It will be apparent to one of ordinary skill in the art that process threads could constitute another level of filter specification. Event sources are required to pre-register which events they can emit when they are installed, and this information is transmitted at startup from the LEC to the central machine. By selecting the "Collect" tab 256, the user can quickly select the desired information to analyze. More complex queries can be generated by creating groups of selections using the "OR" tab 258. As the user makes selections, a textual representation of the query, appearing in text window 260, is dynamically generated in synchronism with the graphical depiction in windows 250, 252, and 254, so the user can verify his or her selection, and understand its behavior. Finally, the user can specify very sophisticated filter queries by entering the filter directly as text in text window 260. The tree-oriented part of the user interface allows highly complex filters to be created without a user having to understand the specific syntax or functionality. The system takes advantage of the fact that users have built-in understanding about the "rational" Boolean operators that are used to combine clauses ("OR" for bindings of the same variable, "AND" for bindings of independent variables). The same filter mechanism and user interface are used to both specify what to analyze and to refine the data which has been collected and which is presented to the user. VSA 100 analyzes data both as events are collected as well as after they have been collected. That is, users can filter already collected data, in a "post mortem" fashion, to create analysis reports of specific elements of the data without having to recollect the data. The user can additionally specify debug and/or trace switches. These are run-time switches. They have a filter to determine the appropriate targets. Components, for example, can access the name/value pairs using the same interface as the IsActive and FireEvent status conditions. Thus a user can chose which events to monitor. Boolean operators can be applied both within the windows and between the windows. Generally OR's are used within the windows, while AND's are used between the windows. In addition, the UI can enable the user to chose from a pre-defined list of the "top N" filters or queries, so that the user can quickly select from the top N. Location of APIs FIG. 12 illustrates a system level overview of an exemplary embodiment showing where APIs of the present invention can appear within the software architecture of a distributed computing system. In a generalized and slightly over-simplified manner, the software architectures for two separate data processing system 301 and 302 are illustrated. Systems 301 and 302 each comprise a plurality of applications, represented by 310 and 340, respectively. Systems 301 and 302 additionally each comprise software referred to as "middleware" identified by reference numbers 320 and 350, respectively, and they each comprise operating system software 330 and 360, respectively. The above-described software executes in the processor(s) of data processing systems 301 and 302, the application programs running under the control of their corresponding operating systems. It will be understood that applications 310, 340, middleware 320, 350, and the operating system software 330, 360 can be entirely local to the data processing system 301 or 302, or they can be distributed among data processing systems 301, 302, and additional data processing systems (not shown but implied by busses 322 and 342). Systems 301 and 302 can communicate with each other over bus 332. Systems 301 and 302 can communicate with other systems (not shown) over busses 322 and 352, respectively. Each system 301 and 302 comprises APIs located in either the middleware or the operating system or in both. In a currently implemented embodiment, APIs are located in both. In order to facilitate utilization of the performance analysis tools of the present invention by software developers, APIs are provided to give a wide variety of functions, in the form of software modules and components, in common to a broad spectrum of applications. Any one application typically uses only a small subset of the available APIs. Providing a wide variety of APIs frees application developers from having to write code that would have to be potentially duplicated in each application. The APIs of the present invention offer the application developer ready access to the built-in performance analysis functions appearing in the middleware and operating system portions of the software architecture. In the next section, various APIs are presented which allow applications to interface with various modules and components of the networking and operating system environment in order to implement the performance monitoring and analysis features of the invention. Exemplary APIs and Their Functions This section presents and describes exemplary APIs relating to the performance monitoring and analysis features of the invention. It will be understood that these APIs are embodied on a computer-readable medium for execution on a computer in conjunction with an operating system or with middleware that interfaces with an application program having one or more event-generating components. The APIs will first be described in functional terms. One or more applications, e.g. applications identified generally by reference number 310 or 340 in FIG. 12 are assumed to be running under the control of an operating system, e.g. operating system 330 or 360. With respect to any one application program, in particular, the application can have any of a number of event-generating components. The application program utilizes APIs (such as APIs 325 or 355 located in middleware 320 or 350, respectively, or APIs 335 or 365 located within operating systems 330 or 360, respectively) associated with the event-generating component which operate to receive data from the operating system and to send data to the operating system. This set of APIs includes a first interface that enables the operating system to set or disable a status condition ("IsActive") in the application, and it further includes a second interface that receives a status query from the operating system and that returns the status (True or False) of the status condition to the operating system. The set of APIs includes an interface that enables the operating system to read anSy one or more of several fields in the application. These fields include arguments, causality i.d., correlation i.d., dynamic event data, exception, return value, security i.d., source component, source handle, source machine, source process, source process name, source session, source thread, target component, target handle, target machine, target process, target process name, target session, and target thread. Now from the point of view of an operating system, consider that an operating system can have an event-registering or event-collecting component. The APIs also include an interface that enables the operating system to query whether a status condition ("IsActive") is set or disabled in the application, and they further include an interface that returns data to the operating system only if the status condition is set. The APIs detailed below are described in terms of the C/C++ programming language. However, the invention is not so limited, and the APIs can be defined and implemented in any programming language, as those of ordinary skill in the art will recognize. Furthermore, the names given to the API functions and parameters are meant to be descriptive of their function. However, other names or identifiers could be associated with the functions and parameters, as will be apparent to those of ordinary skill in the art. Four sets of APIs are presented: APIs for generating events (C interface), APIs for generating events (automation binding), APIs for registering events and sources (C binding), and APIs for registering events and sources (automation binding). APIs for generating events used by applications that interface with the performance analysis functions of the present invention are presented below, both for C interface and for automation binding. APIs for Generating Events (C Interface)
HRESULT BeginSession(
[in] REFGUID guidSourcelD,
[in] LPCOLESTR strSessionName
);
HRESULT EndSession(
);
HRESULT IsActive(
);
typedef [v1_enum] enum VSAParameterType {
cVSAParameterKeyMask= 0x80000000,
cVSAParameterKeyString=0x80000000,
cVSAParameterValueMask=0x0007ffff,
cVSAParameterValueTypeMask=0x00070000,
cVSAParameterValueUnicodeString=0x00000,
cVSAParameterValueANSIString=0x10000,
cVSAParameterValueGUID=0x20000,
cVSAParameterValueDWORD=0x30000,
cVSAParameterValueBYTEArray=0x40000,
cVSAParameterValueLengthMask=0xffff,
} VSAParameterFlags;
typedef [v1_enum] enum VSAStandardParameter {
cVSAStandardParameterDefaultFirst=0,
cVSAStandardParameterSourceMachine=0,
cVSAStandardParameterSourceProcess=1,
cVSAStandardParameterSourceThread=2,
cVSAStandardParameterSourceComponent=3,
cVSAStandardParameterSourceSession=4,
cVSAStandardParameterTargetMachine=5,
cVSAStandardParameterTargetProcess=6,
cVSAStandardParameterTargetThread=7,
cVSAStandardParameterTargetComponent=8,
cVSAStandardParameterTargetSession=9,
cVSAStandardParameterSecurityIdentity=10,
cVSAStandardParameterCausalityID=11,
cVSAStandardParameterSourceProcessName=12,
cVSAStandardParameterTargetProcessName=13,
cVSAStandardParameterDefaultLast=13,
cVSAStandardParameterNoDefault=0x4000,
cVSAStandardParameterSourceHandle=0x4000,
cVSAStandardParameterTargetHandle=0x4001,
cVSAStandardParameterArguments=0x4002,
cVSAStandardParameterReturnValue=0x4003,
cVSAStandardParameterException=0x4004,
cVSAStandardParameterCorrelationID=0x4005,
cVSAStandardParameterDynamicEventData=0x4006,
cVSAStandardParameterNoDefaultLast=0x4006
} VSAStandardParameters;
typedef [v1_enum] enum eVSAEventFlags {
cVSAEventStandard=0,
cVSAEventDefaultSource=1,
cVSAEventDefaultTarget=2,
cVSAEventForceSend=8
} VSAEventFlags;
HRESULT FireEvent(
[in] REFGUID guidEvent,
[in] int nEntries,
[in, size_is(nEntries)] LPDWORD rgKeys,
[in, size_is(nEntries)] LPDWORD rgValues,
[in, size_is(nEntries)] LPDWORD rgTypes,
[in] DWORD dwTimeLow,
[in] LONG dwTimeHigh,
[in] VSAEventFlags dwFlags
);
}
"BeginSession" is called by an entity before it fires events to register its entity and instance names (source and session). "EndSession" is called by an entity after it completes firing events. "IsActive" is called by an entity which is considering firing events and wishes to know if anyone is listening. "FireEvent" fires an actual event from an entity. APIs for Generating Events (Automation Binding)
HRESULT BeginSession(
[in] BSTR guidSourceID,
[in] BSTR strSessionName
);
HRESULT EndSession(
);
HRESULT IsActive(
[out] VARIANT_BOOL *pbIsActive
);
HRESULT FireEvent(
[in] BSTR guidEvent,
[in] VARIANT rgKeys,
[in] VARIANT rgValues,
[in] long rgCount,
[in] VSAEventFlags dwFlags
);
}
The comments for the above set of "APIs For Generating Events" are the same as for the C Interface APIs preceding them. APIs for registering events and sources used by applications that interface with the performance analysis functions of the present invention are presented below, both for C interface and for automation binding. APIs for Registering Events and Sources (C Interface)
HRESULT RegisterSource(
[in] LPCOLESTR strVisibleName,
[in] REFGUID guidSourceID
);
HRESULT IsSourceRegistered(
[in] REFGUID guidSourceID
);
HRESULT RegisterStockEvent(
[in] REFGUID guidSourceID,
[in] REFGUID guidEventID
);
HRESULT RegisterCustomEvent(
[in] REFGUID guidSourceID,
[in] REFGUID guidEventID,
[in] LPCOLESTR strVisibleName,
[in] LPCOLESTR strDescription,
[in] long nEventType,
[in] REFGUID guidCategory,
[in] LPCOLESTR strIconFile,
[in] long nIcon
);
HRESULT RegisterEventCategory(
[in] REFGUID guidSourceID,
[in] REFGUID guidCategoryID,
[in] REFGUID guidParentID,
[in] LPCOLESTR strVisibleName,
[in] LPCOLESTR strDescription,
[in] LPCOLESTR strIconFile,
[in] long nIcon
);
HRESULT UnRegisterSource(
[in] REFGUID guidSourceID
);
HRESULT RegisterDynamicSource(
[in] LPCOLESTR strVisibleName,
[in] REFGUID guidSourceID,
[in] LPCOLESTR strDescription,
[in] REFGUID guidClsid,
[in] long inproc);
HRESULT UnRegisterDynamicSource(
[in] REFGUID guidSourceID);
HRESULT IsDynamicSourceRegistered(
[in] REFGUID guidSourceID);
};
"RegisterSource" is called by code that is installing a new event-generating entity on a machine. "IsSourceRegistered" detects if an event-generating entity is present. "RegisterStockEvent" is called by an event-generating entity to note its use of a system event. "RegisterCustomEvent" is called by an event-generating entity to note its definition of a custom event. "RegisterEventCategory" is called by an event-generating entity to note its definition of a custom event category. "UnRegisterSource" is called by code that is uninstalling an event-generating entity. "RegisterDynamicSource" is called by code that is installing a DEC (dynamic event-generating entity). "UnRegisterDynamicSource" is called by code that is uninstalling a DEC (dynamic event-generating entity). "IsDynamicSourceRegistered" detects if an event-generating entity is present. APIs for Registering Events and Sources (Automation Binding)
HRESULT RegisterSource(
[in] BSTR strVisibleName,
[in] BSTR guidSourceID
);
HRESULT IsSourceRegistered(
[in] BSTR guidSourceID,
[out] VARIANT_BOOL *pbIsRegistered
);
HRESULT RegisterStockEvent(
[in] BSTR guidSourceID,
[in] BSTR guidEventID
);
HRESULT RegisterCustomEvent(
[in] BSTR guidSourceID,
[in] BSTR guidEventID,
[in] BSTR strVisibleName,
[in] BSTR strDescription,
[in] long nEventType,
[in] BSTR guidCategory,
[in] BSTR strIconFile,
[in] long nIcon
);
HRESULT RegisterEventCategory(
[in] BSTR guidSourceID,
[in] BSTR guidCategoryID,
[in] BSTR guidParentID,
[in] BSTR strVisibleName,
[in] BSTR strDescription,
[in] BSTR strIconFile,
[in] long nIcon
);
HRESULT UnRegisterSource(
[in] BSTR guidSourceID
);
HRESULT RegisterDynamicSource(
[in] BSTR strVisibleName,
[in] BSTR guidSourceID,
[in] BSTR strDescription,
[in] BSTR guidClsid,
[in] long inproc);
HRESULT UnRegisterDynamicSource(
[in] BSTR guidSourceID);
HRESULT IsDynamicSourceRegistered(
[in] BSTR guidSourceID,
[out] VARIANT_BOOL *boolRegistered);
};
The comments for the above set of "APIs For Registering Events and Sources" are the same as for the C Interface APIs preceding them. The APIs for registering events and sources (C interface/automation binding) can be used by an application to register which events can be generated by a data source. These APIs turn on and off such registration. They also specify whether the registration is a pre-defined, standard event or a custom event. They can also specify the event category, and they can determine whether a source is registered or not. Automatic Generation of Animated Application Model FIG. 13 illustrates a screen print of an animated application model which the present invention generates to show the structure and activity of an application whose performance is being studied. An important innovation in the VSA's analysis function is its ability to dynamically generate diagrams of the functionally active structure of the application. The VSA creates the application diagrams by closely examining the event data that is received. As explained above, events are correlated by the VSA to understand the flow of control. The data design described above makes it possible to understand which events need to be correlated and how they should be grouped and connected. Correlation makes use of the source and target information specified in the event data. When insufficient information is present, additional heuristics can be used to extrapolate the event flow. This includes time-ordering, COM causality information, and event handles. With reference to the screen print 370 of FIG. 13, the functional interrelationship among blocks such as blocks 371 and 372 is visually depicted. (It will be understood by one of ordinary skill in the art that, while all blocks in FIG. 13 are depicted with dummy labels, in practice each block will bear an appropriate label in accordance with that block's function or place within the performance model.) It will also be understood by one of ordinary skill that many other forms of visual portrayal of the application performance model can be used. As new diagram elements are identified, they are added to the user's screen 370. Frequently sufficient information is not available to immediately connect them to other entities on the diagram. This is the case with blocks 381 and 382 in FIG. 13. As data becomes available, the entities are connected. This application model diagram is highly interactive. Selections made in other VSA windows can result in selections in the diagram. Incoming events are directly animated into the diagram. Diagram blocks can be expanded or collapsed to show more or less detail. To support this interactive behavior, the diagram data structures use a network of linked mapping tree data structures to efficiently understand the impact of new data, and to determine the blocks required to be added or removed when more data arrives. Incomplete information is stored specially, and when other incomplete data arrives, there is an attempt to pair up the incomplete data using pre-defined heuristics and the data design described above. Because the internal storage of the diagram only stores blocks and their connections, it is very space efficient. In normal scenarios storage space does not grow very fast proportionate to the number of events that have been viewed. FIG. 14 illustrates various user interface features of an animated application model in an exemplary embodiment of the invention. The user interface features are shown generally by reference number 400. In the UI depicted in FIG. 14, diagrams are portrayed of the different blocks representing varying levels of detail of a hierarchical model of the application. As shown in FIG. 14, four different types of diagrams are available representing varying levels of detail: machines, processes, data sources, entities, and instances. Users can expand and collapse items on these diagrams to create the exact level of detail required. As well, the recorded event data can be depicted adjacent to the animated application model or overlaid upon it. In addition, using VCR-like commands, described below with reference to FIG. 14, users can play and replay the application execution, stop, pause, reverse, speed up, slow down, and so forth. Merely by way of illustration, an animated application model, shown generally by reference number 410, includes a machine 404, which is shown coupled functionally to a machine 412, which in turn is coupled to a machine 411. Each machine 404, 411, 412 can, in turn, be coupled to other items (not shown). A visual depiction of a first machine 404 can be "exploded" into its constituent processes, depicted by box 402. The user can further "drill" into a process, such as Process #1, to explode its constituent entities, depicted by box 406. Further, the user can drill into an entity, such as Entity #1, for example, to explode a view, depicted by box 408, showing the various Instances #1 through #N which are included in Entity #1. The drill-in shown in FIG. 14 can be mixed in the same user screen. That is, a drill-in for machine 411 could show only its constituent processes, and a drill-in for machine 412 could show only its constituent processes plus the entities for one of the processes. So any individual box can be drilled down or up independently. In addition, the user can perform zooming, printing, and any other known screen operations. The graphical UI includes a display and a user interface selection device, such as a keyboard or mouse. A model of the functionally active structure of the data processing system is displayed. Using the user interface selection device, a selection signal is generated with respect to a portion of the animated model, along with the user's expansion or contraction command. The VSA performs an expansion or contraction function on the selected portion in response to the selection signal and to the expansion or contraction command, and the selected portion is either exploded or contracted per the expansion or contraction command. Behind this visual depiction of the application model, the VSA maintains a log of all of the events that have been collected. The VSA utilizes a graphical UI paradigm in the form of a video cassette recorder (VCR) having, for example, Reverse, Stop, Pause, Speed, and Play commands. Other appropriate commands can be provided as indicated by an unlabeled button on the control panel. Using the VCR paradigm to control the depiction of the application performance, the VSA can run through each of the events and correspondingly animate the application model shown in FIG. 13 or FIG. 14. For example, if the current event is between Machine #1 and Machine #N, then a connection segment 411 is highlighted. Using the VCR commands, the user can change the speed, pause the display, and go backward and forward. While the user is doing this, a separate, adjacent window 430 shows the event details. So while the event is occurring, and the application model diagram of FIG. 14 is being animated, the user can also view other pertinent performance details in window 430. Also shown in FIG. 14 is an adjacent time line window 440 having equally spaced vertical lines throughout the time duration of an event. A special marker 445 moves from left to right through the vertical lines to show the progress of an event, either as the event occurs, or as the event is being played back by the user. All of the windows are time-synchronized to one another. Performance Analysis FIG. 15 illustrates a representative display of performance data in an exemplary embodiment of the invention. The VSA provides another important component for automatic analysis of collected data, the performance analysis component. The performance analysis component analyzes the collected data and creates a call tree by pairing events (e.g. Call and Return) and ordering them using temporal ordering and heuristics. The result is a presentation of the call tree in a Gantt style view with any Perfmon (or other dynamic) data displayed adjacent to or overlying the displayed call tree. With this view, the VSA provides a mechanism to simultaneously view application and environmental performance information and quickly drill into the details (by expanding to another level in the call tree). When the VSA is used to track and graph load information, the VSA provides an innovative way for the user to view how applications perform, behave, and degrade under different load and stress scenarios. Like the animated application model, the call tree is generated by the application of suitable pre-determined heuristics, since the user does not have any a priori knowledge of the call relationships of more than two objects. Temporal and contextual information, for example, are used to deduce a call tree without full information. It will be apparent to one of ordinary skill that other kinds of information can also be used to deduce a call tree. With reference to FIG. 15, an upper window 450 includes a process summary portion 460 and a performance summary portion 470. The process summary portion 460 comprises a Call Hierarchy including Call, Enter, Leave, and Return events. Each of these events can contain sublevels, as shown for the Call event. It will be understood that the sublevels can be further subdivided to whatever degree is required, as shown for the Leave event. The user can expand or collapse the levels of detail for each of the events, as desired. Each of the Call, Enter, Leave, and Return events can have a corresponding Gantt type of representation, as illustrated in performance summary portion 470, showing the duration of the event. For example, Gantt segment 471 represents the duration of the Call event. The duration of the Enter, Leave, and Return events are shown by Gantt segments 472, 473, and 474, respectively. Performance summary portion 470 thus provides a GANTT-style presentation of the call tree, i.e. who calls whom. The GANTT bars 471-474 show when it started and how long the Call lasted. This information comes from the IEC. Beneath the call tree performance summary, a graph 480 can be depicted to show, for example, the CPU utilization during the Call operation such as an RPC. Graph 480, which may be positioned adjacent to or overlaying the Gantt segments 471-474, could also illustrate any one or more other desired aspects of the system performance besides the CPU utilization. The Gantt chart can be based upon the application events. The graph can be selected from the time base. Also shown in FIG. 15 is a summary window 490 which provides a distillation of what is shown in the performance windows 410 and 430 of FIG. 14 and in the upper window 450 of FIG. 15. For example, if the time slice between dashed lines 481 and 482 is selected for scrutiny, a summary performance graph 492 is generated for the selected time segment. Summary window 490 also contains a textual description of the application's performance during the specified time segment. Thus the user can view a tightly synchronized, easily comprehensible graphical and textual analysis and representation of the application performance, in the form of the animated block diagram 410, the Event Detail window 430, and the Time Line window 440 of FIG. 14, as well as the process summary portion 460 and the performance summary portion 470 of FIG. 15. The summary window 490 ties everything together. Again, everything is time-synchronized. In addition, all of the above windows can be operated to display the application performance in real time as well as "post mortem". This applies as well to the animated application models, as shown in the screen print of FIG. 13 and in window 410 of FIG. 14, so that in real time as an application is being analyzed, one block will appear, then another, and then the interconnection between the two blocks. Blocks are dynamically added, removed, and moved, and the interconnections between them are dynamically changed to reflect changing conditions in the execution of the application. The diagram is kept up to date with what is really happening. FIG. 16 illustrates a screen print 500 of an exemplary display of performance data. Screen print 500 depicts the percentage of CPU utilization for a selected group of processors. Window 504 shows a graph line 505 which, for example, depicts the percentage of CPU utilization (right-hand side) versus time (bottom side). In general, graph lines represent overlaid DEC data. Window 502 depicts a list of events relating to the operation of the processors under scrutiny. Window 506 depicts a legend or key to the information shown in window 504. Window 506 indicates the source machines (all) as well as summary performance information (a minimum of 13 processors, a maximum of 100 processors, and an average of 49 processors executing simultaneously; currently 35 processors concurrently executing). Window 506 also comprises a "legend" 507 which provides a color key 508 to assist the user in identifying graph lines in window 504, such as Gantt bars 510, 511, and 512, or graph line 505. While window 504 only shows one graph line 505, more can be shown. Window 506 provides an indication of the source machines, maximum, minimum, average, and current value for each graph line shown in window 504. Additional Tools The VSA provides a few other tools which, when used in conjunction with the features described above, provide additional insight into application performance. FIG. 17 illustrates a screen print 520 of a timeline display of performance data. The timeline window presents a visual representation of the timing of all related events. Dark clumps 522 represent tight groupings of events, while spaces 524 represent possible under utilization of resources. Timeline 520 can be annotated to present event activity per machine or per process (or other system resource) using different colors. This allows users to visually identify both potential system-wide and per-machine bottlenecks. As playback or monitoring continues, the timeline 520 acts as a real-time indicator of the current system context. FIG. 18 illustrates a screen print 530 of summary display of performance data. Similar to previously described summary window 490 in FIG. 15, but depicting different information, the summary information in screen print 530 presents a distillation of all events selected by the VSA user. That is, if multiple events are selected, the unique elements (e.g. source and target machines, processes, entities, etc.) are displayed. This is very useful when a time range is selected either in the timeline or performance viewer. The summary window allows the user to see a quick tally of what is going on in the application. This is a particularly important view because of the large volumes of data generated while monitoring a system. Synchronization FIG. 19 illustrates a screen print 550 of several synchronized sets of performance data. Screen 550 comprises several windows, including an animated application model or process diagram 552, an event log window 554, CPU performance view window 556, event viewing window 558, a summary window 560, and a time line window 562. The VSA ensures that all information presented to the user is cross-correlated. This provides instant synchronization. When the user selects an item (or set of items) in one window, all other windows can (based on user preference) automatically highlight the selection. This includes the selection of specific events, selection of all events in a specified time range, or selection of all events associated with a specified entity. However, if the user desires, auto-synchronization can be turned off for any one or more windows. FIG. 19 illustrates this concept. Here, for example, the user made a time selection in the performance view window 556 (representing PerfMon data) over a period of time where CPU behavior was in question. The animated application model or process diagram 552 highlights the entities/processes involved in the selection. The event log window 554 highlights all events in the specified time range, part of which represent a call tree. The event viewing window 558 presents data on a single event (for multi-event selections it highlights the first event). The timeline window 562 highlights the specified time range as well as shows performance peaks, and the summary window 560 tallies the events in the time range and presents a summary. Thus, while displaying the animated functional model 552, the control station can also simultaneously display items such as summary data 560, time data 562, event details 558, and/or an event log or call tree 554. Window synchronization avoids a | ||||||
