System and method for efficiently monitoring quality of service in a distributed processing environment5958009Abstract A measurement system and method of instrumenting a computer program for efficiently monitoring the quality of service in a distributed processing environment are described. A plurality of interconnected network nodes in a computer system with an application process operating on each network node is provided. At least one intelligent sensor is associated with each application process. Each intelligent sensor selectively collects data about at least one of the network node upon which the associated application process operates and the associated application process. An observer is associated with each application process and filters out unchanged and zero values from the data collected by the at least one intelligent sensor. A collector is logically associated with each network node. The intervalized collected data is asynchronously received into the collector periodically pushed from the observer. An analyzer is associated with the distributed processing environment and correlates the intervalized collected data. The intervalized collected data is asynchronously received into the analyzer periodically pushed from the collector. Claims We claim: Description A portion of the disclosure of this patent document contains material which is subject to copyright protection. The copyright owner has no objection to the facsimile reproduction by anyone of the patent document or the patent disclosure, as it appears in the Patent and Trademark Office patent file or records, but otherwise reserves all copyright rights whatsoever.
______________________________________
Level Description
______________________________________
0 Off
1 Threshold
2 Means
3 Minimums and Maximums
4 First moments of a distribution
5 Second moments of a distribution
6 Histogram
______________________________________
The lowest information level, "Off" (Level 0), corresponds to instantiated but inactive sensors and incurs virtually no overhead. The "Threshold" information level (Level 1) is generally used for monitoring network quality of service and performance. Each sensor 42 can be configured with definable thresholds. Ordinarily, each sensor 42 reports data only when its threshold is exceeded, although the threshold can be overridden on demand to force immediate reporting. For example, data is only reported when 90% of all response times exceed 10 seconds. The other information levels support various statistical quantities that provide insights into the load and performance distribution of the collected data. Primitive quantities are collected for levels 2, 3, 4, and 5. These quantities can be processed by an analyzer 48 to convert them into statistical information. This reduces the overhead of the sensor 42 by deferring compute intensive processing to another network node where the analyzer 48 executes. Level 2 collects value summaries and a count of events so that the mean value can be computed. Level 3 collects the minimum and maximum value measured during the time interval. Level 4 collects the sum of the squares of the values so that the first moment in the distribution of this data can be computed. Level 5 collects the sum of the cubes of the values so that the second moment in the distribution of this data can be computed. Level 6 measures values and stores frequency of events in the relevant bin of a histogram. Four categories of standard sensors 42 are defined: timers provide interval times; counters record the number of occurrences or the state of an event; normalizers record events per unit time or unit count; and composers return an arbitrary structure for use by higher level objects. For example, normalizers transform values collected by timers or counters into rates. In addition to standard sensors 42, an application process 41a, 41b can define custom sensors that supplement the standard sensors by recording, for instance, application-specific behavior or network node state, such as described in the related, commonly-assigned patent application entitled "SYSTEM AND METHOD FOR CONTINUOUSLY MEASURING QUALITY OF SERVICE IN A FEDERATED APPLICATION ENVIRONMENT," referenced hereinabove. Custom sensors support extensible measurement of business or organizational units of work that are not available within the standard distribution infrastructure. The observer 43 is also an application process-level object which provides a control point for one or more associated sensors 42. There is one observer 43 for each application process 41a, 41b. Each observer 43 off-loads from its associated sensors 42 the need to manage and transmit collected data by periodically "pushing" the intervalized data generated by its sensors 42 to a collector 44 through a Collector Data Interface 46, as further described hereinbelow. Ordinarily, each observer 43 reports data only when its time interval has expired, although the time interval can be overridden on demand to force immediate reporting. The observer 43 helps to minimize processing overhead by enabling the sensors 42 to defer some computations in favor of the observer 43. The observer 43 supports sensor registration and unregistration and the collected data from multiple sensors 42 is transferred simultaneously to minimize overhead. The observer 42 exports a node transparent sensor access and control interface, named the Performance Measurement Interface (PMI) 45, as further described hereinbelow in the APPENDIX. The collector 44 is a network node-level object which performs local network node data management and control for one or more associated observers 43. There is preferably one collector 44, although multiple collectors 44 or no collector 44 are acceptable configurations. For the zero collector 44 configuration, a remote collector (not shown) logically associated with the network node 40a would provide access to the sensor 42. Two or more collectors 44 can be used for improved efficiency. Each collector 44 accumulates sensor data from observers 43 through a Collector Data Interface 46. It "pushes" the accumulated sensor data to an analyzer 48 through an Analyzer Data Interface 50, as further described hereinbelow. The accumulated data from the observers 43 is transferred simultaneously to minimize collection overhead. Ordinarily, each collector 44 reports data only when its time interval has expired, although the time interval can be overridden on demand to force immediate reporting. The collector 44 also controls each sensor 42 through its associated observer 43 via the PMI 45 and provides a sensor registry 47 that contains the states of all registered and unregistered sensors 42. The collector 44 exports a network transparent sensor access and control interface, named the Collector Measurement Interface (CMI) 49, as further described hereinbelow in the APPENDIX. The CMI 49 is a network communication interface which the analyzer 48 uses to communicate with collectors 44 located anywhere in the network for requesting sensor data and specifying sensor configurations. The collector 44 imports an observer data collection interface, named the Collector Data Interface (CDI) 46, as further described hereinbelow in the APPENDIX. Each observer 43 summarizes the sensor data collected by the associated sensors 42 and periodically "pushes" the collected data to the collector 44 using the CDI 46. The CDI 46 eliminates the need for polling of sensors 42 for data by providing an asynchronous data transport channel. The analyzer 48 analyzes the data gathered by one or more associated collectors 44 within a specific network domain, that is, a network node or cluster of nodes. The analyzer 48 filters and analyzes the collected data by applying statistical routines to compute the distributional characteristics of the collected data, correlating data from application elements residing in different network nodes 40a, 40b and preparing the data for expert system or human analysis. Simple operations, such as counting events, are most efficiently performed by the sensors 42 or observers 43; the analyzer 48, however, performs complex statistical calculations, such as computing moments and histograms. In addition, any analyzer 48 can request subsets of sensor data from multiple collectors 44 and multiple analyzers 48 can access any specific collector 44. Each analyzer 48 gathers data from the associated collectors 44 through the Analyzer Data Interface 50, as further described hereinbelow. The analyzer 48 provides the data to higher-level objects, such as a presentation service for visualization or an interpretation service for autonomous agent operation, examples of which are presented hereinbelow. A repository 51 is associated with the analyzer 48 and is used for storing configuration data for the sensors 42, observers 43 and collectors 44 operating on each network node 40a, 40b in the network. Finally, the analyzer 48 provides the basis for dynamic end-to-end quality-of-service negotiation and monitoring. The analyzer 48 exports a collector data collection interface, named the Analyzer Data Interface (ADI) 50, as further described hereinbelow in the APPENDIX. The data collected from multiple sensors 42 via their associated observers 43 are batched into a single packet by the collector 44 and "pushed" in bulk to the analyzer 48 through the ADI 50. This interface minimizes the number of network packets exchanged between the collector 44 and the analyzer 48. Moreover, as the collector 44 periodically "pushes" sensor data to the ADI 50, the collection overhead otherwise incurred by the analyzer 48 to periodically poll each of the collectors 44 for updates is avoided. This technique improves scalability to large networks and reduces collection and processing overhead. However, each collector 44 must use self-describing data techniques to facilitate the interpretation of data formats since each collector 44 and analyzer 48 may reside on separate, heterogenous network nodes 40a, 40b. To summarize, it is critical that the collection of quality-of-service data not significantly impact the application processes and the system under measurement. Optimization techniques are employed in the described embodiment to minimize network bandwidth utilization and to improve scalability to large networks. Data is reported only when the information level thresholds are violated. Intelligent sensors 42 report summarized data periodically, but do not report unchanged data or zero values. Aggregated sensor data at the observer 43 and collector 44 levels are transported in bulk using data "pushing." The intelligent sensors 42, observers 43, collectors 44 and analyzer 48 comprise the basic quality-of-service measurement infrastructure. Additional, higher-level objects can be added to complement operation of the measurement infrastructure. By way of example, the objects can include any or all of the following, also shown in FIG. 3. A presenter 52 can be used to interactively produce reports, displays and alarms by supporting a user interface on a workstation (not shown). The presenter 52 receives intervalized and analyzed data from the analyzer 48 and presents a logically integrated view that resembles that of an application process executing on a single, centralized network node. Visualization techniques are necessary to provide efficient online analysis of the collected data. An interpreter 53 is an intelligent entity, either human or expert system, that evaluates and identifies complex relationships between the data, such as obtained using the presenter 52, and the environment and draws meaningful interpretations. The interpreter 53 can rely on extrinsic resources 55, such as dynamic modeling 56 of a hypothetical network, to estimate and compare measured data with quality-of-service requirements. Finally, a controller 54 sets and modifies system parameters and configurations based on an interpretation of the monitored quality-of-service data, such as provided by the interpreter 53. Together with the measurement system and dynamic models, the controller 54 creates a closed feedback loop necessary for providing a self-adapting system that can manage itself with minimal human intervention. The basic measurement infrastructure of the described embodiment has been implemented as an object-oriented computer program written in the C++ programming language, although any programming language, object-oriented or otherwise, would be suitable. The APPENDIX is a listing of pertinent portions of code for the header files defining the standardized application programming interfaces for use in implementing the present invention. The sensors 42, observers 43, collectors 44 and analyzer 48 were written in software to conform to the Open Group's Distributed Computing Environment (DCE) model because of availability and commercial interest. However, the use of the DCE as a distribution infrastructure impacted only the implementation of the sensors 42 and observers 43, but not the interface. The sensors 42 were developed and embedded within a DCE runtime library (RTL) which is linked during program compilation with all application processes 41a, 41b. Thus, the standard sensors 42 were made available without having to modify application process, client or server source code. The observers 43 were also implemented in the RTL. The collectors 44 were implemented as daemon processes that communicated with all the observers 43 via interprocess communication through the PMI 45 and CDI 46. The analyzer 48 was also implemented as a daemon process that communicated with all the collectors 44 in the network via remote procedure calls. FIG. 4 shows a state diagram of the basic measurement infrastructure of the quality-of-service monitoring system of FIG. 3. Raw data is observed by a "probe" in each sensor 42. A sensor probe is a software construct referring to a platform-dependent request for raw data, such as a function call to the operating system to obtain processor performance data. Whenever the threshold information level for a sensor 42 has been exceeded, that is, both the time interval has elapsed and the threshold value has been satisfied, the reportable data 90 is collected by the sensor 42 into its local process address space. Whenever two or more analyzers 48 configure a sensor 42 with different threshold values, this collection technique is modified so that modified data 93 is written to collector 44 and collector 44 performs the test for threshold exceptions. Periodically, each observer 43 retrieves the reportable values 90 from the address space for consolidation with data from other associated sensors 42. In turn, each observer 43 copies control data 91 into the configuration address space for each sensor 42. Whenever the time interval for an observer 43 has elapsed, only the modified data 93 for the associated sensors 42 is "pushed" to the collector 44. Zero values or unmodified data 92 are discarded by the observer 43. In turn, each collector 44 sends control data 94 to each observer 43. Whenever the time interval for a collector 44 has elapsed, aggregated modified data 94 for all of the associated observers 43 is "pushed" to the analyzer 48. In turn, each analyzer 48 sends control data 95 to each collector 44. If the analyzer 48 or any other higher-level object (not shown) requires immediate reporting, the request is conveyed downwards as control data and the same data as ordinarily reported is returned, only it is reported immediately. FIG. 5 shows a block diagram of a system for providing end-to-end quality-of-service measurements for client-server and distributed applications in accordance with a further embodiment of the present invention. The system is organized into a hierarchy of objects. The network nodes 60a, 60b, 60c execute a plurality of application processes 61a, 61b, 61c, 61d, 61e, 61f, 61g, each of which are logically divided into application 62, middleware 63, network 64 and operating system 65 layers, each with an associated intelligent sensor 66, 67, 68, 69, respectively. Each of the network layers 64 shares a logical connection 74 over a physical network. The intelligent sensors 66, 67, 68, 69 each consist of the sensors 42, observer 43 and PMI 45 shown in FIG. 3. The sensors 66, 67, 68, 69 for each application process 61 a send their collected data to a collector 70 which is responsible for the entire network node 60a. In turn, the collector 70 sends the collected sensor data to an analyzer 71 which is responsible for the entire network. The analyzer 71, together with the correlater 72, filters and correlates the data, reconfigures sensor measurement parameters if necessary, and prepares the correlated data for presentation on a workstation 73. The analyzer 71, correlater 72 and workstation 73 can be located anywhere in the networked environment. Client-server and distributed application quality of service and performance data is captured and monitored as follows. By way of example, assume that the application process 61a running on the network node 60a is a client object and the application process 61c running on the network node 60b is a server object. The client object 61a communicates with the server object 61b via a logical connection 74 through their respective network layers 64. The sensors 67 in their respective middleware layers 63 can capture data pertaining to the client-server interactions by observing the information exchanged between the application 62 and network 64 layers in their respective application processes. This data, from both the client object 60a and the server object 60c running on separate network nodes 60a, 60b, respectively, is eventually forwarded to the correlater 72 for consolidation, correlation and reporting in an integrated manner. By way of further example, assume that the application process 61a running on the network node 60a and the application process 61c running on the network node 60b together comprise a distributed application. The component application processes 61a, 61c communicate with each other via a logical connection 74 through their respective network layers 64. The sensors 67 in their respective middleware layers 63 can capture data pertaining to the component interactions by observing the information exchanged between the application 62 and network 64 layers in their respective application processes. This data, from both of the component application processes 60a, 60c running on separate network nodes 60a, 60b, respectively, is eventually forwarded to the correlater 72 for consolidation, correlation and reporting in an integrated manner. The present invention presents several unique quality-of-service monitoring capabilities. First, geographically distributed software processes can be measured, correlated and displayed in a single, integrated view. Second, sensors are uniquely identifiable and sensor data is timestamped so that the data can be correlated. Finally, the placement of sensors in the middleware, such as shown in FIG. 5, supports the collection of quality-of-service metrics for application processes on heterogeneous operating system platforms and eliminates the need to have specialized instrumentation in each host operating system. Using the terminology of the Reference Model for Open Distributed Processing, the present invention provides correlated quality-of-service metrics across objects (application components) and their channels (network communication), integrates disparate performance measurement interfaces from a node's nucleus object (operating system), and efficiently transports collected data from network nodes to management stations. The data collected is useful to application designers, model developers, quality-of-service monitors and distributed application managers. In summary, the present invention can correlate application resource usage across nodes; locate or measure application server response times; monitor application servers on a per function basis; judiciously report data to minimize collection overhead for large networks; provide a single, integrated, correlated view of a distributed application; and large networked environments (exceeding a million network nodes) can be efficiently monitored. Having described and illustrated the principles of the invention in a preferred embodiment thereof, it should be apparent that the invention can be modified in arrangement and detail without departing from such principles. We claim all modifications and variations coming within the spirit and scope of the following claims.
__________________________________________________________________________
APPENDIX: applciationm programming interfaces source code
//------------------------------------------------------------------------
----
//
// File:
collector.h
// Description:
Collector class definitions.
//
// Language:
C++
//
// .COPYRGT. Copyright 1997, Hewlett-Packard, all rights reserved.
//
//------------------------------------------------------------------------
----
#ifndef COLLECTOR.sub.-- CLASS
#define COLLECTOR.sub.-- CLASS
/**********************************************************************
Module Name: collector.h/collector.cc
Module Overview
===============
This module provides the implementation of the collectors two
interfaces:
the PRI/PMI and the NPRI/NPMI. THe operations are broken up two parts.
The PRI/PMI operators include:
Collector::AddObserver(. .) - register observer
Collector::AddSensor(. .) - register Sensor
Collector::Data(. .) - report new sensor data
Collector::DeleteSensor(. .) - delete observer sensor.
Collector::DeleteObserver(. .) - delete observer
The above are handler routines that are called based on an observer's PRI
request.
The NPRI/NPMI operators include:
Collector::GetDirectory(. .) - returns observer and sensor lists to
analyzers.
Collector::DataRequest(. .) - Analyzer registers interest in
observer/sensors.
Collector::AddSensors(. .) - Register interest in additional sensor.
Collector::ChangeInfoLevel(. .) - change reporting level of sensor.
Collector::GetData(. .) - get latest sensor data.
Collector::DeleteAnalyzer(. .) - remove analyzer.
Collector::DeleteObservers(. .) - remove observer.
Collector::DeleteSensors(. .) - remove sensor.
A typical way one of these handler routines is invoked is when a request
is
received from either an observer via the ipc abstraction layer (for
PRI/PMI
requests) or an analyzer via an rpc call (for NPRI/NPMI requests).
************************************************************************/
#include <constants.h>
#include <arrayList.h>
extern "C"
{
#include <pthread.h>
}
#include <dmsdaemon.h>
#include <dmsdaemonS.H>
#include "basic-defs.h"
#include <collector-ipc.h>
#include <baseDataType.h>
#include <observerEntry.h>
const long OBSERVERS.sub.-- PER.sub.-- NODE = 20; // For optimum array
size
const long COLLECTOR.sub.-- CYCLE.sub.-- SEC = 1;
const long COLLECTOR.sub.-- CYCLE.sub.-- MSEC = 1000;
class Collector : public dmsdaemon.sub.-- 2.sub.-- 0.sub.-- ABS // For
DCE-RPC using C++ classes
{
private:
// Observers, sensor management
pthread.sub.-- mutex.sub.-- t .sub.-- observer.sub.-- lock;
// MUTEX LOCK
long .sub.-- num.sub.-- of.sub.-- observers;
// Number of observers
ArrayList .sub.-- observers;
// Array of active observers
// Analyzers management
pthread.sub.-- mutex.sub.-- t .sub.-- analyzer.sub.-- lock;
// MUTEX LOCK
long .sub.-- num.sub.-- of.sub.-- analyzers;
// Number of analyzers
ArrayList .sub.-- analyzers;
// Array of analyzers
// Delay for reading the IPC channel and RPC to analyzers
long .sub.-- delay.sub.-- interval;
// in seconds
// Log file to write the data
FILE *.sub.-- log.sub.-- file;
long .sub.-- log.sub.-- file.sub.-- count;
public:
Collector( );
/************************************************************************
.
The following functions are called to handle an observer's request.
They
support the PRI/PMI collector interface (mostly the PRI). For example,
they
support the registration of observers and sensors. The reporting of
sensor
data, etc. . .
************************************************************************/
// To add a new observer
void AddObserver(long
/* Observer ID */,
char *
/* Service Name */,
char *
/* Directory */,
long
/* Observer pid */,
char *
/* hostname */);
// To add a new sensor
void AddSensor(long
/* Observer ID */,
long
/* Sensor ID */,
long
/* Number of Info. levels */,
long *
/* Pointer to data formats */,
char *
/* Sensor name */,
char *
/* Interface name */,
char *
/* Function (Manager) Name */);
// Receiving new data from the Observers
void Data(long
/* Observer ID */,
long
/* Sensor ID */,
char *
/* Pointer to the data region */,
int /* Size of data */);
// Deleting sensors
void DeleteSensor(long
/* Observer ID */,
long
/* Sensor ID */);
// Deleting observers
void DeleteObserver(long /* Observer ID */);
// Method called by the time.sub.-- out functions to obtain the
// time out interval.
long GetDelayInterval( ) { return .sub.-- delay.sub.-- interval; };
// to print the contents of the collector -- Observers, Sensors and
data
void Print( ); // Print to screen
long Write(FILE * = 0);
// Write to a file
/************************************************************************
The following functions are called to handle an analyzer's (PMT's)
request.
They support the NPRI/NPMI collector interface The initial IPC protocol
was DCE/RPC.
************************************************************************/
ObserverEntry* GetObserver (long observer.sub.-- id);
// Director service, to obtain Observer and Sensor names
SensorDir.sub.-- p GetDirectory(DirRequest * /* service */);
// RPC call made by an analyzer requesting for data.
// For a single Observer and multiple sensors within that observer
// their information levels and frequency of updates
long DataRequest(long
/* analyzer.sub.-- id */,
long /* time.sub.-- interval */,
uuid.sub.-- t
/* object.sub.-- uuid */,
ServiceNames *
/* m.sub.-- services */,
SensorNames *
/* m.sub.-- sensors */,
CollectorID **
/* collector.sub.-- id */);
// To a set of sensors to an Observer
long AddSensors(long
/* analyzer.sub.-- id */,
long /* service.sub.-- id */,
long /* frequency */,
SensorNames *
/* m.sub.-- sensors */,
CollectorID **
/* collector.sub.-- id */);
// Analyzers changing information levels, frequency and
// time.sub.-- outs for a set of sensors
long ChangeInfoLevel(long
/* analyzer.sub.-- id */,
long /* time.sub.-- interval */,
ServiceFrequency *
/* m.sub.-- service.sub.-- id */,
SensorInfoLevel *
/* m.sub.-- sensor.sub.-- id */);
// Routine called periodically to transmit data
void TransmitData( );
// Analyzers directly obtaining data from the collectors
GatheredData.sub.-- p GetData(long /* analyzer.sub.-- id */);
long DeleteSensors(long
/* analyzer.sub.-- id */,
ServiceID *
/* service.sub.-- id */,
Sensor ID *
/* sensor.sub.-- id */);
long DeleteObservers(long
/* analyzer.sub.-- id */,
ServiceID *
/* service.sub.-- id */);
// To delete analyzers from the collectors
long DeleteAnalyzer(long /* analyzer.sub.-- id */);
};
// Set to determine debugging output mode.
extern int .sub.-- verbose.sub.-- mode;
#endif
//------------------------------------------------------------------------
----
//
// File:
observer.h
// Description:
Observer class definitions.
//
// Language:
C++
//
// .COPYRGT. Copyright 1997, Hewlett-Packard, all rights reserved.
//
//------------------------------------------------------------------------
----
#ifndef OBSERVER.sub.-- CLASS
#define OBSERVER.sub.-- CLASS
#include <sensor.h>
#include <baseDataType.h>
#include <arrayList.h>
#include <observerinfo.h>
#include <observer-ipc.h>
const long OBSERVER.sub.-- CYCLE.sub.-- SEC = 5;
const long OBSERVER.sub.-- CYCLE.sub.-- MSEC = 5000;
// This class encapsulates all the Sensors contained in a service Its
// methods are accessed by watchdog threads off the RPC RTL timer queue
// periodically for shipping performance data across address spaces using
Unix
// IPC mechanisms.
class ObserverClass : public BaseType
{
private:
char *.sub.-- observer.sub.-- name;
// user specified logical name
|
Same subclass Same class Consider this |
||||||||||
