System and method of enterprise systems and business impact management6983321Abstract A system architecture and a method for management using a cellular architecture to allow multi-tier management of events such as the managing of the actual impact or the potential impact of IT infrastructure situations on business services. A preferred embodiment includes a high availability management backbone to frame monitoring operations using a cross-domain model where IT Component events are abstracted into IT Aggregate events. By combining IT Aggregate events with transaction events, an operational representation of the business services is possible. Another feature is the ability to connect this information to dependent business user groups such as internal end-users or external customers for direct impact measurement. A web of peer-to-peer rule-based cellular event processors preferably using Dynamic Data Association constitutes management backbone crossed by event flows, the execution of rules, and distributed set of dynamic inter-related object data rooted in the top data instances featuring the business services. Claims What is claimed is: Description FIELD OF THE INVENTION
The journal file 16 is storing the transactions performed by the event processor 10. For each transaction, the event processor 10 records the changes (addition, suppression, modification) it performs on the events 18 and data objects 20. At regular interval and when the system is not overloaded, the event processor 10 can trigger the update of the state file 14. In a preferred embodiment, the update is performed by an independent process that reads the old state file 14, then reads the journal 16, and applies the changes in order to produce a new state file 14. At that stage the old state file 14 might be deleted. When triggering the production of a new state file 14 by state builder 22, the event processor 10 can continue its operation but it will open a new journal file 16. The advantage of this system is that the event processor 10 focuses on writing the changes sequentially in the journal file 16. This preferred method is a simple, lightweight, and fast operation compared to updating tables in a relational database system. Advantageously, the event processor 10 can better handle a massive number of events 18 arriving over a short period of time. This massive number of events 18 is sometimes referred to as an event storm. With each of the event processors distributed across the environment storing events, there is no single place where all the events are located. Therefore, the event console GUI connects to a plurality of event processors in order to provide a better overall picture. The event processors are able to forward events and data between themselves also. With events being forwarded to many different places, there is a need to keep track of where each event came from and where each event is propagated. Without this information, it would be very difficult to update all the copies of events when a change is performed, thus leading to inconsistencies between the event processors. In a preferred embodiment, the tracking information is stored directly in the events. In this fashion, each event object contains a set of fields that store the information necessary to know where the event is coming from and where it has been propagated. In the event that the propagation cannot connect immediately to the destination, the event processor may buffer the propagation information and periodically try to reestablish the connection. When the connection is reestablished, the event processor will propagate the information. The propagation information can also be used by the GUI to connect to the event processors that have a copy of the event based on a review of the event description. The GUI can display the path of propagation and, when connected the event processors on that path, the GUI can explore how the event relates to others in that particular processor. In this manner, the system is able to start from one single event description in one event processor and explore the other processors that worked on the event in order to provide a complete picture to the operators. Within one event processor, relationships can be created between events. An example of relationship is the cause/effect relationship that can link an event considered as a probable cause to its multiple effects. Another example is the abstraction relationship that is used to build one event out of several others. The abstraction relationship can provide a higher level of information into one single event description. Through the abstraction relationship, an abstract event is produced and linked to the abstracted events. The abstraction event can be viewed as one event that is summarizing the problem reported by the multiple abstracted events. Another use of the abstraction event is to provide a generic description of problems so a complete model of analysis can be built without focussing on the exact format of the events that the monitoring sources are going to use. This is helpful for working in distributed environments where multiple different monitoring sources can be used. Rules are typically used to setup abstractions. An abstraction rule is triggered by the arrival of many different classes of events and generates a single event description. The rule instructs the system on how to produce the abstraction from the information coming in the received events. This method allows for different event descriptions of the same problems to be reformatted into a generic abstraction. The following provides an example of the versatility of the invention. For instance, two different monitoring programs are able to report events about the disks attached to server hardware. The two monitoring programs are likely to use different formats for their event representations. Monitoring Software A may report the problem with the following format:
Monitoring Software B may report the problem with the following format:
It is impractical to build a model for the analysis of the event that relies on those specific event formats because they use different fields. That is where it is useful to setup an abstraction. In a preferred embodiment, the abstraction may use the following format:
Accordingly, using this abstract rule process, the format is not limiting.
Preferably, to avoid generating duplicate abstraction, the definition of the abstraction format should contain declarations such that the system can detect the generation of a duplicate. In the current example, a duplicate generation needs to be avoided when the two monitoring programs do report the problem for the same 'Disk' on the same 'System'. The declaration of the DISK_PROBLEM event class could look like this:
Assuming 'System' and 'Disk' are the only two fields being declared as part of the duplicate key, this means that two DISK_PROBLEM events are considered duplicates when they have the same values for their fields 'System' and 'Disk'. Assuming all those declarations and rules are available, the event processor can then generate a single description like
If events reporting the same problem on the same disk are reported by the two monitoring programs, both original events generated by those programs will remain in the event processor and be linked through an abstraction relationship with one single DISK_PROBLEM event. The big advantage being that the rest of the analysis can be based on the DISK_PROBLEM event. With the abstraction relationship explicitly recorded between events, it is possible to explore which events produced the abstraction from the GUI. The exploration of those relationships is done through the same interface that enables exploration of the propagation paths followed by events. With event processors distributed across multiple different architecture, it is most preferable that the Knowledge Bases prepared by administrators can be distributed independently from the target architecture. For example, the same knowledge base could be distributed indifferently to workstations running for example Sun Solaris™ or Microsoft Windows® NT. One method to accomplish this goal is to include a rule interpreter. Because former rule interpreters performed poorly, it is preferably to use a virtual machine in the event processor capable of interpreting intermediate byte-code. An example of virtual machine is a Warren Abstract Machine (WAM). Using a compiler that produces Warren Intermediate Code (WIC) from the rules defined in the knowledge base, the WIC code may be maintained independent from the target architecture. Because administrators may want to preserve the integrity of the code of their knowledge and discourage reverse engineering, the rules compiler is preferably capable of encoding the intermediate code into a non-readable format. Time synchronization of distributions is important for time dependent analysis. For time dependent analysis, each event received by an event processor is first time-stamped with the time at which it was stored in the event itself. When an event is propagated from one event processor to another, it is preferable to preserve the original timestamp. Because of the nature of the invention, it is possible that the origin processor and the target processor run on two different computers. Those computers may not have their clocks synchronized. For the origin processor, "present time" may be 2:00 P.M. For the target processor, "present time" may be 2:05 P.M. In order to prevent the event from detecting the inconsistency, the event processors may have to apply a correction to the timestamps based on an estimation of the difference between time clocks on the system or systems. By establishing some threshold criteria on the estimated difference between clocks, the event processors should estimate the differences between the clocks and account for the discrepancy. A preferred embodiment of the invention includes an auto-limitation feature during heavy analysis. When an event processor is installed on some business critical server, the administrator may want to auto-limit the event processor, i.e. restrict its processing power to a percentage of its capabilities. This shall avoid situations where the event processor would tax too much of the resources on the server because of an event storm. In this embodiment, the event processor has to benchmark its own operations regularly and adapt its auto-limitation accordingly. During normal operations, the event processor will attempt to evaluate its processing capabilities while the other event processors continue to function on the computer. This benchmark will provide the upper limit of the work the event processor allows itself to perform. This upper limit may be determined by simulating a full processing including such activities as parsing, analysis, and storage of a predefined set of events. Running this benchmark at regular interval allows adapting the auto-limitation to the actual load supported by the computer at different times. The administrator of the system can tune that auto-limitation by setting a parameter that instructs the processor to auto-limit itself at a given percentage of what the full processing power has been estimated through the benchmark. The event processor is thus proceeding to an adaptive throttling of its capabilities in order to preserve resources for the critical applications it has to share the computing resources. By combining cells into a network, events can not only be processed as close as possible to the event source, but smarter processing of the events can occur. Each cell has access to a different view of the enterprise and events can not only be analyzed in the context of each other, but in the context of the current cell. This context is provided by the knowledge base and through the execution of external commands. Cells propagate specific events to other cells as appropriate depending on the results of the analysis. Several criteria can be applied to configure a network of cells and propagation of events within this network. One cell can be installed at every site where the company has facilities in order to work on the events collected from equipment located in these facilities. A second level of cells can be installed to receive events from all the cells within a particular location (e.g., country or state). A top-level cell can collect from any cell at the second level in order to provide a worldwide view. Some cells can be dedicated to collecting Events related to database servers, while some others are dedicated to mission-critical applications. Multiple levels of cells can be used in order to provide application-specific and enterprise-wide views of the functional areas. Cells can be set up following organizational unit boundaries (e.g., departments). This type of configuration makes it possible to create a hierarchy of cells that mimics the hierarchy of organizational units. The network of cells can combine any of the criteria mentioned here as well as any other criteria. Typically the result will be some kind of multi-level network with a directed flow of events, but not necessarily a strict hierarchy. In order to provide scalability, the cells at lower levels are tuned to filter, aggregate, or establish relationships between events and propagate only important events to some other cells. Each cell is configured to group events into collectors. A representative event collector is shown in FIG. 5. Collectors are simply sets of events that meet pre-specified criteria. Collectors provide the ability to specify how the events are displayed to the event browsers. The collectors defined for one cell are published to any event browser that connects to the cell. Collectors are typically organized into hierarchies so specialized collectors may be combined into more generic collectors. Criteria used for defining the collectors range from location in the network application generating organizational unit to service levels. Collectors are presented in the event browser as an expandable tree with color-coded severity indicators. For each collector, the operator can view a list of all the events belonging to it. Collectors are defined in the knowledge base loaded by each cell. Only significant events or events containing high-level descriptions of problems should be reported to the top-level cells in the network shown in more detail below. However, many events have been evaluated in order to decide what information to propagate upwards in the network of cells. These events are stored locally by the cells and can be of interest for operators who want to go into more details about some of the reported problems. To that effect, the event browser provides a "drill-down" capability where it is possible to explore the path that was followed by the events as well as relationships established between the events by the rules applied in the cells. Each cell is named and the cell directory provides the ability to reference cells by name independent of location. Cells and event browsers rely on the directory to establish the connection with cells. Through careful definition of cell directories, independent domains of cells or sub-domains can be established to allow different operators to access different levels of cells. Notably, communication with a cell can be protected with encryption. When protected, communications can only be established if the key is known. Each cell can trigger local actions in response to event situations or patterns. The actions can be taken as the result of the analysis performed by rules loaded from the knowledge base or/and by an operator through the event browser. The actions that can be triggered interactively are declared in the knowledge base. The programs that are executed in response to events have access to the complete event description. The execution of the programs occurs on the workstation where the cell is installed. The cells are not active probes or agents. They do not poll to detect events. Event detection can be done using existing tools on the market. These tools may have different conventions for encoding the events. Natively in the preferred embodiment, the software understands events coded using the BAROC language. Other formats can be transformed into BAROC descriptions by the use of adapters. The BAROC language is used to define data structures describing objects or entities. The language has roots in the object-oriented data structuring techniques. Classes or types of entities (e.g. events) are defined and then instances of the defined object types are created. A class defines the fields that can be used in the description of instances of each type of event. In BAROC terminology, these fields are called slots. BAROC is a highly structured language and provides the ability to capture the semantics of the events in a format suitable for processing. Some event management solutions use free text or message representation of events that can be expensive to process. It is better to structure the information conveyed by an event once when it enters the system rather than propagate a free-text representation of an event everywhere and rely on text-scanning functions to extract information each time it is needed. The BAROC language supports structured text classes and instance definitions with a very simple syntax as illustrated in the two previous figures. The cell reads these definitions, interprets the structure and builds an optimized representation of the event. Events not represented in the BAROC language cannot be sent to a cell. Such events have to first go through an adapter that translates the events into BAROC before sending the translated representation to a cell. This portion of the invention comes with adapters that can translate Simple Network Management Protocol (SNMP) events, events from the NT Event Log and generic text log file entries (i.e. syslog). A self-contained command is also available to post events from scripts or directly from a terminal session. To use the data model built in BAROC, software has query and test facilities that work explicitly on the concept of classes and slots. These facilities unleash the power of the event data model. The event processor makes heavy use of these in the analysis of events. The event processor or cell runs as a background process and may collect events, analyze events, respond to events, store events, propagate events, and/or group events. The cell builds the event collectors that are used by the browsers to present the events to users. These collectors are dynamic and an event may move among collectors as slot values for the event change. Configuration of the cell is done through a limited number of configuration files and through a knowledge base. The knowledge base encompasses the class definitions of the events that the cell can process, the rules to build the collectors, the rules to perform the analysis of the events and, optionally, executables for the external actions that may be triggered in response to events. The configuration of the cell to support those different functions is done through a limited number of configuration files and the knowledge base. The knowledge base itself may contain class definitions of the events that the cell can process, rules to build the collectors, rules to perform the analysis and correlation of the events, rules to propagate events to other cells, and executables for the external actions that may be triggered in response to events. As mentioned above, each cell works independently from its neighbor cells. If communication between cells is not possible at some point in time, all cells continue to do their work and simply buffer what they need to propagate to others. They catch up when communications are reestablished. Event processing is configured through rules included in the knowledge base. The rules are defined using the classes of events declared in the knowledge base. The analysis of the events is organized into nine different phases as shown in FIG. 6. Each phase usually has a well-identified mission that allows the rule language to be greatly simplified and enables a strict organization of rules. This organization makes it possible to provide a fully functional GUI-based editor for the knowledge base. Users have a choice of using the knowledge base editor or editing the rules files directly. The set up of the event analysis into phases with an appropriate rule language provides a goal-oriented process for writing rules instead of a programming exercise. Administrators can focus on what they want to happen rather than how to write a rule. Basically, the rules are statements which combine tests and queries on the BAROC data model with actions to be performed depending on the type of rule. These phases include refine. This phase is dedicated to "polishing" the events and collecting information that may be missing in the event description. It results in updating slot values of the events so as to standardize. The next phase is filter. This phase determines which events are going to be further in processed. It enables discarding of unwanted and/or irrelevant events. Following the filter phase, the regulate phase occurs. This phase handles duplicate events. It enables the cell to wait for a given number of repeated events within a specified time window before forwarding an event to the next phase. A conditional reset mechanism implements a hysteresis behavior. The next phase is update. In this phase, the system looks for previously received events that need to be updated with the information conveyed in a newly received event. Following update, the abstract phase takes place. During this phase, the cell tries to summarize events into a higher-level event description, as discussed above. It can help in dramatically reducing the number of events that need to be propagated. Furthermore, the analysis includes a correlate phase. This phase is used to compute the cause-effect relationships between events. The transitivity between the cause-effect relationships leads to the identification of the root cause of problems. Following correlate, the execution phase occurs. During this phase, the cell executes actions when an event satisfies certain conditions. Triggering of the execution can be based on dynamic conditions such as a slot value change. The timer phase may occur next. This phase introduces actions to be executed after a timer has expired. It provides a delayed execution mechanism. Finally, the propagate phase occurs. This phase defines which events get propagated and where they are propagated. Events may be propagated immediately after reception as well as later depending on slots in the event description receiving specific values. In a preferred embodiment, the present invention uses DDA technology to break out implementation-specific, topological data from the rules and put that data in tables, so they can be updated on the fly at runtime. Then instead of writing rules against specific data items, the user may write them against the data contained in the DDA tables. FIG. 7 shows the use of data associations to learn an environment. Using DDA technology, it is possible to prepare a knowledge base that performs intelligent analysis of an IT infrastructure without coding anything specific about the infrastructure directly in the rules. Instead, the rules are dynamically associated with data representing that specific knowledge. It is sufficient to provide the event processor with the data associations in order to initiate the analysis process to a specific infrastructure. If something changes in the environment, providing the updates makes the event processor adapt itself to the new situation dynamically, without recoding the rules. The data representation allows the recording of IT infrastructure element properties, as well as the relationships between the different elements. A complete set of elements and their relationships can be coded in order to get a full description of a complex environment. The event processor uses the BAROC language for data encoding, the same language used for event encoding. Using BAROC for data representation enables reuse of the same query/test facilities on the data and/or on the events. When creating a knowledge base, BAROC classes are defined. These classes enumerate tags that can be used to describe the instances. The data code can include just about anything: topology information, application setup information, components, dependencies, and similar information. Preparation of data classes in the knowledge base is a requirement for the cell to interpret instances provided at runtime. Having the data structures defined, it is then possible to write rules that refer to the data structures without knowing any specifics about the instances. When the rules are evaluated, the event processor is able to search for specific data received as instances. Data can be sent to the event processor or updated while it is at work. Changing the available instances of a given data class modifies the evaluation of rules that refer to that specific class. Therefore, it is possible to build generic rules that automatically adapt themselves to changes in the IT environment. To make the data useful, it needs to be associated with rules. The idea is that when a rule is evaluated, it queries the data to decide in which context the rule is applicable. When creating rules, the data instances are not known. Therefore, the association statements must be expressed as queries on the set of data. The rules are then further evaluated using the solution(s) from these queries. When the event processor receives an event reporting that a service is not available, it must search for applications running on the affected workstation and then find which of those applications depends on the failing service. Thus, it is possible to associate combinations of data elements with rule, taking into account the complex dependencies found in distributed IT environments. A knowledge pack is simply a canned knowledge base that can be used immediately by an event processor in order to perform intelligent analysis on events received. Knowledge packs can be prepared for varied typical environments and/or applications. The knowledge packs include event class definitions, actions that can be triggered in response to events, data class definitions, and rules. Experts prepare these knowledge packs. They define the data classes that are used in the rules and expect data instances to be created for a specific environment. The data instances can be provided explicitly by the administrators (through a GUI application or from the Command Line Interpreter ("CLI")) or can be automatically generated by an auto-discovery agent. Providing the cell with data instances is relatively simple. It is at least several orders of magnitude simpler than coding a complete knowledge base. If the pre-built knowledge pack requires modification for special situations, it is possible to use the graphical Knowledge Base Editor to introduce the required changes. The user can adjust the behavior of the present invention on the fly, at runtime. If an application is moved, the user may simply update the tables with their new locations. If new servers are added to a web farm, the user may insert them into the tables and the rules will use the new information automatically. If one needs to bring a whole new line of business under management, just add the information to the tables. This gives unprecedented benefits to the enterprise. For example, maintenance costs drop immensely. Instead of having a team of dedicated rule writers on call who change the rules for every change on the monitored systems, the user writes the rules once and this system creates automated systems, such as a web site, for updating the DDA tables when the environment changes. An example of DDA is a rule that which takes each record in the Close_Event_Table and adds it to the message slot:
Any OPEN event that is received by the cell will have the following appended to its message slot: "HOST_UP, HOST_DOWN; PROCESSOR_UP, PROCESSOR_DOWN; NFS_SERVER_UP, NFS_SERVER_DOWN; CLEAR_MAINTENANCE_EVENT, SET_MAINTENANCE_EVENT; HOST_OK, SWAP_FULL;ntDiskPercentUsage, by universal_swapavail;" reflecting the fact that this rule will walk every record in a DDA table. The user may also update the table on the fly at runtime, extending the scope of a DDA enabled rule quickly and easily. If the user has another pair of classes due to client changes, the user may add them using the client, and the next incoming event will use the new data. A representative high availability management backbone is depicted in FIG. 8. From a general perspective, such a backbone can be regarded as a cellular network or a group of interconnected cellular networks 135a, 135b spreading over several locations and possibly several companies. In the latter case, each company can actually operate its own backbone and allow only a limited set of interactions both from a technical and a functional stand point with the other backbones. A backbone is typically made, in the low end, of multiple service processors 130a-130i either collecting events from external monitoring sources 903 or using embedded instrumentation functions to actively monitor some IT Components and generate their own events 901, 902. Those complementary actions are all maintained as IT Indicators and relate to IT Components. Based on the dependencies existing between the IT Components, an incoming instrumentation event can lead to the generation of additional dependency events interpreted in the local processor or propagated to the remote service processor(s) owning the dependent IT Components. Similarly, dependency events can lead to the generation of new dependency events. As a result, horizontal event flows are created throughout the access layer, as illustrated by the arrows between the service processors such as 130a→130b→130c; 130e→130d; 130e→130f. IT Component events, i.e. instrumentation and dependency events are all abstracted by the service processors into IT Aggregate events that are then propagated 904 to the domain processors 120a-120c of the Abstraction Layer. Abstraction and propagation are made according to the specific "interest" of each domain processor. Using a system component referred to as an e-Console, an operator can connect to the domain processors in order to view or manipulate those events 905, including drilling down to the underlying events in the Access Layer. As shown in FIG. 9, IT Aggregate events are all abstracted by the domain processors 120a-120c into IT Path events that are then propagated 905 to the service processors 115a, 115b of the Business Layer. Abstraction and propagation are made according to the specific "interest" of each service processor. In parallel, each service processor can generate Site Transaction Emulation And Detection (STEAD) activation or de-activation requests. An activation request encompasses one sample site application transaction emulation sub-request (including frequency) completed with several detection sub-requests (including function name and input data) dispatched along a specific transaction IT Path. A de-activation request disables all the sub-requests of an activation request for a given site application transaction (SAT). Such requests are submitted directly to the service processors, although they can be relayed by an elected domain processor when required, for example, when restricted access apply on a remote location. Information about which processor(s) should be contacted for a STEAD request is provided on demand by ODS processors which maintain the appropriate mapping table. When receiving a STEAD emulation sub-request 906a, a service processor permanently enables the sample site application transaction and triggers its execution 907a, in accordance with the specified frequency, using an incremental SAT-specific identification tag. For each cycle, it sends back an execution confirmation event 909a containing a timestamp and the last SAT tag used. When receiving a STEAD detection sub-request 906b, a service processor permanently activates 907b the specified instrumentation function with the input data in order to capture any execution information related to a sample site application transaction. For each match 908, it sends back an execution control event 909b containing a timestamp and the SAT tag detected. All the STEAD events are consolidated in the originating service processor, along with the propagated IT Path events, on a per-SAT basis. This leads eventually to the generation of Business Impact events related to Business Services and business user groups. Using an e-Console, an operator can connect to the service processors in order to view or manipulate those events 910, including drilling down to the underlying events. Notably, additional processing capabilities may be required in the service processors in order to support the STEAD sub-requests. These extensions can be added in the service processors 130az, 130bz that run IT monitoring operations or they can be implemented in dedicated service processors. Referring to FIG. 10, a cross-layer communications is shown in the aforementioned three-layer functional architecture. At initialization or when an IT Aggregate object is added or updated, a domain processor 120 sends one or several subscription requests 911 to the ODS processor 125 serving its IT Domain. Such requests contain the IDs of the IT Components that the domain processor is interested in, as a means of maintaining its IT Aggregates. Based on its mapping table as described further on in this narrative, the ODS processor forwards the subscription requests 912 to the service processors 130, 130z owning those IT Components. As a result, each service processor will abstract and propagate to the registered domain processor 120 all the IT Component events 904 where the related IT Component is one of those the domain processor has subscribed to. At initialization or when a transaction object is added or updated, a service processor 115 sends one or several subscription requests 913 to the ODS processor 125. Such requests contain the IDs of the IT Aggregates the domain processor is interested in, as a means of maintaining a snapshot of the IT resources supporting the execution of transactions. Based on its mapping table as described further on in this narrative, the ODS processor forwards the subscription requests 914 to the domain processors 120 owning those IT Aggregates. As a result, each domain processor will abstract and propagate to the registered service processor 115 all the IT Aggregate events 905 where the related IT Aggregate is one of those the service processor has subscribed to. At initialization or when a maintenance period ends, a service processor 115 queries 915, 916 the ODS processor 125 to determine which processors should be contacted for a given STEAD request. Then the service processor pushes sub-requests 906 to those servers 130z that will then return events 909 related to each execution cycle. As depicted in FIG. 11, the high-end of the distributed data model supporting the aforementioned three-layer functional architecture is shown. This upper block presents the data structures stored and maintained in the service processors 115. Turning to the low-end of the distributed data model shown in FIG. 12, the left bottom block relates to the ODS processors 125; the middle bottom block to the domain processors 120; and the right bottom block to the service processors 130. By default, each ODS processor should have an entry for all the IT Aggregates respectively for all the IT Components existing in the various domain processors respectively for all the various service processors of the given backbone. However, nothing prevents from a technical stand point to split the backbone in several logical areas with one ODS server per area. ODS servers just need then to automatically forward unmatched requests to their peers. FIG. 13 depicts the High Availability scheme coming with the Management Backbone. As an example, two service processors are shown. Taking advantage of the peer-to-peer build-in capabilities, the first processor 130a is configured to act as an active backup for the second processor 130b which in turn acts as an active backup for the first one 175. Notably, asymmetric backup configurations are possible. A backup processor maintains a dormant copy of the resources managed by its peer and, during the normal course of operations, the events related to those resources are propagated from the peer to the backup and automatically synchronized 185. In each processor, the processing service and the instrumentation service monitor themselves reciprocally 170. If the processing service detects that the instrumentation service is not available anymore, it attempts to restart it. In case the instrumentation service fails to restart, the processing service notifies the backbone administrator. If the instrumentation service detects that the processing service is not available anymore, it attempts to restart it for a first cycle. In case the processing service fails to restart, the instrumentation service automatically redirects the flow of events for a second cycle from 145 the local processing service to 146 the processing service of the backup processor. If the backup detects that the processing service of its peer is not responding for two consecutive cycles, it activates the dormant copy of the resources of the peer, takes ownership for them, and informs the ODS processor to switch the processor names in its tables. The ODS processor then notifies the other processors of the IT Domain(s) it is associated to, plus the other ODS processors, if any. Based on this scheme, controlled event sources 140 and smart event sources 160 will not be affected by a non-recoverable failure of the processing service in a processor. Only the information coming from static event sources might be lost or buffered, if they do not support dual delivery or if this option is dismissed for performance reasons. In addition, when the instrumentation service of a processor experiences a non-recoverable failure, the local processing service may request the one from the backup processor to operate on an assistance mode. In this situation, the processing service of the backup first tags those of its dormant resources that normally rely on the instrumentation service of the peer. Then, it triggers 180 local instrumentation functions to monitor the tagged resources. Eventually it propagates the resulting events to the peer in order to be processed. With this mode, the first processor keeps the ownership on its resources, such that only the monitoring actions are subcontracted to the backup. Resources of low importance or resources that cannot be monitored by the backup may be excluded from this mode by using a static marker. The IT Infrastructure of a representative company is shown in FIG. 14 as implementing the system and method of the present invention. This company is shown as having a main office 205 and a branch office 200. The central application server and database server reside on the internal network of the main office. Separated from the internal network by a firewall is a demilitarized zone (DMZ) 210 with two Web servers. Depending on the type of service being used, the users in the main office directly access the application server or first bounce off the Web servers. Users in the branch office can only access the application services through the Web servers; A possible setup for the processors at the representative company is depicted in FIG. 15. All the servers receive a service processor 130b-130g. An ODS processor is installed on each of the two offices internal networks 125a and 125c, and a third one 125b goes in the DMZ. A similar layout is adopted for the domain processors 120a-120c and a service processor 115a is installed at the main office. The resulting Management Backbone at the representative company is shown in FIG. 16. Service processors 130a-130g are combined in symmetric backup configurations and associated to one of the three IT Domains: main office, DMZ, or branch office. Each domain processor 120a-120c obtains directory services from the local ODS processor 125a-125c and delivers aggregated IT information to the service processor 115a. FIG. 17 depicts three site business transactions (SBT) at the representative company. One SBT 1-bo corresponds to the business users of the branch office submitting a sequence of site application transactions through a web interface. Another one SBT 1-mo corresponds to the business users of the main office submitting a sequence of site application transactions through a web interface. The last one SBT 2-mo corresponds to the business users of the main office submitting a sequence of site application transactions directly through a proprietary client. In this example, the two first SBT relate to the same business transaction BT 1 while the third one relates to another business transaction BT 2. Six SATs support the three-site business transactions (SBT) at the representative company in FIG. 18. The SBT 1-bo (shown in FIG. 17) is made of SAT 11-bo, SAT 12-mo, and SAT 13-bo. The site business transaction SBT 1-mo is made of SAT 11-mo, SAT 12-mo, and SAT 13-mo. The site business transaction SBT 2-mo is made of SAT 22-mo. The logical tree of the resources at the representative company is shown in FIG. 19. The tree goes from the business service at the top down to the IT Aggregates. It shows also how business user groups relate to site business transactions. As illustrated, site business transactions can share some site application transactions. In turn, SATs can share the same IT Path which in turn can share some IT Aggregates with other IT Paths. An end-to-end representation of the IT Path ITP (a) at the representative company is depicted in FIG. 20. This IT Path supports the site application transactions SAT 11-bo and SAT 13-bo, which are part of the site business transaction SBT 1-bo. SBT 1-bo is an instantiation of the business transaction BT 1, which belongs to the business service BS 1. Turning to FIG. 21, the IT Aggregates are shown in the IT Path ITP (a) at the representative company. The split of the IT Path into several IT Aggregates is arbitrary but, from a general perspective, it should comply with the IT Domains division; FIG. 22 depicts the underlying IT Components and dependencies for one of the IT Aggregates at the representative company. The IT Aggregate ITA 3 is owned by the domain processor 120c of the main office and is associated to four IT Components: 'FW1', 'R1', 'application service,' and 'database service'. As a result, a subscription has been made on the three service processors 130e-130g maintaining those IT Components in order to have any related events forwarded to the domain processor. While the two first IT Components are not involved in any relationship, the two others are in fact non-instrumented logical objects combining the events of various other IT Components through cascaded dependencies. For example, the application service component depends on the 'APP process' as a main service and on the 'MDW_app process' as a secondary service. Those two components in turn depend on the 'APP server.' As explained with respect to FIG. 25 and illustrated in FIG. 26, typed dependencies imply specific propagation policies. Also, dependencies can link components 'horizontally' and across the processor boundaries like the bi-directional relationship between the 'MDW_app process' component (owned by 130f) and the 'MDW_db process' component (owned by 130g). FIG. 23 depicts a set of IT Indicators providing availability information about interrelated IT Components at the representative company. Each IT Indicator encompasses a range of instrumentation event(s) in the availability discipline and for a given IT Component. For example, the availability status 215b of the IT Component 'APP process' is the product of (i) the instrumentation events issued by the two associated IT Indicators 'process existence' and 'process errors', and (ii) the dependency events resulting from the Dependency on the 'APP server' component. Referring to FIG. 24, a set of IT Indicators providing performance information about interrelated IT Components at the representative company is shown. Each IT Indicator encompasses a range of instrumentation event(s) in the performance discipline and for a given IT Component. For example, the performance status 220c of the IT Component 'MDW_app process' is the product of (i) the instrumentation events issued by the two associated IT Indicators 'process mem use' and 'process cpu use', and (ii) the dependency events resulting from the dependencies on the 'APP server' component and the 'MDW_db process' component. Referring again to FIG. 25, some Impact Propagation Policies at the representative company are shown. The relationship between the 'MDW_app process' component and the 'application service' component in FIG. 22 is an example where one is a secondary service for the other one. This relationship is governed by the Impact Propagation Policy 6. Thus, when an instrumentation event (bsi) occurs for the 'MDW_app process', the first table determines whether it has to be propagated as a dependency event (bsd) to the 'application service.' By default, a FATAL bsi event translates into a WARNING bsd event. Bsi events with a lower severity are usually not propagated. When a dependency event occurs for the 'MDW_app process' as a consequence of an upstream dependency, the second table determines whether it has or not to be propagated as a new dependency event to the 'application service'. Any bsd event with a severity equal to CRITICAL or FATAL translates into a WARNING bsd event. Bsd events with a lower severity are not propagated. Tables, i.e. policies, are stored in the data repository of the processors. In a preferred system, these tables and the policies contained therein may be modified in real-time. FIG. 26 depicts an impact propagation case at the representative company involving availability events. Instrumentation standard events (ise) issued by the IT Indicators are abstracted into instrumentation events (bsi) for the related IT Components, using the severity as a means to group events. Looking at the 'APP process' component for example, ise-1 (CRITICAL) and ise-2 (CRITICAL) both abstract into bsi-1 (CRITICAL) while ise-3 (MINOR) abstracts into bsi-2 as this severity is different. Similarly, the 'APP server' has the ise-4 abstracted into bsi-3 (CRITICAL) and the two ise-5 and ise-6 abstracted into the same bsi-4 (WARNING). As the 'APP server' is a vital component for the 'APP process', bsi-3 is abstracted into bsd-1 (MINOR) based on the Impact Propagation Policy 2 (shown in FIG. 25), but bsi-4 is not abstracted because of its lower severity. In addition, as the 'APP process' is a main service for the 'application service' component, bsi-1 (CRITICAL) is abstracted into bsd-3 (MINOR) based on the Impact Propagation Policy 5, but bsi-2 is not abstracted because of its lower severity. Despite it has the same severity than bsi-2, bsd-1 is abstracted into bsd-3 (open by bsi-1) because Impact Propagation Policy 5 takes into account the severity MINOR for the dependency events. In conclusion, all the ise events shown at the bottom of the figure eventually lead to a single bsd-3 MINOR dependency event at the 'application service' level. This outcome could be different with modified IPP policies; In FIG. 27, the Instrumentation Standard Event (ISE) hierarchy is depicted in a partial view. These ISE event structures are used by the IT Indicators to deliver standardized information regardless of event source. Turning to FIG. 28, the Base Status Event (BSE) hierarchy is shown in a partial view. These BSE event structures are used throughout the Management Backbone as a means to carry the necessary information for determining the base status of the managed resources such as IT Components, IT Aggregates, IT Paths, SAT, BST, business user groups, and Business Services. The base status of a given resource is the highest severity among those of the open BSE_IMPACT (bsi, bss, bsd, bst), BSE_AVAILABILITY (bsa), and BSE_PERFORMANCE (bsf) events which relate to that resource. The model enforces the following preferred principles for IT Components. First, a resource can have up to 4 bsi events open at the same time (one per severity value: WARNING, MINOR, CRITICAL, FATAL) in each discipline. Next, a resource can have up to 4 bsd events open at the same time (one per severity value: WARNING, MINOR, CRITICAL, FATAL) in each discipline. Moreover, a resource can have only 1 bsa event open at once in the availability discipline. In addition, a resource can have only 1 bsf event open at once in the performance discipline. Furthermore, in the availability discipline, bsi/bsd events (on one side) and the bsa event (on the other side) are mutually exclusive when open. In the performance discipline, bsi/bsd events (on one side) and the bsf event (on the other side) are mutually exclusive when open. Finally, by definition, event collectors associated to the resources will only display open bsi, bsd, bsa and bsf events. The same principles apply for IT Aggregates, IT Paths, BST, and Business Services, with the exception of bsi events which cannot occur at those levels. The same principles apply for SAT, with bst events in place of bsi events. In addition, for IT Aggregates and IT Paths, a resource can have up to 4 bss events open at the same time (one per severity value: WARNING, MINOR, CRITICAL, FATAL) in each discipline. The consolidated status carried in BSE_CONSOLIDATED events (bsc) is derived from the severity values on a per-resource basis, with: two HARMLESS events (one per discipline: bsa+bsf) translating into OPERATIONAL; any combination of events reaching but not exceeding the severity range [WARNING, MINOR] translating into OPERATIONAL_WITH_INCIDENTS; and any combination of events reaching the severity range [CRITICAL, FATAL] translating into NOT_OPERATIONAL. The impact statement carried in SERVICE_IMPACT_STATEMENT (sis) and USER_IMPACT_STATEMENT (uis) events is derived, like the consolidated status, from the severity values on a per-service or per-user group basis, with three possible statements: NO_IMPACT_REPORTED, MINOR_IMPACT, and SEVERE_IMPACT; FIG. 29 depicts the event processing steps from the instrumentation level up to the IT Aggregate level. ISE events are updated and regulated 235 and lead to the creation of bsi events, which in turn lead to the creation of cascaded bsd events. When bsi and bsd events are all closed for an IT Component in a given discipline, respectively a bsf, a bsa event is automatically reopened for that resource. Each time a change occurs at the component level 240, a new bsc event replaces the previous one for the related resource as a means to consolidate status information. In addition, bsi and bsd events are abstracted into new bsd events and propagated to the IT Aggregate level with seamless synchronization over the time. From that level 245, bsd events are further abstracted and propagated upwards; The event processing steps from the IT Aggregate level up to the site business transaction level are depicted in FIG. 30. Abstracted bds events come from the IT Aggregate level are abstracted bsd events. From the IT Path level 250, those bsd events are abstracted into new bsd events to the SAT level where they are correlated 255 with bst events coming from the STEAD monitoring channel. From the SAT level, bsd and bst events are further abstracted into new bsd events to the BST level. Each time a change occurs at the BST level 260, a new bsc event replaces the previous one for the related resource as a means to consolidate status information; in addition, bsd events are abstracted to the upper level; FIG. 31 depicts the event processing steps from the site business transaction level up to the Business Service level. From the BST level, bsd events are abstracted into new bsd events to (a) the business user group level and (b) the Business Service level. Each time a change occurs at the business user group level 265, a new uis event replaces the previous one for the related resource as a means to consolidate user business impact. Similarly, each time a change occurs at the Business Service level 270, a new sis event replaces the previous one for the related resource as a means to consolidate service business impact. This system and method and many of its intended advantages will be understood from the disclosure herein and it will be apparent that, although the invention and its advantages have been described in detail, various changes, substitutions, and alterations may be made in the form, construction, and/or arrangement of the elements without departing from the spirit and scope of the invention, or sacrificing its material advantages, the form described previously and subsequently herein as being merely a preferred or exemplary embodiment thereof.
|
Same subclass Same class Consider this |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
