System for managing group of computers by displaying only relevant and non-redundant alert message of the highest severity and controlling processes based on system resources5862333Abstract The system and method of this invention automatically manages a group of computers by automatically gathering data, storing the data, analyzing the stored data to identify specified conditions, and initiating automated actions to respond to the detected conditions. The invention, hereafter "SYSTEMWatch AI-L", comprises a SYSTEMWatch AI-L client which turns a computer into a managed computer, a SYSTEMWatch AI-L console, which turns a computer into a monitoring computer, a SYSTEMWatch AI-L send facility, which allows a system administrator to send commands to various SYSTEMWatch AI-L clients through the SYSTEMWatch AI-L console, and a SYSTEMWatch AI-L report facility which allows a system administrator to query information collected and processed by the SYSTEMWatch AI-L clients and SYSTEMWatch AI-L consoles. Claims I claim: Description FIELD OF THE INVENTION
TABLE 1
__________________________________________________________________________
FEATURE DESCRIPTION
__________________________________________________________________________
NAME A property must have a name.
TYPE A property must have a type, which corresponds to the type of
the
data to be stored in the field.
FORMAT A property may optionally have a string which describes how the
data in
the field should be formatted. The format string is similar to
the C
language's printf( )'s formatting control.
HEADER A property may optionally contain a string which will be
displayed
as the column header when a report featuring records containing
the property is displayed.
DISPLAYUNIT
A string used by the reporting facility which is appended to the
data
in the field during a report. Thus, if the PROPERTY is a
description
of memory utilization in kilobytes, an appropriate DISPLAYUNIT
might be "kb"
DISPLAYTYPE
Some display formats are commonly used through SYSTEMWatch
AI-L. DISPLAYTYPES are keywords which corresponds to a
particular
FORMAT. Examples of DISPLAYTYPEs include STRING20, for a
string limited to 20 characters in width, DATESMALL, for
displaying
date in mm/dd format, PERCENT, for automatically display
numbers
between 0.0 and 1.0 as percentages (e.g.: 0.52 is displayed as
52%)
SHORTDESC
A PROPERTY may optionally contain an abbreviated description of
the
PROPERTY.
LONGDESC A PROPERTY may optionally contain a long description of the
PROPERTY.
__________________________________________________________________________
2. Entity Conceptually, ENTITYs are similar to database tables. In SYSTEMWatch AI-L, ENTITYs are used to group related PROPERTYs. FIG. 7 illustrates the concept that each piece of data in the database is associated with a given PROPERTY and a given ENTITY. In this document, it will be necessary to refer to certain combinations of ENTITYs and PROPERTYs. The construction <entity name>.sub.-- <property name> (e.g.: IGNORE.sub.-- IGNORETIME) will refer to a database entry with an entity equal to <entity name> and a property equal to <property name>. In addition to ENTITYs and PROPERTYs, the database, 41, in SYSTEMWatch AI-L also has these additional features: 1. Host Information Each piece of data in database, 41, automatically has host information associated with it. Thus, as data is stored in the database, the database automatically associates the host from which the data originated from. This is because in SYSTEMWatch AI-L, data is "owned" by the host from where the data originated. Other hosts may request a copy of the data since SYSTEMWatch AI-L has communications capabilities. Some data may be stored in a central location (e.g.: a SYSTEMWatch AI-L console) if it is relevant to multiple computers. Because each piece of data has host information associated with it, a SYSTEMWatch AI-L console can conslidate data from multiple hosts. 2. Time Information Each piece of data in database, 41, has a time field associated with it. The time field by default has the last time the data was updated, but SYSTEMWatch AI-L provides a mechanism of changing the time field so its possible to store some other time in the field. 3. Name Each piece of data in database, 41, has a key field which is called the name field. A name field must be unique for a given ENTITY, PROPERTY, and host (the name of a computer). Thus, within an ENTITY and PROPERTY used for tracking computer processes, the name field might be the process id since process ids are unique on each computer, so by specifying the ENTITY name, PROPERTY name, and host name, the name field forms a unique key to locate the data. 4. Value Of course, a database stores data. In SYSTEMWatch AI-L, the term value refers to the data stored in the database. In one example, database, 41, is currently implemented as a relational database: One table is used for describing ENTITYs. This table is used to associate ENTITYs with PROPERTYs. Another table is used for describing PROPERTYs. Finally, another table holds the information, which can be located by providing an ENTITY name, PROPERTY name, and the name field of the data. This table also contains the associated host and time information. In another embodiment, database, 41, can also be implemented with a database which is object oriented, i.e, a database which supports the ability to inherit data and methods from super and sub classes. Additional requirements of database, 41, used in the core is that the database must support certain query operations and certain set operations. Specifically, the query operations supported by the database include: 1. regular expression matching in queries. 2. creation time or update time query, i.e., searching for a data item based upon the time the data was stored in the database or based on the time the data was last updated in the database. 3. host of origin in queries, i.e., searching for a data item based on the host which created the data. 4. time comparison query, i.e., searching for data based upon a time comparison. Note: SYSTEMWatch AI-L stores its time in a manner similar to the UNIX operating system. That is to say, all time is converted to seconds elapsed since the beginning of UNIX time. The advantages of using this method is that time comparisons are easily made, and a time plus an interval can be added to obtain a future time. The set operations which database, 41, supports include: 1. set intersections (ANDs)--given 2 or more sets of data, return the elements present in all sets. 2. set union (ORs)--given 2 or more sets of data, return the elements in all sets. 3. set exclusion (NOTs)--given a first set and a second set, return elements in the first set which are not elements of the second set. Core Layer Description--The Expert System The second element of the core layer is an expert system, 40, which is used for problem detection and action initiation. The expert system, 40, is a forward chaining rule based expert system using a rule specificity algorithm. When SYSTEMWatch AI-L client, 13, is started, the expert system contains no rules. Rules are declared and incorporated into the core layer. Rules support both the IF-THEN rules as well as IF-THEN-ELSE rules. The rules used in SYSTEMWatch AI-L permit assignments and function calls within the condition of the rule. Additionally, SYSTEMWatch AI-L expert system, 40, also has the following features: a. Rules can declare variables. All variables declared within a rule are static variables. b. Rules can have an initialization section. The initialization section contains actions which must be performed only once, and before the rule is ever tested. It can, for example, contain a state declaration and an interval declaration (states and intervals are described below). It may contain variable declarations for variables used by the rules, and it may contain code to do a variety of actions. c. Rules can have, for instance, an INTERVAL and a LASTCHECK time. In accordance with the principles of the present invention, in order for a rule to be eligible for testing by the expert system, at the time of testing the clock time must be equal to or greater than the LASTCHECK time plus the INTERVAL time. The LASTCHECK time for each rule is set to the clock time whenever a rule is actually tested. This way, the INTERVAL specifies the minimum amount of time which must elapse since the last time a rule was checked before the rule becomes eligible for testing again. d. The expert system and its rules have a state property. One example of the possible states is described below. Under expert system, 40, in order for a rule to be eligible for testing, the rule's state must equal the expert system's state. All rules except one must declare a state for the rule in its initialization section. The one rule without such a declaration is a rule used by expert system, 40, to switch it into the DATA state. Other rules are responsible for managing the transition from DATA to DATA2, and from DATA2 to EXCEPT. These states are described below: DATA: The data state is assigned to rules which gather raw data from the computer system. Examples of such rules would be rules which gather the amount of free space remaining on a file system, or the amount of CPU time consumed by a process. SYSTEMWatch AI-L contains a series of rules responsible for switching states, and those rules ensures that rules with the DATA state are eligible to be tested before rules with a DATA2 or EXCEPT state. DATA2: Sometimes, a rule which performs problem detection or a rule which initiates an action requires data which can only be computed after certain raw data is gathered in the DATA state. Although the rule can compute that information directly, if that computation is necessary for a variety of rules, it is more efficient to ensure that the computation is performed only once. The DATA2 state is assigned to rules which perform this intermediate level calculation. The rules responsible for switching states ensure that DATA2 state rules are eligible for testing after DATA state rules, but before EXCEPT state rules. EXCEPT: The EXCEPT state is assigned to the remaining rules, which are used to perform problem detection and action initiation. The rule responsible for switching states ensures that EXCEPT state rules are eligible for testing only after both DATA and DATA2 state rules are tested. However, after the EXCEPT state rules are tested, the state is reset to the DATA state, and the cycle resumes. e. Each rule in the expert system also has a ONCE property. ONCE defaults to true, but can be set to false on a per rule basis by making the appropriate declarations in the initialization section of the rule, or the THEN clause, or the ELSE clause of the rule. In SYSTEMWatch AI-L, a rule is not eligible for testing by the expert system if the ONCE property for the rule is true, and if during this pass through the expert system, the rule has previously been tested true. After all the rules are declared, the expert system is in a state where it is ready to test rules. SYSTEMWatch AI-L forces the expert system component of the core layer to run through its rules whenever the execRules function is called. As described later, the SYSTEMWatch AI-L client, 13, and SYSTEMWatch AI-L console, 21, each call a execRules function in their main loop. As shown in FIGS. 8a-8b, in one embodiment, the expert system functions as follows: First, if the rules have not been sorted, INQURY 59, "Have the rules been sorted?", the expert system reorders the rules by sorting them in specificity order, STEP 60. Rules are ranked in their order of specificity, with the most specific rules ordered before the least specific rules. Specificity is the total number of comparison operators (less than, less than or equal to, equal to, greater than, greater than or equal to, not equal to) and logical operators (AND, OR, NOT) contained within the boolean expression used as the test in the rules. For example, consider these boolean expressions:
TABLE 2
______________________________________
Boolean Expression Specificity
______________________________________
A AND NOT B OR (C == D)
4
(A == B) && NOT C 3
(A == B) && C 2
A == B 1
TRUE 0
______________________________________
If during the sorting, a group of rules has the same specificity, that group is sorted in declaration order, with the earlier declared rule ordered before a later declared rule. The reordering of the rules is only done once, during the first time the execrules function is called. If this is the first time the execRules function is called, INQUIRY 61, "Have the rules been initialized?", the expert system also performs rule initialization by running through each rule in order, and each rule's LASTCHECK time is set to zero, ONCE state is set to TRUE, and any statements contained in the rule's initialization section are executed, STEP 62. Subsequent to initialization or if initialization was previously performed, expert system, 40, begins testing rules in sequence. First, the expert system sets its rule state to a empty string, STEP 63. Then the expert system sets its current rule pointer to the first rule, STEP 64. It makes the current rule be the rule pointed to by the current rule pointer, STEP 65. Then, before testing the rule, the expert system checks to see if the current time is greater than the rule's LASTCHECK time plus the rule's INTERVAL time, INQUIRY 66. If so, the required interval has elapsed, and the rule is not disqualified from testing. Otherwise the rule is disqualified from testing during this pass through the rules. If the above inquiry is affirmative, INQUIRY 67, the expert system checks to see if the expert system rule state is equal to the current rule's state. If they are equal, the rule is not disqualified from testing during this pass through the rule. Otherwise the rule is disqualified. Should the expert system rule state equal the current rule's rule state, the expert system checks to see if the rule's ONCE variable is set to TRUE, INQUIRY 68. If it is, and if this rule has ever tested TRUE during the current call to the execRules function. If so, the rule is disqualified from testing during this pass through the expert system. If not, the rule is eligible for testing. If a rule is eligible for testing, the expert system tests its condition and sets the rule's LASTCHECK time to be equal to the current time, STEP 69. (The rule's LASTCHECK time is updated when the condition is tested.) If the condition is true, the expert system then executes the THEN clause of the rule, STEP 70. If the condition is false, the expert system executes the ELSE clause of the rule, STEP 71, if it exists. What happens next depends upon what happened during the rule qualification and rule testing state. If the rule was disqualified from testing, or if the rule was tested and the condition was false, the expert system checks to see if the current rule is the last rule in the expert system, INQUIRY 72. If so, the expert system pass is completed for the time being, and the execRules function returns, STEP 74. If not, the expert system sets the current rule pointer to the next rule, STEP 73, and begins the process of checking rule testing eligibility and rule checking again, STEP 75. On the other hand, if the rule was tested, and the condition was true, then the expert system sets the current rule pointer to the first rule in the expert system, STEP 64, and the expert system begins the process of checking rule testing eligibility and rule checking again, STEP 65. Core Layer Description--Language Interperter Returning to FIG. 6, the third element of the core layer is a mechanism for configuring and controlling the database and the expert system. One preferred embodiment of this layer is an interpreter, 39, for a high level language, said language containing a mechanism of expressing database operations, database data definitions, and expert system rules. Core Layer Description--Communications Mechanism Finally, the fourth element of the core layer is communications mechanism, 42. The communication mechanism, 42, used by SYSTEMWatch AI-L is based on mailboxes. Each module has its mailboxes which is used to receive incoming data and commands. In one example, SYSTEMWatch AI-L contains two modules, SYSTEMWatch AI-L client, 13, and SYSTEMWatch AI-L console, 21. Messages are sent by deliverying files to desired module's mailbox. If the desired module is on a different computer, the delivery mechanism must be able to transport a message from one computer to another. In one example of a preferred embodiment, the communication mechanism, 42, operates by running a communications daemon on each machine which has either SYSTEMWatch AI-L client, 13, or SYSTEMWatch AI-L console, 21. A sending module delivers its message to a receiving module by passing the message to the communications daemon located on the machine where the sending module is located. Message passing is accomplished by sending messages on a TCP/IP based network using network sockets. That communications daemon then transmits the message over a computer network to the communications daemon where the receiving module is located. The communications daemon on the machine where the receiving module is located then places the message in a file in the mailbox of the receiving module. In another example of a preferred embodiment, the communications mechanism, 42, operates by placing all mailboxes of all modules in a central location, say a certain directory on a file server. On each machine which contains either SYSTEMWatch AI-L client, 13, or SYSTEMWatch AI-L console, 21, the file server directory where the mailboxes are located is made accessible. Thus, a sending module delivers its message to a receiving module simply by writing a file into the appropriate mailbox. Now that the client program organization has been explained, its possible to understand how the SYSTEMWatch AI-L client operates within the context of its bifurcated layers, i.e., the core and application layers. When the SYSTEMWatch AI-L client first begins, it consists of the core layer program reading a file containing a program written in the high level language which can be interpreted by the core. That program, including the 14 programs which that program will read, comprises the application layer for SYSTEMWatch AI-L client, 13. At this point, the database in the core layer has no data record definitions, and no data records. Similarly, the expert system within the core layer has no rules, variables, or routines. As the language interpreter, 39, portion of the core begins to interpret and execute the program, the first thing the program causes the core to do is to perform some housekeeping work. This work consists of ensuring that the communications mailboxes used by the SYSTEMWatch AI-L client are set up. After the housekeeping is done, SYSTEMWatch AI-L client, 13, causes the core to read in a series of files. These files are also files with programs written in the high level language. As each file is read, the routines, data record definitions, and rules expressed in each file are incorporated into the database, expert system, and language interpreter, 39, of the core. One preferred embodiment is to split these programs into 14 parts, consisting of the following files: 1. worksets 2. configs 3. events 4. requests 5. coms 6. lib 7. alerts 8. filesys 9. files 10. swap 11. process 12. daemon 13. actions 14. ruleinit Note that if the system administrator wanted to add additional modules to detect, analyze, and respond to additional problems, he need only write a program in the high level language conforming to convention used in the other files in SYSTEMWatch AI-L and modify the application layer to read in his program(s) before the SYSTEMWatch AI-L client reads the ruleinit program. Each of the 14 files read by the SYSTEMWatch AI-L client will now be described in detail: 1. worksets: A program which contains database declarations and routines relating to worksets. The worksets program does not declare any rules. A workset is a SYSTEMWatch AI-L ENTITY which is used to track groups of items for inclusion and exclusion, typically for including/excluding certain objects from being tested by the rules.
TABLE 3
__________________________________________________________________________
ENTITY
PROPERTY TYPE
DESCRIPTION
__________________________________________________________________________
WORKSET
ITEMLIST string
Actual list of colon separated items for
maintaining working sets
WORKSET
ADDEL string
Contains the string ADD in case of a
temporary addition record, and the string
DEL in the case of a temporary deletion
record. An empty string means this record is
a permanent work set record. Other values
are illegal.
WORKSET
WORKSETNAME
string
The name of the workset that a temporary
add/delete transaction references
WORKSET
TIMEOUT integer
Specifies the time at which a temporary ADD
action will delete an item, or at which a
temporary DEL action will ADD an item
back to the database. 0 identifies a
permanent working set record.
__________________________________________________________________________
The routines declared in the workset program are the following:
TABLE 4
__________________________________________________________________________
NAME FUNCTION
__________________________________________________________________________
addItem Takes a string and adds it to a workset if the string is not
already a member of the
workset. Accepts the string and a workset name.
addWorkSet
Adds a string of colon delimited items to a workset. If the
workset does not exist, it is
created. Can optionally accept a time out value, which if
present means the addition is
temporary, and will be deleted from the specified workset after
the timeout period has
expired. Accepts a hostname, workset name, a string, and
optionally, a time out
period.
checkInclExcl
Determines whether an item is on the include or exclude list of
a particular workset
checkInclExcl first checks the workset for an include list. If
an include list exists, and if
the item is on the include list, then checkInclExcl returns the
string "INCLUDE." If an
include list exists and the item is not on the include list,
checkInclExcl returns an
empty string.
If an include list does not exist, checkInclExcl then checks to
see if an exclude list
exists. If an exclude list exist and the item is not on the
exclude list, checkInclExcl
returns "INCLUDE". If an exclude list exists and the item is on
the exclude list,
checkInclExcl returns an empty string.
checkInclExcl accepts a workset name and an itemstring.
checkWorkSet
Reviews workset list of items waiting to be added or deleted and
executes the adds
and deletes if the appropriate time has arrived. checkWorkSet
does not accept any
parameters.
delItem Accepts an itemstring and a workset, goes through the workset
and deletes every item
in the itemstring from the workset, and then returns the
(modified) workset.
delWorkSet
Accepts a hostname, a workset name, itemlist, and optionally a
time out period.
Deletes each item in the itemlist from the workset, and if after
the deletion(s) no
elements remain in the workset, delete the workset itself. If
the optional time out
period is provided, the deletion is temporary, and after the
expiration of the timeout
period, the workset is restored.
getProblemInterval
Accepts a problem name and searches for the workset entry with
the corresponding
problem name to retrieve the problem checking interval. This
function usually used in
the initialization section of a rule in the expert system to get
the interval.
getWorkSet
Accepts a hostname and a workset name, and returns the
itemstring containing
elements of the specified workset.
isItem Accepts a workset name and an itemstring, examines whether the
itemstring is
contained in the workset, returning TRUE if found and FALSE if
not found.
__________________________________________________________________________
As previously mentioned, the workset program does not contain any rule declarations. 2. configs: A program which contains database declarations and routines relating to configurations. The configs program does not dedare any rules. In SYSTEMWatch AI-L, configuration refers to a method of assigning threshold values and other data to a particular computer. Configurations provide a mechanism by which the system administrator can change the behavior of SYSTEMWatch AI-L without having to modify the rules of the application layer. For example, suppose SYSTEMWatch AI-L contains a rule which notifies the system administrator when the load average of a computer remains above a certain threshold so that the computer is now non-responsive. This threshold number will vary across a variety of computers because a more powerful computer can remain responsive at the same load average which might cause a less powerful computer to become non-responsive. Therefore, if a particular computer site has say 10 computers of lesser power, and 2 more powerful computers, the proper way to configure SYSTEMWatch AI-L would be to specify a lower load average for the 10 weaker computers, and a higher threshold for the 2 more powerful computers. In SYSTEMWatch AI-L configurations are specified in a text file. Thus, the system administrator can alter the threshold value used by the rules by modifying the text file containing the configuration information since after the SYSTEMWatch AI-L client has read in each of these program files, it reads the configuration file. Configurations can be specified in one of 5 formats: CONF:<hostname>:<config parameter name>:<string>:string value: Which is used to associate a string value with a config parameter name of type string. CONF:<hostname>:<config parameter name>:<num>:numeric value: Which is used to associate a numeric value with a config parameter name of numeric type. CONF:<hostname>:<config parameter name>:PROBLEM:<problem name>: Which is used to associate a configuration parameter name with a particular problem. CONF:<hostname>:<config parameter name>:SUBPROBLEM:<behavior 1>: Which is used to associate a configuration parameter name with a particular subproblem. WORK:<hostname>:<workset name>:item1:item2: . . . :itemN: Which is used to associate a workset name with a list of data. In all five formats above, the <hostname>field can either be the name of a host being managed; or it could be DEFAULT, which means all hosts except those which have a specific entry. Thus, in the example above, if the threshold for the 10 less powerful computers should be 5.2 and the threshold for the 2 more powerful computers should be 7.5, the following configuration declarations would be appropriate if the 2 more powerful computers had host names of server1 and server2, the config parameter name is called UNRESP LOAD AVE: CONF:DEFAULT:UNRESP LOAD AVE:NUM:5.2: CONF:server1:UNRESP LOAD AVE:NUM:7.5: CONF:server2:UNRESP LOAD AVE:NUM:7.5: The database declarations made in the configs program are, for instance, the following:
TABLE 5
__________________________________________________________________________
ENTITY
PROPERTY
TYPE DESCRIPTION
__________________________________________________________________________
CONFIG
VALTYPE string
The data type for a particular configuration
parameter
CONFIG
STRINGVAL
string
The string value for a particular
configuration parameter
CONFIG
NUMVAL float
The numeric value for a particular
configuration parameter
CONFIG
PROBLEM string
This value indicates the general class of
problem or type of configuration described
by this value.
CONFIG
SUBPROBLEM
string
This value indicates a more specific
measurement of subproblem as it relates to
more general configurations described by
this value.
__________________________________________________________________________
The routines declared in the configs program are, for example, the following:
TABLE 6
__________________________________________________________________________
NAME FUNCTION
__________________________________________________________________________
declConfig
This routine declares a configuration entry. It accepts a host
name, configuration
parameter name, a value type, a problem name, and a subproblem
name.
delConfig
This routine deletes from the database a particular configuration
record. It accepts a
host name and a configuration parameter name.
getConfigStr
This routine returns the string value of a configuration
parameter name if the
configuration parameter name is of string type. It accepts a host
name and a
configuration parameter name.
getConfigNum
This routine returns the numeric value of a configuration
parameter name if the
configuration parameter name is of numeric type. It accepts a
host name and a
configuration parameter name.
getConfigType
This routine returns the type of a configuration parameter name.
It accepts a host
name and a configuration parameter name.
getConfigProblem
This routine returns the problem associated with a configuration
parameter name. It
accepts a host name and a configuration parameter name.
readConfigFile
This routine reads a file which contains configuration and
workset declarations. It
accepts a file name.
setConfig
This routine sets the value of a particular configuration
parameter name. It accepts a
hostname, a configuration parameter name, and a
__________________________________________________________________________
value.
As previously mentioned, the configs program does not declare any rules. 3. events: A program which contains database declarations and routines which implements the SYSTEMWatch AI-L event handler, which allows SYSTEMWatch AI-L to execute functions either at specified times or periodically. The events program does not declare any rules. The events program defines an ordered list of records, each describing a type of event. The order is such that the next event to be executed is first on the list. Each record contains the next event time, the function to be executed at that event, and two optional values, viz., the number of instances that event is to be executed, and the interval between those instances. To add an event, an event record is added to the database. SYSTEMWatch AI-L will check for events whenever the checkEvent function is called. This function call should be placed in the main loop of the SYSTEMWatch AI-L client and the SYSTEMWatch AI-L console. The database declarations made in the events program are, for instance, the following:
TABLE 7
__________________________________________________________________________
ENTITY
PROPERTY
TYPE DESCRIPTION
__________________________________________________________________________
EVENT EVENTNAME
string
Unique generated name for a scheduled
event.
EVENT FUNCTION
string
Name of function to be executed that the
time of the event. (Function name only|-do
not include any command line arguments for
the function)
EVENT ALARMTIME
integer
The alarm time after which the event gets
executed
EVENT INTERVAL
integer
The minimum time between event
repetitions
EVENT REPEATS integer
Number of times the event gets put back
onto the event queue, after the currently
scheduled event has been executed.
__________________________________________________________________________
The routines declared in the event program are, for example, the following:
TABLE 8
__________________________________________________________________________
NAME FUNCTION
__________________________________________________________________________
addEvent
Given a function name, a time period, and an optional repetition
factor, addEvent
schedules SYSTEMWatch AI-L to execute the function named at a time
equal to the
present time plus the time period. If the optional repetition
factor is given, the
function is scheduled that many times, each time differing from
the previous event
time by the time period.
checkEvent
Checks the event list to see if any events are ready to execute.
If so, the ready events
are executed.
delEvent
Accepts a function name and removes all occurrences of that
function from the event
handling system.
getNextEvent
Returns the clock time to the next event waiting.
__________________________________________________________________________
As previously mentioned, the event program does not declare any rules. 4. requests: A program which contains two routines used by SYSTEMWatch AI-L for communication between the SYSTEMWatch AI-L client and the SYSTEMWatch AI-L console. The request program only contains two function dedarations. The request program provides a higher level interface to the communications system by performing some message formatting and calling the communication functions declared in the coms program discussed below. The routines declared in the requests program are, for example, the following:
TABLE 9
__________________________________________________________________________
NAME FUNCTION
__________________________________________________________________________
request
Accepts a hostname, module name, a function name, and arguments to
the function
named. Sends a message to request the module on the host specified
to execute the
named function with the specified arguments.
requestReport
Accepts information which identifies a particular report and a
particular module
which requested the specified report. Formats a string containing
a report request, and
sends the resulting string to the specified module.
__________________________________________________________________________
5. coms: A program which contains routines relating to a SYSTEMWatch AI-L supplemental communications system. The coms program does not contain database declarations nor rules. The routines declared in the coms program are, for example, the following:
TABLE 10
__________________________________________________________________________
NAME FUNCTION
__________________________________________________________________________
manageMe Given a host name, a module name, and a optional string, adds a
SYSTEMWatch AI-L
client to the console list by calling the addWorkSet routine.
Also sends a "notifyMe"
message unless the optional string is equal to "NO RESPONSE".
notifyMe Accepts a hostname and a module name and adds a process to the
liveconsole list. If a
process is not on the approved console list, this function does
nothing.
sendMultiString
Accepts one or more hostnames ("process list"), a module name,
a message, and a
mailbox name and sends the message multiple times to the
modules specified on the
process list specified using the mailbox specified.
getMultiData
Accepts a workset which contains a list of processes and an
entity name. This function
requests data from each of the processes on the list of
processes. The data requested is
all the data contained in the specified entity.
multiRequest
Accepts a function name, parameters for the function, and a
list of at least one pair of
hostname & module name; Sends a message to each of the
hostname/module name
combinations requesting that they execute the specified
function with the specified
parameters.
sendMultiManageMe
Sends multiple manage me messages to the SYSTEMWatch AI-L
consoles on the
console list.
sendMultiNotifyMe
Sends multiple notify me messages to SYSTEMWatch AI-L clients
on the client list.
sendData Accepts a host name, module name, and entity name; sendData
sends all the data
comprising the specified entity to the specified
__________________________________________________________________________
host..
6. lib: A program which contains a series of miscellaneous routines. The libs program does not contain any database declarations nor rule declarations. The routines declared in the lib program are, for example, the following:
TABLE 11
__________________________________________________________________________
NAME FUNCTION
__________________________________________________________________________
fileUser
Accepts a file name and returns a colon delimited list of users
which are using the
specified files.
istr Accepts a floating point number and returns a string which contains
the integer
portion of the floating point number.
ls Accepts an optional path name. If the path name is specified, ls
returns a directory
listing of the specified path. If no path name is specified, ls
returns the directory listing
of the current working directory.
mkDirTree
Accepts a directory name and creates all the necessary directories
to create the
directory name specified. Thus, if a file system only contains the
root directory (/),
and mkDirTree is called with the directory name of /A/B/C,
mkDirTree creates the
following directories: /A;/A/B; and /A/B/C.
procAlive
Accepts a process id and determines whether the process id
specified corresponds to a
process in the process table.
readSwap
Obtains the following information from the virtual memory
subsystem:
swapused-the amount of swap space used on the system.
swaptotal-the total amount of swap space allocated on the system.
swapavail-the remaining amount of swap space
swapperc-the percentage of the allocated swap is used.
systemInOut
Accepts a command name and an input string. Executes the command
named using
the specified input string as the command's input Returns a string
equal to the output
of the command.
systemOut
Accepts a command name, executes the command specified, and returns
a string
equal to the output of the command.
lockProcess
Accepts a directory name and a filename. lockProcess is used when
you only want
one process of a particular kind to be running at any one time. It
guarantees process
uniqueness by first testing whether the lock file exists, and
whether it has the current
process id in it. If it has a process id in it and that process is
still alive, it returns with a
warning message. If the process id in the file is not a live
process, lockProcess writes
its own process id into the file. lockProcess then re-reads the
file, and if it finds its own
process id in the file, lockProcess returns without error.
Otherwise, an error message
is generated.
__________________________________________________________________________
7. alerts: A program which contains database declarations and routines used in a SYSTEMWatch AI-L alert handling system, which is used to manage problem alerts and their associated actions. The alerts program does not declare any rules. The SYSTEMWatch AI-L alert handling mechanism provides a method of presenting problem notification to the system administrator. There are several concepts used within the alert handling mechanism: 1. PRIORITY: Each alert within SYSTEMWatch AI-L has an associated priority, which is used to determine the seriousness of the detected problem. In one embodiment, SYSTEMWatch AI-L uses, for example, 5 levels of priorities comprising of FYI (least serious), NOTIFY, PROBLEM, FAILURE, and CRITICAL (most serious). 2. ESCALATION: After an alert has been created, SYSTEMWatch AI-L provides a technique of automatically changing the PRIORITY of an alert over time. This allows SYSTEMWatch AI-L to promote or demote a particular alert's PRIORITY over time. The promotion/demotion process is known as an escalation scheme. SYSTEMWatch AI-L supports the use of multiple and userdefined escalation schemes. In SYSTEMWatch AI-L, an escalation scheme is defined with a name. The escalation scheme can be associated with an alert by referencing the escalation scheme's name. SYSTEMWatch AI-L stores the escalation schemes in the configuration file. 3. TIME OUT: After an alert has been created, or after an alert has been escalated to a particular state, the technique of the present invention provides for the timing out the alert. A timed out alert is cleared from the alert system. 4. CLEAR: After an alert has been created, SYSTEMWatch AI-L provides a mechanism for clearing the alert, which removes it from the active alert pool. Cleared alerts, however, remain within SYSTEMWatch AI-L for some period of time. That period of time is called the reset time, during which if the condition which causes the alert to be generated occurs, the alert will not be posted. Once the reset time period has elapsed, the alert is completely removed from the alert handling mechanism, and if the condition which can cause the alert to be generated reoccurs, a new alert is posted. Each alert can have a different reset time. 5. IGNORE: After an alert has been created, SYSTEMWatch AI-L provides a mechanism for ignoring the alert, which, like clearing an alert, removes it from the active alert pool. Like clearing, the ignored alert is also kept within the alert handling mechanism, and has an associated time period called an ignore time, during which if the condition reoccurs, the alert will not be posted. Unlike the CLEAR mechanism, however, the IGNORE mechanism does not necessarily have a fixed ignore time for each alert. Rather, SYSTEMWatch AI-L supports an ignore scheme similar to the escalation scheme. In the ignore scheme, SYSTEMWatch AI-L remembers how many times a particular alert has been ignored. By specifying an ignore scheme, it is possible to vary the length of the ignore period depending on how many times that particular alert has already been ignored. The typical application for this is to increase the ignore period as the number of ignore actions for a particular alert has been taken. This way, SYSTEMWatch AI-L can "learn" from the actions of the system administrator and interrupt the system administrator less frequently with an alert that he has previously ignored. In SYSTEMWatch AI-L, an ignore scheme is defined with a name, and thereafter, the ignore scheme can be applied to any alert by referencing its name. SYSTEMWatch AI-L stores the ignore schemes in the configuration file. 6. ALERT ID v. ALERT REFERENCE NUMBER: Each alert in SYSTEMWatch AI-L can be identified by an alert id, which, when combined with a host name and a module name, uniquely identifies an alert, or via an alert reference number, which, when combined with a host name and a module name, uniquely identifies an alert, but only during a specific time period. In other words, the alert id is an unique number generated by SYSTEMWatch AI-L as each alert is created. On the other hand, in order for the system administrator to refer to an alert without having to type a large multi-digit number, SYSTEMWatch AI-L creates a smaller number (in one example, typically 2 digits) which points to an active alert. In order to maintain the alert reference number at 2 digits, SYSTEMWatch AI-L automatically reuses the alert reference numbers over time, so an alert reference number can only uniquely identify an alert within a certain window of time. 7. ALERT NAME and ALERT INSTANCE NAME: In addition to the alert id and the alert reference number described in the paragraph above, each alert in SYSTEMWatch AI-L can also be identified through a combination of two items, specifically the alert name and the alert instance name. The alert name identifies the class of problem which triggered the alert while the alert instance name identifies the object involved in the problem. For example, if the/usr file system reaches 90% capacity, and the fact that a file system reached 90% capacity is defined as a problem named FSFYI, then the alert name in this case is FSFYI and the alert instance name is/usr. 8. OWNER: SYSTEMWatch AI-L allows a system administrator to optionally assign owner(s) to a problem identified in an alert. This is used when the system administrator decides that someone must manually resolve the problem. Once an alert has at least one owner, the alert ceases to escalate or time out. The alert remains active within the alert handling system, and will not be removed until it is cleared. 9. PROBLEM HIERARCHIES and UNIQUE LISTS: Alerts in SYSTEMWatch AI-L may be arranged in problem hierarchies. Problem hierarchies are used to prevent a problem from triggering several overlapping alerts. For example, suppose three problems were defined as:
TABLE 12
______________________________________
Problem Description
______________________________________
FSFYI A file system reached 90% capacity
FSWARN A file system reached 95% capacity
FSALERT A file system reached 98% capacity
______________________________________
If a particular file system reached 98% capacity, the 3 rules which detect the FSFYI, FSWARN, and FSALERT problems would all attempt to post alert of alert type FSFYI, FSWARN, and FSALERT for the same alert instance (in this case, the file system name). However, this is redundant. What is needed is just one single alert of type FSALERT. To resolve this problem SYSTEMWatch AI-L allows problems to be grouped into hierarchies. Once a problem hierarchy has been defined, SYSTEMWatch AI-L will automatically ensure that only the alert with the most severe priority of a particular hierarchy will survive. Problem hierarchies are specified in the SYSTEMWatch AI-L configuration. In SYSTEMWatch AI-L, problem hierarchies are called unique lists. With an understanding of the above information, the operation of an alert mechanism in accordance with the principles of the present invention can now be described. When a rule detects a problem, the rule will post an alert to the alert mechanism by calling the function addAlert. During the SYSTEMWatch AI-L client's main loop, the SYSTEMWatch AI-L client will call the function checkalert to handle alert escalation and alert clearing. When the addAlert function is called, SYSTEMWatch AI-L performs 5 validation tests before a new alert is created. In the description below, the term candidate alert refers to the alert given to addAlert for posting. The validation tests are the following: 1. Unique List Check In order to prevent a severe problem from posting related and less severe alerts, addAlert queries the database to see whether there is an existing alert with the same alert instance name and an alert name which occupies a higher priority position in the same unique list as the candidate alert. If such an alert exists, the candidate alert is rejected and not posted. 2. Duplicate Alert Check In order to prevent the problem of posting multiple identical alerts at different times, addAlert queries the database for an alert with the same alert name and alert instance name. If such an alert exists, the candidate alert is rejected and not posted. 3. Ignore List Check In order to prevent a problem of posting a new alert when the problem is being ignored, addAlert queries the database for a corresponding entry of IGNORE.sub.-- IGNORETIME. If such an entry exists, addAlert compares the current clock time with the value of the entry found. An alert candidate will be rejected if the clock time is less than or equal to the value of the entry found because that condition means that the alert is being ignored at this time. 4. Clear List Check In order to prevent a problem of posting a new alert when the problem is being cleared, addAlert queries the database for an corresponding entry of ALERT.sub.-- CLEARED. If such an entry exists and its value is true, then addAlert queries the database for an entry of ALERT.sub.-- CLEARTIME and compares its value against the clock time. An alert candidate will be rejected if the clock time is less than or equal to the value of the entry found because that condition means that the alert was cleared and the current time is within the reset time period. 5. Lower Priority Check While the unique list check prevents a severe problem of also posting less severe alerts, if a more severe problem occurs after a less severe problem belonging to the same unique list has already posted an alert, the alert which corresponds to the less severe alert must be removed before the more severe alert is posted. Thus, addAlert queries the database for an alert with the same instance name and an alert name which is of a lower priority on the same unique list as the candidate alert. If such an alert is found, it is deleted. If the alert candidate passes the 5 validation tests described above, the alert will be posted. Posting an alert is a multistep process which involves the following steps: 1. Create Alert addAlert will create an alert in the database with the alert name and alert instance name corresponding to the candidate alert. 2. Add Information to Alert addAlert will store descriptive information into the alert. 3. Assign Starting Priority addAlert will query the configuration information stored in the database to retrieve the problem priority associated with an alert with the same alert name as the candidate alert and assign the same priority to the alert. 4. Assign Escalation Scheme addAlert will query the configuration information stored in the database to retrieve the escalation scheme associated with an alert with the same alert name as the candidate alert and store the same with the alert. 5. Assign Ignore Scheme addalert will query the configuration information stored in the database to retrieve the ignore scheme associated with an alert with the same alert name as the candidate alert, and store the same with the alert. 6. Assign Available Actions addAlert will query the configuration information stored in the database to retrieve the available actions associated with an alert with the same alert name as the candidate alert, and store the same with the alert. 7. Assign Default Actions addalert will query the configuration information stored in the database to retrieve the default actions associated with an alert with the same alert name as the candidate alert, and store the same with the alert. 8. Update SYSTEMWatch AI-L consoles addAlert will allow the alert to be communicated to the attached SYSTEMWatch AI-L consoles by calling updateNetworkAlert. 9. Save Alert to Disk addAlert will save the alert to a disk file. 10. Execute Default Action addAlert will execute any default actions associated with the alert. When the checkAlert function is called as part of the main loop of the SYSTEMWatch AI-L client and the SYSTEMWatch AI-L console, alert escalation and alert clearing are performed. Alert escalation is performed by executing the following steps for each of the alerts which has not been cleared, ignored, or assigned an owner: 1. Querying the database to retrieve the "escalation item" of an alert with the same alert name and with a priority equal to the alert's current priority. This information consists of the current priority, a time period, and a new priority. 2. If the time the alert has been in the current priority state is larger than or equal to the time period above, change the alert's priority according to the escalation scheme to the new priority. 3. If the new priority is zero, then clear the alert by removing the alert from the active alerts and place the alert on the clear list for the rest time period. 4. Determine whether any default action(s) is registered from this alert name and priority and the current time. If such a default action is registered, execute such actions by calling the doAction function. Alert clearing is performed by executing the following steps for each of the alerts: 1. Query the database for an alert to see if it has a corresponding entry of ALERT.sub.-- CLEARED. If so, and if the value is true, then perform step 2. Otherwise, the process is done for this alert. 2. Query the database and retrieve a corresponding entry of ALERT.sub.-- CLEARTIME . Check its value against the clock time. If that time is less than or equal to the clock time, this means the alert has been cleared, and the reset time has expired, so remove the alert. Ignoring an alert is accomplished by performing the following steps: 1. If this is the first time this alert has been ignored, store to the database a corresponding entry of IGNORE.sub.-- IGNORECOUNT with value 0. 2. Query the database for an entry of IGNORE.sub.-- IGNORECOUNT associated with this alert. Increment the value by one and store it back into the database. 3. Query the database for the configuration of the associated ignore scheme for this alert name and alert instance. 4. Get the Nth entry in the ignore scheme, where N is the value of the updated IGNORE.sub.-- IGNORECOUNT stored in step 2 and store into the database a corresponding entry of IGNORE.sub.-- NEXTTIME a value equal to the current time plus the time interval of the Nth entry. Note, if the IGNORE.sub.-- IGNORECOUNT value is greater than the number of entries in the ignore scheme, put a very large number into the IGNORE.sub.-- NEXTTIME. This effectively makes the ignore period infinite for all practical purposes, thereby preventing the alert from reoccurring. Note that escalation schemes and ignore schemes can be different for each managed computer by including computer specific information in the configuration database. The alerts program contains the following database declarations:
TABLE 13
__________________________________________________________________________
ENTITY
PROPERTY TYPE
DESCRIPTION
__________________________________________________________________________
ALERT
PRIORITY string
Describes priority of problem with the following
words: FYI, NOTIFY, PROBLEM, FAILURE,
CRITICAL
ALERT
PROBLEMAREA
string
Describes the general nature of the problem.
ALERT
SHORTDESCRIPTION
string
Provides a brief overview of the problem.
ALERT
DETAIL string
Provides a detailed overview of the problem.
ALERT
RECOMMENDFILE
string
Offers recommended solutions to problem,
including useful System data.
ALERT
RECOMMENDFLAG
integer
TRUE if RECOMMENDFILE exists.
ALERT
HISTORYFILE
string
A cumulative problem history, save in an outside
file. The filename is stored in this field.
ALERT
HISTORYFLAG
string
TRUE if HISTORYFILE exists and is a valid file
name.
ALERT
ACTIONSAVAILABLE
string
Provides information about actions available for
problem type. Different actions are separated by
colons, such as 1stAction:2ndAction.
ALERT
ACTIONSTAKEN
string
Provides information about action in progress and
previously taken in this alert. It is the action
responsibility for maintaining this field. Multiple
actions are separated by colons.
ALERT
ACTIONTIME integer
Contains a time stamp for when the action should
review the current action of this Alert. This field
is
under the control of the action.
ALERT
CREATTIME integer
Time stamp of when the alert was created.
ALERT
CLEARED integer
If a record has the cleared flag set to TRUE, then
an
alert will not be displayed as a live alert.
However,
it is still tracked in the database to avoid
immediate
realerts of the same problem.
ALERT
CLEARTIME integer
Time at which the cleared alert is automatically
removed from the list and a new problem can be
generated.
ALERT
ESCALATION string
Specifies name of escalation mechanism to use for
this alert.
ALERT
ESCALTIME integer
Time of next escalation check.
ALERT
OWNER string
This is a list of people who claim ownership for the
problem and are thereby acknowledging the
problem's existence, which stops problem
escalation.
ALERT
PROBLEMID string
Contains problem id:host:entity. For example:
FSWARN:HOST1:/dev/sd0a. Used for tracking if a
problem has been previously seen and whether to
realert.
ALERT
NOTIFY string
Notify gets set to ADD, OWNER, or
RECOMMEND depending what value changed.
Multiple notifications are allowed by a colon
delimiting the notification items.
ALERT
PROCESS string
Specifies the owning and originating process in the
HOST:MODULE format. The PROCESS field with
the ALERTID uniquely specifies a process. It is the
originating processes' responsibility to maintain
unique ALERTIDs. By default, any PROCESS
specified by just the HOST will default to the
SYSTEMWatch AI-L client module.
ALERT
ALERTID integer
An identification number unique to the originating
process specified in the PROCESS property.
ALERT
REFNUM integer
Temporary reference number that is used on each
local host to identify a particular alert from the
alert
displays without having to type the whole alert
name.
IGNORE
IGNORECOUNT
integer
Number of times the user requested to ignore the
problem.
IGNORE
NEXTTIME integer
Describes the next time that particular alert
instance
may reappear if the particular problem is noticed
again.
REFNUM
REFNUM integer
Contains an Alert Reference Number allocated to a
particular local alert.
__________________________________________________________________________
8. filesys: A program which contains database declarations, and rules used by SYSTEMWatch AI-L to monitor files and file systems on a computer. The filesys program detects, for example, the following file system problems:
TABLE 15
__________________________________________________________________________
Problem Description Available Actions
__________________________________________________________________________
FSFYI File system has reached 90% full
fsrecom, rmjunk,
rmoldjunk
FSWARN File system has reached 95% full
fsrecom, rmjunk,
rmoldjunk
FSALERT File system has reached 98% full
fsrecom, rmjunk,
rmoldjunk
FSABSMIN File system has less than 1 Mb free
fsrecom, rmjunk,
rmoldjunk
FSINODEFYI
File system has less than 1000 modes free
fsrecom, rmjunk,
rmoldjunk
FSINODEWARN
File system has less than 200 modes free
fsrecom, rmjunk,
rmoldjunk
FSINODEALERT
File system has less than 20 modes free
fsrecom, rmjunk,
rmoldjunk
FSBEHAVE1
Unusual short term behavior: File system utilization grows
fsrecom, rmjunk,
3% in 3 minutes, as compared to the average file
rmoldjunk
utilization for the most recent 30-minute period.
FSBEHAVE2
Unusual long term behavior: File system utilization grows
fsrecom, rmjunk,
3% over 30 minutes, as compared to the average file
rmoldjunk
utilization for the most recent 24-hour period.
__________________________________________________________________________
Each of the threshold values underlined in the above table is a default
value, which can be changed by the system administrator on either a
computer specific basis or on a network wide basis via the configuration
mechanism, as described above in the section on the config program.
The FSBEHAVE1 and FSBEHAVE2 problems can only be detected if the SYSTEMWatch AI-L client can establish a historical trend line for file system space utilization. The SYSTEMWatch AI-L client performs the historical trend line evaluation by using a recursive average filter. Note: This filter can be used in areas other than file system space monitoring. Although trend line analysis can also be performed using a moving average filter, a moving average filter is less desirable than a recursive average filter because the latter can accomodate more historical data, as well as function in an environment when the sample measurement time is irregular. The recursive average filter calculates its first value by using a current value and computes subsequence instances by calculating a weighted average between the prior value and the new measurement. The weighting factor, which is called "ratio" below, may be set depending on the sensitiviy to fluctuations in the current value. The higher the ratio is set, the more the computed value will fluctuate. In SYSTEMWatch AI-L, the ratio used is dependent upon the measurement window and the time difference between the prior calculation and the current calculation. The advantage of this ratio is that it provides a filter which gives a consistent response even if the measurement intervals vary substantially. This is important, since a real time measurement system cannot necessarily guarantee accuracy in the time between calculations. One example of a recursive average filter technique is the following: Xnow=the current value of the measurement, in this case, the file system space utilization. XP=the historical value if it exists, otherwise, for the first calculation, it is equal to Xnow. Xp=(Xp.times.(1-ratio))+(Xnow.times.ratio) tdelta=current time-previous time XP was calculated ratio 32 1, if tdelta>time window, or tdelta/time window. The above is an embodiment of how the FSBEHAVE1 and FSBEHAVE 2 rules determine the historical trend value. The FSBEHAVE1 and FSBEHAVE2 problems are detected based upon three inputs: 1. The historical trend value within a trend window; 2. The percentage difference of the current value from the trend value; and 3. The period of time over which the difference persists. A problem occurs when the difference of the current value differs by the specified amount from the trend value for a period exceeding a specified period. In the FSBEHAVE1 and FSBEHAVE2 problems, we look only at increases beyond a trend line since as far as computer file systems are concerned, drops in space utilization are not considered problems. The fileSysBehave1Compute and fileSysBehave2Compute rules function by calculating and storing a new trend value and the current time into the database. They also set a flag if the current value differs from the trend value by the specified amount. The trend values are stored in the database under the FILESYS.sub.-- XP1 and FILESYS.sub.-- XP2 entity/property combinations. The flags are stored in the database under the FILESYS.sub.-- FL1 and FILESYS.sub.-- FL2 entity/property combinations. The fileSysBehave1Test and fileSysBehave2Test will call addAlert if the time the flag was set is larger than the specified time period. The following actions are available to respond to problems detected by the filesys program:
TABLE 16
__________________________________________________________________________
Action
Description
__________________________________________________________________________
fstecom
Analyzes a specified file system by traversing the entire file
system and gathering the following
information: names of the 10 largest files, names of the 10 largest
directories, the processes using
each file, the percentage of the file system each file utilizes,
names of all non-device files in the /dev
directory, names of all junk files, log files, and error files on
the file system. The files which
comprises junk files, log files, and error files are defined in the
configuration. The information
gathered by the fsrecom action is stored into the database.
rmjunk
Queries database for a list of junk files produced by the fsrecom
action (see above), and removes all
the junk files retrieved from the database.
rmoldjunk
Virtually the same as rmjunk (above), but only removes those junk
files whose modification time is
at least 2 hours behind the clock time when the rmoldjunk action is
initiated.
__________________________________________________________________________
The filesys program contains, for example, the following database declarations:
TABLE 17
__________________________________________________________________________
ENTITY
PROPERTY
TYPE DESCRIPTION
__________________________________________________________________________
FILESYS
MOUNTPOINT
string
Mount point or directory name that the file
system is mounted onto
FILESYS
FSTYPE string
File system type
FILESYS
MOUNTED boolean
Is the file system mounted?
FILESYS
OPTIONS string
Describes options that the file system may be
mounted with.
FILESYS
SPACETOTAL
integer
Kilobytes of file system space total, including
space reserved by root.
FILESYS
SPACEUSED
integer
Kilobytes of file system space used
FILESYS
SPACEAVAIL
integer
Kilobytes of file system space available to
users. This number does NOT include any in
reserve for root.
FILESYS
SPACEFREE
integer
Kilobytes of file system space free, including
space reserved for root.
FILESYS
SFACEPERC
float
Percentage space used, excluding the root
reserve.
FILESYS
FILEUSED
integer
Number of inodes/files used.
FILESYS
FILESFREE
integer
Number of inodes/files used.
FILESYS
FILESTOTAL
integer
Total number of inodes/files.
FILESYS
FILESPERC
integer
Percentage of total inodes used.
FILESYS
XP1 float
Historical trend value for the FSBEHAVE1
problem.
FILESYS
XP2 float
Historical trend value calculated with a
recursive average filter for the FSBEHAVE2
problem.
FILESYS
FL1 integer
Variation flag used in the FSBEHAVE1
problem.
FILESYS
FL2 integer
Variation flag used in the FSBEHAVE2
problem.
__________________________________________________________________________
No routines are declared in filesys.
TABLE 18
__________________________________________________________________________
Rule Name Initialization
Condition Then-Action
Else-Action
__________________________________________________________________________
fileSysUpdate
Sets state to DATA, gets
(always true)
Gathers information on
N/A
rule interval from file systems
configuration.
fileSysAbsMin
Sets state to EXCEPT, gets
If file system
Checks database for file
N/A
rule interval from
percentages have been
systems which meet the
configuration
updated since the last
FSABSMIN problem
time this rule was
criteria. For each
checked and there are
problem detected, post
file systems in the
an alert to the alert
database mechanism.
fileSysAlertFull
Set state to EXCEPT, set
If file system
Checks database for file
N/A
ONCE to false, gets rule
percentages have been
systems which meet the
interval from
updated since the last
FSALERT problem
configuration
time this rule was
criteria. For each
checked and there are
problem detected, post
file systems in the
an alert to the alert
database mechanism.
fileSysWarnFull
Set state to EXCEPT, get
If file system
Checks database for file
N/A
rule interval from
percentages have been
systems which meet the
configuration
updated since the last
FSWARN problem
time this rule was
criteria. For each
checked, and there are
problem detected, post
file systems in the
an alert to the alert
database mechanism.
fileSysFYIFull
Set state to EXCEPT, get
If file system
Checks database for file
N/A
rule interval from
percentages have been
systems which meet the
configuration.
updated since the last
FSFYI problem criteria.
time this rule was
For each problem
checked, and there are
detected, post an alert
file systems in the
to the alert mechanism.
database
fileInodeALERT
Set state to EXCEPT, get
If file system
Checks database for file
N/A
rule interval from
percentages have been
systems which meet the
configuration
updated since the last
FSINODEALERT
time this rule was
problem criteria. For
checked, and there are
each problem detected,
file systems in the
post an alert to the alert
database mechanism.
fileInodeWarn
Set state to EXCEPT, get
If file system
Checks database for file
N/A
rule interval from
percentages have been
systems which meet the
configuration
updated since the last
FSINODEWARN
time this rule was
problem criteria. For
checked, and there are
each problem detected,
file systems in the
post an alert to the alert
database mechanism.
fileInodeFYI
Set state to EXCEPT, get
If file system
Checks database for file
N/A
rule interval from
percentages have been
systems which meet the
configuration
updated since the last
FSINODEFYI problem
time this rule was
criteria. For each
checked, and there are
problem detected, post
file systems in the
an alert to the alert
database mechanism.
fileSysBehave1Compute
Set state to DATA2, get
TRUE Computes historical
N/A
rule interval from trend value using the
configuration recursive average filter
and store the results in
the database under a
record of type
FILESYS.sub.-- XP1
fileSysBehave1Test
Set state to EXCEP, get
If the FL1 flag for a file
Adds an FSBEHAVE1
N/A
rule interval from
system has been set
alert.
configuration.
for a time period
exceeding the
applicable time period
fileSysBehave2Compute
Set state to DATA2, get
TRUE Computes historical
N/A
rule interval from trend value using the
configuration. recursive average filter
and store the results in
the database under a
record of type
FILESYS.sub.-- XP2
fileSysBehave2Test
Set state to EXCEP, get
If the FL1 flag for a file
Adds an FSBEHAVE2
N/A
rule interval from
system has been set
alert.
configuration.
for a time period
exceeding the
applicable time period
__________________________________________________________________________
9. files: A program which simply contains the following database declarations, which are used in tracking files and file systems.
TABLE 19
__________________________________________________________________________
ENTITY PROPERTY TYPE
DESCRIPTION
__________________________________________________________________________
FILES DIRECTORY
string
Name of directory containing a file, or the
directory name if the record describes a directory
FILES FILENAME string
Name of file or directory without its path
FILES FILESYS string
Name of file system containing file or directory
FILES SIZE integer
Size in bytes of a file or the sum of files in a
directory
FILES LINKS integer
Number of links to a file or directory
FILES FSPERC float
Percent of file system size
FILES MODE string
File mode and pennissions
FILES FILETYPE string
File types: FILE, DIR, LINK, OTHER
FILES UID integer
Owner's UID (user id number)
FILES OWNER string
Owner name
FILES GID integer
Owner's GID (group id number)
FILES GROUP string
Group name
FILES ACCESSTIME
integer
file/directory access time
FILES MODTIME integer
File/Directory last modification time
FILES PROCID string
Process Ids that are accessing file as determined by
the command fuser.
FILES PROCUSER string
Process user names that are accessing file as
determined by the command fuser
FILES PROCCOMMAND
string
Command name of first process on the list
FILES DIRENTRIES
string
Number of directory entries in a directory.
FILES DIRTREESIZE
integer
Sum of all file sizes in bytes in a directory tree.
FILES TIMEOUT integer
Time at which data should be erased
FILES COMMENT string
Free form list: used primarily by the file system
recommendation action to store class of problem
file.
FILECHANGE
DIRECTORY
string
Name of directory containing file, or the directory
name if the record describes a directory
FILECHANGE
FILENAME string
Name of file or directory without its path
FILECHANGE
FILESYS string
Name of file system containing file or directory
FILECHANGE
SIZE integer
Size of a file or the sum of files in a directory
FILECHANGE
FSPERC float
Percentage of file system size
FILECHANGE
MODE string
file mode and permissions
FILECHANGE
FILETYPE string
File types: FILE, DIR, LINK, OTHER
FILECHANGE
UID integer
Owner's UID (user id number)
FILECHANGE
OWNER string
Owner's name
FILECHANGE
GID integer
Owner's GID (group id number)
FILECHANGE
GROUP string
Owner's group name
FILECHANGE
CREATETIME
integer
File/Directory create time
FILECHANGE
MODTIME integer
File/Directory last modification time
FILECHANGE
PROCID string
Process Ids that are accessing a file as determined
by the command fuser
FILECHANGE
PROCUSER string
Process Ids that are accessing a file as determined
by the command fuser
FILECHANGE
PROCCOMMAND
string
Command name of first process on the list
FILECHANGE
DIRENTRIES
integer
Directory entries/modes
FILECHANGE
DIRSIZE integer
Sum of all file sizes in a directory
FILECHANGE
DIRTREESIZE
integer
Sum of all file sizes in a directory tree
FILECHANGE
TIMEOUT integer
Time at which data should be erased
FILECHANGE
COMMENT string
Free form field: used primarily by file system
recommendation action to store class of problem
files
FILECHANGE
STARTSIZE
integer
File size at beginning of measurement
FILECHANGE
RATEINCREASE
integer
Rate of increase: (current size start size)/
timedelt/60.
__________________________________________________________________________
10. swap: A program which contains database declarations, a routine declaration, and rules used by SYSTEMWatch AI-L to monitor the virtual memory swap file for problems. The swap program, for example, detects the following virtual memory problems:
TABLE 20
__________________________________________________________________________
Problem Description Available Actions
__________________________________________________________________________
SWAPFYI Swap space is up to 85% capacity.
addswap, tmpshutdown
SWAPWARN Swap space is up to 90% capacity.
addswap, tmpshutdown
SWAPALERT
Swap space is up to 95% capacity.
addswap, tmpshutdown
SWAPABSMIN1
Available swap space is less than 5 Mb.
addswap, tmpshutdown
SWAPABSMIN2
Available swap space is less than 2 Mb.
addswap, tmpshutdown
__________________________________________________________________________
The following actions are available to respond to problems detected by the swap program:
TABLE 21
__________________________________________________________________________
Action Description
__________________________________________________________________________
addswap
Increases the amount of swap space available on the system by a
two step process. First,
addSwap creates a large file by using the UNIX command mkfile.
Then, addSwap incorporates
that file into the virtual memory system by using the UNIX command
swap on, which lets the
UNIX operating system to use the newly created file as swap
space.
addSwap attempts to create sufficient additional swap space so
that at most 80% of the
augumented swap space is used.
tmpshutdown
Shuts down the SYSTEMWatch AI-L client and console by causing the
SYSTEMWatch AI-L
client and the SYSTEMWatch AI-L console to exit their main loop.
cleanswap
Deletes the files added by the addswap action (above).
__________________________________________________________________________
The following database declarations are made in swap:
TABLE 22
__________________________________________________________________________
ENTITY
PROPERTY
TYPE DESCRIPTION
__________________________________________________________________________
SWAPSTAT
SWAPUSED
integer
Number of kb of swap space in use. E.g.: the
USED value of the UNIX command pstat -s.
SWAPSTAT
SWAPAVAIL
integer
Number of kb of swap space available. E.g.:
the AVAILABLE value of the UNIX
command pstat -s
SWAPSTAT
SWAPPERC
float
Percentage of available swap space in use.
E.g.: USED/(USED - AVAILABLE) from the
UNIX command pstat -s.
SWAPSTAT
SWAPTOTAL
integer
Number of kb of swap space total. E.g.: the
USED + AVAILABLE values from the UNIX
command pstat -s.
__________________________________________________________________________
The routines declared in the swap program are the following:
TABLE 23
__________________________________________________________________________
NAME FUNCTION
__________________________________________________________________________
getSwap
Gathers swap space information by calling the readSwap function, and
places the
information returned by the readSwap function into the
__________________________________________________________________________
database.
These are the rules declared in filesys:
TABLE 24
__________________________________________________________________________
Rule Name
Initialization Condition Then-Action
Else-Action
__________________________________________________________________________
swapUpdate
sets state to DATA, gets interval
(always true)
Calls the getSwap
N/A
from configuration routine
swapAbsMin2
Sets state to EXCEP, gets interval
If available swap is less
Posts a N/A
from configuration, gets
than the limit
SWAPABSMIN2
SWAPABSMIN2 limit from alert to the alert
configuration system.
swapAbsMin1
Sets state to EXCEP, gets interval
If available swap is less
Posts a N/A
from configuration, gets
than the limit
SWAPABSMIN1
SWAPABSMIN1 limit from alert to the alert
configuration system
swapAlert
Sets state to EXCEP, gets interval
If available swap is less
Posts a SWAPALERT
N/A
from configuration, gets
than the limit
alert to the alert
SWAPALERT limit from system.
configuration
swapWarn
Sets state to EXCEP, gets interval
If available swap is less
Posts a SWAPWARN
N/A
from configuration, gets
than the limit
alert to the alert
SWAPWARN limit from system.
configuration
swapFYI
Sets state to EXCEF, gets interval
If available swap is less
Posts a SWAPFYI
N/A
from configuration, gets SWAPFYI
than the limit
alert to the alert
limit from configuration system.
__________________________________________________________________________
11. process: A program which contains database declarations, routines, and rules used by SYSTEMWatch AI-L to monitor processes on the computer. The process program detects, for example, the following file system problems:
TABLE 25
__________________________________________________________________________
Problem
Description Available Actions
__________________________________________________________________________
PROCCPU1
A process is using 30% of the CPU time and the
kill, stoptmp, stopload, nice5, nice10,
system load average has reached 2.5
nice15, nice20, schedule10, schedule25,
schedule50, scheduleVIP10,
scheduleVIP25, scheduleVIP50
PROCCPU2
A process is using 15% of the CPU time and the
kill, stoptmp, stopload, nice5, nice10,
system load average has reached 5.0
nice15, nice20, schedule10, schedule25,
schedule50, scheduleVIP10,
scheduleVIP25, scheduleVIP50
PROCCPU3
A process is using 10% of the CPU time and the
kill, stoptmp, stopload, nice5, nice10,
system load average has reached 7.5
nice15, nice20, schedule10, schedule25,
schedule50, scheduleVIP10,
scheduleVIP25, scheduleVIP50
PROCMEM1
A process is using 40% of the swap space and the
kill, stoptmp, stopload, nice5, nice10,
virtual memory system is using 80% of the available
nice15, nice20, schedule10, schedule25,
swap space. schedule50, scheduleVIP10,
scheduleVIP25, scheduleVIP50
PROCMEM2
A process is using 60% of the swap space and the
kill, stoptmp, stopload, nice5, nice10,
virtual memory system is using 80% of the available
nice15, nice20, schedule10, schedule25,
swap space. schedule50, scheduleVIP10,
scheduleVIP25, scheduleVIP50
PROCMEM3
A process is using 80% of the swap space and the
kill, stoptmp, stopload, nice5, nice10,
virtual memory system is using 80% of the available
nice15, nice20, schedule10, schedule25,
swap space. schedule50, scheduleVIP10,
scheduleVIP25, scheduleVIP50
__________________________________________________________________________
Each of the threshold values underlined in the above table is a default
value, which can be changed by the system administrator on either a
computer specific basis or on a networkwide basis via the configuration
mechanism, as described above in the section on the config program.
The following actions are available to respond to problems detected by the filesys program:
TABLE 26
__________________________________________________________________________
Action Description
__________________________________________________________________________
kill Kills the specified process by sending the process the UNIX kill
signal.
stoptmp
Stops the specified process for a specified period of time by
first sending the process a UNIX
STOP signal, and sending the process a UNIX CONTINUE signal after
the specified period of
time has elapsed.
stopload
Stops the specified process until the 1 minute system load average
drops beheath a specified
load by first sending the process a UNIX STOP signal, and when the
system load drops to the
specified limit, by then sending the process a UNIX CONTINUE
signal.
nice5 Set the specified process' nice value to 5.
nice10 Set the specified process' nice value to 10.
nicd15 Set the specified process' nice value to 15.
nice20 Set the specifled process' nice value to 20.
schedule10
Reschedules a process so that it run approximately 10% of the
time. Schedule10 queries the
database periodically to ascertain what percentage of the CPU the
specified process is
consuming. If the process uses more than the goal percent CPU
consumption, it is reniced such
that it uses less CPU resources. If the process uses less than the
goal percent CPU consumption,
it is reniced so that it uses more CPU resources. This action only
uses non-priviledged calls to
renice.
schedule25
Similar to schedule10, except the percent CPU goal is 25% instead
of 10%.
schedule50
Similar to schedule10, except the percent CPU goal is 50% instead
of 10%.
scheduleVIP10
Similar to schedule10, except this action can utilize priviledged
calls to renice as well as the
normal n | ||||||
