Viewing file: service.html (11.54 KB) -rw-r--r-- Select action/file-type: (+) | (+) | (+) | Code (+) | Session (+) | (+) | SDB (+) | (+) | (+) | (+) | (+) | (+) | MON Help on Service Definitions
This is second and last stage for MON configuration.
Default values are shown for the Mandatory services . See the respective help topic below for more help on the Service Definitions.
For "mail.alert", ensure that the sendmail is configured and "sendmail" deamon is started on the hostmachine.
Service Definitions
- service servicename
-
A service definition begins with they keyword
service
followed by a word which is the tag for this service.
The components of a service are an interval, monitor, and
one or more time period definitions, as defined below.
If a service name of "default" is defined within a watch
group called "dafault" (see above), then the default/default
definition will be used for handling unknown mon traps.
- interval timeval
-
The keyword
interval
followed by a time value specifies the frequency that
a monitor script will be triggered.
Time values are defined as "30s", "5m", "1h", or "1d",
meaning 30 seconds, 5 minutes, 1 hour, or 1 day. The numeric portion
may be a fraction, such as "1.5h" or an hour and a half. This
format of a time specification will be referred to as
timeval.
- traptimeout timeval
-
This keyword takes the same time specification argument as
interval,
and makes the service expect a trap from an external source
at least that often, else a failure will be registered. This is
used for a heartbeat-style service.
- trapduration timeval
-
If a trap is received, the status of the service the trap was delivered
to will normally remain constant. If
trapduration
is specified, the status of the service will remain in a failure
state for the duration specified by
timeval,
and then it will be reset to "success".
- randskew timeval
-
Rather than schedule the monitor script to run at the start of each
interval, randomly adjust the interval specified by the
interval
parameter by plus-or-minus
randskew.
The skew value is specified as the
interval
parameter: "30s", "5m", etc...
For example if
interval
is 1m, and
randskew
is "5s", then
mon
will schedule the monitor script some time between every
55 seconds and 65 seconds.
The intent is to help distribute the load on the server when
many services are scheduled at the same intervals.
- monitor monitor-name [arg...]
-
The keyword
monitor
followed by a script name and arguments
specifies the monitor to run when the timer
expires. Shell-like quoting conventions are
followed when specifying the arguments to send
to the monitor script.
The script is invoked from the directory
given with the
-s
argument, and all following words are supplied
as arguments to the monitor program, followed by the
list of hosts in the group referred to by the current watch group.
If the monitor line ends with ";;" as a separate word,
the host groups are not appended to the argument list
when the program is invoked.
- allow_empty_group
-
The
allow_empty_group
option will allow a monitor to be invoked even when the
hostgroup for that watch is empty because of
disabled hosts. The default behavior is not
to invoke the monitor when all hosts in a hostgroup
have been disabled.
- description descriptiontext
-
The text following
description
is queried by client programs, passed to alerts and monitors via an
environment variable. It should contain a brief description of the
service, suitable for inclusion in an email or on a web page.
- exclude_hosts host [host...]
-
Any hosts listed after
exclude_hosts
will be excluded from the service check.
- exclude_period periodspec
-
Do not run a scheduled monitor during the time
identified by
periodspec.
- depend dependexpression
-
The
depend
keyword is used to specify a dependency expression, which
evaluates to either true of false, in the boolean sense.
Dependencies are actual Perl expressions, and must obey all syntactical
rules. The expressions are evaluated in their own package space so as
to not accidentally have some unwanted side-effect.
If a syntax error is found when evaluating the expression, it
is logged via syslog.
Before evaluation, the following substitutions on the expression occur:
phrases which look like "group:service" are substituted with the value
of the current operational status of that specified service. These
opstatus substitutions are computed recursively, so if service A
depends upon service B, and service B depends upon service C, then
service A depends upon service C. Successful operational statuses (which
evaluate to "1") are "STAT_OK", "STAT_COLDSTART", "STAT_WARMSTART", and
"STAT_UNKNOWN". The word "SELF" (in all caps) can be used for the group
(e.g. "SELF:service"), and is an abbreviation for the current watch group.
This feature can be used to control alerts for services which are
dependent on other services, e.g. an SMTP test which is dependent upon
the machine being ping-reachable.
- dep_behavior {a|m}
-
The evaluation of dependency graphs
can control the
suppression of either alert or monitor invocations.
Alert suppression.
If this option is set to "a",
then the dependency expression
will be evaluated after the
monitor for the service exits or
after a trap is received.
An alert will only be sent
if the evaluation succeeds, meaning
that none of the nodes in the dependency
graph indicate failure.
Monitor suppression.
If it is set to "m",
then the dependency expression will be evaulated
before the monitor for the service is about to run.
If the evaulation succeeds, then the monitor
will be run. Otherwise, the monitor will not
be run and the status of the service will remain
the same.
Period Definitions
Periods are used to define the conditions which
should allow alerts
to be delivered.
- period [label:] periodspec
-
A period groups one or more alarms and variables
which control how often an alert happens when there
is a failure.
The
period
keyword has two forms. The first
takes an argument which is a
period specification from Patrick Ryan's
Time::Period Perl 5 module. Refer to
"perldoc Time::Period" for more information.
The second form requires a label followed by a period specification, as
defined above. The label is a tag consisting of an alphabetic character
or underscore followed by zero or more alphanumerics or underscores
and ending with a colon. This
form allows multiple periods with the same period definition. One use
is to have a period definition which has no
alertafter
or
alertevery
parameters for a particular time period, and another
for the same time period with a different
set of alerts that does contain those
parameters.
- alertevery timeval
-
The
alertevery
keyword (within a
period
definition) takes the same type of argument as the
interval
variable, and limits the number of times an alert
is sent when the service continues to fail.
For example, if the interval is "1h", then only
the alerts in the period section will only
be triggered once every hour. If the
alertevery
keyword is
omitted in a period entry, an alert will be sent
out every time a failure is detected. By default,
if the output of two successive failures changes,
then the alertevery interval is overridden.
If the word
"summary" is the last argument, then only the summary
output lines will be considered when comparing the
output of successive failures.
- alertafter num
-
- alertafter num timeval
-
The
alertafter
keyword (within a
period
section) has two forms: only with the "num"
argument, or with the "num timeval" arguments.
In the first form, an alert will only be invoked
after "num" consecutive failures.
In the second form,
the arguments are a positive integer followed by an interval,
as described by the
interval
variable above.
If these parameters are specified,
then the alerts for that period will only
be called after that many failures happen
within that interval. For example,
if
alertafter
is given the arguments "3 30m", then the alert will be called
if 3 failures happen within 30 minutes.
- numalerts num
-
This variable tells the server to call no more than
num
alerts during a
failure. The alert counter is kept on a per-period basis,
and is reset upon each success.
- comp_alerts
-
If this option is specified, then upalerts will only be
called if a corresponding "down" alert has been called.
- alert alert [arg...]
-
A period may contain multiple alerts, which are triggered
upon failure of the service. An alert is specified with
the
alert
keyword, followed by an optional
exit
parmeter, and arguments which are interpreted the same as
the
monitor
definition, but without the ";;" exception. The
exit
parameter takes the form of
exit=x
or
exit=x-y
and has the effect that the alert is only called if the
exit status of the monitor script falls within the range
of the
exit
parameter. If, for example, the alert line is
alert exit=10-20 mail.alert mis
then
mail-alert
will only be invoked with
mis
as its arguments if the monitor
program's exit value is between 10 and 20. This feature
allows you to trigger different alerts at different
severity levels (like when free disk space goes from 8% to 3%).
See the
ALERT PROGRAMS
section above for a list of the pramaeters mon will pass
automatically to alert programs.
- upalert alert [arg...]
-
An
upalert
is the compliment of an
alert.
An upalert is called when a services makes the state transition from
failure to success. The
upalert
script is called supplying
the same parameters as the
alert
script, with the addition of the
-u
parameter which is simply used to let
an alert script know that it is being called
as an upalert. Multiple upalerts may be
specified for each period definition.
Please note that the default behavior is that
an upalert will be sent
regardless if there were any prior "down" alerts
sent, since upalerts are triggered on a state
transition. Set the per-period
comp_alerts
option to pair upalerts with "down" alerts.
- startupalert alert [arg...]
-
A
startupalert
is only called when the
mon
server starts execution.
- upalertafter timeval
-
The
upalertafter
parameter is specified as a string that
follows the syntax of the
interval
parameter ("30s", "1m", etc.), and
controls the triggering of an
upalert.
If a service comes back up after
being down for a time greater than
or equal to the value of this option, an
upalert
will be called. Use this option to prevent
upalerts to be called because of "blips" (brief outages).
|