table of contents
rig(1) | General Commands Manual | rig(1) |
NAME¶
rig - Monitor a system for events and trigger specific actions
USAGE¶
rig <RESOURCE OR SUBCOMMAND> [OPTIONS] <ACTIONS> [ACTION OPTIONS]
DESCRIPTION¶
rig is a tool to assist in troubleshooting seemingly randomly occurring events or events that occur at times that make active monitoring by a sysadmin difficult.
rig sets-up detached processes, known as 'rigs', that watch a given resource for a trigger condition, and once that trigger condition is met takes actions defined by the user.
GLOBAL OPTIONS¶
The following are options available to all rigs (resources).
- --delay DELAY
- Specify the number of seconds to wait after a rig is triggered before
running the configured actions. Note that the rig will still trigger and
stop all watcher threads immediately - this delay comes after thread
termination but before action execution in order to avoid a possible race
condition where multiple watcher threads could conceivably trigger during
a sufficiently high delay time.
Default 0 seconds, meaning execute actions immediately upon rig trigger condition being met.
- --debug
- Set logging level to debug instead of the default info level.
- --foreground
- Run the rig in the foreground, keeping stdout attached.
- --interval SECONDS
- Specify the amount of time to wait between a rig's polling cycles. Most
rigs monitor their resources in a flow of update -> compare -> wait,
where wait is simply sleeping until the next needed update. Use this
option to set how long a rig should wait/sleep before updating their
monitors again.
Default: 1, meaning update and compare once every second.
- --name NAME
- Give the rig a name, rather than generating a random one at deployment.
By default, rigs are given a randomly generated string as a name, which will appear in rig info output and in the rig's socket name. Using this option will use the provided name instead, and may be useful in distinguishing rigs when several are deployed at one time
- --no-archive
- Do not create a tar archive of the collected data after a rig has been
triggered.
Normally, once all data has been collected, rig will create a gzip'd tar archive under /var/tmp containing all the files created from the rig's actions - after which, the temp directory at /var/tmp/rig/<id>/ is deleted.
Using this option skips creating the archive and preserves the temp directory.
- --repeat COUNT
- Repeat certain actions COUNT number of times after the initial execution
of the action.
Actions will, unless otherwise specified by this option, only execute once. Using this option actions that support repetition will be repeated an additional COUNT number of times. For example, using --repeat 2 will result in repeatable actions being executed three (3) total times.
Not every action supports repetition - in fact most do not. See specific action's sections for information on if it can be repeated or not.
- --repeat-delay SECONDS
- Number of seconds to wait between repetitive executions of the same
action.
This can be useful when using an action like gcore when you want to get coredumps over a certain time period. For example, using --repeat 1 --repeat-delay 60 will give you two (2) coredumps taken one minute apart.
Defaults to one second.
- --restart COUNT
- Restart a configured rig up to COUNT number of times after being
triggered.
By default, a rig will trigger once and then terminate. Using this option, an individual rig may restart itself up to COUNT number of times, producing an additional archive of the requested data after the triggering event happens again.
Note that this is the number of times to restart, not the total number of times to run. Using a restart value of '2' means that there will be 3 total archives generated for a rig.
By default, this is set to 0, meaning terminate after the first trigger event. Use a value of '-1' to have a rig perpetually restart itself without limit.
SUBCOMMANDS¶
- rig list
- Show a list of known existing rigs and their status. Status information is obtained by querying the socket created for that particular rig.
- rig destroy -i [ID or 'all']
- Destroy a deployed rig with id ID. If ID is 'all', destroy
all known rigs. Note that if another entity kills the pid for the running
rig, destroy will fail as the socket is no longer connected to the (now
killed) process. In this case use the --force option to cleanup the
lingering socket.
Any data the rig has generated will be lost when invoking destroy.
- rig info -i [ID]
- Get detailed information on a rig. This information will include
configuration options, the entire cmdline string given to launch the rig,
as well as information on each action the rig is configured to take and
what the expected result from those action(s) are.
Currently, this data is written to stdout in JSON format.
- rig trigger -i [ID]
- Manually trigger rig with id ID. This will cause the specified rig
to begin executing the actions configured for it, as if the trigger
condition had been met.
Note that this is only effective on a single rig basis, so using a value of 'all' for the ID will not work.
RESOURCES¶
These are the system resources that rig can monitor. There may be additional manpages for specific resources. Where applicable this will be noted below.
Note that 'resources', 'monitors', and referencing 'a rig' as a distinct entity all refer to the same thing.
When creating a rig, if successful the rig's ID will be printed to console.
- logs
- Watch a single or multiple log files and/or journald units for a specified
message. When that message is matched to any watched file or journal, the
trigger condition is met and configured actions are initiated.
The following options are available for the logs rig:
- -m|--message STRING
- Define the string that serves as the trigger condition for the rig. This
can be a regex string or an exact message. Be very careful in using the
'*' regex character as this may cause unintended behavior such as
the rig immediately triggering on the first message seen.
Note that a small amount of transformation and testing is done on the provided STRING. First, '*' characters are converted to the python-style regex match of '.*'. After which, rig performs a test on if the provided message will regex-match itself, and if that fails the rig aborts the creation process.
Aside from the conversion noted above, regexes provided in this option must be python-style and not shell-style.
- --logfile FILE
- A comma-delimited list of files to watch. Each FILE specified will
be monitored from the current end of the file, so old entries will not set
off the rig's actions.
Default: /var/log/messages
- --no-files
- Do not monitor any log files.
- --journal UNIT
- A comma-delimited list of journal units to watch. The journal is watched
as a singular entity, and will be filtered to only read from the provided
UNIT(s). If no UNIT is specified, the whole system journal
will be monitored.
Default: 'system'
- --no-journal
- Do not monitor the journal.
- --count COUNT
- The number of times the --message string should be matched before the rig is triggered. Default 1 - meaning match on the first occurence.
- ping
- Perform a simple ongoing ping test against a specified host. Pings are
sent one at a time at a defined interval, and the response is evaluated.
Ping-type rigs may monitor for number of lost packets and/or packets
exceeding a specified RTT in milliseconds.
Packets are first evaluated for loss (including timeouts), then for RTT time.
The following options are available for the ping rig:
- --host ADDRESS
- The target IP or hostname to ping. This is a required option in
order for a ping rig to be created.
During rig creation, a 'sanity check' ping is sent to the ADDRESS to ensure that it is an address that is reachable on the network and that it will respond to ICMP packets. If this sanity check fails, rig creation is aborted.
- --ping-timeout SECONDS
- Specify the number of SECONDS to allow for a ping response. If a ping encounters a timeout, then it is considered both a lost packet and a packet exceeding the RTT threshold (see --ping-ms-max and --ping-ms-count).
- --lost-count PACKETS
- Specify the number of PACKETS to accept being lost or timed-out, before
triggering the rig.
Default: 1 (trigger on the first lost packet)
- --ping-interval SECONDS
- Specify the number of SECONDS to wait between ping requests sent to the
target host.
Default: 1
- --ping-ms-max MILLISECONDS
- Specify the RTT threshold to allow for a returned ping request. If the RTT
reported by the ping command is above this value in milliseconds, it is
counted against the threshold of packets exceeding this value specified by
--ping-ms-count.
By default, this form of checking is disabled. Any integer value passed to this option will enable RTT monitoring.
- --ping-ms-count PACKETS
- Specify the number of PACKETS that may exceed the defined
--ping-ms-max RTT value before triggering the rig.
Default: 5
- process
- Watch a single process or list of processes for state changes or resource
consumption thresholds. When the process enters the specified state or the
specified resource consumption threshold is met, the trigger condition is
met.
The following options are available for the process rig:
- --proc
- A PID or process name of processes to watch. If a process name is specified, then rig will attempt to convert this to a PID during rig creation. If multiple PIDs are found, the default behavior is to fail creation and exit. To have rig monitor all processes found for a process name, use the --all option.
- --state STATE
- The state that a process needs to be in, in order to trigger the rig. The
following is a list of supported states:
NAME DESCRIPTION SHORTHAND
dead Dead - should never be seen 'X'
disk-sleep Uninterruptible sleep 'D' or 'UN'
running Currently running 'R' or 'run'
sleeping Interruptible sleep 'S' or 'sleep'
stopped Stopped 'T' or 'stop'
zombie Exited, still in proc table 'Z' or 'zomb'Users can use either the full status name, or the shorthand noted in the final column of the table above. Both the names and the shorthand values are case sensitive.
This can also be set to a "not" value by preceeding one of the above state strings with a exclaimation mark (!), e.g. '!sleeping' will match any non-sleep (S) state status for the process(es). Most shells will require you to quote the state string when using the '!' character.
Note that using '!running' will cause rig to not trigger against a state of 'sleeping', as generally speaking 'running' processes spend much of their time in S state, and it is assumed that triggering against such a process is not desired.
Process status is polled once every second.
- --rss INTEGER
- The amount of rss (resident set size) memory usage to use as a threshold
for triggering the rig. If the process' RSS usage goes above this value,
trigger.
The value provided here may be suffixed with K, M, or G to denote the IEC unit. Rig will convert the provided value and suffix into a value in bytes.
- --vms INTEGER
- The same as --rss but monitoring Virtual Memory Size instead.
- --memperc PERCENT
- The percentage of total system memory a process is consuming to use as a
threshold for triggering the rig. If the process' %mem meets or
exceeds this value, trigger.
PERCENT may be a whole integer or a float. When using a float, the process rig respects up to two (2) decimal points of precision. For example, using ´--memperc 10.25´ is the same as using ´--memperc 10.25678´.
- --cpuperc PERCENT
- The percentage of CPU usage a process is consuming to use as a threshold
for triggering the rig. If the process' %cpu meets or exceeds this
value, trigger.
PERCENT may be a whole integer or a float. When using a float and monitoring for CPU usage, rig respects one (1) decimal point of precision due to how CPU usage is reported.
PERCENT may be above 100 - as CPU usage can exceed 100 when a process is running on multiple CPUs.
- system
-
Watch the system's utilization of resources as a whole, e.g. total CPU or memory usage. When the utilization of a given resource is either exceeded or falls below the given threshold (determined as appropriate for each resource), the trigger condition is met.
The following options are available for the system rig:
- --iowait PERCENT
- The amount of %iowait as reported by the kernel to use as a threshold
value.
If exceeded, trigger the rig.
- --steal PERCENT
- The amount of %steal as reported by the kernel to use as a threshold
value.
If exceeded, trigger the rig.
- --nice PERCENT
- The amount of %nice as reported by the kernel to use as a threshold value.
If exceeded, trigger the rig.
- --guest PERCENT
- The amount of %guest as reported by the kernel to use as a threshold
value.
If exceeded, trigger the rig.
- --user
- The amount of %user as reported by the kernel to use as a threshold value.
If exceeded, trigger the rig.
- --available INTEGER
- The amount of available memory in MiB as reported by the kernel to use as
a threshold value.
If the amount of available memory falls below this threshold, trigger the rig.
- --free INTEGER
- The amount of free memory in MiB as reported by the kernel to use as a
threshold value.
If the amount of free memory falls below this threshold, trigger the rig.
- --used INTEGER
- The amount of used memory in MiB as reported by the kernel to use as a
threshold value.
If the amount of used memory exceeds this threshold, trigger the rig.
- --slab INTEGER
- The amount of slab memory in MiB as reported by the kernel to use as a
threshold value.
If the amount of slab memory exceeds this threshold, trigger the rig.
- --cpuperc PERCENT
- The amount of total CPU usage as reported by the kernel as a
percentage to use as a threshold value.
If exceeded, trigger the rig.
This value may be a whole integer or a float. Floats are precise out to one (1) decimal point.
- --memperc PERCENT
- The amount of total memory usage as reported by the kernel as a
percentage to use as a theshold value.
If exceeded, trigger the rig.
This value may be a whole integer or a float. Floats are precise out to one (1) decimal point.
- --loadavg FLOAT
- System load average as reported by the OS to use as a threshold value. If
the reported loadavg exceeds this value, trigger the rig. This option can
accept either an integer (1) or a float (1.0).
Linux returns loadavg data for the past 1, 5, and 15 minutes. The system rig will monitor only one (1) of these intervals at a time, as controlled by the --loadavg-interval option.
- --loadavg-interval [1, 5, 15]
- Which time interval the rig should monitor when watching the system's
loadavg. Only 1, 5, and 15 are accepted values for this option, as that is
what the Linux kernel returns loadavg data for.
Default: 1
- --temp INTEGER
- The temperature in Celsius rig should monitor the CPU for meeting or
exceeding.
This option takes an integer value, though temperature data is single decimal point sensitive, so a temperature of 50.9 degrees will not trigger a rig that sets this option to 51.
By default rig will monitor the first physical CPU package installed on the system. This may be changed via the --cpu-id option. Note that rig will only monitor whole packages and not individual cores, and that package temperatures reported are the highest reported temperature for any core in that package.
- --cpu-id ID
- If specified, monitor this physical CPU package. By default, rig will
monitor physical CPU package 0 - meaning the first physically installed
CPU.
When specifying an ID here, remember that in Linux CPU IDs are zero-indexed, so the first CPU will be ID 0, the second ID 1, and so forth.
Default: 0
- Filesystem
-
Watch a filesystem, directory, or file for utilization changes. Currently this rig is focused on space consumption, and will trigger when the specified path or backing filesystem exceeds the defined threshold for space utilization.
The following options are available for the filesystem rig:
- --path PATH
- Specify the filesystem, directory, or file path for the rig to monitor. The location provided must exists when the rig initializes for monitoring to be supported.
- --size SIZE
- Specify the size threshold to trigger on for the provided --path. The size given must be an integer suffixed with either K, M, G, or T. The provided value will be converted to bytes.
- --fs-size SIZE
- Use this option instead of --size if you want to monitor the space
usage of the backing filesystem for --path rather than the size of
the path alone.
Similar to --size this value must be suffixed with either K, M, G, or T.
- --fs-used PERCENT
- Similar to --fs-size but instead provide a percentage value to
trigger on, when the filesystem's %used exceeds this value.
Note that using this option is ultimately the same as --fs-size as rig will convert the specified percentage into a raw bytes value to use for comparisons.
ACTIONS¶
The following actions are supported responses to triggered rigs. These may be chained together on a single rig, so deploying multiple rigs with matching trigger conditions with single, varying actions is unnecessary.
Actions are executed based on a priority weighting system, where lower values represent a higher priority action, and those actions with lower values are executed before those with higher values. This is to allow more time-sensitive actions to be taken before those that may either take a long time to execute or are otherwise unaffected by allowing other actions to run before them. Action priority values are set by the actions directly and are currently not able to be modified by users.
- gcore
- Collect a coredump of a given process or processes using GDB's
gcore utility.
Note that this does _not_ interrupt the running process(es). Cores are saved to /tmp and will be named either core.$pid or core.$proc_name.$pid depending on if a PID or process name was provided. This action will be executed first when a rig is triggered and multiple actions are specified.
This action supports repetition via the --repeat option.
The gcore action supports the following options:
- --gcore PROCESS
- Enables this action and takes either a PID or process name as a value. If
a process name is given, the PID is determined at rig creation. If
multiple PIDs are found for the same process name, the default behavior is
to fail rig creation. Use the --all-pids option to instead use all
PIDs discovered for a process name.
This option can be specified multiple times. E.G. --gcore 12345 --gcore myprocess will generate a coredump for PID 12345 and a process matching the name 'myprocess'.
- --all-pids
- Tells this action to collect a coredump for all PIDs found for a provided process name.
- --freeze
- Freeze the process(es) that will be core dumped by sending a SIGSTOP prior
to calling gcore on the discovered pid(s).
If successful, then rig will send a SIGCONT after the gcore execution has completed in order to thaw the process.
- kdump
- Generate a vmcore by triggering a kernel crash via sysrq.
Note that this action WILL cause node disruption by triggering a kernel panic to generate the vmcore. This means your system will reboot when this action is triggered.
The kdump action does not perform any configuration checks on the system's kdump installation. It is assumed that kdump has been properly configured and tested prior to using this action.
The kdump action supports the following options:
- --kdump
- Enables this action
- --sysrq INTEGER
- When the rig is deployed, if this option is set, rig will set the system's /proc/sys/kernel/sysrq to the value provided. See sysrq kernel documentation for information on what values are supported.
- sosreport
- Run an sos report after the rig has been triggered. Select plugin
enablement options as well as the --plugin-option from sos report are
supported by this rig. This action should run after any time-sensitive
actions otherwise specified by the user for a given rig.
The sosreport action supports the following options:
- --sosreport
- Enables this action
- --enable-plugins PLUGINS
- Specifically force the specified comma-delimited list of PLUGINS to be enabled.
- --plugin-option PLUGOPT
- Modify a specific plugin's runtime options. This is passed directly to sos
report as the same --plugin-option value, which should take the form
'name.option=value'. For example, to increase the podman plugin timeout
use ´--plugin-option podman.timeout=600´.
If you need to pass multiple sos report plugin options, use a comma-delimited list here instead of specifying this option multiple times.
- --skip-plugins PLUGINS
- Do not run these specified plugins. Use a comma-delimited list to skip multiple plugins.
- --only-plugins PLUGINS
- Only enable these specific plugins, disable all others. Use a comma-delimited list to specify multiple plugins.
- tcpdump
- Start collecting a tcpdump when the rig is initialized, and stop the
collection when the rig triggers. This action will be triggered before
most other actions, but after the gcore action.
Note there will be a slight delay in configuring any rig that uses the tcpdump action as rig must verify that the tcpdump process started successfully during the initialization process.
The tcpdump action supports the following options:
- --tcpdump
- Enables this action
- --iface INTERFACE
- Starts the tcpdump to monitor the provided INTERFACE. In almost all
situations this should likely be set to a specific interface on the
system, however the value of 'any' is accepted by the tcpdump command in
order to listen on all interfaces. Be wary of using this however as use of
'any' means will make it impossible to determine which interface a
particular packet came in on in the resulting packet capture.
Default: eth0
- --filter FILTER
- Provide a filter to use with tcpdump in order to reduce the amount of
traffic recorded in the packet capture. This value is passed directly to
the tcpdump utility, and thus can be any valid filter accepted by tcpdump.
For most shells you must quote the filter string for rig to pass it correctly.
- --snaplen LENGTH --snapshot-length LENGTH
- Set the snapshot length for the packet capture. This will truncate captured packets to LENGTH bytes, which defaults to 262144 bytes. Using a value of 0 (also the default), will imply a LENGTH of 262144 bytes.
- --dump-size SIZE
- Limit the size of the packet capture file(s) to SIZE in MB.
Default: 10
- --captures CAPTURES
- Specify the number of packet capture files to keep. If more than one (1),
then tcpdump will rotate the packet capture file when it reaches the
--size value and keep CAPTURES number of files.
E.G. Using a CAPTURES of 2 and a DUMP-SIZE of 5, then when the rig terminates you will have up to 2 5MB packet captures.
Default: 1 (packet capture file is replaced upon reaching SIZE limit).
- monitor
- While a rig is running, monitor various system statistics and record them
for later review. These statistics may be file contents or command
outputs.
This action begins collecting information when the rig is started, and stops when the rig is triggered.
By default, networking-centric information is monitored via commands such as netstat, ss, top, ps, and more. Similarly several networking-related files under /proc/ are monitored.
The rate at which these collections take place is controlled via the --interval option.
The monitor action supports the following options:
- --monitor
- Enables this action.
- --disable-monitor-defaults
- Do not monitor or collect any of the default items. This implies that all collections will be specified via --monitor-files and/or --monitor-commands.
- --monitor-files FILES
- A comma-delimited list of files to monitor. Monitored files have their contents copied to a file within the rig's archive of the same name. The contents will be separated by a timestamp header taken at the time of collection.
- --monitor-commands COMMANDS
- A comma-delimited list of commands to execute every --interval
seconds, and have that output saved to a file within the rig's archive.
Output collections are separated by a timestamp header taken at the time
of collection.
Note that commands will need to be properly quoted if there are spaces (or other quotes) in the command string. For example, to run 'ps auxwww' the proper invocation would be --monitor-commands='ps auxwww'.
In-line shell scripting is not supported. While it may be possible for such values to function, there are no guarantees as to those executions working properly, at all, not causing unintended side-effects or harm, et cetera. Dragons ahead, and so forth.
- noop
-
Does nothing - this action runs a no-op. This is ideally used for when you need to test a rig's configuration to make sure a rig's trigger condition is set properly - e.g. a regex string for the logs' rig message option.
The noop action supports the following options:
- --noop
- Enables this action
MAINTAINER¶
Jake Hunsaker <jhunsake@redhat.com>
January 2019 |