Aug 23

MountStatusMonitor

Our environment is designed around a unsupported characters, or include a non-local or incorrectly linked interwiki prefix. You may be able to locate the desired page by searching for its name (with interwiki prefix, if any) in the search box.

Possible causes are:

If you tried to access a non-local interwiki page, you may be able to access that page by clicking the "article" tab on this page.

> model with a few very large servers providing terabytes of storage to several hundred machines using unsupported characters, or include a non-local or incorrectly linked interwiki prefix. You may be able to locate the desired page by searching for its name (with interwiki prefix, if any) in the search box.

Possible causes are:

If you tried to access a non-local interwiki page, you may be able to access that page by clicking the "article" tab on this page.

> and cfengine to synchronize our filesystem configuration. This greatly simplifies our infrastructure by giving our users a single, consistent filesystem view no matter which machine they're using, avoids the need for us to manage large quantities of local storage and has allowed us to treat servers and workstations as interchangeable parts (automated installs mean a failed system can be replaced in as little as 5 minutes).

NFS was designed to handle failures smoothly: when something breaks, the client suspends all of the affected processes until the server starts responding again and then everything resumes normal operation. Unfortunately we've discovered bugs in several operating systems which cause this process to fail, often without logging any indication of the failure.

I wrote MountStatusMonitor to provide that missing notification. It's a simple daemon which periodically checks every mounted filesystem for failures and unlike most other monitoring tools MountStatusMonitor robustly handles every failure mode, even the ones which cause unrecoverable client hangs. Normally it syslogs a message like this after running:

MountStatusMonitor[2659]: Checked 42 mounts in 0 seconds

When something fails it logs a summary message like this:

MountStatusMonitor[21900]: Checked 37 mounts in 60 seconds: 1 dead

Other information about the failed mount depends on the type of failure. Many NFS failures will simply cause a process to hang when it accesses the mount and so MountStatusMonitor works by unsupported characters, or include a non-local or incorrectly linked interwiki prefix. You may be able to locate the desired page by searching for its name (with interwiki prefix, if any) in the search box.

Possible causes are:

If you tried to access a non-local interwiki page, you may be able to access that page by clicking the "article" tab on this page.

> a worker process which performs the actual check, allowing the main process to record an error if the check takes too long:

MountStatusMonitor[21900]: Timed out waiting for child process 30686: sending SIGKILL

The worker process uses unsupported characters, or include a non-local or incorrectly linked interwiki prefix. You may be able to locate the desired page by searching for its name (with interwiki prefix, if any) in the search box.

Possible causes are:

If you tried to access a non-local interwiki page, you may be able to access that page by clicking the "article" tab on this page.

> to run as the owner of the mountpoint but it's still possible to encounter permissions errors and those will be logged like this:

MountStatusMonitor[18038]: Couldn't check mountpoint /example: mode 4000 does not allow access

To use MountStatusMonitor, download the code from the MountStatusMonitor Subversion repository and run make install.