Room 135 Cluster Emergency Information
Useful Links:
Purpose:
Monitor the temperature in Rm 135, home of T-8's
Nightshade Linux Cluster
and T-10's Microsome Cluster. In case of excessively high temperature
(e.g., air-conditioner failure), appropriate people are notified and
clusters are shut down.
Emergency System Components:
Sensaphone 2000
- Monitors a 10K thermistor (soon 4 thermistors) to determine temperature
in Rm 135.
- In the event of a high temperature or power failure, will begin
calling a list of phone numbers.
- The temperature near a cluster is about 70°F, and the alarm set
point is currently 90°F.
- The Sensaphone page is here,
and documentation in pdf format is
here.
Windows PC (fahrenheit.lanl.gov)
- Unfortunate choice of OS (Windows 98), but this is the only choice for the
Sensaphone 2000.
- Sensaphone 2000
software automatically polls the state of the Sensaphone 2000
every 2 minutes, and generates a web page of the results at
http://fahrenheit.lanl.gov/indexaut.htm.
- An ftp server, wftpd, is needed to
receive the web page from the Sensaphone software.
- The Apache web
server is run as a service under Windows 98 to serve the web pages.
- A Perl script, tempmonitor.pl, runs continuously;
every minute it reads the indexaut.htm file and writes out
a corrected-html file to
http://fahrenheit.lanl.gov/index.htm.
In the process, it scans for alarm conditions (which are marked in red).
If the web page has not been updated recently enough, or the backup
batteries have failed, email messages are sent to an administrator list.
If a power failure or temperature high alarm is detected, email messages
are sent to an expanded list, and special messages are sent to the
clusters to initiate an automatic shutdown. The script will only send
out one battery of mail messages per alarm.
- An SMTP server, Postcast Server,
handles outgoing email. This software is a little clunky and can take
as long as a minute to send the messages.
- The clusters require that the shutdown message be sent recently to
authenticate the shutdown, so the local time must be set accurately.
Software
by NIST synchronizes the system clock via NTP every hour to the
time at utcnist.colorado.edu (the clusters get their time from the
same server).
- All software is configured to start automatically on reboot.
- The status page on fahrenheit is mirrored periodically to the outside
world (with a small additional delay)
at http://t8web.lanl.gov/sensaphone.
A directory listing instead of a status page indicates that
fahrenheit has crashed; in this case the last status page
is at previous.html
(including the time of last update).
- A temperature history of the room generated using this info is
here.
Clusters:
- The cluster master nodes look for mail at the address
emergency_signal@clustername.lanl.gov, and upon receiving
such a message, should launch the emergency_stop
perl script. To configure this, do the following:
- Put this shell script in
/usr/local/sbin
- Set proper permissions to run emergency_stop as a setuid
script by mail:
chgrp mail /usr/local/sbin/emergency_stop
chmod 4750 /usr/local/sbin/emergency_stop
- Since nightshade is set up to use SMRSH, we will run this from
the proper directory:
ln /usr/local/sbin/emergency_stop_wrap /etc/smrsh/emergency_stop
- Add a line to /etc/aliases that says:
emergency_signal: "|/etc/smrsh/emergency_stop"
- Execute the newaliases command
- The script will perform a number of checks to see that the email
message actually came from fahrenheit.lanl.gov:
- A sample message would look like this:
From sensaphone_debug@steck.us Wed Sep 11 19:08:21 2002
Received: from 127.0.0.1 (fahrenheit.lanl.gov [128.165.59.190])
by nightshade.lanl.gov (8.9.3/8.9.3) with SMTP id TAA12914
for <emergency_signal@nightshade.lanl.gov>; Wed, 11 Sep 2002 19:08:21 -0600
Message-Id: <200209120108.TAA12914@nightshade.lanl.gov>
From: "Rm 135 Sensaphone 2000" <invalid_address@invalid_domain>
To: <emergency_signal@nightshade.lanl.gov>
Date: Wed Sep 11 19:07:05 2002
Subject: emergency shutdown request
Priority: Highest
X-Priority: 1 (Highest)
X-Date: 1031792808
X-AuthCode: 1685cc7d5514caea2bc214b310b0d333895ca12108ab84021d4e2ce90dca0f35
X-RSA-Public-Key: 129dad94f1d26ff9790d1f9a713672e39fdc34fd3633681b46cd0c26b6e53675
X-RSA-Modulus: 1c630ce7db21dee339eb573d8dc92ffbe228465010cdef2df8f4ba90ae0bc71d
X-GCMulti: 1
AC Power OFF Off Waiting
- The X-AuthCode: line is an encrypted form of the X-Date:
line, using a 256-bit private RSA key known only to fahrenheit.
The public key and modulus needed to decrypt it are given as headers,
but these are not used by the emergency_stop script.
- Further, the decrypted X-Date: time (in absolute seconds)
should not be different from the cluster time by more than 5 minutes.
(Thus the cluster time needs to be synchronized regularly to
time.nist.gov.
- The message should have been received directly from fahrenheit,
and this is checked by looking at the first Received
header (to defeat forged headers) and verifying the DNS lookup
performed automatically by sendmail upon receipt of the
message.
- The script will then send out a broadcast message, warning users that
the compute nodes will shut down in 5 minutes, with SIGTERM signals
being sent 100 seconds before shutdown. After 3 minutes, the shutdown
commands are actually sent via rsh to the compute nodes.
- After shutdown, the nodes should actually be powered down. The
emergency_stop script supports two mechanisms for this:
- If the BIOS and kernel support APM or a similar power management
system, the nodes can be configured to power off on shut down.
(This may not work on SMP kernels.)
To do this,
edit the /etc/rc.d/init.d/halt script; change the line that says
command="halt"
to
command="halt -p"
- If the compute nodes have EMP ports, the script will use the
vash command-line interface to VACM to power off all the
nodes 2 minutes after the shutdowns are initiated.
For more information on installing VACM, see the
VACM page.
Note that if the master node loses power, VACM must be reconfigured,
using a script such as this one,
which can be invoked from /etc/rc.d/init.d/rc.local.
- The shutdown request is logged once via syslog, and actions
are logged in detailed in a dedicated log file.