![]() |
![]() |
![]() |
We used a HP Procurve switch to interconnect the nodes
and our disk server. Each node uses dhcp from our
disk server
for network configuration. We linked our current cluster to
our old cluster using a single crossover cable.
We decided that a standard batch system such as the Portable Batch System
was
inappropriate for our computer usage patterns. Batch systems are effective
in insuring that CPUs are in constant use but they have several limitations
that make them difficult to use at SEP.
The most significant issue is that
SEP problems generally take up significant disk space and
can have high IO needs. With a batch system you have little to no
control over which node your process starts on. As a result you can
pay a high IO cost.
The other problem with a batch system is that its design parameters don't
address the type of problems we encounter in a research environment.
At best, a batch system determines the number of processes to
use at start up. We occasionally have jobs that will take up to a
week to run. A traditional batch system will either fill
all available nodes with the job, making the machine unavailable
for everyone else for a week or start it on such a few number
of nodes that a week's job will take two months. Neither option
is optimal.
To monitor the nodes we wrote a simple service that returns the result
of the ps command when a certain port is accessed. This methodology
proved more reliable than a simple rsh approach. From our
entry point we periodically attempt to contact each machine and store load
and usage information. For easier evaluation of this information
we wrote a simple
web interface.
Currently we limit computer usage by assigning separate accounts, but
make the usage transparent by using the same uid for both SEP and
cluster accounts.
For console access we debated whether to use a Rocket Port Serial Hub
or a series of switches. Both solutions introduced significant
cabling and additional costs. For now we are using the Virtual Network Computing (VNC)
developed by ATT for general console displays and attaching a monitor
and keyboard when we experience problems with a node. So far this
solution has proven effective but may not be optimal for a large system.