Nightshade and Venom Linux Clusters


More sample batch-submission scripts:

In some cases you have a serial code that you want to run many times. Doing so requires a bit more sophistication, and accordingly we will switch to using perl as our scripting language.

Here is the simplest example, where you request two processors on a single node to run two independent jobs in parallel (e.g., the executables a.out.1 and a.out.2; to do this we use the fork to spawn child processes that launch the executables. The script then monitors the processes using the waitpid command with flags of 0 for blocking waits (i.e., the waidpid command does not finish until the process exits).

#!/usr/bin/perl
#PBS -N sample
#PBS -e sample.err
#PBS -o sample.out
#PBS -m abe
#PBS -q normal
#PBS -l nodes=1:ppn=2

### above PBS settings are: name = sample, std error -> sample.err,
###   std out -> sample.log, mail notification to the job owner,
###   queue = normal, request for 2 processors on 1 node (i.e.
###   2 jobs per node)

use POSIX ":sys_wait_h";

### set locations
#     assume the executable names are a.out.1, a.out.2
$workdir_base = $ENV{'PBS_O_WORKDIR'};
$executable1 = "$workdir_base/a.out.1";
$executable2 = "$workdir_base/a.out.2";

### get node file name from shell
$pbs_nodefile = $ENV{'PBS_NODEFILE'};

### log stuff to standard error
chomp($hostname = `hostname`);
chomp($numprocs = `wc -l <$pbs_nodefile`); $numprocs =~ s/\s//g;
print STDERR "Working base directory is $workdir_base\n";
print STDERR "Running on host $hostname\n";
print STDERR "Time is " . localtime(time) . "\n";
print STDERR "Allocated $numprocs nodes\n";

### launch the first executable on the local machine
#   launch executable by forking child processes
#   keep pid as $child1
print STDERR "running command: $executable1\n";
$child1 = fork();
if ($child1 == 0) { exec($executable1); }

### launch second executable
print STDERR "running command: $executable2\n";
$child2 = fork();
if ($child2 == 0) { exec($executable2); }

### monitor jobs until they finish (using blocking waits)
waitpid($child1, 0);
waitpid($child2, 0);

Here is a more sophisticated example, where n jobs (e.g., a.out.1, a.out.2, ...) are to be run on 10 processors. The script parses the node file and maintains a job table to keep track of running jobs. When a job exits, the script launches the next job on the corresponding node until there are no more jobs. Notice the use of nonblocking waits (using the NOHANG flag) to monitor the running processes.

#!/usr/bin/perl
#PBS -N sample
#PBS -e sample.err
#PBS -o sample.out
#PBS -m abe
#PBS -q normal
#PBS -l nodes=5:ppn=2

### above PBS settings are: name = sample, std error -> sample.err,
###   std out -> sample.log, mail notification to the job owner,
###   queue = normal, request for 10 processors on 5 nodes (i.e.
###   2 jobs per node)

use POSIX ":sys_wait_h";

### set locations
#     assume the executable names are a.out.1, a.out.2, ..., a.out.n
$workdir_base = $ENV{'PBS_O_WORKDIR'};
$executable_base = "$workdir_base/a.out"; # everything but the last number & dot
$num_executables = 20; # number of executables to run

### get node file name from shell
$pbs_nodefile = $ENV{'PBS_NODEFILE'};

### log stuff to standard error
chomp($hostname = `hostname`);
chomp($numprocs = `wc -l <$pbs_nodefile`); $numprocs =~ s/\s//g;
print STDERR "Working base directory is $workdir_base\n";
print STDERR "Running on host $hostname\n";
print STDERR "Time is " . localtime(time) . "\n";
print STDERR "Allocated $numprocs nodes\n";

### loop over allocated nodes, launching jobs as appropriate
$jobcounter = 0;
die "Could not open node file" unless open(nodefile, "<$pbs_nodefile");
while() {
  # get current name of node
  chomp; $nodename = $_;

  # launch executable by forking child processes
  $executable = $executable_base . '.' . (1 + $jobcounter);
  $jobcommand = "rsh $nodename 'cd $workdir_base; $executable'";
  print STDERR "running command: $jobcommand\n";
  $child = fork();
  if ($child == 0) {
    exec($jobcommand);
  } else {
    push @childarray, $child;
  }

  $jobcounter++;
}

### monitor all jobs until they finish
while ( $#childarray >= 0 ) {
  for ($j=0; $j<=$#childarray; $j++) {
    $child = waitpid($childarray[$j], &WNOHANG);
    if ( $child == -1 ) { 
      splice(@childarray, $j, 1);
      $j--;
    }
  }
  sleep 1;
}


Questions/comments: contact Daniel Steck.