
The ECN No Name Newsletter is no longer being published. This is an archived issue.
[previous article] [next article]Large programs are often called "Number Crunchers" because they use the computer to do calculations that would be very tedious if done by hand. Most ECN sites have designated their Gould machines to be their "number crunching" machine. However, there are some limitations, due to the fact that these machines are also used to support the undergrad users.
Sometimes people need to run the same program repeatedly because they are trying different variations with their data. If one person fires off 10 number cruncher jobs, they could drag the whole machine down. The Gould CPUs have a program built into the kernel that compensates for what are fondly called "hogs". Most ECN machines are run with the policy of "a maximum of 2 number crunchers can be run simultaneously." The Gould processor looks at who is running the job and swaps between the large jobs. When you type "uptime", it basically reflects on the number of crunchers in the queue, plus a few heavy "vi" sessions can equal the effect of a number cruncher. So, when you type "ps -axuR" to see the crunchers, the number of lines of information reported should be close to the load average reported from uptime.
USER PID PR %CPU %MEM SZ RSS TT STAT TIME COMMAND badhog 20127 34.6 0.3 247 80 q4 R 5:38 run badhog 20140 34.6 0.3 247 80 q4 R 9:28 run2 whitley 23127 34.3 0.2 139 65 pf R 9:57 a.run bigfoot 18127 33.8 1.2 9800 357 pb R N 39:24 RMREX.10M clarkst 2750 79.1 0.0 0.2 47 52 R 10:00 ps -axuR
Assuming the users from the above table are using a Gould (EI, EN, CN, GN, or MN) swapping would be as follows:
badhog's run, whitley's a.run, bigfoot's RMREX.10M, badhog's run2, whitley's a.run, bigfoot's RMREX.10M, etc.
Under this swap arrangement, the "badhog" jobs would take longer to run simultaneously than they would have if they were fired off serially. This way the affect on the other machine users is minimized. Sometimes people fire off jobs on a protected terminal and go home, leaving themselves logged in, so that they can obtain timed results. The same thing can be done with a shell script, if the output is captured into a file:
$ fort. temp.f [compile the FORTRAN program] $ mv a.out RUN [rename the executable as "RUN"] $ ex timeit [ edit a file called "timeit"] :a time RUN . :wq $ chmod +x timeit [make "timeit" executable]
CAPTURE THE OUTPUT in the file err:
Bourne Shell C Shell
$timeit > err 2>&1 & % timeit >& err &
$ cat err
-- has all standard output, screen stuff --
You could use the shell script timeit to run multiple jobs like this:
$ ex timeit
: a
date >> err
echo "run 1" >> err
time a.out < input1 > output1
echo "run 2" >> err
time a.out < input2 > output2
echo "run 3" >> err
time a.out < input3 > output1
echo "all done" >> err
date >> err
: wq
$ chmod +x timeit
$ touch err (to make sure the file exists,
BECAUSE if you are using "no clobber" then
you can't append (>>) to a non-existent file!)
Bourne Shell C Shell
$timeit >> err 2>&1 % timeit >>& err
--OR--
$timeit >> err 2>&1 & % timeit >>& err &
If you are running /bin/csh, then you can logout. The "time" information will be captured in the file err as well as the date, and items echoed in the timeit command. Make sure that you always use the append (>>) to add to an existing file, because ">" writes over existing files.
NOTE: The "time" function is different in the Bourne Shell ($) than it is in C Shell (%). A shell script can be set to run a certain shell by the first line in the shell script.
Bourne Shell C Shell #!/bin/sh #!/bin/csh more commands more commands
Also note that if you are a C Shell user and the job takes more than 1 CPU hour to run, you will need to use the limit cputime u or unlimit command to insure that the job finishes. These commands should be used carefully. If all your jobs have unlimited time allowed on the computer, it is your responsibility to kill off runaway jobs.