
The ECN No Name Newsletter is no longer being published. This is an archived issue.
[previous article] [next article]Many ECN users have had to run a large job at one time or another. A large job can be considered any program that takes anywhere from a few hours to a number of days to run. One must be careful when starting up large jobs because they can waste precious CPU cycles and take much longer to run. The intent of this article is to provide a guide to making a job of this type run as efficiently as possible without annoying other users.
First, let's talk about how one starts a large job and places it in the background. A background job is one that is started and left alone to do its business. Most large jobs are run as background jobs so that the terminal can be used to perform other tasks. Normally, an ampersand (&) is placed after the name of the program to be run in the background. Additionally, output is usually redirected into a file so that it may be saved for later analysis. This usually also keeps the output from interfering with other jobs currently using the terminal. Let's assume a user has a program named bigjob which wishes to run in the background. He would type "% bigjob > big_output &" .
Now he can go about his other business and use his terminal for other tasks until his large job finishes. Suppose his program is going to take a day or two to finish. What does he do if he is using a public terminal and has to logout? His job will be killed when he logs out, right? Not necessarily. By using the nohup command, he can set his job up so it will continue to chug along after he logs out. Our sample user would type: "% nohup bigjob > big_output".
Note when using nohup it is necessary for any output to be redirected to a file since it will have nowhere else to go once the user logs out.
It is also possible to set up jobs to run in the middle of the night. This is especially suited to jobs that only take a few hours to run. A user can set up a job to run at night before he leaves at the end of the day. When he returns in the morning, his answers will be ready for him to analyze. Running the job this way takes advantage of the CPU when it is the least busy. One can run his job in the middle of the night using the at(1) command. For example:
% at 200A at> bigjob < bigdata > big_out at>
After typing at and the desired run time (the above job will run at 2:00 AM), one just enters commands after the at prompts as if he were using his normal shell. Alternately, a file containing commands can be redirected to at. See the man pages for at(1), atq(1), and atrm(1) for more details.
It is usually not a good idea to start multiple large jobs or even copies of the same job up at the same time. The large machines at ECN have a program running on them known as a scheduler. The scheduler's main function is to make sure that no single user gets more than his share of CPU time. The scheduler allocates CPU time on a per user basis, not per job. So if a person starts three large jobs at the same time, his priority will get split three ways, causing each of those jobs to run much slower. The three jobs will run in less total time if they are run one at a time. This is also being considerate of other users who also need the system; no one wants to be deemed a "CPU Hog."
To make the jobs run sequentially automatically, one simply puts the names of the programs in a shell script in the order he wants them to run. (Remember to chmod 700 the shell script.) For example:
% cat bigj.sh #!/bin/sh bigjob < data1 > big_output1 bigjob < data2 > big_output2 bigjob < data2 > big_output3 % chmod 700 bigj.sh % nohup bigj.sh & % logout
One should not start a large job and forget about it. Users should be aware of how much CPU time a program uses. The program's output should be checked if it seems like the program is taking too long to finish. For some unknown reason, it may have gone awry and is causing many, many CPU cycles to get wasted. Background jobs can be "watched" with the ps(1) command. The jobs that one currently has running can be listed by typing "ps x". More information about all running processes can be obtained by typing "ps axugR". If our sample user looks at some of the results and determines that the job is not producing correct answers, the job can be killed by typing "kill -9".
% ps x PID TT STAT TIME COMMAND 1653 pa 1 0:00 ps x 1435 pa S 0:01 -csh (csh) 11588 q1 R 523:12 bigjob % kill -9 11588