University of Minnesota Institute of Technology     One Stop   Directories   Search U of M 
School of Mathematics

Tips for Running Large Computations

The Math Department's computer resources are limited to desktop computers which are shared by a large number of users. As such, your large computations should be crafted to play well with others. Dedicated research computing servers are available at the Super Computing Institute run by the Digital Technology Center.

Run your computations with a reduced scheduling priority.

Computations on shared workstations must use 'nice -n 19 COMMAND' where COMMAND is the command with arguments used to run your computation. Running your computations with the lowest scheduling priority should not significantly impact the run time of your computation unless another computation with a higher priority is running on the same workstation.

On lab machines, it's important to 'nice' commands so the computer is still usable for desktop users.

Include checkpoints in your code.

Workstations in the Math Department may need to be restarted at any time; resulting in the termination of your computation. At a very minimum, every machine in the department is restarted for software installation purposes every few months. For large computations, you should include checkpoints in your computational code so that you may resume your computation with minimal data loss. Our staff has a pretty good idea of when each machine will be reinstalled so contact adm@math.umn.edu for workstation suggestions or to find out when a particular workstation will be reinstalled.

Faculty with their own office workstations can be notified automatically the day before their workstations are scheduled to be reinstalled. To receive this notification, contact adm@math.umn.edu.

Check system resources while running your code

No matter what programming language you develop in, you can use tools to check system-wide resource usage. Fedora includes the command gnome-system-monitor in the menu entry System Tools > System Monitor. You can also get OS wide details with commands like top, free, uptime, iostat and mpstat.

For example, top and uptime show the current load, but only for one processor. The mpstat command can show the load for each processor or core.

$ uptime
 12:20:08 up 7 days,  1:15,  1 user,  load average: 0.07, 0.02, 0.00
$ mpstat -P ALL
Linux 2.6.23.1-21.fc7 (vinh270b-2.math.umn.edu)         03/24/2008

12:20:10 PM  CPU   %user   %nice    %sys %iowait    %irq   %soft  %steal   %idle    intr/s
12:20:10 PM  all    0.35    0.00    0.04    0.05    0.00    0.01    0.00   99.54   1044.94
12:20:10 PM    0    0.18    0.00    0.02    0.15    0.00    0.00    0.00   99.64    126.25
12:20:10 PM    1    0.23    0.00    0.01    0.01    0.00    0.00    0.00   99.74    125.61
12:20:10 PM    2    0.71    0.00    0.03    0.15    0.00    0.00    0.00   99.10    126.26
12:20:10 PM    3    0.26    0.00    0.02    0.03    0.00    0.00    0.00   99.69    125.62
12:20:10 PM    4    0.18    0.00    0.01    0.02    0.00    0.00    0.00   99.78    125.94
12:20:10 PM    5    0.15    0.00    0.01    0.02    0.00    0.00    0.00   99.82    125.61
12:20:10 PM    6    0.85    0.00    0.24    0.01    0.01    0.06    0.00   98.84    137.53
12:20:10 PM    7    0.25    0.00    0.01    0.03    0.00    0.00    0.00   99.71    125.60

For more info see Resource Monitoring for RedHat Linux 9, and the more recent System Monitoring for RedHat Enterprise Linux 6.

Request inclusion in the 'dontkill' list on each workstation.

Each workstation automatically kills processes that have been running for more than a couple days. The purpose of this is to kill run away processes that users may not be aware of. Contact us to request that your username be added to the 'dontkill' list. Be sure to include all workstations you plan to run computations on. You can check if your name is on this list by looking at the file /etc/dontkill on the specific workstation. Users should also keep in mind that each time a workstation is reinstalled this list reverts back to its original form. If you are added to this list it becomes your responsibility to watch for run away processes.

Some faculty office machines may have this feature disabled.

If an instructional lab has a regularly scheduled class that uses it we will not add users to the /etc/dontkill list of the computers in that lab.

Minimize network traffic by directing heavy I/O to local disk space.

Local file access is 10 times faster than network file access.
$ for d in /var/tmp ~ ; do
	echo -e "\n$d";
	time dd count=50000 if=/dev/zero of=$d/zero_test 2>/dev/null ;
	rm $d/zero_test;
 done

/var/tmp

real    0m0.260s
user    0m0.018s
sys     0m0.237s

/home/johndoe

real    0m2.580s
user    0m0.025s
sys     0m0.263s

If your computation will be generating heavy I/O, you should be reading and writing to disk space local to the workstation. Use /var/tmp for this purpose, and move data to your home directory periodically or at the completion of your computation.

The directory /var/tmp will be erased each time the workstation is reinstalled.

Every user's home directory is centralized on our primary file server. Each workstation mounts these directories over the network using NFS. Thus, every time you write or read to your home directory, you produce network traffic.

/usr/bin/magma magma.script > /var/tmp/magma.out && mv /var/tmp/magma.out ~/magma.out

Run the job noninteractively

Interactive environments increase the possibility of not documenting steps to produce results. Some environments have manage their memory as a heap and can become fragmented over time. To reduce complexity, try running computations with no human interactions. This should also start with a clean environment each time.

For example,

nice /usr/local/bin/math < ~/commands.mathematica > /tmp/mathematica_results.txt

Schedule the job to run later and for when the load isn't high

Jobs can be run latter using the at and batch commands.

For example,

nice echo "matlab -nodesktop < ~/my_simulation.m > /tmp/my_simulation_results.txt" | at 3:00
and to run when the load is below 0.8 try,
nice echo "maple < ~/commands.maple > /tmp/maple.out" | batch

Use nohup to run the job

Prefix the nohup command before a long job, and the command will continue to run if you are logged out.

nohup nice ./monte_carlo 1.7128 > /var/tmp/results.out

Remember to send the output to local disk to reduce network load.

Optionally use screen to run the job

Using nohup is still a good idea, but if you want to see output from the job as it's running over multiple days you can use screen to detach a login shell and reattach at a later time. You can also run multiple shells from a single screen session, which is another nice feature if you're connecting from home to run or check the status of jobs. Screen commands start with a prefix (Ctrl-a by default) and then a command key (usually a single character).

For example, to start matlab in a screen session you could run...

screen nice matlab -nodesktop -nojvm -nosplash
Then press Ctrl-a d to detach the screen session from your terminal. You can log out and screen will keep matlab running.

The next day you could reattach to the running screen by running

screen -r
Inside of a screen session, the basic commands are listed below.
  • Ctrl-a d Dettach from the screen session.
  • Ctrl-a c Create a new terminal inside screen.
  • Ctrl-a 0 Go to terminal 0.
  • Ctrl-a 1 Go to terminal 1.
  • Ctrl-a ? Show a list of commands.
In the help screen, the caret (or hat) character "^" means press the Ctrl key.

Use a fast machine

Check the Computing Facilities list for fast machines. (As of Feb 2008, the VinH 270d lab machines are the fastest.) Compare performance of a short job on different types of machines, if you have time. Always remember to checkpoint your code. Some of the fastest machines are high-end workstations, and are used by multiple users. Long running process may be killed at anytime. Please contact the MSI to run long jobs.

Use ssh to run jobs on a remote machine using nice, nohup, and local temp space for data files.

If the ssh client machine goes down, then nohup keeps the computation running and sending it's output to a file.

ssh charger 'cd ~/magma; nohup nice /usr/bin/magma magma.script > /var/tmp/magma.out' &

Institute of Technology
www.math.umn.edu/systems_guide/computation.html
Last Modified March 24, 2008
Contact the School of Mathematics
The University of Minnesota is an equal opportunity educator and employer.
© 2008, The Regents of the University of Minnesota
  Enter keyword search