|
|
Institute of Technology One Stop Directories Search U of M |
School of Mathematics
Tips for Running Large Computations |
|
Printing, E-Mail |
The Math Department's computer resources are limited to desktop computers, which are shared by a large number of users, and a few computation servers. As such, your large computations should be crafted to play well with others. General InfoRun your computations with a reduced scheduling priority.Computations on shared workstations must use 'nice -n 19 COMMAND' where COMMAND is the command with arguments used to run your computation. Running your computations with the lowest scheduling priority should not significantly impact the run time of your computation unless another computation with a higher priority is running on the same workstation.On lab machines, it's important to 'nice' commands so the computer is still usable for desktop users. If an instructional lab has a regularly scheduled class that uses it, any jobs on the computers in that lab may be killed if they adversely affect the computers' performance. The top and renice commands can be used to change the nice value for a running job. Include checkpoints in your code.Workstations in the Math Department may need to be restarted at any time; resulting in the termination of your computation. Every workstation in the department is restarted for software installation purposes every week. Each workstation shows the date and time on which it is next scheduled to restart on its login screen, as well as in the message shown when logging in via SSH. For large computations, you should include checkpoints in your computational code so that you may resume your computation with minimal data loss.Our staff has a pretty good idea of when each machine will be reinstalled so contact adm@math.umn.edu for workstation suggestions or to find out when a particular workstation will be reinstalled. Selecting Hardware ResourcesUse a fast machineCheck the Math Computing Facilities page and Minnesota Supercompuing Institue for lists of fast machines, (As of Mar 2008, the VinH 314 lab machines are the fastest.) Compare performance of a short job on different types of machines, if you have time.Check system resources while running your codeNo matter what programming language you develop in, you can use tools to check system-wide resource usage. Fedora includes the command gnome-system-monitor in the menu entry System Tools > System Monitor. You can also get OS wide details with commands like top, free, uptime, iostat and mpstat.For example, top and uptime show the current load, but only for one processor. The mpstat command can show the load for each processor or core. $ uptime 12:20:08 up 7 days, 1:15, 1 user, load average: 0.07, 0.02, 0.00 $ mpstat -P ALL Linux 2.6.23.1-21.fc7 (vinh270b-2.math.umn.edu) 03/24/2008 12:20:10 PM CPU %user %nice %sys %iowait %irq %soft %steal %idle intr/s 12:20:10 PM all 0.35 0.00 0.04 0.05 0.00 0.01 0.00 99.54 1044.94 12:20:10 PM 0 0.18 0.00 0.02 0.15 0.00 0.00 0.00 99.64 126.25 12:20:10 PM 1 0.23 0.00 0.01 0.01 0.00 0.00 0.00 99.74 125.61 12:20:10 PM 2 0.71 0.00 0.03 0.15 0.00 0.00 0.00 99.10 126.26 12:20:10 PM 3 0.26 0.00 0.02 0.03 0.00 0.00 0.00 99.69 125.62 12:20:10 PM 4 0.18 0.00 0.01 0.02 0.00 0.00 0.00 99.78 125.94 12:20:10 PM 5 0.15 0.00 0.01 0.02 0.00 0.00 0.00 99.82 125.61 12:20:10 PM 6 0.85 0.00 0.24 0.01 0.01 0.06 0.00 98.84 137.53 12:20:10 PM 7 0.25 0.00 0.01 0.03 0.00 0.00 0.00 99.71 125.60 For more info see Resource Monitoring for Red Hat Linux 9, and the more recent System Monitoring for Red Hat Enterprise Linux 5. Use local disk, it's fasterLocal file access is 10 times faster than network file access. If your computation needs to read or write data, this can save you time.As an example, here's are some times to write 50k zeroes to local disk, or a home directory mounted using NFS. $ for d in /var/tmp ~ ; do echo -en "\n$d"; time dd count=50000 if=/dev/zero of=$d/zero_test 2>/dev/null ; rm $d/zero_test; done /var/tmp real 0m0.260s user 0m0.018s sys 0m0.237s /home/johndoe real 0m2.580s user 0m0.025s sys 0m0.263s Use local disk, it reduces load on file servers.If your computation will be generating heavy I/O, you should be reading and writing to disk space local to the workstation. Use /var/tmp for this purpose, and move data to your home directory periodically or at the completion of your computation.The lsof command on Linux will list open file handles. [jdoe@foo ~]$ lsof |grep jdoe | grep home |grep [0-9]w The directory /var/tmp will be erased each time the workstation is reinstalled. Every user's home directory is centralized on our primary file server. Each workstation mounts these directories over the network using NFS. Thus, every time you write or read to your home directory, you produce network traffic. The command below runs a program writing into local disk, and if it's successful moves the data file to the users home. /usr/bin/magma magma.script > /var/tmp/magma.out && nice mv /var/tmp/magma.out ~/magma.out Job ControlRun the job noninteractivelyInteractive environments increase the possibility of not documenting steps to produce results. Some environments have manage their memory as a heap and can become fragmented over time. To reduce complexity, try running computations with no human interactions. This should also start with a clean environment each time.For example, nice /usr/local/bin/math < ~/commands.mathematica > /var/tmp/mathematica_results.txt Use data export features for smaller outputProgramming environments like Fortran, C, Matlab, Mathematica and Maple have different data import and export features. Check the support documents for more information.In C and Fortran you might use fopen and printf to write into a file. Programming enviroments like Matlab, Mathematica or Maple provide more convenience with functions that can export to specific output formats like JPG, PDF, Excel, or XHTML. Using data export functions can make smaller files and your program can output to multiple files. Capturing standard out can be easier, but the files produced can be large and diffficult to process afterwards. Schedule the job to run later and for when the load isn't highJobs can be run latter using the at and batch commands.For example, nice echo "matlab -nodesktop < ~/my_simulation.m > /var/tmp/my_simulation_results.txt" | at 3:00and to run when the load is below 0.8 try, nice echo "maple < ~/commands.maple > /var/tmp/maple.out" | batch Use nohup to run the jobPrefix the nohup command before a long job, and the command will continue to run if you are logged out.nohup nice ./monte_carlo 1.7128 > /var/tmp/results.out Remember to send the output to local disk to reduce network load. Optionally use screen to run the jobUsing nohup is still a good idea, but if you want to see output from the job as it's running over multiple days you can use screen to detach a login shell and reattach at a later time. You can also run multiple shells from a single screen session, which is another nice feature if you're connecting from home to run or check the status of jobs. Screen commands start with a prefix (Ctrl-a by default) and then a command key (usually a single character).For example, to start matlab in a screen session you could run... screen nice matlab -nodesktop -nojvm -nosplashThen press Ctrl-a d to detach the screen session from your terminal. You can log out and screen will keep matlab running. The next day you could reattach to the running screen by running screen -rInside of a screen session, the basic commands are listed below.
Customizing Screen PreferencesHere's a sample ".screenrc" so the screen hardstatus line always shows the hostname, a list of the virtual terminals and a clock.#Keep the status line active
hardstatus alwayslastline
# Use the "pinfo screen" command to review the topic "string escapes"
hardstatus string '%{= kG}%H:%{w} %= %?%-Lw%?%{G}(%n%f %t%?(%u)%?%)%{w}%?%+Lw%?%?%= %Y%m%d %c'
# ^ ^ ^
# tomato: 0$ sh (1$ mail) 2$ sh 20081023 10:00
# green white green white white
# Hostname Shells Zero Through Nine Date and Time
# Default screens
term linux
screen -t sh 0
screen -t emacs 1 emacs -nw
screen -t sh 2
| |||||||||||||||||||||||||||||||||