Systems

Home page Models Systems Tools IT staff References

We use Linux, Apple Macintosh and Microsoft Windows. The Kirschner lab mostly uses Linux, currently Ubuntu 10.04. The Linderman lab mostly uses Mac (like the College of Engineering in general). These are some specific systems used for development, model runs and storing model run results, in addition to individual user desktops and laptops.

When installing software on a system it is generally best to download the install package for that software and follow the official installation instructions, rather than copying directories from a system that already has that package installed. THis will install the latest version of the application and it will be the proper application version for that system. There are circumstances where copying from an exisiting installation will work but this can make it difficult to manage the installation from then on, ex. when upgrading or uninstalling the package.

The following specific systems are used. See the Kisrchner lab or Linderman lab system administrator for IDs and passwords to access systems.

Axiom: axiom.ccmb.med.umich.edu

Axiom is a compute cluster maintained by the Medical School and CCMB (Center for Computational and Molecular Biology). The main contact person is Jonathon Poisson, jdpoisso@umich.edu. Another contact person is Jim Cavacoli, cavalcol@umich.edu. We use Axiom for LHS runs. Typically we get between 50 and 120 CPU cores when we run an LHS. Axiom has Intel CPUs running Red Hat Linux. It uses PBS for its job schduling. See the CCMB Cluster Usage Guide for more details about using Axiom. This site has a nice overview of using PBS.

You will need an account set up to access Axiom. Ask one of the Axiom contact person to have one setup for you. Your user ID and password will be your umich unique name and kerberos password (the same ID and password you use for your umich e-mail). To access Axiom you log on via ssh to the Axiom head node, which runs Red Hat Linux. You cannot log on to Axiom from outside the university network unless you first logon to a computer on the university network. You also cannot log on to Axiom from North Campus. We are not sure about central campus, but you can definitely log on from the medical school network, such as in the Kirschner Lab. For example ssh to helico and from there ssh to axiom. An alternative is to use a vpn client, in which case you would start the vpn client on your local computer and then ssh to axiom directly. See this ITS web page for instructions on how to use a vpn client with the university network. We have not had any experience using the vpn client approach.

Once you have an account on Axiom, to run an LHS do the following:

Log on to Axiom.
Create a main directory for the LHS runs. By convention this is a sub-directoy of a directory called ModelRuns, and it has a name that begins with the year-month-day and a letter for each LHS performed on that day (since occasionaly more than one is performed per day) and ends with something that describes this specific LHS. For example "~/ModelRuns/2011-07-11A-TNFKO-Robustness"
Checkout the version of the model you want to use from the svn archive. This is usually checked out to a sub-directory of a directory called "~/immunology". Don't check it out to the LHS directory - keep the code and LHS run files in separate directories.
Build the model, make, and copy the executable to the LHS main directory. The run script will expect it there. Also, you will be able to make changes to the model for different LHS runs, such as doing a knockout LHS and a depletion LHS, since those sometimes require small model changes.
Build the LHS program, in the same directory where the model was built: make lhs
Copy the LHS parameter file to the LHS main directory on Axiom.
In the LHS main directory run the lhs program to create the parameter files for the LHS. For example "~/immunology/GR-ABM-ODE/simulation/lhs -i containment.xml -n 1000"
Edit the PBS script for LHS runs. Follow the directions for the lhs-qsub.sh script.
Run the submission script. Follow the directions for the lhssubmit script.
Use the PBS qstat command to status the LHS runs, qstat -u ID where ID is your axiom logon ID, ex. qstat -u pwolberg.

There are some problems that happen on occasion when running jobs on Axiom. If any of these happen, contact the main Axiom system administrator mentioned above.

The system seems unresponsive. For example the PBS qstat command doesn't respond. Other commands don't respond or take a very long time to respond. This usually means there is a serious problem with the system, either just the head node or with the cluster as a whole.
Some jobs fail with a message in the job log file stating that a required library isn't available. This is typically due to a mis-match between the version of a library available on the head node when the executable was built and the version of that library on some of the compute nodes. When a new version of a library is installed on Axiom, sometimes not all the compute nodes have been updated yet when an exectuable is built on the head node with the newer library and jobs submitted using that executable. This seems to happen most often with the Boost library.
Our jobs seem to get only a few cores, even if there are many cores not running any jobs. There was one instance where this happened. It had to do with our overall priority on the system. After sending an e-mail to the main system administrator our overall priority was increased and this hasn't been a problem since. We received the following e-mail from Jonathon Poisson concerning this issue:
Right now actual priority is undefined in the scheduling system, it is instead operating on a soft cap, which is set to distribute 8 cores to each user (not counting when the resources are requested by specific users who have purchased resources on the system) before assigning the remaining cores in excess of the soft cap to waiting jobs.

By default the scheduler as configured now is assigning the jobs first come first served, which would naturally put your jobs to a disadvantage, but the soft cap mechanism interferes with the normal mechanism on a user's jobs in excess of the cap. This means when it does reevaluations it may allow another user to grab up all the cores.

I have increased your soft cap to 64 cores for the time being, but as you may of observed the system is experiencing some abnormally high load so it may not be able take advantage of that increased cap immediately.

Some Axiom users have priority on some of the compute nodes, since those users paid for those nodes. Because of this, jobs running on those nodes by other users (our jobs, for example) may get preempted by a job for a user with higher priority. In this case the lower priority job is halted and placed back on the input queue, to be re-run from the start (there is no automatic checkpointing). Also, the system does not automatically clean up any files created by the preempted job. This means that our PBS scripts must be written to take preemption into account. The directory to receive job result files on the head node file system must be emptied before copying files from a compute node's scratch disk space to that destination directory.

Note also that because of preemption there may be more job result log files than expected, which may also make job counts in job status scripts not add up correctly.

Unlike the Flux cluster, Axiom does not use modules to manage dependencies on specific versions of libraries and tools. This can cause some problems with using the Boost library and the g++ compiler. The default g++ compiler is 4.1.2. On Axiom, g++ version 4.4 is available as g++44 and g++ version 4.6 is available as g++46. For example, it is not possible to build a version of a model on one of our Linux systems and then run that on Axiom, since the executable will expect a particular version of the Boost library and typically those will be different on our systems than the one used on Axiom. The executable will quit immediately with an error message stating the version of the Boost library found and the versione xpected. Obviously this can happen with any library used by a model, not just Boost. In any case, when running on Axiom, or any cluster, it is better to build on the cluster head node.

The g++ "-march=native" command line option should not be used on Axiom. When performing a build on a head node that option will create object code for the specific processor type on the head node, which is not the same as the processors on the compute node. For example, the head node processors support ssse3, ssse4.1, and ssse4.2 vector instructions, whereas the compute nodes only run sse and sse2.

Flux: flux-login.engin.umich.edu

Flux is the UM campus wide cluster. It is managed by CAC. CAC also manages the Nyx cluster. The difference is that Flux usage is available on a fee basis, whereas Nyx is free. Nyx typically has long wait times in its batch queue. We do not yet use Flux because we have access to the med school Axiom cluster, which is free for us since we are part of the med school. Flux uses PBS. It also uses modules for managing dependcies on specific versions of libraries and tools. See the Flux web site for more information on using Flux and the fee structure.

helico: helico.micro.med.umich.edu

4 cpu, 8 core desktop system. Used for small compute runs, such as small LHS runs. Also used as a development system. Runs Ubuntu 10.04.

innoculant: innoculant.micro.med.umich.edu

innoculant is a computer in the Kirshner lab that is used to run some server applications. It has a file sharing web server at http://innoculant.micro.med.umich.edu/ftp2/. Files can be uploaded to this server for viewing (such as this documentation) or for download. This avoids various issues with transferring files via e-mail - for example file size and file format restrictions. The file sharing server requires authentication - it will ask for an ID and password when accessing a page for the first time in a session. This authentication will be valid for some extended period of time. When it times out it will ask for authentication again on the next page access.

innoculant also has the subversion source archive server at svn://innoculant.micro.med.umich.edu/dev.

macrophage: macrophage.micro.med.umich.edu

4 cpu, 8 core desktop system. Used for small compute runs, such as small LHS runs. Also used as a development system. Runs Ubuntu 10.04.

mahayana: mahayana.engin.umich.edu

A 2 cpu, 4 core Intel Mac desktop running OS/X in the Linderman lab. Used as a shared system by the grad students and post docs in the lab. It can be used for small LHS runs. It also has a large monitor useful for graphics intensive applications.

necrosis: necrosis.micro.med.umich.edu

The model run server. It runs a web server that allows access to the results of model runs, at http://necrosis.micro.med.umich.edu/ftp2/. It is also used for post processing of model runs to produce run reports. This typically requires access to a graphical interface, either by being physically at the machine or using a remote desktop application like vnc (which has not been setup on necrosis).

snow0: snow0.micro.med.umich.edu

snow0 is a 32 bit desktop system in the Kirschner lab running Ubuntu Linux 10.04. It can be used for post-processing model run results in case necrosis is not available. It can also be used as the primary desktop for grad students or post docs in the lab.

Teragrid

Teragrid is a US national resource for high end scientific computing. They provide access to high performance computing - clusters, vector computers, etc., large data storage facilities, various software and database resources. Here is the Teragrid official web site.

Teragrid resources are free but user's must apply for a Teragrid allocation. The UM CAC (Center for Advanced Computing) is the campus resource for Teragrid. They can help with applying for a Teragrid allocation and the CAC Teragrid web page has more information about applying for an allocation and using it once granted. We are not yet making use of Teragrid resources.

Teragrid is being phased out and is to be replaced by a new initiative called XSEDE (eXtreme Science and Engineering Discovery Environment).

XSEDE (eXtreme Science and Engineering Discovery Environment)

XSEDE is the successor to Teragrid. It is a US national resource for high end scientific computing. They provide access to high performance computing - clusters, vector computers, etc., large data storage facilities, various software and database resources. Here is the XSEDE official web site. We are not yet making use of XSEDE resources.