Biociphers Compute Hardware

The Biociphers lab contains a number of compute resources which can be used at no additional cost to the lab. When capacity is available, please make use of these resources before turning to other resources such as the PMACS HPC or the Biociphers Oracle cluster

Brief description:

Jordi1, 2, 3, 4: The newest, highest capacity machines. Nvidia Tesla P100 GPU compute is available.

Naxos1,2,3,4: The last generation machines can be used as overflow if Jordi is at capacity. They are still formidable, but less capable than Jordi.

Delos: This is an archive only machine used to store data which is used in complete or inactive projects

Oracle: Burst compute which can be used on-demand in the event that Jordi and Naxos are not adequate. (See Oracle HPC )

General Workflow and Tips

All of the Jordi and Naxos machines are shared by all biociphers lab members. Access to CPUs/GPUs is on a quasi "first come first serve" basis. The usual unofficial procedure is:

  1. Connect to the VPN and to go http://jordi0/ganglia in your web browser. This will give an overview of the general number of CPUs used on all of the systems.

  2. If you will only take one machine or less, and/or your job will not run for too long (a few hours or less), Just pick a machine to use, log in, and run it.

  3. If you are planning on a longer / more extensive run, it's best to make a message in the "Requests" campfire, and wait for any objections for a half hour or so before taking the resources

Avoid running jobs such that more cores are used than are available. For example, Jordi1 has 112 cores. If someone else is already running a job which is using about 60 cores, there are are only 52 cores left. Do not run a job that uses more than 52 cores, as this will end up slowing down both your job and everyone else's jobs. (wiki)

Above is an example of a section of the ganglia web page. Here you can quickly see that there are about 40 cores used on jordi1 and about 60 cores used on jordi4, while jordi 2 and 3 are mostly free to use.

System software tips

Installing software that exactly suits the needs of all lab members is non trivial. Therefore, there are a mix of approaches

  • Per-system software is installed on the operating system disk using the ubuntu package manager. This software is often very old compared to the current versions by a few years at least, but is easy and quick to install. It's also needed for root-user packages like docker.

  • Shared-compiled software is something the system administrator can install on the /data/opt location, for all users to use

  • User software can be installed by users themselves, right in their own home directories.

If you would like to test a new software, and it's not too painful to compile, consider installing it in your own home directory first. If there is interest from more than one person, ask the system administrator to install it using one of the two former methods.

Also of note: Many common datasets, databases, annotations, etc are available in /data/shared . If you find a new annotation that should be added, please let the system administrator know.

Optimizing I/O performance on Jordi*

A lot of the software that we use in the lab needs to read or write a large amount of data to/from disks. When using network disks, it's often pretty slow to read huge amounts of data, or a huge amount of small pieces of data at once. (even on the high speed connections) If you run a software, you can check it's status with $ glances or $ top . If you notice it's not using 100% CPU and is often showing a state of 'D' instead of 'R', it may be being slowed down due to I/O bottleneck.

There are two direct ways to speed up jobs in these cases: Use /scratch or use /dev/shm.

-/dev/shm is a directory, which is actually a RAM disk, which means that everything put here will actually reside on the system's main memory instead of a disk. Because the jordi machines each have 512GB of memory total, you will usually have a good ~250GB or so that can be safely used here (check with $ free -h or $ top -- it will depend on how much your program uses as well!) This storage will be super fast, but of course it's limited, and you should not store anything of value long term here

-/scratch is a single SSD which is local to each individual jordi machine. This means it won't have the bottleneck of traveling over the network, and won't share bandwidth with the network disks. It has a total size of 1TB for each jordi.

In both cases, to optimize bandwidth and avoid data loss, I would recommend copying the required inputs from /data to these locations, and then running your software, reading from the fast location and writing back to /data. This way, if anything is ever deleted, all of the important information is saved.

Please be kind and remember to delete all of the files you've placed in these locations when your runs are complete.

Note: please do not use "/tmp" for any large data sets, as this disk is primarily used by the operating system and is small

Detailed Description of systems:

Network diagram:

Disks and storage devices:

jordi*:/data 62TB of SSD space. Networked with high speed connection to all Jordi machines, and slower speed to Naxos and Delos

naxos1:/data_naxos1 11TB of HDD space. Will work fastest on naxos1 itself, but it's also networked at medium speed to all other naxos and jordi machines

naxos2, naxos3, and naxos4 all have their own local 11TB disks which work much in the same manner.

Both jordi* and naxos* each have a shared /home directory. This means that every configuration / software you install on naxos1 will apply automatically on naxos2, or 3, etc. A similar function works on Jordi1,2,3,4. (However, naxos* home directories are not shared with jordi* home directories)

delos:/Volumes/data 9TB of HDD space

delos:/Volumes/data3 9TB of HDD space

delos:/Volumes/data4 55TB of HDD space

delos:/Volumes/data5 35TB of HDD space

Google drive backup: 214TB Used archive space (lol)

System capabilities:

jordi0: 12 cpu cores, 64gb memory

jordi1: 112 cpu cores, 512gb memory, 4x tesla P100 gpu

jordi2: 112 cpu cores, 512gb memory, 4x tesla P100 gpu

jordi3: 112 cpu cores, 512gb memory

jordi4: 112 cpu cores, 512gb memory

Note that jordi0 should be used only for long-running compression or rsync / transfer tasks, _not_ general purpose compute jobs.

naxos1: 32 cpu cores, 64gb memory, GeForce GTX TITAN X

naxos2: 32 cpu cores, 64gb memory, GeForce GTX 1080

naxos3: 32 cpu cores, 64gb memory, GeForce RTX 2070

naxos4: 32 cpu cores, 64gb memory, GeForce GTX 1080

delos: 8 cpu cores, 16gb memory