Biociphers Compute Hardware

The Biociphers lab contains a number of compute resources which can be used at no additional cost to the lab. When capacity is available, please make use of these resources before turning to other resources such as the PMACS HPC or the Biociphers Oracle cluster

Brief description:

Jordi1, 2: General purpose compute machines + Tesla P100 GPUs available

Jordi3, 4: General purpose compute machines

Jordi5, 6: GPU intensive machines with Tesla A40s available

Jordi7: Intensive compute / RAM usage machine for demanding software

Jordi0: Storage server, file transfer server

Naxos1,2,4: Legacy machines can be used as overflow if Jordi is at capacity. There are assorted GPUs available for debugging

Naxos3: Repurposed Legacy machine which is now used for archiving data long term from complete or inactive projects

Milos: Backup host for mistake or disaster recovery. (Located in Smilow)

Paros: PMACS managed web host allowing public facing Voila instances for collaborator sharing.

Minos: PMACS managed web host for public facing Majiq websites and shared data

Accessing the systems

The main jordi machines are located behind the PMACS university VPN. Please check here for general setup instructions. If you have any trouble with this process please reach out to the system administrator to have a ticket opened. The machine hostnames are:

The various other machines described above (Naxos, Milos, Delos) are located on the PMACS VPN instead. You probably won't need to directly access them often, but if you do, check here for setup instructions. 


You may use any ssh software to log into the systems, and any sftp software to accomplish file transfer to/from your local machine to the systems. 

On Linux and OSX, usually both ssh and sftp software are built into the operating system. Ssh will run from the terminal, and sftp can be used by "opening a location" from dolphin/nautilus/finder, etc, whatever file manager you use. 

On windows ssh can be accomplished through a number of tools. Putty is a small popular tool, though it requires converting the key file into a different format to use. Other popular options include git bash (which has ssh built in) , cygwin, or simply WSL. sftp can be used by downloading the free software "filezilla". 

General Workflow and Tips

All of the Jordi and Naxos machines are shared by all biociphers lab members. Access to CPUs/GPUs is on a quasi "first come first serve" basis. The usual unofficial procedure is:

Avoid running jobs such that more cores are used than are available. For example, Jordi1 has 112 cores. If someone else is already running a job which is using about 60 cores, there are are only 52 cores left. Do not run a job that uses more than 52 cores, as this will end up slowing down both your job and everyone else's jobs. (wiki)

Above is an example of a section of the netdata web page. Here you can quickly see that jordi3 is heavily used while the remaining machines are mostly available

Above is another example of a section of the netdata web page. In order to check GPU usage you need to select or scroll down to the "nvidia-smi" section. In this case you can see that jordi2 is heavily used.

System software tips

Installing software that exactly suits the needs of all lab members is non trivial. Therefore, there are a mix of approaches

If you would like to test a new software, and it's not too painful to compile,  consider installing it in your own  home directory first. If there is interest from more than one person, ask the system administrator to install it using one of the two former methods.


Also of note: Many common datasets, databases, annotations, etc are available in /data/shared . If you find a new annotation that should be added, please let the system administrator know. (or just move it there yourself)

System directories

/data/opt -> custom compiled software which is not installed in the system binary locations. 

/data/shared -> common data files of use, such as fasta sequances and annotations from gencode or ensembl. 

Optimizing I/O performance on Jordi*

A lot of the software that we use in the lab needs to read or write a large amount of data to/from disks. When using network disks, it's often pretty slow to read huge amounts of data, or a huge amount of small pieces of data at once. (even on the high speed connections) If you run a software, you can check it's status with $ glances or $ top . If you notice it's not using 100% CPU and is often showing a state of 'D' instead of 'R', it may be being slowed down due to I/O bottleneck. 

There are two direct ways to speed up jobs in these cases: Use /scratch or use /dev/shm. 

-/dev/shm is a directory, which is actually a RAM disk, which means that everything put here will actually reside on the system's main memory instead of a disk. Because the jordi machines each have a fairly large memory size, you will usually have a good amount of general space that can be safely used here (check with $ free -h or $ top -- it will depend on how much your program/software uses as well!) This storage will be super fast, but of course it's limited, and you should not store anything of value long term here as it's extremely volatile 

-/scratch is a single SSD which is local to each individual jordi machine. This means it won't have the bottleneck of traveling over the network, and won't share bandwidth with the network disks. Please note this is automatically cleared in intervals (with an announcement first)

In both cases, to optimize bandwidth and avoid data loss, I would recommend copying the required inputs from /data to these locations, and then running your software, reading from the fast location and writing back to /data. This way, if anything is ever deleted, all of the important information is saved. 

Please be kind and remember to delete all of the files you've placed in these locations when your runs are complete. 

Note: please do not use "/tmp" for any large data sets, as this disk is primarily used by the operating system and is small

Detailed Description of systems:

Network diagram:

Disks and storage devices:

jordi*:/data 62TB of SSD space. Networked with high speed connection to all Jordi machines, and slower speed to Naxos and Delos

naxos1:/data_naxos1 11TB of HDD space. Will work fastest on naxos1 itself, but it's also networked at medium speed to all other naxos and jordi machines

naxos2, naxos3, and naxos4 all have their own local 11TB disks which work much in the same manner. 

Both jordi* and naxos* each have a shared /home directory. This means that every configuration / software you install on naxos1 will apply automatically on naxos2, or 3, etc. A similar function works on Jordi*. (However, naxos* home directories are not shared with jordi* home directories)


Archive machine space:

milos:/Volumes/data4 55TB of HDD space

milos:/Volumes/data5 73TB of HDD space

milos:/Volumes/data6 5TB of HDD space

System capabilities:


jordi0: 12 cpu cores, 64gb memory

jordi1: 112 cpu cores, 512gb memory, 4x tesla P100 gpu

jordi2: 112 cpu cores, 512gb memory, 4x tesla P100 gpu

jordi3: 112 cpu cores, 512gb memory

jordi4: 112 cpu cores, 512gb memory

jordi5: 128 cpu cores, 128gb memory, 4x tesla A40 gpu

jordi6: 128 cpu cores, 128gb memory, 4x tesla A40 gpu

jordi7: 256 cpu cores, 2048gb memory

Notes:

-jordi0 should be used only for long-running compression or rsync / transfer tasks, _not_ general purpose compute jobs. 

-it is possible to run GPU and CPU jobs concurrently on the machines with gpus, however care should be taken to leave enough CPUs free to "drive" the gpu jobs, and avoid thrashing with them. 

-jordi7 should be prioritized for high-memory jobs. 


naxos1: 32 cpu cores, 64gb memory, GeForce GTX TITAN X

naxos2: 32 cpu cores, 64gb memory, GeForce GTX 1080

naxos3: 32 cpu cores, 64gb memory, GeForce RTX 2070

naxos4: 32 cpu cores, 64gb memory, GeForce GTX 1080


minos: 8 cpu cores, 16gb memory

delos: 8 cpu cores, 16gb memory