San's Linux Shell and RNA processing introduction

Section 1 - Linux shell crash course and very basics

The "Linux shell", often tied very closely with the programming language "Bash", so you might hear of them interchangeably. It is a very powerful text based interface which allows fully configuring and running programs on the system. Using the shell is a very useful skill to learn, which you will find all over, from high performance computers like we will discuss here, to the vast majority of the web servers that back the internet, to raspberry pi's, routers, as well as a number of desktops and laptops. On Mac OSX, you also often have access to the programming language Bash in the terminal application. I personally don't have as much experience with mac OSX, but from what I have seen, there is a good amount of overlap between the programs you can run in Linux and mac OSX. However, there are some small differences that you may have to adapt to.


Why, in 2020, is it still standard to use this cryptic looking, text based, interface as the primary interaction method with most of our most powerful computers? There are a number of reasons:


- Efficiency of not needing to waste system resources (memory, cpu running time) with rendering a graphical desktop

- Efficiency of remotely using the system. Using a remote desktop like VNC or windows RDP requires a powerful network connection and can often be laggy and annoying to use. Connecting to the remote version of the shell (SSH), you will barely notice any latency.

- Creating programs: it is easier to write additional features quickly with a simple text based argument, rather than creating a graphical representation.

- Most importantly, it is extremely easy to automatically run many programs, and chain them together to automate repetitive tasks and save time. This is what we would like to build up during our time together.



Our first tasks will be learning some of the most basic commands:

-log in with SSH

-edit a text file with your preferred terminal text editor

-list a directory, create a directory, move a file, copy a file, delete a file

-write and read a file with input/output tools like head, tail, cat, grep and shell pipes / redirection


Please note that there are a number of already well written tutorials which go well along with these suggestions. I previously mentioned the page here: http://www.ee.surrey.ac.uk/Teaching/Unix/  ; I would highly recommend that you follow along and read that material at well. It is very concise and covers quite a lot of good content.


- Log in with SSH

I already went over this individually with you to my knowledge. We use SSH to open a shell session on a remote computer. Here are a few general tips to keep in mind after logging into naxos1 with $ ssh <username>@naxos1  ; (note that any time that I use the symbol "$" here, it simply means "the text that follows if meant to be run in the shell". It is a shorthand that you will commonly see)


- you can 'disconnect' from the current ssh session by typing CTRL+D

- you can stop the current process from running by typing CTRL+C (you may run something by accident, which will take a long time, for example, searching for a file over the entire system, you press CTRL+C to stop it, you will use this ALL THE TIME) Sometimes, for complicated programs, one 'interrupt' (press of CTRL+C) will not be enough, and you may need to press it a few more times

- You can press the "up arrow" key to go to the previous command, edit it, and run it again by pressing enter. (instead of typing the whole thing again) This is again something you will use all the time. You can press the up arrow a bunch of times to go back multiple commands.


- edit a text file with your preferred terminal text editor

You will often need to edit files of data or configurations. There are a bunch of text editors you can use. The most popular "simple" editor is called "nano". There are also some more advanced editors, popular options are "emacs" and "vim". The choice is up to you. I would not recommend taking too much time in the beginning learning a lot about these editors. As long as you can determine how to open a file, write some text, save the file, and exit, you should be fine for the majority of our use cases. I would recommend you try a few of these programs, or read up on them and check which one you like the most. If you would like me to install another option you find, let me know and I will check it out. Please try to add some text and save a file, and make sure you can open the same file again and see your text.


- list a directory, create a directory, move a file, copy a file


These are simple tasks you would most likely be used to doing at this point with a GUI based tool like windows explorer or OSX finder. This is just a quick overview.


- show which directory you are currently in by typing $ pwd ; After you log in, this will be your "home directory" which should say "/home/<username>"

- list the current directory by typing $ ls ; At this point, you should be able to see the name of the file you created in the last step with the text editor. You can use $ ls -la to show hidden files, as well as much more information about the size and permissions of the files and folders.

- make a new folder / directory (these words are interchangeable) : $ mkdir new_folder . Using 'ls' now, should show the new folder.

- $ mv <source> <dest> ; can be used to rename a file / folder or place it in a different directory. You can for example move your new file to your new folder: $ mv <name of new file> new_folder. Now $ ls ; should not show the new file, but $ ls new_folder ; should show the file.

- $ cp <source> <dest> ; similar to mv, but duplicate the data. For example cp new_folder/<name of new file> <some other name> ; Now you have duplicated the file inside of the folder, and in your home directory.

- $ rm <some other name> , will delete the copied file. Note that this will happen without any confirmation. Be careful, there is no way to undo this action!


Notes:

- for cp or mv, the <dest> will be COMPLETELY OVERWRITTEN if it exists already, without any warning. Be careful not to overwrite files you care about!

- for cp and rm, you will need to add the "-r" flag, like cp -r <souce> <dest> , if you would like to copy or remove an entire directory and its children. (-r is for recursive)

- all file and folder names are case sensitive

- it is possible to put spaces in file and folder names, but this requires quoting the whole name or using escape characters, which is generally more annoying, so most people will just use underscores instead of spaces

- you can tab complete when typing file or folder names. This is extremely useful when you can't remember the full name or something while you are trying to type it. When you go to move something with 'mv' , you can type 'mv ' and then tap the TAB key a few times. This will show possible files in the folder. You can start typing a few letters of one of them, then press TAB again which will complete the rest of the name. This will save a lot of time and frustration.


-write and read a file with input/output tools like head, tail, cat, grep and shell pipes / redirection


Simple input and output tools are useful for processing text files. Use your text editor to make a file "testfile.txt" with text like this:


1) Line one

2) Line two

3) Line three

4) Line four

5) Line five

6) Line six

7) Line seven

8) Line eight

9) Line nine

10) Line ten


- To quickly dump a whole file to the output, use $ cat testfile.txt ; you should see all of the text you wrote dumped to the output. I use this all the time with small files.

- Sometimes you care about parts of a file near the beginning: $ head -3 testfile.txt ; for example, gets the first three lines.

- The reverse, usually more common case, is that you care about the end of a file (the last actions in a log, for example). You can just use $ tail -3 testfile.txt ; in this case.

- Maybe you would like to search for certain text in a file here I will indicate a simple use of shell pipes : '|', '>', '>>'. We will first combine "cat" and "grep". Cat reads the whole file, like we saw before, and grep searched through the data that it is fed, and outputs what it finds.


- $ cat testfile.txt | grep 5


Here, we read the file, then the pipe symbol '|' is used to tell the shell to pass the data from the first process (cat) to the second process (grep). Then we just tell grep to search for '5'. You should get one line back. We can take this a step further. Perhaps you would like to save the output of your filter:


- $ cat testfile.txt | grep 'e t' > found.txt


Here we do the same cat and search like before, but then we use another type of pipe '>', which takes the output and writes it to a file. You can chain an arbitrary number of commands together like this, which is one reason the shell is so powerful.


- You can use '>>' instead of '>' to append to the output file, instead of overwriting it.

- Instead of using normal text as an argument to grep, you can use a language called "Regular expressions" to find specific patterns of text that you are interested in. This is a complicated subject, but you might do well to take some time checking out the basics of it. https://regexr.com/



Section 2 - Screen and preprocessing RNA data


Prerequisite: learning to use GNU "Screen". Some of the tasks we are going to do are computationally intensive, and can therefore take a long time to finish running. When you are using a remote machine, through a vpn, multiple ISPs, over a long distance, there is always a chance that you might lose your internet connection. If you are running a process that takes an hour or more to complete, this can be very frustrating. Some SSH clients are able to reconnect after brief disconnects, but in general, if you lost the connection, the process you were running will stop or have undefined behavior.


There are a few ways to mitigate this problem, but one of the most universal methods on Linux systems is to use a program called "screen". You can use this program for a lot of advanced functionality. But in general the part we care about is being able to set up a shell session, run all of our long commands in there, and be able to reconnect to it in case we get disconnected from the internet or VPN for some reason. You can read a more comprehensive introduction here: https://opensource.com/article/17/3/introduction-gnu-screen ; but for a quickstart I will just describe the extreme basics:


-Log into naxos1 with SSH

-type the command $ screen

(you will get some details about the program, just press the enter key)

Now it will look the same as when you started, but, you are in a screen session, and protected from disconnects! If you get disconnected, your SSH session may freeze, and you will need to restart the connection. Once you are reconnected, you can get back to your session by typing $ screen -r ; It should flip you back to where you were. If you would like to 'manually' disconnect from screen, os that you leave the session running, but back out 'one level' to a 'regular ssh session', you must type the following sequence:


first press CTRL+A

take your hand off the keyboard

then press 'd'


And you will flip back to the regular ssh session.


If you would like to simple end/terminate the screen session, when you don't need it anymore, you can press 'CTRL+D'. It should flip back to ssh and say something like 'screen session terminated'.


Note that the CTRL+A 'escape' followed by some other key, is used for a bunch of different tasks in screen. I figure this should be a good introduction though. I would highly recommend running all of the tasks in this exercise in a 'screen' session.



I don't find prefabricated exercises super exciting, so I am going to propose some tasks which are actually used to accomplish a purpose. i expect this will be a little more difficult, and I am expecting many questions to come up. The purpose of this exercise, beyond simply becoming more familiar with the shell environment.


1 - downloading the published data.


Caleb has a recommendation for some good data which was extracted recently by a close collaborator lab at Upenn (Kristen Lynch). The data was last updated on Jun 17, 2020 (very recently). We will start by downloading one sample raw data from a publicly hosted repository (ncbi.nlm.nih.gov) to the naxos1 machine. I picked a sample at random to start with from Caleb's suggestions: https://www.ncbi.nlm.nih.gov/sra/?term=SRR9861783


I have ran through this part myself, and the process of fastq-dump took a very long time to run. So the choice is up to you. If you would like you may follow the procedure below, otherwise, I have placed the files on a shared location on the server, so you can proceed to step 2.


[the procedure that takes a long time]


-Make a directory called SRR9861783 where you will download the data

-download the data using using the 'prefetch' command from "sra toolkit". This software is already installed and available on naxos1. (https://www.ncbi.nlm.nih.gov/books/NBK242621/) (prefetch  SRR9861783)

-extract the data with ( fastq-dump SRR9861783 --split-3 --gzip -O ./ )


[if you don't want to take that time, you can use the data from this path on naxos1: /data/apsa_data/SRR9861783_dump ]


2 - trimming / cleaning the data


Now that we have the raw sequences download to the local machine, we will need to 'trim' them. Caleb might be able to give a better technical explaination, but at a high level there is often common garbage data / sequences attached for the machine which are not part of the real test data / too short sequences and many other cases which can be algorithmically removed to leave most of the data that we care about. There are a few tools for this. On Caleb's recommendation I have decided to use one called "bbduk" , which had a good multithreaded mode that can allow it to run faster.


In this case, we don't have bbduk installed on the system. You can start typing $ bbdu , press TAB a few times, nothing comes up, so the program does not exist yet. YOu may encounter this situation often when working with machines on an HPC. You have two options: ask that the system administrator install the program, so everyone on the system may use it, or install it into your own home directory yourself. In cases where multiple users may benefit from installing the program, you may make a case to the administrator to install it. There are also programs which are extremely annoying to install yourself, because they have many dependencies. You may also be able to beg for administrator intervention in this case :). However, in general, when you want to try a new tool, it really helps you know how to install something for yourself, as if nothing else it can take some time for an administrator to respond to your request.


For bbduk we can search up "bbduk install" and land on this page: https://jgi.doe.gov/data-and-tools/bbtools/bb-tools-user-guide/installation-guide/

They provide a download link to a 'sourceforge' website  to download the tar.gz file. Note that in many cases, you can download files directly to the server in your ssh session by using the "wget" command. In this specific example, though, the sourceforge website itself makes the direct download link difficult to find and it becomes named weirdly when downloaded. So we will use the strategy of downloading on your local computer, and then uploading that file to naxos1. (you are still welcome to try to download the tar.gz file from sourceforge using wget as a bonus challenge if you so choose). To upload the tar.gz file from your local computer to naxos1, a number of options are available. I would recommend something like "FileZilla" on windows for an easy GUI tool (using sftp connection). I believe finder on OSX can connect directly to sftp servers as well. You should investigate how you would upload the file and let me know if you run into a roadblock and need help. (as each of your systems may be a little different)


Once you have uploaded BBMap_38.86.tar.gz , you will need to 'unzip it' similar to a zip file on windows. Look into how to use the "tar" command on Linux to extract this file into a new folder. When that is done you should have a file in the extracted files called 'bbduk.sh'. You can test that the program seems to be able to run correctly by changing directory and typing its name. Change directory using the 'cd' command to the folder where the files were extracted to. You can check that bbduk.sh is here using 'ls'. To test running it, try using the command $ ./bbduk.sh . Note that when running a program in the current directory, you must ALWAYS prefix it with the './' like this. In this case, you should see the program put out a bunch of example arguments and suggestions, because you have not given it any commands. This tells you that the program is installed correctly and ready to run. (the other case would be that you see an immediate cryptic error instead, in which case you would need to fix the installation) Assuming that you are still in that bbmap directory, this is the command you can run to start running the trimming:


$ ./bbduk.sh in=/data/apsa_data/SRR9861783_dump/SRR9861783_1.fastq.gz in2=/data/apsa_data/SRR9861783_dump/SRR9861783_2.fastq.gz out=~/SRR9861783_1_val_1.fq.gz out2=~/SRR9861783_2_val_2.fq.gz ref=resources/adapters.fa ktrim=r k=23 mink=11 hdist=1 minlength=35 tpe tbo qtrim=r trimq=20 qin=33 -Xmx10g threads=8


Take note of the way some of the paths to different files are specified here. You can specify things "absolute" or "relative". Meaning, every time a file or folder is specified, you can type out the full path from the root of the file system "/", or you can simply type a path relative to your current directory (recall you can double check where you currently are using $ pwd ; ) . For example the arguments to "in" and "in2" here use absolute paths, where I have specified the shared location that I placed the files. You can change these to the location of the files that you processed yourself if you desire. For the "out" and "out2" arguments, you can note that they start with "~" . This symbol (tilde) is a shorthand for "my home directory" and saves a bit of typing. For the "ref" argument, we use a relative path, which will look for a "resources" folder in the current directory, and then an "adapters.fa" file inside of that. You can use any combination of absolute and relative paths in your commands. I tend to favor absolute paths in most cases even though they are very long, because I don't have to worry about my current directory when running commands, but the choice is up to you!


3 - quality check


We use fastqc to analyze the data for potential issues before moving onto the alignment step.


fastqc ~/SRR9861783_1_val_1.fq.gz -o ~

fastqc ~/SRR9861783_2_val_2.fq.gz -o ~


4 - Alignment


Alignment is one of the most interesting steps in this process. As I've mentioned many times, Caleb might be able to explain it better... In any case as a high level as I understand it, we are taking all of the sequences of chemicals from the fastq files and comparing them to the reference genome to try to  place them in the correct position. This process, of course, is not always 100% successful. It is possible that some sequences can not be placed at all, or perhaps the sequence is not specific enough to place in only one location, but could fit in a few different positions. This second issue is called a "multimapped read". We are generally looking for the best case to have a very high percent of "uniquely mapped reads" when aligning.


To actually perform the alignment we will be using a software called STAR. Different versions of STAR need to cache the reference genome in their own way, so in order to use the data we have available now, you need to use a specific version of STAR. For this reason, you will also need to download this software to use. You will need to find STAR version 2.7.0d You should be able to download this release using a wget like the following:


$ wget https://github.com/alexdobin/STAR/archive/2.7.0d.tar.gz


Use Tar to extract the program. You then need to try to test run the program like before with no arguments. Dig around in the extracted files to see if you can find the correct STAR executable to run for linux. When you run it, you should get a message like:


Usage: STAR  [options]... --genomeDir REFERENCE   --readFilesIn R1.fq R2.fq

Spliced Transcripts Alignment to a Reference (c) Alexander Dobin, 2009-2019

...


You can always reach out for help if you need.


To start STAR, use the following command. Note that you will need to replace <path_to_STAR> with the path where you found the correct STAR from above.


<path_to_STAR> --genomeDir /data/STAR_DATA_2.7.0d  --readFilesIn ~/SRR9861783_1_val_1.fq.gz ~/SRR9861783_2_val_2.fq.gz --runThreadN 8 --outSAMtype BAM Unsorted --outFileNamePrefix ./SRR9861783. --outSAMattributes All --alignSJoverhangMin 8 --readFilesCommand zcat --outSAMunmapped Within


When STAR finishes running, you will get an uninteresting message that the program has completed. You can check how well the processing went by reading the file ending in "final.out" from the run directory. You can use a text reader program, such as "more", "less" or "most" to read the output file, or just good old $ cat will also work. When you are finished to this far, let us know how good the data seems to be. What are the percent uniquely mapped reads and multimapped reads? Does this sound reasonable to use to you? What happens if you try to align on UNTRIMMED data? How much worse does this make things?


Our next exercises will involve using the BAM files generated here in a MAJIQ_PSI analysis, followed by a visualization of the splicing!

Section 3 - Majiq and Voila


Ok, I kind of lied, there are a few steps still needed before the majiq run, but I wanted to end the last section on the STAR output as a good place to stop. :)


- Sorting and creating .Bam.Bai index


This is fairly straightforward, just a few more long processes to run. In order to efficiently use the very large BAM files, without scanning through all of the data to find specific areas of the genome and how the reads mapped to them, we make an index of the file, which gets stored as a separate .bam.bai file. Majiq (and many other programs which work with BAMs) need both the .bam and the .bam.bai file to run. You can create the index files using these two commands for the experiment:


samtools sort -@ 8 -o ./SRR9861783.bam ./SRR9861783.Aligned.out.bam

samtools index -@ 8 ./SRR9861783.bam


(this, of course, assumes from the STAR output that you have the SRR9861783.Aligned.out.bam file in the current directory)


- Configure and run majiq build


Majiq takes a few items as input:

-.bam and .bam.bai

-annotation database of the genome

-config file describing general options and experiments


We will want to start here with the simplest run possible. We define one experiment based on our one processed sample. For the annotation database we


Majiq is built mainly using python, so we will get a short introduction to the python package management system. When dealing with python packages, there are often three ways they can be installed.


1) by the administrator of the system, and will be seen by all users

2) by a standard user, which will be available in any general program run by that user

3) by a user, into an isolated environment, so that ONLY packages in this environment are used when it is active


The third option is generally best when developing software, or working with software which might need change up updating often. Let's try to install majiq in an isolated environment using this method.


$ python3.8 -m venv some_environment_name

$ source some_environment_name/bin/activate


Here, a few interesting things happen. We first created the environment, which makes a folder to store it. (You can call it what you like) Then we used the `source` command to activate the environment. The source command itself is a little less intuitive. It is basically very similar to running a program in general, but the program can effect your current shell / bash environment, whereas by default running a program, the program would run in isolation and its environment would be destroyed after it was finished. At a high level, in this case, activating the environment makes it to that installed python packages go directly into this folder you have created instead of to some system location, and also any python programs you run will only try to use packages from this folder. If you ever want to reverse the source command, just run "deactivate".


With that environment activated, you can install the 'stable' version of majiq from the public repository:


pip install git+https://bitbucket.org/biociphers/majiq_academic.git#egg=majiq


This will take a bit to do but should end with success. You can verify if the installation  was successful by running


$ majiq --help


To print the help information.


Next you will need to prepare a directory to run majiq in. Make a new one somewhere. We will first make the config file. The config file is "microsoft INI standard format", somewhat outdated but still commonly used simple configuration standard. It consists of "sections" surrounded by heavy brackets ( [some_section] ) followed by a lite of key=value pairs. You can take note that there is an example config file documented at https://biociphers.bitbucket.io/majiq/quick.html#conf_file , however, we don't need this many options for our case.


For ours we would like the [info] section to specify:


bamdirs= some path, relative or absolute, to a folder where you will put the .bam and .bam.bai files

genome=hg38 (human, instead of mouse, 38th version of the annotation)


Then in the [experiments] section we only need to specify one experiment with a single sample like:


SRR9861783=SRR9861783


We won't be needing the [optional] section at all.


Create the config file and save it in the run folder you made. Make sure the folder you specified for "bamdirs" exists, and you move the SRR9861783.bam and SRR9861783.bam.bai files into it.


The last piece was just that annotation database I mentioned before. This is so commonly used for us that we keep a copy of it in a shared location. Here is the correct version for hg38 human data:


/data/DB/hg38/ensembl.Homo_sapiens.GRCh38.94.gff3


With majiq installed and the run environment prepared, we are ready to run "majiq build". You should be able to figure out how to give it the location of your config file, annotation database, and some output directory (a new folder, you need not make it yourself). There are lots of other options, but you don't need to worry about any of them for now. Reference majiq build --help or the online documentation (https://biociphers.bitbucket.io/majiq/MAJIQ.html#builder) for help. Remember I am always here to answer questions if you need it!


- run majiq quantifier


After parsing all of that input data into a format majiq can work with, you have a few options on how to quantify the splicing data. For example, you may only care about looking at each group isolated (in our case), or you might want to compare the difference between certain groups. The simplest standalone quantification we will try is a "majiq psi" quantification. This will take the .majiq files from the output directory of our build, and turn them into .voila files (and some other outputs) that we can use with the visualizer!


Please take a look at $ majiq psi --help , or the online documentation (https://biociphers.bitbucket.io/majiq/MAJIQ.html#psi) and try to run majiq PSI to produce a .voila file from the .majiq file generated in the last step. 


- run the voila visualization interface. 


After staring at the terminal for this long, the voila program will provide relief by finally showing something more visually pleasing! (well, that's the idea, anyway ;S ) running voila is not difficult:


$ voila view <path_to_output_folder> -p <some_number_between_5000_and_6000> --host 0.0.0.0


I will consider the number picked for the sake of this example to be 5050, but you should choose a different one (all that matters is that it does not conflict with anyone else).


However, there is an issue over the Upenn VPN, where certain ports are not allowed to be used (most are blocked). This is a common network administration problem you might run into when working in large institutions. There are a few possible workarounds. We will use a trick called an "ssh tunnel". What this does, it perform some majig to take a port on your local computer, and then you try to connect to it, it will instead redirect to the SSH port, tunnel through the ssh connection, and be redirected back to the original port on the other machine (naxos1, in this case). If you are on OSX, you can continue to use the standard SSH binary program to do this. To do this, leave the first terminal open, which should be running VOILA from above. Open a second terminal and enter:


ssh -L 5050:naxos1:5050 naxos1  


(changing "5050" to whatever you picked)


The process should just log in and look like a normal shell, but in the background it is running a tunnel, so don't close it yet. For windows machines, the aforementioned "PuTTy" program also provides options to set up an ssh tunnel with similar options. Just specify the same local and remote port that you picked as above, and keep the resulting window open. 


If everything went well now, you should be able to open your local web browser (chrome preferred) and for the URL enter "localhost:5050" and you will see voila load up. Celebrate by checking out some of the splicing variations found. :D


I don't doubt that there may be some additional questions this time around, so feel free to reach out any time you need!