2018年8月7日 星期二

One-line get sorted Bam file from bowtie

One line get sorted Bam file

$ read=collapsed_read.fasta       # collapsed fasta that fastx_collapser creates
$ genome_index=Genome             # basename of bowtie's .ebwt file
$ cat $read | fastx_uncollapser | bowtie -S -f $genome_index - | samtools view -u -h -F 4 - | samtools sort - -o aln.sorted.bam
"-"    It mean <STDIN> that is <STDOUT> of previous command/ program.

fasta_uncollapser [-i INFILE] [-o OUTFILE]

-i [INFILE]      FASTA input file. default is STDIN.

-o [OUTFILE]     FASTA output file. default is STDOUT.

bowtie [options] <ebwt> <s> <hit>

<s>              The query input file. 
                 If - is specified, Bowtie gets the reads from the "standard in" filehandle.
-f               The query input file (<s>) is FASTA files
-S/--sam         Print alignments in SAM format.

samtools view [option] 

-h               Include the header in the output.

-u               Output uncompressed BAM. 
                 This option saves time spent on compression/decompression 
                 and is thus preferred when the output is piped to another 
                 samtools command.

-F [INT]         Do not output alignments with any bits set in INT present 
                 in the FLAG field. # 4 is unmapped reads


samtools sort [-o out.sam|out.bam|out.cram] [in.sam|in.bam|in.cram]

-m [INT]         Approximately the maximum required memory per thread, 
                 specified either in bytes or with a K, M, or G suffix.

-@ [INT]         Set number of sorting and compression threads. 
                 By default, operation is single-threaded


Finally, Create index file of Bam that some software requires

$ samtools index aln.sorted.bam aln.sorted.bam.bai

Get sequences of rRNA, tRNA, snRNA and snoRNA from Rfam database

Create a folder for storing Rfam and downloading index file, family.txt.

# Create and enter Rfam folder
$ cd $HOME
$ mkdir Rfam_ncRNA
$ cd Rfam_ncRNA
# Download Annotation of Rfam fasta file for selecting ncRNA
$ wget ftp://ftp.ebi.ac.uk/pub/databases/Rfam/CURRENT/database_files/family.txt.gz
$ gzip -d family.txt.gz

Create Rfam download link list

$ curl -l ftp://ftp.ebi.ac.uk/pub/databases/Rfam/CURRENT/fasta_files/ | awk '{print "ftp://ftp.ebi.ac.uk/pub/databases/Rfam/CURRENT/fasta_files/"$0}' >Rfam_link.txt
# curl -l will return a list of filenames on FTP

Grep ncRNA download link list from Rfam download link list.

$ cat family.txt | awk -F "\t" '{ if($19~/snRNA/ && $19!~/snoRNA/){print $1} }' | grep -f - Rfam_link.txt >Rfam_snRNA_link.txt
$ cat family.txt | awk -F "\t" '{ if($19~/snoRNA/){print $1} }' | grep -f - Rfam_link.txt >Rfam_snoRNA_link.txt
$ cat family.txt | awk -F "\t" '{ if($19~/tRNA/){print $1} }' | grep -f - Rfam_link.txt >Rfam_tRNA_link.txt
$ cat family.txt | awk -F "\t" '{ if($19~/rRNA/){print $1} }' | grep -f - Rfam_link.txt >Rfam_rRNA_link.txt

# $ Filtered condition: awk -F "\t" '{print $1,$19}' family.txt | head
# ...
# RF00003  Gene; snRNA; splicing;   #<--- snRNA match "snRNA" and don't match "snoRNA"
# ...
# RF00012  Gene; snRNA; snoRNA; CD-box;  #<--- snoRNA match "snRNA" and "snoRNA"

Download multiple files by list file

$ mkdir snRNA
$ mkdir snoRNA
$ mkdir tRNA
$ mkdir rRNA
$ wget -P snRNA -i Rfam_snRNA_link.txt 2>snRNA.log &
$ wget -P snoRNA -i Rfam_snoRNA_link.txt 2>snoRNA.log &
$ wget -P tRNA -i Rfam_tRNA_link.txt 2>tRNA.log &
$ wget -P rRNA -i Rfam_rRNA_link.txt 2>rRNA.log &

# -P [dir]       File was stored at this directory
# -i [file]      File is a list of download links.
# 2>file.log     STDERR was written in the log file.
# &              add "&" to tail of command, the command will run in the background

Wait for download finish! 

All "wget" will be disappeared below "COMMAND".
$ top -u $USER 
   # Rfam_snoRNA have many files, you will wait for about 1~2hr! 
   # show numbers of link pre file
   # wc -l Rfam_*_link.txt 
   #   15 Rfam_rRNA_link.txt
   #  753 Rfam_snoRNA_link.txt
   #   18 Rfam_snRNA_link.txt
   #    2 Rfam_tRNA_link.txt
   #  788 total

Merge multiple fasta.gz files

$ zcat snRNA/*.gz >Rfam_snRNA.fasta
$ zcat snoRNA/*.gz >Rfam_snoRNA.fasta
$ zcat rRNA/*.gz >Rfam_rRNA.fasta
$ zcat tRNA/*.gz >Rfam_tRNA.fasta

DEseq2 usage