2018年9月30日 星期日

2018年9月29日 星期六

Window10 install Jbrowse on subsystem for linux (WSL)

1. install Ubuntu

1.a
Select the Start button and Open Microsoft Store
Search "linux" on Microsoft Store
Select and install Ubuntu

2018年8月7日 星期二

One-line get sorted Bam file from bowtie

One line get sorted Bam file

$ read=collapsed_read.fasta       # collapsed fasta that fastx_collapser creates
$ genome_index=Genome             # basename of bowtie's .ebwt file
$ cat $read | fastx_uncollapser | bowtie -S -f $genome_index - | samtools view -u -h -F 4 - | samtools sort - -o aln.sorted.bam
"-"    It mean <STDIN> that is <STDOUT> of previous command/ program.

fasta_uncollapser [-i INFILE] [-o OUTFILE]

-i [INFILE]      FASTA input file. default is STDIN.

-o [OUTFILE]     FASTA output file. default is STDOUT.

bowtie [options] <ebwt> <s> <hit>

<s>              The query input file. 
                 If - is specified, Bowtie gets the reads from the "standard in" filehandle.
-f               The query input file (<s>) is FASTA files
-S/--sam         Print alignments in SAM format.

samtools view [option] 

-h               Include the header in the output.

-u               Output uncompressed BAM. 
                 This option saves time spent on compression/decompression 
                 and is thus preferred when the output is piped to another 
                 samtools command.

-F [INT]         Do not output alignments with any bits set in INT present 
                 in the FLAG field. # 4 is unmapped reads


samtools sort [-o out.sam|out.bam|out.cram] [in.sam|in.bam|in.cram]

-m [INT]         Approximately the maximum required memory per thread, 
                 specified either in bytes or with a K, M, or G suffix.

-@ [INT]         Set number of sorting and compression threads. 
                 By default, operation is single-threaded


Finally, Create index file of Bam that some software requires

$ samtools index aln.sorted.bam aln.sorted.bam.bai

Get sequences of rRNA, tRNA, snRNA and snoRNA from Rfam database

Create a folder for storing Rfam and downloading index file, family.txt.

# Create and enter Rfam folder
$ cd $HOME
$ mkdir Rfam_ncRNA
$ cd Rfam_ncRNA
# Download Annotation of Rfam fasta file for selecting ncRNA
$ wget ftp://ftp.ebi.ac.uk/pub/databases/Rfam/CURRENT/database_files/family.txt.gz
$ gzip -d family.txt.gz

Create Rfam download link list

$ curl -l ftp://ftp.ebi.ac.uk/pub/databases/Rfam/CURRENT/fasta_files/ | awk '{print "ftp://ftp.ebi.ac.uk/pub/databases/Rfam/CURRENT/fasta_files/"$0}' >Rfam_link.txt
# curl -l will return a list of filenames on FTP

Grep ncRNA download link list from Rfam download link list.

$ cat family.txt | awk -F "\t" '{ if($19~/snRNA/ && $19!~/snoRNA/){print $1} }' | grep -f - Rfam_link.txt >Rfam_snRNA_link.txt
$ cat family.txt | awk -F "\t" '{ if($19~/snoRNA/){print $1} }' | grep -f - Rfam_link.txt >Rfam_snoRNA_link.txt
$ cat family.txt | awk -F "\t" '{ if($19~/tRNA/){print $1} }' | grep -f - Rfam_link.txt >Rfam_tRNA_link.txt
$ cat family.txt | awk -F "\t" '{ if($19~/rRNA/){print $1} }' | grep -f - Rfam_link.txt >Rfam_rRNA_link.txt

# $ Filtered condition: awk -F "\t" '{print $1,$19}' family.txt | head
# ...
# RF00003  Gene; snRNA; splicing;   #<--- snRNA match "snRNA" and don't match "snoRNA"
# ...
# RF00012  Gene; snRNA; snoRNA; CD-box;  #<--- snoRNA match "snRNA" and "snoRNA"

Download multiple files by list file

$ mkdir snRNA
$ mkdir snoRNA
$ mkdir tRNA
$ mkdir rRNA
$ wget -P snRNA -i Rfam_snRNA_link.txt 2>snRNA.log &
$ wget -P snoRNA -i Rfam_snoRNA_link.txt 2>snoRNA.log &
$ wget -P tRNA -i Rfam_tRNA_link.txt 2>tRNA.log &
$ wget -P rRNA -i Rfam_rRNA_link.txt 2>rRNA.log &

# -P [dir]       File was stored at this directory
# -i [file]      File is a list of download links.
# 2>file.log     STDERR was written in the log file.
# &              add "&" to tail of command, the command will run in the background

Wait for download finish! 

All "wget" will be disappeared below "COMMAND".
$ top -u $USER 
   # Rfam_snoRNA have many files, you will wait for about 1~2hr! 
   # show numbers of link pre file
   # wc -l Rfam_*_link.txt 
   #   15 Rfam_rRNA_link.txt
   #  753 Rfam_snoRNA_link.txt
   #   18 Rfam_snRNA_link.txt
   #    2 Rfam_tRNA_link.txt
   #  788 total

Merge multiple fasta.gz files

$ zcat snRNA/*.gz >Rfam_snRNA.fasta
$ zcat snoRNA/*.gz >Rfam_snoRNA.fasta
$ zcat rRNA/*.gz >Rfam_rRNA.fasta
$ zcat tRNA/*.gz >Rfam_tRNA.fasta

2017年7月5日 星期三

Fastx-toolkit Installation

fastx-toolkit Installation


Install libgtextutils (necessary package for fastx-tookit)

$ cul -LOk https://github.com/agordon/libgtextutils/releases/download/0.7/libgtextutils-0.7.tar.gz
$ tar -zxf libgtextutils-0.7.tar.gz
$ cd libgtextutils-0.7
$ ./configure prefix=/usr/bhhou/packages
$ make
$ make install
# Tell pkg-config to look for libraries in /usr/bhhou/packages/lib.
$ echo "export PKG_CONFIG_PATH=\$PKG_CONFIG_PATH:/usr/bhhou/packages/lib" >>~/.bash_prefile

Install fastx-toolkit

$ cul -LOk https://github.com/agordon/fastx_toolkit/releases/download/0.0.14/fastx_toolkit-0.0.14.tar.bz2
$ tar -jxf fastx_toolkit-0.0.12.tar.bz2
$ cd fastx_toolkit-0.0.12
$ ./configure prefix=/usr/bhhou/SOFTWARE
$ make
$ make install
# Add /usr/bhhou/packages/lib to PATH
$ echo "export PKG_CONFIG_PATH=\$PATH:/usr/bhhou/SOFTWARE/bin" >>~/.bash_prefile

Filter non-coding RNA ,multiple hits and size-selection using bowtie and shell script for sRNA-Seq

Software requirements:

1. Bowtie
2. Fastx-toolskit
3. Linux only

Building bowtie indexes of references for mapping

index_path=/usr1/bhhou/Reference
cd $index_path
genome_fasta=Genome.fasta
cDNA_fasta=cDNA.fasta
tRNA_fasta=tRNA.fasta
rRNA_fasta=rRNA.fasta
snRNA_fasta=snRNA.fasta
snoRNA_fasta=snoRNA.fasta
ncRNA_fasta=ncRNA.fasta
cat $tRNA_fasta $rRNA_fasta $snRNA_fasta $snoRNA_fasta >ncRNA.fasta
bowtie-build $genome_fasta ${genome_fasta%%.fasta}
bowtie-build $cDNA_fasta ${cDNA_fasta%%.fasta}
bowtie-build $ncRNA_fasta ${ncRNA_fasta%%.fasta}

genome_index=$index_path/Genome
cDNA_index=$index_path/cDNA
ncRNA_index=$index_path/ncRNA

Mapping

1. Filtering t/r/sn/snoRNAs,
2. Refraining reads having more than 20 genomic locus
3. Remain mapping reads of genome and cDNA
4. Allowing 18 to 26nt reads
cd /usr1/bhhou/sRNA
output=clean_collasped
mkdir $output
for read in /usr1/bhhou/sRNA/collasped/*_trimmed.fasta
do
read_id=${read%%_trimmed*}
read_id=${read_id##*/}
echo $read_id
fastx_collapser -i $read -o ${read_id}_collapsed.fasta
bowtie -v 0 -f $ncRNA_index ${read_id}_collapsed.fasta /dev/null --un ${read_id}_ncRNA_clean.fasta
bowtie -v 0 -f $genome_index ${read_id}_ncRNA_clean.fasta /dev/null --al ${read_id}_genome_map.fasta --un ${read_id}_genome_unmap.fasta
bowtie -v 0 -f -m 20 $genome_index ${read_id}_genome_map.fasta /dev/null --al ${read_id}_genome_clean.fasta
bowtie -v 0 -f $cDNA_index ${read_id}_genome_unmap.fasta /dev/null --al ${read_id}_cDNA_clean.fasta
cat ${read_id}_genome_clean.fasta ${read_id}_cDNA_clean.fasta > ${output}/${read_id}_collapsed_clean.fasta
cat ${output}/${read_id}_collapsed_clean.fasta | fasta_formatter -t | awk -F "\t" '{ if(length($2) >=18 && length($2)<=26){ print ">"$1"\n"$2} }' >${output}/${read_id}_collapsed_clean_size_selected.fasta
rm ${read_id}_*.fasta
done

DEseq2 usage