2017年6月25日 星期日

Extract sequence from fasta using NGS tools : bedtools, samtools and gffread

Install bedtools:

wget https://github.com/arq5x/bedtools2/releases/download/v2.25.0/bedtools-2.25.0.tar.gz
tar -zxf bedtools-2.25.0.tar.gz
cd bedtools2
make
cd bin
echo "export PATH=\$PATH:$PWD" >>$HOME/.bash_profile
source $HOME/.bash_profile
# You download the file using "wget" and return the waring "File name too long". You can try "curl" with option "LOk", eg:
curl -LOk https://github.com/arq5x/bedtools2/releases/download/v2.25.0/bedtools-2.25.0.tar.gz

Install gffread:

wget http://ccb.jhu.edu/software/stringtie/dl/gffread-0.9.8c.Linux_x86_64.tar.gz
tar -zxf gffread-0.9.8c.Linux_x86_64.tar.gz
cd gffread-0.9.8c.Linux_x86_64
echo "export PATH=\$PATH:$PWD" >>$HOME/.bash_profile
source $HOME/.bash_profile

Install samtools:

wget https://github.com/samtools/samtools/releases/download/1.4/samtools-1.4.tar.bz2
tar -jxf samtools-1.4.tar.bz2
cd samtools-1.4
make 
make prefix=$HOME/SOFTWARE/samtools install
cd $HOME/SOFTWARE/samtools/bin
echo "export PATH=\$PATH:$PWD" >>$HOME/.bash_profile
source $HOME/.bash_profile


Get transcript:

gffread:
gffread genome.gff3 -g genome.fasta -w trancript.fasta
# GFF3 (default) or GTF2 format (with the -T option)

bedtools:
bedtools getfasta -fi genome.fasta -bed trancript.bed -fo trancript.fasta -name -s -split
# trancript.bed
The trancript.bed file has 12 columns,
because positions of exons are written in column 11 and 12.


Get specific sequence:

For example 1: Extract 1 to 5 nucleotides of chr1

samtools (forward only):
samtools faidx genome.fasta chr1:1-5 > extract.fasta
# samtools according to one-based position

bedtools:
perl -le 'print join("\t",(chr1,0,5,seq1,1,"-"))'>extract.bed
perl -le 'print join("\t",(chr1,0,5,seq2,1,"+"))'>>extract.bed
bedtools getfasta -fi genome.fasta -bed test.bed -fo extract.fasta -name -s
# simple BED file format
Column     Description
     1            SeqID according to fasta's header (eg. chromsome name)
     2            START position (zero-based)
     3            END position (one-based)
     4            Description you want (eg. gene ID)
     5            Score, but not important
     6            The Strand of sequence [+,-]


For example 2: Extract upstream sequences of genes with the GFF3 file.

But, they are not perfect methods
1. Some upstream sequence and their upstream gene overlap
2. Not consider the maximum chromosome length

samtools:
(It do NOT reverse complement upstream sequence of genes that are located at negative strand)
cat Gene.gff3 | awk -F "\t" '{if($3=="gene"){print $0}}' >gene_feature_only.gff
cat gene_feature_only.gff | awk -F "\t|=|;" 'if ($7=="+"){$5=$4-1;$4-=500;if($4<1){$4=1};print $1":"$4"-"$5}else{$4=$5+1;$5+=500;print $1":"$4"-"$5}}' >upstream500+_gene.txt
cat upstream500+_gene.txt | perl -ne '{print `samtools faidx Sspe_Genome.fa $_`}' >upstream_500_sequence_gene.fasta

bedtools:
(Reverse complement upstream sequence of genes that are located at negative strand)
cat Gene.gff3 | awk -F "\t" '{if($3=="gene"){print $0}}' >gene_feature_only.gff
cat gene_feature_only.gff | awk -F "\t|=|;" '{OFS="\t";if($7=="+"){$5=$4-1;$4-=501;if($4<0){$4=0};print $1,$4,$5,$10,1,$7}else{$4=$5;$5+=500;print $1,$4,$5,$10,1,$7}}'>upstream_500_gene.bed
bedtools getfasta -fi genome.fasta -bed upstream_500_gene.bed -fo upstream_500_sequence_gene.fasta -name -s

沒有留言:

張貼留言

DEseq2 usage