Install bedtools:
wget https://github.com/arq5x/bedtools2/releases/download/v2.25.0/bedtools-2.25.0.tar.gz
tar -zxf bedtools-2.25.0.tar.gz
cd bedtools2
make
cd bin
echo "export PATH=\$PATH:$PWD" >>$HOME/.bash_profile
source $HOME/.bash_profile
# You download the file using "wget" and return the waring "File name too long". You can try "curl" with option "LOk", eg:
curl -LOk https://github.com/arq5x/bedtools2/releases/download/v2.25.0/bedtools-2.25.0.tar.gz
Install gffread:
wget http://ccb.jhu.edu/software/stringtie/dl/gffread-0.9.8c.Linux_x86_64.tar.gz
tar -zxf gffread-0.9.8c.Linux_x86_64.tar.gz
cd gffread-0.9.8c.Linux_x86_64
echo "export PATH=\$PATH:$PWD" >>$HOME/.bash_profile
source $HOME/.bash_profile
Install samtools:
wget https://github.com/samtools/samtools/releases/download/1.4/samtools-1.4.tar.bz2
tar -jxf samtools-1.4.tar.bz2
cd samtools-1.4
make
make prefix=$HOME/SOFTWARE/samtools install
cd $HOME/SOFTWARE/samtools/bin
echo "export PATH=\$PATH:$PWD" >>$HOME/.bash_profile
source $HOME/.bash_profile
Get transcript:
gffread:
gffread genome.gff3 -g genome.fasta -w trancript.fasta
# GFF3 (default) or GTF2 format (with the -T option)
bedtools:
bedtools getfasta -fi genome.fasta -bed trancript.bed -fo trancript.fasta -name -s -split
# trancript.bed
The trancript.bed
file has 12 columns,
because positions of exons are written in column 11 and 12.
Get specific sequence:
For example 1: Extract 1 to 5 nucleotides of chr1
samtools (forward only):
samtools faidx genome.fasta chr1:1-5 > extract.fasta
# samtools according to
one-based position
bedtools:
perl -le 'print join("\t",(chr1,0,5,seq1,1,"-"))'>extract.bed
perl -le 'print join("\t",(chr1,0,5,seq2,1,"+"))'>>extract.bed
bedtools getfasta -fi genome.fasta -bed test.bed -fo extract.fasta -name -s
# simple BED file format
Column Description
1 SeqID according to fasta's header (eg. chromsome name)
2 START position (
zero-based)
3 END position (
one-based)
4 Description you want (eg. gene ID)
5 Score, but not important
6 The Strand of sequence [+,-]
For example 2: Extract upstream sequences of genes with the GFF3 file.
But, they are not perfect methods
1. Some upstream sequence and their upstream gene overlap
2. Not consider the maximum chromosome length
samtools:
(It do
NOT reverse complement upstream sequence of genes that are located at negative strand)
cat Gene.gff3 | awk -F "\t" '{if($3=="gene"){print $0}}' >gene_feature_only.gff
cat gene_feature_only.gff | awk -F "\t|=|;" 'if ($7=="+"){$5=$4-1;$4-=500;if($4<1){$4=1};print $1":"$4"-"$5}else{$4=$5+1;$5+=500;print $1":"$4"-"$5}}' >upstream500+_gene.txt
cat upstream500+_gene.txt | perl -ne '{print `samtools faidx Sspe_Genome.fa $_`}' >upstream_500_sequence_gene.fasta
bedtools:
(Reverse complement upstream sequence of genes that are located at negative strand)
cat Gene.gff3 | awk -F "\t" '{if($3=="gene"){print $0}}' >gene_feature_only.gff
cat gene_feature_only.gff | awk -F "\t|=|;" '{OFS="\t";if($7=="+"){$5=$4-1;$4-=501;if($4<0){$4=0};print $1,$4,$5,$10,1,$7}else{$4=$5;$5+=500;print $1,$4,$5,$10,1,$7}}'>upstream_500_gene.bed
bedtools getfasta -fi genome.fasta -bed upstream_500_gene.bed -fo upstream_500_sequence_gene.fasta -name -s
沒有留言:
張貼留言