2018年8月7日 星期二

Get sequences of rRNA, tRNA, snRNA and snoRNA from Rfam database

Create a folder for storing Rfam and downloading index file, family.txt.

# Create and enter Rfam folder
$ cd $HOME
$ mkdir Rfam_ncRNA
$ cd Rfam_ncRNA
# Download Annotation of Rfam fasta file for selecting ncRNA
$ wget ftp://ftp.ebi.ac.uk/pub/databases/Rfam/CURRENT/database_files/family.txt.gz
$ gzip -d family.txt.gz

Create Rfam download link list

$ curl -l ftp://ftp.ebi.ac.uk/pub/databases/Rfam/CURRENT/fasta_files/ | awk '{print "ftp://ftp.ebi.ac.uk/pub/databases/Rfam/CURRENT/fasta_files/"$0}' >Rfam_link.txt
# curl -l will return a list of filenames on FTP

Grep ncRNA download link list from Rfam download link list.

$ cat family.txt | awk -F "\t" '{ if($19~/snRNA/ && $19!~/snoRNA/){print $1} }' | grep -f - Rfam_link.txt >Rfam_snRNA_link.txt
$ cat family.txt | awk -F "\t" '{ if($19~/snoRNA/){print $1} }' | grep -f - Rfam_link.txt >Rfam_snoRNA_link.txt
$ cat family.txt | awk -F "\t" '{ if($19~/tRNA/){print $1} }' | grep -f - Rfam_link.txt >Rfam_tRNA_link.txt
$ cat family.txt | awk -F "\t" '{ if($19~/rRNA/){print $1} }' | grep -f - Rfam_link.txt >Rfam_rRNA_link.txt

# $ Filtered condition: awk -F "\t" '{print $1,$19}' family.txt | head
# ...
# RF00003  Gene; snRNA; splicing;   #<--- snRNA match "snRNA" and don't match "snoRNA"
# ...
# RF00012  Gene; snRNA; snoRNA; CD-box;  #<--- snoRNA match "snRNA" and "snoRNA"

Download multiple files by list file

$ mkdir snRNA
$ mkdir snoRNA
$ mkdir tRNA
$ mkdir rRNA
$ wget -P snRNA -i Rfam_snRNA_link.txt 2>snRNA.log &
$ wget -P snoRNA -i Rfam_snoRNA_link.txt 2>snoRNA.log &
$ wget -P tRNA -i Rfam_tRNA_link.txt 2>tRNA.log &
$ wget -P rRNA -i Rfam_rRNA_link.txt 2>rRNA.log &

# -P [dir]       File was stored at this directory
# -i [file]      File is a list of download links.
# 2>file.log     STDERR was written in the log file.
# &              add "&" to tail of command, the command will run in the background

Wait for download finish! 

All "wget" will be disappeared below "COMMAND".
$ top -u $USER 
   # Rfam_snoRNA have many files, you will wait for about 1~2hr! 
   # show numbers of link pre file
   # wc -l Rfam_*_link.txt 
   #   15 Rfam_rRNA_link.txt
   #  753 Rfam_snoRNA_link.txt
   #   18 Rfam_snRNA_link.txt
   #    2 Rfam_tRNA_link.txt
   #  788 total

Merge multiple fasta.gz files

$ zcat snRNA/*.gz >Rfam_snRNA.fasta
$ zcat snoRNA/*.gz >Rfam_snoRNA.fasta
$ zcat rRNA/*.gz >Rfam_rRNA.fasta
$ zcat tRNA/*.gz >Rfam_tRNA.fasta

沒有留言:

張貼留言

DEseq2 usage