Create a folder for storing Rfam and downloading index file, family.txt.
# Create and enter Rfam folder
$ cd $HOME
$ mkdir Rfam_ncRNA
$ cd Rfam_ncRNA
# Download Annotation of Rfam fasta file for selecting ncRNA
$ wget ftp://ftp.ebi.ac.uk/pub/databases/Rfam/CURRENT/database_files/family.txt.gz
$ gzip -d family.txt.gz
Create Rfam download link list
$ curl -l ftp://ftp.ebi.ac.uk/pub/databases/Rfam/CURRENT/fasta_files/ | awk '{print "ftp://ftp.ebi.ac.uk/pub/databases/Rfam/CURRENT/fasta_files/"$0}' >Rfam_link.txt
# curl -l will return a list of filenames on FTP
Grep ncRNA download link list from Rfam download link list.
$ cat family.txt | awk -F "\t" '{ if($19~/snRNA/ && $19!~/snoRNA/){print $1} }' | grep -f - Rfam_link.txt >Rfam_snRNA_link.txt
$ cat family.txt | awk -F "\t" '{ if($19~/snoRNA/){print $1} }' | grep -f - Rfam_link.txt >Rfam_snoRNA_link.txt
$ cat family.txt | awk -F "\t" '{ if($19~/tRNA/){print $1} }' | grep -f - Rfam_link.txt >Rfam_tRNA_link.txt
$ cat family.txt | awk -F "\t" '{ if($19~/rRNA/){print $1} }' | grep -f - Rfam_link.txt >Rfam_rRNA_link.txt
# $ Filtered condition: awk -F "\t" '{print $1,$19}' family.txt | head
# ...
# RF00003 Gene; snRNA; splicing; #<--- snRNA match "snRNA" and don't match "snoRNA"
# ...
# RF00012 Gene; snRNA; snoRNA; CD-box; #<--- snoRNA match "snRNA" and "snoRNA"
Download multiple files by list file
$ mkdir snRNA
$ mkdir snoRNA
$ mkdir tRNA
$ mkdir rRNA
$ wget -P snRNA -i Rfam_snRNA_link.txt 2>snRNA.log &
$ wget -P snoRNA -i Rfam_snoRNA_link.txt 2>snoRNA.log &
$ wget -P tRNA -i Rfam_tRNA_link.txt 2>tRNA.log &
$ wget -P rRNA -i Rfam_rRNA_link.txt 2>rRNA.log &
# -P [dir] File was stored at this directory
# -i [file] File is a list of download links.
# 2>file.log STDERR was written in the log file.
# & add "&" to tail of command, the command will run in the background
Wait for download finish!
All "wget" will be disappeared below "COMMAND".
$ top -u $USER
# Rfam_snoRNA have many files, you will wait for about 1~2hr!
# show numbers of link pre file
# wc -l Rfam_*_link.txt
# 15 Rfam_rRNA_link.txt
# 753 Rfam_snoRNA_link.txt
# 18 Rfam_snRNA_link.txt
# 2 Rfam_tRNA_link.txt
# 788 total
Merge multiple fasta.gz files
$ zcat snRNA/*.gz >Rfam_snRNA.fasta
$ zcat snoRNA/*.gz >Rfam_snoRNA.fasta
$ zcat rRNA/*.gz >Rfam_rRNA.fasta
$ zcat tRNA/*.gz >Rfam_tRNA.fasta
沒有留言:
張貼留言