Wednesday 22 February 2012

Fetching sequences from EMBL/ENA using wget/curl

Every time I want to download several EMBL files (eg. all the bacterial genomes) I spend at least an hour trying to find the right URL syntax. This post is a public note to self that will help me next time and perhaps help others who are also receiving a few lines of HTML when all they want is a verdammt plain-text EMBL formatted file.

There is actual documentation on the right syntax here, which again takes a while to find, searching for wget, curl, EMBL and various related combinations doesn't get you there quickly. However, the main issue I have is, if I go to the recommended sequence record eg. here, none of the links work with a simple "wget URL" or "curl -G URL".

So, if I want to fetch the Roseobacter denitrificans genome sequence with EMBL accession CP000362. I use:
wget http://www.ebi.ac.uk/Tools/dbfetch/dbfetch/embl/CP000362
or if you're into curl:
curl -G http://www.ebi.ac.uk/Tools/dbfetch/dbfetch/embl/CP000362 > CP000362.embl

Simple!

5 comments:

  1. I came across the same problem as I needed to download 50+ shotgun sequences of the same organism. I just googled "download embl wget" and google sent me here! Small world. Hope you are having fun back home!
    Miao (former rotation student)

    ReplyDelete
  2. Hi Miao, great to hear from you! I have to keep looking this up too.

    ReplyDelete
  3. is there a reason the accessions don't match?

    ReplyDelete
    Replies
    1. To annoy the OCD? Most likely due to my incompetence.

      Delete
    2. Fixed now. Unless I screwed something else up. Thnx.

      Delete