Downloading

Downloading individual genomes in the browser

Method 1

  1. Open the data directory: http://genomesync.nig.ac.jp/naf/.
  2. Browse and download individual files.

Method 2

  1. Open the statistics page.
  2. Use search, or just browse to your taxon of interest.
  3. Activate the "Show genomes" option.
  4. If necessary, increase the "Tree depth".
  5. Click on the "NAF" links to download genomes.

Method 3

Use the Genome Selector tool to get a list of links to genomes for a specific taxon.


Bulk downloading

These methods are for command line, typically for a Linux machine. They depend on these utilities: curl, wget, lftp. You may already have these tools installed. If not, please check your distribution's package manager, or consult tool homepages. Under Windows you can install and use these tools with via WSL, as well as using cygwin.

Downloading everything

Please make sure that you have enough disk space before trying to download the entire set.

You can use your favorite recursive download tool to download the entire set. For example, using wget:

wget --directory-prefix=/data/GenomeSync -r -np -N -l inf -nH -A '*.naf' http://genomesync.nig.ac.jp/naf/

This command will save the genomes into '/data/GenomeSync' (which will be created if missing).

Selective downloading

It's possible to download selectively, by taxonomy. Example command to download mammalian genomes:

curl 'http://genomesync.nig.ac.jp/selector/?t=Mammalia' | wget -i - --directory-prefix=/data/GenomeSync -x -N -nH

How this works: we get the list of urls from the selector script, then pipe the list to wget to download the files.

It's possible to specify multiple taxa. Prepending a name with '-' means "exclude this taxon". So, for example, to download reptile genomes, we start by including sauropsids, but then exclude birds:

curl 'http://genomesync.nig.ac.jp/selector/?t=Sauropsida&t=-Aves' | wget -i - --directory-prefix=/data/GenomeSync -x -N -nH

Destination directories. These commands store the genomes into the "/data/GenomeSync" directory, re-creating the original GenomeSync directory structure inside it. If you want to just save the files, without creating all directories, remove '-x' from the command.

Escaping special characters. Some taxon names include characters that need to be escaped. For example, the fruit fly genus "Drosophila" is called "Drosophila <flies,genus>", to disambiguate it from the same-named subgenus "Drosophila <flies,subgenus>", and from the mushroom genus "Drosophila <basidiomycetes>" (Yes, it's crazy. I did not invent these names, all names come from the NCBI taxonomy database). So, to download all available genomes for the fruit fly genus, we have to percent-encode the special characters:

curl 'http://genomesync.nig.ac.jp/selector/?t=Drosophila%20%3Cflies%2Cgenus%3E' | wget -i - --directory-prefix=/data/GenomeSync -x -N -nH

Windows notes. If you use Windows, there are some extra points to care about: 1) You may have to use double quotation instead of single quotes. 2) The % character has special meaning in Windows, so it has to be escaped, by typing it twice. 3) Obviously take care that the destination directory is recognizable by Windows. So, the Drosophila example may become:

curl "http://genomesync.nig.ac.jp/selector/?t=Drosophila%%20%%3Cflies%%2Cgenus%%3E" | wget -i - --directory-prefix=C:/data/GenomeSync -x -N -nH

Downloading representative genomes

Sometimes you may want to use just representative genomes for each species, instead of entire set of genomes. There two such representative subsets in GenomeSync:

1. "rep" subset. It basically includes genomes marked as "representative" at NCBI Assembly. It also includes one genome per species for eukaryote species that don't have any genome marked as "representative".

2. "rep2" subset". It includes all genomes from the "rep" subset, plus one genome per species for prokaryotes that don't have any genome included yet.

Adding (rep) or (rep2) in front of a taxon name in selector query will restrict the selection to include (or exclude) only representative genomes. For example, to download representative prokaryote genomes:

curl -s 'http://genomesync.nig.ac.jp/selector/?t=(rep)Archaea&t=(rep)Bacteria' | wget -i - --directory-prefix=/data/GenomeSync -x -N -nH -nv

It's possible to combine representative and complete sets. For example, this command will download representative prokaryote genomes and all virus genomes:

curl -s 'http://genomesync.nig.ac.jp/selector/?t=(rep)Archaea&t=(rep)Bacteria&t=Viruses' | wget -i - --directory-prefix=/data/GenomeSync -x -N -nH -nv

Checking data size

It's good idea to check the size of genome data before downloading, and make sure you have enough disk space. For example, let's check the size of representative Fungal genomes:

Checking in the browser:

http://genomesync.nig.ac.jp/data-size/?t=(rep)Fungi&format=naf&webpage=1

Checking in the command line:

curl -s 'http://genomesync.nig.ac.jp/data-size/?t=(rep)Fungi&format=naf'

The "format" parameter can be one of:


Synchronizing

Synchronizing means updating your local set of genomes so that it matches the current upstream version. Note that genomes are not only added, but sometimes removed or replaced. So, a genome that you currently use might be deleted during synchronization. You can make backup to be sure that you can reproduce your results obtained with particular snapshot of GenomeSync.

Synchronizing entire GenomeSync

Example command to synchronize your local GenomeSync data (previously downloaded to '/data/GenomeSync'):

lftp -e 'open http://genomesync.nig.ac.jp/ && mirror -c --delete --delete-first naf /data/GenomeSync/naf && exit'

(If the destination directory /data/GenomeSync is missing, this command will just download everything.)

Selective synchronization

For example, let's suppose that you previously downloaded the representative archaea genomes into './GenomeSync'. Now you would like to update this set. First you will need to remove the outdated genomes:

comm -1 -3 <(curl -s 'http://genomesync.nig.ac.jp/selector/?paths=1&t=(rep)Archaea' | sort) <(cd ./GenomeSync/naf; find * -type f | sort) | xargs -d '\n' -I F rm './GenomeSync/naf/F'

Next you can download the new genomes:

curl 'http://genomesync.nig.ac.jp/selector/?t=(rep)Archaea' | wget -i - --directory-prefix=./GenomeSync -x -nH -nc