GenomeSync

2024-04-30 update:
921,420 genomes
14,068 Gbp
2,995 GB

What?

GenomeSync is a database of genome sequences, designed for convenience and efficiency when downloading and using them.

Why?

Downloading a comprehensive set of genomes and keeping it up-to-date is a non-trivial task when working with traditional databases. GenomeSync makes it easy to maintain an up-to-date set of genomes.

What genomes are included?

GenomeSync aims to include all publically available genomes. However, to avoid data explosion, for animals and plants only one genome per taxonomic node (species or subspecies) is included.

Most of genomes are from NCBI Assembly, but public genome data released elsewhere is also added, when we find it.

You can see how each taxon is represented by genome data on the Statistics page.

Taxonomy

All genomes are named according to the NCBI taxonomy database, and the matching dump of taxonomy is available here: taxdmp.zip.

Downloading

GenomeSync data is available at http://genomesync.nig.ac.jp/naf/. The genomes can be downloaded one by one, or all at once using tools such as wget. For selective downloading, wget can be combined with the Genome Selector tool. (Please see details on the "Downloading" page).

Synchronizing

Suppose you already downloaded the "naf" directory previously. Now you would like to synchronize your copy to the upstream one. This can be done using the lftp tool:

lftp -e 'open http://genomesync.nig.ac.jp/ && mirror -c --delete --delete-first naf naf && exit'

Selective synchronization for a specific taxon (or a combination of taxa) is also possible. Please see the "Downloading" page for details.

File formats

Individual genomes are stored in the NAF format. They can be decompressed using this command: unnaf file.naf >file.fa. Also they can used without storing decompressed file, by piping unnaf output to another tool. E.g.: unnaf file.naf | grep ">" | wc -l.