GenomeSync

2024-10-08 update:
1,083,950 genomes
15,505 Gbp
3,282 GB

About

GenomeSync is a collection of whole genome sequences aggregated from public databases (mostly NCBI) and compressed into NAF format. It aims to include all publically available genomes, however, for practicality it has only one genome per taxonomic node for plants and animals. Please explore the content on the Statistics page. Genome names are consistent with the bundled snapshot of NCBI taxonomy: taxdmp.zip.

Downloading and synchronizing

GenomeSync data is available at http://genomesync.nig.ac.jp/naf/. The genomes can be downloaded one by one, or all at once using tools such as wget. For selective downloading, wget can be combined with the Genome Selector tool. A previously downloaded set of genomes can be synchronized to the upstream GenomeSync. Please see details on the Downloading page.

Decompressing

Individual genomes are stored in the NAF format. They can be decompressed using this command: unnaf file.naf >file.fa. Also they can used without storing decompressed file, by piping unnaf output to another tool. E.g.: unnaf file.naf | grep ">" | wc -l.

Citation

Kirill Kryukov, So Nakagawa, Tadashi Imanishi (2024) “GenomeSync: a synchronizable database of genome sequences” iDarwin, vol. 4, 4-23.