GenomeSync |
2024-12-03 update: 1,111,526 genomes 16,235 Gbp 3,431 GB |
About
GenomeSync is a collection of whole genome sequences aggregated from public databases (mostly NCBI) and compressed into NAF format. It aims to include all publically available genomes, however, for practicality it has only one genome per taxonomic node for plants and animals. Please explore the content on the Statistics page. Genome names are consistent with the bundled snapshot of NCBI taxonomy: taxdmp.zip.
Downloading and synchronizing
GenomeSync data is available at http://genomesync.nig.ac.jp/naf/. The genomes can be downloaded one by one, or all at once using tools such as wget. For selective downloading, wget can be combined with the Genome Selector tool. A previously downloaded set of genomes can be synchronized to the upstream GenomeSync. Please see details on the Downloading page.
Decompressing
Individual genomes are stored in the NAF format. They can be decompressed using this command: unnaf file.naf >file.fa. Also they can used without storing decompressed file, by piping unnaf output to another tool. E.g.: unnaf file.naf | grep ">" | wc -l.
Citation
Kirill Kryukov, So Nakagawa, Tadashi Imanishi (2024) “GenomeSync: a synchronizable database of genome sequences” iDarwin, vol. 4, 4-23.