Researchers from Microsoft and the University of Washington have made a significant breakthrough in DNA data storage technology by demonstrating the first fully automated system to store and retrieve data in manufactured DNA. This advancement is a key step in moving the technology out of the laboratory and into practical applications. The new DNA storage system utilizes chaotic mapping to control the GC content of base sequences and employs a dual-rule coding table based on the rotational coding of Goldman. This approach conforms to the constraints in DNA storage, mapping binary sequences to base sequences, effectively balances the GC content, and greatly reduces the error rate during sequencing. It enhances the stability of the DNA sequence. Additionally, the system incorporates a mechanism to prevent error propagation during the decoding process. By adding Reed-Solomon code, it achieves high robustness and reliability, allowing 90% of the data to be recovered after introducing a 2% error rate.
Our tool for rapid functional annotation for phages, viruses, bacteria, archaea, eukaryotes, and communities is finally published. Supports a variety of databases: KEGG/KO, FOAM, COG, PHROG, VOG, pVOG, CAZy, EC. Updates soon for a variety of other dbs https://t.co/D9m1hRxYId
GenomicLLM performs regression tasks like identifications of the ORFs, forward vs reverse strands and coding vs non-coding seq, GC content calculation, as well as generation tasks like nucleotide sequence-to-amino acid seq translation and reverse complement sequence generation.
GenomicLLM, an all-in-one tool that enables us to carry out a much wider range of functions including classification tasks with comparable performance to the state-of-the-art DNABERT-2 and HyenaDNA, as well as regression and generation tasks.
smrest (somatic mutation rate estimator), a prototype mutation caller for real long read data. This program haplotags each read, discovers candidate variants, then calculates class probabilities for each candidate using a read-haplotype likelihood model.
CNCA constructs multiple alignments of small genomes by integrating both coding and non-coding sequences. This preserves regions traditionally ignored in conventional back-translation methods, such as non-coding regions.
CNCA aligns annotated genomes from GenBank files. It generates a nucleotide alignment that is then updated based on the protein sequence alignment. The output final nucleotide alignment matches the protein alignment and guarantees no frameshift.
doubletrouble can also calculate substitution rates per substitution site (i.e., Ka and Ks) from duplicate pairs, find peaks in Ks distributions with Gaussian Mixture Models (GMMs), and classify gene pairs into age groups based on Ks peaks.
doubletrouble can detect gene pairs as derived from six duplication modes (segmental, tandem, proximal, retrotransposon-derived, DNA transposon-derived, and dispersed duplications), calculate substitution rates, detect signatures of putative whole-genome duplication events.
doubletrouble: an R/Bioconductor package for the identification, classification, and analysis of gene and genome duplications https://t.co/43PI0Z4d6o https://t.co/reohaGZQPy
MetaCerberus offers scalable gene elucidation to major public databases, including KEGG (KO), COGs, CAZy, FOAM, and specific databases for viruses, including VOGs and PHROGs, from single genomes to metacommunities.
MetaCerberus is a massively parallel, fast, low memory, scalable annotation tool for inference gene function across genomes to metacommunities. MetaCerberus provides an elusive HMM/HMMER-based tool at a rapid scale with low memory.
Microsoft and UW demonstrate first fully automated DNA data storage Researchers from Microsoft and the University of Washington have demonstrated the first fully automated system to store and retrieve data in manufactured DNA — a key step in moving the technology out of the… https://t.co/0u3XE2JX9G
doubletrouble: an R/Bioconductor package for the identification, classification, and analysis of gene and genome duplications https://t.co/xWyiownxAX #biorxiv_bioinfo
In the decoding process, a mechanism to prevent error propagation was introduced. By adding Reed-Solomon code, 90% of the data can still be recovered after introducing a 2% error, proving that the proposed DNA storage scheme has high robustness and reliability.
This storage program can effectively balance the GC content and avoid the generation of homopolymers, enhance the stability of the DNA sequence, and greatly reduce the error rate during sequencing.
A new DNA storage system that uses chaotic mapping to control the GC content of base sequences and builds a dual-rule coding table based on the rotational coding of Goldman to conform to the constraints in the storage of DNA, map binary sequences to base sequences.