Genome engineering is undergoing unprecedented development and is now becoming widely available. Genetic engineering attribution can make sequence-lab associations and assist forensic experts in ensuring responsible biotechnology innovation and reducing misuse of engineered DNA sequences. Here we propose a method based on metric learning to rank the most likely labs of origin while simultaneously generating embeddings for plasmid sequences and labs. These embeddings can be used to perform various downstream tasks, such as clustering DNA sequences and labs, as well as using them as features in machine learning models. Our approach employs a circular shift augmentation method and can correctly rank the lab of origin 90% of the time within its top-10 predictions. We also demonstrate that we can perform few-shot learning and obtain 76% top-10 accuracy using only 10% of the sequences. Finally, our approach can also extract key signatures in plasmid sequences for particular labs, allowing for an interpretable examination of the model's outputs.

Download full-text PDF

Source
http://dx.doi.org/10.1038/s43588-022-00234-zDOI Listing

Publication Analysis

Top Keywords

sequences labs
12
metric learning
8
dna sequences
8
plasmid sequences
8
sequences
5
improving lab-of-origin
4
lab-of-origin prediction
4
prediction genetically
4
genetically engineered
4
engineered plasmids
4

Similar Publications

Hypospadias, a common congenital anomaly of male genitalia, shows significant heritability and familial recurrence, particularly in consanguineous families. This study explored the role of KMT2C polymorphisms in a Yemeni family with two affected siblings. Comprehensive analysis identified 475 unique SNPs in KMT2C, with 59 shared between parents, suggesting common ancestry.

View Article and Find Full Text PDF

Transgenic mice and organoid models, such as three-dimensional tumoroid cultures, have emerged as powerful tools for investigating cancer development and targeted therapies. Yet, the extent to which these preclinical models recapitulate the cellular identity of heterogeneous malignancies, like neuroblastoma (NB), remains to be validated. Here, we characterized the transcriptional landscape of TH-MYCN tumors by single-cell RNA sequencing (scRNA-seq) and developed ex vivo tumoroids.

View Article and Find Full Text PDF

Many applications in biomedicine and synthetic bioengineering rely on understanding, mapping, predicting, and controlling the complex behavior of chemical and genetic networks. The emerging field of diverse intelligence investigates the problem-solving capacities of unconventional agents. However, few quantitative tools exist for exploring the competencies of non-conventional systems.

View Article and Find Full Text PDF

Despite rapid advances in genomic sequencing, most rare genetic variants remain insufficiently characterized for clinical use, limiting the potential of personalized medicine. When classifying whether a variant is pathogenic, clinical labs adhere to diagnostic guidelines that comprehensively evaluate many forms of evidence including case data, computational predictions, and functional screening. While a substantial amount of clinical evidence has been developed for these variants, the majority cannot be definitively classified as 'pathogenic' or 'benign', and thus persist as 'Variants of Uncertain Significance' (VUS).

View Article and Find Full Text PDF

Early cancer detection substantially improves the rate of patient survival; however, conventional screening methods are directed at single anatomical sites and focus primarily on a limited number of cancers, such as gastric, colorectal, lung, breast, and cervical cancer. Additionally, several cancers are inadequately screened, hindering early detection of 45.5% cases.

View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!