Publications by authors named "Arsenii Zinkevich"

A DNA sequence pattern, or "motif", is an essential representation of DNA-binding specificity of a transcription factor (TF). Any particular motif model has potential flaws due to shortcomings of the underlying experimental data and computational motif discovery algorithm. As a part of the Codebook/GRECO-BIT initiative, here we evaluated at large scale the cross-platform recognition performance of positional weight matrices (PWMs), which remain popular motif models in many practical applications.

View Article and Find Full Text PDF
Article Synopsis
  • A systematic evaluation is necessary to understand how different model architectures and training strategies affect the performance of genomics models, prompting the organization of a DREAM Challenge.
  • In the challenge, competitors used a vast dataset of yeast DNA sequences and expression levels to train models, with the best models employing various neural network architectures and training approaches.
  • The development of the Prix Fixe framework allowed for an in-depth analysis of these models, leading to improved performance, and demonstrating that top models not only excelled on yeast data but also outperformed existing benchmarks in Drosophila and human datasets.
View Article and Find Full Text PDF
Article Synopsis
  • The prediction of RNA structure from its sequence is challenging due to a lack of experimental data, which has slowed advancement in the field.
  • Researchers have developed a dataset called Ribonanza, consisting of chemical mapping data from two million RNA sequences, collected through crowdsourcing platforms like Eterna.
  • Utilizing this dataset, they created a deep learning model named RibonanzaNet, which, when fine-tuned, demonstrates superior performance in predicting various RNA behaviors, potentially improving understanding of RNA structures.
View Article and Find Full Text PDF

Neural networks have emerged as immensely powerful tools in predicting functional genomic regions, notably evidenced by recent successes in deciphering gene regulatory logic. However, a systematic evaluation of how model architectures and training strategies impact genomics model performance is lacking. To address this gap, we held a DREAM Challenge where competitors trained models on a dataset of millions of random promoter DNA sequences and corresponding expression levels, experimentally determined in yeast, to best capture the relationship between regulatory DNA and gene expression.

View Article and Find Full Text PDF

We present a major update of the HOCOMOCO collection that provides DNA binding specificity patterns of 949 human transcription factors and 720 mouse orthologs. To make this release, we performed motif discovery in peak sets that originated from 14 183 ChIP-Seq experiments and reads from 2554 HT-SELEX experiments yielding more than 400 thousand candidate motifs. The candidate motifs were annotated according to their similarity to known motifs and the hierarchy of DNA-binding domains of the respective transcription factors.

View Article and Find Full Text PDF

Motivation: The increasing volume of data from high-throughput experiments including parallel reporter assays facilitates the development of complex deep-learning approaches for modeling DNA regulatory grammar.

Results: Here, we introduce LegNet, an EfficientNetV2-inspired convolutional network for modeling short gene regulatory regions. By approaching the sequence-to-expression regression problem as a soft classification task, LegNet secured first place for the autosome.

View Article and Find Full Text PDF

Coronavirus disease 2019 (COVID-19) is an acute infection of the respiratory tract that emerged in December 2019 in Wuhan, China. It was quickly established that both the symptoms and the disease severity may vary from one case to another and several strains of SARS-CoV-2 have been identified. To gain a better understanding of the wide variety of SARS-CoV-2 strains and their associated symptoms, thousands of SARS-CoV-2 genomes have been sequenced in dozens of countries.

View Article and Find Full Text PDF

Many problems of modern genetics and functional genomics require the assessment of functional effects of sequence variants, including gene expression changes. Machine learning is considered to be a promising approach for solving this task, but its practical applications remain a challenge due to the insufficient volume and diversity of training data. A promising source of valuable data is a saturation mutagenesis massively parallel reporter assay, which quantitatively measures changes in transcription activity caused by sequence variants.

View Article and Find Full Text PDF

Background: Transposons are selfish genetic elements that self-reproduce in host DNA. They were active during evolutionary history and now occupy almost half of mammalian genomes. Close insertions of transposons reshaped structure and regulation of many genes considerably.

View Article and Find Full Text PDF