Stakeholders of software development projects have various information needs for making rational decisions during their daily work. Satisfying these needs requires substantial knowledge of where and how the relevant information is stored and consumes valuable time that is often not available. Easing the need for this knowledge is an ideal text-to-SQL benchmark problem, a field where public datasets are scarce and needed. We propose the SEOSS-Queries dataset consisting of natural language utterances and accompanying SQL queries extracted from previous studies, software projects, issue tracking tools, and through expert surveys to cover a large variety of information need perspectives. Our dataset consists of 1,162 English utterances translating into 166 SQL queries; each query has four precise utterances and three more general ones. Furthermore, the dataset contains 393,086 labeled utterances extracted from issue tracker comments. We provide pre-trained SQLNet and RatSQL baseline models for benchmark comparisons, a replication package facilitating a seamless application, and discuss various other tasks that may be solved and evaluated using the dataset. The whole dataset with paraphrased natural language utterances and SQL queries is hosted at figshare.com/s/75ed49ef01ac2f83b3e2.

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC9079685PMC
http://dx.doi.org/10.1016/j.dib.2022.108211DOI Listing

Publication Analysis

Top Keywords

sql queries
12
natural language
8
language utterances
8
dataset
6
utterances
5
seoss-queries software
4
software engineering
4
engineering dataset
4
dataset text-to-sql
4
text-to-sql question
4

Similar Publications

An open-source SQL database schema for integrated clinical and translational data management in clinical trials.

Clin Trials

December 2024

Cancer Research UK Southampton Clinical Trials Unit, MP131, Southampton General Hospital, University of Southampton, Southampton, UK.

Unlocking the power of personalised medicine in oncology hinges on the integration of clinical trial data with translational data (i.e. biospecimen-derived molecular information).

View Article and Find Full Text PDF

Background: This study describes how New York City (NYC) Health + Hospitals implemented a large-scale Community Health Worker (CHW) program in adult primary care clinics between January 2022 and December 2023 and established metrics to monitor program implementation. This study is timely as healthcare systems consider how to scale high-quality CHW programs.

Methods: We collected metrics in the following areas: (1) Workforce demographics, team structure, and training; (2) Enrolled patient demographics; (3) Patient-centered metrics, such as patient counts (e.

View Article and Find Full Text PDF

The Block Copolymer Phase Behavior Database.

J Chem Inf Model

August 2024

Department of Chemical Engineering, Massachusetts Institute of Technology, 77 Massachusetts Avenue, Cambridge, Massachusetts 02139, United States.

The Block Copolymer Database (BCDB) is a platform that allows users to search, submit, visualize, benchmark, and download experimental phase measurements and their associated characterization information for di- and multiblock copolymers. To the best of our knowledge, there is no widely accepted data model for publishing experimental and simulation data on block copolymer self-assembly. This proposed data schema with traceable information can accommodate any number of blocks and at the time of publication contains over 5400 block copolymer total melt phase measurements mined from the literature and manually curated and simulation data points of the phase diagram generated from self-consistent field theory that can rapidly be augmented.

View Article and Find Full Text PDF

Towards cross-application model-agnostic federated cohort discovery.

J Am Med Inform Assoc

October 2024

Department of Biomedical Informatics, University of Pittsburgh, Pittsburgh, PA 15260, United States.

Objectives: To demonstrate that 2 popular cohort discovery tools, Leaf and the Shared Health Research Information Network (SHRINE), are readily interoperable. Specifically, we adapted Leaf to interoperate and function as a node in a federated data network that uses SHRINE and dynamically generate queries for heterogeneous data models.

Materials And Methods: SHRINE queries are designed to run on the Informatics for Integrating Biology & the Bedside (i2b2) data model.

View Article and Find Full Text PDF

Introduction: In response to the increasing prevalence of electronic medical records (EMRs) stored in databases, healthcare staff are encountering difficulties retrieving these records due to their limited technical expertise in database operations. As these records are crucial for delivering appropriate medical care, there is a need for an accessible method for healthcare staff to access EMRs.

Methods: To address this, natural language processing (NLP) for Text-to-SQL has emerged as a solution, enabling non-technical users to generate SQL queries using natural language text.

View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!