: An Improved Python Package for Loading Datasets from the UCI Machine Learning Repository.

bioRxiv

Department of Pathology and the Division of Clinical Informatics, Department of Medicine, BIDMC and with Harvard Medical School, Boston, MA 02215.

Published: October 2024

AI Article Synopsis

  • The UCI Machine Learning Repository (UCIMLR) is a well-known source for datasets, but over 28% of the top 250 datasets are difficult to import due to their nonstandard formats stored in .zip files.
  • A new utility, -load University California Irvine examples, has been created to automatically detect and import these challenging datasets while maintaining a tabular structure.
  • This utility was tested on the top 230 datasets, achieving a 95.4% success rate compared to 73.1% for the existing method, and is available as a Python package on PyPI with high code coverage.

Article Abstract

The University of California-Irvine (UCI) Machine Learning (ML) Repository (UCIMLR) is consistently cited as one of the most popular dataset repositories, hosting hundreds of high-impact datasets. However, a significant portion, including 28.4% of the top 250, cannot be imported via the package that is provided and recommended by the UCIMLR website. Instead, they are hosted as .zip files, containing nonstandard formats that are difficult to import without additional ad hoc processing. To address this issue, here we present -load University California Irvine examples-a utility that automatically determines the data format and imports many of these previously non-importable datasets, while preserving as much of a tabular data structure as possible. was designed using the top 100 most popular datasets and benchmarked on the next 130, where it resulted in a success rate of 95.4% vs. 73.1% for . is available as a Python package on PyPI with 98% code coverage.

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC11526970PMC
http://dx.doi.org/10.1101/2024.10.18.618994DOI Listing

Publication Analysis

Top Keywords

python package
8
uci machine
8
machine learning
8
learning repository
8
improved python
4
package loading
4
datasets
4
loading datasets
4
datasets uci
4
repository university
4

Similar Publications

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!