The proliferation of scientific podcasts has generated an extensive repository of audio content, rich in specialized terminology, diverse topics, and expert dialogues. Here, we introduce a computational framework designed to enhance large language models (LLMs) by leveraging this informational content from publicly accessible podcast data across science, technology, engineering, mathematics and medical (STEMM) disciplines. This dataset, comprising over 3, 700 hours of audio content, was transcribed to generate over 42 million text tokens.
View Article and Find Full Text PDF