1.1 Million Podcast Episodes, Finally in One Dataset

⬅️ Back to Articles

Source: arXiv:2411.07892 · PDF · SPoRC data (GitHub) · Hugging Face

Podcasts are everywhere—Pew puts monthly listenership at 42% of Americans 12+—but academic work on the medium has mostly used hand-picked shows or tiny samples. Litterer, Jurgens, and Card (University of Michigan) close that gap with SPoRC (Structured Podcast Research Corpus): 1.1 million English episodes from public RSS feeds in May–June 2020, 6.6 billion words of transcript text, plus speaker-role labels for every episode and turn-level audio features for 370K diarized episodes. It is the first open corpus meant to do for podcasts what big Twitter or Reddit dumps did for social media research.

  1. The pipeline is the product as much as the snapshot. They started from Podcast Index (~273K English shows active that spring), downloaded 1.3M episodes, transcribed with Whisper (whisper-base.en), filtered repetitive ASR glitches, and landed at 1.1M usable transcripts (<10% WER vs. professional transcripts on validation). A random subset got pyannote diarization and openSMILE prosody (pitch, formants, MFCCs). Host/guest names come from a RoBERTa classifier trained on Prolific labels (κ≈0.77)—because RSS metadata is a mess and hosts usually introduce themselves in the first 350 words.
  2. Religion is the elephant in the room. Self-assigned category labels are noisy, but Religion turns out the most common category in the corpus—often recorded Christian sermons. LDA over 200 topics surfaces coherent islands (Wrestling, Bitcoin, Judaism) inside broad labels. Sports, Religion, and Business categories hang together topically; COVID and racial justice cut across categories—places where ideas might cross-pollinate even when guest networks do not.
  3. Guests wire the graph; not every genre plays. They build a podcast–guest bipartite graph and project it: 10,480 shows, 26,589 edges from shared guests. Business and Sports form tight modules (high modularity)—shows in those lanes reuse the same guest pool. Religion and Society are huge by episode count but invite guests far less often, so they are less connected for cross-show diffusion. Promotional tour guests are a real edge type in this medium.
  4. Podcasts do respond to news—slower than cable, wider than you’d guess. May–June 2020 was deliberately chosen (George Floyd, COVID crossing 100K US deaths). After Floyd’s murder, racial-justice topic share spikes over ~10 days (BLM peaks ~4 days after Floyd topics)—a “media storm” pattern, but slower than TV news storms in prior work. 21% of all shows mention “George Floyd” at least once by end of June. Peaks hit ~20% even in Sports and Religion; only News gives heavy weight to policing/protest framing; Society sustains elevated mention longer than Sports.
  5. Incidental politics is the implication. Widespread Floyd discussion outside News/Society supports the idea that listeners get political content from trusted hosts they chose for other reasons—aligned with incidental exposure findings on other platforms. That matters for misinformation research too; prior work already flags podcast trust and bad health claims during COVID.
  6. What SPoRC is not. Exclusive platform deals drop shows (the paper notes Joe Rogan as a notable omission). Video-only feeds are out of scope. One eight-week slice is thick but not longitudinal—dynamics after June 2020 are unknowable from this release alone. Whisper still confuses ads with hosts (Ryan Reynolds in a sponsorship read labeled HOST). Non-commercial license on the release.
  7. You might think transcripts are enough for podcast research. The paper argues otherwise. Long-form audio carries turn structure, prosody, and who speaks—dimensions Twitter text never had. SPoRC is built for computational social science and NLP (summarization, narrative detection, popularity prediction) at a scale Spotify’s deprecated 200K corpus never sustained.

The takeaway: If you study media, health communication, or information diffusion—or you build tools on podcast text—grab SPoRC before reinventing a scrape-and-transcribe pipeline. Read the Floyd case study as a template for event studies on an ecosystem that is fragmented, trusted, and mostly invisible to existing dashboards. And treat any single May 2020 finding as snapshot science: the corpus is a map of one summer, not the whole territory.

Related TMFNK Content

Crepi il lupo! 🐺