https://github.com/stiles/survivor-transcripts
Fetching and storing complete transcripts for each episode of the American television show and analyzing the text for keyword/phrase frequency.
https://github.com/stiles/survivor-transcripts
Last synced: 5 months ago
JSON representation
Fetching and storing complete transcripts for each episode of the American television show and analyzing the text for keyword/phrase frequency.
- Host: GitHub
- URL: https://github.com/stiles/survivor-transcripts
- Owner: stiles
- License: cc0-1.0
- Created: 2024-07-17T21:00:14.000Z (almost 2 years ago)
- Default Branch: main
- Last Pushed: 2024-12-19T18:04:16.000Z (over 1 year ago)
- Last Synced: 2025-10-30T13:45:21.302Z (7 months ago)
- Language: Python
- Size: 31.9 MB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Survivor transcripts
## About
This repository has scripts for downloading and parsing show transcripts and counting castaways' keywords and phrases — by season and the series overall.
## Sources
Most transcripts sourced from [subslikescript](https://subslikescript.com/series/Survivor-239195) with a few missing seasons pulled from the closed-captioning XML files embedded in the CBS/Paramount video player or from YouTube TV's timedtext API.
*Still need to find transcripts for season 45*
## Processes
- `scripts/fetch_transcripts.py`: This script collects episode transcript URLs for seasons 1-44, converts the URLs to metadata (episode number, season, episode title, URL, etc.) and then fetches the full transcript for each episode. The results are stored as `transcripts` in CSV and JSON formats in the `data/raw/transcripts` directory.
- `scripts/fetch_youtube_transcripts.py`: This script reads a series of episode transcripts from YouTube TV for seasons 46 and 47. The results are stored as `youtube_transcripts` in CSV and JSON formats in the `data/raw/transcripts` directory. *Still searching for a season 45 source*.
- `scripts/process_all_transcripts.py`: This script reads all the assembled transcripts and outputs them in a single clean file with episode details in CSV and JSON formats in the `data/processed/transcripts` directory. The latest version is also stored on S3: [CSV](https://stilesdata.com/survivor/transcripts/transcripts.csv), [JSON](https://stilesdata.com/survivor/transcripts/transcripts.json). This script also loops through each transcript in the dataframe, creates a directory for each season and saves each episode transcript as a .txt file. *See below.*
- `scripts/fetch_words.py`: This script reads a list of dozens of subjectively selected words and associated categories from an evolving [Google Sheets doc](https://docs.google.com/spreadsheets/d/1owUkwauJE24EkMUmVyDl7CbnumOygGfC6BufG7Vspd8/edit?gid=0#gid=0) so they can be used for text analysis of episode transcripts.
- `scripts/analyze_all_transcripts.py`: This script that counts how often these [jargon words](https://docs.google.com/spreadsheets/d/1owUkwauJE24EkMUmVyDl7CbnumOygGfC6BufG7Vspd8/edit?gid=0#gid=0) ("tribe", "vote", "idol", "reward", etc.) have been used by season and episode, according to the transcripts.
## Outputs
The individual Survivor episode transcripts are organized by season and episode number. You can access the files directly from S3 storage or via the provided URLs. The files are an amalgamation from many sources, so formatting isn't perfect or consistent.
For example:
```txt
JEFF PROBST:
From this tiny,
Malaysian fishing village,
these 16 Americans are
beginning the adventure
of a lifetime.
They have volunteered
to be marooned for 39 days
on mysterious Borneo.
This is their story.
This is Survivor.
Are we getting two of these?
Where's that box?
JEFF:
You are witnessing 16 Americans
begin an adventure
that will forever change
their lives.
```
### File structure
Each transcript is stored in the following format:
- **Season directories**: Files are organized by season, with each season having its own directory.
- **File naming convention**: Within each season directory, files are named based on the episode number, formatted as `episode_XX.txt` (where `XX` is the episode number).
### Directory structure
```
data/processed/transcripts/files/
├── season_1/
│ ├── episode_01.txt
│ ├── episode_02.txt
│ └── ...
├── season_2/
│ ├── episode_01.txt
│ ├── episode_02.txt
│ └── ...
└── season_44/
├── episode_01.txt
├── episode_02.txt
└── ...
```
### File access
You can access each transcript by navigating to the corresponding URL. For example, to view the transcript for Season 1, Episode 1, visit the following link:
[Season 1, Episode 1 Transcript](https://stilesdata.com/survivor/transcripts/files/season_1/episode_01.txt)
To access a different episode, simply change the `season_1` and `episode_01.txt` parts of the URL to the appropriate season and episode number. For instance:
- [Season 47, Episode 14 Transcript](https://stilesdata.com/survivor/transcripts/files/season_47/episode_14.txt)
## Related work
- [survivor-voteoffs](https://github.com/stiles/survivor-voteoffs): *How did each castaway react to his or her torch getting snuffed? There's data for that.*
- [survivoR2py](https://github.com/stiles/survivoR2py): *Converting the authoritative [survivoR](https://github.com/doehm/survivoR) repo's R data files into comma-delimitted formats for use with other tools.*
## Questions? Corrections?
[Please let me know](mailto:mattstiles@gmail.com).