{"id":38551632,"url":"https://github.com/stiles/survivor-transcripts","last_synced_at":"2026-01-17T07:35:27.214Z","repository":{"id":248935005,"uuid":"830221805","full_name":"stiles/survivor-transcripts","owner":"stiles","description":"Fetching and storing complete transcripts for each episode of the American television show and analyzing the text for keyword/phrase frequency. ","archived":false,"fork":false,"pushed_at":"2024-12-19T18:04:16.000Z","size":33423,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-10-30T13:45:21.302Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"cc0-1.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/stiles.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-07-17T21:00:14.000Z","updated_at":"2025-07-03T00:36:58.000Z","dependencies_parsed_at":"2024-07-18T01:56:07.267Z","dependency_job_id":null,"html_url":"https://github.com/stiles/survivor-transcripts","commit_stats":null,"previous_names":["stiles/survivor-transcripts"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/stiles/survivor-transcripts","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/stiles%2Fsurvivor-transcripts","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/stiles%2Fsurvivor-transcripts/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/stiles%2Fsurvivor-transcripts/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/stiles%2Fsurvivor-transcripts/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/stiles","download_url":"https://codeload.github.com/stiles/survivor-transcripts/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/stiles%2Fsurvivor-transcripts/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28504356,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-01-17T06:57:29.758Z","status":"ssl_error","status_checked_at":"2026-01-17T06:56:03.931Z","response_time":85,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2026-01-17T07:35:27.125Z","updated_at":"2026-01-17T07:35:27.177Z","avatar_url":"https://github.com/stiles.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Survivor transcripts\n\n## About\n\nThis repository has scripts for downloading and parsing show transcripts and counting castaways' keywords and phrases — by season and the series overall. \n\n## Sources\n\nMost transcripts sourced from [subslikescript](https://subslikescript.com/series/Survivor-239195) with a few missing seasons pulled from the closed-captioning XML files embedded in the CBS/Paramount video player or from YouTube TV's timedtext API. \n\n*Still need to find transcripts for season 45*\n\n## Processes\n\n- `scripts/fetch_transcripts.py`: This script collects episode transcript URLs for seasons 1-44, converts the URLs to metadata (episode number, season, episode title, URL, etc.) and then fetches the full transcript for each episode. The results are stored as `transcripts` in CSV and JSON formats in the `data/raw/transcripts` directory.\n\n- `scripts/fetch_youtube_transcripts.py`: This script reads a series of episode transcripts from YouTube TV for seasons 46 and 47. The results are stored as `youtube_transcripts` in CSV and JSON formats in the `data/raw/transcripts` directory. *Still searching for a season 45 source*.\n\n- `scripts/process_all_transcripts.py`: This script reads all the assembled transcripts and outputs them in a single clean file with episode details in CSV and JSON formats in the `data/processed/transcripts` directory. The latest version is also stored on S3: [CSV](https://stilesdata.com/survivor/transcripts/transcripts.csv), [JSON](https://stilesdata.com/survivor/transcripts/transcripts.json). This script also loops through each transcript in the dataframe, creates a directory for each season and saves each episode transcript as a .txt file. *See below.*\n\n- `scripts/fetch_words.py`: This script reads a list of dozens of subjectively selected words and associated categories from an evolving [Google Sheets doc](https://docs.google.com/spreadsheets/d/1owUkwauJE24EkMUmVyDl7CbnumOygGfC6BufG7Vspd8/edit?gid=0#gid=0) so they can be used for text analysis of episode transcripts.\n\n- `scripts/analyze_all_transcripts.py`: This script that counts how often these [jargon words](https://docs.google.com/spreadsheets/d/1owUkwauJE24EkMUmVyDl7CbnumOygGfC6BufG7Vspd8/edit?gid=0#gid=0) (\"tribe\", \"vote\", \"idol\", \"reward\", etc.) have been used by season and episode, according to the transcripts.\n\n## Outputs\n\nThe individual Survivor episode transcripts are organized by season and episode number. You can access the files directly from S3 storage or via the provided URLs. The files are an amalgamation from many sources, so formatting isn't perfect or consistent. \n\nFor example:  \n\n```txt\n         JEFF PROBST: \n From this tiny, \n Malaysian fishing village, \nthese 16 Americans are \n beginning the adventure \nof a lifetime. \nThey have volunteered \n to be marooned for 39 days \non mysterious Borneo. \nThis is their story. \nThis is Survivor. \nAre we getting two of these? \nWhere's that box? \nJEFF: \nYou are witnessing 16 Americans \n begin an adventure \nthat will forever change \n their lives. \n```\n\n### File structure\n\nEach transcript is stored in the following format:\n\n- **Season directories**: Files are organized by season, with each season having its own directory.\n- **File naming convention**: Within each season directory, files are named based on the episode number, formatted as `episode_XX.txt` (where `XX` is the episode number).\n\n### Directory structure\n\n```\ndata/processed/transcripts/files/\n├── season_1/\n│   ├── episode_01.txt\n│   ├── episode_02.txt\n│   └── ...\n├── season_2/\n│   ├── episode_01.txt\n│   ├── episode_02.txt\n│   └── ...\n└── season_44/\n    ├── episode_01.txt\n    ├── episode_02.txt\n    └── ...\n```\n\n### File access\n\nYou can access each transcript by navigating to the corresponding URL. For example, to view the transcript for Season 1, Episode 1, visit the following link:\n\n[Season 1, Episode 1 Transcript](https://stilesdata.com/survivor/transcripts/files/season_1/episode_01.txt)\n\nTo access a different episode, simply change the `season_1` and `episode_01.txt` parts of the URL to the appropriate season and episode number. For instance:\n\n- [Season 47, Episode 14 Transcript](https://stilesdata.com/survivor/transcripts/files/season_47/episode_14.txt)\n\n## Related work\n- [survivor-voteoffs](https://github.com/stiles/survivor-voteoffs): *How did each castaway react to his or her torch getting snuffed? There's data for that.*\n- [survivoR2py](https://github.com/stiles/survivoR2py): *Converting the authoritative [survivoR](https://github.com/doehm/survivoR) repo's R data files into comma-delimitted formats for use with other tools.*\n\n## Questions? Corrections? \n\n[Please let me know](mailto:mattstiles@gmail.com).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fstiles%2Fsurvivor-transcripts","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fstiles%2Fsurvivor-transcripts","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fstiles%2Fsurvivor-transcripts/lists"}