{"id":18058731,"url":"https://github.com/nathancooperjones/script-scraper","last_synced_at":"2026-05-17T09:49:10.467Z","repository":{"id":54809814,"uuid":"264315374","full_name":"nathancooperjones/script-scraper","owner":"nathancooperjones","description":"Lightweight Python parser for film and TV scripts.","archived":false,"fork":false,"pushed_at":"2021-04-13T15:25:27.000Z","size":302,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"master","last_synced_at":"2025-04-05T11:29:49.258Z","etag":null,"topics":["docker","movies","python","script","tv"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/nathancooperjones.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2020-05-15T22:53:09.000Z","updated_at":"2021-04-13T15:25:30.000Z","dependencies_parsed_at":"2022-08-14T03:31:11.662Z","dependency_job_id":null,"html_url":"https://github.com/nathancooperjones/script-scraper","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/nathancooperjones/script-scraper","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nathancooperjones%2Fscript-scraper","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nathancooperjones%2Fscript-scraper/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nathancooperjones%2Fscript-scraper/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nathancooperjones%2Fscript-scraper/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/nathancooperjones","download_url":"https://codeload.github.com/nathancooperjones/script-scraper/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nathancooperjones%2Fscript-scraper/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":279013903,"owners_count":26085326,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-10-12T02:00:06.719Z","response_time":53,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["docker","movies","python","script","tv"],"created_at":"2024-10-31T03:08:59.690Z","updated_at":"2025-10-13T01:42:49.053Z","avatar_url":"https://github.com/nathancooperjones.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# `script_scraper` - Parser for Film and TV Scripts\n\n[![codecov](https://codecov.io/gh/nathancooperjones/script-scraper/branch/master/graph/badge.svg?token=4YKUOBQM53)](https://codecov.io/gh/nathancooperjones/script-scraper)\n\nA lightweight Python parser built in my spare time as a potential upgrade to the current tool used by the [Geena Davis Institute's](https://seejane.org) Spell Check for Bias tool.\n\n![The first page of the Inception script](https://nathancooperjones.com/wp-content/uploads/2020/05/2-1024x888.jpg)\n\n### Usage\n```python\n\u003e\u003e\u003e from script_scraper import open_pdf, script_scraper, word_and_sentence_count\n\u003e\u003e\u003e # open the PDF file\n\u003e\u003e\u003e pdf = open_pdf(path='~/Desktop/Inception.pdf')\n\u003e\u003e\u003e # run the analysis\n\u003e\u003e\u003e words_spoken = script_scraper(pdf=pdf,\n...                               remove_first_line=False)\n\u003e\u003e\u003e # check words spoken for each character\n\u003e\u003e\u003e words_spoken['ATTENDANT']\n['He was delirious. But he asked for', 'you by name. And...', 'Show him.']\n\u003e\u003e\u003e # get word and sentence count for each character\n\u003e\u003e\u003e word_count, sentence_count = word_and_sentence_count(words_spoken['ATTENDANT'])\n\u003e\u003e\u003e word_count, sentence_count\n(13, 3)\n```\n\n### Development\nBegin by installing [Docker](https://docs.docker.com/install/), if you have not already. Once Docker is running, run development from within the Docker container:\n\n```bash\n# build the Docker image\ndocker build -t script_scraper .\n\n# run the Docker container in interactive mode\ndocker run \\\n    -it \\\n    --rm \\\n    -v \"${PWD}:/script_scraper\" \\\n    -p 8888:8888 \\\n    script_scraper /bin/bash\n\n# launch JupyterLab...\njupyter lab --ip 0.0.0.0 --no-browser --allow-root --NotebookApp.token='' --NotebookApp.password=''\n\n# ... or, now in the container, run unit tests, if you'd like\npytest -v --cov-report term --cov=script_scraper\n```\n\n### FAQ\n_My PDF has a watermark across every page. What can I do?_\nBy default, these PDFs will _not_ work in `script_scraper`. Here is how I have been able to run these documents through the library:\n\n1. Use a tool such as [this](https://smallpdf.com/pdf-to-word) to convert the PDF to a Word document.\n2. Open the document in Word, then save the document in XML format.\n3. Open the XML file in a text editor, find-and-replace the watermark text with an empty string.\n4. Open the XML file back up in Word.\n5. Save the Word document as a PDF for online use.\n\nNow, you can run `script_scraper` on the edited PDF, which no longer should have the watermark.\n\nIf you have a better way to deal with this issue (that is hopefully more automated), feel free to make a PR!\n\n### Known Bugs / Issues Progress\n- [ ] PDFs with watermarks will NOT work.\n- [ ] Dialogue that might span multiple lines with multiple characters separated with a `/` is incorrectly counted and reported as a single character.\n  - Low priority, might not address in the foreseeable future.\n- [X] Character names with slight misspellings are counted as separate characters.\n  - Addressed in version `0.5.0`\n- [X] Dialogue with multiple characters speaking at once (side-by-side type) with sub-groups of characters speaking at once separated with a `/` is incorrectly counted and reported as only two characters.\n  - Addressed in version `0.4.0`\n- [X] Character's with `V.O.` in name are counted as separate characters.\n  - Addressed in version `0.3.0`\n- [X] Dialogue with multiple characters separated with an `AND` or `\u0026` is incorrectly counted and reported as a single character.\n  - Addressed in version `0.3.0`\n- [X] Characters with punctuation are sometimes counted as more than one character.\n  - Addressed in version `0.3.0`\n- [X] Sentence count is not the _most_ reliable yet for some character's dialogue.\n  - Addressed in version `0.2.2`\n- [X] Dialogue with multiple characters separated with a `/` is incorrectly counted and reported as a single character.\n  - Addressed in version `0.2.0`\n- [X] `non_dialogue_sentence` _might_ not be required for `get_character_dialogue_for_page`...\n  - Addressed in version `0.1.1`\n- [X] Sometimes, different scene descriptions are counted as characters.\n  - Addressed in version `0.2.1`\n- [X] When a character speaks in all caps, it is assumed to be a character name.\n  - Addressed in version `0.1.1`\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnathancooperjones%2Fscript-scraper","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fnathancooperjones%2Fscript-scraper","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnathancooperjones%2Fscript-scraper/lists"}