https://github.com/milahu/opensubtitles-scraper
scrape subtitles from opensubtitles.org
https://github.com/milahu/opensubtitles-scraper
opensubtitles subtitles web-scraper
Last synced: 6 months ago
JSON representation
scrape subtitles from opensubtitles.org
- Host: GitHub
- URL: https://github.com/milahu/opensubtitles-scraper
- Owner: milahu
- License: mit
- Created: 2023-04-22T06:26:00.000Z (over 2 years ago)
- Default Branch: main
- Last Pushed: 2025-04-20T21:01:30.000Z (6 months ago)
- Last Synced: 2025-04-20T21:35:46.249Z (6 months ago)
- Topics: opensubtitles, subtitles, web-scraper
- Language: Python
- Homepage: http://s.milahu.webredirect.org/bin/get-subtitles
- Size: 1.54 MB
- Stars: 33
- Watchers: 5
- Forks: 2
- Open Issues: 4
-
Metadata Files:
- Readme: readme.md
- License: license.txt
Awesome Lists containing this project
README
# opensubtitles-scraper
scrape subtitles from [opensubtitles.org](https://www.opensubtitles.org/)
## result
torrent RSS feed: [opensubtitles.org.dump.torrent.rss](release/opensubtitles.org.dump.torrent.rss)
- [opensubtitles.org.dump.9180519.to.9521948.by.lang.2023.04.26.torrent](release/opensubtitles.org.dump.9180519.to.9521948.by.lang.2023.04.26.torrent)
- [opensubtitles.org.dump.9500000.to.9599999.v20240306.torrent](release/opensubtitles.org.dump.9500000.to.9599999.v20240306.torrent)
- [opensubtitles.org.dump.9600000.to.9699999.torrent](release/opensubtitles.org.dump.9600000.to.9699999.torrent)
- [opensubtitles.org.dump.9700000.to.9799999.torrent](release/opensubtitles.org.dump.9700000.to.9799999.torrent)
- [opensubtitles.org.dump.9800000.to.9899999.v20240420.torrent](release/opensubtitles.org.dump.9800000.to.9899999.v20240420.torrent)
- [opensubtitles.org.dump.9900000.to.9999999.v20240609.torrent](release/opensubtitles.org.dump.9900000.to.9999999.v20240609.torrent)
- [opensubtitles.org.dump.10000000.to.10099999.v20240820.torrent](release/opensubtitles.org.dump.10000000.to.10099999.v20240820.torrent)
- [opensubtitles.org.dump.10100000.to.10199999.v20241003.torrent](release/opensubtitles.org.dump.10100000.to.10199999.v20241003.torrent)
- [opensubtitles.org.dump.10200000.to.10299999.v20241124.torrent](release/opensubtitles.org.dump.10200000.to.10299999.v20241124.torrent)unreleased subs are stored in [github.com/milahu/opensubtitles-scraper-new-subs](https://github.com/milahu/opensubtitles-scraper-new-subs)
## usage
- download the torrents from section [result](#result)
- download `subtitles_all.txt.gz` from https://dl.opensubtitles.org/addons/export/
- run [subtitles_all.txt.gz-parse.py](subtitles_all.txt.gz-parse.py) to build `subtitles_all.db`
- `python3 subtitles_all.txt.gz-parse.py subtitles_all.db subz_metadata subtitles_all.txt.gz subtitles_all.txt.gz-parse-errors.txt subtitles_all.txt.gz-parse-debug.txt`
- fix the values for `db_path` in [local-subtitle-providers.json](local-subtitle-providers.json)run [get-subs.py](get-subs.py) to get subtitles for a movie:
```
~/src/opensubtitles-scraper/get-subs.py Scary.Movie.2000.mp4video_path Scary.Movie.2000.mp4
video_filename Scary.Movie.2000.mp4
video_parsed MatchesDict([('title', 'Scary Movie'), ('year', 2000), ('container', 'mp4'), ('mimetype', 'video/mp4'), ('type', 'movie')])
output 'Scary.Movie.2000.en.00018286.sub' from 'Scary_eng.txt' (us-ascii)
output 'Scary.Movie.2000.en.00018615.sub' from 'Scary Movie.txt' (us-ascii)
output 'Scary.Movie.2000.en.00106539.sub' from 'Scary Movie - ENG.txt' (us-ascii)
output 'Scary.Movie.2000.en.00117707.sub' from 'scream_english.sub' (iso-8859-1)
output 'Scary.Movie.2000.en.00203573.sub' from 'Scary Movie - ENG.txt' (us-ascii)
output 'Scary.Movie.2000.en.00204203.sub' from 'Scary Movie_engl.sub' (iso-8859-1)
output 'Scary.Movie.2000.en.03112243.srt' from 'Scary Movie 1 (2000).en.bug-fixed.srt' (Windows-1252)
output 'Scary.Movie.2000.en.03142326.srt' from 'kns-sm.srt' (Windows-1252)
output 'Scary.Movie.2000.en.03279944.srt' from 'Scary Movie 1 iNT DvD RiP- WaCkOs.srt' (Windows-1252)
output 'Scary.Movie.2000.en.03318665.srt' from 'rvlt-scarymovie.srt' (us-ascii)
output 'Scary.Movie.2000.en.03552139.srt' from 'Scary Movie.srt' (us-ascii)
output 'Scary.Movie.2000.en.03686957.sub' from 'Scary.Movie.(2000).DVDRIP.Divx.DOMiNION.sub' (iso-8859-1)
output 'Scary.Movie.2000.en.04867080.srt' from 'Scary.Movie.2000.BrRip.720p.x264.YIFY-eng.srt' (Windows-1252)
output 'Scary.Movie.2000.en.05115082.srt' from 'Scary Movie 1.[2000].UNRATED.DVDRIP.XVID.[Eng]-DUQA®.srt' (Windows-1252)
...
```### subtitles server
to run your own subtitles server,
see [docs/lighttpd.conf](docs/lighttpd.conf)
to expose [get-subs.py](get-subs.py) as a CGI script on an HTTP server#### demo server
[s.milahu.webredirect.org/bin/get-subtitles](http://s.milahu.webredirect.org/bin/get-subtitles)
([http](http://milahu.webredirect.org/bin/get-subtitles))## based on
- [5719123 subtitles from opensubtitles.org](https://www.reddit.com/r/DataHoarder/comments/w7sgcz/5719123_subtitles_from_opensubtitlesorg/) by `-marked-4life`
- [opensubtitles.org.Actually.Open.Edition.2022.07.25.torrent](release/opensubtitles.org.Actually.Open.Edition.2022.07.25.torrent)
- cloudflare bypassing with commercial [scraping services](opensubtitles_dump_client/scraping.md)
- bittorrent## offline version of opensubtitles
useful for subtitle-fetchers like
- [bazarr](https://github.com/morpheus65535/bazarr)
- [subliminal](https://github.com/Diaoul/subliminal)
- [subdl](https://github.com/alexanderwink/subdl)## scraping
opensubtitles.org is protected by cloudflare, so im using a scraping proxy ([zenrows.com](https://www.zenrows.com/)).
with `max_concurrency = 10` in fetch-subs.py, one request takes about 0.2 seconds.videos:
- https://asciinema.org/a/6fG6TXEkF3UOOL0qu8tivViXF
- [docs/fetch-subs.py.cast](docs/fetch-subs.py.cast)