{"id":25934375,"url":"https://github.com/sanand0/imdbscrape","last_synced_at":"2026-05-30T21:31:28.825Z","repository":{"id":275115813,"uuid":"925118269","full_name":"sanand0/imdbscrape","owner":"sanand0","description":"A weekly archive of the IMDB Top 250 results. Automatically scraped via GitHub Actions. Useful to see trends on IMDb Top 250","archived":false,"fork":false,"pushed_at":"2026-04-12T15:16:01.000Z","size":183,"stargazers_count":2,"open_issues_count":0,"forks_count":1,"subscribers_count":1,"default_branch":"main","last_synced_at":"2026-04-12T17:24:00.300Z","etag":null,"topics":["data"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/sanand0.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-01-31T09:09:36.000Z","updated_at":"2026-04-12T15:16:06.000Z","dependencies_parsed_at":"2025-02-23T01:20:41.807Z","dependency_job_id":"6c930db5-f486-436a-af58-e96e7f4fbad6","html_url":"https://github.com/sanand0/imdbscrape","commit_stats":null,"previous_names":["sanand0/imdbscrape"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/sanand0/imdbscrape","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sanand0%2Fimdbscrape","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sanand0%2Fimdbscrape/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sanand0%2Fimdbscrape/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sanand0%2Fimdbscrape/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/sanand0","download_url":"https://codeload.github.com/sanand0/imdbscrape/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sanand0%2Fimdbscrape/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":33711018,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-05-30T02:00:06.278Z","response_time":92,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data"],"created_at":"2025-03-04T00:57:36.775Z","updated_at":"2026-05-30T21:31:28.820Z","avatar_url":"https://github.com/sanand0.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# IMDb Top 250 Scraper: What Broke, Why It Broke, and Why the Schedule Is Off\n\nThis repository started as a very small daily scraper for IMDb's Top 250 chart.\nIt used a plain `httpx` request, parsed the returned HTML with `lxml`, and\nappended one JSON line per day to a date-stamped file.\n\nFor a while, that worked.\n\nThen it stopped.\n\nThis README documents what happened, what changed upstream, what is still true\nabout the code in this repository, and why the scheduled GitHub Action has been\ndisabled.\n\n## The Original Design\n\nThe scraper in [scrape.py](/home/vscode/code/imdbscrape/scrape.py) is simple:\n\n1. Fetch `https://www.imdb.com/chart/top/`\n2. Parse the HTML response\n3. Select movie rows with CSS selectors\n4. Extract title, year, and rating\n5. Append the result to `imdb-top250-YYYY-MM-DD.json`\n\nThe script has no browser automation, no cookies, no JavaScript execution, and\nno anti-bot handling. That simplicity was a feature when the page was publicly\nfetchable as ordinary HTML.\n\nIt became the failure mode once the page stopped behaving like ordinary HTML for\nthis class of client.\n\n## What Failed\n\nRunning the script now fails with:\n\n```text\nlxml.etree.ParserError: Document is empty\n```\n\nThat error happens here:\n\n```python\ntree = html.fromstring(response.text)\n```\n\nAt first glance, this looks like a parser bug or a broken selector.\nIt is neither.\n\nThe real failure is earlier: IMDb no longer returns the expected page body to\nthis request pattern.\n\n## What IMDb Returns Now\n\nAs of April 12, 2026, requests from this scraper receive an AWS WAF challenge\nresponse instead of the Top 250 page. In this environment, the key signals are:\n\n- HTTP status: `202`\n- Response header: `x-amzn-waf-action: challenge`\n- Response body for the script's request path: empty\n\nIn some raw `curl` responses, IMDb returns a small interstitial document that\nloads AWS WAF challenge JavaScript. In the exact request path used by the\nscript, the body is empty, which is why `lxml` raises `Document is empty`.\n\nThis is the important point: the script is not parsing the wrong HTML. It is no\nlonger receiving the HTML it expects.\n\n## When It Started\n\nThere are two different dates worth keeping separate.\n\n### 1. When AWS introduced WAF challenge support\n\nAWS announced the AWS WAF `Challenge` rule action on October 27, 2022:\n\n- https://aws.amazon.com/about-aws/whats-new/2022/10/aws-waf-challenge-rule-action-bot-control-targeted-bots/\n\nSo the underlying mechanism is not new.\n\n### 2. When this repository appears to have been affected\n\nThe repository history shows successful daily outputs through:\n\n- `imdb-top250-2026-03-19.json`\n\nAnd the failure was reproduced in this environment on:\n\n- April 12, 2026\n\nThat means the best evidence-based statement is:\n\nIMDb appears to have enabled or tightened AWS WAF challenge behavior for\nrequests like this sometime after March 19, 2026 and by April 12, 2026.\n\nThere is no verified public announcement here giving the exact day IMDb changed\nbehavior on `/chart/top/`, so anything narrower than that window would be guesswork.\n\n## Why `httpx` No Longer Works\n\nAWS documents the challenge flow as a browser-oriented mechanism.\nThe challenge page is designed to run JavaScript in the client, obtain a valid\nAWS WAF token, and then present that token on subsequent requests.\n\nRelevant AWS documentation:\n\n- AWS WAF JavaScript challenge integration:\n  https://docs.aws.amazon.com/waf/latest/developerguide/waf-js-challenge-api.html\n- AWS WAF token and domain behavior:\n  https://docs.aws.amazon.com/waf/latest/developerguide/web-acl-captcha-challenge-token-domains.html\n- AWS WAF `ChallengeAction` API reference:\n  https://docs.aws.amazon.com/waf/latest/APIReference/API_ChallengeAction.html\n\nA plain `httpx.get(...)` call does not:\n\n- execute challenge JavaScript\n- store and replay the resulting token the way the page expects\n- behave like a real browser session\n\nThat means the current scraper is not just missing a header or user-agent\nstring. The model of access is now wrong.\n\n## Why This Is Not Just a Selector Fix\n\nIt is tempting to think the page layout changed and the CSS selectors drifted.\nThat would have been the easy case.\n\nIf the page structure had changed, the script would likely still receive HTML,\nand one of these things would happen:\n\n- it would return zero movies\n- it would extract incomplete fields\n- it would fail later while indexing into selectors\n\nInstead, the script fails immediately at HTML parsing because the body is empty.\nThat points to access control upstream, not DOM drift downstream.\n\n## Why the Daily Workflow Has Been Disabled\n\nThe GitHub Action used to run this every day at `00:00 UTC`.\n\nThat schedule has been removed for two reasons:\n\n1. The current script is known broken against IMDb's current behavior.\n2. Keeping the job on a timer would just generate repeated failing runs with no\n   user value.\n\nManual runs remain enabled through `workflow_dispatch`, which preserves a path\nfor future testing once the project has a legitimate replacement data source or\na different execution strategy.\n\n## Could We Bypass the Challenge Technically?\n\nProbably, yes.\n\nA real browser automation flow using Playwright or Chromium is more likely to\nwork than `httpx` because it can:\n\n- load the interstitial\n- execute the AWS WAF challenge JavaScript\n- obtain the browser token\n- continue navigation with normal browser state\n\nBut \"technically possible\" is not the same as \"the right fix.\"\n\nThis repository should distinguish between three different questions:\n\n1. Can the challenge be bypassed?\n2. Would that be robust?\n3. Is that the appropriate or permitted way to obtain the data?\n\nThe answer set is not especially flattering:\n\n- It may be possible.\n- It will be brittle.\n- It may conflict with IMDb's access restrictions and anti-bot posture.\n\n## What IMDb Says About Data Access\n\nIMDb points users toward official data products instead of scraping.\n\nImportant references:\n\n- IMDb help, \"Can I use IMDb data in my software?\":\n  https://help.imdb.com/article/imdb/general-information/can-i-use-imdb-data-in-my-software/G5JTRESSHJBBHTGX\n- IMDb Conditions of Use:\n  https://www.imdb.com/conditions\n- IMDb non-commercial datasets:\n  https://developer.imdb.com/non-commercial-datasets/\n- IMDb API access documentation:\n  https://developer.imdb.com/documentation/api-documentation/getting-access/\n\nThe practical reading is straightforward:\n\n- If you need sanctioned non-commercial access, use the published datasets.\n- If you need sanctioned live product/API access, use IMDb's official API.\n- A brittle HTML scraper against a now-challenged page is not the stable path.\n\n## What the Non-Commercial Datasets Do and Do Not Solve\n\nIMDb publishes non-commercial datasets that include title basics and ratings,\nincluding:\n\n- `title.basics.tsv.gz`\n- `title.ratings.tsv.gz`\n\nThose datasets are useful, but they are not a drop-in replacement for the\n`/chart/top/` page itself.\n\nWhy:\n\n- The Top 250 chart is a curated/ranked IMDb product view.\n- The public datasets expose ingredients like rating and vote count.\n- They do not simply hand you \"the exact current Top 250 page output\" as a\n  ready-made file.\n\nSo a dataset-based rebuild would require a new ranking definition or a best-effort\napproximation, and it should be described honestly as that.\n\n## What the Official API Solves\n\nIMDb's official API is the cleanest long-term answer if the goal is current IMDb\ndata with a stable contract.\n\nThat path is stronger because:\n\n- it is sanctioned\n- it is designed for structured access\n- it avoids HTML scraping fragility\n- it aligns with where IMDb is directing developers\n\nThe tradeoff is obvious:\n\n- it is a product integration, not a tiny anonymous scrape\n- it may require credentials, setup, and potentially paid access\n\n## Current Status of This Repository\n\nRight now, the repository contains:\n\n- the original scraper\n- historical daily JSON outputs\n- a manual GitHub workflow\n- this documentation\n\nIt does **not** currently contain a working replacement for the old scraping\npath.\n\nThat is intentional. A broken scraper should be documented clearly before it is\nquietly replaced with something more complicated or more questionable.\n\n## Recommended Next Steps\n\nThere are three realistic paths forward.\n\n### Option 1: Rebuild on IMDb's official API\n\nChoose this if the goal is current IMDb data with a stable and legitimate access\nmethod.\n\nThis is the best long-term engineering decision.\n\n### Option 2: Rebuild on the non-commercial datasets\n\nChoose this if the project can tolerate a derived or approximate chart based on\npublished IMDb data rather than the exact `/chart/top/` page.\n\nThis is the best free and policy-aligned option.\n\n### Option 3: Use browser automation\n\nChoose this only if the project explicitly accepts the operational fragility and\npolicy risk of automating a browser through an anti-bot challenge flow.\n\nThis is the closest replacement for the old script behavior, but the weakest\nlong-term design.\n\n## Bottom Line\n\nThe script did not fail because of a small bug.\nIt failed because the assumptions it depended on are no longer true.\n\nThis repository used to scrape a publicly fetchable HTML page.\nThat page is now protected by an AWS WAF challenge flow for this class of\nclient, and the old one-request parser approach is no longer a valid access\nstrategy.\n\nThat is why the schedule is off, why the scraper is left unmodified, and why any\nfuture fix should start with a decision about data source and access model, not\njust parsing code.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsanand0%2Fimdbscrape","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsanand0%2Fimdbscrape","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsanand0%2Fimdbscrape/lists"}