{"id":30184626,"url":"https://github.com/studyresearchprojects/hacker-news-scraper","last_synced_at":"2025-08-12T12:43:51.298Z","repository":{"id":104133210,"uuid":"406974459","full_name":"StudyResearchProjects/hacker-news-scraper","owner":"StudyResearchProjects","description":"A example web scraper for Hacker News exposed through a REST API","archived":false,"fork":false,"pushed_at":"2021-10-08T21:36:17.000Z","size":9,"stargazers_count":0,"open_issues_count":0,"forks_count":1,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-08-06T00:42:55.785Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/StudyResearchProjects.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2021-09-16T01:19:49.000Z","updated_at":"2025-08-04T20:57:09.000Z","dependencies_parsed_at":null,"dependency_job_id":"9f89ad78-412e-4bcf-a923-9dd5ea98f00d","html_url":"https://github.com/StudyResearchProjects/hacker-news-scraper","commit_stats":{"total_commits":4,"total_committers":1,"mean_commits":4.0,"dds":0.0,"last_synced_commit":"79940aa124c467b74efaea6952dcfe0b0d52faad"},"previous_names":["leoborai/hacker-news-scraper","estebanborai/hacker-news-scraper","studyresearchprojects/hacker-news-scraper"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/StudyResearchProjects/hacker-news-scraper","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/StudyResearchProjects%2Fhacker-news-scraper","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/StudyResearchProjects%2Fhacker-news-scraper/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/StudyResearchProjects%2Fhacker-news-scraper/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/StudyResearchProjects%2Fhacker-news-scraper/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/StudyResearchProjects","download_url":"https://codeload.github.com/StudyResearchProjects/hacker-news-scraper/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/StudyResearchProjects%2Fhacker-news-scraper/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":270064257,"owners_count":24520928,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-08-12T02:00:09.011Z","response_time":80,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-08-12T12:43:47.021Z","updated_at":"2025-08-12T12:43:51.221Z","avatar_url":"https://github.com/StudyResearchProjects.png","language":"Python","readme":"\u003cdiv\u003e\n  \u003ch1 align=\"center\"\u003ehacker-news-scraper\u003c/h1\u003e\n  \u003ch4 align=\"center\"\u003e\n    A example web scraper for Hacker News exposed through a REST API\n  \u003c/h4\u003e\n\u003c/div\u003e\n\n## Production\n\n```bash\ndocker image build -t hacker-news-scrapper .\n```\n\n```bash\ndocker run -d -p 5000:5000 hacker-news-scrapper\n```\n\n## Development\n\n### Execute\n\n```bash\ndocker-compose -f ./docker-compose.dev.yml up --build\n```\n\nThen visit:\n\n```\nhttp://0.0.0.0:5000\n```\n\n### Shutdown\n\nFocus the terminal where the session is running and excute `Ctrl + C`.\nThen execute:\n\n```bash\ndocker-compose -f ./docker-compose.dev.yml down\n```\n\n### HTTP Server\n\nAn HTTP Server acts as the interface to consume this application.\nEndpoints are enumerated in the following table:\n\nMethod | URI | Description\n--- | --- | ---\n`GET` | `/crawl` | Executes the `HackerNewsBotSpider` and retrieves the state\n`GET` | `/results` | Retrieves the results from the `/crawl` process if available\n`GET` | `/context` | Retrieves the current state for relevant values\n\nIn order to execute any of these HTTP requests you must first follow the\n[Execute](#execute) section and an HTTP client such as _cURL_.\n\nExample of a cURL call to this API while running.\n\n```bash\ncurl http://0.0.0.0:5000/crawl\n```\n\n### Scraper\n\nScrapy is used as _web crawler_/_scraper_ solution to retrieve posts from\nHacker News in this project.\n\nThe _Scrapy_ project is stored under `src/scraper` and contains the:\n`HackerNewsBotSpider`.\n\nIn order to use _Scrapy_ shell you must [execute the docker-compose](#execute)\nand use `docker exec` to SSH into the running container.\n\n1. Execute `docker ps` to gather container details\n\n```bash\ndocker ps\n```\n\n2. Copy the relevant `CONTAINER ID` to your clipboard\n\n3. Execute `docker exec -it \u003cCONTAINER ID\u003e bash`\n\n\u003e At this point you will be using the container's BASH instance.\n\n4. Change directory to `src/scraper` and then execute the _Scrapy_ shell\n\n```bash\nscrapy shell\n```\n\nWith the Scrapy shell you will be able to debug and test CSS selectors to\ngather data from the website in question.\n\n### Running _Scrapy_ Spider\n\nAs mentioned above, a `HackerNewsBotSpider` is available, the purpose of this\nspider is to retrieve post details from Hacker News.\n\nFollowing the [\"Scraper\" instructions](#scraper) to the third step, run\n\n```bash\nscrapy crawl hacker_news_bot\n```\n\ninstead of:\n\n```bash\nscrapy shell\n```\n\nTo have the bot executed. This command should output the scraped data as part of\nthe debug output in the terminal.\n\n### Dependencies\n\n- **Flask**: Lightweight WSGI (Web Server Gateway Interface) web application\nframework\n- **crochet**: Makes it easier to use Twisted from regular blocking code\n- **Scrapy**: DOM Scraper useful for crawling the web\n- **Twisted**: Multi-purpose event based framework\n\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fstudyresearchprojects%2Fhacker-news-scraper","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fstudyresearchprojects%2Fhacker-news-scraper","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fstudyresearchprojects%2Fhacker-news-scraper/lists"}