{"id":18925603,"url":"https://github.com/nightmachinery/hw-twitter-scraper","last_synced_at":"2026-03-14T15:30:18.004Z","repository":{"id":40980031,"uuid":"203416879","full_name":"NightMachinery/hw-twitter-scraper","owner":"NightMachinery","description":"A distributed system to scrape Twitter to neo4j, with a high-level API for querying neo4j.","archived":false,"fork":false,"pushed_at":"2022-12-08T06:03:30.000Z","size":61,"stargazers_count":2,"open_issues_count":8,"forks_count":1,"subscribers_count":4,"default_branch":"master","last_synced_at":"2024-12-31T17:48:11.883Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/NightMachinery.png","metadata":{"files":{"readme":"readme.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2019-08-20T16:48:30.000Z","updated_at":"2022-11-21T03:28:32.000Z","dependencies_parsed_at":"2023-01-24T15:31:18.977Z","dependency_job_id":null,"html_url":"https://github.com/NightMachinery/hw-twitter-scraper","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NightMachinery%2Fhw-twitter-scraper","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NightMachinery%2Fhw-twitter-scraper/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NightMachinery%2Fhw-twitter-scraper/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NightMachinery%2Fhw-twitter-scraper/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/NightMachinery","download_url":"https://codeload.github.com/NightMachinery/hw-twitter-scraper/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":239921844,"owners_count":19718844,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-08T11:12:29.480Z","updated_at":"2026-03-14T15:30:17.929Z","avatar_url":"https://github.com/NightMachinery.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Requirements\nYou need docker, docker-compose, and a neo4j cluster connection at \"bolt+routing://localhost:7687\". Your neo4j should have APOC installed and configured properly. You can use our neo4j-compose.yml to set this up, but you still need to download APOC to the plugins folder yourself.\n\nYou need to have a socks5 proxy active at localhost:1080, or you need to disable proxying via suitable environment variables.\n\n# Usage\nYou can use the dockerfile `hworkerDF2` to create a Docker image capable of scraping to neo4j and querying it. Or you can just install `cypher-shell`, `zsh`, and the Pythonic requirements.txt and use the scripts directly.\n\nIf you want to use this via docker, build `hworkerDF2`:\n\n`docker build --tag hworker -f hworkerDF2 . # Run this in our directory`\n\nThen you can prefix all the following commands with `docker run --rm -it --net=host hworker zsh -c 'COMMAND HERE'`.\n\nFirst source `helpers.zsh` in your `zsh` session. (I have included some wrapper scripts which simply source `helpers.zsh` and call the desired function. Feel free to use those if that floats your boat.)\n\nUse `interrogatrix.py --help` to see its documentation. It is a highlevel API for creating cypher queries you can run against cypher-shell or the neo4j browser (Which is accessible on `http://localhost:7474/browser/` in our config).\n\nYou can run all `interrogatrix` queries like this in the command line:\n\n`interrogatrix.py usertweets jack -s like -n 2 -e | cyph`\n\nIn which `cyph` is an alias that authenticates cypher-shell with our config.\n\nSee `t2n.py --help` for our twint-to-neo4j tool.\n\nOf note is `t2n.pt trackuser \u003cusername\u003e` which marks that user to be tracked by us.\n\nRead the source of `helpers.zsh`, I provide some neat helpers there. E.g., you can use this oneliner to track all your followees:\n\n`cygetfollowees your_username | cypara t2n.py trackuser`\n\n`cypara`, in particular, is a very helpful function that runs jobs in parallel. It uses `GNU parallel` under the hoods.\n\nTo start the machinary that automatically tracks users, use `docker-compose` with one of our `hworkers.yml` configs:\n\n`docker-compose --file hworkers_lightweight.yml up`\n\nFeel free to create your own `hworkers.yml` config. We hash each tracked username and assign it a bucket between 0 and 100, and these config files specify which buckets each worker updates.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnightmachinery%2Fhw-twitter-scraper","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fnightmachinery%2Fhw-twitter-scraper","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnightmachinery%2Fhw-twitter-scraper/lists"}