{"id":14110254,"url":"https://github.com/Xetera/jiu","last_synced_at":"2025-08-01T09:33:25.230Z","repository":{"id":41492216,"uuid":"387467177","full_name":"Xetera/jiu","owner":"Xetera","description":"🕵️ Detect new images and video on social media feeds and dispatch webhooks on updates","archived":false,"fork":false,"pushed_at":"2022-03-12T11:53:05.000Z","size":1374,"stargazers_count":65,"open_issues_count":0,"forks_count":4,"subscribers_count":2,"default_branch":"main","last_synced_at":"2024-11-27T20:04:03.373Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Rust","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Xetera.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.md","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2021-07-19T13:05:31.000Z","updated_at":"2024-06-13T22:11:22.000Z","dependencies_parsed_at":"2022-08-28T19:22:51.769Z","dependency_job_id":null,"html_url":"https://github.com/Xetera/jiu","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Xetera%2Fjiu","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Xetera%2Fjiu/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Xetera%2Fjiu/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Xetera%2Fjiu/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Xetera","download_url":"https://codeload.github.com/Xetera/jiu/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":228360688,"owners_count":17907956,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-14T10:02:44.826Z","updated_at":"2024-12-05T19:31:41.075Z","avatar_url":"https://github.com/Xetera.png","language":"Rust","funding_links":[],"categories":["Rust"],"sub_categories":[],"readme":"\u003ch1\u003e\n  \u003cimg src=\"https://i.imgur.com/qVp1N9y.png\"\u003e\n\u003c/h1\u003e\n\n\u003cp align=\"center\"\u003e\n  \u003cb\u003eScrape multiple media providers on a cron job and dispatch webhooks when changes are detected.\u003c/b\u003e\n\u003c/p\u003e\n\n## Jiu\n\nJiu is a multi-threaded media scraper capable of juggling thousands of endpoints from different providers with unique\nrestrictions/requirements.\n\nIt is built for the purpose of fetching media posted on different sites the form of a slow, eventual-consistency, and\nnot for instant change detection.\n\n## Providers\n\nProvider is the umbrella term that encapsulates all endpoints for a given domain.\n\nFor example, https://weverse.io/bts/artist and https://weverse.io/dreamcatcher/artist are 2 endpoints under the Weverse\nprovider.\n\n### Supported providers\n\n* [Twitter](https://twitter.com/RBW_MAMAMOO)\n* [Pinterest Boards](https://www.pinterest.com/janairaoliveira314/handong)\n* [Weverse.io](https://weverse.io/dreamcatcher/feed)\n* [United Cube](https://www.united-cube.com/)\n\n## Dynamic Priority \u0026 Tokens\n\nDynamic priority is main idea behind how JiU can scrape many resources without getting rate limited.\n\nUnique endpoints that have more than 1 token are grouped by their provider type and get scheduled to be scraped at even\nintervals at the start of every day to avoid hammering APIs with requests.\n\n![](./assets/scrape_interval.png)\n\nAfter each successful request, a 30 day sliding window of that endpoint's request history gets graded on a curve that\ndetermines how its priority should be changing based on how many new images it found in each request.\n\n![](./assets/scraping_history.png)\n\nPages that post at least one image regularly get assigned a higher priority, up to a maximum of 3 requests every 2 days.\nPages that don't post anything sink down to a scrape schedule of once every 2 weeks.\n\nNew results found in earlier dates have a higher contribution to priority than those found further back. This curve\nallows JiU to match its request frequency with the changing posting schedule of sites it's processing to avoid wasting\nrequests on resources that are rarely updated.\n\nAt the end of each day, every endpoint gets tokens added to it equal to its current priority that get checked as a\ncriteria when scheduling requests the next day.\n\n## Authorization\n\nAnonymous request are always preferred when possible.\n\nThere is a customizable login flow for providers that require authorization which allows logging into APIs after an\nauthorization error, and persists additional data (such as a JWT token) to be shared across each provider during the\nlifetime of the process.\n\nThe login flow is reverse engineered for providers that don't have a public API.\n\n\u003e Juggling multiple accounts per provider is currently not supported and probably won't be as long as your accounts aren't getting banned (and if they are then you're sending too many requests and need to increase your rate limits).\n\nJiu will try its best to identify itself in its requests' `User-Agent` header, but will submit a fake UA for providers\nthat gate posts behind a user agent check like Twitter.\n\n## Proxies\n\nProxies are not supported or needed.\n\n## Webhooks\n\nJiu is capable of sending webhooks to multiple destinations when an update for a provider is detected.\n\nAlthough data about posts are aggregated within webhooks, they're not persisted to the database as that's the responsibility of the service receiving the events and are not relevant for image aggregation.\n\n```json\n{\n  \"provider\": {\n    \"type\": \"twitter.timeline\",\n    \"id\": \"729935154290925570\",\n    \"ephemeral\": false\n  },\n  \"posts\": [\n    {\n      \"unique_identifier\": \"1460196926796623873\",\n      \"body\": \"[#가현] 삐뚤빼뚤 즐거운 라이브였다❣️ 다음 주에도 재밌는 시간 보내 보카?\\n\\n#드림캐쳐 #Dreamcatcher #4주_집콕_프로젝트 https://t.co/r1ImPUPKkv\",\n      \"url\": \"https://twitter.com/hf_dreamcatcher/status/1460196926796623873\",\n      \"post_date\": null,\n      \"account\": {\n        \"name\":\"드림캐쳐 Dreamcatcher\",\n        \"avatar_url\":\"https://pbs.twimg.com/profile_images/1415983453200261124/4-viIm27_normal.jpg\"\n      },\n      \"metadata\": {\n        \"language\": \"ko\",\n        \"like_count\": 12474,\n        \"retweet_count\": 2760\n      },\n      \"images\": [\n        {\n          \"type\": \"Image\",\n          \"media_url\": \"https://pbs.twimg.com/media/FEOpVKmagAELmzI.jpg\",\n          \"reference_url\": \"https://twitter.com/hf_dreamcatcher/status/1460196926796623873/photo/1\",\n          \"unique_identifier\": \"1460196885285994497\",\n          \"metadata\": {\n            \"width\": 1128,\n            \"height\": 1504\n          }\n        },\n        {\n          \"type\": \"Image\",\n          \"media_url\": \"https://pbs.twimg.com/media/FEOpV2FaAAEG4zr.jpg\",\n          \"reference_url\": \"https://twitter.com/hf_dreamcatcher/status/1460196926796623873/photo/2\",\n          \"unique_identifier\": \"1460196896958709761\",\n          \"metadata\": {\n            \"width\": 1128,\n            \"height\": 1504\n          }\n        }\n      ]\n    }\n  ]\n}\n```\n\nEvery provider has its own `provider_metadata` field that _may_ contain extra information about the image or the post it\nwas found under, but may also be missing. _Documentation WIP_\n\nThe `unique_identifier` field is unique **per provider** and not globally.\n\nThe `ephemeral` field defines whether an image is only accessible for a short period after dispatch (for example\ninstagram image links expire after some time).\n\nIf a Discord webhook URL is detected, the payload is changed to allow Discord to display the images in the channel.\n\nThere is currently no retry mechanism for webhooks that fail to deliver successfully.\n\n## Endpoints\n\nJiu runs a webserver on port 8080 to allow dynamically resolving new resources by URL and getting stats at runtime\n\n- `POST    /v1/provider` Create a new provider by resolving a URL to a resource\n- `DELETE  /v1/provider` Delete an existing provider (sets it to `enabled=false`)\n- `GET     /v1/schedule` Get the upcoming scheduled scrapes\n- `GET     /v1/history`  The list of the last 100 scraped endpoints\n- `GET     /v1/stats`    The stats of all the registered providers\n\n## Jiu is **NOT**:\n\n* For bombarding sites like Twitter with requests to detect changes within seconds.\n* Capable of executing javascript with a headless browser.\n* Able to send requests to any social media site without explicit support.\n\n## Jiu **IS**:\n\n* For slowly monitoring changes in different feeds over the course of multiple hours without abusing the provider.\n* Capable of adjusting the frequency of scrapes based on how frequently the source is updated.\n* Able to send webhooks or push to AMQP on discovery.\n* The lead singer of [Dreamcatcher](https://www.youtube.com/watch?v=1QD0FeZyDtQ).\n\n## Usage\n\n1. Copy over `.env.example` to `.env` and fill out relevant fields.\n2. `docker-compose up -d jiu_db` to start postgres.\n3. `RUST_LOG=jiu cargo run` to start the crawler\n\nTo create a production-ready image, make sure to run `cargo sqlx generate` before building if you modified any of the\nSQL queries.\n\n\u003e If you would like to use this project, please change the `USER_AGENT` environment variable to identify your crawler accurately.\n\nBuilt for [kiyomi.io](https://github.com/xetera/kiyomi)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FXetera%2Fjiu","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FXetera%2Fjiu","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FXetera%2Fjiu/lists"}