{"id":19492630,"url":"https://github.com/infoforcefeed/thray","last_synced_at":"2025-02-25T20:18:32.482Z","repository":{"id":66362330,"uuid":"43096990","full_name":"infoforcefeed/thray","owner":"infoforcefeed","description":"Scraper thing","archived":false,"fork":false,"pushed_at":"2015-10-27T05:08:10.000Z","size":1044,"stargazers_count":3,"open_issues_count":0,"forks_count":0,"subscribers_count":12,"default_branch":"master","last_synced_at":"2025-01-08T09:12:08.287Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/infoforcefeed.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2015-09-24T22:43:39.000Z","updated_at":"2015-10-25T04:56:56.000Z","dependencies_parsed_at":"2023-02-20T16:15:17.444Z","dependency_job_id":null,"html_url":"https://github.com/infoforcefeed/thray","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/infoforcefeed%2Fthray","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/infoforcefeed%2Fthray/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/infoforcefeed%2Fthray/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/infoforcefeed%2Fthray/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/infoforcefeed","download_url":"https://codeload.github.com/infoforcefeed/thray/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":240738122,"owners_count":19849549,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-10T21:22:15.389Z","updated_at":"2025-02-25T20:18:32.449Z","avatar_url":"https://github.com/infoforcefeed.png","language":"C++","funding_links":[],"categories":[],"sub_categories":[],"readme":"THRAY\n=====\n\n![](./thray.gif?raw=true)\n\nThray is a generalized scraper/consumer thing. More information coming.\n\nBackground\n----------\n\n```\n2015-09-11 14:42:28     WAMPA_STOMPA    i'll crawl through to the point that was actually \ncrawling\n2015-09-11 14:42:44     WAMPA_STOMPA    there is really nothing interesting about the \nimplementation, and the architecture is pretty straight forward\n2015-09-11 14:43:01     WAMPA_STOMPA    you just have producers that are crawling and pushing \ninto a shared queue for the consumers\n2015-09-11 14:43:13     WAMPA_STOMPA    the consumers all have their own db they are writing to\n2015-09-11 14:43:21     WAMPA_STOMPA    then at the end the dbs are condensed into a single db\n\n...\n\n2015-09-11 15:51:06     Xamayon the crawler needs a massive queue of links, which are \nprovided to workers, workers grab the data at the link and insert it into something, \nanalyzers take that data and mark the link in the queue either done or failed, and if \ndone, grab more links out of the data if textual, or save the file sanely if binary\n2015-09-11 15:51:35     Xamayon if failed, it goes back into queue and gets processed again\n2015-09-11 15:52:06     Xamayon that's the general idea atleast, there are a few tricky parts\n2015-09-11 15:52:50     Xamayon If I can get this working for DA, I'll probably use it to \nredo pixiv too\n```\n\nGeneral Architecture\n--------------------\n\nTech involved:\n* C++\n* Riak (for distributing scraped information, deduping stuff)\n* Mother Postgres for processed data (notes text, usernames, reblogs, whatever)\n* Whatever random libraries we need for C++\n\nScraper\n=======\n\nThe Scrapers are distributed, small C++ processes that are responsible for pulling endpoints out of a distributed queue. Scrapers will hit the tumblr API for that data, put it into Riak, mark the username as scraped (with metadata?) , stick a new job into the \"To Be Processed\" queue and move onto the next item in the \"To Be Scraped\" queue. This should also include any associated media, like images or video.\n\nProcessor\n=========\n\nProcessors pull scraped blobs out of Riak and turn them into identified data, which goes into Postgres. This includes things like the username, note text, tags, date posted, etc etc. Processors continually pull blobs from Riak for transformations that will be loaded into Postgres. We should be able to re-populate postgres anytime with data from Riak.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Finfoforcefeed%2Fthray","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Finfoforcefeed%2Fthray","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Finfoforcefeed%2Fthray/lists"}