{"id":22496785,"url":"https://github.com/keosariel/ramby","last_synced_at":"2025-03-27T21:24:37.295Z","repository":{"id":57744331,"uuid":"518013364","full_name":"keosariel/ramby","owner":"keosariel","description":"Ramby is a simple way to setup a webscraper","archived":false,"fork":false,"pushed_at":"2022-07-26T16:11:55.000Z","size":25,"stargazers_count":2,"open_issues_count":0,"forks_count":0,"subscribers_count":3,"default_branch":"master","last_synced_at":"2025-03-03T23:17:19.623Z","etag":null,"topics":["beautifulsoup","crawler","python3","webscraping"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/keosariel.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2022-07-26T10:23:12.000Z","updated_at":"2022-09-19T23:59:56.000Z","dependencies_parsed_at":"2022-08-30T11:32:29.373Z","dependency_job_id":null,"html_url":"https://github.com/keosariel/ramby","commit_stats":null,"previous_names":["keosariel/ramby"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/keosariel%2Framby","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/keosariel%2Framby/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/keosariel%2Framby/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/keosariel%2Framby/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/keosariel","download_url":"https://codeload.github.com/keosariel/ramby/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":245925529,"owners_count":20694915,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["beautifulsoup","crawler","python3","webscraping"],"created_at":"2024-12-06T20:14:22.934Z","updated_at":"2025-03-27T21:24:37.272Z","avatar_url":"https://github.com/keosariel.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Ramby\n\nRamby is a simple way to setup a webscraper.\n\n## Installation\n\n`pip install ramby`\n\n## Examples\n\n```python\nfrom ramby import Ramby\n\nscraper = Ramby('./exapmles/hackernews.yaml')\ndata = scraper.scrape(\"https://news.ycombinator.com/item?id=32237445\")\n```\n\n## Configuration\n\nA configuration file needs two fields, `HOST` and `RULES`.\n\n### HOST\n\nThe `HOST` holds the base domain of the site you which to scrape, also keep in mind an error would be thrown if you choose to scrape a `URL` with a different `HOST`.\n\nSo in practice the `HOST` would be added to the configuration like so:\n\n```yaml\nhost: example.com\n```\n\n### RULES\n\nA `RULE` is basically a way to target certain elements in a webpage. For example you want to select all the titles of the top posts in [hackernews](https://news.ycombinator.com) you'd select them like so:\n\n```yaml\nhost: news.ycombinator.com\n\nrules:\n    hompage:\n        pattern: '/' # The `/` path signifies we use the `homepage` rule \n        topics:    # This would denote a section in the homepage, making it easy to add other obejects if needed i.e all_authors\n            title: # An object property\n                selector: '.athing .title \u003e a' # The title target\n                text: true                     # We would want the text inside the target element\n                # html: true is optional\n                count: 2                       # The amount of elements to return\n                attrs:                         # Specify the html attributes you'd want\n                    - href                     # Also taking the link to the post\n```\n\n#### Sample returned Object based on the rules above\n\n```python\n{'topics': {'title': {0: {'attrs': {'href': 'https://paulbutler.org/2022/why-is-it-so-hard-to-give-google-money/'},\n                          'text': 'Why is it so hard to give Google money?'},\n                      1: {'attrs': {'href': 'https://mullvad.net/en/blog/2022/7/26/mullvad-is-now-available-on-amazon-us-se/'},\n                          'text': 'Mullvad is now available on Amazon'}}}}\n```\n\n#### And if you choose to scrape a post and it's comments\n\n```yaml\nhost: news.ycombinator.com\n\nrules:\n    hompage:\n        pattern: '/' # The `/` path signifies we use the `homepage` rule \n        topics:    # This would denote a section in the homepage, making it easy to add other obejects if needed i.e all_authors\n            title: # An object property\n                selector: '.athing .title \u003e a' # The title target\n                text: true                     # We would want the text inside the target element\n                # html: true is optional\n                count: 2                       # The amount of elements to return\n                attrs:                         # Specify the html attributes you'd want\n                    - href                     # Also taking the link to the post\n                  \n    posts:\n        pattern: /item/\n        post:\n            title: \n                selector: '.fatitem:first-child .title \u003e a'\n                count: 1\n                text: true\n                attrs: \n                    - href \n\n        comments:\n            texts:\n                selector: '.comment .commtext'\n                count: 2\n                text: true\n\n```\n\n#### Sample returned Object based on the rules above\n\n```python\n{'comments': {'texts': {0: {'text': 'Wonder how much money \u0026 resources Shopify '\n                                    'spent on all of their NFT features \u0026 '\n                                    'integrations over the last months, how '\n                                    'many people worked on it and how many of '\n                                    \"those are part of the lay-off now. I'd \"\n                                    \"guess the support you'd need to provide \"\n                                    'for it and their tokengated commerce '\n                                    \"isn't little either.Tobi removed all the \"\n                                    'NFT stuff from his Twitter profile and '\n                                    \"didn't tweet much about it for months \"\n                                    'now, after being pretty vocal about it '\n                                    'until earlier this year.Would love to '\n                                    'hear his real thoughts on it and why '\n                                    'he/they even (seemingly) invested so much '\n                                    'into it. One of the few things I never '\n                                    'got about Tobi / Shopify. Just seemed so '\n                                    'late and weird to be so bullish there. '\n                                    \"Don't think he's the kind of person to \"\n                                    'push it just for personal gain, nor that '\n                                    \"he'd have to, but ...\"},\n                        1: {'text': 'I’m honestly still in disbelief at how '\n                                    'many very smart people fell for the NFT '\n                                    'trap. If you’ve spent even a single bull '\n                                    'cycle in the crypto community you could '\n                                    'tell right away NFTs we’re ICO level '\n                                    'scams. The mental gymnastics very smart '\n                                    'and technical people performed to '\n                                    'rationalize paying for a jpeg still makes '\n                                    'me question reality. I participate in '\n                                    'crypto because I take a calculated risk, '\n                                    'and I’m comfortable gambling. People who '\n                                    'actually think something like an NFT has '\n                                    'any real value still messes with my head. '\n                                    'I really can’t grasp how they actually '\n                                    'believe this. And yes, I understand '\n                                    'technically how NFTs work.'}}},\n 'post': {'title': {0: {'attrs': {'href': 'https://www.wsj.com/articles/shopify-to-lay-off-10-of-workers-in-broad-shake-up-11658839047'},\n                        'text': 'Shopify to lay off 10% of workers in broad '\n                                'shake-up'}}}}\n```\n\n### See more examples [here](https://github.com/keosariel/ramby/tree/master/examples)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkeosariel%2Framby","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fkeosariel%2Framby","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkeosariel%2Framby/lists"}