{"id":25439653,"url":"https://github.com/marcuth/xcrap","last_synced_at":"2025-05-15T10:10:22.722Z","repository":{"id":222576878,"uuid":"757782862","full_name":"marcuth/xcrap","owner":"marcuth","description":"Xcrap is a Web Scraping framework for JavaScript, designed to facilitate the process of extracting data from multiple pages or even just one, with a sophisticated page parsing system.","archived":false,"fork":false,"pushed_at":"2025-03-14T23:17:34.000Z","size":698,"stargazers_count":0,"open_issues_count":3,"forks_count":0,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-03-15T00:23:13.110Z","etag":null,"topics":["crawling","scraping","typescript","web","xcrap"],"latest_commit_sha":null,"homepage":"https://www.npmjs.com/package/xcrap","language":"TypeScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/marcuth.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-02-15T00:56:18.000Z","updated_at":"2025-03-14T23:17:37.000Z","dependencies_parsed_at":"2024-11-02T16:21:21.353Z","dependency_job_id":"4a251828-14c4-498a-ad2f-e665ffceb2d0","html_url":"https://github.com/marcuth/xcrap","commit_stats":{"total_commits":41,"total_committers":1,"mean_commits":41.0,"dds":0.0,"last_synced_commit":"45d69fd5a2e2117d1b98a62551d90e704863e56b"},"previous_names":["1marcuth/xcrap","marcuth/xcrap"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/marcuth%2Fxcrap","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/marcuth%2Fxcrap/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/marcuth%2Fxcrap/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/marcuth%2Fxcrap/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/marcuth","download_url":"https://codeload.github.com/marcuth/xcrap/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254319720,"owners_count":22051075,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["crawling","scraping","typescript","web","xcrap"],"created_at":"2025-02-17T10:24:13.046Z","updated_at":"2025-05-15T10:10:22.710Z","avatar_url":"https://github.com/marcuth.png","language":"TypeScript","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cdiv style=\"text-align: center;font-weight: 500;\"\u003e“If I have seen further, it is by standing on the shoulders of giants.”\u003c/div\u003e\n\n## Technologies used:\n\n- [node-html-parser](https://www.npmjs.com/package/node-html-parser)\n- [axios](https://www.npmjs.com/package/axios)\n- [puppeteer](https://www.npmjs.com/package/puppeteer)\n- [axios-rate-limit](https://www.npmjs.com/package/axios-rate-limit)\n- [date-fns](https://www.npmjs.com/package/date-fns)\n\n## Books I read:\n- Web Scraping with Python: Data Extraction from the Modern Web:\n    - [PT-BR](https://encurtador.com.br/svq8Y)\n    - [EN](https://encurtador.com.br/5dS11)\n\n## Friends I've been arguing with:\n- Rafael F.: https://github.com/justonedev42/\n\n---\n\n# Xcrap: A Web Scraping Framework for JavaScript\n\nXcrap is a framework written in TypeScript to handle data extraction in web pages.\n\n---\n\nData extraction works based on two types of models:\n\n## HtmlParsingModel\n\nEach model key receives a `query` which is a CSS selector, and an `extractor` which is a function that will extract a certain property from an HTML element. It also accepts that the field has multiple results by passing the information in the `fieldType`, the model also supports alignment, so you can put models inside models to obtain a complex data structure, you can also define that it is a group of objects through the `isGroup` property, but don't get too attached to the resulting data structure.\n\n## TransformationModel\n\nEach model key receives an array of functions called `middlewares`. These `middlewares` work in a similar way to those we are used to when creating a backend server, I may or may not call the next middleware. It is not necessary for the key to actually exist in the `HtmlParsingModel` you used for data extraction, each function will receive an object containing all the keys from the extraction result, so structure the data however you want.\n\n## Clients\n\n### Default Clients\n\nXcrap comes by default with two clients, `AxiosClient` and `PuppeteerClient` which respectively use [`Axios`](https://npmjs.com/package/axios) and [`Puppeteer`](https://www.npmjs.com/package/puppeteer) to handle HTTP requests and retrieve the HTML of a website.\n\n---\n\n###  Custom Clients\n\nIf you want to use another library to handle HTTP requests or even customize something that happens from one request to another, you can make your own custom client by extending the `BaseClient` class.\n\nHere is an example of how [`PuppeteerExtaClient` (xcrap-puppeteer-extra-client)](https://www.npmjs.com/package/xcrap-puppeteer-extra-client) was made:\n\n```ts\nimport { PuppeteerClientOptions } from \"xcrap/dist/clients/puppeteer.client\"\nimport puppeteer,  { PuppeteerExtraPlugin } from \"puppeteer-extra\"\nimport { PuppeteerClient } from \"xcrap\"\n\nexport type PuppeteerExtraClientOptions = PuppeteerClientOptions \u0026 {\n    plugins?: PuppeteerExtraPlugin[]\n}\n\nclass PuppeteerExtraClient extends PuppeteerClient {\n    public constructor(options: PuppeteerExtraClientOptions = {}) {\n        super(options)\n\n        if (options.plugins) {\n            for (const plugin of options.plugins) {\n                this.usePlugin(plugin)\n            }\n        }\n    }\n\n    protected async initBrowser(): Promise\u003cvoid\u003e {\n        const puppeteerArguments: string[] = []\n\n        if (this.proxy) {\n            const currentProxy = typeof this.proxy === \"function\" ?\n                this.proxy() :\n                this.proxy\n\n            puppeteerArguments.push(`--proxy-server=${currentProxy}`)\n        }\n\n        if (this.options.args \u0026\u0026 this.options.args.length \u003e 0) {\n            puppeteerArguments.push(...this.options.args)\n        }\n\n        this.browser = await puppeteer.launch({\n            ...this.options,\n            args: puppeteerArguments,\n            headless: this.options.headless ? \"shell\" : false\n        })\n    }\n\n    public usePlugin(plugin: PuppeteerExtraPlugin): void {\n        puppeteer.use(plugin)\n    }\n}\n\nexport default PuppeteerExtraClient\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmarcuth%2Fxcrap","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmarcuth%2Fxcrap","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmarcuth%2Fxcrap/lists"}