https://github.com/marcuth/xcrap
Xcrap is a Web Scraping framework for JavaScript, designed to facilitate the process of extracting data from multiple pages or even just one, with a sophisticated page parsing system.
https://github.com/marcuth/xcrap
crawling scraping typescript web xcrap
Last synced: about 1 year ago
JSON representation
Xcrap is a Web Scraping framework for JavaScript, designed to facilitate the process of extracting data from multiple pages or even just one, with a sophisticated page parsing system.
- Host: GitHub
- URL: https://github.com/marcuth/xcrap
- Owner: marcuth
- License: mit
- Created: 2024-02-15T00:56:18.000Z (over 2 years ago)
- Default Branch: master
- Last Pushed: 2025-03-14T23:17:34.000Z (over 1 year ago)
- Last Synced: 2025-03-15T00:23:13.110Z (over 1 year ago)
- Topics: crawling, scraping, typescript, web, xcrap
- Language: TypeScript
- Homepage: https://www.npmjs.com/package/xcrap
- Size: 682 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 3
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
“If I have seen further, it is by standing on the shoulders of giants.”
## Technologies used:
- [node-html-parser](https://www.npmjs.com/package/node-html-parser)
- [axios](https://www.npmjs.com/package/axios)
- [puppeteer](https://www.npmjs.com/package/puppeteer)
- [axios-rate-limit](https://www.npmjs.com/package/axios-rate-limit)
- [date-fns](https://www.npmjs.com/package/date-fns)
## Books I read:
- Web Scraping with Python: Data Extraction from the Modern Web:
- [PT-BR](https://encurtador.com.br/svq8Y)
- [EN](https://encurtador.com.br/5dS11)
## Friends I've been arguing with:
- Rafael F.: https://github.com/justonedev42/
---
# Xcrap: A Web Scraping Framework for JavaScript
Xcrap is a framework written in TypeScript to handle data extraction in web pages.
---
Data extraction works based on two types of models:
## HtmlParsingModel
Each model key receives a `query` which is a CSS selector, and an `extractor` which is a function that will extract a certain property from an HTML element. It also accepts that the field has multiple results by passing the information in the `fieldType`, the model also supports alignment, so you can put models inside models to obtain a complex data structure, you can also define that it is a group of objects through the `isGroup` property, but don't get too attached to the resulting data structure.
## TransformationModel
Each model key receives an array of functions called `middlewares`. These `middlewares` work in a similar way to those we are used to when creating a backend server, I may or may not call the next middleware. It is not necessary for the key to actually exist in the `HtmlParsingModel` you used for data extraction, each function will receive an object containing all the keys from the extraction result, so structure the data however you want.
## Clients
### Default Clients
Xcrap comes by default with two clients, `AxiosClient` and `PuppeteerClient` which respectively use [`Axios`](https://npmjs.com/package/axios) and [`Puppeteer`](https://www.npmjs.com/package/puppeteer) to handle HTTP requests and retrieve the HTML of a website.
---
### Custom Clients
If you want to use another library to handle HTTP requests or even customize something that happens from one request to another, you can make your own custom client by extending the `BaseClient` class.
Here is an example of how [`PuppeteerExtaClient` (xcrap-puppeteer-extra-client)](https://www.npmjs.com/package/xcrap-puppeteer-extra-client) was made:
```ts
import { PuppeteerClientOptions } from "xcrap/dist/clients/puppeteer.client"
import puppeteer, { PuppeteerExtraPlugin } from "puppeteer-extra"
import { PuppeteerClient } from "xcrap"
export type PuppeteerExtraClientOptions = PuppeteerClientOptions & {
plugins?: PuppeteerExtraPlugin[]
}
class PuppeteerExtraClient extends PuppeteerClient {
public constructor(options: PuppeteerExtraClientOptions = {}) {
super(options)
if (options.plugins) {
for (const plugin of options.plugins) {
this.usePlugin(plugin)
}
}
}
protected async initBrowser(): Promise {
const puppeteerArguments: string[] = []
if (this.proxy) {
const currentProxy = typeof this.proxy === "function" ?
this.proxy() :
this.proxy
puppeteerArguments.push(`--proxy-server=${currentProxy}`)
}
if (this.options.args && this.options.args.length > 0) {
puppeteerArguments.push(...this.options.args)
}
this.browser = await puppeteer.launch({
...this.options,
args: puppeteerArguments,
headless: this.options.headless ? "shell" : false
})
}
public usePlugin(plugin: PuppeteerExtraPlugin): void {
puppeteer.use(plugin)
}
}
export default PuppeteerExtraClient
```