Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/tomas2d/puppeteer-table-parser
Scrape and parse HTML tables with the Puppeteer table parser.
https://github.com/tomas2d/puppeteer-table-parser
csv html javascript puppeteer puppeteer-tables scrape scraping table typescript
Last synced: 6 days ago
JSON representation
Scrape and parse HTML tables with the Puppeteer table parser.
- Host: GitHub
- URL: https://github.com/tomas2d/puppeteer-table-parser
- Owner: Tomas2D
- License: mit
- Created: 2021-03-18T08:32:59.000Z (almost 4 years ago)
- Default Branch: master
- Last Pushed: 2024-12-13T00:56:28.000Z (14 days ago)
- Last Synced: 2024-12-18T19:20:25.791Z (8 days ago)
- Topics: csv, html, javascript, puppeteer, puppeteer-tables, scrape, scraping, table, typescript
- Language: TypeScript
- Homepage:
- Size: 1.69 MB
- Stars: 22
- Watchers: 3
- Forks: 3
- Open Issues: 4
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- Funding: .github/FUNDING.yml
- License: LICENSE
Awesome Lists containing this project
README
# 🕸 🕷 puppeteer-table-parser
Library to make parsing website tables much easier!
When you are using `puppeteer` for scrapping websites and web application, you will find out that parsing tables consistently is not that easy.
This library brings you abstraction between `puppeteer` and `page context`.## This library solves the following issues:
- ✨ Parsing columns by their name.
- ✨ Respect the defined order of columns.
- ✨ Appending custom columns with custom data.
- ✨ Custom sanitization of data in cells.
- ✨ Group and Aggregate data by your own function.
- ✨ Merge data from two independent tables into one structure.
- ✨ Handles invalid HTML structure.
- ✨ Retrieve results as CSV or array of plain JS objects.
- ✨ And much more!## Installation
```shell
yarn add puppeteer-table-parser
```
```shell
npm install puppeteer-table-parser
``````typescript
// CommonJS
const { tableParser } = require('puppeteer-table-parser')// ESM / Typescript
import { tableParser } from 'puppeteer-table-parser'
```## API
```typescript
interface ParserSettings {
selector: string; // CSS selector
allowedColNames: Record; // key = input name, value = output name)headerRowsSelector?: string | null; // (default: 'thead tr', null ignores table's header selection)
headerRowsCellSelector?: string; // (default: 'td,th')
bodyRowsSelector?: string; // (default: 'tbody tr')
bodyRowsCellSelector?: string; // (default: 'td')
reverseTraversal?: boolean // (default: false)
temporaryColNames?: string[]; // (default: [])
extraCols?: ExtraCol[]; // (default: [])
withHeader?: boolean; // (default: true)
csvSeparator?: string; // (default: ';')
newLine?: string; // (default: '\n')
rowValidationPolicy?: RowValidationPolicy; // (default: 'NON_EMPTY')
groupBy?: {
cols: string[];
handler?: (rows: string[][], getColumnIndex: GetColumnIndexType) => string[];
}
rowValidator: (
row: string[],
getColumnIndex: GetColumnIndexType,
rowIndex: number,
rows: Readonly,
) => boolean;
rowTransform?: (row: string[], getColumnIndex: GetColumnIndexType) => void;
asArray?: boolean; // (default: false)
rowValuesAsArray?: boolean; // (default: false)
rowValuesAsObject?: boolean; // (default: false)
colFilter?: (elText: string[], index: number) => string; // (default: (txt: string) => txt.join(' '))
colParser?: (value: string, formattedIndex: number, getColumnIndex: GetColumnIndexType) => string; // (default: (txt: string) => txt.trim())
optionalColNames?: string[]; // (default: [])
};
```## Parsing workflow
1. Find table(s) by provided CSS selector.
2. Find associated columns by applying `colFilter` on their text and verify their count.
3. Filter rows based on `rowValidationPolicy`
4. Add extra columns specified in `extraCols` property in settings.
5. Run `rowValidator` function for every table row.
6. Run `colParser` for every cell in a row.
7. Run `rowTransform` function for each row.
8. Group results into buckets (`groupBy.cols`) property and pick the aggregated rows.
9. Add processed row to a temp array result.
10. Add `header` column if `withHeader` property is `true`.
11. Merge partial results and return them.## Examples
> All data came from the HTML page, which you can find in `test/assets/1.html`.
**Basic example** (the simple table where we want to parse three columns without editing)
```typescript
import { tableParser } from 'puppeteer-table-parser'await tableParser(page, {
selector: 'table',
allowedColNames: {
'Car Name': 'car',
'Horse Powers': 'hp',
'Manufacture Year': 'year',
},
});
``````csv
car;hp;year
Audi S5;332;2015
Alfa Romeo Giulia;500;2020
BMW X3;215;2017
Skoda Octavia;120;2012
```**Basic example** with custom column name parsing:
```typescript
import { tableParser } from 'puppeteer-table-parser'await tableParser(page, {
selector: 'table',
colFilter: (value: string[]) => {
return value.join(' ').replace(' ', '-').toLowerCase();
},
colParser: (value: string) => {
return value.trim();
},
allowedColNames: {
'car-name': 'car',
'horse-powers': 'hp',
'manufacture-year': 'year',
},
})
``````csv
car;hp;year
Audi S5;332;2015
Alfa Romeo Giulia;500;2020
BMW X3;215;2017
Skoda Octavia;120;2012
```**Basic example** with row validation and using temporary column.
```typescript
import { tableParser } from 'puppeteer-table-parser'await tableParser(page, {
selector: 'table',
allowedColNames: {
'Car Name': 'car',
'Manufacture Year': 'year',
'Horse Powers': 'hp',
},
temporaryColNames: ['Horse Powers'],
rowValidator: (row: string[], getColumnIndex) => {
const powerIndex = getColumnIndex('hp');
return Number(row[powerIndex]) < 250;
},
});
``````csv
car;year
BMW X3;2017
Skoda Octavia;2012
```**Advanced example:**
Uses custom temporary column for filtering. It uses an extra column with custom
position to be filled on a fly.```typescript
import { tableParser } from 'puppeteer-table-parser'await tableParser(page, {
selector: 'table',
allowedColNames: {
'Manufacture Year': 'year',
'Horse Powers': 'hp',
'Car Name': 'car',
},
temporaryColNames: ['Horse Powers'],
extraCols: [
{
colName: 'favorite',
data: '',
position: 0,
},
],
rowValidator: (row: string[], getColumnIndex) => {
const horsePowerIndex = getColumnIndex('hp');
return Number(row[horsePowerIndex]) > 150;
},
rowTransform: (row: string[], getColumnIndex) => {
const nameIndex = getColumnIndex('car');
const favoriteIndex = getColumnIndex('favorite');if (row[nameIndex].includes('Alfa Romeo')) {
row[favoriteIndex] = 'YES';
} else {
row[favoriteIndex] = 'NO';
}
},
asArray: false,
rowValuesAsArray: false
});
``````csv
favorite;year;car
NO;2015;Audi S5
YES;2020;Alfa Romeo Giulia
NO;2017;BMW X3
```**Optional columns**
Sometimes you can be in a situation where some if
your columns are desired, but they are not available in a table.
You can easily add an exception for them via `optionalColNames` property.```typescript
import { tableParser } from 'puppeteer-table-parser'await tableParser(page, {
selector: 'table',
allowedColNames: {
'Car Name': 'car',
'Rating': 'rating',
},
optionalColNames: ['rating']
});
```**Grouping and Aggregating**
```typescript
import { tableParser } from 'puppeteer-table-parser'await tableParser(page, {
selector: '#my-table',
allowedColNames: {
'Employee Name': 'name',
'Age': 'age',
},
groupBy: {
cols: ['name'],
handler: (rows: string[][], getColumnIndex) => {
const ageIndex = getColumnIndex('age');// select one with the minimal age
return rows.reduce((previous, current) =>
previous[ageIndex] < current[ageIndex] ? previous : current,
);
},
}
});
```For more, look at the `test` folder! 🙈