Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/tomas2d/puppeteer-table-parser

Scrape and parse HTML tables with the Puppeteer table parser.
https://github.com/tomas2d/puppeteer-table-parser

csv html javascript puppeteer puppeteer-tables scrape scraping table typescript

Last synced: about 1 month ago
JSON representation

Scrape and parse HTML tables with the Puppeteer table parser.

Awesome Lists containing this project

README

        

# 🕸 🕷 puppeteer-table-parser

Library to make parsing website tables much easier!
When you are using `puppeteer` for scrapping websites and web application, you will find out that parsing tables consistently is not that easy.
This library brings you abstraction between `puppeteer` and `page context`.

## This library solves the following issues:

- ✨ Parsing columns by their name.
- ✨ Respect the defined order of columns.
- ✨ Appending custom columns with custom data.
- ✨ Custom sanitization of data in cells.
- ✨ Group and Aggregate data by your own function.
- ✨ Merge data from two independent tables into one structure.
- ✨ Handles invalid HTML structure.
- ✨ Retrieve results as CSV or array of plain JS objects.
- ✨ And much more!

## Installation

```shell
yarn add puppeteer-table-parser
```
```shell
npm install puppeteer-table-parser
```

```typescript
// CommonJS
const { tableParser } = require('puppeteer-table-parser')

// ESM / Typescript
import { tableParser } from 'puppeteer-table-parser'
```

## API

```typescript
interface ParserSettings {
selector: string; // CSS selector
allowedColNames: Record; // key = input name, value = output name)

headerRowsSelector?: string | null; // (default: 'thead tr', null ignores table's header selection)
headerRowsCellSelector?: string; // (default: 'td,th')
bodyRowsSelector?: string; // (default: 'tbody tr')
bodyRowsCellSelector?: string; // (default: 'td')
reverseTraversal?: boolean // (default: false)
temporaryColNames?: string[]; // (default: [])
extraCols?: ExtraCol[]; // (default: [])
withHeader?: boolean; // (default: true)
csvSeparator?: string; // (default: ';')
newLine?: string; // (default: '\n')
rowValidationPolicy?: RowValidationPolicy; // (default: 'NON_EMPTY')
groupBy?: {
cols: string[];
handler?: (rows: string[][], getColumnIndex: GetColumnIndexType) => string[];
}
rowValidator: (
row: string[],
getColumnIndex: GetColumnIndexType,
rowIndex: number,
rows: Readonly,
) => boolean;
rowTransform?: (row: string[], getColumnIndex: GetColumnIndexType) => void;
asArray?: boolean; // (default: false)
rowValuesAsArray?: boolean; // (default: false)
rowValuesAsObject?: boolean; // (default: false)
colFilter?: (elText: string[], index: number) => string; // (default: (txt: string) => txt.join(' '))
colParser?: (value: string, formattedIndex: number, getColumnIndex: GetColumnIndexType) => string; // (default: (txt: string) => txt.trim())
optionalColNames?: string[]; // (default: [])
};
```

## Parsing workflow

1. Find table(s) by provided CSS selector.
2. Find associated columns by applying `colFilter` on their text and verify their count.
3. Filter rows based on `rowValidationPolicy`
4. Add extra columns specified in `extraCols` property in settings.
5. Run `rowValidator` function for every table row.
6. Run `colParser` for every cell in a row.
7. Run `rowTransform` function for each row.
8. Group results into buckets (`groupBy.cols`) property and pick the aggregated rows.
9. Add processed row to a temp array result.
10. Add `header` column if `withHeader` property is `true`.
11. Merge partial results and return them.

## Examples

> All data came from the HTML page, which you can find in `test/assets/1.html`.

**Basic example** (the simple table where we want to parse three columns without editing)

```typescript
import { tableParser } from 'puppeteer-table-parser'

await tableParser(page, {
selector: 'table',
allowedColNames: {
'Car Name': 'car',
'Horse Powers': 'hp',
'Manufacture Year': 'year',
},
});
```

```csv
car;hp;year
Audi S5;332;2015
Alfa Romeo Giulia;500;2020
BMW X3;215;2017
Skoda Octavia;120;2012
```

**Basic example** with custom column name parsing:

```typescript
import { tableParser } from 'puppeteer-table-parser'

await tableParser(page, {
selector: 'table',
colFilter: (value: string[]) => {
return value.join(' ').replace(' ', '-').toLowerCase();
},
colParser: (value: string) => {
return value.trim();
},
allowedColNames: {
'car-name': 'car',
'horse-powers': 'hp',
'manufacture-year': 'year',
},
})
```

```csv
car;hp;year
Audi S5;332;2015
Alfa Romeo Giulia;500;2020
BMW X3;215;2017
Skoda Octavia;120;2012
```

**Basic example** with row validation and using temporary column.

```typescript
import { tableParser } from 'puppeteer-table-parser'

await tableParser(page, {
selector: 'table',
allowedColNames: {
'Car Name': 'car',
'Manufacture Year': 'year',
'Horse Powers': 'hp',
},
temporaryColNames: ['Horse Powers'],
rowValidator: (row: string[], getColumnIndex) => {
const powerIndex = getColumnIndex('hp');
return Number(row[powerIndex]) < 250;
},
});
```

```csv
car;year
BMW X3;2017
Skoda Octavia;2012
```

**Advanced example:**

Uses custom temporary column for filtering. It uses an extra column with custom
position to be filled on a fly.

```typescript
import { tableParser } from 'puppeteer-table-parser'

await tableParser(page, {
selector: 'table',
allowedColNames: {
'Manufacture Year': 'year',
'Horse Powers': 'hp',
'Car Name': 'car',
},
temporaryColNames: ['Horse Powers'],
extraCols: [
{
colName: 'favorite',
data: '',
position: 0,
},
],
rowValidator: (row: string[], getColumnIndex) => {
const horsePowerIndex = getColumnIndex('hp');
return Number(row[horsePowerIndex]) > 150;
},
rowTransform: (row: string[], getColumnIndex) => {
const nameIndex = getColumnIndex('car');
const favoriteIndex = getColumnIndex('favorite');

if (row[nameIndex].includes('Alfa Romeo')) {
row[favoriteIndex] = 'YES';
} else {
row[favoriteIndex] = 'NO';
}
},
asArray: false,
rowValuesAsArray: false
});
```

```csv
favorite;year;car
NO;2015;Audi S5
YES;2020;Alfa Romeo Giulia
NO;2017;BMW X3
```

**Optional columns**

Sometimes you can be in a situation where some if
your columns are desired, but they are not available in a table.
You can easily add an exception for them via `optionalColNames` property.

```typescript
import { tableParser } from 'puppeteer-table-parser'

await tableParser(page, {
selector: 'table',
allowedColNames: {
'Car Name': 'car',
'Rating': 'rating',
},
optionalColNames: ['rating']
});
```

**Grouping and Aggregating**
```typescript
import { tableParser } from 'puppeteer-table-parser'

await tableParser(page, {
selector: '#my-table',
allowedColNames: {
'Employee Name': 'name',
'Age': 'age',
},
groupBy: {
cols: ['name'],
handler: (rows: string[][], getColumnIndex) => {
const ageIndex = getColumnIndex('age');

// select one with the minimal age
return rows.reduce((previous, current) =>
previous[ageIndex] < current[ageIndex] ? previous : current,
);
},
}
});
```

For more, look at the `test` folder! 🙈