https://github.com/tomas2d/puppeteer-table-parser

Scrape and parse HTML tables with the Puppeteer table parser.
https://github.com/tomas2d/puppeteer-table-parser

csv html javascript puppeteer puppeteer-tables scrape scraping table typescript

Last synced: 3 months ago
JSON representation

Scrape and parse HTML tables with the Puppeteer table parser.

Host: GitHub
URL: https://github.com/tomas2d/puppeteer-table-parser
Owner: Tomas2D
License: mit
Created: 2021-03-18T08:32:59.000Z (over 4 years ago)
Default Branch: master
Last Pushed: 2024-12-13T00:56:28.000Z (11 months ago)
Last Synced: 2024-12-18T19:20:25.791Z (11 months ago)
Topics: csv, html, javascript, puppeteer, puppeteer-tables, scrape, scraping, table, typescript
Language: TypeScript
Homepage:
Size: 1.69 MB
Stars: 22
Watchers: 3
Forks: 3
Open Issues: 4
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- Funding: .github/FUNDING.yml
- License: LICENSE

Awesome Lists containing this project

README

          # 🕸 🕷 puppeteer-table-parser 

Library to make parsing website tables much easier! 

When you are using `puppeteer` for scrapping websites and web application, you will find out that parsing tables consistently is not that easy.

This library brings you abstraction between `puppeteer` and `page context`.

## This library solves the following issues:

- ✨ Parsing columns by their name.

- ✨ Respect the defined order of columns.

- ✨ Appending custom columns with custom data.

- ✨ Custom sanitization of data in cells.

- ✨ Group and Aggregate data by your own function.

- ✨ Merge data from two independent tables into one structure.

- ✨ Handles invalid HTML structure.

- ✨ Retrieve results as CSV or array of plain JS objects.

- ✨ And much more!

## Installation

```shell

yarn add puppeteer-table-parser

```

```shell

npm install puppeteer-table-parser

```

```typescript

// CommonJS

const { tableParser } = require('puppeteer-table-parser')

// ESM / Typescript

import { tableParser } from 'puppeteer-table-parser'

```

## API

```typescript

interface ParserSettings {

  selector: string; // CSS selector

  allowedColNames: Record; // key = input name, value = output name)

  headerRowsSelector?: string | null; // (default: 'thead tr', null ignores table's header selection)

  headerRowsCellSelector?: string; // (default: 'td,th')

  bodyRowsSelector?: string;  // (default: 'tbody tr')

  bodyRowsCellSelector?: string;  // (default: 'td')

  reverseTraversal?: boolean // (default: false)

  temporaryColNames?: string[]; // (default: []) 

  extraCols?: ExtraCol[]; // (default: [])

  withHeader?: boolean; // (default: true)

  csvSeparator?: string; // (default: ';')

  newLine?: string; // (default: '\n')

  rowValidationPolicy?: RowValidationPolicy; // (default: 'NON_EMPTY')

  groupBy?: {

    cols: string[];

    handler?: (rows: string[][], getColumnIndex: GetColumnIndexType) => string[];

  }

  rowValidator: (

    row: string[],

    getColumnIndex: GetColumnIndexType,

    rowIndex: number,

    rows: Readonly,

  ) => boolean;

  rowTransform?: (row: string[], getColumnIndex: GetColumnIndexType) => void;

  asArray?: boolean; // (default: false)

  rowValuesAsArray?: boolean; // (default: false)

  rowValuesAsObject?: boolean; // (default: false)

  colFilter?: (elText: string[], index: number) => string; // (default: (txt: string) => txt.join(' '))

  colParser?: (value: string, formattedIndex: number, getColumnIndex: GetColumnIndexType) => string; // (default: (txt: string) => txt.trim())

  optionalColNames?: string[]; // (default: [])

};

```

## Parsing workflow

1. Find table(s) by provided CSS selector.

2. Find associated columns by applying `colFilter` on their text and verify their count.

3. Filter rows based on `rowValidationPolicy`

4. Add extra columns specified in `extraCols` property in settings.

5. Run `rowValidator` function for every table row.

6. Run `colParser` for every cell in a row.

7. Run `rowTransform` function for each row.

8. Group results into buckets (`groupBy.cols`) property and pick the aggregated rows.

9. Add processed row to a temp array result. 

10. Add `header` column if `withHeader` property is `true`.

11. Merge partial results and return them.

## Examples

> All data came from the HTML page, which you can find in `test/assets/1.html`.

**Basic example** (the simple table where we want to parse three columns without editing)

```typescript

import { tableParser } from 'puppeteer-table-parser'

await tableParser(page, {

  selector: 'table',

  allowedColNames: {

    'Car Name': 'car',

    'Horse Powers': 'hp',

    'Manufacture Year': 'year',

  },

});

```

```csv

car;hp;year

Audi S5;332;2015

Alfa Romeo Giulia;500;2020

BMW X3;215;2017

Skoda Octavia;120;2012

```

**Basic example** with custom column name parsing:

```typescript

import { tableParser } from 'puppeteer-table-parser'

await tableParser(page, {

  selector: 'table',

  colFilter: (value: string[]) => {

    return value.join(' ').replace(' ', '-').toLowerCase();

  },

  colParser: (value: string) => {

    return value.trim();

  },

  allowedColNames: {

    'car-name': 'car',

    'horse-powers': 'hp',

    'manufacture-year': 'year',

  },

})

```

```csv

car;hp;year

Audi S5;332;2015

Alfa Romeo Giulia;500;2020

BMW X3;215;2017

Skoda Octavia;120;2012

```

**Basic example** with row validation and using temporary column.

```typescript

import { tableParser } from 'puppeteer-table-parser'

await tableParser(page, {

  selector: 'table',

  allowedColNames: {

    'Car Name': 'car',

    'Manufacture Year': 'year',

    'Horse Powers': 'hp',

  },

  temporaryColNames: ['Horse Powers'],

  rowValidator: (row: string[], getColumnIndex) => {

    const powerIndex = getColumnIndex('hp');

    return Number(row[powerIndex]) < 250;

  },

});

```

```csv

car;year

BMW X3;2017

Skoda Octavia;2012

```

**Advanced example:**

Uses custom temporary column for filtering. It uses an extra column with custom 

position to be filled on a fly.

```typescript

import { tableParser } from 'puppeteer-table-parser'

await tableParser(page, {

  selector: 'table',

  allowedColNames: {

    'Manufacture Year': 'year',

    'Horse Powers': 'hp',

    'Car Name': 'car',

  },

  temporaryColNames: ['Horse Powers'],

  extraCols: [

    {

      colName: 'favorite',

      data: '',

      position: 0,

    },

  ],

  rowValidator: (row: string[], getColumnIndex) => {

    const horsePowerIndex = getColumnIndex('hp');

    return Number(row[horsePowerIndex]) > 150;

  },

  rowTransform: (row: string[], getColumnIndex) => {

    const nameIndex = getColumnIndex('car');

    const favoriteIndex = getColumnIndex('favorite');

    if (row[nameIndex].includes('Alfa Romeo')) {

      row[favoriteIndex] = 'YES';

    } else {

      row[favoriteIndex] = 'NO';

    }

  },

  asArray: false,

  rowValuesAsArray: false

});

```

```csv

favorite;year;car

NO;2015;Audi S5

YES;2020;Alfa Romeo Giulia

NO;2017;BMW X3

```

**Optional columns**

Sometimes you can be in a situation where some if

your columns are desired, but they are not available in a table.

You can easily add an exception for them via `optionalColNames` property.

```typescript

import { tableParser } from 'puppeteer-table-parser'

await tableParser(page, {

  selector: 'table',

  allowedColNames: {

    'Car Name': 'car',

    'Rating': 'rating',

  },

  optionalColNames: ['rating']

});

```

**Grouping and Aggregating**

```typescript

import { tableParser } from 'puppeteer-table-parser'

await tableParser(page, {

  selector: '#my-table',

  allowedColNames: {

    'Employee Name': 'name',

    'Age': 'age',

  },

  groupBy: {

    cols: ['name'],

    handler: (rows: string[][], getColumnIndex) => {

      const ageIndex = getColumnIndex('age');

      // select one with the minimal age

      return rows.reduce((previous, current) =>

        previous[ageIndex] < current[ageIndex] ? previous : current,

      );

    },

  }

});

```

For more, look at the `test` folder! 🙈

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/tomas2d/puppeteer-table-parser

Awesome Lists containing this project

README