{"id":22612114,"url":"https://github.com/tomas2d/puppeteer-table-parser","last_synced_at":"2025-08-22T03:32:04.023Z","repository":{"id":42528986,"uuid":"348997179","full_name":"Tomas2D/puppeteer-table-parser","owner":"Tomas2D","description":"Scrape and parse HTML tables with the Puppeteer table parser.","archived":false,"fork":false,"pushed_at":"2024-12-13T00:56:28.000Z","size":1768,"stargazers_count":22,"open_issues_count":4,"forks_count":3,"subscribers_count":3,"default_branch":"master","last_synced_at":"2024-12-18T19:20:25.791Z","etag":null,"topics":["csv","html","javascript","puppeteer","puppeteer-tables","scrape","scraping","table","typescript"],"latest_commit_sha":null,"homepage":"","language":"TypeScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Tomas2D.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":".github/FUNDING.yml","license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null},"funding":{"github":["Tomas2D"]}},"created_at":"2021-03-18T08:32:59.000Z","updated_at":"2024-12-06T09:00:45.000Z","dependencies_parsed_at":"2023-02-19T16:46:03.735Z","dependency_job_id":"d7db97ef-170a-44d4-8771-bb841a2f150d","html_url":"https://github.com/Tomas2D/puppeteer-table-parser","commit_stats":{"total_commits":209,"total_committers":2,"mean_commits":104.5,"dds":0.416267942583732,"last_synced_commit":"40cc526ed95cdd098d0212e8e9a35c5a3fcaa8c2"},"previous_names":[],"tags_count":27,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Tomas2D%2Fpuppeteer-table-parser","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Tomas2D%2Fpuppeteer-table-parser/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Tomas2D%2Fpuppeteer-table-parser/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Tomas2D%2Fpuppeteer-table-parser/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Tomas2D","download_url":"https://codeload.github.com/Tomas2D/puppeteer-table-parser/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":230554330,"owners_count":18244234,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["csv","html","javascript","puppeteer","puppeteer-tables","scrape","scraping","table","typescript"],"created_at":"2024-12-08T17:10:00.913Z","updated_at":"2024-12-20T08:08:47.685Z","avatar_url":"https://github.com/Tomas2D.png","language":"TypeScript","funding_links":["https://github.com/sponsors/Tomas2D"],"categories":[],"sub_categories":[],"readme":"# 🕸 🕷 puppeteer-table-parser \n\nLibrary to make parsing website tables much easier! \nWhen you are using `puppeteer` for scrapping websites and web application, you will find out that parsing tables consistently is not that easy.\nThis library brings you abstraction between `puppeteer` and `page context`.\n\n## This library solves the following issues:\n\n- ✨ Parsing columns by their name.\n- ✨ Respect the defined order of columns.\n- ✨ Appending custom columns with custom data.\n- ✨ Custom sanitization of data in cells.\n- ✨ Group and Aggregate data by your own function.\n- ✨ Merge data from two independent tables into one structure.\n- ✨ Handles invalid HTML structure.\n- ✨ Retrieve results as CSV or array of plain JS objects.\n- ✨ And much more!\n\n## Installation\n\n```shell\nyarn add puppeteer-table-parser\n```\n```shell\nnpm install puppeteer-table-parser\n```\n\n```typescript\n// CommonJS\nconst { tableParser } = require('puppeteer-table-parser')\n\n// ESM / Typescript\nimport { tableParser } from 'puppeteer-table-parser'\n```\n\n## API\n\n```typescript\ninterface ParserSettings {\n  selector: string; // CSS selector\n  allowedColNames: Record\u003cstring, string\u003e; // key = input name, value = output name)\n\n  headerRowsSelector?: string | null; // (default: 'thead tr', null ignores table's header selection)\n  headerRowsCellSelector?: string; // (default: 'td,th')\n  bodyRowsSelector?: string;  // (default: 'tbody tr')\n  bodyRowsCellSelector?: string;  // (default: 'td')\n  reverseTraversal?: boolean // (default: false)\n  temporaryColNames?: string[]; // (default: []) \n  extraCols?: ExtraCol[]; // (default: [])\n  withHeader?: boolean; // (default: true)\n  csvSeparator?: string; // (default: ';')\n  newLine?: string; // (default: '\\n')\n  rowValidationPolicy?: RowValidationPolicy; // (default: 'NON_EMPTY')\n  groupBy?: {\n    cols: string[];\n    handler?: (rows: string[][], getColumnIndex: GetColumnIndexType) =\u003e string[];\n  }\n  rowValidator: (\n    row: string[],\n    getColumnIndex: GetColumnIndexType,\n    rowIndex: number,\n    rows: Readonly\u003cstring[][]\u003e,\n  ) =\u003e boolean;\n  rowTransform?: (row: string[], getColumnIndex: GetColumnIndexType) =\u003e void;\n  asArray?: boolean; // (default: false)\n  rowValuesAsArray?: boolean; // (default: false)\n  rowValuesAsObject?: boolean; // (default: false)\n  colFilter?: (elText: string[], index: number) =\u003e string; // (default: (txt: string) =\u003e txt.join(' '))\n  colParser?: (value: string, formattedIndex: number, getColumnIndex: GetColumnIndexType) =\u003e string; // (default: (txt: string) =\u003e txt.trim())\n  optionalColNames?: string[]; // (default: [])\n};\n```\n\n## Parsing workflow\n\n1. Find table(s) by provided CSS selector.\n2. Find associated columns by applying `colFilter` on their text and verify their count.\n3. Filter rows based on `rowValidationPolicy`\n4. Add extra columns specified in `extraCols` property in settings.\n5. Run `rowValidator` function for every table row.\n6. Run `colParser` for every cell in a row.\n7. Run `rowTransform` function for each row.\n8. Group results into buckets (`groupBy.cols`) property and pick the aggregated rows.\n9. Add processed row to a temp array result. \n10. Add `header` column if `withHeader` property is `true`.\n11. Merge partial results and return them.\n\n## Examples\n\n\u003e All data came from the HTML page, which you can find in `test/assets/1.html`.\n\n**Basic example** (the simple table where we want to parse three columns without editing)\n\n```typescript\nimport { tableParser } from 'puppeteer-table-parser'\n\nawait tableParser(page, {\n  selector: 'table',\n  allowedColNames: {\n    'Car Name': 'car',\n    'Horse Powers': 'hp',\n    'Manufacture Year': 'year',\n  },\n});\n```\n\n```csv\ncar;hp;year\nAudi S5;332;2015\nAlfa Romeo Giulia;500;2020\nBMW X3;215;2017\nSkoda Octavia;120;2012\n```\n\n**Basic example** with custom column name parsing:\n\n```typescript\nimport { tableParser } from 'puppeteer-table-parser'\n\nawait tableParser(page, {\n  selector: 'table',\n  colFilter: (value: string[]) =\u003e {\n    return value.join(' ').replace(' ', '-').toLowerCase();\n  },\n  colParser: (value: string) =\u003e {\n    return value.trim();\n  },\n  allowedColNames: {\n    'car-name': 'car',\n    'horse-powers': 'hp',\n    'manufacture-year': 'year',\n  },\n})\n```\n\n```csv\ncar;hp;year\nAudi S5;332;2015\nAlfa Romeo Giulia;500;2020\nBMW X3;215;2017\nSkoda Octavia;120;2012\n```\n\n**Basic example** with row validation and using temporary column.\n\n```typescript\nimport { tableParser } from 'puppeteer-table-parser'\n\nawait tableParser(page, {\n  selector: 'table',\n  allowedColNames: {\n    'Car Name': 'car',\n    'Manufacture Year': 'year',\n    'Horse Powers': 'hp',\n  },\n  temporaryColNames: ['Horse Powers'],\n  rowValidator: (row: string[], getColumnIndex) =\u003e {\n    const powerIndex = getColumnIndex('hp');\n    return Number(row[powerIndex]) \u003c 250;\n  },\n});\n```\n\n```csv\ncar;year\nBMW X3;2017\nSkoda Octavia;2012\n```\n\n**Advanced example:**\n\nUses custom temporary column for filtering. It uses an extra column with custom \nposition to be filled on a fly.\n\n```typescript\nimport { tableParser } from 'puppeteer-table-parser'\n\nawait tableParser(page, {\n  selector: 'table',\n  allowedColNames: {\n    'Manufacture Year': 'year',\n    'Horse Powers': 'hp',\n    'Car Name': 'car',\n  },\n  temporaryColNames: ['Horse Powers'],\n  extraCols: [\n    {\n      colName: 'favorite',\n      data: '',\n      position: 0,\n    },\n  ],\n  rowValidator: (row: string[], getColumnIndex) =\u003e {\n    const horsePowerIndex = getColumnIndex('hp');\n    return Number(row[horsePowerIndex]) \u003e 150;\n  },\n  rowTransform: (row: string[], getColumnIndex) =\u003e {\n    const nameIndex = getColumnIndex('car');\n    const favoriteIndex = getColumnIndex('favorite');\n\n    if (row[nameIndex].includes('Alfa Romeo')) {\n      row[favoriteIndex] = 'YES';\n    } else {\n      row[favoriteIndex] = 'NO';\n    }\n  },\n  asArray: false,\n  rowValuesAsArray: false\n});\n```\n\n```csv\nfavorite;year;car\nNO;2015;Audi S5\nYES;2020;Alfa Romeo Giulia\nNO;2017;BMW X3\n```\n\n**Optional columns**\n\nSometimes you can be in a situation where some if\nyour columns are desired, but they are not available in a table.\nYou can easily add an exception for them via `optionalColNames` property.\n\n```typescript\nimport { tableParser } from 'puppeteer-table-parser'\n\nawait tableParser(page, {\n  selector: 'table',\n  allowedColNames: {\n    'Car Name': 'car',\n    'Rating': 'rating',\n  },\n  optionalColNames: ['rating']\n});\n```\n\n**Grouping and Aggregating**\n```typescript\nimport { tableParser } from 'puppeteer-table-parser'\n\nawait tableParser(page, {\n  selector: '#my-table',\n  allowedColNames: {\n    'Employee Name': 'name',\n    'Age': 'age',\n  },\n  groupBy: {\n    cols: ['name'],\n    handler: (rows: string[][], getColumnIndex) =\u003e {\n      const ageIndex = getColumnIndex('age');\n\n      // select one with the minimal age\n      return rows.reduce((previous, current) =\u003e\n        previous[ageIndex] \u003c current[ageIndex] ? previous : current,\n      );\n    },\n  }\n});\n```\n\nFor more, look at the `test` folder! 🙈\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftomas2d%2Fpuppeteer-table-parser","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ftomas2d%2Fpuppeteer-table-parser","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftomas2d%2Fpuppeteer-table-parser/lists"}