{"id":19162751,"url":"https://github.com/centre-for-humanities-computing/web-extractor","last_synced_at":"2026-02-14T03:18:59.864Z","repository":{"id":42937253,"uuid":"235980246","full_name":"centre-for-humanities-computing/web-extractor","owner":"centre-for-humanities-computing","description":"A tool for extracting DOM content and taking screenshots of websites","archived":false,"fork":false,"pushed_at":"2024-06-19T20:15:11.000Z","size":331,"stargazers_count":1,"open_issues_count":5,"forks_count":2,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-10-03T13:35:51.196Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"JavaScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/centre-for-humanities-computing.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2020-01-24T10:16:35.000Z","updated_at":"2025-09-12T08:07:26.000Z","dependencies_parsed_at":"2024-10-31T04:01:35.369Z","dependency_job_id":"86a2f43b-91ae-4bbd-9d67-839e188e26b5","html_url":"https://github.com/centre-for-humanities-computing/web-extractor","commit_stats":{"total_commits":103,"total_committers":4,"mean_commits":25.75,"dds":"0.27184466019417475","last_synced_commit":"feff5ce944b490e462f4537db2e5481183f74ac5"},"previous_names":[],"tags_count":15,"template":false,"template_full_name":null,"purl":"pkg:github/centre-for-humanities-computing/web-extractor","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/centre-for-humanities-computing%2Fweb-extractor","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/centre-for-humanities-computing%2Fweb-extractor/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/centre-for-humanities-computing%2Fweb-extractor/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/centre-for-humanities-computing%2Fweb-extractor/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/centre-for-humanities-computing","download_url":"https://codeload.github.com/centre-for-humanities-computing/web-extractor/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/centre-for-humanities-computing%2Fweb-extractor/sbom","scorecard":{"id":270987,"data":{"date":"2025-08-11","repo":{"name":"github.com/centre-for-humanities-computing/web-extractor","commit":"feff5ce944b490e462f4537db2e5481183f74ac5"},"scorecard":{"version":"v5.2.1-40-gf6ed084d","commit":"f6ed084d17c9236477efd66e5b258b9d4cc7b389"},"score":2,"checks":[{"name":"Code-Review","score":0,"reason":"Found 0/30 approved changesets -- score normalized to 0","details":null,"documentation":{"short":"Determines if the project requires human code review before pull requests (aka merge requests) are merged.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#code-review"}},{"name":"Maintained","score":0,"reason":"0 commit(s) and 0 issue activity found in the last 90 days -- score normalized to 0","details":null,"documentation":{"short":"Determines if the project is \"actively maintained\".","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#maintained"}},{"name":"SAST","score":0,"reason":"no SAST tool detected","details":["Warn: no pull requests merged into dev branch"],"documentation":{"short":"Determines if the project uses static code analysis.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#sast"}},{"name":"Packaging","score":-1,"reason":"packaging workflow not detected","details":["Warn: no GitHub/GitLab publishing workflow detected."],"documentation":{"short":"Determines if the project is published as a package that others can easily download, install, easily update, and uninstall.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#packaging"}},{"name":"Binary-Artifacts","score":10,"reason":"no binaries found in the repo","details":null,"documentation":{"short":"Determines if the project has generated executable (binary) artifacts in the source repository.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#binary-artifacts"}},{"name":"Token-Permissions","score":-1,"reason":"No tokens found","details":null,"documentation":{"short":"Determines if the project's workflows follow the principle of least privilege.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#token-permissions"}},{"name":"Dangerous-Workflow","score":-1,"reason":"no workflows found","details":null,"documentation":{"short":"Determines if the project's GitHub Action workflows avoid dangerous patterns.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#dangerous-workflow"}},{"name":"CII-Best-Practices","score":0,"reason":"no effort to earn an OpenSSF best practices badge detected","details":null,"documentation":{"short":"Determines if the project has an OpenSSF (formerly CII) Best Practices Badge.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#cii-best-practices"}},{"name":"Pinned-Dependencies","score":-1,"reason":"no dependencies found","details":null,"documentation":{"short":"Determines if the project has declared and pinned the dependencies of its build process.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#pinned-dependencies"}},{"name":"Security-Policy","score":0,"reason":"security policy file not detected","details":["Warn: no security policy file detected","Warn: no security file to analyze","Warn: no security file to analyze","Warn: no security file to analyze"],"documentation":{"short":"Determines if the project has published a security policy.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#security-policy"}},{"name":"License","score":0,"reason":"license file not detected","details":["Warn: project does not have a license file"],"documentation":{"short":"Determines if the project has defined a license.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#license"}},{"name":"Fuzzing","score":0,"reason":"project is not fuzzed","details":["Warn: no fuzzer integrations found"],"documentation":{"short":"Determines if the project uses fuzzing.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#fuzzing"}},{"name":"Signed-Releases","score":-1,"reason":"no releases found","details":null,"documentation":{"short":"Determines if the project cryptographically signs release artifacts.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#signed-releases"}},{"name":"Branch-Protection","score":0,"reason":"branch protection not enabled on development/release branches","details":["Warn: branch protection not enabled for branch 'master'"],"documentation":{"short":"Determines if the default and release branches are protected with GitHub's branch protection settings.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#branch-protection"}},{"name":"Vulnerabilities","score":5,"reason":"5 existing vulnerabilities detected","details":["Warn: Project is vulnerable to: GHSA-v6h2-p8h4-qcjw","Warn: Project is vulnerable to: GHSA-3xgq-45jj-v275","Warn: Project is vulnerable to: GHSA-pq67-2wwv-3xjx","Warn: Project is vulnerable to: GHSA-8cj5-5rvv-wf4v","Warn: Project is vulnerable to: GHSA-3h5v-q93c-6h6q"],"documentation":{"short":"Determines if the project has open, known unfixed vulnerabilities.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#vulnerabilities"}}]},"last_synced_at":"2025-08-17T13:20:37.347Z","repository_id":42937253,"created_at":"2025-08-17T13:20:37.347Z","updated_at":"2025-08-17T13:20:37.347Z"},"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":29433304,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-02-14T02:20:56.896Z","status":"ssl_error","status_checked_at":"2026-02-14T02:11:29.478Z","response_time":53,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-09T09:13:03.741Z","updated_at":"2026-02-14T03:18:59.827Z","avatar_url":"https://github.com/centre-for-humanities-computing.png","language":"JavaScript","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Web Extractor\nA tool for extracting DOM content and taking screenshots of web pages. \n\nProvided a list of urls and a set of extraction rules Web Extractor loads each url \nand test each rule against the page until a rule succeeds or there are no more rules. If a rule \nsucceeds the data described in the rule's extract method is exported.\n\nWeb Extractor can be used as a CLI program or as a npm module. \n\n## CLI Installation\n- Install [node.js](https://nodejs.org/en/download/) version 16.x or higher\n- Clone this repository\n- Navigate to the root of the repository and run\n```\nnpm install\n```\n\n\n### Usage\n- Navigate to the root of the repository and run \n```\nnode extract -h \n```\n\n### CLI options\n- **`-u, --urls \u003cfile\u003e`** [required] - A path to a file with a list of urls for extraction. Each url in the file should be on it's own line\n- **`-d, --destination \u003cdirectory\u003e`** [required] - A path to the dir where data should be saved. If the dir already contains previous collected data the new data will be appended to the existing files\n- **`-r, --rules \u003cdirectory\u003e`** [optional] - A path to the dir where extraction rules are located. If not set the \"rules\" folder in the project will be used as default\n- **`-c, --concurrency \u003cinteger\u003e`** [optional, default=15] - The maximum simultaneous loaded pages\n- **`-n, --no-screenshot`** [optional] - Disable screenshots\n- **`-t, --page-timeout \u003cinteger\u003e`** [optional, default=90000] - Milliseconds to wait for the initial loading of a page\n- **`-h, --headless`** [optional, default=true] - run browser on headless mode\n- **`-i, --use-id-for-screenshot-name`** [optional] - Use an universal unique id for screenshot names instead of the url\n- **`-x, --debug`** [optional] - Print more detailed error information\n\n\u003e**NOTE** if `cpm-data.json` contains many results with a `requestStrategy` equal to `domContentLoaded` or `errors.json` \n\u003e contains many `TimeoutError` errors, try lowering concurrency or increase page-timeout. \n\n### Full Example\n```\n$ node extract -u /data/urls.txt-d /data/web-extract -c 35 -t 90000\n```\n*Analyze each url in '/data/urls.txt' and save the results in '/data/web-extract'. \nLoad a maximum of 35 simultaneous pages and wait a maximum of 90000ms for each page to load.*\n\n## NPM Module Installation\nrun:\n```\nnpm install @chcaa/web-extractor\n```\n\n### Usage\nTo get started with some simple extractions, create a simple rule (see [Extraction Rules](#extraction-rules)) \nand do the following:\n```\nimport { WebExtractor } from '@chcaa/web-extractor';\n\nasync function run() {\n    let urls = ['https://www.dr.dk', 'tv2.dk'];\n\n    let rule = {\n        extractor: {\n            extract: function() {\n                return document.querySelector('h1');\n            }\n        }\n    };\n\n    let destDir = '/temp/data/web-extractor';\n\n    let extractor = new WebExtractor(urls, rule, destDir);\n\n    await extractor.execute();\n}\n\nrun();\n```\n\nWebExtractor uses [puppeteer-extra](https://www.npmjs.com/package/puppeteer-extra), so it is possible to add plugins using the `options.configurePuppeteer` function.\n```\nimport { WebExtractor } from '@chcaa/web-extractor';\nimport StealthPlugin = import('puppeteer-extra-plugin-stealth');\n\nasync function run() {\n    let urls = ['https://www.dr.dk', 'tv2.dk'];\n\n    let rule = {...};\n    let destDir = '/temp/data/web-extractor';\n    \n    let options = {\n        configurePuppeteer(puppeteer) {\n            puppeteer.use(StealthPlugin());\n        }\n    }\n\n    let extractor = new WebExtractor(urls, rule, destDir, options);\n\n    await extractor.execute();\n}\n\nrun();\n```\n\n\n### WebExtractor\nThe following methods and properties are available:\n\n##### constructor(urls, rules, destDir, [options])\n- `urls` - an array of urls or a path to a file with urls. (one url pr. line). If further input is needed along with\nthe url, the url can be and object with a property named `url` the object will then be passed in to relevant methods of\nthe extraction rules (see [Creating Rules](#creating-rules)). If the url's are located in a file each line can be\na string in JSON-format.\n- `rules` - the dir where the extraction rules are located or a rule object or an array of rule objects \n(see [Creating Rules](#creating-rules))  \n- `destDir` - the dir where the extracted data, screenshots and logs should be saved \n- `options` - additional options in the format\n\n  - ```\n    {\n        useIdForScreenshotName: {boolean} default false,\n        maxConcurrency: {integer} default 15,\n        pageTimeoutMs: {integer} default 90000,\n        headless: {boolean} default true,\n        userAgent: {string} default \n        waitUntil: {string} default 'load' // one of [load|domcontentloaded|networkidle0|networkidle2]\n        output: {\n            screenshot: {boolean} default true,\n            logs: {boolean} default true,\n            data: {boolean} default true\n        },\n        ruleInitOptions: {}, // options which should be passed to rules init() method\n        printProgression: {boolean} default false,\n        configurePuppeteer: {function(puppeteer)} default undefined // a function to further configure puppeteer\n    }\n    ```  \n##### execute([progressionListener]) \\\u003casync\u003e\n- `progressionListener` - a function which will be notified on progression during the extraction\n\nReturns: `Promise\u003cundefined\u003e` - resolves when extraction completes or fails if an unhandled error occurred\n\n##### errors \\\u003cstatic\u003e\nAn object with the internal custom errors thrown by Web Extractor.  Can be used in rules which want to \nbe able to throw the same kinds of errors for e.g. consistency in the error-log.\n\n##### debug([enable]) \\\u003cstatic\u003e\n- `enable` a boolean enabling or disabling debug information to the console. \n    \n## Results\nIf the destination path does not contain data from a previous extraction session a new directory will be created with the \nname \"data-{DATE-TIME}\". The created directory has the following structure:\n\n- data.json\n- no-rule-match-urls.txt\n- errors.json\n- ./screenshots\n\n**data.json** contains the extracted data for each url where a matching rule could be found. Each line is a \nself contained json-object, which makes it easy to parse large files line by line.\n\n**no-rule-match-urls.txt** contains a list of urls which did not match any of the rules.\n\n**errors.json** contains a json object for each error occurred during the extraction process.\n\n**screenshots** contains one or more screenshots for each url where a rule matched.\n\n### Resuming a Previous Extraction\nIf a path to a directory containing previous extracted data is passed in, Web Extractor will\nadd to the existing files and screenshot directory instead of creating a new directory. \n\n### Screenshots\nIf screenshots are enabled (default) a screenshot will taken of every page which can be reached.\nSo even if no rule does match or there are no rules at all a screenshot is still taken. \n\nFor further control each rule can specify if additional screenshots should be taken (see [extractor.beforeExtract](#extractorbeforeextractpage-async)).\n\n## Extraction Rules\nRules defines what should be extracted from given web-page as well as when the extraction should take place.\n\nEach rule is tested one by one in alphabetical order until a match is found. \nIn the event of a match the result is saved and any remaining rules are aborted.\n\nIf no `rules` path is passed in, the rules (if any) located in the `rules` directory of the project, will be used.\n\n### Disabling and Deleting Rules\nA rule can be temporarily disabled by prefixing the file name with double underscore e.g. `__cookiebot.js`. To completely\nremove a rule simply delete the file.\n\n### Creating Rules\nEach rule is a node.js module with the following\nstructure:\n\n```\nexport default {\n    name: 'name of the rule', //required\n\n    init: async function(options) {} // optional\n    dataTemplate: function() {} // optional,\n    extractorOptions: function() {} // optional,\n    \n    extractor: {\n        beforeExtract: async function(page, url, options) {}, // optional \n        extract: function(template, url, options) {}, // optional\n        extractPuppeteer: function(page, template, url, options) {}, // optional\n        afterExtract: function(data, url, options) {} // optional\n     }\n    \n};\n```\n##### name \\\u003cstring\u003e\n\nThe name of the rule or some other name identifying the extracted data. If only one rule is used the name can be omitted.\n\n##### init(options) \\\u003casync\u003e\n- `options` - an object with relevant config data for the extractor. (see [WebExtractor](#WebExtractor) constructor `options.ruleInitOptions`):\n \n\n\nReturns: `Promise\u003cundefined\u003e`\n\nIf defined this method will be called before each new url extraction begins. \n\nCan be used for initializing the rule or other preparations which should take place before a \nrule is processed.\n\n\u003e **Info Rule State** It is safe to store state in a rule using `this.xxx = yyy` from when `init(options)` is called\n\u003e and throughout the analysis as each rule for each url will have its own context-object with the rule as prototype.\n\n##### dataTemplate()\n\nReturns: `Object` - a template object to use as a starting point in the first extractor for the rule\n\nIf the function is defined it must return a JSON compliant object which will be passed into the first\nextractor of the rule (see below). For each url the rule i tested against a new clone of the template\nobject, so it is safe to modify the template object in the `extract` method.\n\n##### extractorOptions()\n\nReturns `*` - options to pass to each method in the defined extractor(s). The returned value must be JSON-serializable.\n\n##### extractor \\\u003cobject | array\u003e\n\nThe extractor object should have at least one of the following methods:\n\n##### extractor.beforeExtract(page, [url], [options]) \\\u003casync\u003e\n\n- `page` - an instance of a [puppeteer page](https://github.com/puppeteer/puppeteer/blob/v2.1.0/docs/api.md#class-page)\n- `url` - the url or url object currently being processed\n- `options` the options returned from [extractorOptions](#extractoroptions)\n\nReturns: `Promise\u003cobject | undefined\u003e` - a control object for what to do when beforeExtract succeeds or `undefined` (default) if the normal order of execution should be followed\n\nThe returned object can have the following options:\n- `screenshot` - take a screenshot\n- `nextExctractorIndex` - continue with the extractor at the following index. Makes selection and iteration possible\n\n```\n{\n    screenshot: {boolean}, // optional\n    nextExctractorIndex: {integer} // optional\n}\n```\n\nThe extraction engine will wait for this method to complete before running the `extract` method.\n\nIn the case of a `puppeteer.errors.TimeoutError` the next rule (if present) wil we tested. \nAll other errors will be seen as an actual error and the following rules will be aborted and the error logged.\n\nTo wait for a given DOM-element to become present and then take a screenshot you could do:\n\n```\nextractor: { \n    beforeExtract: async function(page) {\n        await page.waitFor('#my-element' {timeout: 5000});\n        return {\n            screenshot: true\n        };\n    },\n    extract() {...}\n}\n```\n\nWhen the `extractor` is made up of more extractors (see [Multiple Extractors](#multiple-extractors)) it can be desirable to be able to choose which\nextractor to run next depending on a condition in `beforeExtract`. This can be controlled by returning an object with the \nproperty `nextExctractorIndex` which determines the next extractor to run. E.g.\n\n```\nextractor: {\n    beforeExtract: async function(page) {\n        try {\n            await page.waitFor('#my-element' {timeout: 5000});\n            return 2;\n        } catch(e) {\n            return 3;\n        }\n    }\n}\n```\n\n##### extractor.extract([template], [url], [options])\n\n- `template` - a clone of the template object returned by `dataTemplate()` or `undefined` if `dataTemplate()` is not defined\n- `url` - the url or url object currently being processed\n- `options` the options returned from [extractorOptions](#extractoroptions)\n\nReturns: the extraction result or one of: `null`, `undefined`, `[]` or `{}` if no result was found\n\nThe `extract` method is executed in the context of the page so you have access to `document`, `window` etc.\n\nTo extract all paragraph text from a page you could do: \n\n```\nextractor: {\n    extract: function() {\n        let results = [];\n        let paragraphs = document.querySelector('p');\n        if (p) {\n            for (let p of paragraphs) {\n                results.push(p.textContent);\n            }\n        }\n        return results;\n    }\n}\n```\n\n##### extractor.extractPuppeteer(page, [template], [url], [options]) \\\u003casync\u003e\n\n- `page` - a puppeteer [`page`](https://github.com/puppeteer/puppeteer/blob/master/docs/api.md#class-page) instance. \n- `template` - a clone of the template object returned by `dataTemplate()` or `undefined` if `dataTemplate()` is not defined\n- `url` - the url or url object currently being processed\n- `options` the options returned from [extractorOptions](#extractoroptions)\n\nReturns `Promise\u003c?\u003e`: the extraction result or one of: `null`, `undefined`, `[]` or `{}` if no result was found\n\nUse this method of you need to have access to the `page` object and wants to its methods for extraction.\n\nIf both `extractor.extractPuppeteer` and `extractor.extract` is present `extractor.extractPuppeteer` will\nbe executed first.\n\nTo extract all paragraph text from a page you could do: \n\n```\nextractor: {\n    extractPuppeteer: async function(page) {\n        let results = await page.$$('p', (elements) =\u003e elements.map((element) =\u003e element.textContent));\n        return results;\n    }\n}\n```\n\n##### extractor.afterExtract(data, [url], [options]) \\\u003casync\u003e\n\n- `data` - the extracted data from [extractor.extract](#extractorextracttemplate)\n- `url` - the url or url object currently being processed\n- `options` the options returned from [extractorOptions](#extractoroptions)\n\nReturns: `Promise\u003cdata | undefined\u003e` return the processed version of the passed in data or `undefined` \nif you will handle saving the data yourself \n\nThis method will only be called on a successful return value from [extractor.extract](#extractorextracttemplate)\n\nUse this method to do post processing of the extracted data or to handle further saving of data yourself. \nIn the case of returning `undefined` the extractor chain will be broken and no further extractors will be called.\n\n#### Multiple Extractors\nSometimes it is required to click buttons, wait for events to happen etc. to complete an extraction. The extractor can\nthen be divided into multiple sections by providing an array of objects with one or both of `beforeExtract` and `extract`. \nEach extractor will get passed the return value from the previous extractor so you can add to this to get a combined result \n(if present the `template` will be passed to the first object's `extract` method).\n\nif an extractor return one of `null`, `undefined`, `[]` or `{}` the extraction chain will be aborted and no result will\nbe saved for this rule. If there are more rules the next rule in line will be tested.\n\nTo wait for an element to appear, extract some text, \nclick next and extract some more text you could do:\n\n```\nextractor: [\n    {\n        beforeExtract: async function(page) {\n            await page.waitFor('.popup');\n        },\n        extract: function() {\n            let data = {\n                popupPage1: document.querySelector('.popup-text').textContent;\n            }\n            return data;\n        }\n    }, {\n        beforeExtract: async function(page) {\n            await page.click('.popup .next-button');\n            await page.waitFor('.popup .page2');\n        },\n        extract: function(data) { // data is the object returned in the extractor above\n            data['popupPage2'] = document.querySelector('.popup-text').textContent;\n            return data;\n        }\n    }\n\n]\n```\n\nAll this could also have been done in the extractor alone using the DOM, events and mutation-observers etc., \nso the above is only an easier way to do it. \n \n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcentre-for-humanities-computing%2Fweb-extractor","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcentre-for-humanities-computing%2Fweb-extractor","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcentre-for-humanities-computing%2Fweb-extractor/lists"}