{"id":42566841,"url":"https://github.com/datopian/harvesterjs","last_synced_at":"2026-01-28T20:45:11.960Z","repository":{"id":312193740,"uuid":"1040218378","full_name":"datopian/harvesterjs","owner":"datopian","description":"Framework and scripts for harvesting datasets into PortalJS . Provides reusable pipelines, connectors, and utilities for extracting, transforming, and loading (ETL) data from diverse sources into your data portal. Designed for extensibility and automation, making it easy to bootstrap data portals at scale.","archived":false,"fork":false,"pushed_at":"2025-10-06T14:21:34.000Z","size":99,"stargazers_count":20,"open_issues_count":1,"forks_count":1,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-10-21T16:49:21.740Z","etag":null,"topics":["ckan","dataverse","dkan","harvest","open-data","opendatasoft","socrata"],"latest_commit_sha":null,"homepage":"https://www.portaljs.com/","language":"TypeScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/datopian.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-08-18T16:25:23.000Z","updated_at":"2025-10-06T14:21:38.000Z","dependencies_parsed_at":"2025-08-29T06:42:31.234Z","dependency_job_id":"6d121d59-0605-4ac1-880d-16574d7d1f95","html_url":"https://github.com/datopian/harvesterjs","commit_stats":null,"previous_names":["datopian/harvesterjs"],"tags_count":0,"template":true,"template_full_name":null,"purl":"pkg:github/datopian/harvesterjs","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/datopian%2Fharvesterjs","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/datopian%2Fharvesterjs/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/datopian%2Fharvesterjs/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/datopian%2Fharvesterjs/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/datopian","download_url":"https://codeload.github.com/datopian/harvesterjs/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/datopian%2Fharvesterjs/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28851249,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-01-28T15:15:36.453Z","status":"ssl_error","status_checked_at":"2026-01-28T15:15:13.020Z","response_time":57,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ckan","dataverse","dkan","harvest","open-data","opendatasoft","socrata"],"created_at":"2026-01-28T20:45:11.111Z","updated_at":"2026-01-28T20:45:11.942Z","avatar_url":"https://github.com/datopian.png","language":"TypeScript","readme":"# PortalJS Harvesters\n\n\u003cdiv align=\"center\"\u003e\n  \n**Extendable harvester framework built with TypeScript. Harvest data from a variety of sources into your PortalJS portal.**\n  \nWith out-of-the-box support to 🌀 [PortalJS Cloud](https://portaljs.com)\n\n\u003c/div\u003e\n\n---\n\n## Built-in Harvesters\n\nThe following sources are supported out of the box:\n\n- [CKAN](./src/harvesters/ckan.ts)\n- [DKAN](./src/harvesters/dkan.ts)\n- [Socrata Open Data](./src/harvesters/socrata.ts)\n- [OpenDataSoft (ODS)](./src/harvesters/ods.ts)\n- [Dataverse Repository](./src/harvesters/dataverse.ts)\n- [ArcGIS Hub/Portal](./src/harvesters/arcgis.ts)\n\n## Getting Started\n\nThis tool is intended as a template/starter framework rather than a standalone deployable app. The idea is that you use this repo as the base for your own harvesting tool, customizing it and integrating it into your workflow.\n\nTo get started, click on the \"Use This Template\" button in the top-right corner of this page and then click on \"Create a new repository\".\n\n## Running Harvesters\n\nYou can run this tool on any platform that supports Node, such as GitHub Actions.\n\n1. Install dependencies with `npm install`\n2. Set up the environment variables according to the [configuration](#configuration) section\n3. Run `npm run start`\n\nSee the [GitHub Action example](https://github.com/datopian/harvesterjs/blob/main/.github/workflows/run-harvester.yml).\n\n## Configuration\n\nThe following environment variables can be used to configure the tool:\n\n- `HARVESTER_NAME` - E.g., \"CkanHarvester\". Literally the name of the harvester class as defined in [./src/harvesters](./src/harvesters).\n- `SOURCE_API_URL` - E.g., \"http://ckan.com\". The source URL from which you want to harvest datasets.\n- `SOURCE_API_KEY` - (Optional) Used for authenticated requests when private data should be harvested.\n- `PORTALJS_CLOUD_API_URL` - (Optional) Defaults to https://api.cloud.portaljs.com/.\n- `PORTALJS_CLOUD_MAIN_ORG` - The name of your main organization in PortalJS Cloud.\n- `PORTALJS_CLOUD_API_KEY` - You can create PortalJS Cloud API keys in your PortalJS Cloud account profile.\n- `DRY_RUN` - (Optional). Whether data should be ingested or just logged. Either `true` or undefined.\n\nYou can set these environment variables either with a `.env` file or in the runner's environment.\n\n## Development\n\nFor development and testing harvesters locally:\n\n1. Clone this repo\n2. Install dependencies with `npm i`\n3. Duplicate [`.env.example`](./.env.example) and rename it to `.env`\n4. Customize the `.env` as you'd like (see [configuration](#configuration)) \n5. Start harvesting with `npm run start`\n\n\u003e [!TIP]\n\u003e Dry runs are supported via the `DRY_RUN=true` environment variable\n\n## Extending\n\nThis tool is built to be extendable by design. \n\nIt can be customized to harvest data from any source by extending either a preexisting [built-in harvesters](./src/harvesters) or the [base harvester](./src/harvesters/base.ts).\n\nOne common use case would be, for example, if you want to harvest data from a CKAN instance that uses a custom metadata schema. \n\nIn this case, you could simply create a new harvester extending the [CKAN harvester](./src/harvesters/ckan.ts) and override the Source to Target mapping, as shown in the example below.\n\n### Creating a Custom Harvester\n\n1. Create a new file in the `src/harvesters/` directory.\n2. Extend `BaseHarvester` (or any other pre-built harvester class) and decorate it with `@Harvester`.\n3. Implement overrides:\n   * `getSourceDatasets()` → Fetch and return all datasets from your source.\n   * `mapSourceDatasetToTarget()` → Convert source dataset schema into the PortalJS Cloud dataset schema.\n4. Set `HARVESTER_NAME=YourCustomHarvester` in `.env` and run. The name of your custom harvester is simply the name of the class that defines it.\n\nThe base harvester handles concurrency, rate limit, retries, upsert, but note that all these can be fleely overriden and customized.\n\n#### Example: Harvester for a CKAN instance with a custom dataset metadata schema\n\n```ts\nimport { CkanPackage } from \"@/schemas/ckanPackage\";\nimport { PortalJsCloudDataset } from \"@/schemas/portaljs-cloud\";\nimport { Harvester } from \".\";\nimport { BaseHarvesterConfig } from \"./base\";\nimport { CkanHarvester } from \"./ckan\";\nimport { env } from \"../../config\";\n\ntype CustomCkanPortalDataset = CkanPackage \u0026 {\n    data_owner_email: string;\n};\n\n@Harvester\nclass CustomCkanPortalHarvester extends CkanHarvester\u003cCustomCkanPortalDataset\u003e {\n  constructor(args: BaseHarvesterConfig) {\n    super(args);\n  }\n\n  mapSourceDatasetToTarget(pkg: CustomCkanPortalDataset): PortalJsCloudDataset {\n    const owner_org = env.PORTALJS_CLOUD_MAIN_ORG;\n    return {\n      owner_org,\n      name: `${owner_org}--${pkg.name}`,\n      title: pkg.title,\n      notes: pkg.notes || \"no description\",\n      resources: (pkg.resources || []).map((r: any) =\u003e ({\n        name: r.name,\n        url: r.url,\n        format: r.format,\n        ...(r.id ? { id: r.id } : {}),\n      })),\n      language: pkg.language || \"EN\",\n      contact_point: pkg.data_owner_email // \u003c== Custom field to PortalJS Cloud mapping\n    };\n  }\n}\n\nexport { CustomCkanPortalHarvester };\n```\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdatopian%2Fharvesterjs","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdatopian%2Fharvesterjs","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdatopian%2Fharvesterjs/lists"}