{"id":17458147,"url":"https://github.com/pocesar/apify-query-dataset","last_synced_at":"2026-02-10T01:33:53.157Z","repository":{"id":41854346,"uuid":"249091538","full_name":"pocesar/apify-query-dataset","owner":"pocesar","description":"Use MongoDB query language style to search and generate a new subset of your datasets","archived":false,"fork":false,"pushed_at":"2023-03-04T07:29:00.000Z","size":618,"stargazers_count":2,"open_issues_count":2,"forks_count":0,"subscribers_count":2,"default_branch":"master","last_synced_at":"2024-10-18T06:28:33.162Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"JavaScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/pocesar.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2020-03-22T01:24:35.000Z","updated_at":"2023-03-04T17:53:31.000Z","dependencies_parsed_at":"2024-10-20T19:23:22.179Z","dependency_job_id":null,"html_url":"https://github.com/pocesar/apify-query-dataset","commit_stats":null,"previous_names":[],"tags_count":2,"template":false,"template_full_name":null,"purl":"pkg:github/pocesar/apify-query-dataset","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pocesar%2Fapify-query-dataset","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pocesar%2Fapify-query-dataset/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pocesar%2Fapify-query-dataset/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pocesar%2Fapify-query-dataset/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/pocesar","download_url":"https://codeload.github.com/pocesar/apify-query-dataset/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pocesar%2Fapify-query-dataset/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":268470799,"owners_count":24255391,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-08-02T02:00:12.353Z","response_time":74,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-10-18T03:55:28.652Z","updated_at":"2026-02-10T01:33:48.120Z","avatar_url":"https://github.com/pocesar.png","language":"JavaScript","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Query Dataset\n\nQuery your existing datasets, map and generate a subset of your data\n\nUses MongoDB-like query style, for extended documentation, check [MongoDB query documentation](http://docs.mongodb.org/manual/reference/operator/query/)\n\nIt uses [sift](https://www.npmjs.com/package/sift) module for matching, means you can use as query input.\n\n## Example\n\nTake this dataset, for example:\n\n```json\n[\n    {\n        \"name\": \"Name 1\",\n        \"anotherValue\": 1\n    },\n    {\n        \"name\": \"Name 2\",\n        \"anotherValue\": 2\n    },\n    {\n        \"name\": \"\",\n        \"anotherValue\": 3\n    }\n]\n```\n\nYou want to query only items that have a name that isn't empty, so you use the following `INPUT`:\n\n```json\n{\n    \"datasetId\": \"YOUR_DATASET_ID\",\n    \"query\": {\n        \"name\": { \"$ne\": \"\" }\n    }\n}\n```\n\n`$ne` means \"not equal\" in MongoDB, so you'll receive \"Name 1\" and \"Name 2\" items.\n\nNow say you want to rename the \"name\" field to something else:\n\n```json\n{\n    \"datasetId\": \"YOUR_DATASET_ID\",\n    \"query\": {\n        \"name\": { \"$ne\": \"\" }\n    },\n    \"filterMap\": \"({ item }) =\u003e { item.name = item.name.replace('Name ', ''); item.extra = true; return item; }\"\n}\n```\n\nYour generated dataset is now:\n\n```json\n[\n    {\n        \"name\": \"1\",\n        \"anotherValue\": 1,\n        \"extra\": true,\n    },\n    {\n        \"name\": \"2\",\n        \"anotherValue\": 2,\n        \"extra\": true,\n    }\n]\n```\n\n## filterMap and customOperationSetup\n\nThe `filterMap` parameter exists to do even more complex checks. `filterMap` is run in a limited context, and those are the variables available inside your function:\n\n* `sift`: the [sift](https://www.npmjs.com/package/sift) module, so you can create a filter on-the-fly\n* `console.log`: tied to the 'outside' `console.log` and outputs information to the actor log\n* `item`: the current dataset item\n* `index`: the current filtered index\n* `total`: total items available in the dataset\n* `filter`: the created filter from `query` parameter\n* `datasetIndex`: the current position in the dataset index\n\nThe `customOperationSetup` is mostly useful to prepare a [custom operation](https://www.npmjs.com/package/sift#custom-operations) using `sift`:\n\n```js\n() =\u003e ({\n    $gtDate(params, ownerQuery, options) {\n        const timestamp = new Date(params).getTime();\n\n        return createEqualsOperation(\n            value =\u003e new Date(value).getTime() \u003e timestamp, // 'value' here is the date from the field you provide\n            ownerQuery,\n            options\n        );\n    }\n})\n```\n\nthen use directly inside your `query` (\"2020-01-01\" is passed as param to `params`):\n\n```json\n{\n    \"query\": {\n        \"lastModified\": { \"$gtDate\": \"2020-01-01\" }\n    }\n}\n```\n\nMost of the time, you won't need to use `customOperationSetup`, since the built-in operators can do a lot by themselves, but they are provided for completeness.\n\n## Expected Comsumption\n\nThe memory requirements should be really low, but you need at least 128MB, the dataset items aren't loaded all at once in memory, but depending on the shape of your query, you may need more. The more query parameters you provide, more memory and CPU are required, subsequently your query finishes faster.\n\n## Limitations\n\nSome types aren't allowed in JSON, such as `Date` and `RegExp`. The workaround is to define a query without those types, then inside the `filterMap`, you return either null or undefined for dates or RegExp that don't match.\n\nE.g.:\n\n```json\n{\n    \"datasetId\": \"YOUR_DATASET_ID\",\n    \"query\": {\n\n    },\n    \"filterMap\": \"({ item }) =\u003e { if (new Date(item.someDateField).getTime() \u003c new Date(2019, 10, 20)) { return item } }\"\n}\n```\n\nOr you can use the `customOperationSetup` and provide your advanced operator for native types.\n\n## License\n\nApache-2.0\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpocesar%2Fapify-query-dataset","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fpocesar%2Fapify-query-dataset","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpocesar%2Fapify-query-dataset/lists"}