{"id":16643407,"url":"https://github.com/sonots/embulk-filter-row","last_synced_at":"2025-03-21T15:32:29.054Z","repository":{"id":49727204,"uuid":"38870008","full_name":"sonots/embulk-filter-row","owner":"sonots","description":"A filter plugin for Embulk to filter out rows with conditions","archived":false,"fork":false,"pushed_at":"2022-10-24T14:18:52.000Z","size":375,"stargazers_count":13,"open_issues_count":0,"forks_count":6,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-03-14T23:14:22.689Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Java","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/sonots.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2015-07-10T08:32:38.000Z","updated_at":"2022-10-31T13:32:22.000Z","dependencies_parsed_at":"2022-09-13T07:01:41.612Z","dependency_job_id":null,"html_url":"https://github.com/sonots/embulk-filter-row","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sonots%2Fembulk-filter-row","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sonots%2Fembulk-filter-row/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sonots%2Fembulk-filter-row/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sonots%2Fembulk-filter-row/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/sonots","download_url":"https://codeload.github.com/sonots/embulk-filter-row/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":244146369,"owners_count":20405819,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-10-12T08:08:24.382Z","updated_at":"2025-03-21T15:32:28.769Z","avatar_url":"https://github.com/sonots.png","language":"Java","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Row filter plugin for Embulk\n\n[![Build Status](https://secure.travis-ci.org/sonots/embulk-filter-row.png?branch=master)](http://travis-ci.org/sonots/embulk-filter-row)\n\nA filter plugin for Embulk to filter out rows\n\n## Configuration\n\nRequirement: version \u003e= 0.3.0\n\n* **where**: Select only rows which match with conditions written in SQL-like syntax. See [SQL-like Syntax](#sql-like-syntax)\n\n## Example\n\n```yaml\nfilters:\n  - type: row\n    where: column1 = 'str'\n```\n\n```yaml\nfilters:\n  - type: row\n    where: |-\n      (\n        string_column START_WITH 'str' AND\n        number_column \u003e 1.0\n      )\n      OR\n      (\n        time_column = TIMESTAMP '2016-01-01 +0900' AND\n        \"true_column\" = true\n      )\n```\n\nSee [SQL-like Syntax](#sql-like-syntax) for more details\n\n# SQL-like Syntax\n\nThis syntax must be similar with a standard SQL syntax.\n\n```sql\nwhere: column1 = 'str'\n```\n\n```sql\nwhere: |-\n  (\n    string_column START_WITH 'str' AND\n    number_column \u003e 1.0\n  \n  )\n  OR\n  (\n    time_column = TIMESTAMP '2016-01-01 +0900' AND\n    \"true_column\" = true AND\n    string_column REGEXP '^reg'\n  )\n```\n\n## Literals\n\n### Boolean Literal\n\n`true` or `TRUE` or `false` or `FALSE` are considered as a boolean literal\n\n### Number Literal\n\nCharacters matching with a regular expression `-?[0-9]+(\\.[0-9]+)?` is considered as a number literal\n\n### String Literal\n\nCharacters surrounded by `'` such as `'foo'` is considered as a string literal\n\n### Timestamp Literal\n\nNOTE: It became possible to omit `TIMESTAMP` keyword on comparing with `timestamp` identifier (column) from version \u003e= 0.3.3.\n\n`TIMESTAMP ( NumberLiteral | StringLiteral )` such as `TIMESTAMP 1470433087.747123` or `TIMESTAMP '2016-08-06 06:38:07.747123 +0900'` is considered as a timestamp literal\n\nNumber is a epoch time since 1970-01-01 UTC with nano time resolution.\n\nString is a timestamp string which matches with one of following format:\n\n* `%Y-%m-%d %H:%M:%S.%N %z`\n* `%Y-%m-%d %H:%M:%S.%N`\n* `%Y-%m-%d %H:%M:%S %z`\n* `%Y-%m-%d %H:%M:%S`\n* `%Y-%m-%d %z`\n* `%Y-%m-%d`\n\nThe time zone for formats without `%z` is UTC, and the time resolution is micro second (caused by limitation of Embulk TimestampParser).\n\n### Json Literal\n\nNot supported yet\n\n### Identifier Literal\n\nCharacters matching with a regular expression `[a-zA-Z_][a-zA-z0-9_]*` such as `foobar`, and characters surrounded by `\"` such as `\"foo-bar\"`, `\"foo.bar\"`, and `\"foo\\\"bar\"` are considred as an identifier literal, that is, embulk's column name.\n\n## Operators\n\n### Boolean Operator\n\n* `=`\n* `!=`\n\n### Number Operator (Long and Double)\n\n* `=`\n* `!=`\n* `\u003e`\n* `\u003e=`\n* `\u003c=`\n* `\u003c`\n\n### String Operator\n\n* `=`\n* `!=`\n* `START_WITH`\n* `END_WITH`\n* `INCLUDE`\n* `REGEXP`\n\n### Timestamp Operator\n\n* `=`\n* `!=`\n* `\u003e`\n* `\u003e=`\n* `\u003c=`\n* `\u003c`\n\n### Json Operator\n\nNot supported yet\n\n### unary operator\n\n* \"xxx IS NULL\"\n* \"xxx IS NOT NULL\"\n* \"NOT xxx\"\n\n## Old Configuration\n\nVersions \u003e= 0.3.0 has `where` option to supports SQL-like syntax. I recommend to use it.\n\nFollowing options are **deprecated**, and **will be removed someday**.\n\n* **condition**: AND or OR (string, default: AND).\n* **conditions**: select only rows which matches with conditions.\n  * **column**: column name (string, required)\n  * **operator** operator (string, optional, default: ==)\n    * boolean operator\n      * `==`\n      * `!=`\n    * numeric operator (long, double, Timestamp)\n      * `==`\n      * `!=`\n      * `\u003e`\n      * `\u003e=`\n      * `\u003c=`\n      * `\u003c`\n    * string operator\n      * `==`\n      * `!=`\n      * `start_with` (or `startsWith`)\n      * `end_with` (or `endsWith`)\n      * `include` (or `contains`)\n    * unary operator\n      * `IS NULL`\n      * `IS NOT NULL`\n  * **argument**: argument for the operation (string, required for non-unary operators)\n  * **not**: not (boolean, optional, default: false)\n  * **format**: special option for timestamp column, specify the format of timestamp argument, parsed argument is compared with the column value as Timestamp object (string, default is `%Y-%m-%d %H:%M:%S.%N %z`)\n  * **timezone**: special option for timestamp column, specify the timezone of timestamp argument (string, default is `UTC`)\n\nNOTE: column type is automatically retrieved from input data (inputSchema)\n\n## Example (AND)\n\n**Deprecated**\n\n```yaml\nfilters:\n  - type: row\n    condition: AND\n    conditions:\n      - {column: foo,  operator: \"IS NOT NULL\"}\n      - {column: id,   operator: \"\u003e=\", argument: 10}\n      - {column: id,   operator: \"\u003c\",  argument: 20}\n      - {column: name, opeartor: \"include\", argument: foo, not: true}\n      - {column: time, operator: \"==\", argument: \"2015-07-13\", format: \"%Y-%m-%d\"}\n```\n\n## Example (OR)\n\n**Deprecated**\n\n```yaml\nfilters:\n  - type: row\n    condition: OR\n    conditions:\n      - {column: a, operator: \"IS NOT NULL\"}\n      - {column: b, operator: \"IS NOT NULL\"}\n```\n\n## Example (AND of OR)\n\n**Deprecated**\n\nYou can express a condition such as `(A OR B) AND (C OR D)` by combining multiple filters like\n\n```yaml\nfilters:\n  - type: row\n    condition: OR\n    conditions:\n      - {column: a, operator: \"IS NOT NULL\"}\n      - {column: b, operator: \"IS NOT NULL\"}\n  - type: row\n    condition: OR\n    conditions:\n      - {column: c, operator: \"IS NOT NULL\"}\n      - {column: d, operator: \"IS NOT NULL\"}\n```\n\n## Comparisions\n\n* [embulk-filter-calcite](https://github.com/muga/embulk-filter-calcite)\n  * embulk-filter-calcite is a pretty nice plugin which enables us to write SQL query to filter embulk records, not only `WHERE` but also `SELECT`.\n  * Based on [my benchmark (Japanese)](http://qiita.com/sonots/items/a70482d29862de87624d), embulk-filter-row was faster than embulk-filter-calcite.\n  * Choose which to use as your demand.\n\n## ToDo\n\n* Support filtering by values of `type: json` with JSONPath\n* Support IN operator\n\n## ChangeLog\n\n[CHANGELOG.md](./CHANGELOG.md)\n\n## Development\n\nRun example:\n\n```\n$ ./gradlew classpath\n$ embulk preview -I lib example/example.yml\n```\n\nRun test:\n\n```\n$ ./gradlew test\n```\n\nRun checkstyle:\n\n```\n$ ./gradlew check\n```\n\nRelease gem:\n\n```\n$ ./gradlew gemPush\n```\n\n## Development of SQL-like Syntax\n\nRead the article [Supported SQL-like Syntax with embulk-filter-row using BYACC/J and JFlex](http://blog.livedoor.jp/sonots/archives/48172830.html).\n\nTo download BYACC/J and JFlex and run them, you can use:\n\n```\n$ script/byaccj.sh\n```\n\nor\n\n```\n$ ./gradlew byaccj # this runs script/byaccj.sh internally\n```\n\nThis generates `src/main/java/org/embulk/filter/row/where/{Parser,ParserVal,Yylex}.java`.\n\nThe `byaccj` task of gradle is ran before `compileJava` task (which means to be ran before `classpath` or `test` task also) automatically.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsonots%2Fembulk-filter-row","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsonots%2Fembulk-filter-row","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsonots%2Fembulk-filter-row/lists"}