https://github.com/alipsa/jparq
JDBC driver for parquet files
https://github.com/alipsa/jparq
jdbc jdbc-driver parquet parquet-files parquet-tools
Last synced: 6 days ago
JSON representation
JDBC driver for parquet files
- Host: GitHub
- URL: https://github.com/alipsa/jparq
- Owner: Alipsa
- License: apache-2.0
- Created: 2025-10-16T18:43:24.000Z (8 months ago)
- Default Branch: main
- Last Pushed: 2026-03-27T17:11:27.000Z (3 months ago)
- Last Synced: 2026-06-14T01:32:45.654Z (6 days ago)
- Topics: jdbc, jdbc-driver, parquet, parquet-files, parquet-tools
- Language: Java
- Homepage:
- Size: 2 MB
- Stars: 2
- Watchers: 0
- Forks: 0
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- License: LICENSE
- Agents: AGENTS.md
Awesome Lists containing this project
README
[](https://mvnrepository.com/artifact/se.alipsa/jparq)
[](https://javadoc.io/doc/se.alipsa/jparq)
# JParq
JParq is a JDBC driver for Apache Parquet files. It treats a directory as a database and every `.parquet` file in that
directory as a table. The table name is the filename without the `.parquet` extension.
Sub directories are treated as schemas.
For example, given the following directory structure:
```
/data
├── employees.parquet
├── departments.parquet
└── sales
├── orders.parquet
└── customers.parquet
```
The tables `employees` and `departments` would be in the default PUBLIC schema, while `orders` and `customers` would be in the `sales` schema.
JParq uses Apache Arrow and Apache Parquet for efficient columnar reads and jsqlparser to parse SQL statements.
It aims to be 100% compliant with the read part of the SQL standard. There are a few common extensions supported as well e.g. LIMIT, ASCII, REPEAT, DIFFERENCE.
Note: When common implementations differ from the SQL standard, we stick to the standard. An example of this is
convert which in many databases is used for data type casting, but in the SQL standard is used for character set
conversion. If you want to convert data types, use CAST instead.
> **Note**
> The majority of the code was created in collaboration with (vibe coded with) ChatGPT Codex with Copilot and
> myself (Per Nyfelt) as code reviewers.
## Requirements
- Java 21 (higher is not supported due to Hadoop restrictions)
- Parquet files stored in a directory accessible from the JVM running the driver
## Installation
Add the dependency to your build. Replace `x.y.z` with the latest version number from the Maven Central badge above.
```xml
se.alipsa
jparq
x.y.z
```
## Usage
The driver registers itself when the `se.alipsa.jparq.JParqDriver` class is loaded, so simply placing the JAR on the
classpath is normally enough. If your runtime requires explicit registration you can call
`Class.forName("se.alipsa.jparq.JParqDriver")` before obtaining a connection.
```java
import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.ResultSet;
import java.sql.SQLException;
import java.sql.Statement;
import se.alipsa.jparq.JParqSql;
public class JParqExample {
// Standard JDBC
void selectMtcarsLimit() throws SQLException {
String jdbcUrl = "jdbc:jparq:/home/user/data";
try (Connection conn = DriverManager.getConnection(jdbcUrl);
Statement stmt = conn.createStatement();
ResultSet rs = stmt.executeQuery("SELECT * FROM mtcars LIMIT 5")) {
while (rs.next()) {
System.out.println(rs.getString(1));
}
}
}
// Using the JParqSql helper
void selectMtcarsToyotas() {
String jdbcUrl = "jdbc:jparq:/home/user/data";
JParqSql jparqSql = new JParqSql(jdbcUrl);
jparqSql.query("SELECT model, cyl, mpg FROM mtcars WHERE model LIKE 'Toyota%'", rs -> {
try {
while (rs.next()) {
System.out.println(rs.getString(1) + ", " + rs.getInt(2) + ", " + rs.getDouble(3));
}
} catch (SQLException e) {
System.out.println("Query failed: " + e);
}
});
}
}
```
### Connection options
- `caseSensitive` — defaults to `false`. Set to `true` to make table-name resolution case-sensitive.
```text
jdbc:jparq:/home/user/data?caseSensitive=true
```
- Paths may be specified directly or using the `file://` prefix.
### Fat Jar
A "fat" jar, which includes all dependencies, is also created. This is useful when you want to add the driver to the classpath of a tool. The fat jar is named `jparq-x.y.z-fat.jar` i.e. fat is the maven classifier. It is available in maven central and in the [releases](https://github.com/Alipsa/JParq/releases) section.
## JParq CLI
An interactive command line interface is available for querying Parquet files using JParq. The CLI supports command
history and basic line editing features. It also includes several commands starting with a slash (/) for controlling the CLI itself.
To start the CLI, run the following command, replacing `path/to/jparq.jar` with the actual path to the JParq fat jar and the {path to jparq dir} with the path the directory where your parquet files reside.
```shell
java -jar path/to/jparq.jar {path to jparq dir}
```
To download the fat jar, you can either download it from [releases](https://github.com/Alipsa/JParq/releases) or use the following command:
```shell
LATEST_VERSION=$(curl -s https://repo1.maven.org/maven2/se/alipsa/jparq/maven-metadata.xml | sed -nE 's/.*(.*)<\/release>.*/\1/p') && \
curl -o jparq.jar https://repo1.maven.org/maven2/se/alipsa/jparq/$LATEST_VERSION/jparq-$LATEST_VERSION-fat.jar
```
For simpler access you can create a script to it (e.g called jparq) that you put in your ~/bin dir
```shell
#!/usr/bin/env bash
java -jar /path/to/jparq.jar $@
```
Once the CLI is started, you can use the following commands:
- /connect - Connect to a Parquet database directory (dir with parquet files)
- /close - Close the current connection
- /list - List available tables
- /describe - Show column definitions for a table
- /info - Show information about the current connection
- /help - Display this help text
- /exit - Exit the CLI
> Anything else will be treated as a SQL query and executed against the connected dir.
Example session:
```
$ java -jar ~/lib/jparq.jar target/test-classes/acme
JParq CLI version 1.3.0
Connected to /home/user1/project/JParq/target/test-classes/acme
jparq(acme)> select * from employees e left join salary s on e.id = salary.employee
e__id | first_name | last_name | s__id | employee | salary | change_date
------+------------+------------+-------+----------+----------+------------
1 | Per | Andersson | 1 | 1 | 150000.0 | 2020-03-01
1 | Per | Andersson | 6 | 1 | 160000.0 | 2021-01-01
1 | Per | Andersson | 8 | 1 | 165000.0 | 2021-08-01
2 | Karin | Pettersson | 2 | 2 | 180000.0 | 2020-03-01
3 | Tage | Lundström | 3 | 3 | 130000.0 | 2021-01-01
3 | Tage | Lundström | 7 | 3 | 140000.0 | 2021-12-01
4 | Arne | Larsson | 4 | 4 | 195000.0 | 2020-10-01
5 | Sixten | Svensson | 5 | 5 | 230000.0 | 2020-12-15
jparq(acme)> /exit
$
```
## JDBC support
- **Driver scope** — Read-only. Only `SELECT` statements are supported; `executeUpdate`/DML, generated keys, and updatable result sets are not implemented. Result sets are `TYPE_FORWARD_ONLY`, `CONCUR_READ_ONLY` with `CLOSE_CURSORS_AT_COMMIT` holdability.
- **Prepared statements** — SQL is planned when the `PreparedStatement` is created. Positional `?` parameters are bound via the standard `setXxx` methods and rendered as safe SQL literals (named parameters are also supported). Parameter metadata is available but reports generic `VARCHAR`/`IN` characteristics because actual typing occurs at execution. Only forward-only/read-only statements with `CLOSE_CURSORS_AT_COMMIT` holdability are accepted; overloads requesting other result-set types or generated keys/column-based key selection throw `SQLFeatureNotSupportedException`. Prepared-statement batching without a SQL argument is supported for convenience and reports `SUCCESS_NO_INFO` counts per entry. `Statement` batching and `addBatch(String)` remain unsupported.
- **Parquet predicate pushdown** — When a query targets a Parquet-backed table with readable schema metadata and the full `WHERE` clause can be translated by `ParquetFilterBuilder`, JParq attaches a Parquet `FilterPredicate` and enables statistics, column-index, dictionary, and bloom-filter pruning. Supported shapes are column-vs-literal comparisons (`=`, `!=`, `<`, `<=`, `>`, `>=`), `BETWEEN`, `IN` with up to 20 literals, and `AND`/`OR`/`NOT` compositions where every branch is pushdownable. Predicates involving functions, `LIKE`, column-to-column comparisons, subqueries, unsupported types, or mixed pushdownable/non-pushdownable trees fall back to residual filtering in Java. `JParqPreparedStatement.getPushdownInfo()` exposes which path a prepared statement uses.
- **Transactions and sessions** — The connection is always read-only with `TRANSACTION_NONE`; `commit`, `rollback`, and savepoints are no-ops. `getAutoCommit()` always returns `true`, and changing isolation level, holdability, or poolability is not supported.
- **Metadata** — `DatabaseMetaData` exposes tables and columns derived from the base directory and subdirectories (schemas); the catalog is the base directory name. Callable statements, stored procedures, and custom type mappings are not supported.
- **Other JDBC surfaces** — `CallableStatement`, `createClob`/`createBlob`/`createArrayOf`, network timeout, and client-info features are not implemented. Connection validation via `isValid()` returns `true` while the connection is open and `false` after it is closed.
### Non standard JDBC extensions
- Named parameters i.e. :paramName syntax for prepared statements
## SQL Support
JParq aims to be fully compliant with the read part of the SQL standard (SQL:2016 and earlier).
As of version 1.1.1, the support for the standard is complete (as far as I know). The following is a detailed list of supported features and functions.
The following SQL statements are supported:
- `SELECT` with support for
- `*` to select all columns
- Qualified wildcard projections (table.*).
- alias support for columns and tables
- Support computed expressions with aliases (e.g. SELECT mpg*2 AS double_mpg)
- Quoted identifiers using "double quotes" as per the SQL standard.
- `CASE` support
- `SELECT` statements with `WHERE` supporting:
- `BETWEEN`, `IN`, `LIKE` operators
- `AND`, `OR`, `NOT` logical operators
- Comparison operators: `=`, `!=`, `<`, `>`, `<=`, `>=`
- Null checks: `IS NULL`, `IS NOT NULL`
- `ORDER BY` clause with multiple columns and `ASC`/`DESC` options
- support for optional explicit NULLS FIRST and NULLS LAST
- `OFFSET` and `LIMIT` support
- Also support the PostgreSQL shorthand `LIMIT n OFFSET m` syntax
- Standard row-limiting syntax (FETCH FIRST / OFFSET … FETCH)
- `DISTINCT` support
- Functions support
- Date functions
- Aggregate functions (count, sum, avg, max, min)
- CAST support
- CONVERT support (for both character set conversion with USING and data type conversion without USING, as per SQL standard and common extensions)
- coalesce (The COALESCE() function returns the first non-null value in a list.)
- String functions (all SQL standard string function supported)
- also support || operator for string concatenations
- Numeric functions (abs, ceil/ceiling, floor, round, sqrt, truncate/trunc, mod, power/pow, exp, log, log10,
rand/random, sign, sin, cos, tan, cot, asin, acos, atan, atan2, degrees, radians, pi)
- ARRAY constructor function
- comments (line --) and block (/* */)
- Subquery support
- In the SELECT Clause : Used to return a single value or a set of values. e.g.
SELECT first_name, (
SELECT department_name FROM departments WHERE departments.department_id = employees.department_id
) AS department_name
FROM employees;
- In the FROM Clause : Treated as a derived table or inline view. E.g:
SELECT *
FROM (SELECT first_name, salary FROM employees WHERE salary > 5000) AS "high_salaried"
- In the WHERE Clause : Used to filter the results. e.g
SELECT first_name
FROM employees
WHERE department_id IN (SELECT department_id FROM departments WHERE location_id>1500);
- In the HAVING Clause : Used to filter groups. E.g:
SELECT department_id, AVG(salary)
FROM employees
GROUP BY department_id
HAVING AVG(salary) > (SELECT AVG(salary) FROM employees);
- `GROUP BY` support
- `COUNT(*)` aggregation
- `HAVING` clause with conditions
- support aggregation functions and case statements in the `GROUP BY` and `SELECT` clause
- `EXISTS` support
- `ANY` and `ALL` support
- Join support: INNER, LEFT, RIGHT, FULL, CROSS, and Self Join, join ... using syntax
- `UNION` and `UNION ALL` support
- Complete set-operation coverage.
- `EXCEPT`
- `INTERSECT`
- `INTERSECT ALL`
- `EXCEPT ALL`
- nesting of set operations
- CTE (Common Table Expressions) support
- Windowing
- Ranking functions
- `ROW_NUMBER`, `RANK`, `DENSE_RANK`, `PERCENT_RANK`, `CUME_DIST`, `NTILE`
- Aggregate window functions
- `SUM`, `AVG`, `MIN`, `MAX`, `COUNT`
- Analytic Value/Navigation Functions
- `LAG`, `LEAD`, `FIRST_VALUE`, `LAST_VALUE`, `NTH_VALUE`
- Advanced GROUP BY constructs i.e:
- `GROUPING SETS`
- `ROLLUP`
- `CUBE`
- Derived Tables: `UNNEST` with or without a table wrapper, `LATERAL` derived tables, `VALUES` table constructors
- `INFORMATION_SCHEMA.COLUMNS` and `INFORMATION_SCHEMA.TABLES`
#### String functions support details
##### Character Length and Position
- CHAR_LENGTH(string) or CHARACTER_LENGTH(string) Returns number of characters in a string. CHAR_LENGTH('hello') → 5
- OCTET_LENGTH(string) Returns number of bytes in the string (depends on encoding). OCTET_LENGTH('Å') → 2 (in UTF-8)
- POSITION(substring IN string) Finds the position (1-based) of substring in string. POSITION('l' IN 'hello') → 3
##### Substrings and Extraction
- SUBSTRING(string FROM start [FOR length]) Extracts substring starting at start, optionally limited by length. SUBSTRING('abcdef' FROM 2 FOR 3) → 'bcd'
- LEFT(string, count) (optional extension) Leftmost characters. LEFT('abcdef', 3) → 'abc'
- RIGHT(string, count) (optional extension) Rightmost characters. RIGHT('abcdef', 2) → 'ef'
##### Concatenation
- CONCAT(string1, string2, …) Concatenates two or more strings (SQL:2016 added variadic support). CONCAT('a','b','c') → 'abc'
- Support for the || operator, covering both binary and text concatenation, as well as null propagation.
##### Case Conversion
- UPPER(string) Converts to uppercase. UPPER('sql') → 'SQL'
- LOWER(string) Converts to lowercase. LOWER('SQL') → 'sql'
##### Trimming and Padding
- TRIM([LEADING TRAILING BOTH] [characters] FROM string)
- LTRIM(string) (extension) Trims leading spaces. LTRIM(' hi') → 'hi'
- RTRIM(string) (extension) Trims trailing spaces. RTRIM('hi ') → 'hi'
- LPAD(string, length [, fill]) (SQL:2008 optional) Pads string on the left. LPAD('42', 5, '0') → '00042'
- RPAD(string, length [, fill]) (SQL:2008 optional) Pads string on the right. RPAD('42', 5, '0') → '42000'
##### Searching and Replacing
- OVERLAY(string PLACING replacement FROM start [FOR length]) Replaces part of string starting at start with replacement. OVERLAY('abcdef' PLACING 'xyz' FROM 3 FOR 2) → 'abxyze f'
- REPLACE(string, search, replace) (SQL:2008) Replaces all occurrences of search with replace. REPLACE('banana', 'na', 'xy') → 'baxyxy'
##### Collation and Comparison
- COLLATE(string, collation_name) Applies a specific collation to a string. 'abc' COLLATE "sv_SE"
- SIMILAR TO e.g.
'cat' SIMILAR TO '(cat|dog)' → TRUE
'cab' SIMILAR TO 'c(a|o)b' → TRUE
'cab' SIMILAR TO 'c(a|e)b' → FALSE
'abc' SIMILAR TO 'a%' → TRUE
- REGEXP_LIKE (pattern matching operators) e.g:
REGEXP_LIKE('abc123', '^[a-z]+[0-9]+$') → TRUE
REGEXP_LIKE('AbC', 'abc', 'i') → TRUE -- 'i' = case-insensitive
REGEXP_LIKE('cat', 'dog|cat') → TRUE
##### Unicode and Codepoints
- CHAR(code) Returns the character corresponding to a code point. CHAR(65) → 'A'
- UNICODE(string) Returns Unicode code point of first character. UNICODE('A') → 65
##### SQL:2016–2023 Additions
- NORMALIZE(string [USING form]) Normalizes Unicode text (SQL:2016). NORMALIZE('é') → 'é'
- STRING_AGG(expression, separator) Aggregates values into a single string with a separator. STRING_AGG(name, ', ') → 'Alice, Bob, Carol'
- JSON_VALUE, JSON_QUERY, JSON_OBJECT, JSON_ARRAY JSON construction/extraction—technically not core string functions but string-returning functions standardized in SQL:2016–2023.
##### Non-standard extensions
- ASCII(string) Returns the ASCII code of the first character. ASCII('A') → 65
- LOCATE(substring, string[, start]) Locates substring in string with optional start position. LOCATE('c', 'abcabc', 3) → 3
- REPEAT(string, count) Repeats the string N times. REPEAT('a', 4) → 'aaaa'
- SPACE(count) Creates a string of N space characters. SPACE(5) → ' '
- INSERT(string, start, length, replacement) Inserts replacement at position after removing length characters. INSERT('abcdef', 3, 2, 'XYZ') → 'abXYZef'
- SOUNDEX(string) Computes the Soundex phonetic code. SOUNDEX('Robert') → 'R163'
- DIFFERENCE(string1, string2) Calculates similarity based on Soundex codes (0-4). DIFFERENCE('Smith', 'Smyth') → 4
### Build and test
```bash
mvn verify
```
- To run without spotless, add -Dspotless.check.skip=true
- To skip unit tests, add -DskipTests
- To skip integration tests add -DskipITs=true
This project uses Checkstyle, PMD, and Spotless. The checks run automatically as part of the Maven build. Use the
`spotless:apply` goal before committing if you need to fix formatting issues.
### Release notes
See [release.md](release.md) for the full version history and work in progress.
## Roadmap: _Might_ be implemented in the future
### Non standard extensions
- Allow omission of from clause in some cases
- Several popular databases allow you to omit the FROM clause when you are only selecting literal values, performing arithmetic, or evaluating scalar functions (functions that return a single value). In these cases, the query returns a single row e.g: `SELECT CURRENT_DATE as cur_date`.
- Support for variable assignment and use within SQL scripts.
- @variable_name syntax to define a variable that exists for the duration of the connection
- Example (direct assignment, connection scope):
declare @myVar INT = 10;
SELECT * FROM myTable WHERE myColumn > @myVar;
SELECT * FROM anotherTable LIMIT @myVar;
- PIVOT and UNPIVOT operators.
- TABLESAMPLE clause for sampling rows from a table.
- Support for modular encryption via a keystore and/or a Key Management Service
## Out of scope (will not be supported, at least not in the foreseeable future)
- Data modification statements (INSERT, UPDATE, DELETE, MERGE)
- Transaction control (COMMIT, ROLLBACK, SAVEPOINT)
- Data definition statements (CREATE, ALTER, DROP, TRUNCATE)
- User management and security (GRANT, REVOKE, CREATE USER, etc.)
- Stored procedures and functions (CREATE PROCEDURE, CREATE FUNCTION)
- Triggers (CREATE TRIGGER, DROP TRIGGER)
- Advanced indexing and optimization hints
- Full-text search capabilities
- TEMPORARY TABLES, you need to use CTE's or value tables instead.