https://github.com/queukat/spark_oracle_hive_streaming
https://github.com/queukat/spark_oracle_hive_streaming
Last synced: 2 months ago
JSON representation
- Host: GitHub
- URL: https://github.com/queukat/spark_oracle_hive_streaming
- Owner: queukat
- License: mit
- Created: 2023-01-11T13:34:50.000Z (over 3 years ago)
- Default Branch: main
- Last Pushed: 2023-12-15T21:29:10.000Z (over 2 years ago)
- Last Synced: 2025-01-28T21:18:02.603Z (over 1 year ago)
- Language: Scala
- Size: 123 KB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Project: Spark Universal Migrator
Spark Universal Migrator is a Scala/Spark library for full-load table-by-table migration from Oracle to Hive. It captures an Oracle snapshot SCN, reads source rows through Spark JDBC using `ROWID` range queries built from Oracle extent metadata, applies an explicit Oracle-to-Spark schema policy, and writes the result into Hive through a temporary table plus `INSERT OVERWRITE`.
# Requirements
- Apache Spark 3.5.7
- Scala 2.12.x
- Oracle as the source system
- Hive metastore / Hive-enabled Spark as the target system
- Oracle JDBC driver support on the runtime classpath
# What It Supports
- Full-load data migration
- Table-by-table execution
- Range-based reads using Oracle `ROWID`
- Snapshot-based reads through captured SCN
- Oracle type handling modes:
- `spark`: keep Spark JDBC inferred types for ambiguous Oracle `NUMBER`
- `oracle`: profile Oracle `NUMBER` columns without precision/scale metadata
- `skip`: avoid extra profiling and fall back to `StringType` for ambiguous `NUMBER`
# What It Does Not Do
- CDC / change capture
- Incremental load orchestration
- Schema evolution management across runs
- Environment-variable based configuration
- Standalone CLI entrypoint in this repository
# Testing
Run the test suite with:
```bash
sbt test
```
The test suite covers SQL generation, schema conversion, Spark session creation, Spark-side JDBC load composition, and Hive overwrite behavior.
# Using the Library
Use the `NewSpark.migrate(...)` API from your own application entrypoint:
```scala
import queukat.spark_universal.NewSpark
object ExampleMigration {
def main(args: Array[String]): Unit = {
NewSpark.migrate(
url = "jdbc:oracle:thin:@//localhost:1521/ORCL",
oracleUser = "your_oracle_username",
oraclePassword = "your_oracle_password",
tableName = "employees",
owner = "HR",
hivetable = "employees_hive",
numPartitions = 8,
fetchSize = 1000,
typeCheck = "spark"
)
}
}
```
# Notes
- The library expects Oracle users with access to the required metadata views used to compute extent ranges.
- Unsupported Oracle types fail fast during schema conversion instead of being silently downgraded.
- Temporary Hive table names are generated uniquely per migration run.
- Logging stays on the existing `slf4j` facade so the host Spark application keeps control over the final backend.
- The library now emits stage-oriented log messages such as `[MIGRATE]`, `[SCHEMA]`, `[LOAD]`, and `[HIVE]` for easier scanning.
- ANSI color is opt-in/out through `-Dspark.universal.log.color=true|false`; by default the library only colors logs when it detects an interactive terminal.
# Publishing
The repository includes GitHub Actions workflows for CI and Maven Central publishing. CI runs `sbt test`, and publishing uses `sbt +publishSigned` on release.