https://github.com/queukat/spark_oracle_hive_streaming

Last synced: 2 months ago
JSON representation

Host: GitHub
URL: https://github.com/queukat/spark_oracle_hive_streaming
Owner: queukat
License: mit
Created: 2023-01-11T13:34:50.000Z (over 3 years ago)
Default Branch: main
Last Pushed: 2023-12-15T21:29:10.000Z (over 2 years ago)
Last Synced: 2025-01-28T21:18:02.603Z (over 1 year ago)
Language: Scala
Size: 123 KB
Stars: 1
Watchers: 1
Forks: 0
Open Issues: 1
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

          # Project: Spark Universal Migrator

Spark Universal Migrator is a Scala/Spark library for full-load table-by-table migration from Oracle to Hive. It captures an Oracle snapshot SCN, reads source rows through Spark JDBC using `ROWID` range queries built from Oracle extent metadata, applies an explicit Oracle-to-Spark schema policy, and writes the result into Hive through a temporary table plus `INSERT OVERWRITE`.

# Requirements

- Apache Spark 3.5.7

- Scala 2.12.x

- Oracle as the source system

- Hive metastore / Hive-enabled Spark as the target system

- Oracle JDBC driver support on the runtime classpath

# What It Supports

- Full-load data migration

- Table-by-table execution

- Range-based reads using Oracle `ROWID`

- Snapshot-based reads through captured SCN

- Oracle type handling modes:

  - `spark`: keep Spark JDBC inferred types for ambiguous Oracle `NUMBER`

  - `oracle`: profile Oracle `NUMBER` columns without precision/scale metadata

  - `skip`: avoid extra profiling and fall back to `StringType` for ambiguous `NUMBER`

# What It Does Not Do

- CDC / change capture

- Incremental load orchestration

- Schema evolution management across runs

- Environment-variable based configuration

- Standalone CLI entrypoint in this repository

# Testing

Run the test suite with:

```bash

sbt test

```

The test suite covers SQL generation, schema conversion, Spark session creation, Spark-side JDBC load composition, and Hive overwrite behavior.

# Using the Library

Use the `NewSpark.migrate(...)` API from your own application entrypoint:

```scala

import queukat.spark_universal.NewSpark

object ExampleMigration {

  def main(args: Array[String]): Unit = {

    NewSpark.migrate(

      url = "jdbc:oracle:thin:@//localhost:1521/ORCL",

      oracleUser = "your_oracle_username",

      oraclePassword = "your_oracle_password",

      tableName = "employees",

      owner = "HR",

      hivetable = "employees_hive",

      numPartitions = 8,

      fetchSize = 1000,

      typeCheck = "spark"

    )

  }

}

```

# Notes

- The library expects Oracle users with access to the required metadata views used to compute extent ranges.

- Unsupported Oracle types fail fast during schema conversion instead of being silently downgraded.

- Temporary Hive table names are generated uniquely per migration run.

- Logging stays on the existing `slf4j` facade so the host Spark application keeps control over the final backend.

- The library now emits stage-oriented log messages such as `[MIGRATE]`, `[SCHEMA]`, `[LOAD]`, and `[HIVE]` for easier scanning.

- ANSI color is opt-in/out through `-Dspark.universal.log.color=true|false`; by default the library only colors logs when it detects an interactive terminal.

# Publishing

The repository includes GitHub Actions workflows for CI and Maven Central publishing. CI runs `sbt test`, and publishing uses `sbt +publishSigned` on release.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/queukat/spark_oracle_hive_streaming

Awesome Lists containing this project

README