https://github.com/osahp/pyspark_db_utils

The easy to use database connector that allows one-command operations between PySpark and PostgreSQL or ClickHouse databases.
https://github.com/osahp/pyspark_db_utils

Last synced: 3 months ago
JSON representation

The easy to use database connector that allows one-command operations between PySpark and PostgreSQL or ClickHouse databases.

Host: GitHub
URL: https://github.com/osahp/pyspark_db_utils
Owner: osahp
License: mit
Created: 2017-10-20T12:10:55.000Z (almost 8 years ago)
Default Branch: master
Last Pushed: 2018-05-24T10:27:49.000Z (over 7 years ago)
Last Synced: 2024-10-01T15:36:15.279Z (about 1 year ago)
Language: Python
Homepage:
Size: 5.87 MB
Stars: 8
Watchers: 3
Forks: 2
Open Issues: 1
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

          # pyspark_db_utils  

It helps you with your DB deals in Spark

## Documentation

http://pyspark-db-utils.readthedocs.io/en/latest/

## Example of using

You need jdbc drivers for using this lib!

Just get drivers from

https://jdbc.postgresql.org/download.html

https://github.com/yandex/clickhouse-jdbc

and put it in jars/ directory in your project

### Example settings:

```

settings = {

  "PG_PROPERTIES": {

    "user": "user",

    "password": "pass",

    "driver": "org.postgresql.Driver"

  },

  "PG_DRIVER_PATH": "jars/postgresql-42.1.4.jar",

  "PG_URL": "jdbc:postgresql://db.olabs.com/dbname",

}

```

### Example of code

see example.py

### Example of run

```

vsmelov@vsmelov:~/PycharmProjects/pyspark_db_utils$ mkdir jars

vsmelov@vsmelov:~/PycharmProjects/pyspark_db_utils$ cp /var/bigdata/spark-2.2.0-bin-hadoop2.7/jars/postgresql-42.1.4.jar ./jars/

vsmelov@vsmelov:~/PycharmProjects/pyspark_db_utils$ python3 pyspark_db_utils/example.py 

host: ***SECRET***

db: ***SECRET***

user: ***SECRET***

password: ***SECRET***

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties

Setting default log level to "WARN".

To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).

18/03/05 11:43:29 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

18/03/05 11:43:29 WARN Utils: Your hostname, vsmelov resolves to a loopback address: 127.0.1.1; using 192.168.43.26 instead (on interface wlp2s0)

18/03/05 11:43:29 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address

TRY: create df

OK: create df

+---+-----------+

| id|    mono_id|

+---+-----------+

|  1|          0|

|  2|          1|

|  3|          2|

|  4|          3|

|  5| 8589934592|

|  6| 8589934593|

|  7| 8589934594|

|  8| 8589934595|

|  9| 8589934596|

| 10|17179869184|

| 11|17179869185|

| 12|17179869186|

| 13|17179869187|

| 14|17179869188|

| 15|25769803776|

| 16|25769803777|

| 17|25769803778|

| 18|25769803779|

| 19|25769803780|

+---+-----------+

TRY: write_to_pg

OK: write_to_pg                                                                 

TRY: read_from_pg

OK: read_from_pg

+---+-----------+

| id|    mono_id|

+---+-----------+

| 10|17179869184|

| 11|17179869185|

| 12|17179869186|

| 13|17179869187|

| 14|17179869188|

|  1|          0|

|  2|          1|

|  3|          2|

|  4|          3|

|  5| 8589934592|

|  6| 8589934593|

|  7| 8589934594|

|  8| 8589934595|

|  9| 8589934596|

| 15|25769803776|

| 16|25769803777|

| 17|25769803778|

| 18|25769803779|

| 19|25769803780|

|  1|          0|

+---+-----------+

only showing top 20 rows

```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/osahp/pyspark_db_utils

Awesome Lists containing this project

README