https://github.com/apache/doris-spark-connector
Spark Connector for Apache Doris
https://github.com/apache/doris-spark-connector
apache connector data-warehousing dbms doris mpp olap spark
Last synced: 14 days ago
JSON representation
Spark Connector for Apache Doris
- Host: GitHub
- URL: https://github.com/apache/doris-spark-connector
- Owner: apache
- License: apache-2.0
- Created: 2022-02-10T03:40:38.000Z (about 3 years ago)
- Default Branch: master
- Last Pushed: 2025-04-01T06:24:07.000Z (22 days ago)
- Last Synced: 2025-04-02T02:13:49.380Z (21 days ago)
- Topics: apache, connector, data-warehousing, dbms, doris, mpp, olap, spark
- Language: Java
- Homepage: https://doris.apache.org/
- Size: 957 KB
- Stars: 89
- Watchers: 33
- Forks: 99
- Open Issues: 36
-
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE-dependencies.txt
- Code of conduct: CODE_OF_CONDUCT.md
Awesome Lists containing this project
README
# Spark Connector for Apache Doris
[](https://www.apache.org/licenses/LICENSE-2.0.html)
[](https://join.slack.com/t/apachedoriscommunity/shared_invite/zt-11jb8gesh-7IukzSrdea6mqoG0HB4gZg)### Spark Doris Connector
More information about compilation and usage, please visit [Spark Doris Connector](https://doris.apache.org/docs/ecosystem/spark-doris-connector)
## License
[Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0)
## How to Build
You need to copy customer_env.sh.tpl to customer_env.sh before build and you need to configure it before build.
```shell
git clone [email protected]:apache/doris-spark-connector.git
cd doris-spark-connector/spark-doris-connector
./build.sh
```### QuickStart
1. download and compile Spark Doris Connector from https://github.com/apache/doris-spark-connector, we suggest compile Spark Doris Connector by Doris offfcial image。
```bash
$ docker pull apache/doris:build-env-ldb-toolchain-latest
```2. the result of compile jar is like:spark-doris-connector-3.1_2.12-1.0.0-SNAPSHOT.jar
3. download spark for https://spark.apache.org/downloads.html .if in china there have a good choice of tencent link https://mirrors.cloud.tencent.com/apache/spark/spark-3.1.2/
```bash
#download
wget https://mirrors.cloud.tencent.com/apache/spark/spark-3.1.2/spark-3.1.2-bin-hadoop3.2.tgz
#decompression
tar -xzvf spark-3.1.2-bin-hadoop3.2.tgz
```4. config Spark environment
```shell
vim /etc/profile
export SPARK_HOME=/your_parh/spark-3.1.2-bin-hadoop3.2
export PATH=$PATH:$SPARK_HOME/bin
source /etc/profile
```5. copy spark-doris-connector-3.1_2.12-1.0.0-SNAPSHOT.jar to spark jars directory。
```shell
cp /your_path/spark-doris-connector/target/spark-doris-connector-3.1_2.12-1.0.0-SNAPSHOT.jar $SPARK_HOME/jars
```6. created doris database and table。
```sql
create database mongo_doris;
use mongo_doris;
CREATE TABLE data_sync_test_simple
(
_id VARCHAR(32) DEFAULT '',
id VARCHAR(32) DEFAULT '',
user_name VARCHAR(32) DEFAULT '',
member_list VARCHAR(32) DEFAULT ''
)
DUPLICATE KEY(_id)
DISTRIBUTED BY HASH(_id) BUCKETS 10
PROPERTIES("replication_num" = "1");
INSERT INTO data_sync_test_simple VALUES ('1','1','alex','123');
```7. Input this coed in spark-shell.
```bash
import org.apache.doris.spark._
val dorisSparkRDD = sc.dorisRDD(
tableIdentifier = Some("mongo_doris.data_sync_test"),
cfg = Some(Map(
"doris.fenodes" -> "127.0.0.1:8030",
"doris.request.auth.user" -> "root",
"doris.request.auth.password" -> ""
))
)
dorisSparkRDD.collect()
```- mongo_doris:doris database name
- data_sync_test:doris table mame.
- doris.fenodes:doris FE IP:http_port
- doris.request.auth.user:doris user name.
- doris.request.auth.password:doris password8. if Spark is Cluster model,upload Jar to HDFS,add doris-spark-connector jar HDFS URL in spark.yarn.jars.
```bash
spark.yarn.jars=hdfs:///spark-jars/doris-spark-connector-3.1.2-2.12-1.0.0.jar
```Link:https://github.com/apache/doris/discussions/9486
9. in pyspark,input this code in pyspark shell command.
```bash
dorisSparkDF = spark.read.format("doris")
.option("doris.table.identifier", "mongo_doris.data_sync_test")
.option("doris.fenodes", "127.0.0.1:8030")
.option("user", "root")
.option("password", "")
.load()
# show 5 lines data
dorisSparkDF.show(5)
```## type convertion for writing to doris using arrow
|doris|spark|
|---|---|
| BOOLEAN | BooleanType |
| TINYINT | ByteType |
| SMALLINT | ShortType |
| INT | IntegerType |
| BIGINT | LongType |
| LARGEINT | StringType |
| FLOAT | FloatType |
| DOUBLE | DoubleType |
| DECIMAL(M,D) | DecimalType(M,D) |
| DATE | DateType |
| DATETIME | TimestampType |
| CHAR(L) | StringType |
| VARCHAR(L) | StringType |
| STRING | StringType |
| ARRAY | ARRAY |
| MAP | MAP |
| STRUCT | STRUCT |## Report issues or submit pull request
If you find any bugs, feel free to file a [GitHub issue](https://github.com/apache/doris/issues) or fix it by submitting a [pull request](https://github.com/apache/doris/pulls).
## Contact Us
Contact us through the following mailing list.
| Name | Scope | | | |
|:------------------------------------------------------------------------------|:--------------------------------|:----------------------------------------------------------------|:--------------------------------------------------------------------|:-----------------------------------------------------------------------------|
| [[email protected]](mailto:[email protected]) | Development-related discussions | [Subscribe](mailto:[email protected]) | [Unsubscribe](mailto:[email protected]) | [Archives](https://mail-archives.apache.org/mod_mbox/doris-dev/) |## Links
* Doris official site -
* Developer Mailing list - . Mail to , follow the reply to subscribe the mail list.
* Slack channel - [Join the Slack](https://join.slack.com/t/apachedoriscommunity/shared_invite/zt-11jb8gesh-7IukzSrdea6mqoG0HB4gZg)