Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/spotify/scio
A Scala API for Apache Beam and Google Cloud Dataflow.
https://github.com/spotify/scio
batch beam bigquery data dataflow google-cloud ml scala scio streaming
Last synced: 5 days ago
JSON representation
A Scala API for Apache Beam and Google Cloud Dataflow.
- Host: GitHub
- URL: https://github.com/spotify/scio
- Owner: spotify
- License: apache-2.0
- Created: 2015-03-26T19:07:34.000Z (almost 10 years ago)
- Default Branch: main
- Last Pushed: 2025-01-14T19:35:15.000Z (11 days ago)
- Last Synced: 2025-01-16T04:11:09.203Z (10 days ago)
- Topics: batch, beam, bigquery, data, dataflow, google-cloud, ml, scala, scio, streaming
- Language: Scala
- Homepage: https://spotify.github.io/scio
- Size: 69.7 MB
- Stars: 2,572
- Watchers: 115
- Forks: 513
- Open Issues: 143
-
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
- Code of conduct: CODE_OF_CONDUCT.md
Awesome Lists containing this project
- awesome-beam - Scio - Scala wrapper for Apache Beam
- awesome-google-cloud - Scio - A Scala API for Google Cloud Dataflow and Apache Beam. (Big Data / Apache Beam & Dataflow)
- awesome-scala - **scio** - activity/y/spotify/scio) (Table of Contents / Big Data)
- awesome-twitter-algo - Scio
README
# Scio
[![Continuous Integration](https://github.com/spotify/scio/actions/workflows/ci.yml/badge.svg)](https://github.com/spotify/scio/actions/workflows/ci.yml)
[![codecov.io](https://codecov.io/github/spotify/scio/coverage.svg?branch=master)](https://codecov.io/github/spotify/scio?branch=master)
[![GitHub license](https://img.shields.io/github/license/spotify/scio.svg)](./LICENSE)
[![Maven Central](https://img.shields.io/maven-central/v/com.spotify/scio-core_2.12.svg)](https://maven-badges.herokuapp.com/maven-central/com.spotify/scio-core_2.12)
[![Scaladoc](https://img.shields.io/badge/scaladoc-latest-blue.svg)](https://spotify.github.io/scio/api/com/spotify/scio/index.html)
[![Scala Steward badge](https://img.shields.io/badge/Scala_Steward-helping-brightgreen.svg?style=flat&logo=data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAA4AAAAQCAMAAAARSr4IAAAAVFBMVEUAAACHjojlOy5NWlrKzcYRKjGFjIbp293YycuLa3pYY2LSqql4f3pCUFTgSjNodYRmcXUsPD/NTTbjRS+2jomhgnzNc223cGvZS0HaSD0XLjbaSjElhIr+AAAAAXRSTlMAQObYZgAAAHlJREFUCNdNyosOwyAIhWHAQS1Vt7a77/3fcxxdmv0xwmckutAR1nkm4ggbyEcg/wWmlGLDAA3oL50xi6fk5ffZ3E2E3QfZDCcCN2YtbEWZt+Drc6u6rlqv7Uk0LdKqqr5rk2UCRXOk0vmQKGfc94nOJyQjouF9H/wCc9gECEYfONoAAAAASUVORK5CYII=)](https://scala-steward.org)> Ecclesiastical Latin IPA: /ˈʃi.o/, [ˈʃiː.o], [ˈʃi.i̯o]
> Verb: I can, know, understand, have knowledge.Scio is a Scala API for [Apache Beam](http://beam.incubator.apache.org/) and [Google Cloud Dataflow](https://github.com/GoogleCloudPlatform/DataflowJavaSDK) inspired by [Apache Spark](http://spark.apache.org/) and [Scalding](https://github.com/twitter/scalding).
# Features
- Scala API close to that of Spark and Scalding core APIs
- Unified batch and streaming programming model
- Fully managed service\*
- Integration with Google Cloud products: Cloud Storage, BigQuery, Pub/Sub, Datastore, Bigtable
- Avro, Cassandra, Elasticsearch, gRPC, JDBC, neo4j, Parquet, Redis, [TensorFlow](http://tensorflow.org/) IOs
- Interactive mode with Scio REPL
- Type safe BigQuery
- Integration with [Algebird](https://github.com/twitter/algebird) and [Breeze](https://github.com/scalanlp/breeze)
- Pipeline orchestration with [Scala Futures](http://docs.scala-lang.org/overviews/core/futures.html)
- Distributed cache\* provided by Google Cloud Dataflow
# Quick Start
Download and install the [Java Development Kit (JDK)](https://adoptopenjdk.net/index.html) version 8.
Install [sbt](https://www.scala-sbt.org/1.x/docs/Setup.html).
Use our [giter8 template](https://github.com/spotify/scio.g8) to quickly create a new Scio job repository:
`sbt new spotify/scio.g8`
Switch to the new repo (default `scio-job`) and build it:
```
cd scio-job
sbt stage
```Run the included word count example:
`target/universal/stage/bin/scio-job --output=wc`
List result files and inspect content:
```
ls -l wc
cat wc/part-00000-of-00004.txt
```# Documentation
[Getting Started](https://spotify.github.io/scio/Getting-Started.html) is the best place to start with Scio. If you are new to Apache Beam and distributed data processing, check out the [Beam Programming Guide](https://beam.apache.org/documentation/programming-guide/) first for a detailed explanation of the Beam programming model and concepts. If you have experience with other Scala data processing libraries, check out this comparison between [Scio, Scalding and Spark](https://spotify.github.io/scio/Scio,-Scalding-and-Spark.html).
Example Scio pipelines and tests can be found under [scio-examples](https://github.com/spotify/scio/tree/master/scio-examples/src). A lot of them are direct ports from Beam's Java [examples](https://github.com/apache/beam/tree/master/examples). See this [page](http://spotify.github.io/scio/examples/) for some of them with side-by-side explanation. Also see [Big Data Rosetta Code](https://github.com/spotify/big-data-rosetta-code) for common data processing code snippets in Scio, Scalding and Spark.
- [Scio Docs](https://spotify.github.io/scio/) - main documentation site
- [Scio Scaladocs](http://spotify.github.io/scio/api/) - current API documentation
- [Scio Examples](http://spotify.github.io/scio/examples/) - examples with side-by-side explanation# Artifacts
Scio includes the following artifacts:
- `scio-avro`: add-on for Avro, can also be used standalone
- `scio-cassandra*`: add-ons for Cassandra
- `scio-core`: core library
- `scio-elasticsearch*`: add-ons for Elasticsearch
- `scio-extra`: extra utilities for working with collections, Breeze, etc., best effort support
- `scio-google-cloud-platform`: add-on for Google Cloud IO's: BigQuery, Bigtable, Pub/Sub, Datastore, Spanner
- `scio-grpc`: add-on for gRPC service calls
- `scio-jdbc`: add-on for JDBC IO
- `scio-neo4j`: add-on for Neo4J IO
- `scio-parquet`: add-on for Parquet
- `scio-redis`: add-on for Redis
- `scio-repl`: extension of the Scala REPL with Scio specific operations
- `scio-smb`: add-on for Sort Merge Bucket operations
- `scio-snowflake`: add-on for Snowflake IO
- `scio-tensorflow`: add-on for TensorFlow TFRecords IO and prediction
- `scio-test`: all following test utilities. Add to your project as a "test" dependency
- `scio-test-core`: test core utilities
- `scio-test-google-cloud-platform`: test utilities for Google Cloud IO's
- `scio-test-parquet`: test utilities for Parquet# License
Copyright 2024 Spotify AB.
Licensed under the Apache License, Version 2.0: http://www.apache.org/licenses/LICENSE-2.0