https://github.com/spotify/scio

A Scala API for Apache Beam and Google Cloud Dataflow.
https://github.com/spotify/scio

batch beam bigquery data dataflow google-cloud ml scala scio streaming

Last synced: 8 months ago
JSON representation

A Scala API for Apache Beam and Google Cloud Dataflow.

Host: GitHub
URL: https://github.com/spotify/scio
Owner: spotify
License: apache-2.0
Created: 2015-03-26T19:07:34.000Z (almost 11 years ago)
Default Branch: main
Last Pushed: 2025-04-14T17:29:43.000Z (9 months ago)
Last Synced: 2025-04-15T09:49:45.526Z (8 months ago)
Topics: batch, beam, bigquery, data, dataflow, google-cloud, ml, scala, scio, streaming
Language: Scala
Homepage: https://spotify.github.io/scio
Size: 77 MB
Stars: 2,589
Watchers: 113
Forks: 516
Open Issues: 137
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
- Code of conduct: CODE_OF_CONDUCT.md

Awesome Lists containing this project

awesome-scala - **scio** - activity/y/spotify/scio) (Table of Contents / Big Data)
awesome-twitter-algo - Scio
awesome-beam - Scio - Scala wrapper for Apache Beam
fucking-awesome-scala - **scio** - activity/y/spotify/scio) (Table of Contents / Big Data)
awesome-scala - Scio - A Scala API for [Apache Beam](https://beam.apache.org/) and [Google Cloud Dataflow](https://cloud.google.com/dataflow/) - None (Big Data)
awesome-java - Scio

README

          # Scio

[![Continuous Integration](https://github.com/spotify/scio/actions/workflows/ci.yml/badge.svg)](https://github.com/spotify/scio/actions/workflows/ci.yml)

[![codecov.io](https://codecov.io/github/spotify/scio/coverage.svg?branch=master)](https://codecov.io/github/spotify/scio?branch=master)

[![GitHub license](https://img.shields.io/github/license/spotify/scio.svg)](./LICENSE)

[![Maven Central](https://img.shields.io/maven-central/v/com.spotify/scio-core_2.12.svg)](https://maven-badges.herokuapp.com/maven-central/com.spotify/scio-core_2.12)

[![Scaladoc](https://img.shields.io/badge/scaladoc-latest-blue.svg)](https://spotify.github.io/scio/api/com/spotify/scio/index.html)

[![Scala Steward badge](https://img.shields.io/badge/Scala_Steward-helping-brightgreen.svg?style=flat&logo=data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAA4AAAAQCAMAAAARSr4IAAAAVFBMVEUAAACHjojlOy5NWlrKzcYRKjGFjIbp293YycuLa3pYY2LSqql4f3pCUFTgSjNodYRmcXUsPD/NTTbjRS+2jomhgnzNc223cGvZS0HaSD0XLjbaSjElhIr+AAAAAXRSTlMAQObYZgAAAHlJREFUCNdNyosOwyAIhWHAQS1Vt7a77/3fcxxdmv0xwmckutAR1nkm4ggbyEcg/wWmlGLDAA3oL50xi6fk5ffZ3E2E3QfZDCcCN2YtbEWZt+Drc6u6rlqv7Uk0LdKqqr5rk2UCRXOk0vmQKGfc94nOJyQjouF9H/wCc9gECEYfONoAAAAASUVORK5CYII=)](https://scala-steward.org)



> Ecclesiastical Latin IPA: /ˈʃi.o/, [ˈʃiː.o], [ˈʃi.i̯o]

> Verb: I can, know, understand, have knowledge.

Scio is a Scala API for [Apache Beam](http://beam.incubator.apache.org/) and [Google Cloud Dataflow](https://github.com/GoogleCloudPlatform/DataflowJavaSDK) inspired by [Apache Spark](http://spark.apache.org/) and [Scalding](https://github.com/twitter/scalding).

# Features

- Scala API close to that of Spark and Scalding core APIs

- Unified batch and streaming programming model

- Fully managed service^\*

- Integration with Google Cloud products: Cloud Storage, BigQuery, Pub/Sub, Datastore, Bigtable

- Avro, Cassandra, Elasticsearch, gRPC, JDBC, neo4j, Parquet, Redis, [TensorFlow](http://tensorflow.org/) IOs

- Interactive mode with Scio REPL

- Type safe BigQuery

- Integration with [Algebird](https://github.com/twitter/algebird) and [Breeze](https://github.com/scalanlp/breeze)

- Pipeline orchestration with [Scala Futures](http://docs.scala-lang.org/overviews/core/futures.html)

- Distributed cache

^\* provided by Google Cloud Dataflow

# Quick Start

Download and install the [Java Development Kit (JDK)](https://adoptopenjdk.net/index.html) version 8.

Install [sbt](https://www.scala-sbt.org/1.x/docs/Setup.html).

Use our [giter8 template](https://github.com/spotify/scio.g8) to quickly create a new Scio job repository:

`sbt new spotify/scio.g8`

Switch to the new repo (default `scio-job`) and build it:

```

cd scio-job

sbt stage

```

Run the included word count example:

`target/universal/stage/bin/scio-job --output=wc`

List result files and inspect content:

```

ls -l wc

cat wc/part-00000-of-00004.txt

```

# Documentation

[Getting Started](https://spotify.github.io/scio/Getting-Started.html) is the best place to start with Scio. If you are new to Apache Beam and distributed data processing, check out the [Beam Programming Guide](https://beam.apache.org/documentation/programming-guide/) first for a detailed explanation of the Beam programming model and concepts. If you have experience with other Scala data processing libraries, check out this comparison between [Scio, Scalding and Spark](https://spotify.github.io/scio/Scio,-Scalding-and-Spark.html).

Example Scio pipelines and tests can be found under [scio-examples](https://github.com/spotify/scio/tree/master/scio-examples/src). A lot of them are direct ports from Beam's Java [examples](https://github.com/apache/beam/tree/master/examples). See this [page](http://spotify.github.io/scio/examples/) for some of them with side-by-side explanation. Also see [Big Data Rosetta Code](https://github.com/spotify/big-data-rosetta-code) for common data processing code snippets in Scio, Scalding and Spark.

- [Scio Docs](https://spotify.github.io/scio/) - main documentation site

- [Scio Scaladocs](http://spotify.github.io/scio/api/) - current API documentation

- [Scio Examples](http://spotify.github.io/scio/examples/) - examples with side-by-side explanation

# Artifacts

Scio includes the following artifacts:

- `scio-avro`: add-on for Avro, can also be used standalone

- `scio-cassandra*`: add-ons for Cassandra

- `scio-core`: core library

- `scio-elasticsearch*`: add-ons for Elasticsearch

- `scio-extra`: extra utilities for working with collections, Breeze, etc., best effort support

- `scio-google-cloud-platform`: add-on for Google Cloud IO's: BigQuery, Bigtable, Pub/Sub, Datastore, Spanner

- `scio-grpc`: add-on for gRPC service calls

- `scio-jdbc`: add-on for JDBC IO

- `scio-neo4j`: add-on for Neo4J IO

- `scio-parquet`: add-on for Parquet

- `scio-redis`: add-on for Redis

- `scio-repl`: extension of the Scala REPL with Scio specific operations

- `scio-smb`: add-on for Sort Merge Bucket operations

- `scio-snowflake`: add-on for Snowflake IO

- `scio-tensorflow`: add-on for TensorFlow TFRecords IO and prediction

- `scio-test`: all following test utilities. Add to your project as a "test" dependency

  - `scio-test-core`: test core utilities

  - `scio-test-google-cloud-platform`: test utilities for Google Cloud IO's

  - `scio-test-parquet`: test utilities for Parquet

# License

Copyright 2024 Spotify AB.

Licensed under the Apache License, Version 2.0: http://www.apache.org/licenses/LICENSE-2.0

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/spotify/scio

Awesome Lists containing this project

README