{"id":24175853,"url":"https://github.com/iht/scio-quickstart","last_synced_at":"2026-06-07T04:31:42.474Z","repository":{"id":74375669,"uuid":"515419359","full_name":"iht/scio-quickstart","owner":"iht","description":"This repository contains a sample pipeline for starting with Scio, the Scala framework to develop Apache Beam pipelines.  Fork this repository so you can commit your changes in your own repository.","archived":false,"fork":false,"pushed_at":"2022-07-20T16:21:39.000Z","size":842,"stargazers_count":0,"open_issues_count":0,"forks_count":2,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-01-13T02:37:06.513Z","etag":null,"topics":["scala","scio"],"latest_commit_sha":null,"homepage":"","language":"Scala","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/iht.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2022-07-19T03:20:47.000Z","updated_at":"2024-04-26T08:19:09.000Z","dependencies_parsed_at":null,"dependency_job_id":"683e253c-78ee-4f6b-8ff5-f879206e530e","html_url":"https://github.com/iht/scio-quickstart","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/iht%2Fscio-quickstart","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/iht%2Fscio-quickstart/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/iht%2Fscio-quickstart/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/iht%2Fscio-quickstart/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/iht","download_url":"https://codeload.github.com/iht/scio-quickstart/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":241523942,"owners_count":19976424,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["scala","scio"],"created_at":"2025-01-13T02:33:25.639Z","updated_at":"2026-06-07T04:31:41.798Z","avatar_url":"https://github.com/iht.png","language":"Scala","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Scio quickstart\n\nThis repository contains a sample pipeline for starting with [Scio](https://spotify.github.io/scio/), the Scala\nframework to develop Apache Beam pipelines.\n\nFork this repository so you can commit your changes in your own repository.\n\n# Pipeline\n\nThe goal of this example is to count the words in Don Quixote, the famous novel by Miguel de Cervantes. The novel has\nseveral characters: Sancho, the buddy of Don Quixote; Dulcinea, the significant other of Don Quixote; Rocinante, the\nfearful horse of Don Quixote, etc.\n\nThe pipeline does not only count the words, it also sorts the words by number of occurrences, and provides an answer\nto an existential question: who is mentioned more in the novel, Sancho or Dulcinea?\n\nLet's find out with the help of Scio.\n\n## Compile\n\nThe first step to solve the mysterious question is to compile the code. For that, you will need to have installed SBT:\n* https://www.scala-sbt.org/\n\nWhen you have installed, you can run\n\n* `sbt compile`  to compile the code (for instance, while you are developing the code for the pipeline)\n* `sbt stage` to produce a runnable package\n\n## Input data\n\nIn the `data` directory you will find two files:\n\n* `sample.txt`, small extract of the novel. You can use this for tests while you are developing the pipeline\n* `el_quijote.txt`, the full novel, to solve the important question about Sancho or Dulcinea\n\n## Running the example\n\nOnce you have run `sbt stage`, there will be a script in the directory `target/universal/stage/bin`. You can use that\nscript to run the pipeline.\n\nFor instance, to find the top 10 words in the sample data:\n\n`./target/universal/stage/bin/scio-quickstart --input-file=./data/sample.txt --output-file=tmp --num-words=10`\n\nAfter that you should find a file with a name like ` part-00000-of-00001.txt` in the `tmp` subdirectory.\n\nTo run with the full data and top 100 words:\n\n`./target/universal/stage/bin/scio-quickstart --input-file=./data/el_quijote.txt --output-file=tmp --num-words=100`\n\nSearch for `sancho` and `dulcinea` in the output to solve this burning question.\n\n# Development\n\nThe pipeline is initially empty. Your task, should you accept it, is to create the pipeline that is required to solve\nthe Sancho vs. Dulcinea question.","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fiht%2Fscio-quickstart","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fiht%2Fscio-quickstart","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fiht%2Fscio-quickstart/lists"}