{"id":18802238,"url":"https://github.com/oracle-quickstart/oci-streamsets","last_synced_at":"2025-04-13T18:31:23.252Z","repository":{"id":106380792,"uuid":"151475798","full_name":"oracle-quickstart/oci-streamsets","owner":"oracle-quickstart","description":"Terraform module to deploy StreamSets on Oracle Cloud Infrastructure (OCI)","archived":true,"fork":false,"pushed_at":"2020-01-22T01:22:58.000Z","size":1039,"stargazers_count":0,"open_issues_count":2,"forks_count":2,"subscribers_count":9,"default_branch":"master","last_synced_at":"2025-02-19T21:12:46.783Z","etag":null,"topics":["cloud","oci","oracle","partner-led","streamsets","terraform"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/oracle-quickstart.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2018-10-03T20:25:19.000Z","updated_at":"2024-05-09T11:42:52.000Z","dependencies_parsed_at":null,"dependency_job_id":"17b847f0-06f9-488e-8d4b-bac3e9ca3238","html_url":"https://github.com/oracle-quickstart/oci-streamsets","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/oracle-quickstart%2Foci-streamsets","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/oracle-quickstart%2Foci-streamsets/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/oracle-quickstart%2Foci-streamsets/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/oracle-quickstart%2Foci-streamsets/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/oracle-quickstart","download_url":"https://codeload.github.com/oracle-quickstart/oci-streamsets/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248760335,"owners_count":21157341,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cloud","oci","oracle","partner-led","streamsets","terraform"],"created_at":"2024-11-07T22:27:07.408Z","updated_at":"2025-04-13T18:31:23.239Z","avatar_url":"https://github.com/oracle-quickstart.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# oci-streamsets\nThese are Terraform modules that deploy [Streamsets](https://streamsets.com/) on [Oracle Cloud Infrastructure (OCI)](https://cloud.oracle.com/en_US/cloud-infrastructure).  They are developed jointly by Oracle and StreamSets.\n\t\t\t\n## Getting Started\nWelcome! This folder contains Terraform scripts that setup the StreamSets Data Collector (SDC) to ingest data rapidly and easily. In the the current level of this directory are the Terraform files that create a single compute instance running one data collector. This is commonly used for learning or developing on the StreamSets Data Operations Platform. However, it can be used for production ready data movement and transformation.\n\nThe folder titled, \"SDC Standalone with EDH Cluster\", will create single SDC instance ready for data movement inside a Cloudera Enterprise Data Hub. This SDC instance will reside in the same subnet(s) as the worker nodes in the cluster. This is mainly for easy development and learning how the StreamSets Data Operations Platform extends to Hadoop infrastructure.\n\nThe folder titled, \"SDC via CDH Parcel Manager\", will create SDC instances on all the worker nodes in the cluster and enable things like clustered execution of pipelines or REST-based microservices pipelines. This is not currently production ready and is still in development but stay tuned!\n\n## Standalone StreamSets Data Collector Architecture\n![](./images/OCI_Arch_StreamSets_SDC_Capture.PNG)\t\t\n\n## Prerequisites\nIn addition to an active tenancy on OCI, you will need a functional installation of Terraform, and an API key for a privileged user in the tenancy.  See these documentation links for more information:\n\n[Getting Started with Terraform on OCI](https://docs.cloud.oracle.com/iaas/Content/API/SDKDocs/terraformgetstarted.htm)\n\n[How to Generate an API Signing Key](https://docs.cloud.oracle.com/iaas/Content/API/Concepts/apisigningkey.htm#How)\n\nOnce the pre-requisites are in place, you will need to copy the templates from this repository to where you have Terraform installed.\n\n## Clone the Terraform template\nNow, you'll want a local copy of this repo.  You can make that with the commands:\n\n    git clone https://github.com/oracle-quickstart/oci-streamsets.git\n    cd oci-streamsets\n    ls\n\n## Update Template Configuration\nUpdate environment variables in config file: [env-vars](https://github.com/cloud-partners/oci-streamsets/blob/master/env-vars)  to specify your OCI account details like tenancy_ocid, user_ocid, compartment_ocid. To source this file prior to installation, either reference it in your .rc file for your shell's or run the following:\n\n        source env-vars\n\n## Deployment \u0026 Post Deployment\n\nDeploy using standard Terraform commands\n\n        terraform init\n\tterraform plan\n\tterraform apply\n\n## SSH to SDC Node\nWhen terraform apply is complete, the terminal console will display the public ip address for first broker and worker node.  The default login is opc.  You can SSH into the machine with a command like this:\n\n        ssh -i ~/.ssh/id_rsa opc@${data.oci_core_vnic.datacollector_vnic.public_ip_address}\n        http://${data.oci_core_vnic.datacollector_vnic.public_ip_address}:18630/ The default username and password are admin and admin.\n\n## Data Collector Web Console\n![](./images/Pipeline_Screenshot.png)\n![](./images/metrics_Capture.PNG)\n\n## What is the StreamSets Data Collector?\nStreamSets Data Collector is a lightweight, powerful design and execution engine that streams data in real time. SDC is used to route and process data in your data streams from almost any origin to almost any source.\n\nTo define the flow of data, you design a pipeline in SDC. A pipeline consists of one or more stages that represents the origin(s) and destination of the pipeline, as well as any additional processing that you want to perform. After you design the pipeline, you can preview it to assist with debugging.  When ready to run the pipeline live, you click Start and the SDC goes to work.\n\nOnce SDC is running, it processes the data when it arrives at the origin and waits quietly when not needed. You can view real-time statistics about your data, inspect data as it passes through the pipeline, or take a closer look at a snapshot of data.\n\n## How should I use SDC?\nUse SDC like a pipe for a data stream. Throughout your enterprise data topology, you have streams of data that you need to move, collect, and process on the way to their destinations. SDC provides the crucial connection between hops in the stream.\n\nTo solve your ingest needs, you can use a single SDC to run one or more pipelines. Or you might install a series of Data Collectors to stream data across your enterprise data topology.\n\n## How does this really work?\nLet's walk through it...\n\nAfter you run the terraform script for a standalone SDC, you use the Data Collector UI to log in and create your first pipeline.\n\nWhat do you want it to do? Let's say you want to read XML files from a directory and remove the newline characters before moving it into HDFS. To do this, you start with a Directory origin stage and configure it to point to the source file directory. (You can also have the stage archive processed files and write files that were not fully processed to a separate directory for review.)\n\nTo remove the newline characters, connect Directory to an Expression Evaluator processor and configure it to remove the newline character from the last field in the record.\n\nTo make the data available to HDFS, you connect the Expression Evaluator to a Hadoop FS destination stage. You configure the stage to write the data as a JSON object (though you can use other data formats as well).\n\nYou preview data to see how source data moves through the pipeline and notice that some fields have missing data. So you add a Field Replacer to replace null values in those fields.\n\nNow that the data flow is done, you configure the pipeline error record handling to write error records to a file, you create a data drift alert to let you know when field names change, and you configure an email alert to let you know when the pipeline generates more than 100 error records. Then, you start the pipeline and Data Collector goes to work.\n\nThe Data Collector goes into Monitor mode and displays summary and error statistics immediately. To get a closer look at the activity, you take a snapshot of the pipeline so you can examine how a set of data passed through the pipeline. You see some unexpected data in the pipeline, so you create a data rule for a link between two stages to gather information about similar data and set an alert to notify you when the numbers get too high.\n\nAnd what about those error records being written to file? They're saved with error details, so you can create an error pipeline to reprocess that data. Et voila!\n\nStreamSets Data Collector is a powerful tool, but we're making it as simple as possible to use. So give it a try, click the Help icon for information, and contact us if you need a hand. For more use cases or examples to learn on, please visit: https://github.com/streamsets/tutorials\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Foracle-quickstart%2Foci-streamsets","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Foracle-quickstart%2Foci-streamsets","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Foracle-quickstart%2Foci-streamsets/lists"}