{"id":14069481,"url":"https://github.com/openbridge/ob_datastash","last_synced_at":"2025-04-10T11:53:17.510Z","repository":{"id":73272227,"uuid":"101433462","full_name":"openbridge/ob_datastash","owner":"openbridge","description":"Stream your CSV files to an HTTP API","archived":false,"fork":false,"pushed_at":"2018-04-09T01:44:03.000Z","size":7511,"stargazers_count":12,"open_issues_count":0,"forks_count":1,"subscribers_count":5,"default_branch":"master","last_synced_at":"2025-03-24T10:47:34.284Z","etag":null,"topics":["aws","bigquery","csv","csv-files","logstash","parquet","redshift"],"latest_commit_sha":null,"homepage":"https://www.openbridge.com","language":"Shell","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/openbridge.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2017-08-25T18:54:41.000Z","updated_at":"2023-07-05T13:14:44.000Z","dependencies_parsed_at":"2023-03-05T21:15:39.676Z","dependency_job_id":null,"html_url":"https://github.com/openbridge/ob_datastash","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/openbridge%2Fob_datastash","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/openbridge%2Fob_datastash/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/openbridge%2Fob_datastash/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/openbridge%2Fob_datastash/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/openbridge","download_url":"https://codeload.github.com/openbridge/ob_datastash/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248215193,"owners_count":21066622,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["aws","bigquery","csv","csv-files","logstash","parquet","redshift"],"created_at":"2024-08-13T07:06:59.297Z","updated_at":"2025-04-10T11:53:17.488Z","avatar_url":"https://github.com/openbridge.png","language":"Shell","funding_links":[],"categories":["Shell"],"sub_categories":[],"readme":"# Data Stash - Event API Client\n\nData Stash is a `logstash` service than can ingest data from different data sources, transform them, and then send JSON output via HTTP to the Openbridge Events API. You can also store the outputs into other formats such as CSV.\n\n![Data Stash](https://raw.githubusercontent.com/openbridge/ob_datastash/master/datastash.png \"How It Works\")\n\n# Why Data Stash?\n\nData Stash can perform some magic by automatically processing, cleaning, encoding and streaming contents of one or more CSVs directly to our API. Once it arrives at our API we automatically route all the data to a destination table in your data warehouse.  Since CSV files can be a bit messy we have pre-packaged processing configurations that turn those old files into first class data sources. Here are a few of the standard operations we have defined:\n\n- Exclude columns resident in a CSV (e.g., remove/drop the userID, email address and social security columns) from the output\n- Replace non-ASCII characters with an ASCII approximation, or if none exists, a replacement character which defaults to ?\n- Remove extraneous white space from records in target columns\n- Strip backslashes, question marks, equals, hashes, minuses or other characters from the target columns\n- Set a desired data type of a given column and have it transform records to meet that type\n- Set everything to lowercase\n- Proper UTF-8 encoding of the data\n- Mask sensitive data with security \"hashes\" for one or more fields.\n- Add new fields, such as IDs or concatenations of other columns, which can replace the contents of a column or store the results in a new field that is appended to the CSV\n\n## Quick Start Sample Config Files\nFor reference, sample configs can be found in the [`/config/pipeline`](config/pipeline) folder of this repo.\n\n- **CSV to API**: CSV files with header rows use [`sample-csv-api-header.conf`](config/pipeline/sample-csv-api-header.conf)\n- **CSV to API**: CSV without header rows use [`sample-csv-api-noheader.conf`](config/pipeline/sample-csv-api-noheader.conf)\n- **CSV to CSV**: To process one CSV to generate a clean processed CSV use[`sample-csv-csv-noheader.conf`](config/pipeline/sample-csv-csv-noheader.conf)\n- **Multiple CSV Inputs to Multiple CSV Outputs**: To process multiple CSV files to generate multiple clean CSV files use [`sample-multi-csv-csv-noheader.conf`](config/pipeline/sample-multi-csv-csv-noheader.conf)\n\n\n# Install\n\nData Stash is neatly packaged into a Docker image so you can run this on your local laptop or deploy it to a server. The first step is to build or pull the image:\n\n```docker\ndocker build -t openbridge/ob_datastash .\n```\n\nor simply pull it from Docker Hub:\n\n```docker\ndocker pull openbridge/ob_datastash:latest\n```\n\nOnce you have your image you are ready yo get started!\n\n# Getting Started: How To Stream CSV Files\nData Stash is based on a premise of inputs, filters and outputs;\n\n- **Inputs**: Your data sources. Primarily this will be a CSV file, but it an be many others.\n- **Filters**: This is pre-processing your data prior to delivery to an output location\n- **Outputs**: Ther are a few output options but the principle is the Openbridge Webhook API\n\nData Stash can take a CSV file and break each row into a streamed JSON \"event\". These JSON events are delivered to an Openbridge API for import into your target warehouse.\n\nThere are a couple of CSV file use cases:\n\n- **Static Files**: You have exports from a system that you want to load to your data warehouse. Data Stash will process the exported source file and stream the content of the file until it reaches the end.\n- **Dynamic Files**: You have a file that continually has new rows added. Data Stash will process changing files and stream new events as they are appended to a file.\n\nFor our example walk-thru we use a static CSV file called `sales.csv`.\n\n## `sales.csv` Needs A Data Stash Configuration File\n\nTo run Data Stash for `sales.csv` you need to define a config file. Each config file is comprised of three parts; input, filter and output. A config file describes how Data Stash should process your `sales.csv` file.\n\n### Step 1: Define Your Input\n\nLets dig into your example `sales.csv`. The principle part of the input is setting the `path =\u003e` to your file(s). You will need to specify the path to the file you want to process like this `path =\u003e \"/the/path/to/your/sales.csv\"`. We are going to assume this is located in a folder on your laptop here: `/Users/bob/csv/mysalesdata`.\n\nHowever, Data Stash has its own location where it references your data. It will use its own default directory called `/data` to reference your files. What does this mean? In the Data Stash config you will use the `/data` in the file path as a default. When you run Data Stash you will tell it to map your laptop directory `/Users/bob/csv/mysalesdata` to the `/data`. This means anything in your laptop directory will appear exactly the same way inside `/data`.\n\nSee the \"How To Run\" section for more details on this mapping.\n\n```bash\n input {\n   file {\n      path =\u003e \"/data/sales.csv\"\n      start_position =\u003e \"beginning\"\n      sincedb_path =\u003e \"/dev/null\"\n   }\n }\n```\n\n### Step 2: Define Your Filter\n\nThis is where you define a CSV filter. A basic filter is focused on setting the schema and removal of system generated columns.\n\n- The `separator =\u003e \",\"` defines the delimiter. Do not change\n- The removal of system generated columns is done via `remove_field =\u003e [ \"message\", \"host\", \"@timestamp\", \"@version\", \"path\" ]`. Do not change unless you want to remove other columns from your CSV file. For example, lets say you had a column called `userid`. You can add it like this `remove_field =\u003e [ \"message\", \"host\", \"@timestamp\", \"@version\", \"path\", \"userid\" ]`. Now `userid` will be supressed and not sent to Openbridge.\n- If your CSV file has a header row, then you can set `autodetect_column_names =\u003e \"true\"` and `autogenerate_column_names =\u003e \"true\"` to leverage those values when processing the file.\n\n```bash\n filter {\n   csv {\n      separator =\u003e \",\"\n      remove_field =\u003e [ \"message\", \"host\", \"@timestamp\", \"@version\", \"path\" ]\n      autodetect_column_names =\u003e \"true\"\n      autogenerate_column_names =\u003e \"true\"\n   }\n }\n```\n\nIf your CSV does **not** have a header in the file you need to provide context about the target source file. You need to supply the header to the application `columns =\u003e [Sku,Name,SearchKeywords,Main,Price,ID,Brands]`. This header should align to the laytout of the CSV file.\n\n```bash\n  filter {\n    csv {\n       separator =\u003e \",\"\n       remove_field =\u003e [ \"message\", \"host\", \"@timestamp\", \"@version\", \"path\" ]\n       columns =\u003e [\"Sku\",\"Name\",\"SearchKeywords\",\"Main\",\"Price\",\"ID\",\"Brands\"]\n    }\n  }\n```\n\n#### Advanced Filtering\n\nHere is a more advance filter. This performs pre-prcoessing cleanup on the CSV file. For example, it will strip whitespace from columns, removed bad characters, convert a column to a different data type and so forth.\n\n```bash\n\nfilter {\n\n# The CSV filter takes an event field containing CSV data,\n# parses it, and stores it as individual fields (can optionally specify the names).\n# This filter can also parse data with any separator, not just commas.\n\n  csv {\n  # Set the comma delimiter\n    separator =\u003e \",\"\n\n  # We want to exclude these system columns\n    remove_field =\u003e [\n       \"message\",\n       \"host\",\n       \"@timestamp\",\n       \"@version\",\n       \"path\"\n    ]\n\n  # Define the layout of the input file\n    columns =\u003e [\n    \"Sku\",\"Name\",\"SearchKeywords\",\"Main\",\"Price\",\"ID\",\"Brands\"\n    ]\n  }\n\n  # The mutate filter allows you to perform general\n  # mutations on fields. You can rename, remove, replace\n  # and modify fields in your events\n\n  # We need to set the target column to \"string\" to allow for find and replace\n  mutate {\n    convert =\u003e [ \"Sku\", \"string\" ]\n  }\n\n  # Strip backslashes, question marks, equals, hashes, and minuses from the target column\n  mutate {\n     gsub =\u003e [ \"Sku\", \"[\\\\?#=]\", \"\" ]\n  }\n\n  # Strip extraneous white space from records\n  mutate {\n     strip =\u003e [ \"Sku\",\"Name\",\"SearchKeywords\",\"Main\",\"Price\",\"ID\",\"Brands\"\n     ]\n  }\n\n  # Set everything to lowercase\n  mutate {\n     lowercase =\u003e [ \"Sku\",\"Name\",\"SearchKeywords\",\"Main\",\"Price\",\"ID\",\"Brands\"\n     ]\n  }\n}\n```\n\n### Step 3: Define Your Output Destination\n\nThe output defines the delivery location for all the records in your CSV(s). Openbridge generates a private API endpoint which you use in the `url =\u003e \"\"`. The delivery API would look like this `url =\u003e \"https://myapi.foo-api.us-east-1.amazonaws.com/dev/events/teststash?token=774f77b389154fd2ae7cb5131201777\u0026sign=ujguuuljNjBkFGHyNTNmZTIxYjEzMWE5MjgyNzM1ODQ=\"`\n\nYou would take the Openberidge provided endpoint and put it into the config:\n\n```bash\n   output {\n     http {\n        url =\u003e \"https://myapi.foo-api.us-east-1.amazonaws.com/dev/events/teststash?token=774f77b389154fd2ae7cb5131201777\u0026sign=ujguuuljNjBkFGHyNTNmZTIxYjEzMWE5MjgyNzM1ODQ=\"\n        http_method =\u003e \"post\"\n        format =\u003e \"json\"\n        pool_max =\u003e \"10\"\n        pool_max_per_route =\u003e \"5\"\n     }\n   }\n```\n\n**Note**: Do not change `http_method =\u003e \"post\"`, `format =\u003e \"json\"`, `pool_max =\u003e \"10\"`, `pool_max_per_route =\u003e \"5\"` from the defaults listed in the config.\n\nYou can also store the data to a CSV file (vs sending it to an API). This might be useful to test or validate your data prior to using the API. It also might be useful if you want to create a CSV for upload to Openbridge via SFTP or SCP.\n\n```bash\noutput {\n\n  # Saving output to CSV so we define the layout of the file\n    csv {\n      fields =\u003e [ \"Sku\",\"Name\",\"SearchKeywords\",\"Main\",\"Price\",\"ID\",\"Brands\" ]\n\n   # Where do you want to export the file\n     path =\u003e \"/data/foo.csv\"\n    }\n}\n```\n\nYou need to reach out to your Openbridge team so they can provision your private API for you.\n\n### Step 4: Save Your Config\n\nYou will want to store your configs in a easy to remember location. You should also name the config in a manner that reflects the data resident in the CSV file. Since we are using `sales.csv` we saved our config like this: `/Users/bob/datastash/configs/sales.conf`. We will need to reference this config location in the next section.\n\nThe final config will look something like this:\n\n```bash\n####################################\n# An input enables a specific source of\n# events to be read by Logstash.\n####################################\n\ninput {\n  file {\n     # Set the path to the source file(s)\n     path =\u003e \"/data/sales.csv\"\n     start_position =\u003e \"beginning\"\n     sincedb_path =\u003e \"/dev/null\"\n  }\n}\n\n####################################\n# A filter performs intermediary processing on an event.\n# Filters are often applied conditionally depending on the\n# characteristics of the event.\n####################################\n\nfilter {\n\n csv {\n\n   # The CSV filter takes an event field containing CSV data,\n   # parses it, and stores it as individual fields (can optionally specify the names).\n   # This filter can also parse data with any separator, not just commas.\n\n  # Set the comma delimiter\n    separator =\u003e \",\"\n\n  # We want to exclude these system columns\n    remove_field =\u003e [\n    \"message\", \"host\", \"@timestamp\", \"@version\", \"path\"\n    ]\n\n  # Define the layout of the input file\n    columns =\u003e [\n    \"Sku\",\"Name\",\"SearchKeywords\",\"Main\",\"Price\",\"ID\",\"Brands\"\n    ]\n  }\n\n  # The mutate filter allows you to perform general\n  # mutations on fields. You can rename, remove, replace\n  # and modify fields in your events\n\n  # We need to set the target column to \"string\" to allow for find and replace\n  mutate {\n    convert =\u003e [ \"Sku\", \"string\" ]\n  }\n\n  # Find and remove backslashes, question marks, equals and hashes from the target column. These are characters we do not want in our column\n  mutate {\n     gsub =\u003e [ \"Sku\", \"[\\\\?#=]\", \"\" ]\n  }\n\n  # Strip extraneous white space from records\n  mutate {\n     strip =\u003e [ \"Sku\",\"Name\",\"SearchKeywords\",\"Main\",\"Price\",\"ID\",\"Brands\"\n     ]\n  }\n\n  # Set everything to lowercase\n  mutate {\n     lowercase =\u003e [ \"Sku\",\"Name\",\"SearchKeywords\",\"Main\",\"Price\",\"ID\",\"Brands\"\n     ]\n  }\n}\n\n####################################\n# An output sends event data to a particular\n# destination. Outputs are the final stage in the\n# event pipeline.\n####################################\n\noutput\n{\n  # Sending the contents of the file to the event API\n  http\n  {\n    # Put the URL for your HTTP endpoint to deliver events to\n    url =\u003e \"https://myapi.foo-api.us-east-1.amazonaws.com/dev/events/teststash?token=774f77b389154fd2ae7cb5131201777\u0026sign=ujguuuljNjBkFGHyNTNmZTIxYjEzMWE5MjgyNzM1ODQ=\"\n    # Leave the settings below untouched.\n    http_method =\u003e \"post\"\n    format =\u003e \"json\"\n    pool_max =\u003e \"10\"\n    pool_max_per_route =\u003e \"5\"\n  }\n}\n```\n\n# How To Run\n\nWith your `sales.csv`config file saved to `/Users/bob/datastash/configs/sales.conf` you are ready to stream your data!\n\nThere are two things that Data Stash needs to be told in order to run.\n\n1. Where to find your source CSV file (`/Users/bob/csv/mysalesdata`)\n2. The location of the the config file (`/Users/bob/datastash/configs`)\n\nYou tell Data Stash where the file and config are via the `-v` or `volume` command in Docker. In our example your CSV is located on your laptop in this folder: `/Users/bob/csv/mysalesdata`. This means we put that path into the first `-v` command. Internally Data Stash defaults to `/data` so you can leave that untouched. It should look like this:\n\n```bash\n-v /Users/bob/csv/mysalesdata:/data\n```\n\nIn our example you also saved your config file on you laptop here: `/Users/bob/datastash/config`. Data Stash defaults to looking for configs in `/config/pipeline` so you can that untouched:\n\n```bash\n-v /Users/bob/datastash/configs:/config/pipeline\n```\n\nLastly, we put it all together so we can tell Data Stash to stream the file. Here is the command to run our Docker based Data Stash image:\n\n```bash\ndocker run -it --rm \\\n-v /Users/bob/csv/mysalesdata:/data \\\n-v /Users/bob/datastash/configs:/config/pipeline \\\nopenbridge/ob_datastash \\\ndatastash -f /config/pipeline/xxxxx.conf\n```\n# Performance\nIf you are processing very large CSV files that have millions of records this approach can take awhile to complete. Depending on the complexity of the filters, you can expect about 1000 to 3000 events (i.e., rows) processed per minute. A CSV with 1,000,000 rows might take anywhere from 5 to 8 hours to complete.\n\nWe limit the requests to 100 per second, so the max # of transactions possible in a minute would be 6000. At a rate of 6000 processing a 1M record CSV file would take close to 3 hours.\n\nYou might want to explore using the Openbridge SFTP or SCP options for processing larger files.\n\n# Notes\n\n## Processing A Folder Of CSV Files\n\nIn the example below we used a wildcard `*.csv` to specify processing all sales CSV files in the directory.\n\n`path =\u003e \"/the/path/to/your/*.csv\"`\n\nFor example, if you had a file called `sales.csv`, `sales002.csv` and `sales-allyear.csv` using a wildcard `*.csv` will process all of them. I\n\nPlease note, using a `*.csv` assumes all files have the same structure/layout. If they do not, then you can be streaming disjointed data sets which will likely fail when it comes time to loading data to your warehouse.\n\n\n\n# Versioning\n\nDocker Tag | Git Hub Release | Logstash | Alpine Version\n---------- | --------------- | -------- | --------------\nlatest     | Master          | 6.2.3    | latest\n\n# Reference\n\nA the heart of Data Stash is [Logstash](https://www.elastic.co/products/logstash). For a deeper dive into the capabilities of Logstash check our their [documentation](https://www.elastic.co/guide/en/logstash/current/index.html). Logstash is pretty cool and can do a lot more than just processing CSV files\n\nCSV files should follow RFC 4180 standards/guidance to ensure success with processing\n\n- \u003chttps://www.loc.gov/preservation/digital/formats/fdd/fdd000323.shtml\u003e\n- \u003chttps://tools.ietf.org/html/rfc4180\u003e\n\nThis images is used for virtualizing your data streaming using Docker. If you don't know what Docker is read \"[What is Docker?](https://www.docker.com/what-docker)\". Once you have a sense of what Docker is, you can then install the software. It is free: \"[Get Docker](https://www.docker.com/products/docker)\". Select the Docker package that aligns with your environment (ie. OS X, Linux or Windows). If you have not used Docker before, take a look at the guides:\n\n- [Engine: Get Started](https://docs.docker.com/engine/getstarted/)\n- [Docker Mac](https://docs.docker.com/docker-for-mac/)\n- [Docker Windows](https://docs.docker.com/docker-for-windows/)\n\n# TODO\n\n- Create more sample configs, including complex wrangling examples.\n\n# Issues\n\nIf you have any problems with or questions about this image, please contact us through a GitHub issue.\n\n# Contributing\n\nYou are invited to contribute new features, fixes, or updates, large or small; we are always thrilled to receive pull requests, and do our best to process them as fast as we can.\n\nBefore you start to code, we recommend discussing your plans through a GitHub issue, especially for more ambitious contributions. This gives other contributors a chance to point you in the right direction, give you feedback on your design, and help you find out if someone else is working on the same thing.\n\n# License\n\nThis project is licensed under the MIT License\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fopenbridge%2Fob_datastash","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fopenbridge%2Fob_datastash","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fopenbridge%2Fob_datastash/lists"}