https://github.com/jey-37/nginx-pipeline
The Apache Beam program which reads nginx access logs from Google Cloud Pub/Sub, parses them, and saves into BigQuery.
https://github.com/jey-37/nginx-pipeline
apache-beam bigquery dataflow gcp-pubsub
Last synced: about 1 month ago
JSON representation
The Apache Beam program which reads nginx access logs from Google Cloud Pub/Sub, parses them, and saves into BigQuery.
- Host: GitHub
- URL: https://github.com/jey-37/nginx-pipeline
- Owner: Jey-37
- Created: 2023-06-03T17:21:04.000Z (about 3 years ago)
- Default Branch: master
- Last Pushed: 2023-07-07T10:28:02.000Z (almost 3 years ago)
- Last Synced: 2025-05-12T08:11:16.806Z (about 1 year ago)
- Topics: apache-beam, bigquery, dataflow, gcp-pubsub
- Language: Java
- Homepage:
- Size: 4.88 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Nginx Logs Pipeline
The Apache Beam program that reads nginx access logs from Google Cloud Pub/Sub, parses them, and saves into BigQuery.
Both local Beam pipeline running and Google Dataflow are supported.
## Environment Variables
To run this project, you need `GOOGLE_APPLICATION_CREDENTIALS` environment variable – the path to the service account key in JSON format.
## Command line arguments
### Mandatory arguments:
`--project` – GCP project name
`--inputSubscription` – short name of the Cloud Pub/Sub subscription to read from
`--dataset` – name of the BigQuery dataset
`--table` – name of the table in the dataset
### To use Dataflow Runner the below arguments are also required:
`--runner=DataflowRunner`
`--region` – Dataflow regional endpoint
`--gcpTempLocation` – GCP location for Dataflow to download temporary files