https://github.com/notdodo/sparktrail
Query AWS CloudTrail using Spark (python) to perform analysis
https://github.com/notdodo/sparktrail
Last synced: 10 months ago
JSON representation
Query AWS CloudTrail using Spark (python) to perform analysis
- Host: GitHub
- URL: https://github.com/notdodo/sparktrail
- Owner: notdodo
- Created: 2023-03-05T23:56:48.000Z (almost 3 years ago)
- Default Branch: main
- Last Pushed: 2024-04-08T12:27:52.000Z (almost 2 years ago)
- Last Synced: 2024-04-09T15:29:53.084Z (almost 2 years ago)
- Language: Python
- Size: 146 KB
- Stars: 0
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# SparkTrail [](https://github.com/notdodo/SparkTrail/actions/workflows/codeql.yml)
Use this Python script to start a Spark standalone session to interact with the CloudTrail bucket.
Spark allow to query the logs using a SQL-like syntax.
The startup `main.py` script will automatically load SSO credentials and set AWS temporary credentials from the SSO to authenticated to the bucket.
## Usage
0. Spawn Poetry shell:
`poetry shell`
1. Start the cluster:
`PYSPARK_DRIVER_PYTHON=ipython PYTHONSTARTUP=main.py pyspark --packages org.apache.hadoop:hadoop-aws:3.3.4,com.amazonaws:aws-java-sdk-bundle:1.12.262 --driver-memory 15G --executor-memory5G --name SparkTrail`
2. Now the environment is configured and to start running queries link the S3 bucket. When the IPython shell is created, link the bucket and start performing queries:
```python
spark = link_s3("audit-cloudtrail-logs/AWSLogs/")
spark.select("Records.eventName").distinct().show(10)
```
### Notes
- You can use this script to any bucket with partitioned JSON files (e.g., Databricks audit logs)
- You may need to adjust the memory sizing to fit your environment
- You need to `aws sso login` before running the command