https://github.com/andgineer/spark-aws-rdkit
Docker image with Apache Spark / Hadoop3 (compatible with AWS services like S3) and with RDKit installed in anaconda environment
https://github.com/andgineer/spark-aws-rdkit
anaconda aws pyspark rdkit spark
Last synced: 8 months ago
JSON representation
Docker image with Apache Spark / Hadoop3 (compatible with AWS services like S3) and with RDKit installed in anaconda environment
- Host: GitHub
- URL: https://github.com/andgineer/spark-aws-rdkit
- Owner: andgineer
- Created: 2021-03-10T11:49:05.000Z (over 4 years ago)
- Default Branch: main
- Last Pushed: 2024-09-20T04:24:35.000Z (about 1 year ago)
- Last Synced: 2025-02-01T11:35:01.810Z (8 months ago)
- Topics: anaconda, aws, pyspark, rdkit, spark
- Language: Shell
- Homepage:
- Size: 65.4 KB
- Stars: 4
- Watchers: 3
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Docker-based PySpark cluster with AWS services and RDKit
Docker-based Apache Spark standalone cluster with
AWS services integration (S3, etc.) and comprehensive Data Science environment
including PySpark, Pandas, and RDKit for cheminformatics.## Features
- Apache Spark cluster in standalone mode
- Full AWS services compatibility (S3, etc.)
- Conda environment with Data Science tools:
- PySpark
- Pandas
- RDKit for cheminformatics (https://www.rdkit.org)
- Deployment options:
- Local with `docker compose`
- Cloud with AWS ECS## Quick Start
Launch locally with `docker compose`:
```bash
./compose.sh up --build
```This starts:
- Spark Master
- Two Spark Workers
- Example job container (`submit`)Access points:
- Spark Web UI: http://localhost:8080
- Spark Driver: `spark://localhost:7077`
- For PySpark use `setMaster('spark://localhost:7077')`> Note: On Linux, change `docker.for.mac.localhost` to `localhost` in `.env` file.
## Example PySpark Application
The `submit` container demonstrates how to:
- Connect to the Spark cluster
- Submit Spark jobs
- Process data with PySparkCheck `src/` directory for the implementation details.
## AWS ECS Deployment
For production deployment on AWS Elastic Container Service (ECS):
1. Navigate to `ecs/` directory
2. Configure your deployment in `config.sh`
3. Run the automated deployment scriptsDetailed instructions available in `ecs/README.md`.
## Docker Images
Public Docker images available on Docker Hub, no need to build locally:
1. **[andgineer/spark-aws](https://hub.docker.com/r/andgineer/spark-aws)**
- Base image with Spark 3 and Hadoop 3
- AWS services integration2. **[andgineer/spark-aws-conda](https://hub.docker.com/r/andgineer/spark-aws-conda)**
- Extends base image
- Adds Anaconda with Pandas3. **[andgineer/spark-aws-rdkit](https://hub.docker.com/r/andgineer/spark-aws-rdkit)**
- Adds RDKit for cheminformatics
- Complete Data Science environment