{"id":16708803,"url":"https://github.com/conema/spark-terraform","last_synced_at":"2025-09-04T04:39:37.115Z","repository":{"id":101242640,"uuid":"329616654","full_name":"conema/spark-terraform","owner":"conema","description":"This project create an Hadoop and Spark cluster on Amazon AWS with Terraform","archived":false,"fork":false,"pushed_at":"2021-02-24T12:01:41.000Z","size":31,"stargazers_count":3,"open_issues_count":0,"forks_count":4,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-04-05T11:24:04.259Z","etag":null,"topics":["aws","cluster","hadoop","hadoop-cluster","hcl","spark","spark-clusters","terraform"],"latest_commit_sha":null,"homepage":"","language":"Shell","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/conema.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2021-01-14T13:00:18.000Z","updated_at":"2023-03-31T15:32:23.000Z","dependencies_parsed_at":null,"dependency_job_id":"5cedef88-8eb7-4f6b-8ca2-1c4267c3cce0","html_url":"https://github.com/conema/spark-terraform","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/conema/spark-terraform","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/conema%2Fspark-terraform","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/conema%2Fspark-terraform/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/conema%2Fspark-terraform/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/conema%2Fspark-terraform/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/conema","download_url":"https://codeload.github.com/conema/spark-terraform/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/conema%2Fspark-terraform/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":261243168,"owners_count":23129591,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["aws","cluster","hadoop","hadoop-cluster","hcl","spark","spark-clusters","terraform"],"created_at":"2024-10-12T19:46:59.587Z","updated_at":"2025-09-04T04:39:37.103Z","avatar_url":"https://github.com/conema.png","language":"Shell","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Hadoop/Spark with Terraform on AWS\n\nThis project create an Hadoop and Spark cluster on Amazon AWS with Terraform\n\n1. [Variables](#Variables)\n2. [Software version](#Software-version)\n3. [Project Structure](#Project-Structure)\n4. [How to](#How-to)\n5. [See also](#See-also)\n\n## Variables\n\n| Name           | Description                                | Default               |\n|----------------|--------------------------------------------|-----------------------|\n| region         | AWS region                                 | us-east-1             |\n| access_key     | AWS access key                             |                       |\n| secret_key     | AWS secret key                             |                       |\n| token          | AWS token                                  | null                  |\n| instance_type  | AWS instance type                          | m5.xlarge             |\n| ami_image      | AWS AMI image                              | ami-0885b1f6bd170450c |\n| key_name       | Name of the key pair used between nodes    | localkey              |\n| key_path       | Path of the key pair used between nodes    | .                     |\n| aws_key_name   | AWS key pair used to connect to nodes      | amzkey                |\n| amz_key_path   | AWS key pair path used to connect to nodes | amzkey.pem            |\n| namenode_count | Namenode count                             | 1                     |\n| datanode_count | Datanode count                             | 3                     |\n| ips            | Default private ips used for nodes         | See variables.tf      |\n| hostnames      | Default private hostnames used for nodes   | See variables.tf      |\n\n\n## Software version\n* Default AMI image: ami-0885b1f6bd170450c (Ubuntu 20.04, amd64, hvm-ssd)\n* Spark: 3.0.1\n* Hadoop: 2.7.7\n* Python: last available (currently 3.8)\n* Java: openjdk 8u275 jdk\n\n## Project Structure\n\n* app/: folder where you can put your application, it will copied to the namenode\n* install-all.sh: script which is executed in every node, it install hadoop/spark and do all the configuration for you\n* main.tf: definition of the resources \n* output.tf: terraform output declaration\n* variables.tf: terraform variable declaration\n\n\n## How to\n\n0. Download and install Terraform\n1. Download the project and unzip it\n2. Open the terraform project folder \"spark-terraform-master/\"\n3. Create a file named \"terraform.tfvars\" and paste this:\n```\naccess_key=\"\u003cYOUR AWS ACCESS KEY\u003e\"\nsecret_key=\"\u003cYOUR AWS SECRET KEY\u003e\"\ntoken=\"\u003cYOUR AWS TOKEN\u003e\"\n```\n**Note:** without setting the other variables (you can find it on variables.tf), terraform will create a cluster on region \"us-east-1\", with 1 namenode, 3 datanode and with an instance type of m5.xlarge.\n\n3. Put your application files into the \"app\" terraform project folder \n4. Open a terminal and generate a new ssh-key\n```\nssh-keygen -f \u003cPATH_TO_SPARK_TERRAFORM\u003e/spark-terraform-master/localkey\n```\nWhere `\u003cPATH_TO_SPARK_TERRAFORM\u003e` is the path to the /spark-terraform-master/ folder (e.g. /home/user/)\n\n5. Login to AWS and create a key pairs named **amzkey** in **PEM** file format. Follow the guide on [AWS DOCS](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-key-pairs.html#having-ec2-create-your-key-pair). Download the key and put it in the spark-terraform-master/ folder.\n\n6. Open a terminal and go to the spark-terraform-master/ folder, execute the command\n ```\n terraform init\n terraform apply\n ```\n After a while (wait!) it should print some public DNS in a green color, these are the public dns of your instances.\n\n7. Connect via ssh to all your instances via\n ```\nssh -i \u003cPATH_TO_SPARK_TERRAFORM\u003e/spark-terraform-master/amzkey.pem ubuntu@\u003cPUBLIC DNS\u003e\n ```\n\n8. Execute on the master (one by one):\n ```\n$HADOOP_HOME/sbin/start-dfs.sh\n$HADOOP_HOME/sbin/start-yarn.sh\n$HADOOP_HOME/sbin/mr-jobhistory-daemon.sh start historyserver\n$SPARK_HOME/sbin/start-master.sh\n$SPARK_HOME/sbin/start-slaves.sh spark://s01:7077\n```\n\n9. You are ready to execute your app! Execute this command on the master\n```\n/opt/spark-3.0.1-bin-hadoop2.7/bin/spark-submit --master spark://s01:7077  --executor-cores 2 --executor-memory 14g yourfile.py\n```\n\n10. Remember to do `terraform destroy` to delete your EC2 instances\n\n**Note:** The steps from 0 to 5 (included) are needed only on the first execution ever\n\n\n## See also\n * [TransE PySpark](https://github.com/conema/TransE-pyspark): an application using this project\n * [hadoop-spark-cluster-deployment](https://github.com/kostistsaprailis/hadoop-spark-cluster-deployment): the starting point of this project\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fconema%2Fspark-terraform","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fconema%2Fspark-terraform","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fconema%2Fspark-terraform/lists"}