{"id":41705556,"url":"https://github.com/saraivaufc/bigdata-docker","last_synced_at":"2026-01-24T21:30:31.088Z","repository":{"id":37217440,"uuid":"223521716","full_name":"saraivaufc/bigdata-docker","owner":"saraivaufc","description":"Run Hadoop Cluster within Docker Containers.","archived":false,"fork":false,"pushed_at":"2025-01-08T03:00:23.000Z","size":682,"stargazers_count":16,"open_issues_count":18,"forks_count":6,"subscribers_count":5,"default_branch":"master","last_synced_at":"2025-01-08T03:27:59.767Z","etag":null,"topics":["docker","docker-compose","hadoop","hdfs","hive","hue","spark"],"latest_commit_sha":null,"homepage":"","language":"Shell","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/saraivaufc.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2019-11-23T02:55:25.000Z","updated_at":"2025-01-08T02:59:11.000Z","dependencies_parsed_at":"2023-01-25T08:16:31.864Z","dependency_job_id":null,"html_url":"https://github.com/saraivaufc/bigdata-docker","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/saraivaufc/bigdata-docker","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/saraivaufc%2Fbigdata-docker","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/saraivaufc%2Fbigdata-docker/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/saraivaufc%2Fbigdata-docker/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/saraivaufc%2Fbigdata-docker/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/saraivaufc","download_url":"https://codeload.github.com/saraivaufc/bigdata-docker/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/saraivaufc%2Fbigdata-docker/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28737276,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-01-24T21:19:41.845Z","status":"ssl_error","status_checked_at":"2026-01-24T21:13:38.675Z","response_time":89,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["docker","docker-compose","hadoop","hdfs","hive","hue","spark"],"created_at":"2026-01-24T21:30:30.447Z","updated_at":"2026-01-24T21:30:31.082Z","avatar_url":"https://github.com/saraivaufc.png","language":"Shell","funding_links":[],"categories":[],"sub_categories":[],"readme":"![bigdata-docker](./docs/bigdata-docker.png)\n\n### Install Docker (Ubuntu):\n\n```shell\n$ sudo apt-get remove docker docker-engine docker.io\n$ sudo apt-get update\n$ sudo apt-get install apt-transport-https ca-certificates  curl software-properties-common\n$ sudo curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -\n$ sudo apt-key fingerprint 0EBFCD88\n$ sudo add-apt-repository \"deb [arch=amd64] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable\"\n$ sudo apt-get update\n$ sudo apt-get install docker-ce\n$ sudo docker run hello-world\n```\n\n### Install Docker Compose\n\n```shell\n$ sudo curl -L \"https://github.com/docker/compose/releases/download/1.23.2/docker-compose-$(uname -s)-$(uname -m)\" -o /usr/local/bin/docker-compose\n$ sudo chmod +x /usr/local/bin/docker-compose\n```\n\n### To use GPU\n\n#### Install NVIDIA Docker 2\n```shell\n$ sudo apt-get install -y nvidia-docker2 nvidia-container-toolkit\n```\n\n#### Update /etc/docker/daemon.json\n\nFrom:\n```json\n{\n    \"runtimes\": {\n        \"nvidia\": {\n            \"path\": \"nvidia-container-runtime\",\n            \"runtimeArgs\": []\n        }\n    }\n}\n```\nTo:\n```json\n{\n\t\"default-runtime\":\"nvidia\",\n    \"runtimes\": {\n        \"nvidia\": {\n            \"path\": \"nvidia-container-runtime\",\n            \"runtimeArgs\": []\n        }\n    }\n}\n```\n\n#### Restart Docker\n```shell\n$ service docker restart\n```\n\n\n### Build images\n```shell\n$ docker-compose build --parallel\n```\n\n### Up containers via compose\n```shell\n$ docker-compose up -d\n```\n\n# Applications\n\n| Application | URL |\n|-----------------------|-------------------------------------------------------|\n| Hadoop                | http://localhost:9870                                 |\n| Hadoop Cluster        | http://localhost:8088                                 |\n| Hadoop HDFS           | hdfs://localhost:9000                                 |\n| Hadoop WEBHDFS        | http://localhost:14000/webhdfs/v1                     |\n| Hive Server2          | http://localhost:10000                                |\n| Hue                   | http://localhost:8888 (username: hue,password: secret)|\n| Spark Master UI       | http://localhost:4080                                 |\n| Spark Jobs            | http://localhost:4040                                 |\n| Livy                  | http://localhost:8998                                 |\n| Jupyter notebook      | http://localhost:8899                                 |\n| AirFlow               | http://localhost:8080 (username: airflow,password: airflow)|\n| Flower                | http://localhost:8555                                 |\n\n# Tutorials\n\n## HDFS\n\n### Access the Hadoop Namenode container\n```shell\ndocker exec -it hadoop-master bash\n```\n\n### List root content\n```shell\nhadoop fs -ls /\n```\n### Create a directory structure\n```shell\nhadoop fs -mkdir /dados\nhadoop fs -ls /\nhadoop fs -ls /dados\nhadoop fs -mkdir /dados/bigdata\nhadoop fs -ls /dados\n```\n\n### Test the deletion of a directory\n```shell\nhadoop fs -rm -r /dados/bigdata\nhadoop fs -ls /dados\n```\n\n### Add an external file to the cluster\n```shell\ncd /root\nls\nhadoop fs -mkdir /dados/bigdata\nhadoop fs -put /var/log/alternatives.log /dados/bigdata\nhadoop fs -ls /dados/bigdata\n```\n\n### Copy files\n```shell\nhadoop fs -ls /dados/bigdata\nhadoop fs -cp /dados/bigdata/alternatives.log /dados/bigdata/alternatives2.log\nhadoop fs -ls /dados/bigdata\n```\n\n### List the contents of a file\n```shell\nhadoop fs -ls /dados/bigdata\nhadoop fs -cat /dados/bigdata/alternatives.log\n```\n### Create a HUE User\n```shell\nhadoop fs -mkdir /user/hue\nhadoop fs -ls /user/hue\nhadoop fs -chmod 777 /user/hue\n```\n\n## Hive\n\n### Access the Hadoop Namenode container\n```shell\ndocker exec -it hadoop-master bash\n```\n\n### Run Hive Shell\n```shell\nhive\n```\n\n### List databases\n```shell\n\u003e show databases;\n```\n\n### Access 'default' Database\n```shell\n\u003e use default;\n```\n\n### List database tables\n```shell\n\u003e show tables;\n```\n\n## Spark\n\nDocumentation: https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=read%20csv\n\n### Data ingestion in HDFS\n```shell\n# Access the Hadoio Namenode container\ndocker exec -it hadoop-master bash\n\n# Download ENEM datasets: http://inep.gov.br/microdados\n\n# create spark folder in HDFS\nhadoop fs -mkdir /user/spark/\n\n# Data ingestion in HDFS\nhadoop fs -put  MICRODADOS_ENEM_2018.csv /user/spark/\nhadoop fs -put  MICRODADOS_ENEM_2017.csv /user/spark/\n```\n### Access the Spark master node container\n```shell\ndocker exec -it spark-master bash\n```\n\n### Access Spark shell\n```shell\nspark-shell\n```\n\n### Load ENEM 2018 data from HDFS\n```shell\nval df = spark.read.format(\"csv\").option(\"sep\", \";\").option(\"inferSchema\", \"true\").option(\"header\", \"true\").load(\"hdfs://hadoop-master:9000/user/spark/MICRODADOS_ENEM_2018.csv\")\n```\n### Show dataframe schema\n```shell\ndf.printSchema()\n```\n### Show how many visually impaired students participated in the ENEM test in 2018.\n```shell\ndf.groupBy(\"IN_CEGUEIRA\").count().show()\n```\n\n### Show how many students participated in the ENEM test in 2018 grouped by age.\n```shell\ndf.groupBy(\"NU_IDADE\").count().sort(asc(\"NU_IDADE\")).show(100, false)\n```\n\n## Kafka\n\n### Connect Kafka Broker 1\n```shell\ndocker exec -it kafka-broker1 bash\n```\n\n### Create topic\n```shell\nkafka-topics.sh --create --zookeeper zookeeper:2181 --replication-factor 1 --partitions 1 --topic test\n```\n\n### List topics\n```shell\nkafka-topics.sh --zookeeper zookeeper:2181 --list\n```\n\n### Run Producer on Kafka Broker 1\n```shell\nkafka-console-producer.sh --bootstrap-server kafka-broker1:9091 --topic test\n```\n### Enter data\n```shell\n\u003eHello\n```\n\n### Connect Kafka Broker 2\n```shell\ndocker exec -it kafka-broker2 bash\n```\n\n### Run Consumer on Kafka Broker 2\n```shell\nkafka-console-consumer.sh --bootstrap-server kafka-broker1:9091 --from-beginning --topic test\n```\n\n### Delete topic\n```shell\nkafka-topics.sh --zookeeper zookeeper:2181 --delete --topic test\n```\n\n\u003ca rel=\"license\" href=\"http://creativecommons.org/licenses/by-nc/4.0/\"\u003e\n    \u003cimg alt=\"Creative Commons License\" style=\"border-width:0\" src=\"https://i.creativecommons.org/l/by-nc/4.0/88x31.png\" /\u003e\n\u003c/a\u003e\n\u003cbr /\u003e\nThis work is licensed under a \u003ca rel=\"license\" href=\"http://creativecommons.org/licenses/by-nc/4.0/\"\u003eCreative Commons Attribution-NonCommercial 4.0 International License\u003c/a\u003e.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsaraivaufc%2Fbigdata-docker","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsaraivaufc%2Fbigdata-docker","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsaraivaufc%2Fbigdata-docker/lists"}