{"id":23402591,"url":"https://github.com/martincastroalvarez/apache-hive-docker","last_synced_at":"2025-04-08T22:22:56.489Z","repository":{"id":181640483,"uuid":"623518933","full_name":"MartinCastroAlvarez/apache-hive-docker","owner":"MartinCastroAlvarez","description":"Running Hive jobs using Docker","archived":false,"fork":false,"pushed_at":"2023-07-16T15:25:18.000Z","size":6303,"stargazers_count":2,"open_issues_count":0,"forks_count":1,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-02-14T17:31:06.534Z","etag":null,"topics":["hadoop","hdfs","hive"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/MartinCastroAlvarez.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2023-04-04T14:29:43.000Z","updated_at":"2024-09-03T05:05:57.000Z","dependencies_parsed_at":null,"dependency_job_id":"87481304-07bc-41b6-b722-e817e382a934","html_url":"https://github.com/MartinCastroAlvarez/apache-hive-docker","commit_stats":null,"previous_names":["martincastroalvarez/apache-hive-docker"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MartinCastroAlvarez%2Fapache-hive-docker","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MartinCastroAlvarez%2Fapache-hive-docker/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MartinCastroAlvarez%2Fapache-hive-docker/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MartinCastroAlvarez%2Fapache-hive-docker/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/MartinCastroAlvarez","download_url":"https://codeload.github.com/MartinCastroAlvarez/apache-hive-docker/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247935686,"owners_count":21020870,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["hadoop","hdfs","hive"],"created_at":"2024-12-22T12:29:35.893Z","updated_at":"2025-04-08T22:22:56.470Z","avatar_url":"https://github.com/MartinCastroAlvarez.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Hadoop Hive Docker\nRunning Hive jobs using Docker\n\n![img](./wallpaper.jpg)\n\n## Overview\n\n#### HDFS\n\nHDFS, or Hadoop Distributed File System, is a distributed file system designed to store and\nprocess large datasets using commodity hardware. It is part of the Apache Hadoop ecosystem\nand is widely used in big data processing. HDFS uses a master-slave architecture with one\nNameNode and multiple DataNodes. The NameNode manages the file system metadata, while the\nDataNodes store the actual data. This allows for scalable and fault-tolerant data storage\nand processing. HDFS is optimized for batch processing and sequential reads, making it\nwell-suited for applications like log analysis, data warehousing, and machine learning.\nHowever, it is not well suited for random writes and low-latency data access. HDFS is a\ncritical component of the Hadoop ecosystem and is used by many big data applications.\nIts scalable and fault-tolerant design makes it a reliable choice for storing and\nprocessing large datasets. Overall, HDFS plays a crucial role in the world of big\ndata and is an essential tool for data engineers and analysts.\n\n![hadoop.png](hadoop.png)\n\n#### Hive\n\nApache Hive is a data warehousing and SQL-like query tool built on top of the Hadoop\nDistributed File System (HDFS). It provides a SQL-like interface for querying and\nanalyzing large datasets stored in HDFS or other Hadoop-compatible file systems.\nHive translates SQL-like queries into MapReduce jobs, which are executed on the\nHadoop cluster.\n\nHive is designed to be highly scalable, allowing you to process and analyze large\ndatasets using distributed computing resources. It provides a range of built-in\nfunctions and operators for querying and manipulating data, as well as the ability\nto define custom user-defined functions (UDFs) in Java, Python, or other programming\nlanguages.\n\nHive also supports partitioning and bucketing of data for faster query execution,\nas well as the ability to use external tables to access data stored outside of\nHDFS, such as in Amazon S3 or HBase.\n\nOverall, Hive is a powerful tool for processing and analyzing large datasets\nusing the familiar SQL-like interface. It allows you to leverage the\nscalability and distributed computing power of Hadoop to process and\nanalyze data that might be too large or complex to analyze using traditional\ndatabase systems.\n\n## Software Architecture\n\n|File|Purpose|\n|---|---|\n|[docker-compose.yml](docker-compose.yml)|Docker compose with the infrastructure required to run the Hadoop cluster.|\n|[requirements.txt](tests/requirements.txt)|Python requirements file.|\n|[app/test_hdfs.py](tests/test_hdfs.py)|Python script that tests writing data into HDFS.|\n|[app/test_hive.py](tests/test_hive.py)|Python script that tests writing data using Hive.|\n\n## References\n\n- [Docker Hadoop](https://github.com/big-data-europe/docker-hadoop)\n- [HDFS Simple Docker Installation Guide for Data Science Workflow](https://towardsdatascience.com/hdfs-simple-docker-installation-guide-for-data-science-workflow-b3ca764fc94b)\n- [Set Up Containerize and Test a Single Hadoop Cluster using Docker and Docker compose](https://www.section.io/engineering-education/set-up-containerize-and-test-a-single-hadoop-cluster-using-docker-and-docker-compose/)=\n- [Spark Docker](https://github.com/big-data-europe/docker-spark)\n- [Hadoop Namenode](https://hub.docker.com/r/bde2020/hadoop-namenode)\n- [Apache ZooKeeper](https://zookeeper.apache.org/)\n- [Word Counter using Map Reduce on Hadoop](https://medium.com/analytics-vidhya/word-count-using-mapreduce-on-hadoop-6eaefe127502)\n- [Docker Hive](https://github.com/big-data-europe/docker-hive)\n- [Docker Compose Hive](https://github.com/big-data-europe/docker-hadoop-spark-workbench)\n\n## Instructions\n\n#### Starting the Hadoop ecosystem\n```bash\ndocker rm -f $(docker ps -a -q)\ndocker volume rm $(docker volume ls -q)\ndocker-compose up\n```\n\n#### Validating the status of the Hadoop cluster\n```bash\ndocker ps\n```\n```bash\nCONTAINER ID        IMAGE                                                    COMMAND                  CREATED             STATUS                    PORTS                                            NAMES\n0f87a832960b        bde2020/hadoop-resourcemanager:2.0.0-hadoop3.2.1-java8   \"/entrypoint.sh /r...\"   12 hours ago        Up 54 seconds             0.0.0.0:8088-\u003e8088/tcp                           yarn\n51da2508f5b8        bde2020/hadoop-historyserver:2.0.0-hadoop3.2.1-java8     \"/entrypoint.sh /r...\"   12 hours ago        Up 55 seconds (healthy)   0.0.0.0:8188-\u003e8188/tcp                           historyserver\nec544695c49a        bde2020/hadoop-nodemanager:2.0.0-hadoop3.2.1-java8       \"/entrypoint.sh /r...\"   12 hours ago        Up 56 seconds (healthy)   0.0.0.0:8042-\u003e8042/tcp                           nodemanager\n810f87434b2f        bde2020/hadoop-datanode:2.0.0-hadoop3.2.1-java8          \"/entrypoint.sh /r...\"   12 hours ago        Up 56 seconds (healthy)   0.0.0.0:9864-\u003e9864/tcp                           datenode1\nca5186635150        bde2020/hadoop-namenode:2.0.0-hadoop3.2.1-java8          \"/entrypoint.sh /r...\"   12 hours ago        Up 56 seconds (healthy)   0.0.0.0:9000-\u003e9000/tcp, 0.0.0.0:9870-\u003e9870/tcp   namenode\nbeed8502828c        bde2020/hadoop-datanode:2.0.0-hadoop3.2.1-java8          \"/entrypoint.sh /r...\"   12 hours ago        Up 55 seconds (healthy)   0.0.0.0:9865-\u003e9864/tcp                           datenode2\n[...]\n```\n\n#### Testing HDFS using raw HTTP requests.\nThe `-L` flag allows redirections. By default, the namenode redirects the request to any of the datanodes.\n````bash\ndocker exec -it namenode /bin/bash\ncurl -L -i -X PUT \"http://127.0.0.1:9870/webhdfs/v1/data/martin/lorem-ipsum.txt?op=CREATE\" -d 'testing'\n````\n````bash\nHTTP/1.1 307 Temporary Redirect\nDate: Thu, 30 Mar 2023 00:40:44 GMT\nCache-Control: no-cache\nExpires: Thu, 30 Mar 2023 00:40:44 GMT\nDate: Thu, 30 Mar 2023 00:40:44 GMT\nPragma: no-cache\nX-Content-Type-Options: nosniff\nX-FRAME-OPTIONS: SAMEORIGIN\nX-XSS-Protection: 1; mode=block\nLocation: http://datanode2.martincastroalvarez.com:9864/webhdfs/v1/data/martin/lorem-ipsum.txt?op=CREATE\u0026namenoderpcaddress=namenode:9000\u0026createflag=\u0026createparent=true\u0026overwrite=false\nContent-Type: application/octet-stream\nContent-Length: 0\n\nHTTP/1.1 100 Continue\n\nHTTP/1.1 201 Created\nLocation: hdfs://namenode:9000/data/martin/lorem-ipsum.txt\nContent-Length: 0\nAccess-Control-Allow-Origin: *\nConnection: close\n````\n\n#### Listing the content of the root directory\n```bash\ndocker exec -it namenode /bin/bash\nhdfs dfs -ls /\n```\n```bash\nFound 1 items\ndrwxr-xr-x   - root supergroup          0 2023-03-03 14:15 /rmstate\n```\n\n#### Creating a new directory in HDFS\n```bash\ndocker exec -it namenode /bin/bash\nhdfs dfs -mkdir -p /user/root\nhdfs dfs -ls /\n```\n```bash\nFound 2 items\ndrwxr-xr-x   - root supergroup          0 2023-03-03 14:15 /rmstate\ndrwxr-xr-x   - root supergroup          0 2023-03-03 14:17 /user\n```\n\n#### Adding a file to HDFS\n```bash\ndocker exec -it namenode /bin/bash\necho \"lorem\" \u003e /tmp/hadoop.txt \nhdfs dfs -put ./input/* input\nhdfs dfs -ls /user/\n```\n```bash\nFound 2 items\n-rw-r--r--   3 root supergroup          6 2023-03-03 14:20 /user/hadoop.txt\ndrwxr-xr-x   - root supergroup          0 2023-03-03 14:17 /user/root\n```\n\n#### Printing the content of a file in HDFS\n```bash\ndocker exec -it namenode /bin/bash\nhdfs dfs -cat /user/hadoop.txt \n```\n```bash\nlorem\n```\n\n#### Checking the status of the NameNode at [http://127.0.0.1:9870/dfshealth.html](http://127.0.0.1:9870/dfshealth.html)\n\n![status1.png](status1.png)\n![status2.png](status2.png)\n\n#### Testing HDFS using Python\n\n```python3\nvirtualenv -p python3 .env\nsource .env/bin/activate\npip install -r requirements.txt\npython3 app/test_hdfs.py\n```\n```bash\n[...]\nWritten: 684 files 336846 words 1852059 chars\n```\n\n#### Entering into the Hive server\n```bash\ndocker exec -it hive /bin/bash\n```\n\n#### Validating that the Hive service has started correctly.\n```bash\nps -ef | grep hive\n```\n```bash\nroot       398   269 28 16:15 ?        00:00:04 /usr/lib/jvm/java-8-openjdk-amd64//bin/java -Xmx256m -Djava.net.preferIPv4Stack=true -Dhadoop.log.dir=/opt/hadoop-2.7.4/logs -Dhadoop.log.file=hadoop.log -Dhadoop.home.dir=/opt/hadoop-2.7.4 -Dhadoop.id.str=root -Dhadoop.root.logger=INFO,console -Djava.library.path=/opt/hadoop-2.7.4/lib/native -Dhadoop.policy.file=hadoop-policy.xml -Djava.net.preferIPv4Stack=true -Xmx512m -Dproc_hiveserver2 -Dlog4j.configurationFile=hive-log4j2.properties -Djava.util.logging.config.file=/opt/hive/conf/parquet-logging.properties -Dhadoop.security.logger=INFO,NullAppender org.apache.hadoop.util.RunJar /opt/hive/lib/hive-service-2.3.2.jar org.apache.hive.service.server.HiveServer2 --hiveconf hive.server2.enable.doAs=false\n```\n\n#### Troubleshooting logs\n```bash\ntail -n 100 -f /tmp/root/hive.log \n```\n```bash\n[...]\nndler{/static,jar:file:/opt/hive/lib/hive-service-2.3.2.jar!/hive-webapps/static}\n2023-04-04T16:21:13,754 INFO  [main]: handler.ContextHandler (ContextHandler.java:startContext(737)) - started o.e.j.s.ServletContextHandler{/logs,file:/tmp/root/}\n2023-04-04T16:21:13,770 INFO  [main]: server.HiveServer2 (HiveServer2.java:start(508)) - Web UI has started on port 10002\n2023-04-04T16:21:13,768 INFO  [main]: server.AbstractConnector (AbstractConnector.java:doStart(333)) - Started SelectChannelConnector@0.0.0.0:10002\n2023-04-04T16:21:13,770 INFO  [main]: http.HttpServer (HttpServer.java:start(214)) - Started HttpServer[hiveserver2] on port 10002\n```\n\n#### Entering into the Hive prompt\n```bash\nbeeline -u jdbc:hive2://hive:10000\n```\n```bash\n0: jdbc:hive2://hive:10000\u003e \n```\n\n#### Creating a new table.\n```bash\nCREATE TABLE pokes (foo INT, bar STRING);\n```\n```bash\nNo rows affected (1.234 seconds)\n```\n\n#### Inserting data into the table.\n```bash\nINSERT INTO TABLE pokes VALUES (1, 'John'), (2, 'Jane'), (3, 'Bob');\n```\n```bash\nNo rows affected (4.089 seconds)\n```\n\n#### Reading the dtable\n```bash\nSELECT * FROM pojes;\n```\n```bash\n+------------+------------+\n| pokes.foo  | pokes.bar  |\n+------------+------------+\n| 1          | John       |\n| 2          | Jane       |\n| 3          | Bob        |\n+------------+------------+\n3 rows selected (0.267 seconds)\n```\n\n#### Checking that the table was created at [http://127.0.0.1:9870/explorer.html#/user/hive/warehouse/pokes](http://127.0.0.1:9870/explorer.html#/user/hive/warehouse/pokes):\n\n![pokes.png](pokes.png)\n\n```bash\n1\u0001John\n2\u0001Jane\n3\u0001Bob\n```\n\n#### Connecting to Hive using Python\n```bash\nvirtualenv -p python3 .env/\nsource .env/bin/activate\npip install -r requirements.txt\npython3 app/test_hive.py\n```\n```bash\nConnected: \u003cpyhive.hive.Connection object at 0x105a3efd0\u003e\nCursor: \u003cpyhive.hive.Cursor object at 0x1062a5c10\u003e\nSQL: \n    CREATE TABLE fiscales (\n        id INT,\n        name STRING\n    )\nSQL: \n    INSERT INTO fiscales\n    VALUES (1, 'John'), (2, 'Jane'), (3, 'Bob')\nInserted!\nCommitted!\nSQL: SELECT * FROM fiscales\nRow: (1, 'John')\nRow: (2, 'Jane')\nRow: (3, 'Bob')\nRow: (1, 'John')\nRow: (2, 'Jane')\nRow: (3, 'Bob')\nRow: (1, 'John')\nRow: (2, 'Jane')\nRow: (3, 'Bob')\nRow: (1, 'John')\nRow: (2, 'Jane')\nRow: (3, 'Bob')\nRow: (1, 'John')\nRow: (2, 'Jane')\nRow: (3, 'Bob')\nRow: (1, 'John')\nRow: (2, 'Jane')\nRow: (3, 'Bob')\nConnection closed!\n```\n\n#### Generating a CSV\n```bash\nvirtualenv -p python3 .env/\nsource .env/bin/activate\npip install -r requirements.txt\npython3 app/test_csv.py\n```\n\nThen look at [result.csv](result.csv):\n```bash\n1,John\n2,Jane\n3,Bob\n1,John\n2,Jane\n3,Bob\n1,John\n2,Jane\n3,Bob\n1,John\n2,Jane\n3,Bob\n```\n\n#### Visualizing the Hive web interface at [http://127.0.0.1:10002/](http://127.0.0.1:10002/)\n\n![hive.png](hive.png)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmartincastroalvarez%2Fapache-hive-docker","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmartincastroalvarez%2Fapache-hive-docker","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmartincastroalvarez%2Fapache-hive-docker/lists"}