{"id":15208719,"url":"https://github.com/tomkat-cr/data_lakehouse_local_stack","last_synced_at":"2025-02-27T07:31:26.456Z","repository":{"id":245500029,"uuid":"818384810","full_name":"tomkat-cr/data_lakehouse_local_stack","owner":"tomkat-cr","description":"Data Lakehouse local stack with PySpark, Trino, and Minio. Includes an example to process Raygun error data and the IP address occurrence.","archived":false,"fork":false,"pushed_at":"2024-07-03T16:11:42.000Z","size":1429,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2024-10-12T00:01:20.664Z","etag":null,"topics":["docker-compose","hive","hive-metastore","minio","minio-storage","python","spark","spark-sql","trino"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/tomkat-cr.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-06-21T18:22:34.000Z","updated_at":"2024-07-03T16:11:46.000Z","dependencies_parsed_at":"2024-10-12T00:02:15.444Z","dependency_job_id":"46c4f627-1713-49b2-b9ba-b9680fad1715","html_url":"https://github.com/tomkat-cr/data_lakehouse_local_stack","commit_stats":{"total_commits":22,"total_committers":2,"mean_commits":11.0,"dds":0.2272727272727273,"last_synced_commit":"abc6951146ae496f8214e623b4959189c24c5a1f"},"previous_names":["tomkat-cr/data_lakehouse_local_stack"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tomkat-cr%2Fdata_lakehouse_local_stack","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tomkat-cr%2Fdata_lakehouse_local_stack/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tomkat-cr%2Fdata_lakehouse_local_stack/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tomkat-cr%2Fdata_lakehouse_local_stack/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/tomkat-cr","download_url":"https://codeload.github.com/tomkat-cr/data_lakehouse_local_stack/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":219842826,"owners_count":16556560,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["docker-compose","hive","hive-metastore","minio","minio-storage","python","spark","spark-sql","trino"],"created_at":"2024-09-28T07:01:38.368Z","updated_at":"2024-10-16T05:44:15.341Z","avatar_url":"https://github.com/tomkat-cr.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Data Lakehouse Local Stack\n\n**Data Lakehouse local stack with PySpark, Trino, and Minio**\n\nThis repository aims to introduce the Data Lakehouse pattern as a suitable and flexible solution to transit small companies to established enterprises, allowing to implement a local data lakehouse from OpenSource solutions, compatible with Cloud production grade tools.\n\nIt also includes an [example to process Raygun error data and the IP address occurrence](#raygun-data-processing).\n\n## Introduction\n\nIn the world of ML and AI, **data** is the crown jewel, but it's normally lost in [Swamps](https://www.superannotate.com/blog/data-lakes-vs-data-swamps-vs-data-warehouse) due to bad practices with **Data Lakes** when companies try to productionize their data.\n\n**Data Warehouses** are costly solutions for this problem, and increase the complexity of simple Data Lakes.\n\nHere's where **Data Lakehouses** come into action, being a hybrid solution with the best of both worlds. ([source](https://2024.pycon.co/en/talks/23)).\n\n**Data Lakehouses** aim to combine elements of data warehousing with core elements of data lakes. Put simply, they are designed to provide the lower costs of cloud storage even for large amounts of raw data alongside support for certain analytics concepts – such as SQL access to curated and structured data tables stored in relational databases, or support for large scale processing of Big Data analytics or Machine Learning workloads ([source](https://www.exasol.com/resource/data-lake-warehouse-or-lakehouse/)).\n\n\u003c!--\n![Common DatalakeHouse technologies](./images/1676637608474.png)\u003cBR/\u003e\n([Image source](https://www.linkedin.com/pulse/lakehouse-convergence-data-warehousing-science-dr-mahendra/))\n--\u003e\n\n![Common DatalakeHouse technologies](./images/data_lakehouse_new.png)\u003cBR/\u003e\n([Image source](https://www.databricks.com/glossary/data-lakehouse))\n\n## The Medallion Architecture\n\nA medallion architecture is a data design pattern used to logically organize data in a lakehouse, with the goal of incrementally and progressively improving the structure and quality of data as it flows through each layer of the architecture (from Bronze ⇒ Silver ⇒ Gold layer tables). Medallion architectures are sometimes also referred to as \"multi-hop\" architectures.\n\n![The Medallion Architecture](./images/building-data-pipelines-with-delta-lake-120823.png)\u003cBR/\u003e\n([Image source](https://www.databricks.com/glossary/medallion-architecture))\n\n* Bronze: ingestion tables (raw data, originals).\n\n* Silver: refined/cleaned tables.\n\n* Gold: feature/aggregated data store.\n\n* Platinum (optional): in a faster format like a high-speed DBMS, because `gold` is stored in cloud bucket storage (like AWS S3), and it's slow for e.g. real-time dashboard.\n\nReadings:\n\n* https://www.databricks.com/glossary/medallion-architecture\n* https://learn.microsoft.com/en-us/azure/databricks/lakehouse/medallion\n* https://medium.com/@junshan0/medallion-architecture-what-why-and-how-ce07421ef06f\n\n## Data Lakehouse Layers\n\n\u003c!--\n![DatalakeHouse components](./images/Data_Lakehouse_framework_and_process_flow.jpg)\u003cBR/\u003e\n([Image source](https://www.visionet.com/blog/data-lakehouse-the-future-of-modern-data-warehousing-analytics))\n--\u003e\n![DatalakeHouse components](./images/Data_Lakehouse_framework_and_process_flow_AWS.jpg)\u003cBR/\u003e\n([Image source](https://aws.amazon.com/es/blogs/big-data/build-a-lake-house-architecture-on-aws/))\n\nStoring structured and unstructured data in a data lakehouse presents many benefits to a data organization, namely making it easier and more seamless to support both business intelligence and data science workloads. This starts at the data source.\n\nWe describe the five layers in this section, but let’s first talk about the sources that feed the Lake House Architecture.\n\n### Data sources\n\nThe Lake House Architecture enables you to ingest and analyze data from a variety of sources. Many of these sources such as line of business (LOB) applications, ERP applications, and CRM applications generate highly structured batches of data at fixed intervals. In addition to internal structured sources, you can receive data from modern sources such as web applications, mobile devices, sensors, video streams, and social media. These modern sources typically generate semi-structured and unstructured data, often as continuous streams.\n\n### Data ingestion layer\n\nThe ingestion layer is responsible for ingesting data into the Lake House storage layer. It provides the ability to connect to internal and external data sources over a variety of protocols. It can ingest and deliver batch as well as real-time streaming data into a data warehouse as well as data lake components of the Lake House storage layer.\n\n### Data storage layer\n\nThe data storage layer is responsible for providing durable, scalable, and cost-effective components to store and manage vast quantities of data. In a Lake House Architecture, the data warehouse and data lake natively integrate to provide an integrated cost-effective storage layer that supports unstructured as well as highly structured and modeled data. The storage layer can store data in different states of consumption readiness, including raw, trusted-conformed, enriched, and modeled.\n\n### Catalog layer\n\nThe catalog layer is responsible for storing business and technical metadata about datasets hosted in the Lake House storage layer. In a Lake House Architecture, the catalog is shared by both the data lake and data warehouse, and enables writing queries that incorporate data stored in the data lake as well as the data warehouse in the same SQL. It allows you to track versioned schemas and granular partitioning information of datasets. As the number of datasets grows, this layer makes datasets in the Lake House discoverable by providing search capabilities.\n\n### Processing layer\n\nComponents in the processing layer are responsible for transforming data into a consumable state through data validation, cleanup, normalization, transformation, and enrichment. The processing layer provides purpose-built components to perform a variety of transformations, including data warehouse style SQL, big data processing, and near-real-time ETL.\n\n### Data consumption layer\n\nThe data consumption layer is responsible for providing scalable and performant components that use unified Lake House interfaces to access all the data stored in Lake House storage and all the metadata stored in the Lake House catalog. It democratizes analytics to enable all personas across an organization by providing purpose-built components that enable analysis methods, including interactive SQL queries, warehouse style analytics, BI dashboards, and ML.\n\nSources:\n* [https://aws.amazon.com/es/blogs/big-data/build-a-lake-house-architecture-on-aws/](https://aws.amazon.com/es/blogs/big-data/build-a-lake-house-architecture-on-aws/)\n* [https://www.montecarlodata.com/blog-data-lakehouse-architecture-5-layers/](https://www.montecarlodata.com/blog-data-lakehouse-architecture-5-layers/)\n* [https://www.visionet.com/blog/data-lakehouse-the-future-of-modern-data-warehousing-analytics](https://www.visionet.com/blog/data-lakehouse-the-future-of-modern-data-warehousing-analytics)\n\n## Data LakeHouse Components by Cloud Providers\n\n![DatalakeHouse Components by Cloud](./images/7316_cloud-data-lakehouse-success-story.002.png)\u003cBR/\u003e\n([Image source](https://www.mssqltips.com/sqlservertip/7316/cloud-data-lakehouse-success-story-architecture-outcomes-lessons-learned/))\n\n## Local Data Lakehouse Components\n\n\u003c!--\n![DatalakeHouse components](./images/1_xuE8_N_LxoP49S1Pu5Wn5A.webp)\u003cBR/\u003e\n([Image source](https://medium.com/adfolks/data-lakehouse-paradigm-of-decade-caa286f5b7a1))\n\n![Common DatalakeHouse technologies](./images/Data_Lake_vs_Data_Warehouse.webp)\u003cBR/\u003e\n([Image source](https://www.montecarlodata.com/blog-data-lake-vs-data-warehouse))\n--\u003e\n![Common DatalakeHouse technologies](./images/Data_lakehouse_local_components.CR.svg)\n\n* **Minio**\u003cBR/\u003e\n  Data sources, storage layer, landing buckets.\u003cBR/\u003e\n  [https://min.io](https://min.io)\n\n* **Apache Spark**\u003cBR/\u003e\n  Processing layer, compute.\u003cBR/\u003e\n  [https://spark.apache.org](https://spark.apache.org)\u003cBR/\u003e\n  [https://spark.apache.org/docs/latest/api/python/index.html](https://spark.apache.org/docs/latest/api/python/index.html)\u003cBR/\u003e\n[https://sparkbyexamples.com/pyspark-tutorial/](https://sparkbyexamples.com/pyspark-tutorial/)\u003cBR/\u003e\n\n* **Apache Hive**\u003cBR/\u003e\n  Data catalog, catalog layer, metadata.\u003cBR/\u003e\n  [https://hive.apache.org](https://hive.apache.org)\n\n* **Postgres Database**\u003cBR/\u003e\n  Data catalog persistence.\u003cBR/\u003e\n  [https://www.postgresql.org](https://www.postgresql.org)\n\n* **Trino**\u003cBR/\u003e\n  Query engine and data governance.\u003cBR/\u003e\n  [https://trino.io](https://trino.io)\n\n## Other Components\n\n* **SQL Alchemy**\u003cBR/\u003e\n  [https://docs.sqlalchemy.org](https://docs.sqlalchemy.org)\n\n* **Pandas**\u003cBR/\u003e\n  [https://pandas.pydata.org](https://pandas.pydata.org)\n\n* **Jupiter Lab**\u003cBR/\u003e\n  [https://docs.jupyter.org](https://docs.jupyter.org)\n\n* **Delta Lake**\u003cBR/\u003e\n  Open table format.\u003cBR/\u003e\n  [https://delta.io](https://delta.io)\u003cBR/\u003e\n  Open source framework developed by Databricks. Like other modern table formats, it employs file-level listings, improving the speed of queries considerably compared to the  directory-level listing of Hive. Offers enhanced CRUD operations, including the ability to update and delete records in a data lake which would previously have been immutable.\u003cBR/\u003e\n  (Click [here](https://www.starburst.io/data-glossary/open-table-formats/) for more information about Open Table Formats).\u003cBR/\u003e\n\n## Requirements\n\n* [Git](https://www.atlassian.com/git/tutorials/install-git)\n* Make: [Mac](https://formulae.brew.sh/formula/make) | [Windows](https://stackoverflow.com/questions/32127524/how-to-install-and-use-make-in-windows)\n* [Docker and Docker Composer](https://www.docker.com/products/docker-desktop)\n* [wget](https://www.jcchouinard.com/wget-install/)\n\n## Usage\n\nClone the respository:\n\n```bash\ngit clone https://github.com/tomkat-cr/data_lakehouse_local_stack.git\ncd data_lakehouse_local_stack\n```\n\nDownload the required packages:\n\n```bash\nmake install\n```\n\n**IMPORTANT**: this process will take a long time, depending on your Internet connection speed.\n\nStart the local stack (or `spark stack`):\n\n```bash\nmake run\n```\n\nRun the local Jupiter engine:\n\n```bash\nmake open_local_jupiter\n```\n\nRun the local Minio explorer:\n\n```bash\nmake open_local_minio\n```\n\n## Raygun Data Processing\n\nTo process a single Raygun error you need to download the Raygun data, composed by a series of JSON files for each time the error happens with all data shown in Raygun, upload the data to S3 (managed locally by Minio), ingest the data using Spark, build the Hive metastore and finally run the SQL queries to get the data analytics.\n\n### Raygun Data Preparation\n\n1. Go to Raygun ([https://app.raygun.com](https://app.raygun.com)), and select the corresponding Application.\n\n2. Put the checkmark in the error you want to analyze.\n\n3. Click on the `Export selected groups` button.\n\n4. Click on the `Add to export list` button.\n\n5. A message like this will be shown:\n\n```\nGreat work! Those error groups have been added to the export list.\nView your exports by clicking here, or by using the \"Export\" link under \"Crash Reporting\" in the sidebar.\n```\n\n6. Under `Crash Reporting` in the sidebar, click on the `Export` link.\n\n7. Click on the `Start export` button.\n\n```\nConfirm export\n\nExport all errors between XXXX and YYYY.\n\nExports can take some time depending on the volume of errors being exported. You will be notified when your export is ready to be downloaded. Once an export is started, another cannot begin until the first has completed.\n\nExports are generated as a single compressed file. [Learn how to extract them](https://raygun.com/documentation/product-guides/crash-reporting/exporting-data/)\n\nRecipients:\nexample@address.com\n```\n\n8. Click on the `Start export` button.\n\n9. Wait until the compressed file arraives to your email inbox.\n\n10. A message arrives to your inbox like this:\n\n```\nSubject: Your Raygun Crash Reporting export is complete for \"XXXX\"\n\nYour error export has been generated\nWe have completed the error group export for your application \"XXXX\". You can now download it below.\nDownload export - XXX MB\nLearn how to extract 7z files\n```\n\n11. Click on the `Download export - XXX MB` link.\n\n12. Put the compressed file in a local directory like `${HOME}/Downloads/raygun-data`\n\n13. Uncompress the file.\n\n14. A new directory will be created with the JSON files, each one with an error for a date/time, in the directory: `${HOME}/Downloads/raygun-data/assets`\n\n15. To check the total input file size:\n\nSet an environment variable with the path:\n\n```sh\nORIGINAL_JSON_FILES_PATH=\"${HOME}/Downloads/raygun-data/assets\"\n```\n\nGet the total size for all files downloaded:\n\n```sh\ndu -sh ${ORIGINAL_JSON_FILES_PATH}\n```\n\nCount the number of files in that directory:\n\n```sh\nls -l ${ORIGINAL_JSON_FILES_PATH} | wc -l\n```\n\n16. Move the files to the `data/raygun` directory in the Project, or perform the `Large number of input data files` procedure (see next section) to link the `${HOME}/Downloads/raygun-data/assets` to the `data/raygun` directory.\n\n### Large number of input data files\n\nIf you have more than 1000 raw data input files, you can use the following procedure to mount the input files directory in the local stack `data` directory:\n\n1. Edit the docker-compose configuration file in the project's root:\n\n```sh\nvi ./docker-compose.yml\n```\n\n2. Add the `data/any_directory_name` input files directory in the `volumnes` section, changing `any_directory_name` with the name of yours, e.g. `raygun`:\n\nFile: `./docker-compose.yml`\n\n```yaml\nversion: \"3.9\"\nservices:\n  spark:\n      .\n      .\n    volumes:\n      - /path/to/input/directory/:/home/LocalLakeHouse/Project/data/any_directory_name\n```\n\nSo your massive input files will be under the `data/any_directory_name` directory.\n\n3. You can also do it by a symbolic link:\n\n```sh\nln -s /path/to/input/directory/ data/any_directory_name\n```\n\n4. Once you finish the ingestion process (see `Raygun Data Ingestion` section), the link can be removed:\n\nExit the `spark` docker container by pressing Ctrl-D (or running the `exit` command), and run:\n\n```sh\nunlink data/any_directory_name\n```\n\n### Configuration\n\n1. Change the current directory to the Raygun processing path:\n\n```sh\ncd /home/LocalLakeHouse/Project/batch_processing/raygun_ip_processing\n```\n\n2. Copy the configuration template:\n\n```sh\ncp processing.example.env processing.env\n```\n\n3. Edit the configuration file:\n\n```sh\nvi processing.example.env processing.env\n```\n\n4. Assign the Spark App Name:\n\n```env\nSPARK_APPNAME=RaygunErrorTraceAnalysis\n```\n\n5. Assign the data input sub-directory (under `BASE_PATH` or `/home/LocalLakeHouse/Project`):\n\n```env\n# Local directory path containing JSON files\nINPUT_LOCAL_DIRECTORY=\"data/raygun\"\n```\n\n6. Assign the Local S3 (Minio) raw data input sub-directory:\n\n```env\n# S3 prefix (directory path in the bucket) to store raw data read from\n# the local directory\nS3_PREFIX=\"Raw/raygun\"\n```\n\n7. Assign the Cluster storage bucket name, to save the processed Dataframes and be able to resume the execution from the last one when any error aborts the process:\n\n```env\n# Dataframe cluster storage bucket name\nDF_CLUSTER_STORAGE_BUCKET_PREFIX=\"ClusterData/RaygunIpSummary\"\n```\n\n8. Assign the attribute name and alias for the IP Address in the Raygun JSON files:\n\n```env\n# Desired attribute and alias to filter one column\nDESIRED_ATTRIBUTE=\"Request.IpAddress\"\nDESIRED_ALIAS=RequestIpAddress\n```\n\n9. Assign the results sub-directory (under `/home/LocalLakeHouse/Project/Outputs`)\n\n```env\n# Final output result sub-directory\nRESULTS_SUB_DIRECTORY=raygun_ip_addresses_summary\n```\n\n10. Assign the input files reading page size, to prevent errors building the list of large amounts of files.\n\n```env\n# S3 pagination page size: 1000 files chunks\nS3_PAGE_SIZE-1000\n```\n\n11. Define the batch size, to read the JSON files in batches during the Spark Dataframe creation. Adjust the batch size based on your memory capacity and data size. 5000 works well in a MacBook with 16 GB of RAM.\n\n```env\n# Spark Dataframe creation batch size: 5000 files per batch.\nDF_READ_BATCH_SIZE=5000\n```\n\n12. Assign the Spark driver memory. For example, the Raygun JSON event files have 16 Kb each (small files).\n\n```env\n# Spark driver memory\n# \"3g\" for small files it's better 2-3g\n# \"12g\" for big files with more data it's better 4-5g\nSPARK_DRIVER_MEMORY=3g\n```\n\n13. Assign the Spark number of partitions, to optimize parallel processing and memory usage:\n\n```env\n# Repartition the DataFrame to optimize parallel processing and memory usage\n# (Adjust the number of partitions based on your environment and data size,\n#  workload and cluster setup)\nDF_NUM_PARTITIONS=200\n```\n\n14. Assign the number of batches to save Data into the Apache Hive metastore, to prevent errors saving large amounts of files.\n\n```env\n# Number of batches to Save Data into Apache Hive\nHIVE_BATCHES=10\n```\n\n15. For other parameters check the [Raygun IP processing main Python code](batch_processing/raygun_ip_processing/main.py), function `get_config()`.\n\n\n### Raygun Data Copy to S3 / Minio\n\n1. In a terminal window, restart the `spark stack`:\n\nIf the local stack is already running, press Ctrl-C to stop it.\n\nRun this command:\n\n```sh\nmake run\n```\n\nThe spark stak docker container will have a `/home/LocalLakeHouse/Project` directory that's the Project's root directory in your local computer.\n\n2. Open a second terminal window and enter to the `spark` docker container:\n\n```sh\ndocker exec -ti spark bash\n```\n\n3. Then run the load script:\n\n```sh\ncd /home/LocalLakeHouse/Project\nsh ./Scripts/1.init_minio.sh data/raygun\n```\n\n### Raygun Data Ingestion\n\n1. Run the ingest process:\n\nOpen a terminal window and enter to the `spark` docker container:\n\n```sh\ndocker exec -ti spark bash\n```\n\nRun the ingestion process from scratch:\n\n```sh\ncd Project\nMODE=ingest make raygun_ip_processing\n```\n\nWhen the ingestion process ends, the `Build the Hive Metastore` and `Raygun Data Reporting` will run as well.\n\n\n2. If the process stops, copy the counter XXXX after the last `Persisting...` message:\n\nFor example:\n\n```\nPersisting DataFrame to disk (round X)...\n3) From: XXXX | To: YYYY\n```\n\nThen run:\n\n```sh\nMODE=ingest FROM=XXXX make raygun_ip_processing\n```\n\n### Build the Hive Metastore\n\nIf the ingestion process stops and you don't want to resume/finish it, you can run the Hive metastore building, and be able to run the SQL queries and build the reports:\n\n```sh\nMODE=hive_process make raygun_ip_processing\n```\n\n### Raygun Data Reporting\n\n1. To run the default SQL query using Spark:\n\n```sh\nMODE=spark_sql make raygun_ip_processing\n```\n\nThis process generates a report in the Project's `Outputs/raygun_ip_addresses_summary` directory, where you will find a `part-00000-\u003csome-headecimal-code\u003e.csv` file with the results.\n\nThe default SQL query that generates the results file is:\n\n```sql\nSELECT RequestIpAddress, COUNT(*) AS IpCount\n  FROM raygun_error_traces\n  GROUP BY RequestIpAddress\n  ORDER BY IpCount DESC\n```\n\nThe result will be the list of IP addresses and the number of times those IPs are in all the Requests evaluated.\n\n2. To run a custom SQL query using Spark:\n\n```sh\nSQL='SELECT RequestIpAddress FROM raygun_error_traces GROUP BY RequestIpAddress' MODE=spark_sql make raygun_ip_processing\n```\n\n3. To run the default SQL query using Trino:\n\n```sh\nMODE=trino_sql make raygun_ip_processing\n```\n\n### Ingest process stats\n\n1. To check the Dataframe cluster S3 (Minio) files:\n\nThis way you can do the ingestion process follow-up because during the Dataframe build step in the ingest process, files are processed in batches (groups of 5,000 files), and each time a batch finishes, the result is written to the S3 cluster directory. If the ingestion process stops, it can be resumed from the last executed batch, and the processed files will be appended to the S3 cluster directory.\n\nSet an environment variable with the path:\n\n```sh\nCLUSTER_DIRECTORY=\"/home/LocalLakeHouse/Project/Storage/minio/data-lakehouse/ClusterData/RaygunIpSummary\"\n```\n\nThe parquet files resulting from the Dataframe build part in the ingestion process. are in the path assigned to `CLUSTER_DIRECTORY`.\n\nGet the total size for all files currently processed in the Dataframe:\n\n```sh\ndu -sh ${CLUSTER_DIRECTORY}\n```\n\nCount the number of files in that directory:\n\n```sh\nls -l ${CLUSTER_DIRECTORY} | wc -l\n```\n\n2. To check the raw input JSON files uploaded to S3 (Minio):\n\nSet an environment variable with the path:\n\n```sh\nS3_JSON_FILES_PATH=\"/home/LocalLakeHouse/Project/Storage/minio/data-lakehouse/Raw/raygun\"\n```\n\nGet the total size for all files:\n\n```sh\ndu -sh ${S3_JSON_FILES_PATH}\n```\n\nCount the number of files in that directory:\n\n```sh\nls -l ${S3_JSON_FILES_PATH} | wc -l\n```\n\n## Minio UI\n\n1. Run the local Minio explorer:\n\n```bash\nmake open_local_minio\n```\n\n2. Automatically this URL will be opened in a Browser: [http://127.0.0.1:9001](http://127.0.0.1:9001)\n\n3. The credentials for the login are in the `minio.env` file:\u003cBR/\u003e\n   Username (`MINIO_ACCESS_KEY`): minio_ak\u003cBR/\u003e\n   Password (`MINIO_SECRET_KEY`): minio_sk\u003cBR/\u003e\n\n## Monitor Spark processes\n\nTo access the `pyspark` web UI:\n\n1. Run the following command in a terminal window:\n\n```sh\nmake open_pyspark_ui\n```\n\nIt runs the following command:\n\n```sh\ndocker exec -ti spark pyspark\n```\n\n2. Go to this URL in a Browser: [http://127.0.0.1:4040](http://127.0.0.1:4040)\n\n## Run the local Jupiter Notebooks\n\n1. Run the local Jupiter engine:\n\n```bash\nmake open_local_jupiter\n```\n\n2. Automatically this URL will be opened in a Browser: [http://127.0.0.1:8888/lab](http://127.0.0.1:8888/lab)\n\n3. A screen will appear asking for the  `Password or token` to authenticate.\u003cBR/\u003e\n   It can be found in the  `docker attach` screen (the one that stays running when you execute `make run` to start the `spark stack`).\n\n4. Seach for a message like this:\u003cBR/\u003e\n    `http://127.0.0.1:8888/lab?token=xxxx`\n\n5. The `xxxx` is the `Password or token` to authenticate.\n\n## Connect to the Jupiter Server in VSC\n\nTo connect the Jupiter Server in VSC (Visual Studio Code):\n\n1. In the docker attach screen, look for a message like this:\u003cBR/\u003e\n    `http://127.0.0.1:8888/lab?token=xxxx`\n\n2. The `xxxx` is the password to be used when the Jupyter Kernel Connection ask for it...\n\n3. Then select the `Existing Jupiter Server` option.\n\n4. Specify the URL: `http://127.0.0.1:8888`\n\n5. Specify the password copied in seconf step: `xxxx`\n\n6. Select the desired Kernel from the list\n\nThe VSC will be connected to the Jupiter Server.\n\n## Pokemon Data preparation\n\nTo prepare the data for the Pockemon Data Ingestion:\n\n1. Download the compressed files from: [https://github.com/alejogm0520/lakehouses-101-pycon2024/tree/main/data](https://github.com/alejogm0520/lakehouses-101-pycon2024/tree/main/data)\n\n2. Copy those files to the `data` directory.\n\n3. Decompress the example data files:\n\n```bash\ncd data\nunzip moves.zip\nunzip pokemon.zip\nunzip types.zip\n```\n\n### Jupiter notebooks\n\n* [Raygun data ingestion](notebooks/Raygun-data-ingestion.ipynb)\n\n* [Pockemon data ingestion](notebooks/Pokemon-data-ingestion.ipynb)\n\n## License\n\nThis is a open-sourced software licensed under the [MIT](LICENSE) license.\n\n## Credits\n\nThis project is maintained by [Carlos J. Ramirez](https://www.carlosjramirez.com).\n\nIt was forked from the [Data Lakehouse 101](https://github.com/alejogm0520/lakehouses-101-pycon2024) repository made by [Alejandro Gómez Montoya](https://github.com/alejogm0520).\n\nFor more information or to contribute to the project, visit [Data Lakehouse Local Stack on GitHub](https://github.com/tomkat-cr/data_lakehouse_local_stack).\n\nHappy Coding!\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftomkat-cr%2Fdata_lakehouse_local_stack","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ftomkat-cr%2Fdata_lakehouse_local_stack","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftomkat-cr%2Fdata_lakehouse_local_stack/lists"}