{"id":26896056,"url":"https://github.com/sofiasawczenko/factored_datathon_2023","last_synced_at":"2025-07-07T01:35:57.780Z","repository":{"id":184339055,"uuid":"670813288","full_name":"sofiasawczenko/factored_datathon_2023","owner":"sofiasawczenko","description":"This Datathon leverages two primary data sources related to 82 million product reviews on Amazon and Amazon metadata, using Databricks for processing and analysis.","archived":false,"fork":false,"pushed_at":"2024-12-18T11:39:35.000Z","size":34,"stargazers_count":0,"open_issues_count":1,"forks_count":1,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-04-01T02:59:29.045Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/sofiasawczenko.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2023-07-25T22:42:03.000Z","updated_at":"2024-12-18T17:50:16.000Z","dependencies_parsed_at":"2023-07-28T02:50:01.695Z","dependency_job_id":null,"html_url":"https://github.com/sofiasawczenko/factored_datathon_2023","commit_stats":null,"previous_names":["sofiasawczenko/factored-datathon-2023-data-hackers"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/sofiasawczenko/factored_datathon_2023","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sofiasawczenko%2Ffactored_datathon_2023","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sofiasawczenko%2Ffactored_datathon_2023/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sofiasawczenko%2Ffactored_datathon_2023/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sofiasawczenko%2Ffactored_datathon_2023/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/sofiasawczenko","download_url":"https://codeload.github.com/sofiasawczenko/factored_datathon_2023/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sofiasawczenko%2Ffactored_datathon_2023/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":263998578,"owners_count":23541908,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-04-01T02:59:31.581Z","updated_at":"2025-07-07T01:35:57.736Z","avatar_url":"https://github.com/sofiasawczenko.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Factored Datathon 2023\n\nThis project leverages two primary data sources related to product reviews on Amazon, using Databricks for processing and analysis.\n\n## 1. Batch-format Tables\n\nThese tables contain valuable insights and information about product reviews on Amazon. They are structured in batches, distributed in partitions, and stored as JSON files in an Azure Data Lake Storage instance. Specifically, the following two tables are used:\n\n- **Amazon Product Reviews**: This dataset consists of 82.83 million unique product reviews contributed by approximately 20 million users.\n- **Amazon Metadata**: This table contains crucial product descriptions and metadata for all products in the dataset.\n\n## 2. Streaming Data - Amazon Product Reviews\n\nIn addition to the batch-format tables, real-time data updates are available through a streaming mode. This streaming topic continuously receives new data as it becomes available, keeping the dataset up-to-date with the latest developments.\n\n```python\nspark.conf.set(\"fs.azure.account.auth.type.{0}.dfs.core.windows.net\".format(\"safactoreddatathon\"), \"SAS\")\nspark.conf.set(\"fs.azure.sas.token.provider.type.{0}.dfs.core.windows.net\".format(\"safactoreddatathon\"), \"org.apache.hadoop.fs.azurebfs.sas.FixedSASTokenProvider\")\n```\n### Installation of Required Libraries\nTo get started with the necessary packages, install the required libraries using the following commands:\n\n```python\n!pip install pyspark\n!pip install azure-eventhub\n```\n### Spark Session and EventHub Consumer\nInitialize the SparkSession and EventHub consumer client for handling real-time events.\n```python\nfrom azure.eventhub import EventHubConsumerClient\nfrom pyspark.sql import SparkSession\n\n# Define Event Hub Connection Parameters\neventhub_namespace = \"factored-datathon\"\neventhub_name = \"factored_datathon_amazon_review_1\"\nlisten_policy_key = \"sJJnyi8GGTBAa55jY89kacoT6hXAzWx2B+AEhCPEKYE=\"\nlisten_policy_connection_string = \"Endpoint=sb://factored-datathon.servicebus.windows.net/;SharedAccessKeyName=datathon_listener;SharedAccessKey=sJJnyi8GGTBAa55jY89kacoT6hXAzWx2B+AEhCPEKYE=;EntityPath=factored_datathon_amazon_review\"\n\n# Event processing function\ndef on_event(partition_context, event):\n    print(\"Received event from partition: {}\".format(partition_context.partition_id))\n    print(\"Data: {}\".format(event.body_as_str(encoding='UTF-8')))\n    print(\"Properties: {}\".format(event.properties))\n\n# EventHub Consumer Client setup\nconnection_str = listen_policy_connection_string\nconsumer_client = EventHubConsumerClient.from_connection_string(connection_str, consumer_group=\"$Default\", eventhub_name=eventhub_name)\n\n# Start receiving events\ntry:\n    with consumer_client:\n        consumer_client.receive(on_event=on_event, starting_position=\"-1\")\nexcept KeyboardInterrupt:\n    print(\"Receiving has stopped.\")\n```\n### 3. Accessing the Data\nUse the following commands to access the batch-format data stored in Azure Data Lake:\n\n```python\ndbutils.fs.ls(\"abfss://source-files@safactoreddatathon.dfs.core.windows.net/amazon_reviews/\")\n```\nThis command lists the files within the amazon_reviews folder in the Azure Data Lake.\n\n### Example Output:\n```python\n[FileInfo(path='abfss://source-files@safactoreddatathon.dfs.core.windows.net/amazon_reviews/partition_1/', name='partition_1/', size=0, modificationTime=1689569806000),\n FileInfo(path='abfss://source-files@safactoreddatathon.dfs.core.windows.net/amazon_reviews/partition_10/', name='partition_10/', size=0, modificationTime=1689569900000),\n FileInfo(path='abfss://source-files@safactoreddatathon.dfs.core.windows.net/amazon_reviews/partition_100/', name='partition_100/', size=0, modificationTime=1689570860000),\n FileInfo(path='abfss://source-files@safactoreddatathon.dfs.core.windows.net/amazon_reviews/partition_1000/', name='partition_1000/', size=0, modificationTime=1689580431000),\n ...]\n```\nThis allows you to access the different partitions of product reviews, enabling further processing and analysis.\n\n## Conclusion\nThe Factored Datathon 2023 project integrates both batch and streaming data sources to analyze product reviews from Amazon. By using Spark, Databricks, and Azure Data Lake, it facilitates efficient processing of large datasets and real-time data streaming. The setup includes necessary libraries for PySpark and Azure EventHub to handle data processing and event-driven workflows.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsofiasawczenko%2Ffactored_datathon_2023","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsofiasawczenko%2Ffactored_datathon_2023","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsofiasawczenko%2Ffactored_datathon_2023/lists"}