{"id":23230664,"url":"https://github.com/brynlai/data-engineering-assignment-rdsy2s2","last_synced_at":"2025-04-05T19:20:27.942Z","repository":{"id":263989714,"uuid":"892013369","full_name":"Brynlai/Data-Engineering-Assignment-RDSY2S2","owner":"Brynlai","description":"This repository contains a data engineering project aimed at processing and analyzing scraped data using PySpark, Redis, and Neo4j. The goal is to efficiently store, process, and analyze text data.","archived":false,"fork":false,"pushed_at":"2025-01-11T06:00:20.000Z","size":1525,"stargazers_count":1,"open_issues_count":3,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-02-11T17:14:51.061Z","etag":null,"topics":["data-engineering","gemini-ai","google","hadoop","kafka","neo4j","pyspark","redis"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Brynlai.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-11-21T11:08:11.000Z","updated_at":"2025-01-19T11:42:59.000Z","dependencies_parsed_at":"2024-12-11T09:19:28.886Z","dependency_job_id":"083e239e-8e5d-4af6-82cb-2d149d3e8fae","html_url":"https://github.com/Brynlai/Data-Engineering-Assignment-RDSY2S2","commit_stats":null,"previous_names":["brynlai/data-engineering-assignment-rdsy2s2"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Brynlai%2FData-Engineering-Assignment-RDSY2S2","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Brynlai%2FData-Engineering-Assignment-RDSY2S2/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Brynlai%2FData-Engineering-Assignment-RDSY2S2/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Brynlai%2FData-Engineering-Assignment-RDSY2S2/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Brynlai","download_url":"https://codeload.github.com/Brynlai/Data-Engineering-Assignment-RDSY2S2/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247387095,"owners_count":20930775,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data-engineering","gemini-ai","google","hadoop","kafka","neo4j","pyspark","redis"],"created_at":"2024-12-19T02:11:03.660Z","updated_at":"2025-04-05T19:20:27.919Z","avatar_url":"https://github.com/Brynlai.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Data Engineering Assignment\n\n## All the api keys in the repo are dead! Use your own keys.\n\n## Description\nThis project involves processing and analyzing scraped data using PySpark, Redis, and Neo4j. The aim is to store, process, and analyze text data efficiently.\n\n## Usage\n\n### Starting Services\n0. Open Powershell in Administrator mode and run wsl:\n    ```bash\n    wsl ~\n    ```\n1. Start Hadoop and Spark services:\n   ```bash\n   start-dfs.sh\n   ```\n   ```bash\n   start-yarn.sh\n   ```\n2. Start Kafka and Zookeeper:\n   \u003e Note: Wait for about 30 seconds before performing the next step.\n   ```bash\n   zookeeper-server-start.sh $KAFKA_HOME/config/zookeeper.properties \u0026\n   ```\n   ```bash\n   kafka-server-start.sh $KAFKA_HOME/config/server.properties \u0026\n   ```\n    \n\n3. Switch to student:\n    ```bash\n    su - student\n    ```\n### Running Notebooks (Curently not working if ru scrape_article while consumer is running.)\n1. Activate the virtual environment and start Jupyter Lab:\n   ```bash\n   source de-prj/de-venv/bin/activate\n   jupyter lab\n   ```\n   \n2. Open 2 Powershell Terminals from Windows, then go (de-venv) student@R2D3:~/urdirectory$\n3. (To show kafka working) cd into the directory both files are in!\n    - Producer Terminal:\n       ```bash\n       python kafka_producer_show.py\n        ```\n   - Consumer Terminal:\n       ```bash\n        spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.13:3.5.1 kafka_consumer_show.py\n        ```\n     \u003e [!IMPORTANT]  \n        DO NOT RUN \n        \"$ python kafka_producer_show.py\"\n        when scrape_aritcles_into_words.ipynb or neo4j.ipynb is running.\n        \"kafka_consumer_show.py\" can run in the background. \n\n4. Run the notebooks in this sequence:\n   - `scrape_articles_into_words.ipynb`\n   - `neo4j.ipynb`\n\n\n### Stopping Services\n\n1. Stop Kafka and Zookeeper:\n   \u003e Note: Wait for about 30 seconds before performing the next step.\n   ```bash\n   kafka-server-stop.sh\n   ```\n   ```bash\n   zookeeper-server-stop.sh\n   ```\n3. Stop Hadoop and Spark services:\n   ```bash\n   stop-yarn.sh\n   ```\n   ```bash\n   stop-dfs.sh\n   ```\n\n\n## Data Storage and Processing\n\n### Data Collection and Raw Storage\n- **What to Store**: Raw scraped text data.\n- **Where to Store**: Hadoop HDFS.\n- **Tool**: PySpark for ingestion and Hadoop for storage.\n\n### Processed Data\n- **What to Store**: Cleaned and tokenized text.\n- **Where to Store**: Hadoop HDFS or a relational database.\n- **Tool**: PySpark for preprocessing.\n\n### Lexicon\n- **What to Store**: Words with definitions, relationships, and POS annotations.\n- **Where to Store**: Neo4j for relationships; Redis for fast retrieval.\n- **Tool**: Neo4j and Redis.\n\n### Analytics\n- **What to Store**: Analytical results.\n- **Where to Store**: Local files, Neo4j, and Redis.\n- **Tool**: Neo4j.\n\n### Real-Time Updates\n- **What to Store**: New and updated words.\n- **Where to Store**: Kafka for message streaming.\n- **Tool**: Kafka and Spark Structured Streaming.\n\n## Decision Highlights\n- **Neo4j**: For storing and querying word relationships.\n- **Redis**: For fast key-value lookups.\n- **Hadoop HDFS**: For scalable storage of raw and processed data.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbrynlai%2Fdata-engineering-assignment-rdsy2s2","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fbrynlai%2Fdata-engineering-assignment-rdsy2s2","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbrynlai%2Fdata-engineering-assignment-rdsy2s2/lists"}