{"id":21959902,"url":"https://github.com/cricksmaidiene/snowplough","last_synced_at":"2026-02-26T18:37:05.845Z","repository":{"id":208690862,"uuid":"708437374","full_name":"cricksmaidiene/snowplough","owner":"cricksmaidiene","description":"🏂 A machine learning model that performs topic classification of news articles for media bias analysis. Final project for UC Berkeley MIDS 266 (Natural Language Processing)","archived":false,"fork":false,"pushed_at":"2023-12-15T03:59:00.000Z","size":3320,"stargazers_count":3,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-11-05T20:21:27.048Z","etag":null,"topics":["all-the-news","databricks","delta-lake","jupyter-notebook","machine-learning","natural-language-processing","pandas","plotly"],"latest_commit_sha":null,"homepage":"https://cricksmaidiene.github.io/snowplough/","language":"HTML","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/cricksmaidiene.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null}},"created_at":"2023-10-22T15:06:35.000Z","updated_at":"2024-09-22T18:58:25.000Z","dependencies_parsed_at":"2023-12-15T04:46:59.712Z","dependency_job_id":null,"html_url":"https://github.com/cricksmaidiene/snowplough","commit_stats":{"total_commits":35,"total_committers":3,"mean_commits":"11.666666666666666","dds":0.4,"last_synced_commit":"1cd7b235cca6e430c8672cc6cf7369070be85778"},"previous_names":["cricksmaidiene/snowplough"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/cricksmaidiene/snowplough","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cricksmaidiene%2Fsnowplough","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cricksmaidiene%2Fsnowplough/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cricksmaidiene%2Fsnowplough/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cricksmaidiene%2Fsnowplough/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/cricksmaidiene","download_url":"https://codeload.github.com/cricksmaidiene/snowplough/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cricksmaidiene%2Fsnowplough/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":29867561,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-02-26T18:27:06.972Z","status":"ssl_error","status_checked_at":"2026-02-26T18:26:57.848Z","response_time":89,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["all-the-news","databricks","delta-lake","jupyter-notebook","machine-learning","natural-language-processing","pandas","plotly"],"created_at":"2024-11-29T09:34:54.106Z","updated_at":"2026-02-26T18:37:05.828Z","avatar_url":"https://github.com/cricksmaidiene.png","language":"HTML","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Snowplough 🏂\n\n\u003e Find More Info on the Project Page: [Snowplough Project](https://cricksmaidiene.github.io/snowplough)\n\nA machine learning model that performs topic classification of news articles for media bias analysis. Final project for UC Berkeley MIDS 266 (Natural Language Processing)\n\nEnvironments:\n\n![](https://img.shields.io/badge/Jupyter-F37626.svg?style=for-the-badge\u0026logo=Jupyter\u0026logoColor=white)\n![](https://img.shields.io/badge/Databricks-FF3621.svg?style=for-the-badge\u0026logo=Databricks\u0026logoColor=white)\n![](https://img.shields.io/badge/Python-3776AB.svg?style=for-the-badge\u0026logo=Python\u0026logoColor=white)\n![](https://img.shields.io/badge/Poetry-60A5FA.svg?style=for-the-badge\u0026logo=Poetry\u0026logoColor=white)\n\nLibraries:\n\n![](https://img.shields.io/badge/Anaconda-44A833.svg?style=for-the-badge\u0026logo=Anaconda\u0026logoColor=white)\n![](https://img.shields.io/badge/pandas-150458.svg?style=for-the-badge\u0026logo=pandas\u0026logoColor=white)\n![](https://img.shields.io/badge/NumPy-013243.svg?style=for-the-badge\u0026logo=NumPy\u0026logoColor=white)\n![](https://img.shields.io/badge/TensorFlow-FF6F00.svg?style=for-the-badge\u0026logo=TensorFlow\u0026logoColor=white)\n![](https://img.shields.io/badge/scikitlearn-F7931E.svg?style=for-the-badge\u0026logo=scikit-learn\u0026logoColor=white)\n\nData:\n\n![](https://img.shields.io/badge/Delta-003366.svg?style=for-the-badge\u0026logo=Delta\u0026logoColor=white)\n![](https://img.shields.io/badge/Amazon%20S3-569A31.svg?style=for-the-badge\u0026logo=Amazon-S3\u0026logoColor=white)\n![](https://img.shields.io/badge/Files-4285F4.svg?style=for-the-badge\u0026logo=Files\u0026logoColor=white)\n\n## Installation\n\nSetup anaconda as a virtual environment\n\n```bash\nconda create --name snowplough python=3.10 -y\nconda activate snowplough\n```\n\nDowload and install snowplough dependencies\n\n```bash\ngit clone https://github.com/cricksmaidiene/snowplough\ncd snowplough\n```\n\nInstall with poetry:\n\n```bash\npoetry install\n```\n\nOr with pip:\n\n```bash\npip install .\n```\n\n## Tools \u0026 Infrastructure\n\nAll descriptive analysis, data engineering, processing and baseline modeling was run within Python environment-based Databricks notebooks on CPU backed single-node clusters. Spark was not required, and the main choice for Databricks here was to allow variable sized clusters based on requirements at different project stages. No Databricks-specific commands or dependencies exist, and the **notebooks are agnostic and can be run directly on Jupyter or Google Colab as well**, provided that the Python requirements are met, and the requisite hardware is available. A custom Delta Lake (an open source file format on top of apache parquet) handler to store data locally in the file system or on AWS S3 was used, in order to manage memory better for the size of All The News v2. The neural network based classifiers were trained on P-class and G-class instance-type GPUs made available through AWS \u0026 Databricks. Mlflow was used to track and save experimental results for trial and error of hyperparameter tuning\n\n## Data Layer\n\nThis project utilizes [Delta Lake](https://delta.io/) for data storage. The storage location is flexible between [AWS S3](https://aws.amazon.com/s3/) or Local Filesystem. The data layer is abstracted away from the user and can be specified when calling `FileSystemHandler` from `src.utils.io` in notebooks.\n\nExample:\n\n```python\nfrom src.utils.io import FileSystemHandler\n\n# AWS S3\ndatafs = FileSystemHandler(\"s3\", s3_bucket=\"snowplough-mids\")\n\n# Local Filesystem\ndatafs = FileSystemHandler(\"local\", local_path=\"/path/to/data/dir\")\n\n# List Tables\ndatafs.listdir(\"/location/catalog/\")\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcricksmaidiene%2Fsnowplough","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcricksmaidiene%2Fsnowplough","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcricksmaidiene%2Fsnowplough/lists"}