{"id":14982339,"url":"https://github.com/gr-menon/bigbanyantree","last_synced_at":"2026-03-06T14:34:02.607Z","repository":{"id":254669130,"uuid":"847195108","full_name":"GR-Menon/BigBanyanTree","owner":"GR-Menon","description":"Gathering insights from Common Crawl using Apache Spark and LLMs.","archived":false,"fork":false,"pushed_at":"2024-11-14T14:43:57.000Z","size":8537,"stargazers_count":3,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-02-02T01:31:43.407Z","etag":null,"topics":["apache-spark","docker","docker-compose","llamafile"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/GR-Menon.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-08-25T05:57:09.000Z","updated_at":"2024-12-14T05:33:58.000Z","dependencies_parsed_at":"2024-09-15T02:45:34.611Z","dependency_job_id":null,"html_url":"https://github.com/GR-Menon/BigBanyanTree","commit_stats":{"total_commits":6,"total_committers":2,"mean_commits":3.0,"dds":"0.16666666666666663","last_synced_commit":"0acf3f18529f3b102d24913b7b6b3bb7d6401aad"},"previous_names":["gr-menon/bigbanyantree"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/GR-Menon%2FBigBanyanTree","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/GR-Menon%2FBigBanyanTree/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/GR-Menon%2FBigBanyanTree/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/GR-Menon%2FBigBanyanTree/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/GR-Menon","download_url":"https://codeload.github.com/GR-Menon/BigBanyanTree/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":238825707,"owners_count":19537112,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["apache-spark","docker","docker-compose","llamafile"],"created_at":"2024-09-24T14:05:13.882Z","updated_at":"2025-10-29T12:31:33.470Z","avatar_url":"https://github.com/GR-Menon.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# BigBanyanTree\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"assets/bbt.jpg\" width=\"50%\"\u003e\n\u003c/p\u003e      \n\u003cbr\u003e\nBigBanyanTree is an initiative to empower engineering colleges to set up their data engineering clusters and drive\ninterest in data processing and analysis using tools such as Apache Spark.\n\nThis project was made in collaboration with ![Suchit](https://www.linkedin.com/in/suchitg04/) under the guidance\nof ![Mr. Harsh Singhal](https://www.linkedin.com/in/harshsinghal/).\n\nThe endeavour comprised of 3 main steps:\n\n- Set up a ***dedicated Apache Spark cluster*** along with Jupyterlab interface to run Spark jobs.\n- Parse a ***random 1% sample of the Common Crawl data dumps*** spanning the years 2018 to 2024, extracting various\n  attributes.\n- Perform **various analyses on the extracted datasets** and open-source our findings.\n\nCheck out the open-sourced HuggingFace datasets we created\nat [huggingface.co/big-banyan-tree](https://huggingface.co/big-banyan-tree)\n\n### Apache Cluster Setup\n\nWe first set up an Apache Spark cluster in standalone mode on a dedicated Hetzner server. The entire server setup was\nmade quite simple and straightforward by making use of `Docker` and `Docker Compose`.\n\nTo get a more in-depth understanding of our Apache Spark cluster setup, check out the following resources :\n\n- [SparkBazaar](https://github.com/GR-Menon/Spark-Bazaar)\n- [BigBanyanTree Setup Blog](https://datascience.fm/zero-to-spark-apache-spark-cluster-setup/)\n- [LLM Service Setup](https://datascience.fm/llamafile-an-executable-llm/)\n  \u003cbr\u003e\n\n### CommonCrawl Data Processing\n\nCommon Crawl releases data dumps every few months, containing raw HTML source code of the literal Internet, and\nopen-sources this data using archival file storage formats such as **WARC** (Web Archive).\n\nUnder the BigBanyanTree project, we undertook two main data processing tasks:\n\n- Extracting webpage JavaScript libraries from `src` tags within HTML `script` tags, among other fields.\n- Enriching server IP \u0026 geolocation data using the MaxMind Database.\n\nFor a deep dive into both these topics, check out our blogs :\n\n- [JSLib Extraction](https://datascience.fm/parsing-html-source-code-with-apache-spark-selectolax/)\n- [Server MaxMind Data Enrichment](https://datascience.fm/bigbanyantree-enriching-warc-data-with-ip-information-from-maxmind/)\n\n    - [Serializability in Spark](https://datascience.fm/serializability-in-spark-using-non-serializable-objects-in-spark-transformations/)\n\n### Extracted Data Analysis\n\nTODO","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgr-menon%2Fbigbanyantree","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fgr-menon%2Fbigbanyantree","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgr-menon%2Fbigbanyantree/lists"}