{"id":25939145,"url":"https://github.com/shruti-h/github-topic-repo-scraper","last_synced_at":"2026-04-11T09:02:25.446Z","repository":{"id":280401592,"uuid":"941857353","full_name":"Shruti-H/github-topic-repo-scraper","owner":"Shruti-H","description":"A Python-based web scraper that extracts GitHub topics and their top repositories, saving the data into structured CSV files for analysis.","archived":false,"fork":false,"pushed_at":"2025-03-03T07:40:54.000Z","size":41,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-03-03T08:29:18.450Z","etag":null,"topics":["automation","beautifulsoup","datascience","github","opensource","pandas","python","requests","scraper","webscraping"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Shruti-H.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2025-03-03T07:01:06.000Z","updated_at":"2025-03-03T07:43:55.000Z","dependencies_parsed_at":"2025-03-03T08:39:58.365Z","dependency_job_id":null,"html_url":"https://github.com/Shruti-H/github-topic-repo-scraper","commit_stats":null,"previous_names":["shruti-h/github-topic-repo-scraper"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/Shruti-H/github-topic-repo-scraper","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Shruti-H%2Fgithub-topic-repo-scraper","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Shruti-H%2Fgithub-topic-repo-scraper/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Shruti-H%2Fgithub-topic-repo-scraper/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Shruti-H%2Fgithub-topic-repo-scraper/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Shruti-H","download_url":"https://codeload.github.com/Shruti-H/github-topic-repo-scraper/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Shruti-H%2Fgithub-topic-repo-scraper/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":31674624,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-11T08:18:19.405Z","status":"ssl_error","status_checked_at":"2026-04-11T08:17:08.892Z","response_time":54,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["automation","beautifulsoup","datascience","github","opensource","pandas","python","requests","scraper","webscraping"],"created_at":"2025-03-04T04:15:50.398Z","updated_at":"2026-04-11T09:02:25.410Z","avatar_url":"https://github.com/Shruti-H.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# **GitHub Topics \u0026 Repositories Scraper**  \n\nA Python-based web scraper that extracts **GitHub topics and their top repositories**, then saves the data into structured CSV files for further analysis.\n\n---\n\n## **📌 Project Overview**\nThis project **scrapes GitHub Topics** from the [GitHub Topics Page](https://github.com/topics) and retrieves information about **top repositories** under each topic.  \nThe scraped data is then **saved as CSV files** inside the `Data/` directory.\n\n---\n\n## **🛠 Tools Used**\nThis project was built using the following technologies:  \n- **Python** - The core programming language  \n- **BeautifulSoup** - For parsing and extracting HTML data  \n- **Requests** - To send HTTP requests and fetch web pages  \n- **Pandas** - For data manipulation and exporting to CSV  \n\n---\n\n## **🔍 Features**\n- Extracts a list of top **GitHub topics** (Machine Learning, Data Science, etc.)\n- Retrieves **top repositories** under each topic\n- Extracts **repository details** (username, repo name, stars, repo URL)\n- Saves all data into **CSV files** for easy analysis\n- Implements **logging** to track scraping progress and errors\n\n---\n\n## **📂 Project Structure**\n```\n├── GitHub_Scraper.ipynb    # Jupyter Notebook containing the scraper\n├── LOGS/                   # Stores log files for debugging\n├── Data/                   # Stores CSV files with scraped data\n└── README.md               # Project documentation\n```\n\n## **⚙️ How It Works**\n1. **Scrape GitHub Topics** - The script fetches the top topics from GitHub and extracts their names, descriptions, and URLs.  \n2. **Scrape Top Repositories** - For each topic, the script scrapes **top repositories**, extracting details like **username, repo name, star count, and repo URL**.  \n3. **Save Data to CSV** - Extracted data is stored in structured **CSV files**, one for each topic.  \n4. **Logging for Debugging** - A logging system is implemented to track the scraping progress and handle errors.  \n\n---\n\n## **📊 Example Output**\nExample **3D** topic data stored in `Data/3d.csv`:\n\n![Example CSV Screenshot](https://github.com/user-attachments/assets/d949f6e8-e2e2-44fd-b848-cf2a45e79214)\n\n---\n\n## **🔧 Setup \u0026 Installation**\n### **1. Clone the Repository**\n```sh\ngit clone https://github.com/Shruti-H/github-topic-repo-scraper.git\ncd github-topic-repo-scraper\n```\n\n### **2. Open Jupyter Notebook**\n```sh\njupyter notebook scraping-github-topics-repositories.ipynb\n```\n*If Jupyter Notebook is not installed, install it using:*\n```sh\npip install notebook\n```\n\n### **3. Run the Notebook**\n- Open `scraping-github-topics-repositories.ipynb` in Jupyter Notebook.\n- Click **\"Run All\"** to execute all the cells.\n\n---\n\n## **🚀 Future Work**\n- Further analyze CSV data using **Pandas, SQL, or visualization tools**.\n- Improve scraping speed using **parallel processing**.\n- Add **database storage** for structured data retrieval.\n- Extract additional repository details (Forks, Issues, Contributors).\n- Implement **pagination** to scrape **more topics and repositories** by iterating through multiple pages (`?page=1`, `?page=2`, etc.).\n\n---\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fshruti-h%2Fgithub-topic-repo-scraper","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fshruti-h%2Fgithub-topic-repo-scraper","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fshruti-h%2Fgithub-topic-repo-scraper/lists"}