Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/anonym0uswork1221/python-code-docstring-scraper
A multi-threaded GitHub scraper to collect Python code with docstrings from public repositories, creating a well-documented dataset for the JaraConverse LLM model.
https://github.com/anonym0uswork1221/python-code-docstring-scraper
causal-language-modeling data-scraping dataset dataset-generation dataset-scripts docst docstring-generator github-scraper llm llm-training nlp nlp-machine-learning python-code python-dataset python3 scraper script
Last synced: about 1 month ago
JSON representation
A multi-threaded GitHub scraper to collect Python code with docstrings from public repositories, creating a well-documented dataset for the JaraConverse LLM model.
- Host: GitHub
- URL: https://github.com/anonym0uswork1221/python-code-docstring-scraper
- Owner: Anonym0usWork1221
- License: mit
- Created: 2024-07-21T19:50:31.000Z (7 months ago)
- Default Branch: main
- Last Pushed: 2024-07-21T20:54:16.000Z (7 months ago)
- Last Synced: 2024-11-12T09:13:14.635Z (3 months ago)
- Topics: causal-language-modeling, data-scraping, dataset, dataset-generation, dataset-scripts, docst, docstring-generator, github-scraper, llm, llm-training, nlp, nlp-machine-learning, python-code, python-dataset, python3, scraper, script
- Language: Python
- Homepage:
- Size: 454 KB
- Stars: 3
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# GitHub Python Code Scraper
This project is a GitHub scraper designed to collect Python code from repositories, focusing on code that includes docstrings in functions and classes. The collected data is used to create an extensive dataset for the **JaraConverse** LLM model created by **Abdul Moez**.
* Date : 2024/07/22
* Author : **__Abdul Moez__** & **__Hammad Hussain__**
* Version Code : 0.1
* Used For: JaraConverse Coding dataset creationMIT License
Copyright (c) 2024 AbdulMoez## Table of Contents
- [Overview](#overview)
- [Features](#features)
- [Installation](#installation)
- [Usage](#usage)
- [Configuration](#configuration)
- [Ethical Considerations](#ethical-considerations)
- [Contributing](#contributing)
- [License](#license)## Overview
The GitHub Python Code Scraper automates the process of searching for and scraping Python repositories from GitHub. It focuses on extracting code snippets that contain docstrings, ensuring that the dataset includes well-documented code. The scraper supports multithreading and handles GitHub API rate limits.
## Features
- **Multi-threaded Scraping**: Utilizes multiple threads to speed up the scraping process.
- **Rate Limit Handling**: Automatically switches GitHub tokens when rate limits are reached.
- **Customizable Search Queries**: Allows specifying custom GitHub search queries.
- **Database Storage**: Stores the scraped data in an SQLite database.
- **Verbose Logging**: Provides detailed status messages during the scraping process.## Installation
To install the required dependencies, run:
```bash
pip install -r requirements.txt
```## Usage
To start the scraper, run the following command:
```bash
python scraper.py
```### Example
```python
from scraper import Scraper# Initialize the Scraper object
scraper = Scraper(
search_query="language:python pushed:<2023-01-01",
git_tokens=["your_github_token1", "your_github_token2"],
number_of_threads=4
)# Start the scraping process
scraper.start_scraper()
```## Configuration
You can customize the scraper's behavior using the following parameters:
- `search_query`: GitHub search query used to filter repositories.
- `git_tokens`: List of GitHub personal access tokens for authentication.
- `id_record_file`: File name for storing scraped repository IDs.
- `db_commit_index`: Number of records after which the database is committed.
- `width_for_wrap`: Width used for wrapping docstrings.
- `verbose`: Whether to print verbose status messages.
- `use_threads`: Whether to use multiple threads for scraping.
- `number_of_threads`: The number of threads to use for scraping.
- `data_base_path`: Path where the SQLite database will be stored.
- `data_base_file_name`: Name of the SQLite database file.## Ethical Considerations
When using this scraper, it is essential to adhere to GitHub's terms of service and respect the rights of repository owners. Follow these guidelines to ensure ethical use:
1. **Obtain Permission**: Ensure you have permission to scrape and use the data from public repositories.
2. **Respect Rate Limits**: Be mindful of GitHub's rate limits and avoid excessive requests that could impact the service for others.
3. **Respect Licensing**: Check the licenses of the repositories you scrape and comply with their terms.
4. **Data Privacy**: Avoid scraping personal data or sensitive information from repositories.# Contributors
## Contributing
Contributions are welcome! Please open an issue or submit a pull request with your changes.
## License
This project is licensed under the MIT License. See the [LICENSE](LICENSE) file for details.