Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/rossop/data-scraper
[Undergoing Refactoring] I needed a scraper to dowload white papers.
https://github.com/rossop/data-scraper
Last synced: about 1 month ago
JSON representation
[Undergoing Refactoring] I needed a scraper to dowload white papers.
- Host: GitHub
- URL: https://github.com/rossop/data-scraper
- Owner: rossop
- License: mit
- Created: 2024-05-20T22:53:11.000Z (8 months ago)
- Default Branch: develop
- Last Pushed: 2024-06-12T23:36:16.000Z (7 months ago)
- Last Synced: 2024-10-27T17:07:04.053Z (3 months ago)
- Language: Python
- Homepage:
- Size: 61.5 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 7
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Data Scraper
A modular data scraping project designed to extract information from various sources. Currently, the project includes a PDF scraper, and it will be expanded to include a Goodreads scraper.
## Table of Contents
- [Introduction](#introduction)
- [Project Structure](#project-structure)
- [PDF Scraper](#pdf-scraper)
- [Installation](#installation)
- [Usage](#usage)
- [Configuration](#configuration)
- [Goodreads Scraper](#goodreads-scraper)
- [Installation](#installation)
- [Usage](#usage)
- [Configuration](#configuration)
- [Common Utilities](#common-utilities)
- [Testing](#testing)
- [Contributing](#contributing)
- [License](#license)## Introduction
`data_scraper` is a Python project aimed at scraping and extracting data from various sources. The project is structured in a modular way to allow for easy addition of new scrapers. Currently, it includes:
- PDF Scraper
- Goodreads Scraper (coming soon)## Project Structure
```bash
data_scraper/
│
├── src/
│ ├── common/
│ │ ├── file_utils.py
│ │ └── web_utils.py
│ ├── pdf/
│ │ ├── __init__.py
│ │ └── scraper.py
│ ├── goodreads/
│ │ ├── __init__.py
│ │ └── scraper.py
│ └── main.py
│
├── tests/
│ ├── test_file_utils.py
│ ├── test_web_utils.py
│ ├── test_pdf_scraper.py
│ └── test_goodreads_scraper.py
│
├── .gitignore
├── requirements.txt
└── README.md
```## PDF Scraper
The PDF Scraper is designed to download and parse PDF files from given URLs.
### Installation
1. Clone the repository:
```bash
git clone https://github.com/yourusername/data_scraper.git
cd data_scraper
```2. Set up a virtual environment:
```bash
python -m venv venv
source venv/bin/activate # On Windows, use `venv\Scripts\activate`
```3. Install the required dependencies:
```bash
pip install -r requirements.txt
```### Usage
To use the PDF Scraper, run the `main.py` script:
```bash
python src/main.py
```The main.py script is currently configured to run the PDF scraper. Modify it to include your URLs and configurations as needed.
### Configuration
The pdf_scraper.py file contains the main logic for the PDF scraper. You can configure the URLs and directories within this file or extend its functionality as needed.
## Goodreads Scraper
(Coming soon)
### Installation
(Coming soon)
### Usage
(Coming soon)
### Configuration
(Coming soon)
## Common Utilities
The common directory contains utility functions that are shared across different scrapers:
- file_utils.py: Functions related to file operations.
- web_utils.py: Functions related to web operations, such as setting up the Selenium WebDriver.## Testing
To run the tests, use pytest:
```bash
pytest
```Make sure all tests pass before making any contributions.
# Contributing to Data Scraper
Thank you for considering contributing to our project! We appreciate your support and aim to make the contribution process as smooth as possible.
## How to Contribute
1. **Fork the Repository**: Click the "Fork" button at the top right of this repository to create a copy of the repository on your own GitHub account.
2. **Clone the Repository**: Clone your fork to your local machine using the following command:
```bash
git clone https://github.com/yourusername/data_scraper.git
cd data_scraper
```
3. **Create a Branch**: Create a new branch from `develop` for your feature or bug fix:
```bash
git checkout -b feature-branch develop
```
4. **Make Changes**
5. **Make Changes**: Make your changes to the codebase.
6. **Commit Changes**: Follow our commit message guidelines when committing your changes:```bash
git commit -m "feat: Add new featureAdded functionality to scrape PDF files from a list of URLs.
- Implemented PDFScraper class
- Added unit tests for PDFScraper# Issue References:
# - Related to #123
# - Fixes #456"
```7. **Push Changes**: Push your changes to your forked repository:
```bash
git push origin feature-branch
```
8. **Create a Pull Request**: Open a pull request to merge your changes into the main repository.## Commit Message Guidelines
Please follow the commit message template below to ensure consistency and clarity in our project's history.
```plaintext
:-
# Issue References:
# - Related to #
# - Fixes ## --------------- Commit Message Guide --------------- #
# Verbs: Add, Remove, Update, Replace, Fix, Implement,
# Improve, Refactor, Optimize, Ensure, Prevent, Correct
#
# Type: Categorizes the nature of the commit:
# - feat: A new feature
# - fix: A bug fix
# - docs: Changes to documentation
# - style: Formatting, missing semicolons, etc. (does not affect code logic)
# - refactor: Code changes that neither fix a bug nor add a feature
# - perf: Performance improvements
# - test: Adding missing tests or correcting existing tests
# - chore: Updates to the build process or auxiliary tools and libraries
#
# Short Summary: Concise description in imperative mood ("Add" not "Adds").
#
# Body: Detailed explanation of what was changed and why, not how.
#
# Details: Additional context, limitations of the current solution, etc.
# - Use bullet points for multiple details.
#
# Issue References: Optionally, mention related issue numbers for more context.
#
# ------------------- Dos and Don'ts ------------------ #
# DO:
# - Keep the subject line under 50 characters
# - Use the body to explain what and why vs. how
# - Wrap the body at 72 characters
# - Use imperative mood in the subject line
#
# DON'T:
# - End the subject line with a period
# - Use the body to explain how
#
# ----------------- Sentence Structure ---------------- #
# Type: :
# Example: feat:
#
# Short Summary:
# Example: Add user login functionality
#
# Body and Details:
# For Body:
# For Details: -
# ----------------------------------------------------- #
```### Breaking Down Features
When working on a new feature, consider breaking it into multiple smaller commits if necessary. This approach helps in:
- Easier code reviews
- Clearer project history
- Simplified debuggingEach commit should represent a single logical change and follow the commit message guidelines.
### Additional Notes
Feel free to reach out if you have any questions or need further assistance. We appreciate your contributions and look forward to collaborating with you!
## License
This project is licensed under the MIT License. See the [LICENSE](LICENSE) file for details.