Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/nneji123/ycombinator-scraper
A Python library and cli tool for scraping companies, jobs, and founders data from Workatastartup.com.
https://github.com/nneji123/ycombinator-scraper
automation cli library mkdocs-material package pypi python selenium webscraping ycombinator
Last synced: 29 days ago
JSON representation
A Python library and cli tool for scraping companies, jobs, and founders data from Workatastartup.com.
- Host: GitHub
- URL: https://github.com/nneji123/ycombinator-scraper
- Owner: Nneji123
- License: mit
- Created: 2024-01-23T13:41:15.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2024-03-06T00:35:57.000Z (11 months ago)
- Last Synced: 2024-11-24T20:38:53.532Z (3 months ago)
- Topics: automation, cli, library, mkdocs-material, package, pypi, python, selenium, webscraping, ycombinator
- Language: Python
- Homepage: https://nneji123.github.io/ycombinator-scraper
- Size: 1.27 MB
- Stars: 8
- Watchers: 1
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE
Awesome Lists containing this project
README
# YCombinator-Scraper
| | |
| --- | --- |
| CI/CD | [![CI - Test](https://github.com/Nneji123/ycombinator-scraper/actions/workflows/tests.yml/badge.svg)](https://github.com/Nneji123/ycombinator-scraper/actions/workflows/tests.yml) [![publish-pypi](https://github.com/Nneji123/ycombinator-scraper/actions/workflows/pypi.yml/badge.svg)](https://github.com/Nneji123/ycombinator-scraper/actions/workflows/pypi.yml) [![Coverage](https://codecov.io/gh/Nneji123/ycombinator-scraper/graph/badge.svg?token=37muKJo0SL)](https://codecov.io/gh/Nneji123/ycombinator-scraper)|
| Docs | [![Docs](https://github.com/Nneji123/ycombinator-scraper/actions/workflows/docs.yml/badge.svg)](https://github.com/Nneji123/ycombinator-scraper/actions/workflows/docs.yml) |
| Package | [![PyPI - Version](https://img.shields.io/pypi/v/ycombinator-scraper.svg?logo=pypi&label=PyPI&logoColor=gold)](https://pypi.org/project/ycombinator-scraper/) [![PyPI - Downloads](https://img.shields.io/pypi/dm/ycombinator-scraper.svg?color=blue&label=Downloads&logo=pypi&logoColor=gold)](https://pypi.org/project/ycombinator-scraper/) [![PyPI - Python Version](https://img.shields.io/pypi/pyversions/ycombinator-scraper.svg?logo=python&label=Python&logoColor=gold)](https://pypi.org/project/ycombinator-scraper/) |
| Meta | [![linting - Ruff](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/astral-sh/ruff/main/assets/badge/v2.json)](https://github.com/astral-sh/ruff) [![License - MIT](https://img.shields.io/badge/license-MIT-9400d3.svg)](./LICENSE) |-----
YCombinator-Scraper provides a web scraping tool for extracting data from [Workatastartup](https://www.workatastartup.com/) website. The package uses Selenium and BeautifulSoup to navigate through the pages and extract information.
---
**Documentation**: https://nneji123.github.io/ycombinator-scraper
**Source Code**: https://github.com/nneji123/ycombinator-scraper
---
# Sponsor
[Proxycurl APIs](https://nubela.co/proxycurl/?utm_campaign=influencer_marketing&utm_source=github&utm_medium=social&utm_content=ifeanyi_nneji_ycombinator_scraper)[
](https://nubela.co/proxycurl?utm_campaign=influencer_marketing&utm_source=github&utm_medium=social&utm_content=ifeanyi_nneji_ycombinator_scraper)
Scrape public LinkedIn profile data at scale with Proxycurl APIs.
- Scraping Public profiles are battle tested in court in HiQ VS LinkedIn case.
- GDPR, CCPA, SOC2 compliant.
- High rate limit - 300 requests/minute.
- Fast - APIs respond in ~2s.
- Fresh data - 88% of data is scraped real-time, other 12% are not older than 29 days.
- High accuracy.
- Tons of data points returned per profileBuilt for developers, by developers.
## Features
- **Web Scraping Capabilities:**
- Extract detailed information about companies, including name, description, tags, images, job links, and social media links.
- Scrape job-specific details such as title, salary range, tags, and description.- **Founder and Company Data Extraction:**
- Obtain information about company founders, including name, image, description, linkedIn profile, and optional email addresses.- **Headless Mode:**
- Run the scraper in headless mode to perform web scraping without displaying a browser window.- **Configurability:**
- Easily configure scraper settings such as login credentials, logs directory, automatic install of webdriver based on browser with `webdriver-manager package` and using environment variables or a configuration file.- **Command-Line Interface (CLI):**
- Command-line tools to perform various scraping tasks interactively or in batch mode.- **Data Output Formats:**
- Save scraped data in JSON or CSV format, providing flexibility for further analysis or integration with other tools.- **Caching Mechanism:**
- Implement a caching feature to store function results for a specified duration, reducing redundant web requests and improving performance.- **Docker Support:**
- Package the scraper as a Docker image, enabling easy deployment and execution in containerized environments or run the prebuilt docker image `docker pull nneji123/ycombinator_scraper`.## Requirements
- Python 3.9+
- Chrome or Chromium browser installed.## Installation
```console
$ pip install ycombinator-scraper
$ ycscraper --help# Output
YCombinator-Scraper Version 0.7.0
Usage: python -m ycombinator_scraper [OPTIONS] COMMAND [ARGS]...Options:
--help Show this message and exit.Commands:
login
scrape-company
scrape-founders
scrape-job
version
```### With Docker
```bash
$ git clone https://github.com/Nneji12/ycombinator-scraper
$ cd ycombinator-scraper
$ docker build -t your_name/scraper_name . # e.g docker build -t nneji123/ycombinator_scraper .
$ docker run nneji123/ycombinator_scraper python -m ycombinator_scraper --help
```## Dependencies
- **click**: Enables the creation of a command-line interface for interacting with the scraper tool.
- **beautifulsoup4**: Facilitates the parsing and extraction of data from HTML and XML in the web scraping process.
- **loguru**: Provides a robust logging framework to track and manage log messages generated during the scraping process.
- **pandas**: Utilized for the manipulation and organization of data, particularly in generating CSV files from scraped information.
- **pathlib**: Offers an object-oriented approach to handle file system paths, contributing to better file management within the project.
- **pydantic**: Used for data validation and structuring the models that represent various aspects of scraped data.
- **pydantic-settings**: Extends Pydantic to enhance the management of settings in the project.
- **selenium**: Employs browser automation for web scraping, allowing interaction with dynamic web pages and extraction of information.## Usage
### With CLI
```bash
ycscraper scrape-company --company-url https://www.workatastartup.com/companies/example-inc
```This command will scrape data for the specified company and save it in the default output format (JSON).
### With Library
```python
from ycombinator_scraper import Scraperscraper = Scraper()
company_data = scraper.scrape_company_data("https://www.workatastartup.com/companies/example-inc")
print(company_data.model_dump_json(by_alias=True,indent=2))
```
Pydantic is used under the hood so methods like `model_dump_json` are available for all the scraped data.> **You can view more examples here: [Examples](https://nneji123.github.io/ycombinator-scraper/usage/examples)**
## Contribution
We welcome contributions from the community! To contribute to this project, follow the steps below.
### Setting Up Development Environment
#### Gitpod
You can use Gitpod, a free online VS Code-like environment, to quickly start contributing.
[![Open in Gitpod](https://gitpod.io/button/open-in-gitpod.svg)](https://gitpod.io/#https://github.com/nneji123/ycombinator-scraper)
#### Local Setup
1. Clone the repository:
```bash
git clone https://github.com/nneji123/ycombinator-scraper.git
cd ycombinator-scraper
```2. Create a virtual environment (optional but recommended):
```bash
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
```3. Install dependencies:
```bash
pip install -r requirements.txt
```### Running Tests
Make sure to run tests before submitting a pull request.
```bash
pip install -r requirements-test.txt
pytest tests
```### Installing Documentation Requirements
If you make changes to documentation, install the necessary dependencies:
```bash
pip install -r requirements-docs.txt
mkdocs serve
```### Setting Up Pre-Commit Hooks
We use `pre-commit` to ensure code quality. Install it by running:
```bash
pip install pre-commit
pre-commit install
```Now, `pre-commit` will run automatically before each commit to check for linting and other issues.
### Submitting a Pull Request
1. Fork the repository and create a new branch for your contribution:
```bash
git checkout -b feature-or-fix-branch
```2. Make your changes and commit them:
```bash
git add .
git commit -am "Your meaningful commit message"
```3. Push the changes to your fork:
```bash
git push origin feature-or-fix-branch
```4. Open a pull request on GitHub. Provide a clear title and description of your changes.
## Documentation
The [documentation](https://nneji123.github.io/ycombinator-scraper/) is made with [Material for MkDocs](https://github.com/squidfunk/mkdocs-material) and is hosted by [GitHub Pages](https://docs.github.com/en/pages).
## License
YCombinator-Scraper is distributed under the terms of the [MIT](./LICENSE) license.