https://github.com/lukafilipxvic/yc-vault
YC Directory Database
https://github.com/lukafilipxvic/yc-vault
database yc ycombinator
Last synced: about 1 year ago
JSON representation
YC Directory Database
- Host: GitHub
- URL: https://github.com/lukafilipxvic/yc-vault
- Owner: lukafilipxvic
- License: other
- Created: 2024-09-22T10:14:37.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2025-03-21T11:15:05.000Z (about 1 year ago)
- Last Synced: 2025-03-21T12:26:36.919Z (about 1 year ago)
- Topics: database, yc, ycombinator
- Language: Python
- Homepage:
- Size: 8.29 MB
- Stars: 5
- Watchers: 1
- Forks: 1
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# YC Vault
Analysis on every YC Batch ever.
Read the initial blog post [here.](https://lukafilipovic.com/writing/2024/10/12/analysing-every-y-combinator-batch-ever/)
## Why?
Y Combinator is one of the largest startup accelerators in the world.
It has one of the highest concentrations of technical founders.
Companies like Airbnb, Docker, Instacart and Coinbase were all brought up through the accelerator. But they only represent the top percentile.
YC Vault is my attempt to make sense of the entire Y Combinator directory.
## Requirements
Any language model of your choice through LiteLLM. High-performing models like GPT-4o-mini are recommended for their data extraction accuracy.
## Project installation
```
git clone https://github.com/lukafilipxvic/YC-Vault.git
```
```
uv sync
```
3. Set up environment:
- Create a `.env` file using the '.env.example' file as a template
- Example `.env` file:
```
[llm]
OPENAI_API_KEY=your_api_key_here
[data]
DATA_DIR=./data
```
## Usage
1. Configure your data sources:
- Update the `YC_Batches.csv` file with all batch IDs
- This file will need updating as new batches are launched
2. Run the pipeline:
```
uv run python run_pipeline.py
```
## Performance
### Time Requirements
- `get_yc_urls.py`: ~2.5 minutes to scrape all YC URLs
- `get_yc_data.py`: ~3.68 seconds per company (approximately 5.11 hours to scrape 5,000 YC companies synchronously)
### Cost Analysis
- Using GPT-4o-mini costs approximately $0.00026 to extract one YC company page
- Total cost for 5,000 YC companies: ~$1.30
- For comparison, Gumloop costs ~$80.83 for the same data (62.18x more expensive)
## Data Structure
The scraping pipeline generates 3 CSV files:
- `YC_Companies.csv`: Company profiles and metrics
- `YC_Founders.csv`: Founder information and backgrounds
- `YC_URLs.csv`: Source URLs for all scraped data
## Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
## License
Licensed under [AGPL-3.0](https://choosealicense.com/licenses/agpl-3.0/)