{"id":25028840,"url":"https://github.com/lukafilipxvic/yc-vault","last_synced_at":"2025-04-13T15:57:54.482Z","repository":{"id":261398839,"uuid":"861215348","full_name":"lukafilipxvic/YC-Vault","owner":"lukafilipxvic","description":"YC Directory Database","archived":false,"fork":false,"pushed_at":"2025-03-21T11:15:05.000Z","size":8688,"stargazers_count":5,"open_issues_count":1,"forks_count":1,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-03-21T12:26:36.919Z","etag":null,"topics":["database","yc","ycombinator"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/lukafilipxvic.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-09-22T10:14:37.000Z","updated_at":"2025-03-21T11:15:10.000Z","dependencies_parsed_at":"2025-03-21T12:37:29.964Z","dependency_job_id":null,"html_url":"https://github.com/lukafilipxvic/YC-Vault","commit_stats":null,"previous_names":["lukafilipxvic/yc-analyzed","lukafilipxvic/yc-vault"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lukafilipxvic%2FYC-Vault","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lukafilipxvic%2FYC-Vault/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lukafilipxvic%2FYC-Vault/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lukafilipxvic%2FYC-Vault/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/lukafilipxvic","download_url":"https://codeload.github.com/lukafilipxvic/YC-Vault/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248741161,"owners_count":21154252,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["database","yc","ycombinator"],"created_at":"2025-02-05T20:47:54.081Z","updated_at":"2025-04-13T15:57:54.476Z","avatar_url":"https://github.com/lukafilipxvic.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# YC Vault\n Analysis on every YC Batch ever.\n Read the initial blog post [here.](https://lukafilipovic.com/writing/2024/10/12/analysing-every-y-combinator-batch-ever/)\n\n ## Why?\nY Combinator is one of the largest startup accelerators in the world.\nIt has one of the highest concentrations of technical founders.\nCompanies like Airbnb, Docker, Instacart and Coinbase were all brought up through the accelerator. But they only represent the top percentile.  \n\nYC Vault is my attempt to make sense of the entire Y Combinator directory.\n\n## Requirements\nAny language model of your choice through LiteLLM. High-performing models like GPT-4o-mini are recommended for their data extraction accuracy.\n\n## Project installation\n```\ngit clone https://github.com/lukafilipxvic/YC-Vault.git\n```\n```\nuv sync\n```\n\n3. Set up environment:\n   - Create a `.env` file using the '.env.example' file as a template\n   - Example `.env` file:\n     ```\n     [llm]\n     OPENAI_API_KEY=your_api_key_here\n     \n     [data]\n     DATA_DIR=./data\n     ```\n\n## Usage\n\n1. Configure your data sources:\n   - Update the `YC_Batches.csv` file with all batch IDs\n   - This file will need updating as new batches are launched\n2. Run the pipeline:\n```\nuv run python run_pipeline.py\n```\n\n## Performance\n\n### Time Requirements\n\n- `get_yc_urls.py`: ~2.5 minutes to scrape all YC URLs\n- `get_yc_data.py`: ~3.68 seconds per company (approximately 5.11 hours to scrape 5,000 YC companies synchronously)\n\n### Cost Analysis\n\n- Using GPT-4o-mini costs approximately $0.00026 to extract one YC company page\n- Total cost for 5,000 YC companies: ~$1.30\n- For comparison, Gumloop costs ~$80.83 for the same data (62.18x more expensive)\n\n## Data Structure\n\nThe scraping pipeline generates 3 CSV files:\n- `YC_Companies.csv`: Company profiles and metrics\n- `YC_Founders.csv`: Founder information and backgrounds\n- `YC_URLs.csv`: Source URLs for all scraped data\n\n## Contributing\n\nContributions are welcome! Please feel free to submit a Pull Request.\n\n## License\n\nLicensed under [AGPL-3.0](https://choosealicense.com/licenses/agpl-3.0/)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flukafilipxvic%2Fyc-vault","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Flukafilipxvic%2Fyc-vault","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flukafilipxvic%2Fyc-vault/lists"}