https://github.com/limeo131/top-startups-founders
Used machine learning to segment and classify top scientists involved in startups, based on PitchBook data. Analyzed founder profiles, patents, and affiliations using Python and Pandas to uncover trends in academic entrepreneurship and tech commercialization.
https://github.com/limeo131/top-startups-founders
classification identification machine-learning python sklearn
Last synced: 5 months ago
JSON representation
Used machine learning to segment and classify top scientists involved in startups, based on PitchBook data. Analyzed founder profiles, patents, and affiliations using Python and Pandas to uncover trends in academic entrepreneurship and tech commercialization.
- Host: GitHub
- URL: https://github.com/limeo131/top-startups-founders
- Owner: Limeo131
- Created: 2024-06-04T21:39:15.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2025-04-03T23:31:45.000Z (7 months ago)
- Last Synced: 2025-05-19T14:21:22.926Z (6 months ago)
- Topics: classification, identification, machine-learning, python, sklearn
- Language: Jupyter Notebook
- Homepage:
- Size: 5.28 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# ๐ง Top-Startup-Founders
**AI-driven prediction and profiling of top scientists for startup success.**
This project leverages machine learning and natural language processing to analyze over 1,000 top scientists' **publications**, **patents**, and **grant histories**, aiming to identify researchers with high potential for **startup success**. By combining state-of-the-art embedding models and ensemble learning, we generate actionable insights to support **investment** and **product development** decisions.
---
## ๐ Key Highlights
- ๐ Extracted key features from large-scale researcher datasets using advanced NLP and transformer-based embeddings (e.g., `allenai-specter`)
- ๐งช Trained and fine-tuned ML models: **Neural Network**, **XGBoost**, **Decision Trees**, and more
- ๐ง Built an **ensemble majority voting classifier**, achieving ~**90% accuracy** in predicting startup outcomes
- ๐ Delivered interpretable outputs for **investors**, **VCs**, and **startup founders** to identify high-impact scientific talent
---
## ๐๏ธ Project Structure
### ๐ `prediction.ipynb`
Processes researcher metadata and textual content (publications, patents, grants). Core tasks:
- Sentence embedding using `allenai-specter`
- Feature engineering across multiple institutions
- Model training (classification and similarity scoring)
### ๐ `correlation_0.ipynb`
Analyzes cross-institutional patterns and correlation in scientist success predictors, validating generalizability of the models across universities.
### ๐ `sample`
Example input/output file (can be replaced with your data). Format typically includes researcher name, affiliation, text fields (abstracts/patents), and labeled outcomes.
---
## ๐ฆ Installation
```bash
pip install pandas matplotlib numpy transformers keyphrasetransformer sentence-transformers scikit-learn xgboost
```
---
## ๐ง Models Used
| Model Type | Purpose |
|--------------------|-----------------------------------------|
| `allenai-specter` | Generate semantic embeddings from text |
| `XGBoost` | Feature-rich startup success prediction |
| `Neural Network` | Deep modeling of researcher profiles |
| `Decision Trees` | Transparent classification models |
| `VotingClassifier` | Ensemble method (~90% accuracy) |
---
## ๐ผ Use Cases
- ๐ฐ **VC & Investment Strategy** โ Prioritize researchers with strong innovation potential
- ๐ **Startup Partnerships** โ Identify academic collaborators with commercial success prospects
- ๐ฌ **R&D Analytics** โ Track emerging research trends and their entrepreneurial applications
---
## ๐ How to Use
1. Clone the repo:
```bash
git clone https://github.com/your-username/Top-scientist-for-Startups.git
cd Top-scientist-for-Startups
```
2. Run notebooks:
- Use `prediction.ipynb` to generate features and train predictive models
- Use `correlation_0.ipynb` for analysis across institutions
3. Modify `sample` or replace with your dataset (`.csv`) following the expected schema.
---
## ๐ Sample Output
- Similarity scores between scientists and startup keywords
- Top predicted scientists ranked by success probability
- Confusion matrix, classification report (accuracy, precision, recall)
---
## ๐ค Contributions
Pull requests are welcome! For major changes, please open an issue first to discuss what you would like to change.
---
## ๐ฌ Contact
If you have any questions or suggestions, feel free to reach out or open an issue.
---
Let me know if you want this saved as a `README.md` file, or if you'd like to include badges (like GitHub stars, last updated, license, etc.)!