An open API service indexing awesome lists of open source software.

https://github.com/limeo131/top-startups-founders

Used machine learning to segment and classify top scientists involved in startups, based on PitchBook data. Analyzed founder profiles, patents, and affiliations using Python and Pandas to uncover trends in academic entrepreneurship and tech commercialization.
https://github.com/limeo131/top-startups-founders

classification identification machine-learning python sklearn

Last synced: 5 months ago
JSON representation

Used machine learning to segment and classify top scientists involved in startups, based on PitchBook data. Analyzed founder profiles, patents, and affiliations using Python and Pandas to uncover trends in academic entrepreneurship and tech commercialization.

Awesome Lists containing this project

README

          

# ๐Ÿง  Top-Startup-Founders

**AI-driven prediction and profiling of top scientists for startup success.**

This project leverages machine learning and natural language processing to analyze over 1,000 top scientists' **publications**, **patents**, and **grant histories**, aiming to identify researchers with high potential for **startup success**. By combining state-of-the-art embedding models and ensemble learning, we generate actionable insights to support **investment** and **product development** decisions.

---

## ๐Ÿš€ Key Highlights

- ๐Ÿ” Extracted key features from large-scale researcher datasets using advanced NLP and transformer-based embeddings (e.g., `allenai-specter`)
- ๐Ÿงช Trained and fine-tuned ML models: **Neural Network**, **XGBoost**, **Decision Trees**, and more
- ๐Ÿง  Built an **ensemble majority voting classifier**, achieving ~**90% accuracy** in predicting startup outcomes
- ๐Ÿ“Š Delivered interpretable outputs for **investors**, **VCs**, and **startup founders** to identify high-impact scientific talent

---

## ๐Ÿ—‚๏ธ Project Structure

### ๐Ÿ“ `prediction.ipynb`
Processes researcher metadata and textual content (publications, patents, grants). Core tasks:
- Sentence embedding using `allenai-specter`
- Feature engineering across multiple institutions
- Model training (classification and similarity scoring)

### ๐Ÿ“ `correlation_0.ipynb`
Analyzes cross-institutional patterns and correlation in scientist success predictors, validating generalizability of the models across universities.

### ๐Ÿ“„ `sample`
Example input/output file (can be replaced with your data). Format typically includes researcher name, affiliation, text fields (abstracts/patents), and labeled outcomes.

---

## ๐Ÿ“ฆ Installation

```bash
pip install pandas matplotlib numpy transformers keyphrasetransformer sentence-transformers scikit-learn xgboost
```

---

## ๐Ÿง  Models Used

| Model Type | Purpose |
|--------------------|-----------------------------------------|
| `allenai-specter` | Generate semantic embeddings from text |
| `XGBoost` | Feature-rich startup success prediction |
| `Neural Network` | Deep modeling of researcher profiles |
| `Decision Trees` | Transparent classification models |
| `VotingClassifier` | Ensemble method (~90% accuracy) |

---

## ๐Ÿ’ผ Use Cases

- ๐Ÿ’ฐ **VC & Investment Strategy** โ€“ Prioritize researchers with strong innovation potential
- ๐Ÿš€ **Startup Partnerships** โ€“ Identify academic collaborators with commercial success prospects
- ๐Ÿ”ฌ **R&D Analytics** โ€“ Track emerging research trends and their entrepreneurial applications

---

## ๐Ÿ“ How to Use

1. Clone the repo:
```bash
git clone https://github.com/your-username/Top-scientist-for-Startups.git
cd Top-scientist-for-Startups
```

2. Run notebooks:
- Use `prediction.ipynb` to generate features and train predictive models
- Use `correlation_0.ipynb` for analysis across institutions

3. Modify `sample` or replace with your dataset (`.csv`) following the expected schema.

---

## ๐Ÿ“ˆ Sample Output

- Similarity scores between scientists and startup keywords
- Top predicted scientists ranked by success probability
- Confusion matrix, classification report (accuracy, precision, recall)

---

## ๐Ÿค Contributions

Pull requests are welcome! For major changes, please open an issue first to discuss what you would like to change.

---

## ๐Ÿ“ฌ Contact

If you have any questions or suggestions, feel free to reach out or open an issue.

---

Let me know if you want this saved as a `README.md` file, or if you'd like to include badges (like GitHub stars, last updated, license, etc.)!