Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/gnrlleclerc/inf-554-classification-data-challenge
Machine and Deep learnning Kaggle Competition
https://github.com/gnrlleclerc/inf-554-classification-data-challenge
Last synced: about 9 hours ago
JSON representation
Machine and Deep learnning Kaggle Competition
- Host: GitHub
- URL: https://github.com/gnrlleclerc/inf-554-classification-data-challenge
- Owner: GnRlLeclerc
- Created: 2023-12-03T00:20:31.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2023-12-07T18:44:49.000Z (about 1 year ago)
- Last Synced: 2024-11-20T14:34:11.529Z (about 2 months ago)
- Language: Jupyter Notebook
- Size: 2.88 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# INF554 Data Challenge
https://www.kaggle.com/competitions/inf554-extractive-summarization-2023/data
## Installation
Using python virtual environment :
```bash
python --version # Check that you have python >=3.11
python -m venv venv# On MacOS / Linux
source venv/bin/activate
# On Windows
venv\Scripts\activate# Install dependencies
pip install -r requirements.txt
```## Usage
### Data
All training data files (`json` utterances and `.txt` graph edges) must be located in the [`data/training/`](./data/training/) directory. The same goes for testing data files in the [`data/test/`](./data/test/) directory.
### Generate utterance text embeddings
We used pretrained embedders from the `sentence_transformers` library.
You can test the performance of our own custom embedder with the following command (we ended up not using it because it would always overfit)
```bash
python src/word_embedding.py
```In order to save time when embedding utterance, you must first generate the preembedded tensors for each data file. This can be done with the following command :
```bash
python src/loader.py
```This generates prembedded tensors using 2 models : `all-MiniLM-L6-v2` (small) and `all-mpnet-base-v2` (large).
### Run the classifiers
We experimented on different methods to classify utterances as important or not. The best results were obtained with a simple logistic regression on the embeddings. Our RNN and GNN attempts as well as reimplementations of lab8 and lab9 adapted for this task can be found in the [`rnn.py`](./src/rnn.py), [`graph_model.py`](./src/graph_model.py), [`lab_lstm.py`](./src/lab8_lstm.py) and [`lab9_gnn.py`](./src/lab9_gcn.py).
You can run the logistic regression classifier with the following command :
```bash
python src/majority.py
```## File structure
```bash
├── .vscode # VSCode configuration
│
├── data # Datasets
│ ├── test # Data for testing
│ └── training # Data for training
│
├── labels # Computed labels + true training labels
│ ├── test # Labels for testing
│ └── training # Labels for training
│
├── notebooks # Jupyter notebooks (experiments)
│
└── src # Python script dependencies
```## Usage
Run the baseline text-only (no graph) model with the following command.
It will test it against training and validation data, and write to labels/train the estimated test labels.```bash
python src/baseline.py
```Use the following command to display help about the function that generates a submission file.
```bash
python src/make_submission.py -h
```