https://github.com/junayed-hasan/bengaliclinicalmt
Official repository for "Extrinsic Evaluation of Machine Translation Quality via Downstream Tasks for Low-Resource Bengali Clinical Texts." Includes datasets, preprocessing pipelines, translation models, code for evaluation, and clinical outcome prediction tasks.
https://github.com/junayed-hasan/bengaliclinicalmt
Last synced: 11 months ago
JSON representation
Official repository for "Extrinsic Evaluation of Machine Translation Quality via Downstream Tasks for Low-Resource Bengali Clinical Texts." Includes datasets, preprocessing pipelines, translation models, code for evaluation, and clinical outcome prediction tasks.
- Host: GitHub
- URL: https://github.com/junayed-hasan/bengaliclinicalmt
- Owner: junayed-hasan
- License: mit
- Created: 2024-12-08T00:19:57.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2024-12-08T00:42:47.000Z (over 1 year ago)
- Last Synced: 2025-03-04T14:51:40.767Z (over 1 year ago)
- Language: Jupyter Notebook
- Size: 11.3 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# BengaliClinicalMT
Official repository for the paper **"Extrinsic Evaluation of Machine Translation Quality via Downstream Tasks for Low-Resource Bengali Clinical Texts"**. This repository contains datasets, preprocessing pipelines, translation models, and scripts for evaluating Bengali machine-translated clinical texts on downstream clinical tasks, including Mortality Prediction (MP) and Length of Stay Prediction (LOS).
---
## Repository Contents
- `LICENSE`: License file for the repository.
- `README.md`: Documentation and usage instructions.
- `balanced_translated_dataset.csv`: Translated Bengali dataset obtained from machine translation models.
- `balanced_translated_dataset_preprocessed.csv`: Preprocessed version of the translated Bengali dataset.
- `classification-ben-opus-with-preprocessing.ipynb`: Jupyter notebook for downstream task classification using BanglaBERT.
- `translation-opus-with-preprocessing.ipynb`: Jupyter notebook for preprocessing and translation of clinical texts using OPUS-MT.
- `requirements.txt`: Python dependencies for reproducing the experiments.
---
## Abstract
Low-resource languages face significant challenges in the development of NLP solutions, particularly in specialized domains like healthcare. This repository implements the experiments detailed in our paper, including extrinsic evaluation of machine-translated Bengali clinical texts. It demonstrates how translation quality influences the performance of Bengali clinical NLP models on downstream tasks.
---
## Features
1. **Dataset Creation**: Contains translated Bengali clinical text datasets derived from MIMIC-III admission notes.
2. **Preprocessing Pipelines**: Implements robust preprocessing for noisy clinical texts.
3. **Machine Translation**: Uses multiple machine translation models (e.g., OPUS-MT, ChatGPT, BanglaNMT) for English-to-Bengali translation.
4. **Downstream Tasks**: Evaluates translations via real-world tasks such as:
- Mortality Prediction (MP)
- Length of Stay Prediction (LOS)
5. **Comparative Analysis**: Includes extrinsic evaluation and comparison with intrinsic metrics like BLEU.
---
## Setup and Installation
1. Clone this repository:
```bash
git clone https://github.com/junayed-hasan/BengaliClinicalMT
cd BengaliClinicalMT
```
2. Install dependencies:
```bash
pip install -r requirements.txt
```
3. Download the MIMIC-III database (requires appropriate credentials) and preprocess the admission notes using the provided scripts. Please see https://github.com/bvanaken/clinical-outcome-prediction for more details.
---
## Usage
### Translation and Preprocessing
Run the notebook for preprocessing and translation:
```bash
jupyter notebook translation-opus-with-preprocessing.ipynb
```
### Downstream Task Evaluation
Use the classification notebook to evaluate the performance of BanglaBERT on the translated and preprocessed dataset:
```bash
jupyter notebook classification-ben-opus-with-preprocessing.ipynb
```
---
## Datasets
- **balanced_translated_dataset.csv**: Bengali translations of English clinical texts.
- **balanced_translated_dataset_preprocessed.csv**: Preprocessed Bengali clinical text dataset, cleaned for downstream task evaluation.
---
## Results
We evaluate translation quality through:
- **Intrinsic Metric**: BLEU score
- **Extrinsic Metrics**: AUROC scores for MP and LOS tasks
The full results and analysis are available in the paper and notebooks.
---
## Reproducibility
To ensure reproducibility:
1. Use the preprocessing and translation scripts to generate the datasets.
2. Follow the notebooks for training and evaluation on downstream tasks.
3. All code and implementation details are provided in this repository.
---
## Contact
For questions, please contact:
**Mohammad Junayed Hasan**
Master’s Student, Computer Science
Johns Hopkins University
Email: [junayedhasan100@gmail.com](mailto:junayedhasan100@gmail.com)
GitHub: [junayed-hasan](https://github.com/junayed-hasan)
Here’s the license and copyright section to add at the end of your `README.md` file:
## License
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
## Copyright
© 2024 Mohammad Junayed Hasan, Johns Hopkins University. All rights reserved.