https://github.com/zunainaliazam/medical-data-ner-with-bert

This project leverages BERT for Named Entity Recognition (NER) on a medical dataset. The notebook provides a step-by-step guide, from dataset preparation to fine-tuning and saving the trained model for medical NER tasks.
https://github.com/zunainaliazam/medical-data-ner-with-bert

bert-base-cased bert-fine-tuning berttokenizer confusion-matrix fine-tuning jupyter-notebook medical-data named-entity-recognition pytorch

Last synced: about 2 months ago
JSON representation

Host: GitHub
URL: https://github.com/zunainaliazam/medical-data-ner-with-bert
Owner: ZunainAliAzam
License: mit
Created: 2024-10-23T13:23:36.000Z (7 months ago)
Default Branch: main
Last Pushed: 2024-10-27T14:47:57.000Z (7 months ago)
Last Synced: 2025-02-11T14:57:20.977Z (3 months ago)
Topics: bert-base-cased, bert-fine-tuning, berttokenizer, confusion-matrix, fine-tuning, jupyter-notebook, medical-data, named-entity-recognition, pytorch
Language: Jupyter Notebook
Homepage:
Size: 210 KB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE.txt

Awesome Lists containing this project

README

        # Medical Data NER with Fine-tuned BERT

This project leverages BERT for Named Entity Recognition (NER) on a medical dataset. The notebook provides a step-by-step guide, from dataset preparation to fine-tuning and saving the trained model for medical NER tasks.

## Project Structure

1. **Dataset Loading**  

   Upload and load your medical dataset in JSON format:

   ```python

   from google.colab import files

   import json

   # Upload and load JSON data

   uploaded = files.upload()

   file_name = list(uploaded.keys())[0]

   with open(file_name, 'r') as f:

       dataset = json.load(f)

   ```

2. **Tokenization and BIO Tagging**  

   Use BERT's tokenizer to tokenize sentences and assign BIO tags to each token:

   ```python

   from transformers import BertTokenizer

   tokenizer = BertTokenizer.from_pretrained('bert-base-cased')

   def tokenize_and_create_labels(sentence, entities, tokenizer):

       # Tokenize and label tokens

       tokenized_sentence = tokenizer.tokenize(sentence)

       labels = ['O'] * len(tokenized_sentence)

       # Assign 'B-' and 'I-' labels as per BIO tagging format

       # (Sample logic for locating and labeling entities)

       return tokenized_sentence, labels

   # Process each sentence in the dataset

   tokenized_data = []

   for item in dataset:

       tokens, labels = tokenize_and_create_labels(item["sentence"], item["entities"], tokenizer)

       tokenized_data.append({"tokens": tokens, "labels": labels})

   ```

3. **Tensor Conversion**  

   Convert tokenized data to tensors with padding for uniform input size:

   ```python

   import torch

   from sklearn.model_selection import train_test_split

   # Convert tokens to IDs, create attention masks, and pad inputs

   def convert_to_tensors(tokenized_data, tokenizer, max_length=128):

       # Example function for tensor conversion and padding

       return input_ids, attention_masks, label_ids

   input_ids, attention_masks, label_ids = convert_to_tensors(tokenized_data, tokenizer)

   train_inputs, test_inputs, train_labels, test_labels, train_masks, test_masks = train_test_split(

       input_ids, label_ids, attention_masks, test_size=0.2

   )

   ```

4. **Model Training**  

   Fine-tune the BERT model on the medical dataset using Hugging Face’s `Trainer`:

   ```python

   from transformers import BertForTokenClassification, Trainer, TrainingArguments

   model = BertForTokenClassification.from_pretrained('bert-base-cased', num_labels=len(unique_labels))

   training_args = TrainingArguments(

         output_dir='./results',

         evaluation_strategy="epoch",  # Evaluate at the end of every epoch

         per_device_train_batch_size=16,

         per_device_eval_batch_size=16,

         num_train_epochs=15,

         weight_decay=0.01,

         logging_dir='./logs',

         logging_steps=100,

   )

   trainer = Trainer(model=model, args=training_args, train_dataset=train_dataset, eval_dataset=eval_dataset)

   trainer.train()

   ```

5. **Saving the Model**  

   Save the trained model weights for future use:

   ```python

   model_save_path = "model.pt"

   torch.save(model.state_dict(), model_save_path)

   print(f"Model saved as {model_save_path}")

   ```

6. **Evaluation Metrics**

   

      | Metric                  | BERT Model |

      |-------------------------|------------|

      | **Accuracy**            | 96%        |

      | **Precision (Average)** | 96%        |

      | **Recall (Average)**    | 96%        |

      | **F1 Score (Average)**  | 96%        |

   

8. **NER Inferences**

   Enter a sentence for entity recognition: Mrs. Williams attends physical therapy to improve her mobility after hip replacement surgery.

   ```

      Token-Level Predictions:

      [

      {

         "token": "Mrs",

         "label": "B-Person"

      },

      {

         "token": ".",

         "label": "I-Person"

      },

      {

         "token": "Williams",

         "label": "I-Person"

      },

      {

         "token": "attends",

         "label": "O"

      },

      {

         "token": "physical",

         "label": "B-Service"

      },

      {

         "token": "therapy",

         "label": "I-Service"

      },

      {

         "token": "to",

         "label": "O"

      },

      {

         "token": "improve",

         "label": "B-Outcome"

      },

      {

         "token": "her",

         "label": "O"

      },

      {

         "token": "mobility",

         "label": "B-Outcome"

      },

      {

         "token": "after",

         "label": "O"

      },

      {

         "token": "hip",

         "label": "B-MedicalProcedure"

      },

      {

         "token": "replacement",

         "label": "I-MedicalProcedure"

      },

      {

         "token": "surgery",

         "label": "I-MedicalProcedure"

      },

      {

         "token": ".",

         "label": "O"

      }

      ]

      Entity-Level JSON Output:

      {

      "sentence": "Mrs. Williams attends physical therapy to improve her mobility after hip replacement surgery.",

      "entities": [

         {

            "text": "Mrs . Williams",

            "label": "Person"

         },

         {

            "text": "physical therapy",

            "label": "Service"

         },

         {

            "text": "improve",

            "label": "Outcome"

         },

         {

            "text": "mobility",

            "label": "Outcome"

         },

         {

            "text": "hip replacement surgery",

            "label": "MedicalProcedure"

         }

      ]

      }

   ```

## Requirements

- Python 3.x

- Libraries: `transformers`, `torch`, `scikit-learn`

## How to Run

1. **Install Requirements**

   ```bash

   pip install transformers torch scikit-learn

   ```

2. **Upload the Dataset**  

   Place your JSON dataset in the same directory or upload it in the notebook.

3. **Run the Notebook**  

   Execute each cell sequentially to preprocess data, train, and save the model.

## Acknowledgments

- Hugging Face `transformers`

- PyTorch for tensor manipulation

---

Let me know if you'd like further customization!

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/zunainaliazam/medical-data-ner-with-bert

Awesome Lists containing this project

README