https://github.com/parthapray/disease-symptom-knowledge-database_flattening

This repo shows the coding organizing symptoms in comma separated rows flattening for given disease from Disease-Symptom Knowledge Database based github link
https://github.com/parthapray/disease-symptom-knowledge-database_flattening

disease-symptom knowledge-database

Last synced: 9 months ago
JSON representation

This repo shows the coding organizing symptoms in comma separated rows flattening for given disease from Disease-Symptom Knowledge Database based github link

Host: GitHub
URL: https://github.com/parthapray/disease-symptom-knowledge-database_flattening
Owner: ParthaPRay
License: mit
Created: 2025-07-28T09:51:46.000Z (11 months ago)
Default Branch: main
Last Pushed: 2025-07-28T10:09:51.000Z (11 months ago)
Last Synced: 2025-07-28T11:39:46.650Z (11 months ago)
Topics: disease-symptom, knowledge-database
Language: Python
Homepage:
Size: 107 KB
Stars: 0
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

          # Disease-Symptom Data Cleaner and Flattener

This repository provides a Python script for cleaning, normalizing, and flattening disease-symptom datasets [Unified Medical Language System (UMLS)](https://www.nlm.nih.gov/research/umls/index.html) specific [https://people.dbmi.columbia.edu/~friedma/Projects/DiseaseSymptomKB/index.html](https://people.dbmi.columbia.edu/~friedma/Projects/DiseaseSymptomKB/index.html).

It is designed to process medical data from Excel files (especially those with composite codes separated by `^`), remove unwanted junk characters, and output a clean, analysis-ready CSV.

## Features

* **Directly loads medical data** from a GitHub-hosted Excel file

* **Fills missing disease names** using last valid entry (forward fill)

* **Removes unnecessary columns** (e.g., occurrence counts)

* **Eliminates rows with missing symptoms**

* **Handles composite disease codes**: splits codes joined by `^` into separate records

* **Removes junk characters** (e.g., `Â`) from disease and symptom columns

* **Exports a clean, flattened CSV** ready for ML or analytics

## Input

"raw_data.xlsx" The script loads raw data which is obtained in .xlsx format from [https://people.dbmi.columbia.edu/~friedma/Projects/DiseaseSymptomKB/index.html](https://people.dbmi.columbia.edu/~friedma/Projects/DiseaseSymptomKB/index.html) from below github link:

```

https://raw.githubusercontent.com/anujdutt9/Disease-Prediction-from-Symptoms/master/notebook/dataset/raw_data.xlsx

```

## Output

A cleaned and flattened CSV named:

```

flattened_url.csv

```

Each row contains a single disease code and its associated symptoms.

## Usage

1. **Clone this repository** or copy the script into your project.

2. Ensure you have [Python 3.9+](https://www.python.org/downloads/) and the following packages installed:

   ```bash

   pip install pandas

   ```

3. **Run the script**:

   ```bash

   python flatten.py

   ```

4. **Result:**

   You’ll get a `flattened_url.csv` file in your working directory.

## Example

**Input disease string:**

```

UMLS:C0376358_malignant neoplasm of prostate^UMLS:C0600139_carcinoma prostate^UMLS:C0600159_carcinoma digitailis

```

**Becomes three rows:**

```

UMLS:C0376358_malignant neoplasm of prostate, symptom_1, symptom_2, ...

UMLS:C0600139_carcinoma prostate, symptom_1, symptom_2, ...

UMLS:C0600159_carcinoma digitailis, symptom_1, symptom_2, ...

```

## Full Script

```python

import pandas as pd

import re

# Download Excel file directly from GitHub RAW link

file_url = 'https://raw.githubusercontent.com/anujdutt9/Disease-Prediction-from-Symptoms/master/notebook/dataset/raw_data.xlsx'

df = pd.read_excel(file_url)

# Fill down Disease names

df['Disease'] = df['Disease'].ffill()

# Drop 'Count of Disease Occurrence' if exists

if 'Count of Disease Occurrence' in df.columns:

    df = df.drop(columns=['Count of Disease Occurrence'])

# Remove rows where Symptom is NaN

df = df[df['Symptom'].notna()]

# Replace '^' with ',' in symptoms

df['Symptom'] = df['Symptom'].str.replace('^', ',', regex=False)

# --- Remove junk symbols like Â etc. from Disease and Symptom ---

def clean_text(text):

    # Remove any character that is not printable ASCII or common punctuation/space

    return re.sub(r'[^\x20-\x7E]', '', str(text))

df['Disease'] = df['Disease'].apply(clean_text)

df['Symptom'] = df['Symptom'].apply(clean_text)

# Prepare new rows for disease splits

expanded_rows = []

for disease, group in df.groupby('Disease', sort=False):

    symptoms = ','.join(group['Symptom'])

    disease_codes = [d.strip() for d in disease.split('^')]

    for code in disease_codes:

        expanded_rows.append({'Disease': code, 'Symptom': symptoms})

# Create result DataFrame preserving order

result = pd.DataFrame(expanded_rows)

# (Optional) Clean junk symbols from the final result again, just in case

result['Disease'] = result['Disease'].apply(clean_text)

result['Symptom'] = result['Symptom'].apply(clean_text)

# Save to CSV

output_csv = 'flattened_url.csv'

result.to_csv(output_csv, index=False)

result.head()

```

### Raw data to flattended data

Before Raw data



After

Flattened data (Comma separated)



   ```bash

UMLS:C0008031_pain chest,UMLS:C0392680_shortness of breath,UMLS:C0012833_dizziness,UMLS:C0004093_asthenia,UMLS:C0085639_fall,UMLS:C0039070_syncope,UMLS:C0042571_vertigo,UMLS:C0038990_sweat,UMLS:C0700590_sweating increased,UMLS:C0030252_palpitation,UMLS:C0027497_nausea,UMLS:C0002962_angina pectoris,UMLS:C0438716_pressure chest

   ```

## License

MIT

## References

1. https://people.dbmi.columbia.edu/~friedma/Projects/DiseaseSymptomKB/index.html

2. https://github.com/anujdutt9/Disease-Prediction-from-Symptoms

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/parthapray/disease-symptom-knowledge-database_flattening

Awesome Lists containing this project

README