{"id":25975739,"url":"https://github.com/samia35-2973/living-type-classification-from-codon-usage","last_synced_at":"2026-05-06T01:34:34.119Z","repository":{"id":279897925,"uuid":"940368327","full_name":"Samia35-2973/Living-Type-Classification-from-Codon-Usage","owner":"Samia35-2973","description":"Machine learning project to classify living types based on codon usage data using Random Forest and XGBoost classifiers.","archived":false,"fork":false,"pushed_at":"2025-02-28T03:57:11.000Z","size":9158,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-02-28T11:26:57.839Z","etag":null,"topics":["classification","codon-usage","data-cleaning","data-preprocessing","excel","exploratory-data-analysis","living-type","machine-learning","python","random-forest-classifier","scikit-learn","supervised-learning","xgboost-classifier"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Samia35-2973.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2025-02-28T03:50:45.000Z","updated_at":"2025-02-28T05:42:11.000Z","dependencies_parsed_at":"2025-02-28T11:27:02.005Z","dependency_job_id":"bedcdf7c-0a66-49b8-8749-5733930a2ed1","html_url":"https://github.com/Samia35-2973/Living-Type-Classification-from-Codon-Usage","commit_stats":null,"previous_names":["samia35-2973/living-type-classification-from-codon-usage"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Samia35-2973%2FLiving-Type-Classification-from-Codon-Usage","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Samia35-2973%2FLiving-Type-Classification-from-Codon-Usage/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Samia35-2973%2FLiving-Type-Classification-from-Codon-Usage/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Samia35-2973%2FLiving-Type-Classification-from-Codon-Usage/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Samia35-2973","download_url":"https://codeload.github.com/Samia35-2973/Living-Type-Classification-from-Codon-Usage/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":241956594,"owners_count":20048662,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["classification","codon-usage","data-cleaning","data-preprocessing","excel","exploratory-data-analysis","living-type","machine-learning","python","random-forest-classifier","scikit-learn","supervised-learning","xgboost-classifier"],"created_at":"2025-03-05T03:23:53.955Z","updated_at":"2026-05-06T01:34:34.062Z","avatar_url":"https://github.com/Samia35-2973.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Living Type Classification from Codon Usage\n\nIn this project, I participated in a hackathon where I worked with a raw codon usage dataset. The goal was to classify the data into different living types by cleaning the data thoroughly. The classes of the 'type' were:\n\n- bacteria\n- virus\n- plant\n- bacteriophage\n- mammal\n- vertebrate\n- invertebrate\n- rodent\n- archaea\n- primate\n- plasmid\n\n## Directory Structure\n\n```\n- codon-usage-dataset-partially-cleaned\n    - dna_test - dna_test.csv.csv\n    - train_replaced_missing_values.csv\n- Raw Data\n    - dna_sample_submission.csv\n    - dna_test.csv\n    - dna_train.csv\n- output\n    - submission3.csv\n    - submission4.csv\n    - submission5.csv\n- living-type-classification-from-codon-usage.ipynb\n```\n\n### Data Cleaning\n\n1. **Handling Missing Values**: The first step was to load the raw data (`dna_train.csv` and `dna_test.csv`) into Excel, where I addressed string values indicating `NaN`. These string values caused the column's data type to be recognized as `Object`. Once cleaned, the columns were automatically formatted with the correct data types. The cleaned datasets were saved as:\n   - `train_replaced_missing_values.csv`\n   - `dna_test - dna_test.csv.csv`\n\n2. **Exploratory Data Analysis (EDA)**: After inspecting the data, I performed EDA to identify and impute remaining missing values. I used data distribution analysis to impute numerical variables. During this analysis, I identified an issue with the `uuu` column, which contained a string value among the float values. This was corrected and handled.\n\n3. **Encoding Categorical Data**: After cleaning and imputation, I encoded categorical data for model training.\n\n### Model Training\n\nI trained the following models:\n\n1. **RandomForestClassifier**: \n   ```python\n   r_mod = RandomForestClassifier(n_estimators=100, random_state=42)\n   ```\n\n2. **XGBoost Classifier**: \n   ```python\n   x_mod = XGBClassifier(objective='multi:softmax', n_estimators=100, random_state=42)\n   ```\n\n3. **Optimized XGBoost Classifier**:\n   ```python\n   xm = XGBClassifier(objective='multi:softmax', n_estimators=500, learning_rate=0.5, random_state=42)\n   ```\n\n### Results\n\n- **Random Forest Results**:\n  - Accuracy: 0.8399\n  - Precision: 0.8468\n  - Recall: 0.8399\n  - F1-Score: 0.8252\n\n- **XGBoost Results**:\n  - Accuracy: 0.8910\n  - Precision: 0.8899\n  - Recall: 0.8910\n  - F1-Score: 0.8865\n\n### Submission Files\n\nThe results were saved in the following files:\n- `submission3.csv`\n- `submission4.csv`\n- `submission5.csv`\n\n## How to Run\n\n1. Clone this repository.\n2. Open the `living-type-classification-from-codon-usage.ipynb` notebook.\n3. Load the cleaned datasets: `train_replaced_missing_values.csv` and `dna_test - dna_test.csv.csv`.\n4. Replace the train data path under the \"Understanding Data\" section and the test data path under the \"Test Data Preparation\" section of the notebook (use the `read_csv()` function).\n5. Run the notebook.\n\nAlternatively, you can start from scratch by using the raw data files (`dna_train.csv` and `dna_test.csv`) and following the cleaning and preprocessing steps in the notebook.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsamia35-2973%2Fliving-type-classification-from-codon-usage","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsamia35-2973%2Fliving-type-classification-from-codon-usage","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsamia35-2973%2Fliving-type-classification-from-codon-usage/lists"}