https://github.com/walidalsafadi/toxicity
The dataset includes 171 molecules designed for functional domains of a core clock protein, CRY1, responsible for generating circadian rhythm. 56 of the molecules are toxic and the rest are non-toxic.
https://github.com/walidalsafadi/toxicity
exploratory-data-analysis genetic-algorithm jupyter python svm svm-classifier
Last synced: 3 months ago
JSON representation
The dataset includes 171 molecules designed for functional domains of a core clock protein, CRY1, responsible for generating circadian rhythm. 56 of the molecules are toxic and the rest are non-toxic.
- Host: GitHub
- URL: https://github.com/walidalsafadi/toxicity
- Owner: WalidAlsafadi
- License: mit
- Created: 2023-03-15T21:42:52.000Z (over 2 years ago)
- Default Branch: main
- Last Pushed: 2023-03-16T20:20:52.000Z (over 2 years ago)
- Last Synced: 2025-01-22T17:16:15.497Z (5 months ago)
- Topics: exploratory-data-analysis, genetic-algorithm, jupyter, python, svm, svm-classifier
- Language: Jupyter Notebook
- Homepage:
- Size: 590 KB
- Stars: 3
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Toxicity
The dataset includes 171 molecules designed for functional domains of a core clock protein, CRY1, responsible for generating circadian rhythm. 56 of the molecules are toxic and the rest are non-toxic.### Information:
| Dataset Characteristics | Subject Area | Associated Tasks | Attribute Type | Instances | Attributes |
|---------|:-------------|:--------------:|:-------------|:-------------|--------------:|
| Tabular | Life Sciences | Classification | - | 171 | 1203 |What do the instances in this dataset represent?
Small molecules
Was there any data preprocessing performed?
The data consists a complete set of 1203 molecular descriptors and needs feature selection before classification since some of the features are redundant. We used Recursive Feature Elimination together with Decision Tree Classifier (DTC) to get the best set of molecular descriptors for DTC. Subsetted data with 13 features is included as supplementary file.
### Task:
Implement an SVM algorithm using any benchmark data of your choosing; the only condition for the data is that it must have 1000 columns (features) or more, then use Genetic Algorithm (GA) to implement dimensionality reduction be feature selections.
### Conclusion:
The SVM algorithm is a powerful technique for classification problems, but its performance can be improved by selecting a relevant subset of features.
In this study, we performed an exploratory data analysis and found that the dataset consisted of 1003 continuous and 200 discrete features with no missing values or redundant instances. We implemented an SVM algorithm using Python's scikit-learn library and used GA for feature selection. The accuracy of the SVM model improved from 54.3% before feature selection to 68.6% after feature selection. This indicates that GA was able to identify a subset of features that significantly improved the performance of the model.
In summary, the combination of SVM and GA is an effective approach for solving classification problems that involve large and complex datasets. The results of this study demonstrate the importance of feature selection in improving the accuracy of SVM models and highlight the potential of GA as a powerful feature selection technique.
### References:
UC Irvine Machine Learning Repository. (n.d.). UC Irvine Machine Learning Repository. [https://archive-beta.ics.uci.edu/dataset/728/toxicity-2](https://archive-beta.ics.uci.edu/dataset/728/toxicity-2)