An open API service indexing awesome lists of open source software.

https://github.com/walidalsafadi/toxicity

The dataset includes 171 molecules designed for functional domains of a core clock protein, CRY1, responsible for generating circadian rhythm. 56 of the molecules are toxic and the rest are non-toxic.
https://github.com/walidalsafadi/toxicity

exploratory-data-analysis genetic-algorithm jupyter python svm svm-classifier

Last synced: 3 months ago
JSON representation

The dataset includes 171 molecules designed for functional domains of a core clock protein, CRY1, responsible for generating circadian rhythm. 56 of the molecules are toxic and the rest are non-toxic.

Awesome Lists containing this project

README

        

# Toxicity
The dataset includes 171 molecules designed for functional domains of a core clock protein, CRY1, responsible for generating circadian rhythm. 56 of the molecules are toxic and the rest are non-toxic.

### Information:
| Dataset Characteristics | Subject Area | Associated Tasks | Attribute Type | Instances | Attributes |
|---------|:-------------|:--------------:|:-------------|:-------------|--------------:|
| Tabular | Life Sciences | Classification | - | 171 | 1203 |

What do the instances in this dataset represent?

Small molecules

Was there any data preprocessing performed?

The data consists a complete set of 1203 molecular descriptors and needs feature selection before classification since some of the features are redundant. We used Recursive Feature Elimination together with Decision Tree Classifier (DTC) to get the best set of molecular descriptors for DTC. Subsetted data with 13 features is included as supplementary file.

### Task:

Implement an SVM algorithm using any benchmark data of your choosing; the only condition for the data is that it must have 1000 columns (features) or more, then use Genetic Algorithm (GA) to implement dimensionality reduction be feature selections.

### Conclusion:

The SVM algorithm is a powerful technique for classification problems, but its performance can be improved by selecting a relevant subset of features.

In this study, we performed an exploratory data analysis and found that the dataset consisted of 1003 continuous and 200 discrete features with no missing values or redundant instances. We implemented an SVM algorithm using Python's scikit-learn library and used GA for feature selection. The accuracy of the SVM model improved from 54.3% before feature selection to 68.6% after feature selection. This indicates that GA was able to identify a subset of features that significantly improved the performance of the model.

In summary, the combination of SVM and GA is an effective approach for solving classification problems that involve large and complex datasets. The results of this study demonstrate the importance of feature selection in improving the accuracy of SVM models and highlight the potential of GA as a powerful feature selection technique.

### References:

UC Irvine Machine Learning Repository. (n.d.). UC Irvine Machine Learning Repository. [https://archive-beta.ics.uci.edu/dataset/728/toxicity-2](https://archive-beta.ics.uci.edu/dataset/728/toxicity-2)