An open API service indexing awesome lists of open source software.

https://github.com/amirhosseinazami1373/dna-sequencing-using-k-mer-counting-and-machine-learning-techniques

Utilizing support vector machines, naïve bayes classification and neural networks to find which gives the best results of classification of DNA sequence data into 7 different protein classes.
https://github.com/amirhosseinazami1373/dna-sequencing-using-k-mer-counting-and-machine-learning-techniques

gene-sequencing k-mer-counting naive-bayes-classifier neural-networks nlp svm

Last synced: 7 months ago
JSON representation

Utilizing support vector machines, naïve bayes classification and neural networks to find which gives the best results of classification of DNA sequence data into 7 different protein classes.

Awesome Lists containing this project

README

          

# DNA-Sequencing-Using-K-mer-Counting-and-Machine-Learning-Techniques
Analyzing and reading gene sequence data is a hard task for a machine mainly because:

**1- Gene sequence data are in the form of basic nucleotide sequences (A, T, C, G), not numeric values.**

**2- The length of the gene sequences are random.**

The variability in the length of gene sequence data will make their vectorization an arduous task. Uniform vector lengths are required, considering that vectorization and uniform-length vectors are necessary for feeding the data for classification. There are several
different ways to approach the pre-processing of sequencing
data, however, the most popular seems to be the **‘k-mer’**
approach.

Using this method, we will split data sequences
into overlapping subsets of words of k length (usually
multiples of three as three nucleotides correspond to one amino
acid and are called a DNA sequence reading frame). Using the k-mer technique, we can create a **bag of words** for gene sequences.
This bag of words will later be used to train different machine-learning algorithms. For this project, we aim to compare the performance of Neural Networks, Naïve Bayes, and Support Vector Machines for the detection of enzymes based on their sequence.

# Gene types and distribution:

The distribution data shows the distribution of each class using k-length = 4. Other values of k-mer length can obviously be used for the task.

![image](https://github.com/user-attachments/assets/894780c2-8954-4948-bfc1-bf5623ed2920)
![image](https://github.com/user-attachments/assets/6dc5a3e6-213d-4d01-a656-bb064b97b708)

# ML methods:

Here, **Naïve Bayes** is a probabilistic model of data that performs generative
classification rather than discriminative classification. According to the model, a class is
defined as a piece of data based on which is the most probable. Naïve Bayes, in particular, learns the
probability value for each piece of data independent
of other data in a given set and combines those probabilities to
estimate the total probability for the whole piece of data.

*Tunable parameters:*

*1- Smoothing value*

**Support vector machine** is a linear classifier that can sort data into different classes based on linear
separation. Each k-mer in a sequence is affected by a weight
value added up, and then the whole sequence is affected by a bias
value. The bias and weight values are then adjusted as the model
learns from the training data.

*Tunable parameters:*

*1- Learning rate*

*2- Number of iterations*

*3- C parameter*

**Neural networks** are a machine learning algorithm that was
developed based on the human neural networks that make up
our brain and the way we process data. Neural networks take in
data and process it using a chosen number of ‘hidden’ layers that
alter the input data using weights, bias, and activation functions
to produce an output that can indicate the class of data that
was input into the algorithm. We made use of a
neural network with three hidden layers and one output layer that
all started with random weights and bias values that were
then tuned using backpropagation. The backpropagation
the algorithm makes adjustments to the weights and biases by going
backward through the network so that the algorithm’s output
will line up closer to the actual class of the data. We also
specifically made use of the ReLu activation function that has
become well known for its excellent accuracy by classifying any
value larger than 0 as itself, and any point less than 0 as zero.

*Tunable parameters:*

*1- Regularization value*

*2- Learning rate*

*3- Number of epochs*