Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/hayatiyrtgl/automated_essay_scoring

artificial-intelligence essay essayscoring keras linear-models lstm-neural-networks machine-learning python tensorflow

Last synced: 20 days ago
JSON representation

Host: GitHub
URL: https://github.com/hayatiyrtgl/automated_essay_scoring
Owner: HayatiYrtgl
License: mit
Created: 2024-07-09T11:38:38.000Z (6 months ago)
Default Branch: main
Last Pushed: 2024-07-09T11:39:52.000Z (6 months ago)
Last Synced: 2024-11-05T09:15:28.381Z (2 months ago)
Topics: artificial-intelligence, essay, essayscoring, keras, linear-models, lstm-neural-networks, machine-learning, python, tensorflow
Language: Jupyter Notebook
Homepage:
Size: 179 KB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

        Here's a detailed analysis of the provided code:

### Imports and Dependencies

```python

import pandas as pd

import numpy as np

from sklearn.preprocessing import MinMaxScaler

from sklearn.linear_model import LogisticRegression

from sklearn.ensemble import RandomForestClassifier

from keras.utils import to_categorical

from keras.layers import *

from keras.models import Model, Sequential

import matplotlib.pyplot as plt

import seaborn as sns

from keras.preprocessing.text import Tokenizer

from keras.utils import pad_sequences

from sklearn.model_selection import train_test_split

from keras.callbacks import ModelCheckpoint

```

The code uses several libraries:

- **Pandas** and **NumPy** for data manipulation.

- **Scikit-learn** for preprocessing and machine learning models.

- **Keras** (part of TensorFlow) for building and training neural networks.

- **Matplotlib** and **Seaborn** for data visualization.

### Data Loading

```python

dataset = pd.read_csv("../dataset/learning-agency-lab-automated-essay-scoring-2/train.csv")

```

The dataset is loaded from a CSV file.

### Data Exploration and Preprocessing

1. **Basic Information**

    ```python

    dataset.head()

    dataset.dtypes

    dataset.isna().sum()

    ```

    Displaying the first few rows, data types, and checking for missing values.

2. **Text Cleaning**

    ```python

    dataset.full_text = dataset.full_text.replace("[^a-zA-Z0-9 ]", "",regex=True)

    ```

    Removing non-alphanumeric characters from the text.

3. **Tokenization**

    ```python

    tokenizer = Tokenizer(num_words=10000)

    ```

    Initializing a tokenizer that will keep the top 10,000 most frequent words.

4. **Word Count Feature**

    ```python

    def apply_func(x):

        splitted = x.split()

        return len(splitted)

    dataset["word_num"] = dataset.full_text.apply(apply_func)

    ```

    Creating a new column that counts the number of words in each essay.

### Exploratory Data Analysis (EDA)

```python

dataset.describe()

sns.countplot(data=dataset, x="score")

sns.boxenplot(data=dataset, x="score", y="word_num")

sns.regplot(data=dataset, x="score", y="word_num")

```

Basic statistical descriptions and visualizations:

- **Count plot** of scores.

- **Boxen plot** and **regression plot** to explore the relationship between word count and score.

### Data Filtering

```python

data = dataset.copy()

data = data.sort_values(by="word_num", ascending=False)

ninety_nine = round(len(data) * 0.01)

data = data.iloc[ninety_nine:, :]

data = data.sort_index().reset_index()

data.drop("index", axis=1, inplace=True)

```

Filtering out the top 1% of essays with the highest word counts to remove outliers.

### Text Preprocessing

```python

texts = data.full_text.values

tokenizer.fit_on_texts(texts)

texts = tokenizer.texts_to_sequences(data.full_text.values)

max_len = max([len(i) for i in texts])

padded = pad_sequences(texts, maxlen=max_len, padding="post")

```

Converting texts to sequences of integers and padding them to the same length.

### Preparing Features and Labels

```python

y = to_categorical(data.score.values, num_classes=7)

scaler = MinMaxScaler()

X = scaler.fit_transform([data.word_num.values])

X = X.reshape(-1, 1)

```

- **Labels (`y`)**: Converting scores to categorical format.

- **Features (`X`)**: Scaling word counts.

### Model Definition

```python

def create_model():

    input_lstm = Input(shape=(max_len,))

    text_ai_input = Embedding(input_dim=10000, input_length=max_len, output_dim=128)(input_lstm)

    text_ai_lstm_1 = LSTM(128, return_sequences=True)(text_ai_input)

    text_ai_dr1 = Dropout(0.3)(text_ai_lstm_1)

    text_ai_lstm2 = LSTM(128)(text_ai_dr1)

    text_ai_dr2 = Dropout(0.2)(text_ai_lstm2)

    text_ai_dense = Dense(128, activation="relu")(text_ai_dr2)

    linear_model_input = Input(shape=(1, ))

    linear_model_dense = Dense(128, activation="relu")(linear_model_input)

    linear_model_dr = Dropout(0.2)(linear_model_dense)

    linear_model_dense2 = Dense(128, activation="relu")(linear_model_dr)

    linear_model_dr2 = Dropout(0.2)(linear_model_dense2)

    linear_model_dense3 = Dense(128, activation="relu")(linear_model_dr2)

    concated_layer = Concatenate()([text_ai_dense, linear_model_dense3])

    concated_dense = Dense(256, activation="relu")(concated_layer)

    concated_dr = Dropout(0.2)(concated_dense)

    concated_dense2 = Dense(128)(concated_dr)

    out = Dense(7, activation="softmax")(concated_dense2)

    model = Model(inputs=[input_lstm, linear_model_input], outputs=out)

    model.summary()

    return model

```

Creating a combined neural network model with:

- **Text model**: LSTM layers to process the essay texts.

- **Linear model**: Dense layers to process the word count feature.

- **Concatenation**: Combining the outputs of both models.

### Model Training

```python

model = create_model()

model.compile(loss="categorical_crossentropy", optimizer="adam", metrics=["accuracy"])

txt_train, txt_test, X_train, X_test, y_train, y_test = train_test_split(padded, X, y, train_size=0.8, random_state=42)

cp = ModelCheckpoint(filepath="essay", save_best_only=True, mode="min", monitor="val_loss")

model.fit([txt_train, X_train], y_train, epochs=100, validation_data=([txt_test,X_test], y_test), callbacks=[cp])

```

- **Model Compilation**: Using categorical crossentropy loss and Adam optimizer.

- **Data Splitting**: Splitting data into training and testing sets.

- **Model Checkpoint**: Saving the best model based on validation loss.

- **Model Training**: Training the model for 100 epochs.

### Model Evaluation and Saving

```python

h = model.history

history_df = pd.DataFrame(h.history).plot(title="Model Saved on 11th epoch")

```

Plotting the training history.

```python

## REMEMBER

# SAve tokenizer

# save minmaxscaler

# save model

```

Reminder to save the tokenizer, MinMaxScaler, and model.

### Summary

This code provides a complete pipeline for building a machine learning model to predict essay scores based on text content and word count. It includes data loading, preprocessing, exploratory data analysis, model building, training, and evaluation. The model combines text processing with LSTM layers and a linear model for the word count feature, achieving a comprehensive approach to the problem.