Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/hayatiyrtgl/automated_essay_scoring


https://github.com/hayatiyrtgl/automated_essay_scoring

artificial-intelligence essay essayscoring keras linear-models lstm-neural-networks machine-learning python tensorflow

Last synced: 3 days ago
JSON representation

Awesome Lists containing this project

README

        

Here's a detailed analysis of the provided code:

### Imports and Dependencies

```python
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from keras.utils import to_categorical
from keras.layers import *
from keras.models import Model, Sequential
import matplotlib.pyplot as plt
import seaborn as sns
from keras.preprocessing.text import Tokenizer
from keras.utils import pad_sequences
from sklearn.model_selection import train_test_split
from keras.callbacks import ModelCheckpoint
```

The code uses several libraries:
- **Pandas** and **NumPy** for data manipulation.
- **Scikit-learn** for preprocessing and machine learning models.
- **Keras** (part of TensorFlow) for building and training neural networks.
- **Matplotlib** and **Seaborn** for data visualization.

### Data Loading

```python
dataset = pd.read_csv("../dataset/learning-agency-lab-automated-essay-scoring-2/train.csv")
```

The dataset is loaded from a CSV file.

### Data Exploration and Preprocessing

1. **Basic Information**
```python
dataset.head()
dataset.dtypes
dataset.isna().sum()
```

Displaying the first few rows, data types, and checking for missing values.

2. **Text Cleaning**
```python
dataset.full_text = dataset.full_text.replace("[^a-zA-Z0-9 ]", "",regex=True)
```

Removing non-alphanumeric characters from the text.

3. **Tokenization**
```python
tokenizer = Tokenizer(num_words=10000)
```

Initializing a tokenizer that will keep the top 10,000 most frequent words.

4. **Word Count Feature**
```python
def apply_func(x):
splitted = x.split()
return len(splitted)

dataset["word_num"] = dataset.full_text.apply(apply_func)
```

Creating a new column that counts the number of words in each essay.

### Exploratory Data Analysis (EDA)

```python
dataset.describe()
sns.countplot(data=dataset, x="score")
sns.boxenplot(data=dataset, x="score", y="word_num")
sns.regplot(data=dataset, x="score", y="word_num")
```

Basic statistical descriptions and visualizations:
- **Count plot** of scores.
- **Boxen plot** and **regression plot** to explore the relationship between word count and score.

### Data Filtering

```python
data = dataset.copy()
data = data.sort_values(by="word_num", ascending=False)
ninety_nine = round(len(data) * 0.01)
data = data.iloc[ninety_nine:, :]
data = data.sort_index().reset_index()
data.drop("index", axis=1, inplace=True)
```

Filtering out the top 1% of essays with the highest word counts to remove outliers.

### Text Preprocessing

```python
texts = data.full_text.values
tokenizer.fit_on_texts(texts)
texts = tokenizer.texts_to_sequences(data.full_text.values)
max_len = max([len(i) for i in texts])
padded = pad_sequences(texts, maxlen=max_len, padding="post")
```

Converting texts to sequences of integers and padding them to the same length.

### Preparing Features and Labels

```python
y = to_categorical(data.score.values, num_classes=7)
scaler = MinMaxScaler()
X = scaler.fit_transform([data.word_num.values])
X = X.reshape(-1, 1)
```

- **Labels (`y`)**: Converting scores to categorical format.
- **Features (`X`)**: Scaling word counts.

### Model Definition

```python
def create_model():
input_lstm = Input(shape=(max_len,))
text_ai_input = Embedding(input_dim=10000, input_length=max_len, output_dim=128)(input_lstm)
text_ai_lstm_1 = LSTM(128, return_sequences=True)(text_ai_input)
text_ai_dr1 = Dropout(0.3)(text_ai_lstm_1)
text_ai_lstm2 = LSTM(128)(text_ai_dr1)
text_ai_dr2 = Dropout(0.2)(text_ai_lstm2)
text_ai_dense = Dense(128, activation="relu")(text_ai_dr2)

linear_model_input = Input(shape=(1, ))
linear_model_dense = Dense(128, activation="relu")(linear_model_input)
linear_model_dr = Dropout(0.2)(linear_model_dense)
linear_model_dense2 = Dense(128, activation="relu")(linear_model_dr)
linear_model_dr2 = Dropout(0.2)(linear_model_dense2)
linear_model_dense3 = Dense(128, activation="relu")(linear_model_dr2)

concated_layer = Concatenate()([text_ai_dense, linear_model_dense3])
concated_dense = Dense(256, activation="relu")(concated_layer)
concated_dr = Dropout(0.2)(concated_dense)
concated_dense2 = Dense(128)(concated_dr)
out = Dense(7, activation="softmax")(concated_dense2)

model = Model(inputs=[input_lstm, linear_model_input], outputs=out)
model.summary()
return model
```

Creating a combined neural network model with:
- **Text model**: LSTM layers to process the essay texts.
- **Linear model**: Dense layers to process the word count feature.
- **Concatenation**: Combining the outputs of both models.

### Model Training

```python
model = create_model()
model.compile(loss="categorical_crossentropy", optimizer="adam", metrics=["accuracy"])
txt_train, txt_test, X_train, X_test, y_train, y_test = train_test_split(padded, X, y, train_size=0.8, random_state=42)
cp = ModelCheckpoint(filepath="essay", save_best_only=True, mode="min", monitor="val_loss")
model.fit([txt_train, X_train], y_train, epochs=100, validation_data=([txt_test,X_test], y_test), callbacks=[cp])
```

- **Model Compilation**: Using categorical crossentropy loss and Adam optimizer.
- **Data Splitting**: Splitting data into training and testing sets.
- **Model Checkpoint**: Saving the best model based on validation loss.
- **Model Training**: Training the model for 100 epochs.

### Model Evaluation and Saving

```python
h = model.history
history_df = pd.DataFrame(h.history).plot(title="Model Saved on 11th epoch")
```

Plotting the training history.

```python
## REMEMBER
# SAve tokenizer
# save minmaxscaler
# save model
```

Reminder to save the tokenizer, MinMaxScaler, and model.

### Summary

This code provides a complete pipeline for building a machine learning model to predict essay scores based on text content and word count. It includes data loading, preprocessing, exploratory data analysis, model building, training, and evaluation. The model combines text processing with LSTM layers and a linear model for the word count feature, achieving a comprehensive approach to the problem.