An open API service indexing awesome lists of open source software.

https://github.com/rmodi6/clickstream-mining

Mining clickstream data to predict if a visitor will view another page or leave the website.
https://github.com/rmodi6/clickstream-mining

artificial-intelligence chisquare-test decision-trees id3-algorithm machine-learning python27

Last synced: 7 months ago
JSON representation

Mining clickstream data to predict if a visitor will view another page or leave the website.

Awesome Lists containing this project

README

          

# Clickstream Mining with Decision Trees
The project is based on a task posed in KDD Cup 2000. It involves mining click-stream data collected from Gazelle.com,
which sells legware products. The task is to determine: Given a set of page views, will the visitor view another page
on the site or will he leave?
The data set has discretized numerical values obtained by partitioning them into 5 intervals of equal frequency.
This way, we get a data set where for each feature, we have a finite set of values. These values are mapped to
integers, so that the data is easier to handle.

## Implementation details
Implemented the [ID3 algorithm](https://en.wikipedia.org/wiki/ID3_algorithm) for Decision Trees with Chi Square stopping criteria. The code structure is very similar to scipy's classification models with similar methods like `model.fit()`, `model.save()` and `model.predict()`.

## Usage
Run the following command to create a decision tree for the given dataset.
```bash
python q1_classifier.py -p 0.01 -f1 train.csv -f2 test.csv -o output.csv -t tree.pkl
```
The above code prints the number of leaf nodes and the number of internal nodes created in the tree along with the prediction accuracy. The decision tree is saved in `tree.pkl` pickle dump file while the predictions are stored in the `output.csv` file in the working directory. The accuracy can also be printed for a given tree.pkl and output.csv file using the `autograder_basic.py` script as:
```bash
python autograder_basic.py
```

## Results
The accuracy, number of internal nodes and number of leaf nodes for different p values (significance values) are reported in the table below.

|P Value|Accuracy |# Internal Nodes|# Leaf Nodes|
|-------|-------------|----------------|------------|
|0.01 |0.75216 |26 |105 |
|0.05 |**0.75324** |35 |141 |
|1.00 |0.74736 |187 |749 |

As observed from the table, the accuracy is maximum for p value 0.05 at 0.75324. The p value is the chi-squared stopping criteria for the decision tree which means if the p value is smaller, the decision tree generation will be stopped earlier. Hence, as the p value increases the number of internal nodes and the number of leaf nodes in the decision tree also increases and at p value 1, the entire decision tree is generated without pruning. This results in over-fitting which leads to memorization of the training examples by the decision tree. As a result, the model performs extremely well on the training dataset, but has a poor accuracy on the testing dataset with increasing p values. With lower p value, the tree is pruned extremely early which results in under-fitting and lower accuracy. To better generalize the model, we need to prune the tree so that it neither overfits nor underfits the training data. This is done by having p value 0.05 which improves the performance of the decision tree. The resulting decision tree is also smaller in size consuming less space.