https://github.com/rmodi6/clickstream-mining

Mining clickstream data to predict if a visitor will view another page or leave the website.
https://github.com/rmodi6/clickstream-mining

artificial-intelligence chisquare-test decision-trees id3-algorithm machine-learning python27

Last synced: 7 months ago
JSON representation

Mining clickstream data to predict if a visitor will view another page or leave the website.

Host: GitHub
URL: https://github.com/rmodi6/clickstream-mining
Owner: rmodi6
License: mit
Created: 2019-11-22T04:01:14.000Z (almost 6 years ago)
Default Branch: master
Last Pushed: 2019-12-16T19:45:10.000Z (almost 6 years ago)
Last Synced: 2025-01-11T08:09:02.070Z (9 months ago)
Topics: artificial-intelligence, chisquare-test, decision-trees, id3-algorithm, machine-learning, python27
Language: Python
Homepage:
Size: 2.59 MB
Stars: 1
Watchers: 2
Forks: 1
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# Clickstream Mining with Decision Trees
The project is based on a task posed in KDD Cup 2000. It involves mining click-stream data collected from Gazelle.com,
which sells legware products. The task is to determine: Given a set of page views, will the visitor view another page
on the site or will he leave?
The data set has discretized numerical values obtained by partitioning them into 5 intervals of equal frequency.
This way, we get a data set where for each feature, we have a finite set of values. These values are mapped to
integers, so that the data is easier to handle.

## Implementation details
Implemented the [ID3 algorithm](https://en.wikipedia.org/wiki/ID3_algorithm) for Decision Trees with Chi Square stopping criteria. The code structure is very similar to scipy's classification models with similar methods like `model.fit()`, `model.save()` and `model.predict()`.

## Usage
Run the following command to create a decision tree for the given dataset.
```bash
python q1_classifier.py -p 0.01 -f1 train.csv -f2 test.csv -o output.csv -t tree.pkl
```
The above code prints the number of leaf nodes and the number of internal nodes created in the tree along with the prediction accuracy. The decision tree is saved in `tree.pkl` pickle dump file while the predictions are stored in the `output.csv` file in the working directory. The accuracy can also be printed for a given tree.pkl and output.csv file using the `autograder_basic.py` script as:
```bash
python autograder_basic.py
```

## Results
The accuracy, number of internal nodes and number of leaf nodes for different p values (significance values) are reported in the table below.

|P Value|Accuracy |# Internal Nodes|# Leaf Nodes|
|-------|-------------|----------------|------------|
|0.01 |0.75216 |26 |105 |
|0.05 |**0.75324** |35 |141 |
|1.00 |0.74736 |187 |749 |

As observed from the table, the accuracy is maximum for p value 0.05 at 0.75324. The p value is the chi-squared stopping criteria for the decision tree which means if the p value is smaller, the decision tree generation will be stopped earlier. Hence, as the p value increases the number of internal nodes and the number of leaf nodes in the decision tree also increases and at p value 1, the entire decision tree is generated without pruning. This results in over-fitting which leads to memorization of the training examples by the decision tree. As a result, the model performs extremely well on the training dataset, but has a poor accuracy on the testing dataset with increasing p values. With lower p value, the tree is pruned extremely early which results in under-fitting and lower accuracy. To better generalize the model, we need to prune the tree so that it neither overfits nor underfits the training data. This is done by having p value 0.05 which improves the performance of the decision tree. The resulting decision tree is also smaller in size consuming less space.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/rmodi6/clickstream-mining

Awesome Lists containing this project

README