https://github.com/rmodi6/clickstream-mining
Mining clickstream data to predict if a visitor will view another page or leave the website.
https://github.com/rmodi6/clickstream-mining
artificial-intelligence chisquare-test decision-trees id3-algorithm machine-learning python27
Last synced: 7 months ago
JSON representation
Mining clickstream data to predict if a visitor will view another page or leave the website.
- Host: GitHub
- URL: https://github.com/rmodi6/clickstream-mining
- Owner: rmodi6
- License: mit
- Created: 2019-11-22T04:01:14.000Z (almost 6 years ago)
- Default Branch: master
- Last Pushed: 2019-12-16T19:45:10.000Z (almost 6 years ago)
- Last Synced: 2025-01-11T08:09:02.070Z (9 months ago)
- Topics: artificial-intelligence, chisquare-test, decision-trees, id3-algorithm, machine-learning, python27
- Language: Python
- Homepage:
- Size: 2.59 MB
- Stars: 1
- Watchers: 2
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Clickstream Mining with Decision Trees
The project is based on a task posed in KDD Cup 2000. It involves mining click-stream data collected from Gazelle.com,
which sells legware products. The task is to determine: Given a set of page views, will the visitor view another page
on the site or will he leave?
The data set has discretized numerical values obtained by partitioning them into 5 intervals of equal frequency.
This way, we get a data set where for each feature, we have a finite set of values. These values are mapped to
integers, so that the data is easier to handle.## Implementation details
Implemented the [ID3 algorithm](https://en.wikipedia.org/wiki/ID3_algorithm) for Decision Trees with Chi Square stopping criteria. The code structure is very similar to scipy's classification models with similar methods like `model.fit()`, `model.save()` and `model.predict()`.## Usage
Run the following command to create a decision tree for the given dataset.
```bash
python q1_classifier.py -p 0.01 -f1 train.csv -f2 test.csv -o output.csv -t tree.pkl
```
The above code prints the number of leaf nodes and the number of internal nodes created in the tree along with the prediction accuracy. The decision tree is saved in `tree.pkl` pickle dump file while the predictions are stored in the `output.csv` file in the working directory. The accuracy can also be printed for a given tree.pkl and output.csv file using the `autograder_basic.py` script as:
```bash
python autograder_basic.py
```## Results
The accuracy, number of internal nodes and number of leaf nodes for different p values (significance values) are reported in the table below.|P Value|Accuracy |# Internal Nodes|# Leaf Nodes|
|-------|-------------|----------------|------------|
|0.01 |0.75216 |26 |105 |
|0.05 |**0.75324** |35 |141 |
|1.00 |0.74736 |187 |749 |As observed from the table, the accuracy is maximum for p value 0.05 at 0.75324. The p value is the chi-squared stopping criteria for the decision tree which means if the p value is smaller, the decision tree generation will be stopped earlier. Hence, as the p value increases the number of internal nodes and the number of leaf nodes in the decision tree also increases and at p value 1, the entire decision tree is generated without pruning. This results in over-fitting which leads to memorization of the training examples by the decision tree. As a result, the model performs extremely well on the training dataset, but has a poor accuracy on the testing dataset with increasing p values. With lower p value, the tree is pruned extremely early which results in under-fitting and lower accuracy. To better generalize the model, we need to prune the tree so that it neither overfits nor underfits the training data. This is done by having p value 0.05 which improves the performance of the decision tree. The resulting decision tree is also smaller in size consuming less space.