{"id":19817255,"url":"https://github.com/rmodi6/clickstream-mining","last_synced_at":"2025-02-28T15:04:10.925Z","repository":{"id":147566315,"uuid":"223320458","full_name":"rmodi6/clickstream-mining","owner":"rmodi6","description":"Mining clickstream data to predict if a visitor will view another page or leave the website.","archived":false,"fork":false,"pushed_at":"2019-12-16T19:45:10.000Z","size":2712,"stargazers_count":1,"open_issues_count":0,"forks_count":1,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-01-11T08:09:02.070Z","etag":null,"topics":["artificial-intelligence","chisquare-test","decision-trees","id3-algorithm","machine-learning","python27"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/rmodi6.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2019-11-22T04:01:14.000Z","updated_at":"2023-06-22T06:08:50.000Z","dependencies_parsed_at":null,"dependency_job_id":"7995370b-a5af-41aa-b165-e9dfe880dc77","html_url":"https://github.com/rmodi6/clickstream-mining","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rmodi6%2Fclickstream-mining","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rmodi6%2Fclickstream-mining/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rmodi6%2Fclickstream-mining/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rmodi6%2Fclickstream-mining/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/rmodi6","download_url":"https://codeload.github.com/rmodi6/clickstream-mining/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":241170702,"owners_count":19921679,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["artificial-intelligence","chisquare-test","decision-trees","id3-algorithm","machine-learning","python27"],"created_at":"2024-11-12T10:12:08.031Z","updated_at":"2025-02-28T15:04:10.893Z","avatar_url":"https://github.com/rmodi6.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Clickstream Mining with Decision Trees\nThe project is based on a task posed in KDD Cup 2000. It involves mining click-stream data collected from Gazelle.com, \nwhich sells legware products. The task is to determine: Given a set of page views, will the visitor view another page \non the site or will he leave?  \nThe data set has discretized numerical values obtained by partitioning them into 5 intervals of equal frequency. \nThis way, we get a data set where for each feature, we have a finite set of values. These values are mapped to \nintegers, so that the data is easier to handle.\n\n## Implementation details\nImplemented the [ID3 algorithm](https://en.wikipedia.org/wiki/ID3_algorithm) for Decision Trees with Chi Square stopping criteria. The code structure is very similar to scipy's classification models with similar methods like `model.fit()`, `model.save()` and `model.predict()`.\n\n## Usage\nRun the following command to create a decision tree for the given dataset.\n```bash\npython q1_classifier.py -p 0.01 -f1 train.csv -f2 test.csv -o output.csv -t tree.pkl\n```\nThe above code prints the number of leaf nodes and the number of internal nodes created in the tree along with the prediction accuracy. The decision tree is saved in `tree.pkl` pickle dump file while the predictions are stored in the `output.csv` file in the working directory. The accuracy can also be printed for a given tree.pkl and output.csv file using the `autograder_basic.py` script as:\n```bash\npython autograder_basic.py\n```\n\n## Results\nThe accuracy, number of internal nodes and number of leaf nodes for different p values (significance values) are reported in the table below.\n\n|P Value|Accuracy     |# Internal Nodes|# Leaf Nodes|\n|-------|-------------|----------------|------------|\n|0.01   |0.75216      |26              |105         |\n|0.05   |**0.75324**  |35              |141         |\n|1.00   |0.74736      |187             |749         |\n\nAs observed from the table, the accuracy is maximum for p value 0.05 at 0.75324. The p value is the chi-squared stopping criteria for the decision tree which means if the p value is smaller, the decision tree generation will be stopped earlier. Hence, as the p value increases the number of internal nodes and the number of leaf nodes in the decision tree also increases and at p value 1, the entire decision tree is generated without pruning. This results in over-fitting which leads to memorization of the training examples by the decision tree. As a result, the model performs extremely well on the training dataset, but has a poor accuracy on the testing dataset with increasing p values. With lower p value, the tree is pruned extremely early which results in under-fitting and lower accuracy. To better generalize the model, we need to prune the tree so that it neither overfits nor underfits the training data. This is done by having p value 0.05 which improves the performance of the decision tree. The resulting decision tree is also smaller in size consuming less space.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frmodi6%2Fclickstream-mining","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Frmodi6%2Fclickstream-mining","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frmodi6%2Fclickstream-mining/lists"}