{"id":19505563,"url":"https://github.com/bogdaaamn/url-classifier-thesis","last_synced_at":"2026-03-01T08:35:20.674Z","repository":{"id":131320067,"uuid":"430749608","full_name":"bogdaaamn/url-classifier-thesis","owner":"bogdaaamn","description":"The data, models, methods and finding of Bogdan Covrig's thesis \"URL Product Web Page Classifier\"","archived":false,"fork":false,"pushed_at":"2021-11-22T19:28:02.000Z","size":112,"stargazers_count":2,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-02-25T22:15:08.219Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":null,"has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/bogdaaamn.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2021-11-22T14:54:54.000Z","updated_at":"2024-06-30T00:31:24.000Z","dependencies_parsed_at":"2023-09-10T02:31:14.596Z","dependency_job_id":null,"html_url":"https://github.com/bogdaaamn/url-classifier-thesis","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/bogdaaamn/url-classifier-thesis","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bogdaaamn%2Furl-classifier-thesis","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bogdaaamn%2Furl-classifier-thesis/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bogdaaamn%2Furl-classifier-thesis/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bogdaaamn%2Furl-classifier-thesis/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/bogdaaamn","download_url":"https://codeload.github.com/bogdaaamn/url-classifier-thesis/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bogdaaamn%2Furl-classifier-thesis/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":29965407,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-03-01T06:55:38.174Z","status":"ssl_error","status_checked_at":"2026-03-01T06:53:04.810Z","response_time":124,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-10T22:31:44.280Z","updated_at":"2026-03-01T08:35:20.652Z","avatar_url":"https://github.com/bogdaaamn.png","language":null,"funding_links":[],"categories":[],"sub_categories":[],"readme":"# URL Product Web Page Classifier Thesis\nThe data, models, methods and finding of Bogdan Covrig's thesis titled _\"URL Product Web Page Classifier\"_.\n\n## Data\n- `product_links_annotated.csv`: the product pages URLs _(n=2522)_ manually annotated into the two categories (`0: non-shopping webpage` and `1: shopping webpage`). The annotation was performed as described in [Mathur _et al._](https://webtransparency.cs.princeton.edu/dark-patterns/) (section 4.2.1). The set contains URLs from:\n    - `0-714`: A. Mathur, G. Acar, M. J. Friedman, E. Lucherini, J. Mayer, M. Chetty, and A. Narayanan, [“Dark patterns at scale: Findings from a crawl of 11k shopping websites,”](https://webtransparency.cs.princeton.edu/dark-patterns/) _Proceedings of the ACM on Human- Computer Interaction_, vol. 3, no. CSCW, pp. 1–32, 2019.\n    - `714-2521`: B. Covrig, C. Rosca, C. Goanta _TBD_\n- `product_links_features_labels.csv`: all the features extracted from the product pages URLs. The set contains the following information:\n    - `url`: The full product URL. *Example: https://harrypottershop.com/products/hogwarts-express-ticket-ornament?pr_prod_strat=copurchase\u0026pr_rec_pid=5843216793758\u0026pr_ref_pid=5843215679646\u0026pr_seq=uniform*\n    - `schema`: The schema of the URL. _Example: https://_\n    - `domain`: The domain name of the URL. _Example: harrypottershop.com_\n    - `path`: The path of the URL. *Example: /products/hogwarts-express-ticket-ornamentpr_prod_strat=copurchase\u0026pr_rec_pid=5843216793758\u0026pr_ref_pid=5843215679646\u0026pr_seq=uniform*\n    - `path_len`: Total length of the path. _Example: 131_\n    - `num_dot`: Number of dots _(.)_ or extension in the path. _Example: 0_\n    - `num_hyphen`: Number of hyphens _(–)_ or segments in the sections of the path. _Example: 3_\n    - `num_slash`: Number of slashes _(/)_ or sections of the path. _Example: 2_\n    - `num_hash`: Number of hashes _(#)_ or anchors in the path. _Example: 0_\n    - `num_param`: Number of equals _(=)_ or URL parameters in the path. _Example: 4_\n    - `contains_product`: Boolean if there are present one (or more) _product_-related keywords _(e.g. product, item)_ in the path. _Example: 1_\n    - `contains_category`: Boolean if there are present one (or more) _category_-related words _(e.g. category, collection)_. _Example: 0_\n    - `longest_num`: Length of the longest numerical feature (usually IDs) found in the path. _Example: 13_\n    - `label`: The non-product/product label of the URL. _Example: 1_\n\n## Models\n- `log-reg-mod.pkl`: Logistic Regression classifier model trained on the URLs dumped in the `product_links_annotated.csv` dataset. The classification report of the model tested with _n=247_ entries of the set below:\n    ```\n                precision    recall  f1-score   support\n\n            0       0.91      0.89      0.90       128\n            1       0.89      0.91      0.90       119\n\n    accuracy                            0.90       247\n    macro avg       0.90      0.90      0.90       247\n    weighted avg    0.90      0.90      0.90       247\n    ```\n\nUse [pickle](https://docs.python.org/3/library/pickle.html) to import the model and [sklearn](https://scikit-learn.org/stable/install.html) to use it accordingly.\n\n```python\nimport pickle\nclf = pickle.load(open('log-reg-mod.pkl', 'rb'))\n```\n\n## Notebooks\n_tbd_\n\n## More information\nThe notebooks and publications will follow soon.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbogdaaamn%2Furl-classifier-thesis","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fbogdaaamn%2Furl-classifier-thesis","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbogdaaamn%2Furl-classifier-thesis/lists"}