{"id":20425899,"url":"https://github.com/mohamedhmini/iww","last_synced_at":"2025-10-26T01:20:06.531Z","repository":{"id":43914196,"uuid":"183448936","full_name":"MohamedHmini/iww","owner":"MohamedHmini","description":"AI based web-wrapper for web-content-extraction","archived":false,"fork":false,"pushed_at":"2023-02-06T20:59:56.000Z","size":62116,"stargazers_count":72,"open_issues_count":2,"forks_count":12,"subscribers_count":5,"default_branch":"master","last_synced_at":"2023-03-04T03:10:24.491Z","etag":null,"topics":["ai","data-mining","information-extraction","library","python","web-content-extractor","web-data-extraction","web-mining","web-scraping"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/MohamedHmini.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2019-04-25T14:17:04.000Z","updated_at":"2023-02-15T10:01:32.000Z","dependencies_parsed_at":"2022-09-02T13:30:37.572Z","dependency_job_id":null,"html_url":"https://github.com/MohamedHmini/iww","commit_stats":null,"previous_names":[],"tags_count":null,"template":null,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MohamedHmini%2Fiww","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MohamedHmini%2Fiww/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MohamedHmini%2Fiww/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MohamedHmini%2Fiww/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/MohamedHmini","download_url":"https://codeload.github.com/MohamedHmini/iww/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":224742388,"owners_count":17362232,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai","data-mining","information-extraction","library","python","web-content-extractor","web-data-extraction","web-mining","web-scraping"],"created_at":"2024-11-15T07:14:39.425Z","updated_at":"2025-10-26T01:20:01.440Z","avatar_url":"https://github.com/MohamedHmini.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# IWW-IntelliWebWrapper\n\n![](/iww2.png)\u003cbr/\u003e\n[![GitHub license](https://img.shields.io/github/license/Naereen/StrapDown.js.svg)](https://github.com/Naereen/StrapDown.js/blob/master/LICENSE)\n[![made-with-python](https://img.shields.io/badge/Made%20with-Python-1f425f.svg)](https://www.python.org/)\n[![GitHub version](https://badge.fury.io/gh/Naereen%2FStrapDown.js.svg)](https://github.com/Naereen/StrapDown.js)\n[![Generic badge](https://img.shields.io/badge/docs-passing-\u003cgreen\u003e.svg)](https://shields.io/)\n[![Ask Me Anything !](https://img.shields.io/badge/Ask%20me-anything-1abc9c.svg)](https://GitHub.com/Naereen/ama)\n\n\nan AI based web-mining library for web-content-extraction using machine learning algorithms.\n\ncurrently, the library offers many functionalities to be exploited \u0026 some interesting algos to look at:\n\n  - DOM extractor, mapper, reducer and flattening functionality...\n  - DoC, degree of coherence, a euclidean distance based similarity.\n  - LD, Lists detector algorithm.\n  - MCD, Main content detector algorithm.\n  - MCD algorithms results integrator method.\n  - CETD algorithm.\n  - DOM tags detector script (highlighting the chosen nodes).\n\nP.S : \n   - the documentation isn't available yet.\n   - LD \u0026 MCD algorithms are to be released as a research article in the near future.\n   - the pip package of iww will be available online as soon as possible.\n\n\n\n## USE CASE EXAMPLE :\n\n### 1- extraction :\n\n```python\nfrom iww.extractor import extractor\nfrom iww.detector import detector\nfrom iww.features_extraction.lists_detector import Lists_Detector as LD\nfrom iww.features_extraction.main_content_detector import MCD\n```\n\n```python\nurl = \"https://www.theiconic.com.au/catalog/?q=kids%20sunglasses\"\njson_file = \"./iconic.json\"\n\nextractor.extract(\n    url = url, \n    destination = json_file\n)\n```\n\n### 2- data exploratory analysis :\n\n```python\nfrom iww.utils.dom_mapper import DOM_Mapper as DM\n\ndm = DM()\ndm.retrieve_DOM_tree(\"./iconic.json\")\nprint(\"total number of nodes : {}\".format(dm.DOM['CETD']['tagsCount']))\n```\n\u003e total numbre of nodes : 2098\n\n![](iww/test/webpage.PNG)\n\n\n### 3- LD algorithm :\n\n```python\nld = LD()\nld.retrieve_DOM_tree(file_path = \"./iconic.json\")\nld.apply(\n    node = ld.DOM, \n    coherence_threshold= (0.75,1), \n    sub_tags_threshold = 2\n)\nld.update_DOM_tree()\n```\n\n```python\ndetector.detect(\n    input_file = \"./iconic.json\", \n    output_file = \"./iconic_ld.png\",\n    mark_path = \"LISTS.mark\", \n    mark_value = \"1\"\n)\n```\n\n![](iww/test/ld.png)\n\n### 4- MCD algorithm :\n\n```python\nmcd = MCD()\nmcd.retrieve_DOM_tree(\"./iconic.json\")\nmcd.apply(\n    node = mcd.DOM, \n    min_ratio_threshold = 0.0, \n    nbr_nodes_threshold = 1\n)\nmcd.update_DOM_tree()\n```\n\n```python\ndetector.detect(\n    input_file = \"./iconic.json\", \n    output_file = \"./iconic_mcd.png\",\n    mark_path = \"MCD.mark\", \n    mark_value = \"1\"\n)\n```\n\n![](iww/test/mcd.png)\n\n### 5- LD/MCD integration (main list detection) :\n\n```python\nmcd.integrate_other_algorithms_results(\n    node = mcd.DOM, \n    nbr_nodes = 1,\n    mode = \"ancestry\", \n    condition_features = [(\"LISTS.mark\",\"1\")])\n\nmcd.update_DOM_tree()\n```\n\n```python\ndetector.detect(\n    input_file = \"./iconic.json\", \n    output_file = \"./iconic_main_list.png\",\n    mark_path = \"MCD.main_node\", \n    mark_value = \"1\"\n)\n```\n\n![](iww/test/main_list.png)\n\n\n## License\n[MIT](https://choosealicense.com/licenses/mit/)\n\n**MOHAMED-HMINI 2019**\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmohamedhmini%2Fiww","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmohamedhmini%2Fiww","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmohamedhmini%2Fiww/lists"}