{"id":27692073,"url":"https://github.com/0xnan/carbonaraextractor","last_synced_at":"2025-04-25T11:17:41.760Z","repository":{"id":39738841,"uuid":"141299587","full_name":"0xNaN/carbonaraextractor","owner":"0xNaN","description":"Product specifications extractor driven by a DOM Classifier.","archived":false,"fork":false,"pushed_at":"2022-12-08T02:16:57.000Z","size":614,"stargazers_count":5,"open_issues_count":8,"forks_count":4,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-04-25T11:17:36.804Z","etag":null,"topics":["camera","dom","extractor","keras","neural-network","scraping","tensorflow"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/0xNaN.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2018-07-17T14:15:50.000Z","updated_at":"2025-01-25T13:44:00.000Z","dependencies_parsed_at":"2023-01-25T03:16:01.528Z","dependency_job_id":null,"html_url":"https://github.com/0xNaN/carbonaraextractor","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/0xNaN%2Fcarbonaraextractor","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/0xNaN%2Fcarbonaraextractor/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/0xNaN%2Fcarbonaraextractor/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/0xNaN%2Fcarbonaraextractor/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/0xNaN","download_url":"https://codeload.github.com/0xNaN/carbonaraextractor/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":250805802,"owners_count":21490189,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["camera","dom","extractor","keras","neural-network","scraping","tensorflow"],"created_at":"2025-04-25T11:17:41.268Z","updated_at":"2025-04-25T11:17:41.754Z","avatar_url":"https://github.com/0xNaN.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"## Carbonara Extractor :spaghetti:\n\nProduct specifications extractor driven by a DOM Classifier.\n\n\u003cp align=\"center\"\u003e\n\u003cimg src=carbonaraextractor.gif width=600 /\u003e\n\u003c/p\u003e\n\n_Note: this is an old PoC developed to demonstrate how tables and lists of the DOM can be classified before being extracted._\n\n## how to run\n\nTo install dependencies:\n\n    python3.7 -m venv venv\n    source ./venv/bin/activate\n    pip install --upgrade pip\n    pip install -r requirements.txt\n\nTo extract information from an URL, simple run:\n\n    \u003e python -m carbonaraextractor \"https://www.dpreview.com/products/canon/slrs/canon_eosm50\"\n    ...\n    ...\n    ...\n    \u003e cat result.json\n    {\n    \"Articulated LCD\": \"Fully articulated\",\n    \"Body type\": \"SLR-style mirrorless\",\n    \"Dimensions\": \"116 x 88 x 59 mm (4.57 x 3.46 x 2.32″)\",\n    \"Effective pixels\": \"24 megapixels\",\n    \"Focal length mult.\": \"1.6×\",\n    \"Format\": \"MPEG-4, H.264\",\n    \"GPS\": \"None\",\n    \"ISO\": \"Auto, 100-25600 (expands to 51200)\",\n    \"Lens mount\": \"Canon EF-M\",\n    \"Max resolution\": \"6000 x 4000\",\n    \"Max shutter speed\": \"1/4000 sec\",\n    \"Screen dots\": \"1,040,000\",\n    \"Screen size\": \"3″\",\n    \"Sensor size\": \"APS-C (22.3 x 14.9 mm)\",\n    \"Sensor type\": \"CMOS\",\n    \"Storage types\": \"SD/SDHC/SDXC slot (UHS-I compatible)\",\n    \"USB\": \"USB 2.0 (480 Mbit/sec)\",\n    \"Weight (inc. batteries)\": \"390 g (0.86 lb / 13.76 oz)\",\n    \"__unstructured\": []\n    }\n\nThe standard output shows information about every Table and List found on the specified URL with their \"relevance\" scores given by the classifiers.  Every red row is a table/list that the classifier has tagged as *unrelevant*.  Every green row is a table/list that the classifier think is *relevant* about the trained domain. The relevant content is then parsed as `\u003ckey, value\u003e` pairs and reported inside the file `result.json`.  \n\n## classifiers\n\nThe project uses two simple classifiers trained on the \"Camera\" domains, implemented with Keras and saved inside `models/`.\n\nThe notebook `notebooks/train_classifiers` show the process of traning/testing/saving of the models. \n\nThe datasets used are inside the `data` folder which contains:\n\n   1. `list.csv`: features values extracted for relevant/not_relevant lists. The features are extracted from a corpus of webpages not shared here.\n   2. `table.csv`: features values extracted for relevant/not_relevant tables. The features are extracted from a corpus of webpages not shared here.\n   3. `camera_hot_words.txt`: 200 *stems* of relevant words about the \"Cameras\" domain.\n   4.  `list_xpath_ground_truth.txt`: xpath for relevant lists for known domains.\n   5.  `table_xpath_ground_truth.txt`: xpath for relevant tables for known domains.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2F0xnan%2Fcarbonaraextractor","html_url":"https://awesome.ecosyste.ms/projects/github.com%2F0xnan%2Fcarbonaraextractor","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2F0xnan%2Fcarbonaraextractor/lists"}