https://github.com/0xnan/carbonaraextractor
Product specifications extractor driven by a DOM Classifier.
https://github.com/0xnan/carbonaraextractor
camera dom extractor keras neural-network scraping tensorflow
Last synced: about 1 year ago
JSON representation
Product specifications extractor driven by a DOM Classifier.
- Host: GitHub
- URL: https://github.com/0xnan/carbonaraextractor
- Owner: 0xNaN
- Created: 2018-07-17T14:15:50.000Z (almost 8 years ago)
- Default Branch: master
- Last Pushed: 2022-12-08T02:16:57.000Z (over 3 years ago)
- Last Synced: 2025-04-25T11:17:36.804Z (about 1 year ago)
- Topics: camera, dom, extractor, keras, neural-network, scraping, tensorflow
- Language: Jupyter Notebook
- Homepage:
- Size: 600 KB
- Stars: 5
- Watchers: 2
- Forks: 4
- Open Issues: 8
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
## Carbonara Extractor :spaghetti:
Product specifications extractor driven by a DOM Classifier.
_Note: this is an old PoC developed to demonstrate how tables and lists of the DOM can be classified before being extracted._
## how to run
To install dependencies:
python3.7 -m venv venv
source ./venv/bin/activate
pip install --upgrade pip
pip install -r requirements.txt
To extract information from an URL, simple run:
> python -m carbonaraextractor "https://www.dpreview.com/products/canon/slrs/canon_eosm50"
...
...
...
> cat result.json
{
"Articulated LCD": "Fully articulated",
"Body type": "SLR-style mirrorless",
"Dimensions": "116 x 88 x 59 mm (4.57 x 3.46 x 2.32″)",
"Effective pixels": "24 megapixels",
"Focal length mult.": "1.6×",
"Format": "MPEG-4, H.264",
"GPS": "None",
"ISO": "Auto, 100-25600 (expands to 51200)",
"Lens mount": "Canon EF-M",
"Max resolution": "6000 x 4000",
"Max shutter speed": "1/4000 sec",
"Screen dots": "1,040,000",
"Screen size": "3″",
"Sensor size": "APS-C (22.3 x 14.9 mm)",
"Sensor type": "CMOS",
"Storage types": "SD/SDHC/SDXC slot (UHS-I compatible)",
"USB": "USB 2.0 (480 Mbit/sec)",
"Weight (inc. batteries)": "390 g (0.86 lb / 13.76 oz)",
"__unstructured": []
}
The standard output shows information about every Table and List found on the specified URL with their "relevance" scores given by the classifiers. Every red row is a table/list that the classifier has tagged as *unrelevant*. Every green row is a table/list that the classifier think is *relevant* about the trained domain. The relevant content is then parsed as `` pairs and reported inside the file `result.json`.
## classifiers
The project uses two simple classifiers trained on the "Camera" domains, implemented with Keras and saved inside `models/`.
The notebook `notebooks/train_classifiers` show the process of traning/testing/saving of the models.
The datasets used are inside the `data` folder which contains:
1. `list.csv`: features values extracted for relevant/not_relevant lists. The features are extracted from a corpus of webpages not shared here.
2. `table.csv`: features values extracted for relevant/not_relevant tables. The features are extracted from a corpus of webpages not shared here.
3. `camera_hot_words.txt`: 200 *stems* of relevant words about the "Cameras" domain.
4. `list_xpath_ground_truth.txt`: xpath for relevant lists for known domains.
5. `table_xpath_ground_truth.txt`: xpath for relevant tables for known domains.