https://github.com/kevalmorabia97/cova-web-object-detection
A Context-aware Visual Attention-based training pipeline for Object Detection from a Webpage screenshot!
https://github.com/kevalmorabia97/cova-web-object-detection
attention computer-vision convolutional-neural-networks deep-learning graph-attention-networks graph-convolutional-networks information-extraction multimodal-learning object-detection pytorch visual-attention
Last synced: 2 months ago
JSON representation
A Context-aware Visual Attention-based training pipeline for Object Detection from a Webpage screenshot!
- Host: GitHub
- URL: https://github.com/kevalmorabia97/cova-web-object-detection
- Owner: kevalmorabia97
- License: apache-2.0
- Created: 2019-10-22T06:11:54.000Z (over 5 years ago)
- Default Branch: master
- Last Pushed: 2025-02-25T09:27:39.000Z (4 months ago)
- Last Synced: 2025-03-31T05:09:05.928Z (2 months ago)
- Topics: attention, computer-vision, convolutional-neural-networks, deep-learning, graph-attention-networks, graph-convolutional-networks, information-extraction, multimodal-learning, object-detection, pytorch, visual-attention
- Language: Python
- Homepage: https://aclanthology.org/2022.ecnlp-1.11/
- Size: 1.4 MB
- Stars: 92
- Watchers: 5
- Forks: 14
- Open Issues: 3
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# CoVA: Context-aware Visual Attention for Webpage Information Extraction
## Abstract
Webpage information extraction (WIE) is an important step to create knowledge bases. For this, classical WIE methods leverage the Document Object Model (DOM) tree of a website. However, use of the DOM tree poses significant challenges as context and appearance are encoded in an abstract manner. To address this challenge we propose to `reformulate WIE as a context-aware Webpage Object Detection` task. Specifically, we develop a Context-aware Visual Attention-based (CoVA) detection pipeline which combines appearance features with syntactical structure from the DOM tree. To study the approach we collect a `new large-scale datase of e-commerce websites` for which we manually annotate every web element with four labels: product price, product title, product image and others. On this dataset we show that the proposed CoVA approach is a new challenging baseline which improves upon prior state-of-the-art methods.In Proceedings of The Fifth Workshop on e-Commerce and NLP (ECNLP 5), Association for Computational Linguistics 2022. [Paper](https://aclanthology.org/2022.ecnlp-1.11), [Slides](https://docs.google.com/presentation/d/1K1uebQl4hr0vNGwtFCShX7IPfvqBLEb80mUemXuanb8/edit?usp=sharing), [Video Presentation](https://youtu.be/YDgysgyCPFQ)
## CoVA Dataset
We labeled _7,740_ webpages spanning _408_ domains (Amazon, Walmart, Target, etc.). Each of these webpages contains exactly one labeled price, title, and image. All other web elements are labeled as background. On average, there are _90_ web elements in a webpage.Webpage screenshots and bounding boxes can be obtained [here](https://drive.google.com/drive/folders/1LcF40ZPrcRAc4RXyIVZGgmGf2LaQzjGK?usp=sharing)
### Train-Val-Test split
We create a cross-domain split which ensures that each of the train, val and test sets contains webpages from different domains. Specifically, we construct a 3 : 1 : 1 split based on the number of distinct domains. We observed that the top-5 domains (based on number of samples) were Amazon, EBay, Walmart, Etsy, and Target. So, we created 5 different splits for 5-Fold Cross Validation such that each of the major domains is present in one of the 5 splits for test data. These splits can be accessed [here](splits/)## CoVA End-to-end Training Pipeline
Our Context-Aware Visual Attention-based end-to-end pipeline for Webpage Object Detection (_CoVA_) aims to learn function _f_ to predict labels _y = [y1, y2, ..., yN]_ for a webpage containing _N_ elements. The input to CoVA consists of:
1. a screenshot of a webpage,
2. list of bounding boxes _[x, y, w, h]_ of the web elements, and
3. neighborhood information for each element obtained from the DOM tree.This information is processed in four stages:
1. the graph representation extraction for the webpage,
2. the Representation Network (_RN_),
3. the Graph Attention Network (_GAT_), and
4. a fully connected (_FC_) layer.The graph representation extraction computes for every web element _i_ its set of _K_ neighboring web elements _Ni_. The _RN_ consists of a Convolutional Neural Net (_CNN_) and a positional encoder aimed to learn a visual representation _vi_ for each web element _i ∈ {1, ..., N}_. The _GAT_ combines the visual representation _vi_ of the web element _i_ to be classified and those of its neighbors, i.e., _vk ∀k ∈ Ni_ to compute the contextual representation _ci_ for web element _i_. Finally, the visual and contextual representations of the web element are concatenated and passed through the _FC_ layer to obtain the classification output.

## Experimental Results

Cross Domain Accuracy (mean ± standard deviation) for 5-fold cross validation.NOTE: Cross Domain means we train the model on some web domains and test it on completely different domains to evaluate the generalizability of the models to unseen web templates.
## Attention Visualizations!

Attention Visualizations where red border denotes web element to be classified, and its contexts have green shade whose intensity denotes score. Price in (a) get much more score than other contexts. Title and image in (b) are scored higher than other contexts for price.## Cite
If you find this useful in your research, please cite our [ACL 2022 Paper]([https://arxiv.org/abs/2110.12320](https://aclanthology.org/2022.ecnlp-1.11/)):
```
@inproceedings{kumar-etal-2022-cova,
title = "{C}o{VA}: Context-aware Visual Attention for Webpage Information Extraction",
author = "Kumar, Anurendra and
Morabia, Keval and
Wang, William and
Chang, Kevin and
Schwing, Alex",
booktitle = "Proceedings of The Fifth Workshop on e-Commerce and NLP (ECNLP 5)",
month = may,
year = "2022",
address = "Dublin, Ireland",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2022.ecnlp-1.11",
pages = "80--90",
abstract = "Webpage information extraction (WIE) is an important step to create knowledge bases. For this, classical WIE methods leverage the Document Object Model (DOM) tree of a website. However, use of the DOM tree poses significant challenges as context and appearance are encoded in an abstract manner. To address this challenge we propose to reformulate WIE as a context-aware Webpage Object Detection task. Specifically, we develop a Context-aware Visual Attention-based (CoVA) detection pipeline which combines appearance features with syntactical structure from the DOM tree. To study the approach we collect a new large-scale datase of e-commerce websites for which we manually annotate every web element with four labels: product price, product title, product image and others. On this dataset we show that the proposed CoVA approach is a new challenging baseline which improves upon prior state-of-the-art methods.",
}```