{"id":21241883,"url":"https://github.com/kevalmorabia97/cova-web-object-detection","last_synced_at":"2025-04-07T07:11:05.585Z","repository":{"id":42994562,"uuid":"216736230","full_name":"kevalmorabia97/CoVA-Web-Object-Detection","owner":"kevalmorabia97","description":"A Context-aware Visual Attention-based training pipeline for Object Detection from a Webpage screenshot!","archived":false,"fork":false,"pushed_at":"2025-02-25T09:27:39.000Z","size":1473,"stargazers_count":92,"open_issues_count":3,"forks_count":14,"subscribers_count":5,"default_branch":"master","last_synced_at":"2025-03-31T05:09:05.928Z","etag":null,"topics":["attention","computer-vision","convolutional-neural-networks","deep-learning","graph-attention-networks","graph-convolutional-networks","information-extraction","multimodal-learning","object-detection","pytorch","visual-attention"],"latest_commit_sha":null,"homepage":"https://aclanthology.org/2022.ecnlp-1.11/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/kevalmorabia97.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2019-10-22T06:11:54.000Z","updated_at":"2025-02-25T09:27:43.000Z","dependencies_parsed_at":"2025-03-15T15:10:50.491Z","dependency_job_id":"337a2047-53b7-4d72-bfc0-19b852cd7604","html_url":"https://github.com/kevalmorabia97/CoVA-Web-Object-Detection","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kevalmorabia97%2FCoVA-Web-Object-Detection","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kevalmorabia97%2FCoVA-Web-Object-Detection/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kevalmorabia97%2FCoVA-Web-Object-Detection/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kevalmorabia97%2FCoVA-Web-Object-Detection/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/kevalmorabia97","download_url":"https://codeload.github.com/kevalmorabia97/CoVA-Web-Object-Detection/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247608151,"owners_count":20965952,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["attention","computer-vision","convolutional-neural-networks","deep-learning","graph-attention-networks","graph-convolutional-networks","information-extraction","multimodal-learning","object-detection","pytorch","visual-attention"],"created_at":"2024-11-21T00:57:24.038Z","updated_at":"2025-04-07T07:11:05.564Z","avatar_url":"https://github.com/kevalmorabia97.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# CoVA: Context-aware Visual Attention for Webpage Information Extraction\n\n## Abstract\nWebpage information extraction (WIE) is an important step to create knowledge bases. For this, classical WIE methods leverage the Document Object Model (DOM) tree of a website. However, use of the DOM tree poses significant challenges as context and appearance are encoded in an abstract manner. To address this challenge we propose to `reformulate WIE as a context-aware Webpage Object Detection` task. Specifically, we develop a Context-aware Visual Attention-based (CoVA) detection pipeline which combines appearance features with syntactical structure from the DOM tree. To study the approach we collect a `new large-scale datase of e-commerce websites` for which we manually annotate every web element with four labels: product price, product title, product image and others. On this dataset we show that the proposed CoVA approach is a new challenging baseline which improves upon prior state-of-the-art methods.\n\nIn Proceedings of The Fifth Workshop on e-Commerce and NLP (ECNLP 5), Association for Computational Linguistics 2022. [Paper](https://aclanthology.org/2022.ecnlp-1.11), [Slides](https://docs.google.com/presentation/d/1K1uebQl4hr0vNGwtFCShX7IPfvqBLEb80mUemXuanb8/edit?usp=sharing), [Video Presentation](https://youtu.be/YDgysgyCPFQ) \n\n\u003c!--\n## Key Contributions\n1. We formulate WIE as a context-aware Webpage Object Detection problem.\n2. We develop a Context-aware Visual Attention-based detection pipeline (_CoVA_), which is end-to-end trainable and exploits syntactic structure from the DOM tree along with screenshot images. CoVA uses a variant of Fast R-CNN to obtain a visual representation and graph attention for contextual learning on a graph constructed from the DOM tree. CoVA improves recent state-of-the-art baselines by a significant margin.\n3. We create the largest public dataset of _7.7k_ product webpage screenshots from 408 online retailers for Object Detection from product webpages. Our dataset is \u0026sim;_10x_ larger than existing datasets.\n4. We show the interpretability of CoVA using attention visualizations.\n--\u003e\n\n## CoVA Dataset\nWe labeled _7,740_ webpages spanning _408_ domains (Amazon, Walmart, Target, etc.). Each of these webpages contains exactly one labeled price, title, and image. All other web elements are labeled as background. On average, there are _90_ web elements in a webpage.\n\nWebpage screenshots and bounding boxes can be obtained [here](https://drive.google.com/drive/folders/1LcF40ZPrcRAc4RXyIVZGgmGf2LaQzjGK?usp=sharing)\n\n### Train-Val-Test split\nWe create a cross-domain split which ensures that each of the train, val and test sets contains webpages from different domains. Specifically, we construct a 3 : 1 : 1 split based on the number of distinct domains. We observed that the top-5 domains (based on number of samples) were Amazon, EBay, Walmart, Etsy, and Target. So, we created 5 different splits for 5-Fold Cross Validation such that each of the major domains is present in one of the 5 splits for test data. These splits can be accessed [here](splits/)\n\n## CoVA End-to-end Training Pipeline\nOur Context-Aware Visual Attention-based end-to-end pipeline for Webpage Object Detection (_CoVA_) aims to learn function _f_ to predict labels _y = [y\u003csub\u003e1\u003c/sub\u003e, y\u003csub\u003e2\u003c/sub\u003e, ..., y\u003csub\u003eN\u003c/sub\u003e]_ for a webpage containing _N_ elements. The input to CoVA consists of:\n1. a screenshot of a webpage,\n2. list of bounding boxes _[x, y, w, h]_ of the web elements, and\n3. neighborhood information for each element obtained from the DOM tree.\n\nThis information is processed in four stages:\n1. the graph representation extraction for the webpage,\n2. the Representation Network (_RN_),\n3. the Graph Attention Network (_GAT_), and\n4. a fully connected (_FC_) layer.\n\nThe graph representation extraction computes for every web element _i_ its set of _K_ neighboring web elements _N\u003csub\u003ei\u003c/sub\u003e_. The _RN_ consists of a Convolutional Neural Net (_CNN_) and a positional encoder aimed to learn a visual representation _v\u003csub\u003ei\u003c/sub\u003e_ for each web element _i \u0026isin; {1, ..., N}_. The _GAT_ combines the visual representation _v\u003csub\u003ei\u003c/sub\u003e_ of the web element _i_ to be classified and those of its neighbors, i.e., _v\u003csub\u003ek\u003c/sub\u003e \u0026forall;k \u0026isin; N\u003csub\u003ei\u003c/sub\u003e_ to compute the contextual representation _c\u003csub\u003ei\u003c/sub\u003e_ for web element _i_. Finally, the visual and contextual representations of the web element are concatenated and passed through the _FC_ layer to obtain the classification output.\n\n![Pipeline](imgs/CoVA-architecture.jpg)\n\n## Experimental Results\n![Table of Comparison](imgs/performance-comparison.jpg)\nCross Domain Accuracy (mean \u0026pm; standard deviation) for 5-fold cross validation.\n\nNOTE: Cross Domain means we train the model on some web domains and test it on completely different domains to evaluate the generalizability of the models to unseen web templates.\n\n## Attention Visualizations!\n![Attention Visualizations](imgs/attn_viz.jpg)\nAttention Visualizations where red border denotes web element to be classified, and its contexts have green shade whose intensity denotes score. Price in (a) get much more score than other contexts. Title and image in (b) are scored higher than other contexts for price.\n\n## Cite\nIf you find this useful in your research, please cite our [ACL 2022 Paper]([https://arxiv.org/abs/2110.12320](https://aclanthology.org/2022.ecnlp-1.11/)):\n```\n@inproceedings{kumar-etal-2022-cova,\n    title = \"{C}o{VA}: Context-aware Visual Attention for Webpage Information Extraction\",\n    author = \"Kumar, Anurendra  and\n      Morabia, Keval  and\n      Wang, William  and\n      Chang, Kevin  and\n      Schwing, Alex\",\n    booktitle = \"Proceedings of The Fifth Workshop on e-Commerce and NLP (ECNLP 5)\",\n    month = may,\n    year = \"2022\",\n    address = \"Dublin, Ireland\",\n    publisher = \"Association for Computational Linguistics\",\n    url = \"https://aclanthology.org/2022.ecnlp-1.11\",\n    pages = \"80--90\",\n    abstract = \"Webpage information extraction (WIE) is an important step to create knowledge bases. For this, classical WIE methods leverage the Document Object Model (DOM) tree of a website. However, use of the DOM tree poses significant challenges as context and appearance are encoded in an abstract manner. To address this challenge we propose to reformulate WIE as a context-aware Webpage Object Detection task. Specifically, we develop a Context-aware Visual Attention-based (CoVA) detection pipeline which combines appearance features with syntactical structure from the DOM tree. To study the approach we collect a new large-scale datase of e-commerce websites for which we manually annotate every web element with four labels: product price, product title, product image and others. On this dataset we show that the proposed CoVA approach is a new challenging baseline which improves upon prior state-of-the-art methods.\",\n}\n\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkevalmorabia97%2Fcova-web-object-detection","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fkevalmorabia97%2Fcova-web-object-detection","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkevalmorabia97%2Fcova-web-object-detection/lists"}