{"id":19448108,"url":"https://github.com/scrapinghub/aile","last_synced_at":"2026-03-03T05:39:03.542Z","repository":{"id":66021308,"uuid":"42531767","full_name":"scrapinghub/aile","owner":"scrapinghub","description":"Automatic Item List Extraction","archived":false,"fork":false,"pushed_at":"2016-06-15T12:32:18.000Z","size":911,"stargazers_count":87,"open_issues_count":6,"forks_count":16,"subscribers_count":82,"default_branch":"master","last_synced_at":"2025-02-25T08:52:25.877Z","etag":null,"topics":["data-science"],"latest_commit_sha":null,"homepage":null,"language":"HTML","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/scrapinghub.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2015-09-15T16:27:35.000Z","updated_at":"2024-07-29T08:58:50.000Z","dependencies_parsed_at":null,"dependency_job_id":"be995732-317e-4004-83cc-1d9e71c5592b","html_url":"https://github.com/scrapinghub/aile","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/scrapinghub/aile","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/scrapinghub%2Faile","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/scrapinghub%2Faile/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/scrapinghub%2Faile/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/scrapinghub%2Faile/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/scrapinghub","download_url":"https://codeload.github.com/scrapinghub/aile/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/scrapinghub%2Faile/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":30033494,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-03-03T05:09:26.876Z","status":"ssl_error","status_checked_at":"2026-03-03T05:09:23.944Z","response_time":61,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data-science"],"created_at":"2024-11-10T16:24:16.126Z","updated_at":"2026-03-03T05:39:03.516Z","avatar_url":"https://github.com/scrapinghub.png","language":"HTML","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Automatic Item List Extraction [![Build Status](https://travis-ci.org/scrapinghub/aile.svg?branch=master)](https://travis-ci.org/scrapinghub/aile)\n\nThis repository is a temporary container for experiments in automatic extraction of list and tables from web pages.\nAt some later point I will merge the surviving algorithms either in [scrapely](https://github.com/scrapy/scrapely)\nor [portia](https://github.com/scrapinghub/portia).\n\nI document my ideas and algorithms descriptions at [readthedocs](http://aile.readthedocs.org/en/latest/).\n\nThe current approach is based on the HTML code of the page, treated as a stream of HTML tags as processed by\n[scrapely](https://github.com/scrapy/scrapely). An alternative approach would be to use also the web page\nrendering information ([this script](https://github.com/plafl/aile/blob/master/misc/visual.py) renders a tree\nof bounding boxes for each element).\n\n## Installation\n\tpip install -r requirements.txt\n\tpython setup.py develop\n\n## Running\nIf you want to have a feeling of how it works there are two demo scripts included in the repo.\n\n- demo1.py\n  Will annotate the HTML code of a web page, marking as red the lines that form part of the repeating item\n  and with a prefix number the field number inside the item. The output is written in the file 'annotated.html'.\n\n      python demo1.py https://news.ycombinator.com\n\n  ![annotated HTML](https://github.com/plafl/aile/blob/master/misc/demo1_img.png)\n\n- demo2.py\n  Will label, color and draw the HTML tree so that repeating elements are easy to see. The output is interactive\n  (requires PyQt4).\n\n      python demo2.py https://news.ycombinator.com\n\n  ![annotated tree](https://github.com/plafl/aile/blob/master/misc/demo2_img.png)\n\n## Algorithms\n\nWe are trying to auto-detect repeating patterns in the tags, not necessarily made of of *li*, *tr* or *td* tags.\n\n### Clustering trees with a measure of similarity\nThe idea is to compute the distance between all subtrees in the web page and run a clustering algorithm with this distance matrix.\nFor a web page of size N this can be achieved in time O(N^2). The current algorithm actually computes a kernel and from the kernel\ncomputes the distance. The algorithm is based on:\n\n    Kernels for semi-structured data\n    Hisashi Kashima, Teruo Koyanagi\n\nOnce we compute the distance between all subtrees of the web page DBSCAN clustering is run using the distance matrix.\nThe result is massaged a little more until you get the result.\n\n### Markov models\nThe problem of detecting repeating patterns in streams is known as *motif discovery* and most of the literature about it seems\nto be published in the field of genetics. Inspired from this there is [a branch](https://github.com/plafl/aile/tree/markov_model)\n(MEME and Profile HMM algorithms).\n\nThe Markov model approach has the following problems right now:\n\n- Requires several web pages for training, depending on the web page type\n- Training is performed using EM algorithm which requires several attempts until a good optimum is achieved\n- The number of hidden states is hard to determine. There are some heuristics applied that work partially\n\nThese problems are not unsurmountable (I think) but require a lot of work:\n\n- Precision could be improved using [conditional random fields](https://en.wikipedia.org/wiki/Conditional_random_field).\n  These could alleviate the need for data.\n- Training can run greatly in parallel. This is actually already done using [joblib](https://pythonhosted.org/joblib/parallel.html)\n  in a single PC but it could be further improved using a cluster of computers\n- There are some papers about hidden state merging/splitting and even an\n  [infinite number of states](http://machinelearning.wustl.edu/mlpapers/paper_files/nips02-AA01.pdf)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fscrapinghub%2Faile","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fscrapinghub%2Faile","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fscrapinghub%2Faile/lists"}