{"id":19452222,"url":"https://github.com/seagatesoft/webdext","last_synced_at":"2025-04-25T04:30:42.247Z","repository":{"id":47926883,"uuid":"88947751","full_name":"seagatesoft/webdext","owner":"seagatesoft","description":"Intelligent Web Data Extractor","archived":false,"fork":false,"pushed_at":"2022-12-05T02:39:28.000Z","size":6060,"stargazers_count":74,"open_issues_count":7,"forks_count":16,"subscribers_count":9,"default_branch":"master","last_synced_at":"2025-04-03T15:52:32.424Z","etag":null,"topics":["javascript","scraping"],"latest_commit_sha":null,"homepage":"","language":"HTML","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/seagatesoft.png","metadata":{"files":{"readme":"README.rst","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2017-04-21T06:09:59.000Z","updated_at":"2025-01-19T12:50:17.000Z","dependencies_parsed_at":"2023-01-22T23:31:32.314Z","dependency_job_id":null,"html_url":"https://github.com/seagatesoft/webdext","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/seagatesoft%2Fwebdext","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/seagatesoft%2Fwebdext/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/seagatesoft%2Fwebdext/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/seagatesoft%2Fwebdext/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/seagatesoft","download_url":"https://codeload.github.com/seagatesoft/webdext/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":250754556,"owners_count":21481835,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["javascript","scraping"],"created_at":"2024-11-10T16:45:55.338Z","updated_at":"2025-04-25T04:30:37.225Z","avatar_url":"https://github.com/seagatesoft.png","language":"HTML","funding_links":[],"categories":[],"sub_categories":[],"readme":"=======\nWebdext\n=======\n\nWebdext is a Javascript library for web data extraction (web scraping). Currently, it only supports data records extraction from a list page (a web page containing 2 or more data records).\n\nIn order to use it, you must run Webdext inside the web page context. There are 2 ways to do that:\n\n1. Use it as browser extension (currently, I only implemented the Chrome extension) \n2. Inject the script into the web page context using headless browser such as Puppeteer_, PhantomJS_, or Splash_ (currently, I only implemented the runner script for PhantomJS)\n\n.. _Puppeteer: https://pptr.dev/\n.. _PhantomJS: http://phantomjs.org/\n.. _Splash: http://github.com/scrapinghub/splash\n\nCheck the video below to see how it works as Chrome extension: \n\n|DemoVideo|_\n\n.. |DemoVideo| image:: https://img.youtube.com/vi/TmSgcPI25Qc/0.jpg\n.. _DemoVideo: https://www.youtube.com/watch?v=TmSgcPI25Qc\n\nInstallation and usage\n======================\n\n1. `Chrome Extension`_\n2. `PhantomJS script`_\n\n.. _Chrome extension: https://github.com/seagatesoft/webdext/wiki/Chrome-extension\n.. _PhantomJS script: https://github.com/seagatesoft/webdext/wiki/PhantomJS-script\n\n\nInternals\n=========\n\nIntelligent extraction algorithm is heavily based on AutoRM [1]_ and DAG-MTM [2]_ (not an exact implementation though).\n\n.. [1] `Shengsheng Shi , Chengfei Liu, Yi Shen, Chunfeng Yuan, Yihua Huang. 2015. AutoRM: An effective approach for automatic Web data record mining. Knowledge-Based Systems, 89, 314–331. doi:10.1016/j.knosys.2015.07.012 \u003chttp://dl.acm.org/citation.cfm?id=2840138\u003e`_\n\n.. [2] `Shengsheng Shi , Chengfei Liu, Chunfeng Yuan, Yihua Huang. 2014. Multi-feature and DAG-based multi-tree matching algorithm for automatic web data mining. Proceedings of International Joint Conferences on Web Intelligence and Intelligent Agent Technology, 739–755. doi:10.1109/WI-IAT.2014.24 \u003chttp://dl.acm.org/citation.cfm?id=2682781\u003e`_\n\nAuthor\n======\n\nSigit Dewanto, sigitdewanto11[at]yahoo[dot]co[dot]uk\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fseagatesoft%2Fwebdext","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fseagatesoft%2Fwebdext","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fseagatesoft%2Fwebdext/lists"}