{"id":19452226,"url":"https://github.com/seagatesoft/sde","last_synced_at":"2025-06-28T19:41:00.515Z","repository":{"id":3456356,"uuid":"4510190","full_name":"seagatesoft/sde","owner":"seagatesoft","description":"Structured Data Extractor. An application to extract structured data from web pages. It uses Data Extraction Based on Partial Tree Alignment (DEPTA) method. (UPDATE: I implemented a newer algorithm: https://github.com/seagatesoft/webdext)","archived":false,"fork":false,"pushed_at":"2012-06-09T05:02:27.000Z","size":255,"stargazers_count":49,"open_issues_count":1,"forks_count":26,"subscribers_count":8,"default_branch":"master","last_synced_at":"2025-04-03T15:52:32.134Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"http://seagatesoft.blogspot.com","language":"Java","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/seagatesoft.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2012-05-31T16:36:11.000Z","updated_at":"2025-01-19T12:50:17.000Z","dependencies_parsed_at":"2022-09-02T04:00:32.204Z","dependency_job_id":null,"html_url":"https://github.com/seagatesoft/sde","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/seagatesoft%2Fsde","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/seagatesoft%2Fsde/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/seagatesoft%2Fsde/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/seagatesoft%2Fsde/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/seagatesoft","download_url":"https://codeload.github.com/seagatesoft/sde/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":250754556,"owners_count":21481835,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-10T16:45:56.718Z","updated_at":"2025-04-25T04:30:36.979Z","avatar_url":"https://github.com/seagatesoft.png","language":"Java","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cp\u003eStructured Data Extractor (SDE) is an implementation of \u003ca href=\"http://www.cs.uic.edu/~yzhai/paper/www05_depta.pdf\"\u003eDEPTA\u003c/a\u003e (Data Extraction based on Partial Tree Alignment), a method to extract data from web pages (HTML documents). DEPTA was invented by \u003ca href=\"http://www.cs.uic.edu/~yzhai/\"\u003eYanhong Zhai\u003c/a\u003e and \u003ca href=\"http://www.cs.uic.edu/~liub/\"\u003eBing Liu\u003c/a\u003e from University of Illinois at Chicago and was published in their paper: \"Structured Data Extraction from the Web based on Partial Tree Alignment\" (\u003cem\u003eIEEE Transactions on Knowledge and Data Engineering\u003c/em\u003e, 2006). Given a web page, SDE will detect \u003cem\u003edata records\u003c/em\u003e contained in the web page and extract them into table structure (rows and columns). You can download the application from this link: \u003ca href=\"http://seagatesoft.com/download/sde.zip\"\u003eDownload Structured Data Extractor\u003c/a\u003e.\u003c/p\u003e\n\u003ch3\u003eUsage\u003c/h3\u003e\n\u003cp\u003e\n\u003col\u003e\n\u003cli\u003eExtract sde.zip.\u003c/li\u003e\n\u003cli\u003eMake sure that Java Runtime Environment (version 5 or higher) already installed on your computer.\u003c/li\u003e\n\u003cli\u003eOpen \u003cem\u003ecommand prompt\u003c/em\u003e (Windows) or \u003cem\u003eshell\u003c/em\u003e (UNIX).\u003c/li\u003e\n\u003cli\u003eGo to the directory where you extract sde.zip.\u003c/li\u003e\n\u003cli\u003eRun this command: \u003ccode\u003ejava -jar sde-runnable.jar URI_input path_to_output_file\u003c/code\u003e\u003c/li\u003e\n\u003cli\u003eYou can pass \u003cem\u003eURI_input\u003c/em\u003e parameter refering to a local file or remote file, as long as it is a valid URI. URI refering to a local file must be preceded by \"file:///\". For example in Windows environment: \"file:///D:/Development/Proyek/structured_data_extractor/bin/input/input.html\" or in UNIX environment: \"file:///home/seagate/input/input.html\".\u003c/li\u003e\n\u003cli\u003eThe path to output file parameter is formatted as a valid path in the host operating system like \"D:\\Data\\output.html\" (Windows) or \"/home/seagate/output/output.html\" (UNIX).\u003c/li\u003e\n\u003cli\u003eExtracted data can be viewed in the output file. The output file is a HTML document and the extracted data is presented in HTML tables.\u003c/li\u003e\n\u003c/ol\u003e\n\u003c/p\u003e\n\u003ch3\u003eSource Code\u003c/h3\u003e\n\u003cp\u003eSDE source code is available at \u003ca href=\"https://github.com/seagatesoft/sde\"\u003eGitHub\u003c/a\u003e.\u003c/p\u003e\n\u003ch3\u003eDependencies\u003c/h3\u003e\n\u003cp\u003eSDE was developed using these libraries:\n\u003cul\u003e\n\u003cli\u003e\u003ca href=\"http://nekohtml.sourceforge.net/\"\u003eNeko HTML Parser\u003c/a\u003e by Andy Clark and Marc Guillemot. Licensed under Apache License Version 2.0.\u003c/li\u003e\n\u003cli\u003e\u003ca href=\"http://xerces.apache.org/\"\u003eXerces\u003c/a\u003e by The Apache Software Foundation. Licensed under Apache License Version 2.0.\u003c/li\u003e\n\u003c/ul\u003e\n\u003c/p\u003e\n\u003ch3\u003eLicense\u003c/h3\u003e\n\u003cp\u003eSDE is licensed under the MIT license.\u003c/p\u003e\n\u003ch3\u003eAuthor\u003c/h3\u003e\n\u003cp\u003eSigit Dewanto, sigitdewanto11[at]yahoo[dot]co[dot]uk, 2009.\u003c/p\u003e","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fseagatesoft%2Fsde","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fseagatesoft%2Fsde","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fseagatesoft%2Fsde/lists"}