https://github.com/seagatesoft/sde
Structured Data Extractor. An application to extract structured data from web pages. It uses Data Extraction Based on Partial Tree Alignment (DEPTA) method. (UPDATE: I implemented a newer algorithm: https://github.com/seagatesoft/webdext)
https://github.com/seagatesoft/sde
Last synced: 12 months ago
JSON representation
Structured Data Extractor. An application to extract structured data from web pages. It uses Data Extraction Based on Partial Tree Alignment (DEPTA) method. (UPDATE: I implemented a newer algorithm: https://github.com/seagatesoft/webdext)
- Host: GitHub
- URL: https://github.com/seagatesoft/sde
- Owner: seagatesoft
- Created: 2012-05-31T16:36:11.000Z (about 14 years ago)
- Default Branch: master
- Last Pushed: 2012-06-09T05:02:27.000Z (about 14 years ago)
- Last Synced: 2025-04-03T15:52:32.134Z (about 1 year ago)
- Language: Java
- Homepage: http://seagatesoft.blogspot.com
- Size: 249 KB
- Stars: 49
- Watchers: 8
- Forks: 26
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
Structured Data Extractor (SDE) is an implementation of DEPTA (Data Extraction based on Partial Tree Alignment), a method to extract data from web pages (HTML documents). DEPTA was invented by Yanhong Zhai and Bing Liu from University of Illinois at Chicago and was published in their paper: "Structured Data Extraction from the Web based on Partial Tree Alignment" (IEEE Transactions on Knowledge and Data Engineering, 2006). Given a web page, SDE will detect data records contained in the web page and extract them into table structure (rows and columns). You can download the application from this link: Download Structured Data Extractor.
Usage
- Extract sde.zip.
- Make sure that Java Runtime Environment (version 5 or higher) already installed on your computer.
- Open command prompt (Windows) or shell (UNIX).
- Go to the directory where you extract sde.zip.
- Run this command:
java -jar sde-runnable.jar URI_input path_to_output_file
- You can pass URI_input parameter refering to a local file or remote file, as long as it is a valid URI. URI refering to a local file must be preceded by "file:///". For example in Windows environment: "file:///D:/Development/Proyek/structured_data_extractor/bin/input/input.html" or in UNIX environment: "file:///home/seagate/input/input.html".
- The path to output file parameter is formatted as a valid path in the host operating system like "D:\Data\output.html" (Windows) or "/home/seagate/output/output.html" (UNIX).
- Extracted data can be viewed in the output file. The output file is a HTML document and the extracted data is presented in HTML tables.
Source Code
SDE source code is available at GitHub.
Dependencies
SDE was developed using these libraries:
-
Neko HTML Parser by Andy Clark and Marc Guillemot. Licensed under Apache License Version 2.0. -
Xerces by The Apache Software Foundation. Licensed under Apache License Version 2.0.
License
SDE is licensed under the MIT license.
Author
Sigit Dewanto, sigitdewanto11[at]yahoo[dot]co[dot]uk, 2009.