https://github.com/longi94/wdps1706
https://github.com/longi94/wdps1706
Last synced: 4 months ago
JSON representation
- Host: GitHub
- URL: https://github.com/longi94/wdps1706
- Owner: Longi94
- Created: 2017-11-14T16:18:34.000Z (over 7 years ago)
- Default Branch: master
- Last Pushed: 2017-12-04T11:28:10.000Z (over 7 years ago)
- Last Synced: 2025-02-15T12:13:15.810Z (4 months ago)
- Language: Java
- Size: 50.5 MB
- Stars: 1
- Watchers: 4
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# wdps1706
## Extracting the text
1) The WARC file(s) is loaded into Spark by splitting the file into WARC
records. Since compressed files cannot be partitioned before
decompressing the whole file, the RDD created by Spark is repartitioned.
2) Spark reads through the whole file and HTML code is detected just by
looking for a line starting with `