https://github.com/longi94/wdps1706

Last synced: 4 months ago
JSON representation

Host: GitHub
URL: https://github.com/longi94/wdps1706
Owner: Longi94
Created: 2017-11-14T16:18:34.000Z (over 7 years ago)
Default Branch: master
Last Pushed: 2017-12-04T11:28:10.000Z (over 7 years ago)
Last Synced: 2025-02-15T12:13:15.810Z (4 months ago)
Language: Java
Size: 50.5 MB
Stars: 1
Watchers: 4
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# wdps1706

## Extracting the text
1) The WARC file(s) is loaded into Spark by splitting the file into WARC
records. Since compressed files cannot be partitioned before
decompressing the whole file, the RDD created by Spark is repartitioned.
2) Spark reads through the whole file and HTML code is detected just by
looking for a line starting with `

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/longi94/wdps1706

Awesome Lists containing this project

README