An open API service indexing awesome lists of open source software.

https://github.com/longi94/wdps1706


https://github.com/longi94/wdps1706

Last synced: 4 months ago
JSON representation

Awesome Lists containing this project

README

        

# wdps1706

## Extracting the text
1) The WARC file(s) is loaded into Spark by splitting the file into WARC
records. Since compressed files cannot be partitioned before
decompressing the whole file, the RDD created by Spark is repartitioned.
2) Spark reads through the whole file and HTML code is detected just by
looking for a line starting with `