https://github.com/gianlucabortoli/topic-zoomer
URL categorization system using Spark
https://github.com/gianlucabortoli/topic-zoomer
Last synced: 6 months ago
JSON representation
URL categorization system using Spark
- Host: GitHub
- URL: https://github.com/gianlucabortoli/topic-zoomer
- Owner: GianlucaBortoli
- License: mit
- Created: 2016-12-07T14:01:19.000Z (almost 9 years ago)
- Default Branch: master
- Last Pushed: 2021-05-11T15:49:32.000Z (over 4 years ago)
- Last Synced: 2025-02-05T15:26:52.664Z (8 months ago)
- Language: TeX
- Homepage:
- Size: 461 KB
- Stars: 1
- Watchers: 3
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# topic-zoomer
This is the project for the Big Data course at the University of Trento.The goal is to categorise geolocalised URLs. It is possible to select an area of
interest by means of a top left and a bottom right point and a step S to further
divide this area into squares (of size SxS).# Requirements
* python3 version <= 3.5 (it will not work on python3.6 due to some incompatibility between python3.6 and pySpark)
* pip dependencies installed via `pip install -r requirements.txt` (see _init.sh_ script)
* working Spark environment# How to run the tool
## Generating a dataset from URL file
```bash
cd /path/to/repo/data
python crawler.py --input 100_wikipedia_urls --output out.csv --min 0 --max 100
```## Running the topic extractor on Spark
```bash
cd /path/to/repo/src
/path/to/spark-submit topic_zoomer.py 5 0 100 100 0 100 ../data/test.csv 0
```## Generating charts from timings
```bash
cd /path/to/repo/charts
python generate_charts.py --input timings.csv --output out
```