Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/nikitaeverywhere/hadoop-network-of-keywords
Keywords network builder based on TF-IDF with the use of Hadoop platform
https://github.com/nikitaeverywhere/hadoop-network-of-keywords
cloudera cloudera-hadoop document-frequency hadoop hadoop-platform keywords-builder mapreduce term-frequency tf-idf
Last synced: 11 days ago
JSON representation
Keywords network builder based on TF-IDF with the use of Hadoop platform
- Host: GitHub
- URL: https://github.com/nikitaeverywhere/hadoop-network-of-keywords
- Owner: nikitaeverywhere
- License: mit
- Created: 2017-12-15T13:06:29.000Z (about 7 years ago)
- Default Branch: master
- Last Pushed: 2017-12-17T12:01:10.000Z (about 7 years ago)
- Last Synced: 2024-11-30T03:13:37.644Z (2 months ago)
- Topics: cloudera, cloudera-hadoop, document-frequency, hadoop, hadoop-platform, keywords-builder, mapreduce, term-frequency, tf-idf
- Language: Python
- Size: 86.9 KB
- Stars: 0
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: readme.md
- License: license
Awesome Lists containing this project
README
# Network of Keywords Builder with Hadoop
Keywords network builder based on TF-IDF with the use of Hadoop platform.
Preview
-------A keywords graph built from [this article](http://news.bbc.co.uk/2/hi/technology/4276125.stm).
![]()
Set Up
------This repository is intended to work with Cloudera Hadoop technology stack,
but can be easily ported to any other Hadoop stacks.1. Download [this VM](http://content.udacity-data.com/courses/ud617/Cloudera-Udacity-Training-VM-4.1.1.c.zip) used by Cloudera.
2. Log in to VM using `training/training` login/password.
3. Clone this repository using Git with its submodules: `git clone --recursive https://github.com/ZitRos/hadoop-network-of-keywords`.
4. `cd hadoop-network-of-keywords` and run the shell script `run_mapreduce.sh`.
5. To generate the graph to `result.csv`, run `network_builder.py` after running 4.
6. Build a visual graph from `result.csv` file, for example, using [Gephi](https://gephi.org).Running Keywords Builder
------------------------TF-IDF metrics are computed using Hadoop. Further processing and graph building
are done after TF-IDF values are computed.By running the `run_mapreduce.sh` script, you should get similar output to the
following. Note that you can pass a particular file name to analyze to the shell
script, located at `texts` directory: `run_mapreduce.sh animals/dogs.txt`.Sample output:
```txt
[training@localhost hadoop-network-of-keywords]$ ./run_mapreduce.sh
Calculating TF-IDF for tech/ink-helps-drive-democracy-in-asia.txt
Running TF mapreduce...
Removing old results...
Deleted /temp
Putting files to HDFS...
Counting files...
Running TF mapreduce on Hadoop...
packageJobJar: [tf_mapper.py, tf_reducer.py, utils.py, /tmp/hadoop-training/hadoop-unjar7892492009998614173/] [] /tmp/streamjob4399530855769057884.jar tmpDir=null
17/12/16 21:15:56 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
17/12/16 21:15:56 WARN snappy.LoadSnappy: Snappy native library is available
17/12/16 21:15:56 INFO snappy.LoadSnappy: Snappy native library loaded
17/12/16 21:15:56 INFO mapred.FileInputFormat: Total input paths to process : 2095
17/12/16 21:15:58 INFO streaming.StreamJob: getLocalDirs(): [/var/lib/hadoop-hdfs/cache/training/mapred/local]
17/12/16 21:15:58 INFO streaming.StreamJob: Running job: job_201712162108_0001
17/12/16 21:15:58 INFO streaming.StreamJob: To kill this job, run:
17/12/16 21:15:58 INFO streaming.StreamJob: UNDEF/bin/hadoop job -Dmapred.job.tracker=0.0.0.0:8021 -kill job_201712162108_0001
17/12/16 21:15:58 INFO streaming.StreamJob: Tracking URL: http://0.0.0.0:50030/jobdetails.jsp?jobid=job_201712162108_0001
17/12/16 21:15:59 INFO streaming.StreamJob: map 0% reduce 0%
17/12/16 21:16:23 INFO streaming.StreamJob: map 1% reduce 0%
...
17/12/16 21:18:45 INFO streaming.StreamJob: map 5% reduce 0%
17/12/16 21:19:12 INFO streaming.StreamJob: map 5% reduce 2%
17/12/16 21:19:22 INFO streaming.StreamJob: map 6% reduce 2%
...
17/12/16 22:14:49 INFO streaming.StreamJob: map 99% reduce 33%
17/12/16 22:15:23 INFO streaming.StreamJob: map 100% reduce 33%
17/12/16 22:15:44 INFO streaming.StreamJob: map 100% reduce 74%
17/12/16 22:15:47 INFO streaming.StreamJob: map 100% reduce 83%
17/12/16 22:15:50 INFO streaming.StreamJob: map 100% reduce 92%
17/12/16 22:15:54 INFO streaming.StreamJob: map 100% reduce 100%
17/12/16 22:15:55 INFO streaming.StreamJob: Job complete: job_201712162108_0001
17/12/16 22:15:55 INFO streaming.StreamJob: Output: /temp/output
Running DF mapreduce on Hadoop...
packageJobJar: [df_mapper.py, df_reducer.py, utils.py, /tmp/hadoop-training/hadoop-unjar8254911625928607214/] [] /tmp/streamjob64323986015252274.jar tmpDir=null
17/12/16 22:15:57 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
17/12/16 22:15:57 WARN snappy.LoadSnappy: Snappy native library is available
17/12/16 22:15:57 INFO snappy.LoadSnappy: Snappy native library loaded
17/12/16 22:15:57 INFO mapred.FileInputFormat: Total input paths to process : 1
17/12/16 22:15:57 INFO streaming.StreamJob: getLocalDirs(): [/var/lib/hadoop-hdfs/cache/training/mapred/local]
17/12/16 22:15:57 INFO streaming.StreamJob: Running job: job_201712162108_0002
17/12/16 22:15:57 INFO streaming.StreamJob: To kill this job, run:
17/12/16 22:15:57 INFO streaming.StreamJob: UNDEF/bin/hadoop job -Dmapred.job.tracker=0.0.0.0:8021 -kill job_201712162108_0002
17/12/16 22:15:57 INFO streaming.StreamJob: Tracking URL: http://0.0.0.0:50030/jobdetails.jsp?jobid=job_201712162108_0002
17/12/16 22:15:58 INFO streaming.StreamJob: map 0% reduce 0%
17/12/16 22:16:03 INFO streaming.StreamJob: map 100% reduce 0%
17/12/16 22:16:11 INFO streaming.StreamJob: map 100% reduce 95%
17/12/16 22:16:13 INFO streaming.StreamJob: map 100% reduce 100%
17/12/16 22:16:14 INFO streaming.StreamJob: Job complete: job_201712162108_0002
17/12/16 22:16:14 INFO streaming.StreamJob: Output: /temp/dfoutput
Getting results into tf_df_output.txt...
```The result will go to `tf_df_output.txt` file. Each row in this file is a tuple of
three values (term frequency, document frequency, word), separated by tabs. To
calculate TF-IDF, the number of documents is saved to `files_count.txt` file as a
plain number.Example of `tf_df_output.txt`:
```txt
3 5 a
1 3 and
1 5 are
1 1 awesome
1 1 best
1 1 can
1 1 dog
5 1 dogs
1 1 everybody
1 1 friend
1 1 high
1 3 is
1 1 jump
2 3 love
1 1 man
1 3 of
1 2 other
```After generating `tf_df_output.txt` file with some other helper files, run the `network_builder.py`
script to produce `result.csv` file. Example of result is generated from
`Ink helps drive democrasy in Asia` article in `result.csv` file:```text
;use;voter;thumb;readers;type;uv;serbia;elections;light;sprayed;ultraviolet;ink;republic;kyrgyz;ballot
use;0;0;0;0;0;0;0;2;0;0;0;12;0;1;0
voter;0;0;0;0;0;2;0;1;1;0;0;5;0;0;0
thumb;0;0;0;0;0;0;0;0;1;0;0;3;0;0;0
readers;0;0;0;0;0;0;0;1;0;0;0;3;0;0;0
type;0;0;0;0;0;0;0;0;0;0;0;2;0;0;0
uv;0;2;0;0;0;0;0;0;0;0;0;0;0;0;0
serbia;0;0;0;0;0;0;0;1;0;0;0;0;0;0;0
elections;2;1;0;1;0;0;1;0;0;0;0;21;0;2;0
light;0;1;1;0;0;0;0;0;0;0;0;2;0;0;0
sprayed;0;0;0;0;0;0;0;0;0;0;0;1;0;0;0
ultraviolet;0;0;0;0;0;0;0;0;0;0;0;4;0;0;0
ink;12;5;3;3;2;0;0;21;2;1;4;0;2;4;2
republic;0;0;0;0;0;0;0;0;0;0;0;2;0;0;0
kyrgyz;1;0;0;0;0;0;0;2;0;0;0;4;0;0;0
ballot;0;0;0;0;0;0;0;0;0;0;0;2;0;0;0
```License
-------MIT © [Nikita Savchenko](https://nikita.tk)