Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/m3dwards/spark-retreat-python
https://github.com/m3dwards/spark-retreat-python
Last synced: 3 months ago
JSON representation
- Host: GitHub
- URL: https://github.com/m3dwards/spark-retreat-python
- Owner: m3dwards
- Created: 2015-06-20T04:15:56.000Z (over 9 years ago)
- Default Branch: master
- Last Pushed: 2015-09-28T19:20:39.000Z (about 9 years ago)
- Last Synced: 2024-07-27T17:57:45.747Z (4 months ago)
- Language: Python
- Size: 148 KB
- Stars: 1
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
## Using Olympics.csv data:
Example data:
Athlete,Country,Year,Sport,Gold,Silver,Bronze,Total
Yang Yilin,China,2008,Gymnastics,1,0,2,3## Questions
1: Which country scored the most medals?
2: For “United States”, find the year in which it scored most medals?
3: List the years when “United States” scored less than 200 medals?
4: Which player has medals in more than one sport?
## Inverted Indices
Build Inverted indices from given text
Given the following input,
1: if you prick us do we not bleed
2: if you tickle us do we not laugh
3: if you poison us do we not die and
4: if you wrong us shall we not revengeThe inverted index, we want to build looks like,
and : 1 : (3, 1)
bleed : 1 : (1, 1)
die : 1 : (3, 1)
do : 3 : (1, 1), (2, 1), (3, 1)
if : 4 : (1, 1), (2, 1), (3, 1), (4, 1)
laugh : 1 : (2, 1)
not : 4 : (1, 1), (2, 1), (3, 1), (4, 1)
poison : 1 : (3, 1)
prick : 1 : (1, 1)
revenge : 1 : (4, 1)
shall : 1 : (4, 1)
tickle : 1 : (2, 1)
us : 4 : (1, 1), (2, 1), (3, 1), (4, 1)
we : 4 : (1, 1), (2, 1), (3, 1), (4, 1)
wrong : 1 : (4, 1)
you : 4 : (1, 1), (2, 1), (3, 1), (4, 1)Note - The data set we have is just a text file, without line numbers, to get line numbers, you need can get that by calling “zipWithIndex”
Questions to answer
In which line can you find the term "starcross'd".
In how many lines does "gold" appear once, twice, three times.## Useful commands
password !abcd1234
scp ./application.py [email protected]:~/
spark-submit spark-application.py
hdfs dfs -rm -r /tmp/output
hdfs dfs -ls /tmp/output
hdfs dfs -tail /tmp/output/part-00001
## Note
Everyone is connecting to the server using the same username so to avoid overwriting each other, use unique directory names or unique python application names.