https://github.com/arun11299/mining-massive-datasets
Programs written as part of Coursera's MMDS course by Ullman-Rajaraman-Leskovic
https://github.com/arun11299/mining-massive-datasets
Last synced: 3 months ago
JSON representation
Programs written as part of Coursera's MMDS course by Ullman-Rajaraman-Leskovic
- Host: GitHub
- URL: https://github.com/arun11299/mining-massive-datasets
- Owner: arun11299
- Created: 2014-11-09T08:27:51.000Z (over 10 years ago)
- Default Branch: master
- Last Pushed: 2014-11-09T17:56:34.000Z (over 10 years ago)
- Last Synced: 2024-12-27T11:14:30.653Z (5 months ago)
- Language: Python
- Size: 2.12 MB
- Stars: 0
- Watchers: 3
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
Mining-Massive-Datasets
=======================Programs written as part of Coursera's MMDS course by Ullman-Rajaraman-Leskovic.
adwords.py :- Given a set of advertisers, their budget and click through rates, find/choose the advertisers, such
that when the budget of one advertiser finishes choose an advertiser that can bring in maximum revenue based on the
click through rate based upon the impressions (which is limited to 101).lsh/lsh_test.py: This implements the min hashing technique by shingling of the document lines and creating a signature matrix
for the document lines.
This signature matrix is then fed to the LSH (Location Sensitive hashing) algo code, which finds the best matching lines within
the document. The Jaccard similarity is kept around 0.8 (but the code just displays the best matching lines with a
difference of 1 word).