Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/aborg-dev/information_retrieval_class
SHAD Information Retrieval Class materials
https://github.com/aborg-dev/information_retrieval_class
Last synced: 25 days ago
JSON representation
SHAD Information Retrieval Class materials
- Host: GitHub
- URL: https://github.com/aborg-dev/information_retrieval_class
- Owner: aborg-dev
- Created: 2014-05-01T12:49:07.000Z (over 10 years ago)
- Default Branch: master
- Last Pushed: 2015-05-14T09:35:33.000Z (over 9 years ago)
- Last Synced: 2024-10-13T19:10:14.059Z (25 days ago)
- Language: C++
- Size: 508 KB
- Stars: 0
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Information Retrieval Course [![Build Status](https://travis-ci.org/IIoTeP9HuY/Information_Retrieval_Class.png?branch=master)](https://travis-ci.org/IIoTeP9HuY/Information_Retrieval_Class)
Information Retrieval YSDA course programming assignments repository
http://shad.yandex.ru/Introduction
============These are my solutions for programming exercises from YSDA Information Retrieval classes. Written in Python/C++.
Requirements
============
* gcc >= 4.8 or clang >= 3.4
* cmake
* boost >= 1.55
* glib2
* libglibmm-2.4
* libxml++2.6
* libtidyUsage tutorial
============Build everything using
```bash
mkdir build && cd build
cmake ..
make
```####Part1, Download wiki and analyze webgraph
#####Download wiki
```bash
crawler http://simple.wikipedia.org/ -o wiki -t 8
```
This will download whole domain to folder "wiki", preserving hierarchical structure and will
fill file "ready_urls.txt" with downloaded urls.#####Flatten hierarchical structure to simplify parsing
```bash
flatten --urlsList ready_urls.txt --urlsDir wiki --outDir flat_wiki --urlsMapping urls
```
This will create folder "flat_wiki" with files 1.html, 2.html, ...
and also create file urls with mapping 1.html``url_1, ...#####Strip hypertext tags and extract text from webpages
```bash
extract --urlsDir flat_site --outDir text_wiki --urlsMapping urls
```
This will fill the folder "text_wiki" with files 1.txt, 2.txt, ... corresponding to extracted
text from 1.html, 2.html, ...
Also it will produce file token_frequency containing lines token_1``frequency_1 ...#####Finally, analyze webgraph
```bash
flat_webgraph --path flat_site --urlMapping urls --domain http://simple.wikipedia.org/ --start_page http://simple.wikipedia.org/wiki/Main_Page
```
This will create files:
* "distances" - distance from start_page to every other page
* "in_out_stats" - input and output degrees for each page
* "pageranks" - pagerank for each page (damping factor = 0.85)####Part2, Find duplicates among documents
Next steps allow you to find duplicates among downloaded pages using technique called "Simhashing".
#####Preprocess data a bit
During this step you should remove all frequent and wiki-specific words from documents.
There is more general ways to do it using word frequency and scoring, but in this case
it's easier to run simple sed command :)```bash
cp -r text_site text_site_clean && cd text_site_clean
find . -type f -exec gsed -n -i "/Navigation menu/q;p" {} \;
```
This will remove all footers that start with string "Navigation menu" till the end of file.
That's where extracted wiki-specific words are located.#####Build simhash signatures
Now we can build simhashes for documents
```bash
Simhash -b --path=text_site_clean --dest=results
```
This will produce file "simhashes" with following format: **url** **length\_in\_words** **simhash**For example: http://simple.wikipedia.org/wiki/Nathalia_Dill 383 11074093965332231517
#####Cluster documents:
To cluster documents we need to specify maximum allowed simhash distance between documents
in cluster. In this case we will use "-s 5", which means that documents that differ no more then
in 5 positions will be considered similar.
```bash
Simhash -f -s 5 --dest=results
```
This will create files:
* "clusters_5" - containts list of found clusters
* "clusters_5_sizes" - contains sizes of found clustersCollaboration Policy
==========I opensourced this code because I believe it can help people learn something new and improve their skills.
You can use it on your conscience, but I encourage you not to copy-paste this sources and use
them only for educational purposes :)