https://github.com/rexshijaku/lotor
The Albanian Web Crawler
https://github.com/rexshijaku/lotor
albanian-domains domain-size language language-detection multilingual-domains quality quality-estimation web-crawler
Last synced: 6 days ago
JSON representation
The Albanian Web Crawler
- Host: GitHub
- URL: https://github.com/rexshijaku/lotor
- Owner: rexshijaku
- License: mit
- Created: 2020-03-22T17:54:59.000Z (over 6 years ago)
- Default Branch: master
- Last Pushed: 2020-03-22T21:20:29.000Z (over 6 years ago)
- Last Synced: 2025-02-22T02:15:34.949Z (over 1 year ago)
- Topics: albanian-domains, domain-size, language, language-detection, multilingual-domains, quality, quality-estimation, web-crawler
- Language: C#
- Homepage:
- Size: 63.5 KB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Lotor
Lotor - built from scratch and writen in C#, is a hybrid type of focused and incremental crawlers: (1) focused since it collects relevant (Albanian)
documents and (2) incremental because it incrementally refreshes the existing collection of a domain list. This language focused web crawler aims to exclude domains which are not written in Albanian from a given set of domains and ranks exclusively Albanian domains by their importance. This tool makes predictions about the size of domains, the domains main language (as Albanian or not), the multilingual domains, where at least one of the languages is Albanian, and (4) the quality of a domain. Lotor gets all the necessary information for the domain by crawling maximum its three levels.
Procyon lotor is a mammal who dips his food in water before eating, and Lotor from Latin stands for washer. This web-crawler doesn't decide that a domain is Albanian just by checking its index page language, but it analyzes its index page and all first level pages, so it washes it well.
Configuration and usage
- Clone this repository
- Make sure all packages are installed
- You will need to run on Packer Manager Console following command: Install-Package HtmlAgilityPack
Helpful information
Alomst every piece of code is commented.
Lotor/lotor_input and Lotor/lotor_output are the only folders we should focus on.
Lotor/lotor_input folder contains all important files which are essential for Lotor to start its work.
Domain urls you want to crawl should be added in seed.txt which is located in
Lotor/lotor_input folder in the following format:
- nytimes.com
- shqipfm.al
- startek.al
- linktone.al
- eurosistemalbania.al
- sabah.com.tr
- emeraldhotel.info
the result of the preceding list processed by Lotor will be as below :
- nytimes.com
- startek.al 41.583
- shqipfm.al 14.212
- eurosistemalbania.al 4.67
- linktone.al 1.87
- sabah.com.tr
- emeraldhotel.info
Files such as: al_stopwords.txt (which should contain 45 stopwords) and albTerms4.txt (which should contain 30000 the most common Albanian words on the web which contain at least four letters) are incomplete (due to copyright and company privacy)!
Whenever you need these files, I can email you back, in minutes or hours! Write me in rexhepshijaku@gmail.com for these files and any other kind of help.
Lotor/lotor_output contains cache folders and descriptive files which sum up the crawl process.
For instance results folder contains files such : albanian_domains.csv, nonalbanian_domains.csv, likely_albanian_domains.html and likely_multilingual.htm which give information about the domains recently crawled.
From the previous input list albanian_domains.csv at the end of crawling process should contain these domains : startek.al, shqipfm.al, eurosistemalbania.al and linktone.al, since these al are writen in Albanian. On the other hand nonalbanian_domains.csv should be populated by nytimes.com, sabah.com.tr and emeraldhotel.info because these domains have nothing to do with Albanian. File likely_albanian_domains.html will contain emeraldhotel.info because this domain is multilingual and it contains Albanian language as an Alternative language in this url in following format:
- emeraldhotel.info => emeraldhotel.info?lang=sq.
and similarly the file likely_multilingual.htm will contain multilingual domains which primarily are written in Albanian, in this case it will contain startek.al in following format:
- startek.al => startek.al/?lang=en
Additional information about Lotor/lotor_input folder you can find in Globals/Configs.cs commented lines.
Lotor was used to test our proposed methods in a scientific paper published as: "Model-based prediction of the size, the language and the quality of the web domains" and it produced highly accurate results in determining and classifying Albanian and non-Albanian domains.
The aim of being in github is: first of all to be useful and to help anyone in need, to get improved and to be generalized into more than one language.
This project was initiated by Rexhep Shijaku and Ercan Canhasi.