{"id":15308534,"url":"https://github.com/rexshijaku/lotor","last_synced_at":"2026-06-22T13:31:47.166Z","repository":{"id":106639387,"uuid":"249240819","full_name":"rexshijaku/Lotor","owner":"rexshijaku","description":"The Albanian Web Crawler","archived":false,"fork":false,"pushed_at":"2020-03-22T21:20:29.000Z","size":65,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-02-22T02:15:34.949Z","etag":null,"topics":["albanian-domains","domain-size","language","language-detection","multilingual-domains","quality","quality-estimation","web-crawler"],"latest_commit_sha":null,"homepage":"","language":"C#","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/rexshijaku.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2020-03-22T17:54:59.000Z","updated_at":"2021-12-15T18:56:47.000Z","dependencies_parsed_at":null,"dependency_job_id":"49103254-60dc-4704-bed5-0e844c1f9d93","html_url":"https://github.com/rexshijaku/Lotor","commit_stats":{"total_commits":11,"total_committers":2,"mean_commits":5.5,"dds":"0.18181818181818177","last_synced_commit":"7a9d049a13ee9d4fff651de8106c55300143f5a2"},"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/rexshijaku/Lotor","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rexshijaku%2FLotor","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rexshijaku%2FLotor/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rexshijaku%2FLotor/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rexshijaku%2FLotor/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/rexshijaku","download_url":"https://codeload.github.com/rexshijaku/Lotor/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rexshijaku%2FLotor/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34651747,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-22T02:00:06.391Z","response_time":106,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["albanian-domains","domain-size","language","language-detection","multilingual-domains","quality","quality-estimation","web-crawler"],"created_at":"2024-10-01T08:16:44.493Z","updated_at":"2026-06-22T13:31:47.145Z","avatar_url":"https://github.com/rexshijaku.png","language":"C#","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Lotor\n\nLotor - built from scratch and writen in C#, is a hybrid type of focused and incremental crawlers: (1) focused since it collects relevant (Albanian)\ndocuments and (2) incremental because it incrementally refreshes the existing collection of a domain list. This language focused web crawler aims to exclude domains which are not written in Albanian from a given set of domains and ranks exclusively Albanian domains by their importance. This tool makes predictions about the size of domains, the domains main language (as Albanian or not), the multilingual domains, where at least one of the languages is Albanian, and (4) the quality of a domain. Lotor gets all the necessary information for the domain by crawling maximum its three levels.\n\nProcyon lotor is a mammal who dips his food in water before eating, and Lotor from Latin stands for washer. This web-crawler doesn't decide that a domain is Albanian just by checking its index page language, but it analyzes its index page and all first level pages, so it washes it well.\n\n\u003cb\u003eConfiguration and usage\u003c/b\u003e \n\u003col\u003e\n  \u003cli\u003e Clone this repository \u003c/li\u003e\n  \u003cli\u003e Make sure all packages are installed \u003c/li\u003e\n  \u003cli\u003e You will need to run on Packer Manager Console following command: Install-Package HtmlAgilityPack \u003c/li\u003e\n\u003c/ol\u003e\n\n\u003cb\u003eHelpful information\u003c/b\u003e\n\nAlomst every piece of code is commented.\n\n\u003cb\u003eLotor/lotor_input\u003c/b\u003e and \u003cb\u003eLotor/lotor_output\u003c/b\u003e are the only folders we should focus on.\n\n\u003cb\u003eLotor/lotor_input\u003c/b\u003e folder contains all important files which are essential for Lotor to start its work.\n\nDomain urls \u003cu\u003eyou want to crawl\u003c/u\u003e should be added in \u003cb\u003eseed.txt\u003c/b\u003e which is located in \u003cbr\u003eLotor/lotor_input\u003c/b\u003e folder in the following format:\n\n\u003cul\u003e\n\u003cli\u003enytimes.com\u003c/li\u003e\n\u003cli\u003eshqipfm.al\u003c/li\u003e\n\u003cli\u003estartek.al\u003c/li\u003e\n\u003cli\u003elinktone.al\u003c/li\u003e\n\u003cli\u003eeurosistemalbania.al\u003c/li\u003e\n\u003cli\u003esabah.com.tr\u003c/li\u003e\n\u003cli\u003eemeraldhotel.info\u003c/li\u003e\n\u003c/ul\u003e\n\nthe result of the preceding list processed by Lotor will be as below : \n\n\u003cul\u003e\n\u003cli\u003e\u003cstrike\u003enytimes.com\u003c/strike\u003e\u003c/li\u003e\n\u003cli\u003estartek.al 41.583\u003c/li\u003e\n\u003cli\u003eshqipfm.al 14.212\u003c/li\u003e\n\u003cli\u003eeurosistemalbania.al 4.67\u003c/li\u003e\n\u003cli\u003elinktone.al 1.87\u003c/li\u003e\n\u003cli\u003e\u003cstrike\u003esabah.com.tr\u003c/strike\u003e\u003c/li\u003e\n\u003cli\u003e\u003cstrike\u003eemeraldhotel.info\u003c/strike\u003e\u003c/li\u003e\n\u003c/ul\u003e\n\nFiles such as: \u003cb\u003eal_stopwords.txt\u003c/b\u003e (which should contain 45 stopwords) and \u003cb\u003ealbTerms4.txt\u003c/b\u003e (which should contain 30000 the most common Albanian words on the web which contain at least four letters) are incomplete (due to copyright and company privacy)! \nWhenever you need these files, I can email you back, in minutes or hours! Write me in rexhepshijaku@gmail.com for these files and any other kind of help.\n\n\u003cb\u003eLotor/lotor_output\u003c/b\u003e contains cache folders and descriptive files which sum up the crawl process.\nFor instance \u003cb\u003eresults\u003c/b\u003e folder contains files such : albanian_domains.csv, nonalbanian_domains.csv, likely_albanian_domains.html and likely_multilingual.htm  which give information about the domains recently crawled.\n\nFrom the previous input list \u003cb\u003e albanian_domains.csv \u003c/b\u003e at the end of crawling process should contain these domains : startek.al, shqipfm.al, eurosistemalbania.al and linktone.al, since these al are writen in Albanian. On the other hand \u003cb\u003enonalbanian_domains.csv\u003c/b\u003e should be populated by nytimes.com, sabah.com.tr and emeraldhotel.info because these domains have nothing to do with Albanian. File \u003cb\u003elikely_albanian_domains.html\u003c/b\u003e will contain emeraldhotel.info because this domain is multilingual and it contains Albanian language as an Alternative language in this url in following format:\n\n\u003cul\u003e\n  \u003cli\u003eemeraldhotel.info =\u003e emeraldhotel.info?lang=sq.\u003c/li\u003e\n\u003c/ul\u003e\n\nand similarly the file \u003cb\u003elikely_multilingual.htm\u003c/b\u003e will contain multilingual domains which primarily are written in Albanian, in this case it will contain startek.al in following format:\n\n\u003cul\u003e\n  \u003cli\u003estartek.al =\u003e startek.al/?lang=en\u003c/li\u003e\n\u003c/ul\u003e\n\n\u003cbr\u003eAdditional information about Lotor/lotor_input folder you can find in Globals/Configs.cs commented lines.\u003c/b\u003e\n\nLotor was used to test our proposed methods in a scientific paper published as: \u003ci\u003e\"Model-based prediction of the size, the language and the quality of the web domains\"\u003c/i\u003e and it produced highly accurate results in determining and classifying Albanian and non-Albanian domains.\n\nThe aim of being in github is: first of all to be useful and to help anyone in need, to get improved and to be generalized into more than one language.\n\nThis project was initiated by Rexhep Shijaku and Ercan Canhasi. \n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frexshijaku%2Flotor","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Frexshijaku%2Flotor","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frexshijaku%2Flotor/lists"}