Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/nitsas/simple-web-search-engine
https://github.com/nitsas/simple-web-search-engine
Last synced: 5 days ago
JSON representation
- Host: GitHub
- URL: https://github.com/nitsas/simple-web-search-engine
- Owner: nitsas
- License: mit
- Created: 2014-11-14T11:41:59.000Z (about 10 years ago)
- Default Branch: master
- Last Pushed: 2014-11-14T15:54:50.000Z (about 10 years ago)
- Last Synced: 2023-03-23T03:56:06.946Z (over 1 year ago)
- Language: Python
- Size: 129 KB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE.txt
Awesome Lists containing this project
README
INTRO
=====A **very** simple web search engine written in Python 2.
Originally created for my *Language Technology* course project, at the
Computer Engineering and Informatics Department, University of Patras, around
2010.***
USAGE
=====There are three basic ways to use the software:
- with existing index file, jump to making queries
- create index file from set of downloaded webpages
- crawl first and then create index fileWith existing index file, jump to making queries
------------------------------------------------Prerequisites:
- an index file (default name `index.xml`)
- the url-map (default name `urls.pickle`)Just run:
python evaluate_index.py -i -u
Or, if you are using the default file names, just:
python evaluate_index.py
Caution, this last command will actually start at the crawling step if it
can't find the index and url-map.`evaluate_index.py` will load the index file and url-map in memory and give
you a prompt to start issuing search queries.Create index file from set of downloaded webpages
-------------------------------------------------Prerequisites:
- a directory containing '.html' files (default `./html/` directory)
Just run:
python preprocessor.py
python indexer.pyThe preprocessor will clean and tokenize every '.html' file in the given
directory (let's call it ``), and store the tokenized webpages
inside a `./tokenized/` directory; page `/x.html` will be
stored as `./tokenized/x.txt` after the tokenization.Crawl first and then create index file
--------------------------------------Prerequisites:
- nothing!
Just run:
python crawler.py
python preprocessor.py
python indexer.pyThe crawler will start crawling webpages, starting from a default set of five
*seed* webpages, and saving them inside a `./html/` directory. Each webpage
must pass a set of default requirements to be saved. Some of the default
requirements are:- page must be cacheable, i.e. no `no-store` attribute in the `cache-control`
header
- page length must be at least 40000 characters, including html tags
- must be a `text/html` page
- language must be English, i.e. `content-language` must be `en`The crawler extracts links to visit next, from every page it crawles, but
there are some links it does not follow. Default link requirements are:- only follow `http://` links, i.e. no `ftp://`, `mailto:` etc links
- only crawl `.com` and `.co.uk` urls (no `.gov` etc urls)
- block `twitter.com`, `facebook.com`, `wikipedia` and `imdb` urls
- only follow urls ending in `.html`, `.htm` or `/`The crawler will by default crawl until it has exactly 1000 pages (or it runs
out of links).I might allow the user to change the defaults via command line parameters and
configuration files, in the future, if I find the time. Don't count on it.After the crawler finishes, the preprocessor and indexer will process all
`.html` pages inside the `./html/` directory, as described earlier.After the whole process ends, the user can start querying the index after
running the command:python evaluate_index.py