Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/marinagorbacheva/information-retrieval
Website scraping, parsing, indexing and search.
https://github.com/marinagorbacheva/information-retrieval
indexing parsing scraping search
Last synced: about 2 months ago
JSON representation
Website scraping, parsing, indexing and search.
- Host: GitHub
- URL: https://github.com/marinagorbacheva/information-retrieval
- Owner: MarinaGorbacheva
- Created: 2021-12-17T10:51:01.000Z (about 3 years ago)
- Default Branch: master
- Last Pushed: 2021-12-19T19:28:40.000Z (about 3 years ago)
- Last Synced: 2024-10-28T16:38:01.599Z (2 months ago)
- Topics: indexing, parsing, scraping, search
- Language: Python
- Homepage:
- Size: 19.5 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
### Project structure:
#### Data retrieval (`data retrieval`):
1) `data`:
* `parsed_data`: data, retrieved while parsing and that isn't indexed yet
* `scraper`: binary file, that contains the `Scarper` object
2) `parsing`:
* class `Parser`
* class `ParserSettings`
3) `scraping`:
* class `Scraper`
* class `ScraperSettings`
4) `storing`:
* class `ESDataStoring` (derived from `ESConnection`)
* class `IndexBody`
5) `settings`: data, used as settings for scraping, parsing and storing:
* `indices_settings`: a set of directories, each of which represents one index and contains JSON-files with settings and mappings for the index
* `parser_settings`: parser settings, should return a dictionary, in which keys are names of indices and values are functions, that parse a page, specific for the index
* `scraper_settings`: scraper settings, should return the first URL to scape and a list of URL-patterns to ignore or to parse while scraping
6) `main.py`: an entry point for scraping, parsing and storing
7) `methods.py`: additional methods, which are used in main function#### Elascticsearch connection (`es_connection`): class `ESConnection`
#### Search (`search`):
1) `es_search`: class `ESDataSearch`
2) `query_ui`: class `QueryUI`
3) `main.py`: an entry point for search