Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/pavel-durov/scraper.challenge
scraper challange (momentum ai)
https://github.com/pavel-durov/scraper.challenge
Last synced: about 1 month ago
JSON representation
scraper challange (momentum ai)
- Host: GitHub
- URL: https://github.com/pavel-durov/scraper.challenge
- Owner: Pavel-Durov
- License: mit
- Created: 2023-11-21T17:35:58.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2023-12-01T09:57:58.000Z (about 1 year ago)
- Last Synced: 2024-12-06T21:32:49.837Z (about 1 month ago)
- Language: HTML
- Homepage:
- Size: 5.56 MB
- Stars: 0
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Scraper challange
## Overview
In this solution, we have multiple components - `Server`, `UpdateJob`, `Store`, `Scraper` and the `Classifier`.
`Store` - simple in-memory store. Stores scraped company information.
`Server` - serves incoming HTTP traffic and returns classified/scraped company data from the Store.
`UpdateJob` - updates and schedules future cronjob to update company information in the `Store`.
`Scraper` - scrapes website information and classifies each company site with its findings. Each site might have multiple classifications.
`Classifier` - the decision-making component based on the scraped company information. Currently, it's a dump component, it just uses the first scraped information it gets but it can be extended.
## Known issues
- This design was intended to be included in a single process, for simplicity reasons and time constraints. An actual implementation would probably include multiple distributed components. For example, if we want to scale `Server` horizontally we will need to distribute `Server`, `UpdateJob` and `Store`.
- Some HTML pages fail to load, others are empty files.
- In terms of scraping classification, the only thing I could think about in the given timeframe was to check included scripts and global variables. I am sure there's more that can be done to improve chat technology identification.
- Scraping classification has low accuracy - it can only identify a few Drift and a few Salesforce chats.
- I am not sure what kind of tooling was used for snapshotting HTML files but I think it can be definitely improved.
- It takes time to scrape all the HTML files. The server endpoint `GET /chat/find` will return the data it has, but the list might be incomplete if the processing of HTML files is not completed.## Install dependencies
```shell
$ yarn
```## Run
```shell
$ yarn start
```
### Try it
```shell
$ curl -i -XGET localhost:8000/chat/find
HTTP/1.1 200 OK
...[..., {"companyName":"bittitan.html","chatType":"Drift"}, ...{"companyName":"exabeam.html","chatType":"None"}, ..., {"companyName":"konfio.mx.html","chatType":"Salesforce"}, ...]
```## Test
```shell
$ npm run test
```## Lint
```shell
$ yarn lint # lint check
$ yarn lint:fix # lint write
```## Git hooks
### Tests
```shell
npx husky add .husky/pre-commit "npm test"
npx husky add .husky/pre-commit "npm run lint"
git add .husky/pre-commit
```### Commit message
```shell
npx husky add .husky/commit-msg 'npx --no -- commitlint --edit ${1}'
npm pkg set scripts.commitlint="commitlint --edit"
npx husky add .husky/commit-msg 'npm run commitlint ${1}'
```