An open API service indexing awesome lists of open source software.

https://github.com/martincastroalvarez/html2vec

Algorithm that converts an HTML to a vectorized object suitable for neural networks.
https://github.com/martincastroalvarez/html2vec

data-science html2vec natural-language-processing python web-scraping word2vec

Last synced: about 1 year ago
JSON representation

Algorithm that converts an HTML to a vectorized object suitable for neural networks.

Awesome Lists containing this project

README

          

# html2vec
Algorithm that converts an HTML to a vectorized object suitable for neural networks, including sequential models.

![wallpaper](/wallpaper.jpeg)
![architecture](/architecture.jpg)

## Instructions
Installing dependencies.
```bash
virtualenv -p python3 .env
source .env/bin/activate
pip install -r requirements.txt
python -m spacy download en_core_web_md
```
Vectorizing HTML from the CLI
```bash
python3 html2vec.py "https://hippie-inheels.com/3-day-new-orleans-itinerary/"
```
Vectorizing HTML inside a Python script.
```python
from html2vec import Html2Vec

model: Html2Vec = Html2Vec()
model.relatives = 5
for node in model.fit(html):
print(node)
print(node.get_vector())
```