https://github.com/martincastroalvarez/html2vec
Algorithm that converts an HTML to a vectorized object suitable for neural networks.
https://github.com/martincastroalvarez/html2vec
data-science html2vec natural-language-processing python web-scraping word2vec
Last synced: about 1 year ago
JSON representation
Algorithm that converts an HTML to a vectorized object suitable for neural networks.
- Host: GitHub
- URL: https://github.com/martincastroalvarez/html2vec
- Owner: MartinCastroAlvarez
- Created: 2020-11-02T18:01:50.000Z (over 5 years ago)
- Default Branch: master
- Last Pushed: 2020-11-02T18:28:46.000Z (over 5 years ago)
- Last Synced: 2025-03-25T16:21:46.294Z (about 1 year ago)
- Topics: data-science, html2vec, natural-language-processing, python, web-scraping, word2vec
- Language: Python
- Homepage: https://martincastroalvarez.com
- Size: 204 KB
- Stars: 13
- Watchers: 2
- Forks: 2
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# html2vec
Algorithm that converts an HTML to a vectorized object suitable for neural networks, including sequential models.


## Instructions
Installing dependencies.
```bash
virtualenv -p python3 .env
source .env/bin/activate
pip install -r requirements.txt
python -m spacy download en_core_web_md
```
Vectorizing HTML from the CLI
```bash
python3 html2vec.py "https://hippie-inheels.com/3-day-new-orleans-itinerary/"
```
Vectorizing HTML inside a Python script.
```python
from html2vec import Html2Vec
model: Html2Vec = Html2Vec()
model.relatives = 5
for node in model.fit(html):
print(node)
print(node.get_vector())
```