Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/scrapinghub/page_clustering
A simple algorithm for clustering web pages, suitable for crawlers
https://github.com/scrapinghub/page_clustering
data-science
Last synced: 8 days ago
JSON representation
A simple algorithm for clustering web pages, suitable for crawlers
- Host: GitHub
- URL: https://github.com/scrapinghub/page_clustering
- Owner: scrapinghub
- License: bsd-3-clause
- Created: 2016-05-25T11:17:45.000Z (over 8 years ago)
- Default Branch: master
- Last Pushed: 2017-03-06T17:29:19.000Z (almost 8 years ago)
- Last Synced: 2024-04-15T01:20:14.806Z (9 months ago)
- Topics: data-science
- Language: HTML
- Homepage:
- Size: 468 KB
- Stars: 35
- Watchers: 5
- Forks: 8
- Open Issues: 3
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Description [![Build Status](https://travis-ci.org/scrapinghub/page_clustering.svg?branch=master)](https://travis-ci.org/scrapinghub/page_clustering)
A simple algorithm for clustering web pages.
A wrapper around [KMeans](http://scikit-learn.org/stable/modules/generated/sklearn.cluster.MiniBatchKMeans.html#sklearn.cluster.MiniBatchKMeans).
Web pages are converted to vectors, where each vector entry is just the count of a given tag and class attribute.
The dimension of the vectors will change as new pages with new tags or class attributes arrive.
Also a simple outlier detection is available and enabled by default. This allows for rejecting web pages
that are highly improbable to belong to any cluster.# Install
pip install page_clustering# Usage
import page_clusteringclt = page_clustering.OnlineKMeans(n_clusters=5)
# `pages` must have been obtained somehow
for page in pages:
clt.add_page(page)
y = clt.classify(new_page)
for page in more_pages:
clt.add_page(page)
y = clt.classify(yet_another_page)# Demo
wget -r --quota=5M https://news.ycombinator.com
python demo.py news.ycombinator.com# Tests
cd tests
py.test# Algorithm
The first part, vectorization, transforms the web page to a vector. For example,
take the following page:```html
- A
- B
- Y
- Z
```
Each non-closing (tag, class) pair is mapped to a vector position and the number
of times it appears in the document is the value of the vector at that position.
| tag, class | position | count |
|------------|----------|-------|
| html | 0 | 1 |
| body | 1 | 1 |
| ul, list1 | 2 | 1 |
| li | 3 | 4 |
| ul, list2 | 4 | 1 |
The vector is therefore `[1, 1, 1, 4, 1]`. This vector is normalized so that
it's elements sum up to 1 and the final frequency vector is:
`[0.125, 0.125, 0.125, 0.5, 0.125]`
When a new page arrives it can be possible that new (tag, class) pairs appear.
For example imagine that this new page arrives:
```html
Another page with a paragraph tag
```
The new page would be mapped according to this table:
| tag, class | position | count |
|------------|----------|-------|
| html | 0 | 1 |
| body | 1 | 1 |
| ul, list1 | 2 | 0 |
| li | 3 | 0 |
| ul, list2 | 4 | 0 |
| p | 5 | 1 |
The vector for this page would be `[1, 1, 0, 0, 0, 1]`, and with normalization:
`[0.33, 0.33, 0, 0, 0, 0.33]`.
The new vector has 6 dimensions, this means that the previous page vector needs
to be extended accordingly with zeros to the right: `[0.125, 0.125, 0.125, 0.5, 0.125, 0]`.
Once all needed pages are vectorized, KMeans is applied.