Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/kahliloppenheimer/Web-page-classification
Classifies webpages into categories defined in DMOZ dataset
https://github.com/kahliloppenheimer/Web-page-classification
Last synced: 4 days ago
JSON representation
Classifies webpages into categories defined in DMOZ dataset
- Host: GitHub
- URL: https://github.com/kahliloppenheimer/Web-page-classification
- Owner: kahliloppenheimer
- License: mit
- Created: 2015-12-01T04:07:57.000Z (almost 9 years ago)
- Default Branch: master
- Last Pushed: 2015-12-14T22:16:24.000Z (almost 9 years ago)
- Last Synced: 2024-10-12T01:38:49.421Z (about 1 month ago)
- Language: Shell
- Size: 169 KB
- Stars: 41
- Watchers: 2
- Forks: 10
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Topical Web-page classification of the DMOZ Dataset
### [Read the paper](paper.pdf)
This repository contains all scripts associated with my research on topical Web-page classification. You can read the full paper describing the task, experiments, and results [here](paper.pdf).
## Abstract
Multi-class topical web-page classification is a difficult task with widespread application. Throughout this paper, I analyze the performance of well-studied techniques on two different representations of web-pages: hand-written meta-descriptions and on-page text content. I acquired all of the training labels and website descriptions from the DMOZ dataset and all of the on-page content from scraping the actual web-pages. I achieved 74.035% and 79.121% accuracy for on-page content and website descriptions respectively in a 16-way classification task with a 42.032% most frequently tagged baseline accuracy.