Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/kahliloppenheimer/Web-page-classification

Classifies webpages into categories defined in DMOZ dataset
https://github.com/kahliloppenheimer/Web-page-classification

Last synced: 4 days ago
JSON representation

Classifies webpages into categories defined in DMOZ dataset

Awesome Lists containing this project

README

        

# Topical Web-page classification of the DMOZ Dataset

### [Read the paper](paper.pdf)

This repository contains all scripts associated with my research on topical Web-page classification. You can read the full paper describing the task, experiments, and results [here](paper.pdf).

## Abstract
Multi-class topical web-page classification is a difficult task with widespread application. Throughout this paper, I analyze the performance of well-studied techniques on two different representations of web-pages: hand-written meta-descriptions and on-page text content. I acquired all of the training labels and website descriptions from the DMOZ dataset and all of the on-page content from scraping the actual web-pages. I achieved 74.035% and 79.121% accuracy for on-page content and website descriptions respectively in a 16-way classification task with a 42.032% most frequently tagged baseline accuracy.