https://github.com/nagypeterjob/xtractor
Topic extractor with the idea of generating labels using genism.n_similarity
https://github.com/nagypeterjob/xtractor
genism labels machine-learning pandas python topic-extraction
Last synced: 5 months ago
JSON representation
Topic extractor with the idea of generating labels using genism.n_similarity
- Host: GitHub
- URL: https://github.com/nagypeterjob/xtractor
- Owner: nagypeterjob
- License: mit
- Created: 2018-04-02T19:50:40.000Z (about 8 years ago)
- Default Branch: master
- Last Pushed: 2018-04-02T21:46:30.000Z (about 8 years ago)
- Last Synced: 2025-10-28T00:31:35.490Z (8 months ago)
- Topics: genism, labels, machine-learning, pandas, python, topic-extraction
- Language: Python
- Size: 8.79 KB
- Stars: 7
- Watchers: 2
- Forks: 1
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
[](https://opensource.org/licenses/MIT)
[](https://circleci.com/gh/nagypeterjob/xtractor/tree/master)
```
_ _
__ __| |_ _ _ __ _ __ | |_ ___ _ _
\ \ /| _|| '_|/ _` |/ _|| _|/ _ \| '_|
/_\_\ \__||_| \__,_|\__| \__|\___/|_|
xtractor
Topic extractor with the idea of generating labels using genism.n_similarity
by Peter Nagy
```
## Overview
xtractor is little package which aims to label text automatically harnessing the power of pre-trained word vectors.
The idea is the following:
- You must provide one or more genism compatible pre-trained word vectors
- You must define categories with keywords
- You must provide a tokenized text features you want to label
- Run the extractor to label input text
- The extractor digests the cosine distance of each word (vector) in the sentence and each keyword (vector)
- Then it chooses the most "similar" category as label
## Installation
```bash
$ pip install xtractor
```
## Usage
See `example.py` for a more detailed example.
```python
from xtractor import TopicExtractor as te
extractor = te.TopicExtractor(models=models, categories=categories)
labels = extractor.extract(pandas_data_frame)
```
## Parameters
#### TopicExtractor(models=models, categories=categories)
##### models
- list of genism compatible models
##### categories
- list of categories
Format:
#### extract(X=pandas_dataframe)
- input pandas data frame or python list
- in case X is a pandas dataframe, it must have only one column (the feature column)
- X can be a regular python `list`
- the features are expected to be tokenized string (e.g. following format: `['Tokenized', 'string']`)
- the return value is a regular `list` containing the category names (labels) for each input row respectively (e.g. in case of a 2 row input `['economy', 'sport']`)
## Precision
It really depends on the quality of you pre-trained word vector and on the quality of your intuitively defined category keywords.
In my use case I have used these [vectors](https://github.com/facebookresearch/fastText/blob/master/pretrained-vectors.md) and played with several iterations of keywords.
I have reached around 69% precision which is not bad. With more carefully picked keywords it can be enhanced.
## F.A.Q.
* Q: Why did you make this?
A: Because I looked for a way to automatically label huge amount of (hungarian) text and I found no simple way.
## Author
* peter nagy | nagypeterjob@gmail.com