Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/antidigest/userclassify

:construction_worker: User Classify
https://github.com/antidigest/userclassify

Last synced: about 1 month ago
JSON representation

:construction_worker: User Classify

Host: GitHub
URL: https://github.com/antidigest/userclassify
Owner: antiDigest
Created: 2015-08-03T14:07:52.000Z (over 9 years ago)
Default Branch: master
Last Pushed: 2016-01-28T16:17:05.000Z (almost 9 years ago)
Last Synced: 2024-03-18T06:33:24.220Z (8 months ago)
Language: Python
Homepage:
Size: 135 KB
Stars: 1
Watchers: 3
Forks: 0
Open Issues: 1
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# UserClassify

Classify twitter Users on the basis of their occupation using tweets and metadata related to tweets.

##Characteristics to be extracted:

* Profile Features

* location -- Industries seem to concentrate in some areas Y

* User Network Structure

* Followers frequency --million fans if celeb or political person Y

* Tweeting Behaviour

* No. of messages/tweets per day ?? --maybe important depends on experiments

* No. of links, images etc --Based on experiments and how HITS algorithm works

* Status Count --Check if you can get date of joining

* Linguistic Content

* bad words --politician might not use bad words. Y

* Hash Tagged words --on the side

* Sentiment Words --Useful Y

* Social Network

* Who you tweet --on the side

* tweets in response to tweets --number Y

* retweets --number Y

##Basis Of Classification:

* Occupation

##Algorithms that can be employed:

* Multiclass SVM ??

* KNN

* Gradient Boost Decision Trees

###Comments

* we find that a support vector machine (SVM) trained on hashtag metadata outperforms an SVM trained on the full text of users’ tweets, yielding predictions of political affiliations with 91% accuracy.

##TODO:

* We are getting the tweets and have about 40k records of tweets. Only thing we need to do now is to extract meaningful data from these tweets.

* HITS does not give a value on the link (It works on graphs and we do not have the web graph). Also, I don't know how number of links will work for us. Rest everything is done, I think we can start extracting data from the tweets and visualise how things work out for us.

* We only need to select classifiers to work now, I hope that will not take much time.

* Find any more things we can use to determine occupation. That is done by manually visualising things.

* We need to meet Tanvir Sir after doing this.