Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/nikdon/similaritymeasure
TF-IDF and similarity measure in C#
https://github.com/nikdon/similaritymeasure
article c-sharp cosine-similarity similarity-measures tf-idf
Last synced: about 5 hours ago
JSON representation
TF-IDF and similarity measure in C#
- Host: GitHub
- URL: https://github.com/nikdon/similaritymeasure
- Owner: nikdon
- Created: 2014-03-29T11:38:27.000Z (almost 11 years ago)
- Default Branch: master
- Last Pushed: 2014-03-31T06:26:40.000Z (almost 11 years ago)
- Last Synced: 2023-08-31T16:56:16.965Z (over 1 year ago)
- Topics: article, c-sharp, cosine-similarity, similarity-measures, tf-idf
- Language: C#
- Homepage:
- Size: 527 KB
- Stars: 3
- Watchers: 1
- Forks: 2
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Similarity Measure
## Summary
Content based similarity measure of the articles at given urls. Idea is based on representaton of each article as numerical statistic as per http://en.wikipedia.org/wiki/Tf%E2%80%93idfTerm frequency is modified as per http://nlp.stanford.edu/IR-book/html/htmledition/sublinear-tf-scaling-1.html
After the representation of two articles as vectors similarity measures as cosine of the angle between them (see http://en.wikipedia.org/wiki/Cosine_similarity).
## Example of usage
For comparison of two articles instance of SimilarityCalculator should be created:```csharp
SimilarityCalculator sc = new SimilarityCalculator();
```After that two urls can be passed as variables as well as threshold for vocabulary:
```csharp
string url1 = "http://www.dailymail.co.uk/news/article-2592103/Minister-faces-censure-expenses-abuse.html";
string url2 = "http://www.telegraph.co.uk/news/newstopics/mps-expenses/10729984/Maria-Miller-to-have-to-repay-thousands-of-pounds-and-apologise-over-expenses-claims.html";int threshold = 3;
sc.Compare(url1, url2, vocabularyThreshold: threshold);
```After executing program will return something like:
> url1 consists of 424 words, url2 consists of 301 words.
>
> Vocabulary contains 41 words after tokenization and thresholding.
>
> Similarity is 0.8897
>
> Press any key