Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/tressos-aristomenis/most-similar-string-to-given-query
In this project I am using the tf - idf algorithm and cosine similarity to find the similarity of two strings.
https://github.com/tressos-aristomenis/most-similar-string-to-given-query
cosine-similarity cosine-similarity-scores document-frequency idf inverse-document-frequency query string-similarity term-frequency tf tf-idf tf-idf-vectorizer
Last synced: about 1 month ago
JSON representation
In this project I am using the tf - idf algorithm and cosine similarity to find the similarity of two strings.
- Host: GitHub
- URL: https://github.com/tressos-aristomenis/most-similar-string-to-given-query
- Owner: Tressos-Aristomenis
- Created: 2018-03-30T15:08:14.000Z (over 6 years ago)
- Default Branch: master
- Last Pushed: 2018-03-31T01:12:56.000Z (over 6 years ago)
- Last Synced: 2024-01-26T03:40:25.094Z (11 months ago)
- Topics: cosine-similarity, cosine-similarity-scores, document-frequency, idf, inverse-document-frequency, query, string-similarity, term-frequency, tf, tf-idf, tf-idf-vectorizer
- Language: Java
- Homepage:
- Size: 97.7 KB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README - How the algorithm works
Awesome Lists containing this project
README
# Most-similar-string-to-given-query
~ In this project I am using the tf - idf algorithm to find the similarity of two strings.
~ I am given a text file with 2000 random generated tweets.
~ Given a query that contains key words find the most similar tweet of the .txt file to this query.The algorithm:
A string is represented as a vector. Every word of the string represents a dimension of this vector.
So the given query can be represented as a vector as well.
For every unique term(non-duplicate words) of the query, we count its term frequency(TF).
Term frequency: How many times a word appears in its string.
Then, for every unique term of the query we count its Inverse Document Frequency(IDF).
Inverse Document Frequency: How many strings(in our case how many tweets of the txt file) contain the specific word.
In the end to find the similarity of two strings we calculate their cosine.
One of two strings must be the query, the other one is a specific document from the file.
Eg. if the text file contains 5 tweets (document1, document2, document3, document4, document5):
For document doc : file
calculate cosine(query, doc).
The document with the highest cosine value is the most similar to our query.
The method of finding the similarity of strings using cosine is called cosine similarity.Here is some information on how TF-IDF algorithm works: https://en.wikipedia.org/wiki/Tf%E2%80%93idf
-----------------------------------------------------------------------------------------------------------------------------------------------