Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/noureldin2303/vsm
Web Information Retrieval | Vector Space Model (VSM)
https://github.com/noureldin2303/vsm
idf information-retrieval ir learn search-engine vectorspacemodel
Last synced: 2 days ago
JSON representation
Web Information Retrieval | Vector Space Model (VSM)
- Host: GitHub
- URL: https://github.com/noureldin2303/vsm
- Owner: Noureldin2303
- License: mit
- Created: 2023-05-16T14:42:15.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2023-05-16T15:35:11.000Z (over 1 year ago)
- Last Synced: 2024-01-26T03:38:28.874Z (about 1 year ago)
- Topics: idf, information-retrieval, ir, learn, search-engine, vectorspacemodel
- Language: Jupyter Notebook
- Homepage:
- Size: 15.6 KB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
- Security: SECURITY.md
Awesome Lists containing this project
README
# VSM
#### Web Information Retrieval | Vector Space Model (VSM)VSM used in finding relevant documents with respect to a given query. In VSM, each document or query is a N-dimensional vector
where N is the number of distinct terms over all the documents and queries.The i-th index of a vector contains
the score of the i-th term for that vector.
```vue
- The Vector Space Model for Information Retrieval represents documents and queries as vectors of weights.
- The weights represent the importance of the terms (aka words, tokens) in the documents and queries.
- Each weight is a measure of the importance of an index term in a document or a query, respectively.
```
##### The main score functions are based on: Term-Frequency (tf) and Inverse-Document-Frequency(idf).###### Term-Frequency and Inverse-Document Frequency – The Term-Frequency (tf_{ij}) is computed with respect to the i-th term and j-th document where $ n_{i, j} are the occurrences of the i-th term in the j-th document.
###### The idea is that if a document has multiple receptions of given terms, it will probably deals with that argument.
###### The Inverse-Document-Frequency (idf_{i}) takes into consideration the i-th terms and all the documents in the collection![image](idf.png)
###### The intuition is that rare terms are more important that common ones : if a term is present only in a document it can mean that term characterizes that document.The final score for the i-th term in the j-th document consists of a simple multiplication Since a document/query contains only a subset of all the distinct terms in the collection, the term frequency can be zero for a big number of terms : this means a sparse vector representation is needed to optimize the space requirements.
#### Cosine Similarity
```vue
In order to compute the similarity between two vectors : a, b (document/query but also document/document)
the cosine similarity is used :
```
![image](cos.png)### The algorithm steps for VSM are:
```vue,
1- Collecting and preprocessing documents
2- Creating a vocabulary of unique terms
3- Representing each document as a vector
4- Representing each query as a vector
5- Calculating the similarity between each document vector and the query vector
6- Ranking documents based on their similarity to the query
```