An open API service indexing awesome lists of open source software.

https://github.com/navierula/document-similarity


https://github.com/navierula/document-similarity

Last synced: 3 months ago
JSON representation

Awesome Lists containing this project

README

        

# Document Similarity

There exists a **plethora of textual data** on the Internet. The is good...because the data can be used to perhaps assist
machine learning algorithms. This is also bad...because the data may repeat itself sometimes.

I want to examine the bad, particularly in regards to the idea
of deduplication.