{"id":13412271,"url":"https://github.com/adriacabeza/similaripy","last_synced_at":"2025-04-30T23:33:23.212Z","repository":{"id":112051646,"uuid":"187345211","full_name":"adriacabeza/similaripy","owner":"adriacabeza","description":" 📝 Approach for a better clustering built in HackNLP","archived":false,"fork":false,"pushed_at":"2019-05-18T16:01:56.000Z","size":1534,"stargazers_count":4,"open_issues_count":0,"forks_count":1,"subscribers_count":4,"default_branch":"master","last_synced_at":"2024-07-31T20:50:06.521Z","etag":null,"topics":["clustering","java","knn","nmslib","python","similarity"],"latest_commit_sha":null,"homepage":"https://devpost.com/software/similaripy","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/adriacabeza.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2019-05-18T10:37:39.000Z","updated_at":"2024-04-26T13:59:28.000Z","dependencies_parsed_at":"2023-07-31T09:15:41.797Z","dependency_job_id":null,"html_url":"https://github.com/adriacabeza/similaripy","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/adriacabeza%2Fsimilaripy","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/adriacabeza%2Fsimilaripy/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/adriacabeza%2Fsimilaripy/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/adriacabeza%2Fsimilaripy/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/adriacabeza","download_url":"https://codeload.github.com/adriacabeza/similaripy/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":224227221,"owners_count":17276759,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["clustering","java","knn","nmslib","python","similarity"],"created_at":"2024-07-30T20:01:22.848Z","updated_at":"2024-11-12T06:22:32.873Z","avatar_url":"https://github.com/adriacabeza.png","language":"Python","funding_links":[],"categories":["Python"],"sub_categories":[],"readme":"# Similaripy\n\n[![ForTheBadge built-with-love](http://ForTheBadge.com/images/badges/built-with-love.svg)](https://github.com/adriacabeza/similaripy/) ![forthebadge made-with-python](http://ForTheBadge.com/images/badges/made-with-python.svg)\n\n[![HitCount](http://hits.dwyl.io/adriacabeza/similaripy.svg)](http://hits.dwyl.io/adriacabeza/similaripy)\n[![GitHub stars](https://img.shields.io/github/stars/adriacabeza/similaripy.svg)](https://GitHub.com/adriacabeza/similaripy/stargazers/)\n[![GitHub forks](https://img.shields.io/github/forks/adriacabeza/similaripy.svg)](https://GitHub.com/adriacabeza/similaripy/network/)\n[![GitHub repo size in bytes](https://img.shields.io/github/repo-size/adriacabeza/similaripy.svg)](https://github.com/adriacabeza/similaripy)\n[![GitHub contributors](https://img.shields.io/github/contributors/adriacabeza/similaripy.svg)](https://GitHub.com/adriacabeza/similaripy/graphs/contributors/)\n[![GitHub license](https://img.shields.io/github/license/adriacabeza/similaripy.svg)](https://github.com/adriacabeza/similaripy/blob/master/LICENSE)\n\n\n📝 Approach for a better clustering built in HackNLP\n\n## What we wanted to do\n\nOur approach was to create an N-dimensional index representing the similarity of texts and create clusters from it. \n\n- Take the matrix from Java\n- Create N-dimensional index\n- Create algorithm to create clusters\n- Implement a 3D dimensional representation\n\n## What we have done\n\nWe have taken the similarity score of all the possible pairs of vectors representation of several texts (requirements) given by ESSI University group project which is calculated using a Cosine distance. Then using that information as a matrix we have created an index using **NMSLIB** (source: https://github.com/nmslib/nmslib) and implemented a **clusterization algorithm** by thresholding and selecting a number of neighbours.\n\n## Challenges we ran into\n\nWe did not have a lot of time to develop our ideas. The brainstorming was a little bit rush and we are not used to it.  Moreover, the dataset and method to validate our model were a little difficult to deal with.\n\n## What we learned\n\nWe've never used nmslib or neither done a clustering algorithm so we can say that almost everything of what we've done it was new to us.\n\n## What's next for Similaripy\n\nRe-think about the way it is computed the accuracy for the model and experiment with several parameters to get the best result. We could try several ways to compute the distance and its similarity score instead of the Cosine distance.\n\n## Usage\n\n### Build model\n\n```bash\npython3 -m src.scripts.build_api_model data/input_buildModel_duplicates.json\n```\n\n![](docs/images/build_model.png)\n\n\n### Get matrix\n\n```bash\npython3 -m src.scripts.get_matrix data/input_computeClusters_duplicates.json data/score_matrix.json data/mapping.json\n```\n\n![](docs/images/get_matrix.png)\n\n\n### Build index\n\n```bash\npython3 -m src.build data\n```\n\n![](docs/images/build_index.png)\n\n\n### Find clusters\n\n```bash\npython3 -m src.find_clusters data \n```\n\n![](docs/images/find_clusters.png)\n\n\n### Eval\n\n```bash\npython3 -m src.eval data/input_computeClusters_duplicates.json data/clusters.json \u0026\u0026 python3 -m src.eval data/input_computeClusters_duplicates.json data/clusters.json \n```\n\n![](docs/images/eval.png)\n\n## License\n\nMIT © Similaripy\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fadriacabeza%2Fsimilaripy","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fadriacabeza%2Fsimilaripy","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fadriacabeza%2Fsimilaripy/lists"}