{"id":20801886,"url":"https://github.com/philips-software/textsimilarityprocessor","last_synced_at":"2025-05-07T00:43:51.983Z","repository":{"id":37636708,"uuid":"236949066","full_name":"philips-software/TextSimilarityProcessor","owner":"philips-software","description":"Resolving the Technical Debt in \"Test/Requirement/Issues/Any-text\" repos with unique id using Natural Language Processing Continuous  de-duplicate monitoring system in place to check the duplication of any new text added to \"Test/Requirement/Issues/Any-text\" bank.  Grouping of similar \"Test/Requirement/Issues/Any-text\" helps in reduction of \"Test/Requirement/Issues/Any-text\" yet quality quotient remain same.   Cycle time of test execution comes down as similar tests are identified for merging.  Repeated requirement can be reduced Issues list can be merged/reduced","archived":false,"fork":false,"pushed_at":"2024-06-17T23:29:03.000Z","size":370,"stargazers_count":4,"open_issues_count":2,"forks_count":1,"subscribers_count":4,"default_branch":"master","last_synced_at":"2025-04-16T20:42:21.975Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/philips-software.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE.md","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2020-01-29T09:47:58.000Z","updated_at":"2023-08-13T14:26:15.000Z","dependencies_parsed_at":"2022-09-10T03:44:55.778Z","dependency_job_id":null,"html_url":"https://github.com/philips-software/TextSimilarityProcessor","commit_stats":null,"previous_names":[],"tags_count":9,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/philips-software%2FTextSimilarityProcessor","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/philips-software%2FTextSimilarityProcessor/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/philips-software%2FTextSimilarityProcessor/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/philips-software%2FTextSimilarityProcessor/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/philips-software","download_url":"https://codeload.github.com/philips-software/TextSimilarityProcessor/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":252793561,"owners_count":21805053,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-17T18:24:57.459Z","updated_at":"2025-05-07T00:43:51.971Z","avatar_url":"https://github.com/philips-software.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Text Similarity\n\n![Python application](https://github.com/philips-software/TextSimilarityProcessor/workflows/Python%20application/badge.svg)\n[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)\n[![codecov](https://codecov.io/gh/philips-software/TextSimilarityProcessor/branch/master/graph/badge.svg)](https://codecov.io/gh/philips-software/TextSimilarityProcessor)\n\nTool to identify the similarity of the input text\n\nIt can be used to identify the similarity of,\n\n- Tests  \n\n- Code  \n\n- Requirements  \n\n- Defects  \n\nAdvantage of using such similarity analysis are,\n\n- Resolving technical debt  \n\n- Grouping together similar code / tests / requirements / defects etc.  \n  \n## Dependencies\n\n- python 3.8 : 64 bit  \n\n- python packages (xlrd, xlsxwriter, pandas, scikit-learn, numpy)  \n\n## Installation\n  \n[INSTALL.md](INSTALL.md)\n\n```sh\npip install similarity-processor\n```\n\n## Usage\n\n### UI\n\n```sh\n\u003e\u003e\u003epython -m similarity.similarity_ui\n```\n\n- Path to the test/requirement/other other document to be\n analyzed(xlsx / csv format).  \n\n- Unique ID in the csv/xlsx column ID(0/1 etc...)  \n\n- Steps/Description id for content matching (column of interest IDs\n in the csv/xlsx separated by , like 1,2,3)  \n\n- If new requirement / test to me checked with existing, enable the\n check box and paste the content to be checked in the new text box.  \n\n### Commandline\n\n```sh\n\u003e\u003e\u003epython -m similarity --p \"path\\to\\TestBank.xlsx\" --u 0 --c \"1,2,3\" --n 8\n```\n\n- Help option can be found at,  \n\n```sh\n\u003e\u003e\u003epython -m similarity --h\n```\n\n### Code\n\n```sh\n\u003e\u003e\u003e from similarity.similarity_io import SimilarityIO\n\u003e\u003e\u003e similarity_io_obj = SimilarityIO(\"path\\to\\TestBank.xlsx\", 0, \"1,2,3\")\n\u003e\u003e\u003e similarity_io_obj.orchestrate_similarity()\n```\n\n### Arguments\n\nMandatory\n\n- Path to the input file\n- Unique id value column id in xlsx  \n- Interested columns in xlsx  \n\nOptional\n\n- Upper and lower range to filter the similarity values in the output\n   (defaulted \"60,100\")\n- Number of rows in the html report, defaulted to 100  \n- Are you checking a new text against a existing text bank?\n- If yes: new text\n- Filter value to split the report xlsx file, defaulted to 500000,\n   500001 onward row will be moved to new file\n\n```sh\nimport pandas as pd\nfrom similarity.similarity_io import SimilarityIO\n\ndemo_df = pd.read_excel(r\"input\\xlsx\\sheet\\name\")  # You could read from any input source\n\nsimilarity_io_obj = SimilarityIO(None, None, None)  # (None, None, None, 200) =\u003e200 = The brief html report rows\n default is 10  \nsimilarity_io_obj.file_path = r\"path\\to\\report\\folder\" #when used in this format, else input file path to read data\nsimilarity_io_obj.data_frame = demo_df # input data frame\nsimilarity_io_obj.uniq_header = \"Uniq ID\"  # Unique header of the input data frame (string)\nsimilarity_io_obj.create_merged_df()\nprocessed_similarity = similarity_io_obj.process_cos_match()\nsimilarity_io_obj.report_brief_html(processed_similarity)\nprocessed_similarity.to_csv(r\"path\\to\\report\\folder\\report.csv\", header=True)\n```\n\n### Output\n  \n- Output will be available in same folder as input file or  `file_path`\n specified  \n\n- If any duplicate ids in the unique id file with name string containing\n 'duplicate id'  \n\n- A recommendation file with similarity values  \n\n- A merged file with data in the \"interested columns in xlsx\"  \n\n- An html brief report containing the top 10 similarities\n (100 is default value which can be changed by --n option)  \n\n## Contact\n\n[MAINTAINERS.md](MAINTAINERS.md)  \n\n## License\n\n[License.md](LICENSE.md)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fphilips-software%2Ftextsimilarityprocessor","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fphilips-software%2Ftextsimilarityprocessor","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fphilips-software%2Ftextsimilarityprocessor/lists"}