https://github.com/philips-software/textsimilarityprocessor

Resolving the Technical Debt in "Test/Requirement/Issues/Any-text" repos with unique id using Natural Language Processing Continuous de-duplicate monitoring system in place to check the duplication of any new text added to "Test/Requirement/Issues/Any-text" bank. Grouping of similar "Test/Requirement/Issues/Any-text" helps in reduction of "Test/Requirement/Issues/Any-text" yet quality quotient remain same. Cycle time of test execution comes down as similar tests are identified for merging. Repeated requirement can be reduced Issues list can be merged/reduced
https://github.com/philips-software/textsimilarityprocessor

Last synced: about 1 year ago
JSON representation

Host: GitHub
URL: https://github.com/philips-software/textsimilarityprocessor
Owner: philips-software
License: other
Created: 2020-01-29T09:47:58.000Z (over 6 years ago)
Default Branch: master
Last Pushed: 2024-06-17T23:29:03.000Z (about 2 years ago)
Last Synced: 2025-04-16T20:42:21.975Z (about 1 year ago)
Language: Python
Size: 361 KB
Stars: 4
Watchers: 4
Forks: 1
Open Issues: 2
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE.md

Awesome Lists containing this project

README

          # Text Similarity

![Python application](https://github.com/philips-software/TextSimilarityProcessor/workflows/Python%20application/badge.svg)

[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)

[![codecov](https://codecov.io/gh/philips-software/TextSimilarityProcessor/branch/master/graph/badge.svg)](https://codecov.io/gh/philips-software/TextSimilarityProcessor)

Tool to identify the similarity of the input text

It can be used to identify the similarity of,

- Tests  

- Code  

- Requirements  

- Defects  

Advantage of using such similarity analysis are,

- Resolving technical debt  

- Grouping together similar code / tests / requirements / defects etc.  

  

## Dependencies

- python 3.8 : 64 bit  

- python packages (xlrd, xlsxwriter, pandas, scikit-learn, numpy)  

## Installation

  

[INSTALL.md](INSTALL.md)

```sh

pip install similarity-processor

```

## Usage

### UI

```sh

>>>python -m similarity.similarity_ui

```

- Path to the test/requirement/other other document to be

 analyzed(xlsx / csv format).  

- Unique ID in the csv/xlsx column ID(0/1 etc...)  

- Steps/Description id for content matching (column of interest IDs

 in the csv/xlsx separated by , like 1,2,3)  

- If new requirement / test to me checked with existing, enable the

 check box and paste the content to be checked in the new text box.  

### Commandline

```sh

>>>python -m similarity --p "path\to\TestBank.xlsx" --u 0 --c "1,2,3" --n 8

```

- Help option can be found at,  

```sh

>>>python -m similarity --h

```

### Code

```sh

>>> from similarity.similarity_io import SimilarityIO

>>> similarity_io_obj = SimilarityIO("path\to\TestBank.xlsx", 0, "1,2,3")

>>> similarity_io_obj.orchestrate_similarity()

```

### Arguments

Mandatory

- Path to the input file

- Unique id value column id in xlsx  

- Interested columns in xlsx  

Optional

- Upper and lower range to filter the similarity values in the output

   (defaulted "60,100")

- Number of rows in the html report, defaulted to 100  

- Are you checking a new text against a existing text bank?

- If yes: new text

- Filter value to split the report xlsx file, defaulted to 500000,

   500001 onward row will be moved to new file

```sh

import pandas as pd

from similarity.similarity_io import SimilarityIO

demo_df = pd.read_excel(r"input\xlsx\sheet\name")  # You could read from any input source

similarity_io_obj = SimilarityIO(None, None, None)  # (None, None, None, 200) =>200 = The brief html report rows

 default is 10  

similarity_io_obj.file_path = r"path\to\report\folder" #when used in this format, else input file path to read data

similarity_io_obj.data_frame = demo_df # input data frame

similarity_io_obj.uniq_header = "Uniq ID"  # Unique header of the input data frame (string)

similarity_io_obj.create_merged_df()

processed_similarity = similarity_io_obj.process_cos_match()

similarity_io_obj.report_brief_html(processed_similarity)

processed_similarity.to_csv(r"path\to\report\folder\report.csv", header=True)

```

### Output

  

- Output will be available in same folder as input file or  `file_path`

 specified  

- If any duplicate ids in the unique id file with name string containing

 'duplicate id'  

- A recommendation file with similarity values  

- A merged file with data in the "interested columns in xlsx"  

- An html brief report containing the top 10 similarities

 (100 is default value which can be changed by --n option)  

## Contact

[MAINTAINERS.md](MAINTAINERS.md)  

## License

[License.md](LICENSE.md)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/philips-software/textsimilarityprocessor

Awesome Lists containing this project

README