Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/aurumnpegasus/text-summariser

Summarises text for articles in Hindi and compares a custom TF-IDF algorithm with the baseline
https://github.com/aurumnpegasus/text-summariser

hindi text-summariser tf-idf

Last synced: 3 days ago
JSON representation

Summarises text for articles in Hindi and compares a custom TF-IDF algorithm with the baseline

Awesome Lists containing this project

README

        

# Text Summarizer for Hindi Wikipedia Articles

This is a project made by:
* Prajneya Kumar
* Shivansh S.
* Tejasvi Chebrolu

## How to Use

* Clone the repository
* Install all dependencies mentioned in ```requirements.txt```
* Choose which method you would like to use, and depending on that go to appropriate section

### Method I

This model generates a summary using a Document Term Matrix and frequency count. To use this

* Go to the ```method_1``` folder
* Place your article in ```valid``` folder named as ```article.txt```.
* Run the ```extractive.py``` file using python3.

* You will end up getting a summary named as ```summary.txt``` inside the ```valid``` folder.

### Method II

This model generates a summary using modified TF-IDF of the document dataset, with weights attached. To use this

* Go to the ```method_2``` folder
* Place your article in ```valid``` folder
* Run the code in jupyter notebook
* Input the name of your file which is within that directory

* You will end up getting a summary + wordcloud in the output folder :)

### Calculating Accuracy

* Add the Gold standard for the summary as ```n.txt``` in the Gold folder in the Summaries directory. Here `n` is the next number in the sequence in the Gold folder.
* For example, if there are 7 files in the Gold Folder, they must be labelled as ```1.txt``` ```2.txt``` ... ```7.txt``` etc.
* Repeat this process for the summaries generated by the rule-based method and the extractive method and store them in the `Extractive` and `RuleBased` directories.
* You can do this on the terminal via simple redirection.
* Now, in the `accuracy.py` file on line number 15, change the code to ```for i in range(1, n+1):``` where n is the same variable as above.
* For example, if your file was saved as `9.txt` you would change the code to ```for i in range(1, 10):```
* Run the code as `python accuracy.py`
* If you want individual accuracies for any article, you can uncomment line number 62 in the ```Rouge_1.py``` file.
* It is advised then to redirect to a new file as `python accuracy.py > output.txt` to enable better formatting.

## Initial Results
For _Method I_ we got an accuracy of 74.1%
For _Method II_ we got an accuracy of 83.4%

## Methods of Evaluation

The evaulation was done based on the Rouge method proposed by Chin-Yew Lin. For this project, since the summarization has been extractive, only Rouge-I has been used. To generate the gold standard for the summaries, the annotation was done manually. For any given article, the annotators were asked to pick the most important sentences. The only rule was that the number of sentences they could choose was equal to **0.3N** where **N** was the number of sentences in the initial article.

## Human Evaluators

We thank the following for creating the gold standard summaries:

* Abhinav Menon
* Trisha Kaore
* Yash Agrawal
* Eshika Khandelwal
* Vidushi Bhartari
* Shashwat Singh
* Shubhankar Kamthankar

## How to Contribute

* Fork this repository
* Clone the forked repository to your local system
* ```git remote add upstream https://github.com/AurumnPegasus/Text-Summariser.git```
* Install all required dependencies (mentioned in ```requirements.txt```)
* Commit and Send PRs :)