Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/humsha/USCorpus
Urdu Summary Corpus and Software Tools Version 1.0
https://github.com/humsha/USCorpus
Last synced: 3 months ago
JSON representation
Urdu Summary Corpus and Software Tools Version 1.0
- Host: GitHub
- URL: https://github.com/humsha/USCorpus
- Owner: humsha
- License: mit
- Created: 2016-05-22T19:35:31.000Z (over 8 years ago)
- Default Branch: master
- Last Pushed: 2022-10-16T09:27:33.000Z (about 2 years ago)
- Last Synced: 2024-04-06T22:32:44.014Z (7 months ago)
- Homepage: https://dl.dropboxusercontent.com/u/48044196/USCTools.zip
- Size: 18.5 MB
- Stars: 13
- Watchers: 3
- Forks: 10
- Open Issues: 2
-
Metadata Files:
- Readme: README.md
- License: LICENSE.txt
Awesome Lists containing this project
- awesome-urdu - UrduSummary Corpus Benchmark, 2016
README
# USCorpus
## Urdu Summary Corpus
Urdu summary corpus consists of 50 articles collected from various sources.
From the original HTML documents only unformatted content text was kept, removing all other things.
We provide abstractive summaries of these 50 articles.
After normalization, we further applied different NLP tools on the articles to generate part-of-speech tagged,
morphologically analyzed, lemmatized and stemmed articles.## Urdu Summary Corpus Tools
Normalization is taken from [1], Diacritic marks are also removed in this step.
Table-lookup based Morphological Analyzer and lemmatizer is built from [3].
Stemmer is built from [1]
Table-lookup based POS tagger is built from [4]. We used unigram and bigram counts.
## Commands:
Unzip USCTools.zip
Open Console
Go to USCTools directly typing: cd USCTools
## For Normalization
$ java -cp bin USCTools normalize input.txt output.txt
## For Lemmatization
$ java -cp bin USCTools lemmatize input.txt output.txt## For Morphological analysis
$ java -cp bin USCTools morph_analysis input.txt output.txt## For stemming by Assas-Band
$ java -cp bin USCTools stemming input.txt output.txt## For POS tagging
$ java -cp bin USCTools tagging input.txt output.txt## Contributers:
Muhammad Humayoun, [email protected]Muhammad Uzair, [email protected]
Saba Aslam, [email protected]
Omer Farzand, [email protected]
Rao Muhammad Adeel Nawab, [email protected]
## Maintainer
Muhammad Humayoun (PhD)Assistant Professor
Computer Information Sciences Division
Abu Dhabi Men’s Campus, Higher Colleges of Technology
Abu Dhabi, United Arab Emirates
## Related Publications:
Muhammad Humayoun, Rao Muhammad Adeel Nawab, Muhammad Uzair, Saba Aslam, Omer Farzand (2016)
Urdu Summary Corpus. Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016). European Language Resources Association (ELRA), ISBN: 978-2-9517408-9-1. http://www.lrec-conf.org/proceedings/lrec2016/index.htmlMuhammad Humayoun and Hwanjo Yu (2016), Analyzing Pre-processing Settings for Urdu Single-document Extractive Summarization. In Nicoletta Calzolari, et al., editors, Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016). European Language Resources Association (ELRA). ISBN: 978-2-9517408-9-1.
## References:
[1] Q.-u.-A. Akram, A. Naseer, and S. Hussain. Proceedings of the 7th Workshop
on Asian Language Resources (ALR7), chapter Assas-band, an Affix-
Exception-List Based Urdu Stemmer, pages 40–47. Association for Computational
Linguistics, 2009.[2] A. Gulzar. Urdu normalization utility v1.0. Technical report,
Center for Language Engineering, Al-kwarzimi Institute of Computer
Science (KICS), University of Engineering, Lahore, Pakistan.
http://www.cle.org.pk/software/langproc/urdunormalization.htm, 2007.[3] M. Humayoun, H. Hammarström, and A. Ranta. Urdu morphology, orthography
and lexicon extraction. CAASL-2: The Second Workshop on
Computational Approaches to Arabic Script-based Languages, LSA Linguistic
Institute. Stanford University, California, USA., pages 21–22, 2007.
http://www.lama.univ-savoie.fr/ humayoun/UrduMorph/.[4] B. Jawaid, A. Kamran, and O. Bojar. A tagged corpus and a tagger for
urdu. In N. C. C. Chair), K. Choukri, T. Declerck, H. Loftsson, B. Maegaard,
J. Mariani, A. Moreno, J. Odijk, and S. Piperidis, editors, Proceedings
of the Ninth International Conference on Language Resources and
Evaluation (LREC’14), Reykjavik, Iceland, may 2014. European Language
Resources Association (ELRA).
https://lindat.mff.cuni.cz/repository/xmlui/handle/11858/00-097C-0000-0023-65A9-5