https://github.com/oroszgy/phd-dissertation
Hybrid algorithms for preprocessing agglutinative languages and less-resourced domains
https://github.com/oroszgy/phd-dissertation
hungarian nlp phd-dissertation phd-thesis
Last synced: 6 months ago
JSON representation
Hybrid algorithms for preprocessing agglutinative languages and less-resourced domains
- Host: GitHub
- URL: https://github.com/oroszgy/phd-dissertation
- Owner: oroszgy
- License: mit
- Created: 2014-07-15T06:46:17.000Z (about 11 years ago)
- Default Branch: master
- Last Pushed: 2017-02-24T13:42:30.000Z (over 8 years ago)
- Last Synced: 2025-02-10T08:13:11.715Z (8 months ago)
- Topics: hungarian, nlp, phd-dissertation, phd-thesis
- Language: TeX
- Homepage: http://gyorgy.orosz.link/
- Size: 2.78 MB
- Stars: 1
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Hybrid algorithms for preprocessing agglutinative languages and less-resourced domains effectively
This thesis deals with text processing applications examining methods suitable for
less-resourced and agglutinative languages, thus presenting accurate preprocessing
algorithms.The first part of this study describes morphological tagging algorithms which can
compute both the morpho-syntactic tags and lemmata of words accurately. A tool (called
PurePos) was developed that was shown to produce precise annotations for Hungarian
texts and also to serve as a good base for rule-based domain adaptation scenarios.
Besides, we present a methodology for combining tagger systems raising the overall
accuracy of Hungarian annotation systems.Next, an application of the presented tagger is described that aims to produce
morphological annotation for speech transcripts, and thus, the first morphological
disambiguation tool for spoken Hungarian is introduced. Following this, a method is
described which utilizes the adapted PurePos system for estimating morpho-syntactic
complexity of Hungarian speech transcripts automatically.The third part of the study deals with the preprocessing of electronic health records.
On the one hand, a hybrid algorithm is presented for segmenting clinical texts into words
and sentences accurately. On the other hand, domain-specific enhancements of PurePos
are described showing that the resulting tagger has satisfactory performance on noisy
medical records.Finally, the main results of this study are summarized by presenting the author’s
theses. Further on, applications of the methods presented are listed which aims
less-resourced languages.*Continue reading [here](https://github.com/oroszgy/phd-dissertation/releases/download/Final/thesis.pdf).*
---
It uses [this template](https://github.com/kks32/phd-thesis-template)