https://github.com/oroszgy/phd-dissertation

Hybrid algorithms for preprocessing agglutinative languages and less-resourced domains
https://github.com/oroszgy/phd-dissertation

hungarian nlp phd-dissertation phd-thesis

Last synced: 6 months ago
JSON representation

Hybrid algorithms for preprocessing agglutinative languages and less-resourced domains

Host: GitHub
URL: https://github.com/oroszgy/phd-dissertation
Owner: oroszgy
License: mit
Created: 2014-07-15T06:46:17.000Z (about 11 years ago)
Default Branch: master
Last Pushed: 2017-02-24T13:42:30.000Z (over 8 years ago)
Last Synced: 2025-02-10T08:13:11.715Z (8 months ago)
Topics: hungarian, nlp, phd-dissertation, phd-thesis
Language: TeX
Homepage: http://gyorgy.orosz.link/
Size: 2.78 MB
Stars: 1
Watchers: 2
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

          # Hybrid algorithms for preprocessing agglutinative languages and less-resourced domains effectively

This thesis deals with text processing applications examining methods suitable for

less-resourced and agglutinative languages, thus presenting accurate preprocessing

algorithms.

The first part of this study describes morphological tagging algorithms which can

compute both the morpho-syntactic tags and lemmata of words accurately. A tool (called

PurePos) was developed that was shown to produce precise annotations for Hungarian

texts and also to serve as a good base for rule-based domain adaptation scenarios.

Besides, we present a methodology for combining tagger systems raising the overall

accuracy of Hungarian annotation systems.

Next, an application of the presented tagger is described that aims to produce

morphological annotation for speech transcripts, and thus, the first morphological

disambiguation tool for spoken Hungarian is introduced. Following this, a method is

described which utilizes the adapted PurePos system for estimating morpho-syntactic

complexity of Hungarian speech transcripts automatically.

The third part of the study deals with the preprocessing of electronic health records.

On the one hand, a hybrid algorithm is presented for segmenting clinical texts into words

and sentences accurately. On the other hand, domain-specific enhancements of PurePos

are described showing that the resulting tagger has satisfactory performance on noisy

medical records.

Finally, the main results of this study are summarized by presenting the author’s

theses. Further on, applications of the methods presented are listed which aims

less-resourced languages.

*Continue reading [here](https://github.com/oroszgy/phd-dissertation/releases/download/Final/thesis.pdf).*

--- 

It uses [this template](https://github.com/kks32/phd-thesis-template)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/oroszgy/phd-dissertation

Awesome Lists containing this project

README