Statistical Methods and Algorithms of Text Processing (based on Romanian texts) / June / 2007 / Theses / CNAA

CNAA / Theses / 2007 / June /

Statistical Methods and Algorithms of Text Processing (based on Romanian texts)

Author:	Victoria Bobicev
Degree:	doctor of informatics
Speciality:	01.05.04 - Mathematical modelling, mathematical methods, software
Year:	2007
Scientific adviser:	Anatol Popescu doctor habilitat, professor, Technical University of Moldova
Institution:
Scientific council:

Status

The thesis was presented on the 8 June, 2007
Approved by NCAA on the 20 September, 2007

Abstract

– 0.32 Mb / in romanian

Keywords

natural Language Processing (NLP), text processing, statistical models of text, text elements probabilities, smoothing methods, frequency dictionary, Zipf’s law, Heaps law, word distribution, corpus similarity, statistical compression methods, Prediction by Partial Matching (PPM), diacritic restoration algorithm, text classification, morpho-syntactic annotation, morpho-syntactic disambiguation algorithm

Summary

The thesis contains the analysis, study and elaboration of statistical methods used for text processing. The research has been carried out for three types of text elements: letters, words and morpho-syntactic tags as well as sequences of all these elements in text. Characteristics that make possible statistical methods application and features which create difficulties for statistical methods have been investigated.

The necessary resources for the carried out study have been prepared: the morphological dictionary; four corpora of Romanian texts; a morphologically annotated corpus, a number of scripts for the experiments.

The first part of the thesis contains examination of text’s theoretical laws. Zipf’s and Heap’s law constants for the sequences of all three types of elements (letters, words and morpho-syntactic tags) have been calculated.

In the second part several aspects of text statistical models effectiveness are described. In order to increase the statistical methods performance, means of probability estimation for rare elements in text have been reviewed. Distribution laws for high frequency elements in Romanian texts have been determined. Corpora similarity and corpus homogeneity have been evaluated.

The last part of the work presents several algorithms for text processing which use statistical compression method – Prediction by Partial Matching (PPM). This method is considered the best method for compression as it creates the optimal statistical model of text. Three variations of PPM method have been applied: on the basis of letters, words and morpho-syntactic tags. The PPM method on the letter basis has been applied for diacritics restoration. Document classification task have been solved by the word-based method. The usage of PPM method on the base of morpho-syntactic tags for morpho-syntactic disambiguation has been tested.

The results of the experiments confirm the effectiveness of the PPM method in text processing

Oficial Reviewers

Svetlana Cojocaru
doctor habilitat, professor, Institute of Mathematics and Computer Science
Nicolae Objelean
doctor, associate professor (docent), Moldova State University

Theses

There have been written 13 theses, including 1 theses for the degree of doctor habilitate. (in this specialty)

Under consideration [1] :

The use of information technologies in the development of cryptographic and algebraic algorithms
21 April, 2023

Theses Archive: