Attestation committee
Accreditation committee
Expert committee
Dispositions, instructions
Normative acts
Nomenclature
Institutions
Scientific councils
Seminars
Theses
Scientific advisers
Scientists
Doctoral students
Postdoctoral students
CNAA logo

 română | русский | english

CNAA / Theses / 2007 / June /

Statistical Methods and Algorithms of Text Processing (based on Romanian texts)


Author: Victoria Bobicev
Degree:doctor of informatics
Speciality: 01.05.04 - Mathematical modelling, mathematical methods, software
Year:2007
Scientific adviser: Anatol Popescu
doctor habilitat, professor, Technical University of Moldova
Institution:
Scientific council:

Status

The thesis was presented on the 8 June, 2007
Approved by NCAA on the 20 September, 2007

Abstract

Adobe PDF document0.32 Mb / in romanian

Keywords

natural Language Processing (NLP), text processing, statistical models of text, text elements probabilities, smoothing methods, frequency dictionary, Zipf’s law, Heaps law, word distribution, corpus similarity, statistical compression methods, Prediction by Partial Matching (PPM), diacritic restoration algorithm, text classification, morpho-syntactic annotation, morpho-syntactic disambiguation algorithm

Summary

The thesis contains the analysis, study and elaboration of statistical methods used for text processing. The research has been carried out for three types of text elements: letters, words and morpho-syntactic tags as well as sequences of all these elements in text. Characteristics that make possible statistical methods application and features which create difficulties for statistical methods have been investigated.

The necessary resources for the carried out study have been prepared: the morphological dictionary; four corpora of Romanian texts; a morphologically annotated corpus, a number of scripts for the experiments.

The first part of the thesis contains examination of text’s theoretical laws. Zipf’s and Heap’s law constants for the sequences of all three types of elements (letters, words and morpho-syntactic tags) have been calculated.

In the second part several aspects of text statistical models effectiveness are described. In order to increase the statistical methods performance, means of probability estimation for rare elements in text have been reviewed. Distribution laws for high frequency elements in Romanian texts have been determined. Corpora similarity and corpus homogeneity have been evaluated.

The last part of the work presents several algorithms for text processing which use statistical compression method – Prediction by Partial Matching (PPM). This method is considered the best method for compression as it creates the optimal statistical model of text. Three variations of PPM method have been applied: on the basis of letters, words and morpho-syntactic tags. The PPM method on the letter basis has been applied for diacritics restoration. Document classification task have been solved by the word-based method. The usage of PPM method on the base of morpho-syntactic tags for morpho-syntactic disambiguation has been tested.

The results of the experiments confirm the effectiveness of the PPM method in text processing