StatusThe thesis was presented on the 8 June, 2007
Approved by NCAA on the 20 September, 2007
Abstract– 0.32 Mb / in romanian
The thesis contains the analysis, study and elaboration of statistical methods used for text processing. The research has been carried out for three types of text elements: letters, words and morpho-syntactic tags as well as sequences of all these elements in text. Characteristics that make possible statistical methods application and features which create difficulties for statistical methods have been investigated.
The necessary resources for the carried out study have been prepared: the morphological dictionary; four corpora of Romanian texts; a morphologically annotated corpus, a number of scripts for the experiments.
The first part of the thesis contains examination of text’s theoretical laws. Zipf’s and Heap’s law constants for the sequences of all three types of elements (letters, words and morpho-syntactic tags) have been calculated.
In the second part several aspects of text statistical models effectiveness are described. In order to increase the statistical methods performance, means of probability estimation for rare elements in text have been reviewed. Distribution laws for high frequency elements in Romanian texts have been determined. Corpora similarity and corpus homogeneity have been evaluated.
The last part of the work presents several algorithms for text processing which use statistical compression method – Prediction by Partial Matching (PPM). This method is considered the best method for compression as it creates the optimal statistical model of text. Three variations of PPM method have been applied: on the basis of letters, words and morpho-syntactic tags. The PPM method on the letter basis has been applied for diacritics restoration. Document classification task have been solved by the word-based method. The usage of PPM method on the base of morpho-syntactic tags for morpho-syntactic disambiguation has been tested.
The results of the experiments confirm the effectiveness of the PPM method in text processing