Automation of the process of computational linguistic resources creation / December / 2011 / Theses / CNAA

CNAA / Theses / 2011 / December /

Automation of the process of computational linguistic resources creation

Author:	Petic Mircea
Degree:	doctor of informatics
Speciality:	01.05.01 - Theoretical foundation of computer science; programming
Year:	2012
Scientific adviser:	Svetlana Cojocaru doctor habilitat, professor, Institute of Mathematics and Computer Science
Scientific consultant:	Elena Boian doctor, associate professor (docent), Institute of Mathematics and Computer Science
Institution:	Institute of Mathematics and Computer Science
Scientific council:	D 01-01.05.01-25.12.03 Institute of Mathematics and Computer Science

Status

The thesis was presented on the 22 December, 2011
Approved by NCAA on the 16 February, 2012

Abstract

– 0.35 Mb / in romanian

Keywords

computational linguistic resources, derivational algorithm, affix, prefix, suffix, words segmentation, vocalic/consonantal alternations, automatic derivative generation, generative derivational mechanisms

Summary

The thesis was elaborated at the Institute of Mathematics and Computer Science of the Academy of Sciences of Moldova, Chisinau, in 2011. The thesis is written in Romanian and contains introduction, three chapters, general conclusions and recommendations, bibliography of 200 titles, 14 appendices, 133 pages of the main text, 15 figures, and 44 tables. The results are published in 27 scientific papers.

The study in this thesis concerns an actual research area related to automation of the process of computational linguistic resources creation, namely, by automatic generation of the derivated words that are absent in computational linguistic resources.

The purpose is to study the mechanisms and to elaborate algorithms for automatic generation of the derivated words for these resources completion.

The research objectives are: evaluation of the existent methods in the automation of the derivational process; study of the structure particularities of computational linguistic resources available for research; establishing the quantitative and qualitative characteristics of the derivated words; elaboration of the algorithms for automatic recognition of the derivated words; establishing the mechanisms and elaboration of algorithms for automatic generation of the derivated words.

Novelty and scientific originality. This work contributes to complete research in the field of natural language processing by development of mathematical models and algorithms to solve the problem of automatic derivatives generation. The results of the study represent a realization of a new methodology of studying the issues in computational derivational morphology, related to the algorithmization of certain linguistic mechanisms, such as affixes substitution, derivatives projection, derivational constraints and formal derivational rules.

Theoretical significance and applied value of the thesis. A statistical method for Romanian affixes uncertainty evaluation based on the notion of entropy was proposed. The mathematical formal descriptions of the derivatives word formation mechanisms were elaborated which served to development of algorithms for automatic generation of the derivatives. During the research the important results were obtained, which permitted to elaborate algorithm for automatic generation of derivatives which can facilitate computational linguistic resources completion and can serve as tools in the further research in the field of natural language processing. The research results present interest for lexicographic practice, in the process of dictionary elaboration and lexicographic treatment of the derivatives. Also, the results of the investigation can serve as a methodical support in activity of the specialists in both computer science and linguistics.

Implementation of scientific results. An extention of RRTLN database was developed which allowed a correct extraction of about 15.000 derivatives without having a special program of word segmentation in morphemes (41 of prefixes, about 420 of suffixes, over 8 thousand of roots/stems). The established mechanisms, which permitted the elaboration of algorithms and corresponding programs, led to generation a significant number of derivatives with different affixes, 8839 with 11 prefixes, and 2352 with 24 suffixes which will help in Romanian language computational linguistic resources essential enrichment.

Oficial Reviewers

Anatol Popescu
doctor habilitat, professor, Technical University of Moldova
Adrian Iftene
dr. în informatică, UAIC Iaşi, România

Council's Members

Constantin Gaindric, president
doctor habilitat, professor, Institute of Mathematics and Computer Science
Constantin Ciubotaru, secretary
doctor, associate professor (docent), Institute of Mathematics and Computer Science
Victoria Bobicev, member
doctor, associate professor (docent), Technical University of Moldova
Alexandru Colesnicov, member
doctor
Iurie Rogojin, member
doctor habilitat, Institute of Mathematics and Computer Science
Ilie Costaş, member
doctor habilitat, professor, Academy of Economic Studies of Moldova

Theses

There have been written 3 theses. (in this specialty)