Attestation committee
Accreditation committee
Expert committee
Dispositions, instructions
Normative acts
Nomenclature
Institutions
Scientific councils
Seminars
Theses
Scientific advisers
Scientists
Doctoral students
Postdoctoral students
CNAA logo

 română | русский | english


Automation of the process of computational linguistic resources creation


Author: Petic Mircea
Degree:doctor of informatics
Speciality: 01.05.01 - Theoretical foundation of computer science; programming
Year:2012
Scientific adviser: Svetlana Cojocaru
doctor habilitat, professor, Institute of Mathematics and Computer Science
Scientific consultant: Elena Boian
doctor, associate professor (docent), Institute of Mathematics and Computer Science
Institution: Institute of Mathematics and Computer Science
Scientific council: D 01-01.05.01-25.12.03
Institute of Mathematics and Computer Science

Status

The thesis was presented on the 22 December, 2011
Approved by NCAA on the 16 February, 2012

Abstract

Adobe PDF document0.35 Mb / in romanian

Keywords

computational linguistic resources, derivational algorithm, affix, prefix, suffix, words segmentation, vocalic/consonantal alternations, automatic derivative generation, generative derivational mechanisms

Summary

The thesis was elaborated at the Institute of Mathematics and Computer Science of the Academy of Sciences of Moldova, Chisinau, in 2011. The thesis is written in Romanian and contains introduction, three chapters, general conclusions and recommendations, bibliography of 200 titles, 14 appendices, 133 pages of the main text, 15 figures, and 44 tables. The results are published in 27 scientific papers.

The study in this thesis concerns an actual research area related to automation of the process of computational linguistic resources creation, namely, by automatic generation of the derivated words that are absent in computational linguistic resources.

The purpose is to study the mechanisms and to elaborate algorithms for automatic generation of the derivated words for these resources completion.

The research objectives are: evaluation of the existent methods in the automation of the derivational process; study of the structure particularities of computational linguistic resources available for research; establishing the quantitative and qualitative characteristics of the derivated words; elaboration of the algorithms for automatic recognition of the derivated words; establishing the mechanisms and elaboration of algorithms for automatic generation of the derivated words.

Novelty and scientific originality. This work contributes to complete research in the field of natural language processing by development of mathematical models and algorithms to solve the problem of automatic derivatives generation. The results of the study represent a realization of a new methodology of studying the issues in computational derivational morphology, related to the algorithmization of certain linguistic mechanisms, such as affixes substitution, derivatives projection, derivational constraints and formal derivational rules.

Theoretical significance and applied value of the thesis. A statistical method for Romanian affixes uncertainty evaluation based on the notion of entropy was proposed. The mathematical formal descriptions of the derivatives word formation mechanisms were elaborated which served to development of algorithms for automatic generation of the derivatives. During the research the important results were obtained, which permitted to elaborate algorithm for automatic generation of derivatives which can facilitate computational linguistic resources completion and can serve as tools in the further research in the field of natural language processing. The research results present interest for lexicographic practice, in the process of dictionary elaboration and lexicographic treatment of the derivatives. Also, the results of the investigation can serve as a methodical support in activity of the specialists in both computer science and linguistics.

Implementation of scientific results. An extention of RRTLN database was developed which allowed a correct extraction of about 15.000 derivatives without having a special program of word segmentation in morphemes (41 of prefixes, about 420 of suffixes, over 8 thousand of roots/stems). The established mechanisms, which permitted the elaboration of algorithms and corresponding programs, led to generation a significant number of derivatives with different affixes, 8839 with 11 prefixes, and 2352 with 24 suffixes which will help in Romanian language computational linguistic resources essential enrichment.