Research Group for Human Language Technologies

Our group deals with all aspects of Natural Language Processing (NLP) from early recognition (automatic speech recognition, ASR; optical character recognition, OCR) to late synthesis tasks, with particular emphasis on the intermediate stages that require understanding (semantic modeling). Our work combines rule-based and statistical techniques, with the idea that the rules themselves need to be established by Machine Learning (ML) techniques.

In addition to standard NLP tasks such as morphological analysis and synthesis, part-of-speech tagging, parsing, and generation, we deal with all subsystems required for full-fledged HCI systems, particularly in the information retrieval (IR)
area. The theoretical foundations of our work are closely linked to the study of finite state automata (FSA), finite state transducers (FST), and finite state machines (Eilenberg 1974). Currently, we are developing a theory of formal semantics for natural language where the model structures are machines. In addition to providing real-time recognition (clearly a requirement in HCI), finite state methods have the potential for automatic acquisition of the basic building blocks, finite transductions.

Our group continues, and builds on, the Free and Open Source Software tradition of NLP established in Hungary with the Hun* toolchain which includes the HunMorph morphological analyzer, the HunNER named entity recognizer, the HunParse parser, the HunAlign parallel sentence aligner, and perhaps best known, the HunSpell spellchecking library now widely used in OpenOffice.org and Mozilla Firefox and Thunderbird.

Research Areas

- Machine understanding
- Knowledge based Human-Computer Interaction
- Machine Learning
- Artificial Intelligence
- Question Answering
- Lexical semantics
- Information extraction and retrieval
- Morphological analysis
- Named Entity Recognition
- Shallow parsing
- Syntactic parsing and generation
- Intelligent dictionary building
- Machine Translation

Latest results

As part of our OTKA-project, 'Semantics-based language technology', we have produced a defining vocabulary of about 3000 semantic units, available with English, Hungarian, Polish, and Latin bindings, formalized using the theoretical framework of machines. Our current efforts are in three related directions: first, extension to a larger vocabulary by automatically translating pre-existing dictionary definitions into the formal model; second, building several demo systems that perform semantics-driven parsing and generation: in 2011 we introduced to the public the SHRDLU 2.0 system, which engages in simple dialogues and follows natural language instructions (essentially, a modern version of Winograd’s classic system), while in 2012 we presented two demos that perform real-world tasks, allowing users to purchase train tickets and make timetable inquiries using natural language. Thirdly, we apply the technology to well-known standard tasks such as Question Answering (QA) and Machine Translation (MT).

Products and services

- huntoken tokenizer
- hunpos part-of-speech tagger
- morphdb morphological database
- hunmorph morphological analyzer
- hunner named entity recognizer
- hunchunk shallow parser
- hunpars syntactic parser
- hunalign sentence aligner

Our tools can be downloaded from:
http://hlt.sztaki.hu

More information: