Professor Gerhard Heyer | Dr Michael Richter – Models for Understanding Language
Languages are filled with unexplored quirks that can be attributed to more than their mere design. The way we use words has an interdependent relationship with the rules of that language. Professor Gerhard Heyer and Dr Michael Richter of the University of Leipzig have been working to create models that help us to find grammatical universals that transcend any one particular language.
A Tool to Deepen Understanding
The verbs we use in everyday speech are used for different purposes. The verb ‘run’ can denote two different possibilities of completion. ‘I ran’ is different from ‘I ran to the store’ in that the first use gives the verb the trait of being unbounded in time (imperfective), whereas the second use indicates a completed activity (perfective). These two different temporal structures are due to different verbal or sentence aspects. In some languages, verbs can be overtly marked to indicate their aspectual load – they can be marked to indicate duration, repetition, completion or quality.
In comparative language studies, there are a number of areas involving statistical data that enhance our knowledge of the nature of language. Such research can provide language models based on algorithms.
Professor Heyer and Dr Richter of the University of Leipzig have recently been focusing their studies on verbs with quantifiable, binary aspectual distinctions, such as perfective and imperfective. To a great extent, this research topic was instigated by world-leading typologist Professor Martin Haspelmath, senior researcher at the Max Planck Institute for the Science of Human History, and Head of the ERC-funded research project Grammatical Universals.
Dr Richter and Professor Heyer are currently endeavouring to collect data on verbs with aspectual distinctions in order to establish new algorithms. The purpose of these algorithms will be to predict certain ‘asymmetries’ that these verbs are coded with across languages.
The project focuses on the asymmetry between verbs aspectual coding forms. As Dr Richter and Professor Heyer point out, non-default coded verbs tend to be longer. For instance, a verb with default imperfective aspect should be longer when used with a non-default perfective aspect. Note however, that verbal aspect marking is not obligatory in natural languages: English and German for instance exhibit barely overt marking while Slavic languages have a rich aspect marking morphology.
Through establishing a connection between cross-linguistic patterns that form language and general trends of language use, the team is creating a universal model that can account for grammar across languages. This has implications for understanding language that go beyond the rules of one specific language.
A product of this project will be the creation of an analysis software tool, available to anyone, for quantitatively understanding language. The tool will be based on an evaluation and interpretation of the team’s results. Further, this technology will approach boundaries between disciplines, using predictive statistical techniques based on entropy (a gradual increase in disorder due to a large number of variables) and probability. The tool will equip linguists with the means for testing hypotheses and explanations, for example, the hypothesis that the established forms of verbs are impacted by tendencies in their usage to defer to more efficient sentence structures.
Trends in Words
It has been established that the most frequently-expressed meanings tend to be expressed by short words or phrases across different languages. Professor Heyer and Dr Richter are contributing to a new area of research that builds on this and involves attributing these tendencies to phenomena that transcend specific languages.
In the history of typology (the study of the structural and functional features of languages), statistical methods have rarely been used for examining universal governing phenomena. The team is employing special software for this project, because of the sheer volume of information required to understand trends with such scope. Only computational models are capable of validating typological universals (such as asymmetries between specific aspects of verbs) through statistically significant statements. The software tools created as a result of this project will provide linguists with a much deeper understanding of how usage tendencies create recurring grammatical patterns.
Prior research has found that the ‘rules’ of a language both influence and are influenced by the words speakers choose to utter. This premise will be used by Dr Richter and Professor Heyer to build on the information-theory-based approach named after the linguist George Kingsley Zipf.
Zipf’s theory purported that the use of words was largely influenced by their inherent meaning – words of higher magnitude (having meanings that are employed frequently) tend to be shorter. In this view, the effect of predictability that emerges from the use of language across time was not explicitly integrated. The team’s recent work seeks to address this. To achieve this, they have been utilising data of both the inherent design of asymmetries across languages and the usage tendencies across languages to establish models for prediction.
Zipf argued for a principle that states speakers of a given language will tend to favour words that require the least effort. ‘Information’, for example, is gradually abandoned in favour of ‘info’. Serving this principle, Professor Heyer and Dr Richter are in the process of identifying a general mechanism of ‘usage-based language change’ that will help to create and maintain efficient language systems that centre on ease of use. They envisage that through these systems, words and phrases that communicate meaning most readily will be able to characterise language.
The Universals of Grammar
Professor Heyer and Dr Richter are in the process of documenting and explaining their concept of a model that accounts for universals in grammar across languages. The pair plans to achieve this through demonstrating a link between cross-linguistic patterns of language form and general trends of language use. Essential components in the analysis of this universal are being carried out by the software tool that is being developed during the course of the project.
There has been a tendency to examine the relationship between form and meaning to the exclusion of that between form, usage patterns and economic motivation (a tendency to use words that are shorter and easier to say). The team has been addressing this discrepancy. Professor Heyer and Dr Richter seek to prove the idea that frequently-expressed meanings tend to be expressed by short forms. The quantities ‘frequency’ and ‘form’ are components of Zipf’s law that will be pursued by extending the frequency-form relation to a frequency-form/function relation. In this way, implications for the ways in which meaning is structured in phrases can be incorporated.
Collecting Corpora
Research investigating the nature of grammar has improved considerably since the turn of the millennium. A movement to fully document the peculiarities of language has only been facilitated by database technology and digitisation’s effect on the quick availability of information. This has meant that data can be collected, organised and published with drastically increased effectiveness.
Professor Heyer’s team has crawled language-dense websites including Wikipedia, online newspapers and others. With the data acquired, the team is working to provide the corpora (large and structured sets of text) publicly through various forms of access so that future language technology can build on the data. In their research, the languages covered and size of resources provided by the team are constantly expanding. By providing the corpora in a uniform way for widespread use, the project is contributing to a much-improved understanding of language usage in everyday life.
In this fairly new type of research, the pair has studied the relative usage frequency of patterns in corpora between diverse languages. Through this, they are building models that establish the asymmetry in forms that exists between grammatical meanings at a universal level. There are limitations, however, on the simultaneous access to corpora from multiple languages. These place restrictions on available sample sizes.
Nonetheless, the background meanings that grammar is coded with are known to be universal, and Dr Richter and Professor Heyer hypothesise that the corresponding frequency distributions are similar across languages. As a result, smaller sample sizes should not impact the statistical value of the data to studying frequency asymmetries.
A History of Language
Diachrony concerns the ways in which languages evolve over time. The patterns that have been found in diachronic paths are the fundamental basis of Professor Heyer and Dr Richter’s investigation. Diachrony is limited, however, in that usable data only extends back in terms of decades.
As such, the emphasis of their project focuses out of necessity on European, Western Asian, East Asian and Egyptian languages. It also must employ reconstructed language histories (which are available for a few additional language families). The result of these limitations is that the project focuses on present-day aspects of structure and usage, as opposed to including a long-term historical perspective.
Technologies of Prediction
Programming the team’s analysis software tool will establish a degree of predictability in asymmetry for use in other research. The tool provides multiple statistical techniques based on entropy and probability so that new linguistic theories can be generated with a basis in relevant statistical data.
Statistical research has already shown that certain forms tend to be asymmetrical. There are length asymmetries between words denoting plural meanings and singular ones, and between first person and third person forms. With this software, linguists will be provided with further and more specific insight as to the patterns that coding asymmetries follow.
Future Research
In their upcoming research, Professor Heyer and Dr Richter will apply new lenses to the project’s topics through diversifying approaches into further disciplines. These will include formalisms such as stochastic optimality theory (which states that language forms arise from a balance between conflicting constraints) or evolutionary game theory (the study of mathematical models of conflict and cooperation between organisms from an evolutionary perspective).
In this way, they will endeavour to explain linguistic phenomena and linguistic behaviour in a broad theory frame that captures human development and behaviour in general. This research can potentially provide us with insight into not only what informs language, but what language tells us about the world of meaning that we live in.
Meet the researchers
Professor Gerhard Heyer
Department of Natural Language Processing
Institute of Computer Science
University of Leipzig
Leipzig
Germany
Professor Gerhard Heyer holds the Chair on Natural Language Processing at the Computer Science Department of the University of Leipzig. His field of interest is focused on automatic semantic processing of natural language text with applications in the area of information retrieval and search, as well as knowledge management. Until he moved to Leipzig, he was responsible within the Olivetti Group for establishing research and development in electronic publishing and natural language processing. Professor Heyer has published numerous papers on natural language processing, including the well-known book Text Mining – Wissensrohstoff Text by W3L/Springer. He is currently conducting several research projects funded by the EU, the German Research Foundation (DFG), and industrial funding.
CONTACT
E: heyer@informatik.uni-leipzig.de
W: http://asv.informatik.uni-leipzig.de/staff/Gerhard_Heyer
Dr Michael Richter
Department of Natural Language Processing
Institute of Computer Science
University of Leipzig
Leipzig
Germany
Dr Michael Richter earned his PhD in 2000 in linguistics in verbal constructions in German. Having previously had a background as a music teacher, he switched to linguistics and finished his study as MA at the Katholieke Universiteit Nijmegen (now Radboud University, the Netherlands). Dr Richter worked at universities in Germany and the Netherlands before joining the Natural Language Processing group at the Institute of Computer Science at Leipzig University. He works as researcher in the project Actionality classes and cross-linguistic coding tendencies: typological research and development of an analysis software tool, which was founded by the German Research Foundation.
CONTACT
E: richter@informatik.uni-leipzig.de
W: http://asv.informatik.uni-leipzig.de/staff/Michael_Richter
KEY COLLABORATORS
Dr Giuseppe Celano, Department of Natural Language Processing, Institute of Computer Science, University of Leipzig
Professor Martin Haspelmath, Max Planck Institute for the Science of Human History, Head of the ERC-funded research project Grammatical Universals
Professor Roeland van Hout, Centre for Language Studies, Radboud University Nijmegen
FUNDING
German Research Foundation