The LADAL Opening event will consist of weekly presentations from eminent figures in linguistics, data science, and computational humanities and will cover a wide range of topics related to LADAL-relevant issues!
The first event of the LADAL Opening is a presentation by Stefan Th. Gries on MuPADRF (Multifactorial Prediction and Deviation Analysis Using Regression/Random Forests) on June 3, 2021, 5pm Brisbane time The event will take place on Zoom (the Zoom link will be announced here, on Twitter (@slcladal), and via our collaborators).
See below for the full list of presentations that are part of the LADAL Opening.
MuPADRF (Multifactorial Prediction and Deviation Analysis Using Regression/Random Forests)
In this talk, I will give a brief and relatively practical introduction to an approach called MuPDAR(F) (for Multifactorial Prediction and Deviation Analysis using Regressions/Random Forests) that I developed (see Gries and Deshors (2014), Gries and Adelman (2014) for the first applications). The main part of the talk involves using a version of the data in Gries and Adelman (2014) to exemplify how this protocol works and how it can be done in R. Second, I will discuss a few recent extensions proposed in Gries and Deshors (2020) and Gries (n.d.), which have to do with
how to deal with situations with more than two linguistic choices,
how predictions are made, and
how deviations are quantified.
Finally, I will briefly comment on exploring individual variation among the target speakers (based on Gries and Wulff (2021)).
Stefan Th. Gries is full professor at the University of California, Santa Barbara (UCSB), as well as Honorary Liebig-Professor and Chair of English Linguistics at the Justus-Liebig-Universität Giessen. Stefan has held several prestigious visiting professorships at top universities and, methodologically, he is a quantitative corpus linguist at the intersection of corpus linguistics, cognitive linguistics, and computational linguistics. Stefan has applied a variety of different statistical methods to investigate a wide range of linguistic topics and much of his work involves the open-source software R. Stefan has produced more than 200 publications (articles, chapters, books, and edited volumes), he is an active member of various editorial boards as well as academic societies.
The Australian Text Analytics Platform (ATAP) and the Language Technology and Data Analysis Laboratory (LADAL) - building computational humanities infrastructures: experiences, problems, and potentials
This talk introduces the Language Technology and Data Analysis Laboratory (LADAL) which is a computational humanities resource infrastructure maintained by the School of Languages and Cultures at the University of Queensland. The talk will also provide information about its relations to the Australian Text Analytics Platform (ATAP) which represents an effort to promote text analytics in Australia and to make resources for using text analytics available to a wider community of researchers.
Martin is a language data scientist with a PhD in English linguistics who has specialized in corpus linguistics and quantitative, computational analyses of language data. Martin is currently an Associate Professor in the AcqVA-Aurora Center at the Arctic University of Norway in Tromsø and he holds a additional appointment as Postdoctoral Research Fellow in Language Technology at the University of Queensland, Australia where he has been establishing the Language Technology and Data Analysis Laboratory (LADAL).
Michael is Professor of Linguistics and a Fellow of the Australian Academy of the Humanities. His research interests lie primarily in the field of pragmatics, the science of language-in-use. He works with recordings and transcriptions of naturally occurring spoken interactions, as well as data from digitally-mediated forms of communication across a number of languages. An area of emerging importance in his view is the role that language corpora can play in the humanities and social sciences more broadly. He has been involved in the establishment of the Australian National Corpus and the Language Technology and Data Analytics Lab, and is currently leading the establishment of a national language data commons.
Field-based methods for collecting quantitative data
Shana Poplack has set benchmarks for the development of corpora since the early 1980s. Poplack (2015, p. 921) maintains that the “gold standard remains the (standard sociolinguistic-style) … corpus”. The aim of producing corpora using these principles is to avoid the ‘cherry picking’ approach which dominates much of the theoretical literature. Poplack and her team have created the Ottawa-Hull Corpus which consists of 3.5 million words of informal speech data. This corpus is enormous and beyond the capabilities of a single linguist in a small language community. This talk offers suggestions for corpus development in the field that follow Poplack’s principles, but also shows where compromises can be made. I discuss the method developed during the Gurindji Kriol project called ‘peer elicitation’. It supplements Poplack’s gold standard of naturally occurring speech with semi-formal elicitation to ensure sufficient data for quantitative analyses.
Felicity Meakins is an ARC Future Fellow in Linguistics at the University of Queensland and a CI in the ARC Centre of Excellence for the Dynamics of Language. She is a field linguist who specialises in the documentation of Australian Indigenous languages in the Victoria River District of the Northern Territory and the effect of English on Indigenous languages. She has worked as a community linguist as well as an academic over the past 20 years, facilitating language revitalisation programs, consulting on Native Title claims and conducting research into Indigenous languages. She has compiled a number of dictionaries and grammars of traditional Indigenous languages and has written numerous papers on language change in Australia.
Text Crunching Center (TCC): Data-driven methods for linguists, social science and digital humanities
This talk introduces the Text Crunching Centre (TCC) which is a Computational Linguistics and Digital Humanities service hosted at the University of Zurich, and a collaboration partner of LADAL. We present a selection of our case studies using text analytics, from cognitive linguistics, social, political and historical studies. We show how stylistics, document classification, topic modelling, conceptual maps, distributional semantics and eye-tracking can offer new perspectives. Our case studies include language and age, learner language, the history of medicine, democratisation, religion, and attitudes to migration. We conclude with an outlook to the future of text analytics.
Gerold Schneider is a Senior Lecturer, researcher and computing scientist at the department of Computational Linguistics at the University of Zurich, Switzerland. His doctoral degree is on large-scale dependency parsing, his habilitation on using computational models for corpus linguistics. His research interests include corpus linguistics, statistical approaches, Digital Humanities, text mining and language modeling. He has published over 100 articles on these topics. He has published a book on statistics for linguists (Schneider and Lauber 2019), and a book on digital humanities is under way. His Google scholar page can be accessed here.
Corpus-based media linguistics: A case study of linguistic diversity in Australian television
Bayesian versus Frequentist Approaches in Statistics (preliminary title)
Gathering data and creating online experiments with jsPsych (preliminary title)
Reproducible research, Open Science, and Pre-Publication (preliminary title)
Neurocognitve effects of bilingual experience
Starting to work with networks
(Semi-)Automated speech recognition using ELPIS (preliminary title)
Establishing Computational Humanities Infrastructures in Finland (preliminary title)
Data-driven learning for younger learners: Boosting schoolgirls’ knowledge of passive voice constructions for STEM education
This paper explores how corpus technology and DDL pedagogy can support secondary schoolgirls’ reporting of an observed science experiment through a written research report, focusing particularly on how corpora were used to develop receptive and productive knowledge of passive voice constructions. A pre-test of the grammaticality of passive constructions was conducted, alongside a diagnostic pre-instruction written report requiring the retelling of an observed science experiment were collected from 60 Year 9-10 girls at a high school in Australia. During a full 10-week term, students were given guided individual homework tasks and short in-class pair/group DDL activities focusing on passive voice constructions, using freely available online corpus applications such as SketchEngine. Following this treatment, a post-test was conducted while an additional written research report was collected. Questionnaire and interview data was also collected to determine the perceptions of younger female learners and their teachers regarding their engagement with corpora and DDL for improving knowledge and use of passive constructions over time. The data suggest the DDL treatment resulted in increased knowledge of the grammaticality of passive voice constructions, while students were more likely to include such constructions in their written reports while increasing the accuracy of their use. Qualitative stakeholder perceptions of improved disciplinary linguistic knowledge, increased data management skills, and positive engagement with “science” were also found in the survey/interview data, although a number of challenges at the technical and conceptual levels for DDL still remain.
Peter is Senior Lecturer in the School of Languages and Cultures at UQ (since 2017), having formerly been an assistant professor at the Centre for Applied English Studies (CAES), University of Hong Kong (since 2014). His areas of research and supervisory expertise include corpus linguistics and the use of corpora for language learning (known as data-driven learning), as well as English for General and Specific Academic Purposes. I am the author of the monograph Learning the language of Dentistry: Disciplinary corpora in the teaching of English for specific academic purposes as part of Benjamins’ Studies in Corpus Linguistics series (with Lisa Cheung, published 2019), as well as the edited volumes Data Driven Learning for the Next Generation: Corpora and DDL for Pre-tertiary Learners (published 2019) and Referring in a second language: Reference to person in a multilingual world (with Jonathon Ryan, published 2020) with Routledge. I am also currently serving as the corpus linguistics section editor for Open Linguistics (ISI-ESCI) - an open access linguistics journal from De Gruyter, as well as on the editorial board of Applied Corpus Linguistics, a new journal covering the direct applications of corpora to teaching and learning.
Strategic targeting of rich inflectional morphology for linguistic analysis and L2 acquisition
Many languages have rich inflectional morphology signaling grammatical categories such as case, number, tense, etc. Rich morphology presents a challenge for L2 learners because even a basic vocabulary of a few thousand words can entail mastery of over 100,000 word forms. However, only a handful of the potential forms of a given word occur frequently, while the remainder are rare. Access to digital corpora makes it possible to determine which forms of any given word are of highest frequency, as well as what grammatical and collocational contexts motivate those few frequent forms, facilitating strategically focused language learning tools. Corpus analysis of the frequency distributions of inflectional forms provide linguists with added insights into the function of languages. The results achieved primarily by using correspondence analysis of Russian material are potentially portable to any language with rich inflectional morphology.
In the Cold War era, Laura Janda combined study of Slavic linguistics at Princeton and UCLA with US-government-funded adventures as an exchange student behind the Iron Curtain in countries that have since changed their names: USSR, Czechoslovakia, and Yugoslavia. After over two decades at the University of Rochester and UNC-Chapel Hill, she moved to the University of Tromsø in 2008. Laura Janda was an early adopter of Cognitive Linguistics in the 1980s and has explored quantitative methods since 2007. Her research focuses primarily on the morphology of Slavic languages, with various admixtures (North Saami, conlangs, political discourse).
Text classification for automatic detection of hate speech, counter speech, and protest events
Social sciences have opened up to text mining, i.e., a set of methods to automatically identify semantic structures in large document collections. However, the methods have often been limited to a statistical analysis of textual data, strongly limiting the scope of possible research questions. The more complex concepts central to the social sciences such as arguments, frames, narratives and claims still are mainly studied using manual content analyses in which the knowledge needed to apply a category (i.e. to “code”) is verbally described in a codebook and implicit in the coder’s own background knowledge. Supervised machine learning provides an approach to scale-up this coding process to large datasets. Recent advantages in neural network-based natural language processing allow for pretraining language models that can transfer semantic knowledge from unsupervised text collections to specific automatic coding problems. With deep learning models such as BERT automatic coding of context-sensitive semantics with substantially lowered efforts in training data generation comes within reach to content analysis. The talk will introduce to the applied usage of these technologies along with two interdisciplinary research projects studying hate speech and counter speech in German Facebook postings, and information extraction for the analysis of the coverage of protest events in local news media.
Dr. Gregor Wiedemann is working as Senior Researcher Computational Social Science at the Leibniz Institute for Media Research │ Hans Bredow Institute (HBI). Since September 2020, he heads the Media Research Methods Lab (MRML). His current work focuses on the development of methods and applications of natural language processing and text mining for empirical social and media research. Gregor Wiedemann studied political science and computer science in Leipzig and Miami, USA. In 2016 he received his doctorate from the Department of Computer Science at the University of Leipzig for his thesis on automation of discourse and content analysis using text mining and machine learning methods. Afterwards he worked as a postdoc in the NLP group of Computer Science Department at the University of Hamburg. Among other things, the resulting works are concerned with unsupervised information extraction to support investigative research in unknown document collections (see newsleak.io) and with the detection of hate and counter-speech in social media.
Case studies from VARIENG (preliminary title)
AntConc 4.0 (preliminary title)
Doing diachronic linguistics with distributional semantic models in R
Analysing 18th-century publications and publishing networks (preliminary title)
Tanja Säily is a tenure-track assistant professor in English language at the University of Helsinki. Her research interests include corpus linguistics, digital humanities, historical sociolinguistics, and linguistic productivity. She is also interested in the social embedding of language variation and change in general, including gendered styles in the history of English and extralinguistic factors influencing language change. Her overarching aim is to develop new ways of understanding language variation and change, often in collaboration with experts from other fields. Her current project combines historical sociolinguistics, intellectual history, book history and data science to analyse eighteenth-century publications and publishing networks.
Git and GitHub/Gitlab for versioning and collaborating
Git is a tool for versioning of and collaborating on any text-based file. Widely used in software development, it is now adopted in many different settings, from document versioning to data analysis management. It is also at the centre of major platforms like GitHub and GitLab, used by millions to share and collaborate on code and documents. In this workshop, you will learn about:
The main commands used in a git workflow
How to publish your work online
How to collaborate on a GitHub repository
If you would like to follow along, please do the following before attending:
Install Git on your computer (here are OS-specific instructions)
Stéphane has worked for the last 10 years at the University of Queensland (UQ). After completing a master’s degree in plants science and ecology in France, he worked in research around the topic of sustainable agriculture. In 2018, a drastic move to a Technology Trainer position at the Library allowed him to share data analysis best practice skills, and promote Open Source tools for research. He is motivated by the principles of Open Science and the opportunities an increasingly collaborative research ecosystem offers.
How to Make your Research Reproducible
Gries, Stefan Th. n.d. “MuPDAR for Corpus-Based Learner and Variety Studies: Two (More) Suggestions for Improvement.” In Tba, edited by Martin Hilpert and Susanne Flach, ???–??? ? ?
Gries, Stefan Th, and Allison S Adelman. 2014. “Subject Realization in Japanese Conversation by Native and Non-Native Speakers: Exemplifying a New Paradigm for Learner Corpus Research.” In Yearbook of Corpus Linguistics and Pragmatics 2014, 35–54. Springer.
Gries, Stefan Th, and Sandra C Deshors. 2014. “Using Regressions to Explore Deviations Between Corpus Data and a Standard/Target: Two Suggestions.” Corpora 9 (1): 109–36.
———. 2020. “There’s More to Alternations Than the Main Diagonal of a 2×2 Confusion Matrix: Improvements of Mupdar and Other Classificatory Alternation Studies.” ICAME Journal 44: 69–96.
Gries, Stefan Th, and Stefanie Wulff. 2021. “Examining Individual Variation in Learner Production Data: A Few Programmatic Pointers for Corpus-Based Analyses Using the Example of Adverbial Clause Ordering.” Applied Psycholinguistics 42 (2): 279–99.
Schneider, Gerold, and Max Lauber. 2019. “Introduction to Statistics for Linguists.” Pressbooks. https://dlf.uzh.ch/openbooks/statisticsforlinguists/.