Unlocking the Treasures of Language: Publicly Available ILL Data


The State Data Agency (Statistics Lithuania), in collaboration with the Institute of the Lithuanian Language (ILL), has opened access to significant linguistic data sets. From now on, data sets prepared by ILL are available on the data portal (duomenys.stat.gov.lt), opening the door to the world of language research, education, and digital solutions. Povilas Bialoglovis, ILL Information Technology Specialist, and his team share insights on where and how these data sets can be used:
Database of Current Issues in Administrative Language |https://data.gov.lt/datasets/2686/#info
This resource is valuable for research on standard Lithuanian, particularly in the fields of functional stylistics or official language. It is also relevant for projects requiring a representative sample of early 21st century Lithuanian administrative language usage. Additionally, this data can be useful in developing educational materials or various digital tools.
Language Consultation Bank (LCB) |https://data.gov.lt/datasets/2765/#info
What questions have people asked about the language over the past 28 years (1997–2025)? The bank contains over 5,000 structured language consultations addressing spelling, punctuation, loanwords, word meanings, grammatical forms, and the use of proper nouns. This dataset reflects the state of standard language codification at the turn of the 20th and 21st centuries, highlights the most dynamic areas of language norm changes, and reveals emerging usage trends.
Text Corpora of the Ideological Narrative of Modern Identity |https://data.gov.lt/datasets/2713/#info
This dataset contains texts – both complete and divided into paragraphs – from the Institute of the Lithuanian Language’s text corpus of publicistic writing on the ideological narrative of modern identity. It is a text database intended for linguistic, statistical, and sociological analysis of written language. The text corpus is useful for research across various academic disciplines, especially studies on the development of journalistic writing and changes in the Lithuanian language (texts are authentic and not edited according to current standard language rules), etc.
Data for this dataset was collected between 2018 and 2021. This text corpus includes press texts from the pre-war period (1928 and 1930), the Soviet era (1945, 1956–1957, 1962), and the restored independent Lithuania (1992 and 1998).
Dictionary of Borrowed Terms | https://data.gov.lt/datasets/2883/
This resource is valuable for linguists, translators, and terminologists. While the dataset is also available on the original ILL website, it has now been included in the open data portal for broader analytical use.
These data are open not only to scientists, but to anyone interested in the dynamics of the Lithuanian language: whether for research, content creation, or educational purposes.
We invite researchers, educational institutions, creators, and data enthusiasts to make use of these datasets by integrating Lithuanian language content into interactive tools, AI models, cultural projects, or by otherwise contributing to the evolving linguistic landscape.
Special thanks to Vytautas Dominykas Leipus, the State Data Agency Programmer-Analyst, Darius Sedleckas, Information Manager, and other specialists who contributed to the publication of these datasets.
