Eventi

International online workshopDepartment of Humanities (DISUM) – University of CataniaLink Teams

Digital Tools and Corpus-Based Approaches in (Arabic) Sociolinguistics: Methods, Challenges, and Cross-Disciplinary Insights

This workshop is intended for students and researchers interested in the intersection of sociolinguistics, corpus-based analysis, and digital tools, with a particular focus on Arabic and its varieties. Held online on February 11, 2026, the workshop aims to bring together scholars from diverse disciplinary backgrounds to reflect on methodological challenges and opportunities in the corpus-based analysis of Arabic sociolinguistic data. While Arabic and its many varieties (Standard and non-standard, spoken and written) are central to the discussion, the workshop also seeks to foster a cross-disciplinary dialogue with researchers who, though not necessarily specialists in Arabic, work with digital tools and corpus methods in sociolinguistics and related fields.

Arabic presents a particularly rich and complex terrain for sociolinguistic inquiry, given its diglossic structure, regional variation, and the interplay of spoken and written forms, often shaped by multilingualism and language contact. These features, combined with the increasing availability of digital data and tools, raise important methodological questions for how we collect, process, and analyze linguistic data.

Participants are invited to explore both practical and theoretical questions related to the use of IT tools, natural language processing, and corpus-based methodologies in sociolinguistic research, with a particular (but not exclusive) focus on Arabic.

Contributions may address, among others, the following topics:

  • Designing and processing corpora involving multiple varieties of Arabic (e.g., Standard, dialectal, mixed-language)
  • Methodological strategies for dealing with mixed data (written/oral, standard/non-standard, monolingual/multilingual)
  • Using corpora to analyze stylistic variation, register shifts, and genre diversity
  • Challenges in annotating and tagging sociolinguistically relevant features in under-resourced languages or dialects
  • Applications of NLP tools to Arabic and implications for sociolinguistic interpretation
  • Visualization and quantification of sociolinguistic phenomena through corpus data
  • Reflections on tool selection, customization, or development in light of specific linguistic and sociolinguistic questions
  • Comparative methodological perspectives: what can be learned from working across different languages and data types?
  • Theoretical implications of corpus-based approaches for analyzing variation, identity, and language practices

Each presentation will be allocated 20 minutes, followed by 10 minutes for discussion.

We welcome contributions that are empirical, theoretical, or methodological in nature, and particularly encourage interdisciplinary perspectives that bridge sociolinguistics, corpus linguistics, and computational methods. We look forward to a lively and generative exchange on how digital tools and corpus-based approaches can enrich the study of Arabic sociolinguistics—and beyond.

Workshop programme

February 11, 2026Teams

8:45 Welcoming participants – Rosa Pennisi (University of Catania): Introduction to the workshop

1st Section (9:00 – 10:30)

9:00 – 9:30 Veronika Laippala (University of Turku), Tracing universals of register variation in masses of web data – a machine learning approach.

9:30 – 10:00 May Rostom (Iremam, University of Aix-Marseille), Methods, challenges, and strategies adopted to analyse and represent mixed Arabic repertoire employed in digital contexts: reflections on a case study.

10:00 – 10:30 Rosa Pennisi (University of Catania), From Posts to Podcasts: Corpus-Based Methods for Mixed (Moroccan) Arabic Across Written and Oral Data.

Pause (10:30 – 10:45)

2nd Section (10:45 – 12:15)

10:45 – 11:15 Ibraam Abdelsayed (University for Foreigners of Siena), Building the LAPE Corpus: Challenges in Transcription, Lemmatization, and Statistical Processing of Spoken Egyptian Arabic.

11:15 – 11:45 Marco Venuti (University of Catania), Exploring identity on 4chan: the view of the people on Migration.

11:45 – 12:15 Antonino Andrea Belfiore & Ludovica Blangiforti (University of Catania), Ideological lexicon and textual frequencies: A data-driven approach to comparing Arabic and Western Media.

Organizing Committee: Rosa Pennisi (University of Catania)

For further information, please email Rosa Pennisi (rosa.pennisi@unict.it)

This workshop is part of the activities of the SABIRANET Project (CUP E63C24001920006; ID SOE2024_0000078), funded by the European Union – NextGenerationEU under Italy’s National Recovery and Resilience Plan (PNRR), Young Researcher 2024 – SoE line, administered by the Italian Ministry of University and Research (MUR).


Twenty Twenty-Five

rosa.pennisi@unict.it

Torna in alto