Update thesis.md

bplank · web-flow · commit 1444366b1530 · 2026-02-09T21:31:31.000+01:00
diff --git a/_pages/thesis.md b/_pages/thesis.md
@@ -256,7 +256,7 @@ Other projects in summarization or simplification (e.g., resource building, mult
 - *Improving the Cross-Lingual Alignment of IR Models.* Machine Translation (MT) and Cross-Lingual Information Retrieval (CLIR) are two interconnected Natural Language Processing (NLP) tasks. In CLIR, MT is typically used to translate queries at retrieval time (translate test) or to translate training data (translate train). Recent studies have proposed a novel approach: aligning the representation spaces of MT models with those of large language models to improve their performance in multilingual NLP tasks ([Acharya et al., 2025](https://aclanthology.org/2025.naacl-long.220.pdf); [Schmidt et al., 2024](https://aclanthology.org/2024.findings-emnlp.394.pdf); [Yoon et al., 2024](https://aclanthology.org/2024.acl-long.405/)). The goal of this project is to investigate the effectiveness of aligning internal representations within LLM-based IR models, and compare their performance to translation-based methods for CLIR across both high- and low-resource languages. Depending on the student’s interests, the focus of this project can be on cross-lingual retrieval or reranking. 
 **Level: MSc.**
 
-- *Evaluating the quality and difficulty of exam questions.* When high-stakes educational assessments (like university entrance exams) are being developed, they usually need to be piloted with a large sample of test takers to make sure that the questions are of appropriate quality and difficulty. Recently, researchers have tried to use LLMs to predict these properties automatically, either directly or by simulating test takers at various levels of ability. However, out-of-the box LLMs are not very good at this. A student project could experiment with fine-tuning LLMs to match different ability levels or to directly evaluate properties such as item difficulty, discrimination, or guessability.
+- :hourglass_flowing_sand: *Evaluating the quality and difficulty of exam questions.* When high-stakes educational assessments (like university entrance exams) are being developed, they usually need to be piloted with a large sample of test takers to make sure that the questions are of appropriate quality and difficulty. Recently, researchers have tried to use LLMs to predict these properties automatically, either directly or by simulating test takers at various levels of ability. However, out-of-the box LLMs are not very good at this. A student project could experiment with fine-tuning LLMs to match different ability levels or to directly evaluate properties such as item difficulty, discrimination, or guessability.
 **References:** [Yaneva et al. (2024)](https://aclanthology.org/2024.bea-1.39/), [Acquaye et al (2025)](https://arxiv.org/abs/2601.09953), [Liu et al. (2025)](https://doi.org/10.1111/bjet.13570), [Gorgun & Bulut (2024)](https://doi.org/10.1111/emip.12663), [Laverghetta et al. (2022)](https://doi.org/10.1007/978-3-031-04572-1_12)
 **Level: MSc.**
 
@@ -265,7 +265,7 @@ Other projects in summarization or simplification (e.g., resource building, mult
 
 <!--
 ### **Selected Research Projects**
-- *NLP for Job Market Analysis*. Job postings are a rich resource to understand the
+- :hourglass_flowing_sand:  *NLP for Job Market Analysis*. Job postings are a rich resource to understand the
 dynamics of the labor market including which skills are demanded, which is also
 important from an educational viewpoint. Recently, the emerging line of work on
 computational job market analysis or NLP for human resources has started to
@@ -389,7 +389,7 @@ This theme covers aspects of uncertainty and ambiguity in human language and per
 - *Interpreting Visual Ambiguity:* Humans and Vision-Language Models. The objective of this thesis is to investigate ambiguity in images and to compare how humans and Vision-Language Models perceive and resolve such ambiguity. Ambiguous images often contain multiple salient elements or support multiple plausible interpretations. While humans show attention shifts depending on context, expectations, and individual differences, what models do? Depending on the student’s interests, this thesis can focus more on modeling approaches or on human strategies for processing ambiguity (e.g., through behavioral or eye-tracking experiments). **References:** [Hindennach et al., 2024](https://dl.acm.org/doi/10.1145/3649902.3656356), [Testoni et al., 2025](https://aclanthology.org/2025.emnlp-main.1206.pdf)
 **Level: MSc.**
 
-- *Beyond Probability Metrics:* Evaluating Free-Form Rationales for Disagreement: Current approaches to Human Label Variation (HLV) mostly treat the problem as a distribution-matching task. We evaluate models using metrics like KL-divergence or Jensen-Shannon Distance to see if the predicted probabilities align with the crowd. However, getting the numbers right doesn't mean the model understands the disagreement. A model might predict a 60/40 split for the wrong reasons, or hallucinate a conflict where none exists. As we move towards &quot;Glass-Box&quot; NLP, we need models that can explicitly justify why a sample is ambiguous through Explanations ([Chen et al. 2025a](https://aclanthology.org/2025.findings-acl.562/)) or Chain-of-Thought (CoT, [Chen et al. 2025b](https://aclanthology.org/2025.emnlp-main.1682/)).
+- :hourglass_flowing_sand:  *Beyond Probability Metrics:* Evaluating Free-Form Rationales for Disagreement: Current approaches to Human Label Variation (HLV) mostly treat the problem as a distribution-matching task. We evaluate models using metrics like KL-divergence or Jensen-Shannon Distance to see if the predicted probabilities align with the crowd. However, getting the numbers right doesn't mean the model understands the disagreement. A model might predict a 60/40 split for the wrong reasons, or hallucinate a conflict where none exists. As we move towards &quot;Glass-Box&quot; NLP, we need models that can explicitly justify why a sample is ambiguous through Explanations ([Chen et al. 2025a](https://aclanthology.org/2025.findings-acl.562/)) or Chain-of-Thought (CoT, [Chen et al. 2025b](https://aclanthology.org/2025.emnlp-main.1682/)).
 While we have metrics for label distributions, we lack a robust framework for evaluating the quality of the reasoning behind the variation. How do we judge if a free-text explanation accurately captures a pragmatic ambiguity versus a semantic one? Does the generated CoT truly reflect the linguistic nuance that causes humans to disagree? This thesis aims to design an evaluation framework for Free-Form HLV Rationales. The student will move beyond standard overlaps (like BLEU/ROUGE) and develop metrics—potentially LLM-based or taxonomy-guided (e.g., using LITEX, [Hong et al. 2025a](https://aclanthology.org/2025.emnlp-main.1728/))—to assess: 1. Faithfulness: Does the explanation actually align with the predicted label distribution? 2. Coverage: Does the CoT capture all valid perspectives (the &quot;Yes&quot; view and the &quot;No&quot; view) or does it collapse into a single viewpoint? 3. Linguistic Validity: Can we quantify the &quot;quality&quot; of the ambiguity detection? 
 **Level: MSc.**
 
@@ -407,7 +407,7 @@ This thesis explores Personalized HLV to disentangle ambiguity from preference.
 **References:** [Keuleers & Brysbaert (2010)](https://doi.org/10.3758/BRM.42.3.627), [New et al. (2024)](https://doi.org/10.1177/17470218231164373)
 **Level: BSc or MSc.**
 
-- *Interpreting Irony in Multimodal Memes.* Internet memes are a widely used form of online communication ([Shifman, 2013](https://onlinelibrary.wiley.com/doi/full/10.1111/jcc4.12013)) and often express irony, where the literal meaning of the text or image differs from the intended message. This mismatch makes automatic meme understanding difficult, as multimodal large language models (MLLMs) may assign incorrect sentiment or intent labels when they rely on surface-level cues ([Fersini et al., 2022](https://aclanthology.org/2022.semeval-1.74/), [Nguyen et al., 2025](https://aclanthology.org/2025.acl-long.927/)). Previous research ([Ilic et al., 2018](https://aclanthology.org/W18-6202.pdf)) has studied irony and sarcasm in text and has addressed meme classification tasks ([Kiela et al., 2020](https://proceedings.nips.cc/paper_files/paper/2020/file/1b84c4cee2b8b3d823b30e2d604b1878-Paper.pdf), [Liu et al., 2022](https://aclanthology.org/2022.emnlp-main.476/)), but the specific ways in which irony arises from text, images, or their interaction are not yet fully understood. The goal of this thesis is to analyze how irony is constructed in multimodal memes and to evaluate how MLLMs detect and explain ironic meaning. Possible research directions include defining or refining a taxonomy of irony in memes, annotating or analyzing existing meme datasets, and assessing model performance in irony detection and explanation, including consistency between predicted labels and generated explanations as well as common error patterns. 
+- :hourglass_flowing_sand:  *Interpreting Irony in Multimodal Memes.* Internet memes are a widely used form of online communication ([Shifman, 2013](https://onlinelibrary.wiley.com/doi/full/10.1111/jcc4.12013)) and often express irony, where the literal meaning of the text or image differs from the intended message. This mismatch makes automatic meme understanding difficult, as multimodal large language models (MLLMs) may assign incorrect sentiment or intent labels when they rely on surface-level cues ([Fersini et al., 2022](https://aclanthology.org/2022.semeval-1.74/), [Nguyen et al., 2025](https://aclanthology.org/2025.acl-long.927/)). Previous research ([Ilic et al., 2018](https://aclanthology.org/W18-6202.pdf)) has studied irony and sarcasm in text and has addressed meme classification tasks ([Kiela et al., 2020](https://proceedings.nips.cc/paper_files/paper/2020/file/1b84c4cee2b8b3d823b30e2d604b1878-Paper.pdf), [Liu et al., 2022](https://aclanthology.org/2022.emnlp-main.476/)), but the specific ways in which irony arises from text, images, or their interaction are not yet fully understood. The goal of this thesis is to analyze how irony is constructed in multimodal memes and to evaluate how MLLMs detect and explain ironic meaning. Possible research directions include defining or refining a taxonomy of irony in memes, annotating or analyzing existing meme datasets, and assessing model performance in irony detection and explanation, including consistency between predicted labels and generated explanations as well as common error patterns. 
 **Level: BSc**, adaptation to MSc possible.
 
 - *Gaze data for NLP.* The way in which our eyes move when reading a text can tell us a lot about the cognitive processes required for language understanding. For example, longer reading times indicate higher processing difficulty. In the past 10 years, a line of research has emerged that attempts to use gaze data obtained by eye tracking to improve NLP models for various tasks (e.g., [Barrett et al., 2016](https://aclanthology.org/P16-2094/), [Hollenstein & Zhang, 2019](https://aclanthology.org/N19-1001/), [Deng et al., 2023](https://aclanthology.org/2023.emnlp-main.400/), [Alaçam et al., 2024](https://aclanthology.org/2024.emnlp-main.11/)). A student project could involve investigating under which circumstances and for which NLP tasks gaze data can be beneficial, or whether we can achieve the same effect with artificially synthesized gaze data. References: [Hollenstein et al. (2019)](https://arxiv.org/abs/1904.02682), [Sood et al. (2020)](https://proceedings.neurips.cc/paper/2020/hash/460191c72f67e90150a093b4585e7eb4-Abstract.html), [Khurana et al. (2023)](https://aclanthology.org/2023.eacl-main.139/), [Bolliger et al. (2023)](https://aclanthology.org/2023.emnlp-main.960/).
@@ -460,16 +460,16 @@ lower scope in the domain of ATS are also possible).
 
 ### **Thesis projects**
 
-- *Aggregation and Multi-Agent Systems.* Multi-Agent Systems (MAS) are ensembles of Language Models that solve tasks as a group. The performance of MAS improves with a higher diversity between agents and their individual solutions. However, to evaluate the performance, the models need to consent or their individual answers must be aggregated. This project offers multiple directions: either aiming at understanding the impact of aggregation on the group solution or analyzing the diversity of the individual solutions. Example projects: a) Comparing different aggregation methods and evaluating the faithfulness of LMs generating the final answer (for Master students: improving existing aggregation methods / comparing the impact on different multi-agent approaches); b) making non-aggregated solutions comprehensible for humans, e.g. through visualization, summarization, and/or formalization (for Master students: + evaluation of non-aggregated solutions / comparing different multi-agent approaches); c) other projects regarding MAS. References: [Du et al. (2024)](https://dl-acm-org.emedien.ub.uni-muenchen.de/doi/10.5555/3692070.3692537), [Casola et al. (2023)](https://aclanthology.org/2023.emnlp-main.212/), [Surowiecki (2005) (Introduction)](https://opac.ub.lmu.de/Record/3146851?sid=71261150).
+- :hourglass_flowing_sand:  *Aggregation and Multi-Agent Systems.* Multi-Agent Systems (MAS) are ensembles of Language Models that solve tasks as a group. The performance of MAS improves with a higher diversity between agents and their individual solutions. However, to evaluate the performance, the models need to consent or their individual answers must be aggregated. This project offers multiple directions: either aiming at understanding the impact of aggregation on the group solution or analyzing the diversity of the individual solutions. Example projects: a) Comparing different aggregation methods and evaluating the faithfulness of LMs generating the final answer (for Master students: improving existing aggregation methods / comparing the impact on different multi-agent approaches); b) making non-aggregated solutions comprehensible for humans, e.g. through visualization, summarization, and/or formalization (for Master students: + evaluation of non-aggregated solutions / comparing different multi-agent approaches); c) other projects regarding MAS. References: [Du et al. (2024)](https://dl-acm-org.emedien.ub.uni-muenchen.de/doi/10.5555/3692070.3692537), [Casola et al. (2023)](https://aclanthology.org/2023.emnlp-main.212/), [Surowiecki (2005) (Introduction)](https://opac.ub.lmu.de/Record/3146851?sid=71261150).
 **Level: BSc or MSc.**
 
 - *Analyzing shortcut learning in VLMs across NLI and visual entailment:* Vision-language models (VLMs) achieve strong performance on many tasks, yet they can exhibit shortcut learning, where predictions rely on simple input patterns rather than on a full use of the available evidence. For LLMs, this behavior has been observed in NLI, which asks whether a hypothesis follows from a given premise. Prior work has shown that models can often solve NLI by relying mainly on cues in the hypothesis, without fully capturing the relationship between the premise and the hypothesis ([Poliak et al., 2018](https://aclanthology.org/S18-2023/), [Yuan et al., 2024](https://aclanthology.org/2024.emnlp-main.679/)). In visual entailment, the premise is an image rather than a text ([Xie et al., 2019](https://arxiv.org/abs/1901.06706), [Kayser et al., 2021](https://openaccess.thecvf.com/content/ICCV2021/papers/Kayser_E-ViL_A_Dataset_and_Benchmark_for_Natural_Language_Explanations_in_ICCV_2021_paper.pdf)). The goal of this project is to investigate whether similar shortcut behavior occurs in vision-language models when performing visual entailment, and to analyze which visual and textual information models rely on when making inferences. The scope can be adjusted for BSc or MSc, for example by varying the number of models, prompting strategies, or the depth and types of cues analyzed. 
 **Level: BSc or MSc.**
 
-- *Data Mining and LLM-as-a-Judge to better understand LLM behavior:* While the behavior of LLMs and their nuanced and complex output data is challenging to evaluate, data mining approaches can be leveraged to explain model behavior, to bring structure into evaluation and to gain new insights, e.g. on cultural biases or task failure [1]. In this thesis project, we want to take this approach further by evaluating the use of newly proposed data mining algorithms and/or the combination of LLM-as-a-Judge with data mining processes. The project offers the possibility to work on a technical evaluation of methods as well as develop and evaluate a new method. **References:** [1] [https://aclanthology.org/2025.acl-long.985/](https://aclanthology.org/2025.acl-long.985/)
+- :hourglass_flowing_sand:  *Data Mining and LLM-as-a-Judge to better understand LLM behavior:* While the behavior of LLMs and their nuanced and complex output data is challenging to evaluate, data mining approaches can be leveraged to explain model behavior, to bring structure into evaluation and to gain new insights, e.g. on cultural biases or task failure [1]. In this thesis project, we want to take this approach further by evaluating the use of newly proposed data mining algorithms and/or the combination of LLM-as-a-Judge with data mining processes. The project offers the possibility to work on a technical evaluation of methods as well as develop and evaluate a new method. **References:** [1] [https://aclanthology.org/2025.acl-long.985/](https://aclanthology.org/2025.acl-long.985/)
 **Level: MSc.**
 
-- *Understanding Post-Training Effects Through Model Behavior Analysis and Interpretability:* Post-training has become an essential technique to adapt pretrained language models, e.g. to improve instruction following [1] or abilities for underrepresented languages [2], or to align model behavior with safety standards [3]. Correctly adapting models through post-training is, however, a complex and difficult process which can e.g. trigger broad misalignments and unexpected effects like safety failures [4]. To better control post-training, it is crucial to better understand how models change during the process.  
+- :hourglass_flowing_sand:  *Understanding Post-Training Effects Through Model Behavior Analysis and Interpretability:* Post-training has become an essential technique to adapt pretrained language models, e.g. to improve instruction following [1] or abilities for underrepresented languages [2], or to align model behavior with safety standards [3]. Correctly adapting models through post-training is, however, a complex and difficult process which can e.g. trigger broad misalignments and unexpected effects like safety failures [4]. To better control post-training, it is crucial to better understand how models change during the process.  
 This thesis will study the effects of post-training through a dual lens. Through model behavior analysis tools like Spotlight [5], it will explore how a model changes with respect to non-performance metrics like gender [6] and cultural biases [7]. Using probing, logic lense or other interpretability techniques, it will then go one step further and also start explaining how these changes occur within the model. Depending on scope and resource availability, this thesis can either work with existing model (checkpoints) or post-train specific model aspects.
 **References:**
 [1] [Ouyang et al. (2022): Training language models to follow instructions with human feedback. arXiv 2203.02155.](https://arxiv.org/pdf/2203.02155)