Deeper Dive
For our project, we created a cross-dialect benchmark, EnDive (English Diversity), that evaluates the fairness and performance of large language models (LLMs) across five underrepresented English dialects. EnDive is particularly significant for two reasons. First, our benchmark is especially comprehensive: spanning 12 reasoning tasks and performing more than 300 evaluations with seven of the most widely used LLMs through zero-shot and chain-of-thought prompting, our project’s scope is far more extensive than most natural language processing papers. Second, our use of a wide variety of analytical and evaluation techniques is novel. We employed human validators to verify translation faithfulness, fluency, and formality; used ROUGE diversity scores, lexical diversity evaluations, BARTScore evaluations, and preference tests; and conducted select qualitative analyses. EnDive builds on our earlier work, AAVENUE, which focused on African American Vernacular English. AAVENUE was recognized at top AI workshops at EMNLP and NeurIPS High School Track and has been cited by research institutions including Microsoft, Google Research, Oxford, and Stanford. After attending EMNLP ’24 and engaging with leading NLP researchers — including the Stanford SALT Lab researchers we cited in our paper — we recognized the urgent need to expand our benchmark to multiple dialects and reasoning tasks. This led to the creation of EnDive, a more comprehensive framework designed to reveal hidden AI biases and guide the development of more inclusive language technologies.
We believe EnDive can directly improve quality of life for millions of people by ensuring that AI systems treat all English speakers fairly, regardless of dialect. By revealing systemic performance gaps in LLMs, our benchmark provides developers with the tools to identify and address hidden biases before deployment in high-stakes applications such as education, hiring, and healthcare. This work not only promotes equity in AI-powered tools but also affirms the cultural and linguistic identities of communities historically underrepresented and misrepresented in technology. In the long term, we envision EnDive as a foundation for dialect-aware NLP systems that are both powerful and inclusive, contributing to a world where technology serves everyone — not just those who speak Standard American English.