UK Exeter Talk: Developing Turkish Language Models: Tokenization, Data Quality and Domain Adaptation

This video features my seminar presentation at the University of Exeter, where I shared the core findings of my PhD research. The talk focuses on building more robust, efficient, and scalable language models for morphologically rich and relatively low-resource languages, taking Turkish as the primary framework. In this presentation, I approach the LLM development pipeline through a holistic perspective, discussing not just a single bottleneck but the interconnected challenges of evaluation, tokenization, data quality, and domain adaptation. Key topics covered in this talk: • The critical need for robust evaluation: Building the TR-MMLU ecosystem • Morphologically-aware tokenizer design for Turkish • Measuring tokenization efficiency with TR and Pure metrics • Data quality-aware learning approaches • Domain-specific model adaptations, featuring a Turkish medical dataset • Embedding generation: Insights into the Magibu 200M model • Open-source AI tools, datasets, and accessible benchmarks My ultimate goal with this research is to foster an open-science ecosystem for Turkish NLP that prioritizes high representational power, computational efficiency, and reproducibility. Thank you to the academic community in the UK for the engaging discussions, and to everyone watching.