초거대 AI 모델 및 플랫폼 최적화 센터

목록 List

SNU Thunder-LLM 한국어 벤치마크

SNU Thunder-LLM Korean Benchmark Suite

개요

Overview

Thunder-LLM 한국어 벤치마크는 한국어 언어 모델의 성능을 다양한 과제(task) 전반에 걸쳐 평가할 수 있도록 구성되었습니다. “SNU Ko-”로 시작하는 데이터셋은 기존 영어 벤치마크를 번역하거나, 새롭게 개발한 한국어 벤치마크입니다. 영어 벤치마크의 번역 과정에서는 전문가 교정과 한국 문화·언어에 맞춘 현지화, 그리고 교차 검토 절차를 거쳤습니다. 자체 개발한 벤치마크는 자연스러운 한국어 텍스트를 기반으로 구축되었으며, 완성된 문항은 적합성과 품질을 검수하여 최종 선정했습니다. 이에 더해, 이미 공개된 한국어 벤치마크 중에서도 품질과 활용도가 높은 일부를 선별하여 포함하였습니다. Thunder-LLM 한국어 벤치마크는 한국어 LLM의 성능 분석과 비교 평가를 위해 활용될 수 있으며, 지속적으로 확장 및 개선될 예정입니다.

Thunder-LLM Korean Benchmark Suite is designed to evaluate Korean language models across a diverse set of tasks. The datasets prefixed with “SNU Ko-” include both carefully translated English benchmarks and newly created benchmarks developed for Korean. Each translation underwent expert proofreading, cultural and linguistic adaptation, and a round of cross-review. The newly created benchmarks were built from natural Korean texts, and all items were thoroughly reviewed to ensure relevance, and quality. In addition, a selection of high-quality existing Korean benchmarks is carefully curated to ensure broader coverage and reliable evaluation. Thunder-LLM Korean Benchmark Suite serves as a foundation for analyzing and comparing Korean LLMs and will continue to expand and improve over time.

데이터 구축 방법	이름	분야	문항 개수 (train / val / test)	논문
자체 개발	SNU Ko-LAMBADA	문맥 기반 단어 예측	- / - / 2,255	TBA
자체 개발	SNU Ko-MuSR	한국어 장문 다단계 추론	- / - / 750
영어 벤치마크 번역	SNU Ko-WinoGrande	상식 기반 명사 추론	- / - / 1,267	TBA
	SNU Ko-ARC-Challenge	초등 수준 과학 문제 풀이 (어려움)	- / - / 1,167	TBA
	SNU Ko-ARC-Easy	초등 수준 과학 문제 풀이 (쉬움)	- / - / 2,376	TBA
	SNU Ko-GSM8K	수학 문제 해결	- / - / 1,319	TBA
	SNU Ko-IFEval	지시문 이해 및 수행	- / - / 841	TBA
	SNU Ko-EQ-Bench	대화 내 감정 추론	- / - / 171	TBA
기존 한국어 벤치마크 선별	KoBEST HellaSwag	문장 끝부분 추론	3,665 / 700 / 1,404
	KMMLU	다양한 분야 지식 평가	208k / 225 / 35k
	KR-HumanEval	파이썬 코드 생성	- / - / 328
	KorQuAD V2.1	지문 기반 정답 추출	83,486 / 10,165 / -

Dataset Construction Method	Name	Evaluation Task	Dataset Size (train / val / test)	Paper
Developed New Benchmark	SNU Ko-LAMBADA	Broad-context word prediction	- / - / 2,255	TBA
Developed New Benchmark	SNU Ko-MuSR	Korean Multi-Step Soft Reasoning	- / - / 750
Translated Existing English Benchmark	SNU Ko-WinoGrande	Pronoun cloze commonsense reasoning	- / - / 1,267	TBA
	SNU Ko-ARC-Challenge	Grade-school science QA (challenging)	- / - / 1,167	TBA
	SNU Ko-ARC-Easy	Grade-school science QA (easy)	- / - / 2,376	TBA
	SNU Ko-GSM8K	Math problem solving	- / - / 1,319	TBA
	SNU Ko-IFEval	Instruction following	- / - / 841	TBA
	SNU Ko-EQ-Bench	Emotion Prediction in dialogue	- / - / 171	TBA
Curated Existing Korean Benchmark	KoBEST HellaSwag	Sentence Ending Completion	3,665 / 700 / 1,404
	KMMLU	Multidomain factual knowledge	208,522 / 225 / 35,030
	KR-HumanEval	Python code generation	- / - / 328
	KorQuAD V2.1	Span-based Extractive QA	83,486 / 10,165 / -

자체 개발 벤치마크

다음은 자체 개발한 2종의 벤치마크(SNU Ko-LAMBADA, SNU Ko-MuSR)에 대한 상세한 설명입니다.

Newly-Developed Benchmarks

The following is a detailed description of two newly developed benchmarks: SNU Ko-LAMBADA, SNU Ko-MuSR.

1) SNU Ko-LAMBADA

SNU Ko-LAMBADA는 한국어 문맥 이해 능력을 평가하는 벤치마크로, 긴 문맥 속에서 앞선 문장에서 나온 핵심 명사를 빈칸으로 제시하고 이를 예측하게 합니다. 영어 원본 LAMBADA가 문장의 마지막 단어(주로 명사)를 예측하는 구조인 것과 달리, 한국어는 문장 구조상 문장의 마지막 단어가 동사이므로, 문장 중간에 나오는 명사 예측으로 설계되었습니다. 해당 데이터셋은 공유마당에서 수집한 저작권이 만료된 한국 문학 작품을 기반으로 하며, 문맥 이해 능력을 평가할 수 있도록 제작되었습니다.

SNU Ko-LAMBADA is a benchmark designed to evaluate Korean language models’ ability to understand long-range context. Unlike the original LAMBADA, which asks models to predict the final word in a sentence, Ko-LAMBADA selects a semantically important noun from mid-sentence, reflecting Korean syntactic structure where sentence endings often contain verbs. The dataset, based on copyright-free Korean literature from 공유마당, tests context understanding.

2) SNU Ko-MuSR

SNU Ko-MuSR은 한국어 장문 맥락에서 다단계 추론 능력을 평가하기 위한 벤치마크입니다. MuSR(Sprague et al., 2024)의 데이터 합성 방법과 같이, 각 지문은 논리적 일관성을 보장하기 위해 중간 단계인 추론 트리(reasoning tree)를 기반으로 생성됩니다. 지문은 LLM 기반 데이터 합성 파이프라인으로 생성되었으며, 연구진은 생성된 문항들이 사실성, 논리성, 자연스러움을 갖추었는지 검수했습니다. 벤치마크는 살인범 추리(Murder Mysteries), 물건 위치 추론(Object Placements), 팀 배정(Team Allocations)의 세 영역으로 구성되어 있으며, 각각 다른 유형의 추론 능력을 평가합니다. SNU Ko-MuSR은 한국어 대형언어모델의 복합 추론 능력을 평가하는 데 중요한 벤치마크로 활용될 것입니다.

SNU Ko-MuSR is a Korean benchmark for evaluating multistep reasoning in long narrative contexts. It follows the MuSR(Sprague et al., 2024) synthesis process and generates data through a LLM-based pipeline, where each narrative is constructed from an intermediate reasoning tree that ensures logical consistency. The dataset includes three subtasks—Murder Mysteries, Object Placements, and Team Allocations—each testing different reasoning skills. Each item contains a 200–600 word narrative and a multiple-choice question. All instances were human-reviewed to guarantee logical consistency and linguistic naturalness, making Ko-MuSR suitable for assessing Korean LLMs’ reasoning and prompting strategies.

번역 벤치마크

다음은 번역한 6종의 벤치마크(SNU Ko-WinoGrande, SNU Ko-ARC-Challenge, SNU Ko-ARC-Easy, SNU Ko-GSM8K, SNU Ko-IFEval, SNU Ko-EQ-Bench)에 대한 상세한 설명입니다.

Translated Benchmarks

The following is a detailed description of six translated benchmarks: SNU Ko-WinoGrande, SNU Ko-ARC-Challenge, SNU Ko-ARC-Easy, SNU Ko-GSM8K, SNU Ko-IFEval, SNU Ko-EQ-Bench.

1. 초기 기계 번역 및 데이터 생성
영문 벤치마크(ARC, GSM8K, Winogrande, EQ-Bench, IFEval) 데이터셋을 DeepL API를 활용하여 영어에서 한국어로 기계 번역했습니다.

2. 도메인 전문가의 교정
도메인 전문가들이 다음과 같은 문제를 교정했습니다:
- 오탈자, 라벨 오류, 중복 항목
- 어색하거나 지나치게 직역인 번역
- 어투, 표현 방식, 서식의 불일치
- 부자연스럽거나 모호한 표현

3. 현지화(Localization; 번역의 경우에만 해당함)
다음과 같은 한국의 문화적, 언어적 현지화 작업을 진행했습니다.
- 외국 인명, 단위, 용어 등을 한국식으로 변환 (예: “Jessica” → “지희”, feet → 미터, dollars → 원)
- 문화적으로 낯선 상황을 한국적 맥락에 맞게 수정 (예: 잔디 깎기 → 구두 닦기)

4. 교차 검토
별도의 검토자가 전 데이터셋을 독립적으로 검토하여 논리적 일관성과 문장의 자연스러움을 확인했습니다.

1. Initial Machine Translation / Data Generation
The original English benchmark datasets (ARC, GSM8K, Winogrande, EQ-Bench, IFEval) were machine-translated from English to Korean using the DeepL API.

2. Expert Correction
Domain experts corrected various issues in the datasets, including:
- Typos, label errors, and duplicate entries
- Awkward or overly literal translations
- Inconsistencies in tone, style, or formatting
- Unnatural or ambiguous expressions

3. Localization (only for translated benchmarks)
To ensure cultural and linguistic appropriateness, the datasets were localized by:
- Replacing foreign names, units, and terms (e.g., “Jessica” → “지희”, feet → meters, dollars → won)
- Modifying culturally unfamiliar scenarios (e.g., lawn mowing → shoe shining)

4. Cross-review
A separate reviewer independently reviewed all datasets to ensure logical coherence and overall fluency.

1) SNU Ko-Winogrande

Ko-WinoGrande는 앞서 나온 명사 중 빈칸에 들어갈 알맞은 명사를 고르는 것을 통해 상식 추론 능력을 평가하는 벤치마크입니다. 영어 WinoGrande 데이터셋의 validation split을 기반으로 번역되었으며, 문맥에 따라 적절한 명사를 선택하는 방식의 2지선다형 문제로 구성되어 있습니다. 번역 및 검수 과정에서 한국의 문법에 맞는 문장 구조로 조정하고, 외국 이름이나 상황을 한국 문화에 맞게 현지화하여 논리적 일관성과 자연스러운 흐름을 유지하였습니다.

Ko-WinoGrande assesses commonsense reasoning through pronoun resolution. Based on the English WinoGrande validation set, it presents two-choice cloze-style questions requiring models to infer the correct referent. Translation and cultural localization ensure logical consistency and fluency, with names and scenarios adapted to Korean norms.

2) SNU Ko-ARC-Challenge

Ko-ARC-Challenge는 초등학생 수준의 과학 문제를 다루는 비교적 고난이도 평가 벤치마크입니다. 원본 ARC-Challenge 데이터셋을 기반으로 번역, 교정, 현지화 과정을 거쳐 제작되었습니다. 영어 문장에서 반복되던 중복 항목, 잘못된 정답 레이블 등의 오류도 수정되었으며, 질문 형식도 한국어로 자연스럽게 다듬어졌습니다.

Ko-ARC-Challenge translates the ARC-Challenge dataset into Korean and includes difficult science questions for elementary students. It corrects issues in the original data (e.g., duplication, label errors), and localizes content to fit the Korean curriculum. Question styles were standardized for natural fluency.

3) SNU Ko-ARC-Easy

Ko-ARC-Easy는 초등학생 수준의 비교적 쉬운 과학 문제로 구성된 벤치마크입니다. 원본 ARC-Easy 데이터셋을 도메인 전문가들이 표현 교정 및 현지화를 수행하였습니다. 예를 들어, "Which of the 27 following is the best use of a robot?"에서 "27" 같은 불필요한 표현을 제거하고, 질문 종결 표현을 “무엇인가요?”와 같이 표준화하여 문장 일관성을 확보했습니다.

Ko-ARC-Easy is a benchmark consisting of relatively easy science questions at the elementary school level. Domain experts revised and localized the original ARC-Easy dataset. For example, unnecessary phrases like “27” in the sentence “Which of the 27 following is the best use of a robot?” were removed, and question endings were standardized to expressions like “무엇인가요?” to ensure consistency and fluency in Korean.

4) SNU Ko-GSM8K

Ko-GSM8K는 수학 문제 해결 능력을 평가하는 벤치마크입니다. 번역 과정에서 수량 단위(피트 → 미터, 달러 → 원), 이름, 문제 상황 등을 한국 실정에 맞게 현지화하였습니다. 풀이 형식은 자연어 해설과 중간 연산식이 포함된 형태로 유지되며, 도메인 전문가들이 계산 오류, 비문 등을 교정하였습니다.

Ko-GSM8K is a benchmark designed to evaluate mathematical problem-solving skills. During translation, quantities (e.g., feet → meters, dollars → won), names, and problem scenarios were localized to better reflect Korean contexts. The solution format preserves natural language explanations along with intermediate calculation steps. Domain experts corrected calculation errors, ungrammatical expressions, and other inconsistencies.

5) SNU Ko-IFEval

Ko-IFEval은 자연어 지시문을 언어 모델이 정확히 따를 수 있는지를 평가하는 벤치마크입니다. 각 항목은 특정 형식이나 제약 조건(예: 쉼표 금지, 특정 길이 미만, 마크다운 형식 등)을 만족해야 하며, 다양한 형태의 지시문 수행 평가에 활용할 수 있습니다. 한국어 스타일에 맞게 명령문의 어조, 표현, 문화적 맥락 등을 수정하였습니다.

Ko-IFEval is a benchmark designed to evaluate whether a language model can follow natural language instructions precisely. Each item requires the model to meet specific formatting or constraint conditions (e.g., no commas, under a certain length, Markdown format, etc.), and it can be used to assess a wide range of instruction-following capabilities. The tone, expressions, and cultural context of commands were revised to suit natural Korean usage.

6) SNU Ko-EQ-Bench

Ko-EQ-Bench는 대화 속 등장인물의 감정 반응을 예측하게 함으로써 감성 추론 능력을 평가하는 벤치마크입니다. 한국어로 번역 및 현지화된 대화를 바탕으로, 특정 감정(예: 분노, 희망, 공포 등)에 대해 0~10 사이의 점수를 예측하도록 구성되어 있습니다. 한국어 어투에 맞춰 인물 호칭이나 표현을 자연스럽게 조정하고, 감정에 대한 번역상의 애매함을 줄이기 위해 영문으로 된 감정의 한국어 의미를 매핑한 표를 활용했습니다.

Ko-EQ-Bench evaluates a model’s ability to perform emotional inference by predicting the emotional responses of characters in a conversation. Based on translated and localized Korean dialogues, the benchmark asks models to rate specific emotions (e.g., anger, hope, fear) on a scale from 0 to 10. Character names and expressions were adapted to match natural Korean speech, and a semantic mapping table was used to reduce ambiguity in emotion translation.

Acknowledgements

본 연구는 과학기술정보통신부 선도연구센터사업(ERC)의 지원을 받아 수행된 연구입니다 (과제번호: RS-2023-00222663, 초거대 AI 모델 및 플랫폼 최적화 센터). 또한, GPU 장비는 과학기술정보통신부·광주광역시가 공동 지원한 '인공지능 중심 산업융합 집적단지 조성사업'의 지원을 받았습니다.

This work was supported by the National Research Foundation of Korea (NRF) under Grant No. RS-2023-00222663 (Center for Optimizing Hyperscale AI Models and Platforms, ERC). This research was also supported by Artificial intelligence industrial convergence cluster development project funded by the Ministry of Science and ICT(MSIT, Korea)&Gwangju Metropolitan City.

Contributors

소연경⁺, 김종민⁺, 이규성⁺, 박찬우^*, 김상호⁺, 박종연⁺, 이준학⁺, 정성목⁺, 표세호⁺, 조경제⁺, 김서린⁺, 김지수⁺, 박수영⁺, 박현지⁺, 서영호^*, 안예림⁺, 강지아⁺, 배수민^*, 강민규^*, 이재진^*+

^* 서울대학교 컴퓨터공학부
⁺ 서울대학교 데이터사이언스대학원

Yeonkyoung So⁺, Jongmin Kim⁺, Gyuseong Lee⁺, Chanwoo Park^*, Sangho Kim⁺, Jongyeon Park⁺, Joonhak Lee⁺, Sungmok Jung⁺, Seho Pyo⁺, Gyeongje Cho⁺, Seorin Kim⁺, Jisoo Kim⁺, Suyoung Park⁺, Hyunji Park⁺, Yeongho Seo^*, Yelim Ahn⁺, JiA Kang⁺, Sumin Bae^*, Mingyu Kang^*, Jaejin Lee^*+

^*Department of Computer Science and Engineering, Seoul National University
⁺Graduate School of Data Science, Seoul National University