초거대 AI 모델 및 플랫폼 최적화 센터

목록 List

SNU Thunder-LLM 영어 벤치마크

SNU Thunder-LLM English Benchmark Suite

개요

Overview

Thunder-LLM 영어 벤치마크는 영어 언어 모델의 성능을 다양한 분야에 걸쳐 평가할 수 있도록 구성되었습니다. SNU Thunder-NUBench는 새롭게 개발한 부정 표현 이해에 대한 벤치마크로, Thunder Research Group의 교정 및 교차 검토의 과정을 거쳐 제작되었습니다. 또한, 이미 공개된 영어 벤치마크 중 품질과 활용도가 높은 것들을 선별하여 포함하였습니다. Thunder-LLM 영어 벤치마크 는 영어 LLM의 성능 분석과 비교 평가를 위해 활용될 수 있으며, 지속적으로 확장 및 개선될 예정입니다.

Thunder-LLM English Benchmark Suite is designed to evaluate English language models across a diverse set of tasks. SNU Thunder-NUBench is a newly developed benchmark focusing on negation understanding, constructed through careful curation and cross-validation by the Thunder Research Group. In addition, a selection of high-quality existing English benchmarks is curated to ensure broader coverage and reliable evaluation. Thunder-LLM English Benchmark Suite serves as a foundation for analyzing and comparing English LLMs and will continue to expand and improve over time.

데이터 구축 방법	이름	분야	문항 개수 (train / val / test)
자체 개발	SNU Thunder-NUBench	부정 이해	3,772 / 100 / 1,002
기존 영어 벤치마크 선별	WinoGrande	상식 기반 명사 추론	40,398 / 1,267 / 1,767
	ARC-Challenge	초등 수준 과학 문제 풀이 (어려움)	1,119 / 299 / 1,172
	ARC-Easy	초등 수준 과학 문제 풀이 (쉬움)	2,251 / 570 / 2,376
	GSM8K	수학 문제 해결	7,473 / - / 1,319
	IFEval	지시문 이해 및 수행	- / - / 541
	HellaSwag	문장 끝부분 추론	39,905 / 10,042 / 10,003
	MMLU	다양한 분야 지식 평가	99,842 / 1,531 / 14,042
	HumanEval	파이썬 코드 생성	- / - / 164
	OpenBookQA	오픈북(openbook), 초등 수준 과학 문제 풀이	4,957 / 500 / 500

Dataset Construction Method	Name	Evaluation Task	Dataset Size (train / val / test)
Developed New Benchmark	SNU Thunder-NUBench	Negation Understanding	3,772 / 100 / 1,002
Curated Existing English Benchmark	WinoGrande	Pronoun cloze commonsense reasoning	40,398 / 1,267 / 1,767
	ARC-Challenge	Grade-school science QA (challenging)	1,119 / 299 / 1,172
	ARC-Easy	Grade-school science QA (easy)	2,251 / 570 / 2,376
	GSM8K	Math problem solving	7,473 / - / 1,319
	IFEval	Instruction following	- / - / 541
	HellaSwag	Sentence Ending Completion	39,905 / 10,042 / 10,003
	MMLU	Multidomain factual knowledge	99,842 / 1,531 / 14,042
	HumanEval	Python code generation	- / - / 164
	OpenBookQA	Open-book, elementary-school science QA	4,957 / 500 / 500

자체 개발 벤치마크

Newly Constructed Benchmarks

다음은 새롭게 개발한 1종의 벤치마크(SNU Thunder-NUBench)에 대한 상세한 설명입니다.

The following is a detailed description of a newly developed benchmark, SNU Thunder-NUBench.

1) SNU Thunder-NUBench

Thunder-NUBench(부정 이해 벤치마크)는 대형 언어 모델(LLM)의 문장 수준 부정 이해 능력을 평가하기 위해 특별히 설계된 벤치마크입니다. 기존의 많은 벤치마크는 부정을 단순한 문법적 요소나 언어의 부차적인 특성으로 다루는 경향이 있습니다. 이에 반해 Thunder-NUBench는 의미적으로 풍부한 문장-부정 쌍을 직접 구축하고 교정하여, 표준적인 부정 표현과 구조적으로 유사하지만 의미적으로는 다른 요소들(문장의 일부만 부정하는 것(local negation), 모순(contradiction), 패러프레이즈(paraphrase))을 제시하는 다중 선택 과제 형식으로 구성되어 있습니다. 본 벤치마크의 목표는 언어 모델이 인간의 언어에서 중요한 의미를 가지는 부정의 의미를 얼마나 잘 이해하는지 평가하는 것입니다.

Thunder-NUBench (Negation Understanding Benchmark) is a benchmark specifically designed to evaluate large language models’ (LLMs) sentence-level understanding of negation. Unlike prior benchmarks that treat negation as a minor or syntactic feature, Thunder-NUBench introduces rich, manually curated sentence-negation pairs and multiple-choice tasks that contrast standard negation with structurally similar distractors (e.g., local negation, contradiction, paraphrase). The goal is to probe semantic-level negation understanding of language models, as negation is an important element in human language.

Acknowledgements

본 연구는 과학기술정보통신부 선도연구센터사업(ERC)의 지원을 받아 수행된 연구입니다 (과제번호: RS-2023-00222663, 초거대 AI 모델 및 플랫폼 최적화 센터). 또한, GPU 장비는 과학기술정보통신부·광주광역시가 공동 지원한 '인공지능 중심 산업융합 집적단지 조성사업'의 지원을 받았습니다.

This work was supported by the National Research Foundation of Korea (NRF) under Grant No. RS-2023-00222663 (Center for Optimizing Hyperscale AI Models and Platforms, ERC). This research was also supported by Artificial intelligence industrial convergence cluster development project funded by the Ministry of Science and ICT(MSIT, Korea)&Gwangju Metropolitan City.

Contributors

박찬우^*, 소연경⁺, 이상민^*, 강민규^*, 김한별^*, 이규성⁺, 정성목⁺, 이준학⁺, 박종연⁺, 강지아⁺, 김상호⁺, 이재진^*+

^* 서울대학교 컴퓨터공학부
⁺ 서울대학교 데이터사이언스대학원

Chanwoo Park^*, Yeonkyoung So⁺, Sangmin Lee^*, Mingyu Kang^*, Hanbeul Kim^*, Gyuseong Lee⁺, Sungmok Jung⁺, Joonhak Lee⁺, Jongyeon Park⁺, Jia Kang⁺, Sangho Kim⁺, Jaejin Lee^*+

^*Department of Computer Science and Engineering, Seoul National University
⁺Graduate School of Data Science, Seoul National University