초거대 AI 모델 및 플랫폼 최적화 센터

목록 List

사후학습 데이터셋

Post-Training Datasets

데이터 가공 코드 Data Processing Code Synthetic Datasets 합성 데이터셋

개요

Overview

Llama-Thunder-LLM의 post-training에 사용된 데이터셋을 공개합니다. 본 데이터셋은 언어 모델이 한국어와 영어의 다양한 도메인-일반 지식, 일반 상식 추론, 독해, 수학 추론, 과학 문제 풀이, 코드 작성, 지시 사항 수행 등-에서 우수한 성능을 발휘할 수 있도록 설계되었습니다.

데이터셋은 두 가지 유형으로 이루어져 있습니다. 첫째, 공개 데이터셋의 학습용 데이터를 가공하여 만든 가공 데이터셋입니다. 둘째, 공개 언어 모델을 활용해 생성한 합성 데이터셋입니다.

We release the datasets used for post-training Llama-Thunder-LLM. These datasets are designed to help the language model perform well across a wide range of domains-including general knowledge, commonsense reasoning, reading comprehension, mathematical reasoning, science, code generation, and instruction following.

The datasets consist of two types: First, processed datasets created by formatting training data from public datasets. Second, synthetic datasets generated using open-source language models.

가공 데이터셋

Processed Datasets

공개된 한국어 및 영어 데이터셋의 학습 데이터를 수집하여 post-training에 적합한 형태로 가공한 데이터셋입니다.

SFT와 DPO를 위해 공개 데이터셋을 질문-답변 형식으로 변환하였습니다. 객관식 질문의 경우 정답과 오답을 활용해 DPO를 위한 선호 데이터셋(preference dataset)으로 구성하였습니다. 이때 각 질문에 대한 정답은 chosen response, 오답은 rejected response로 처리하였습니다.

출처가 되는 각 공개 데이터셋별로 가공 코드가 따로 적용되었습니다. 원본 데이터셋을 다운로드한 후 제공된 코드를 적용하면 가공 데이터셋을 재현할 수 있습니다.

We collected training data from public Korean and English datasets and processed it into a format suitable for post-training.

For both SFT and DPO, each data instance was converted into a question-and-answer format. For multiple-choice questions, we constructed a preference dataset for DPO by using the correct answers as chosen responses and the incorrect answers as rejected responses.

A separate processing script was applied to each source dataset. By downloading the original datasets and applying the provided scripts, you can reproduce the processed datasets.

합성 데이터셋

Synthetic Datasets

고품질 질문에 대해 공개 언어 모델이 생성한 답변을 활용하여 구축한 합성 데이터셋입니다.

질문은 공개 데이터셋에서 수집하였으며, LLaMA3.3-70B-Instruct, EXAONE3.5-32B-Instruct, Qwen2.5 (32B 및 72B) 등 공개 언어 모델을 사용하여 각 질문에 대한 답변을 생성하였습니다. 생성된 답변 중 지나치게 길거나 한국어, 영어 이외의 언어로 작성된 저품질 답변은 필터링하였습니다. 이후 남은 답변에 대해 규칙 기반 방식으로 평가하여, 올바른 답변은 SFT 및 DPO 데이터셋에 포함하였습니다. 틀린 답변은 DPO를 위한 선호 데이터셋에서 rejected response로 사용하였습니다.

본 연구에서 사용된 합성 데이터셋은 수학 추론, 지시 사항 수행, 코드 작성의 3개의 주요 도메인으로 구성됩니다.

We generated synthetic data by prompting open-source language models with high-quality questions.

We gathered these questions from public datasets and generated responses using open-source language models. Specifically, we used LLaMA3.3-70B-Instruct, EXAONE3.5-32B-Instruct, and Qwen2.5 (32B and 72B) to produce responses for each question. Low-quality responses—such as those that were excessively long or not written in the target languages (Korean and English)—were filtered out. The remaining responses were evaluated using rule-based criteria, and the correct ones were included in the SFT and DPO datasets. Incorrect responses were labeled as rejected responses in the preference dataset for DPO.

The synthetic dataset used in this study comprises three key domains: Mathematical Reasoning, Instruction Following, and Coding.

데이터 출처 및 분포

Data sources and distribution

언어	분야	유형	데이터셋 출처	SFT용 질의응답 문항 개수	DPO용 질의응답 문항 개수
영어	일반 지식	가공	MMLU	99842	299526
	일반 상식 추론	가공	Hellaswag	39905	119715
	일반 상식 추론	가공	Winogrande	40398	40398
	독해	가공	OBQA	4957	14871
	수학 추론	가공	GSM8K	7473	-
		합성	GSM8k	12363	2335
		합성	OrcaMath	257922	-
	과학	가공	ARC-Easy	2251	6751
	과학	가공	ARC-Challenge	1119	3357
	코드 작성	합성	MBPP	1475	-
	코드 작성	합성	LeetCodeDataset	2698	-
	지시 사항 수행	합성	SlimOrca	319526	18889
한국어	일반 지식	가공	KMMLU	208522	625566
	일반 상식 추론	가공	Kobest-Hellaswag	2029	6087
	수학 추론	합성	GSM8k 번역본	11238	1804
		합성	OrcaMath 번역본	246123	-
		합성	mwp-korean-2021	3126	660
	지시 사항 수행	합성	KoAlpaca	49832	3825

Language	Domain	Type	Source Dataset	# Question-Answer Pairs for SFT	# Question-Answer Pairs for DPO
English	General Knowledge	Processed	MMLU	99842	299526
	Commonsense Reasoning	Processed	Hellaswag	39905	119715
	Commonsense Reasoning	Processed	Winogrande	40398	40398
	Reading Comprehension	Processed	OBQA	4957	14871
	Mathematical Reasoning	Processed	GSM8K	7473	-
		Synthetic	GSM8k	12363	2335
		Synthetic	OrcaMath	257922	-
	Science	Processed	ARC-Easy	2251	6751
	Science	Processed	ARC-Challenge	1119	3357
	Coding	Synthetic	MBPP	1475	-
	Coding	Synthetic	LeetCodeDataset	2698	-
	Instruction Following	Synthetic	SlimOrca	319526	18889
Korean	General Knowledge	Processed	KMMLU	208522	625566
	Commonsense Reasoning	Processed	Kobest-Hellaswag	2029	6087
	Mathematical Reasoning	Synthetic	Translated GSM8k	11238	1804
		Synthetic	Translated OrcaMath	246123	-
		Synthetic	mwp-korean-2021	3126	660
	Instruction Following	Synthetic	KoAlpaca	49832	3825

Acknowledgements

본 연구는 과학기술정보통신부 선도연구센터사업(ERC)의 지원을 받아 수행된 연구입니다 (과제번호: RS-2023-00222663, 초거대 AI 모델 및 플랫폼 최적화 센터). 또한, GPU 장비는 과학기술정보통신부·광주광역시가 공동 지원한 '인공지능 중심 산업융합 집적단지 조성사업'의 지원을 받았습니다.

This work was supported by the National Research Foundation of Korea (NRF) under Grant No. RS-2023-00222663 (Center for Optimizing Hyperscale AI Models and Platforms, ERC). This research was also supported by Artificial intelligence industrial convergence cluster development project funded by the Ministry of Science and ICT(MSIT, Korea)&Gwangju Metropolitan City.

Contributors

김종민⁺, 이재진^*+

^* 서울대학교 컴퓨터공학부
⁺ 서울대학교 데이터사이언스대학원

Jongmin Kim⁺ Jaejin Lee^*+

^*Department of Computer Science and Engineering, Seoul National University
⁺Graduate School of Data Science, Seoul National University