[강의] CS25 Transformers United V4 8강: Behind the Scenes of LLM Pre-training-StarCoder Use Case(Loubna Ben Allal)

learning/Lectures

[강의] CS25 Transformers United V4 8강: Behind the Scenes of LLM Pre-training-StarCoder Use Case(Loubna Ben Allal)

silhumin9 2025. 2. 18. 13:20

강연 소개

강연자 Loubna Ben Allal: ML Engineer in Hugging Face
Starcoder model

What does it take to train a good LLM?

Intro

수년 전에는 오픈소스 모델이 gpt와 같은 폐쇄형 모델을 따라잡는데 시간이 많이 걸릴 것이라고 예측

↔ much smaller ex) Llama - 모델 가중치가 개방적이고, 모델을 양자화할 수 있고, 소비자용 데스크톱에서도 실행할 수 있기 때문에 use case build 하는게 가능해짐

open llm company가 많아짐

gemma, mistral, …
open vs closed model의 성능 격차가 줄어들고 있음

Limitations: missing details about data and model training

이유

법적 조사 피하기 위함
경쟁 우위(competitive edge) 유지

What do we need to train good LLMs?

Model
- Transformer
- Mamba
- MoE(multiple t models)
GPUs
- ask
Data → 이 강연의 핵심
- The backbone of good LLMs
- 주어진 예산 내에서는 데이터가 중요
- 더 높은 품질의 샘플을 얻는 방법을 아는 것이 중요

How to get good data for LLM training?

Training data of LLMs

How much data for which model size?
Where to get the data?
How to filter it?

How much data to train LLMs?

시간이 지남에 따라 size, datasets, number of tokens 가 많이 바뀜

Scaling laws

Kaplan(2020): 10x compute, parameter 5.5x, training token 1.8x
- 175B parameter, 300B token 학습 GPT3, BLOOM → under trained
Chinchilla(2022)
- 고정 코사인 스케줄러를 사용했기 때문에 위의 결과를 나타냈다고 주장
- 데이터 크기에 맞는 코사인을 사용하지 않았기 때문에 underestimated
- New scaling laws: scale data & model size equally
- 손실을 맞추려고 하면 데이터, 모델 크기가 선형적으로 증가

스크린샷 2025-02-12 오후 9.38.50.png

Chinchilla 이후 ~
- LlaMA: 7B parameters, 1T tokens
- 그러나 최적의 ‘계산’이 언제나 ‘최적(optimal)’ 인 것은 아님
- 비용 뿐 아니라 inference에도 신경써야 함
- 이로 인해 선호도 정립: 더 적은 데이터로 훈련된 큰 모델 사용 < 더 오랫동안 훈련 된 작은 모델 사용
- loss, down stream evaluation 감소

→ 훈련하는 데 드는 비용: Inference gets more expensive

inference 비용 많이 듦
모델이 커질수록 토큰 처리에 드는 시간이 많아짐

Compute optimal is not optimal

Scaling laws에서 inference cost는 고려되지 않았음
- smaller models trained longer are more cost effective
- Chinchilla scaling laws - x → pay a ‘compute overhead’(훈련하는 동안 더 많은 비용을 지불하지만, 추론하는 동안 비용을 아끼도록)
Harm’s law: compute overhead for training past Chinchilla
- 13% overhead(추가 연산), 50% 비용 절감

Scaling laws: further reading

Scaling data-constrained lm - 데이터 제한 언어 모델 확장 논문
- 데이터 크기에 제한이 있는 경우 - 4번 에폭까지 데이터를 반복할 수 있음
- 공개적으로 사용 가능한 데이터가 거의 없는 도메인에서 유리
DeepSeek LLM: new scaling laws for data
- scaling behavior가 데이터 품질에 크게 좌우 된다는 것을 발견
- 다양한 하위집합, 필터링 시도 → 성능이 달라짐
- 더 높은 품질의 데이터셋이 있을 경우 데이터 크기가 아닌, 모델 크기에 더 많은 컴퓨팅을 할당해야한다고 결론

스크린샷 2025-02-13 오전 12.01.43.png

Where to get the data?

스크린샷 2025-02-13 오전 12.09.08.png

Web Data = web pages

commoncrawl: 크롤링된 웹페이지의 공개 저장소 - 몇 달마다 덤프 제시
- 매우 큰 규모의 필터링 필요 ex) 최근 1달 dump 400TB
Use existing filtered web datasets
- best performance라고 함
- code data, web data 모두 존재
→ Fineweb: released by huggingface

Code Data

→ Bigcode: huggingface collaboration

스크린샷 2025-02-13 오전 12.13.34.png

V1: Github
V2: Software Heritage

Synthetic data

Microsoft: Textbooks Are All You Need
- Synthetic textbook으로 pre-train corpus 구축
- ex) claude, Llama에서도 사용

→ Cosmopedia: huggingface, Mixtral 8x7B 사용

- 80% 데이터는 웹에서 가져옴
- 웹 샘플을 사용하여 새로운 textbook을 생성하는 prompt 제시
- curated sources such as wikihow, stanford sources…

How to filter the data?

LLMs training data filtering

high quality dataset might exhibit advanced capabilites in architecture
data가 bakcbone이라는 인식 → curate, outlier 없애는 과정이 중요
Yi의 pipeline
- language filtering
- → 2번째 박스
- → deduplication(중요): 중복을 건너뛰면 모델이 기억하게 되어 창의성을 발휘할 공간이 줄어듦
- → semantic, topic filtering

How to find the best filtering techniques?

필터링의 중요성
- exi) files that have many comments: 잘 documented 된 코드는 품질이 높을 확률 높음
  - comments가 거의 없는 파일을 찾아 모델 훈련시켰지만, 성능 개선은 미미했음
- exii) repository의 별점 5개 미만 파일 제거 - 70% 제거 되어서 …
표준 필터 사용시 예시:
Manual inspection
Train ablation models 중요
- 개념: 기본 필터링 적용한 후 subset of dataset 가져와서 작은 모델을 훈련시키고 필터링을 적용한 경우와 적용하지 않은 경우 어떻게 작용하는지 확인하는 것
  - cf) Ablation Model(소거 모델)은 모델에서 특정 부분(예: 특정 레이어, 입력 특징(feature), 모듈 등)을 제거하고 비교하여 각 요소가 성능에 미치는 영향을 분석하는 실험 기법
- 훈련 초기 필터 효과에 대한 결론을 얻을 수 있는 high signal benchmarks 필요
- 하나보다는 두세개의 seed에서 같은 experiment를 돌리고 평균화 작업 하는게 좋음

→ 20+ ablation models for FineWeb filtering

StarCoder use case: filtering The Stack v1

Language selection
Data quality inspection
Near-deduplication: MinHash + LSH → 가장 효과 좋은 방법
Removing PII
Data Decontamination: 평가에 사용하는 벤치마크를 훈련 세트에서 제거
Data formatting
- StarCoder2는 1과 달리 repository aware

Frameworks for data filtering

스크린샷 2025-02-13 오전 12.52.42.png

More about LLMs for code

→ How to build cool code asisstants

Github Copilot ins 2021
- openAI의 codex model
- 많은 코드를 학습시키면 가능하다. 코드를 텍스트처럼 취급하면 된다. 라는 걸 알려줌
Instruction-tuned models가 허깅페이스에 많음

Bigcode: open scientific collaboration

Open & Responsible Research on LLMs→ 오픈 소스의 의미!
- opt out for data: 사용자가 자신의 데이터를 수집하거나 사용하는 것을 거부(opt-out)하는 선택권을 행사하는 것을 의미

Q&A

What are the consequences of training AI models with synthetic data? Are there any problems? 합성 데이터가 언어의 분포를 잘 대표하는가? 때로는 인간의 low quality data도 필요하다고 생각
- 이유: 1 모델에 어느정도 편향을 강제로 가하는 것, 2 contamination: 벤치마크와 유사한 형태로 훈련하면 contamination 위험
- 사람들이 합성 데이터를 보지 못했기 때문에 회의적이었음
- 합성 데이터가 웹 배포와 동일하지 않은 것은 타당한 지적
- 자연 + synthetic data 섞이면 좋을 것, human intution is not always suitabl for model
RLHF data more preferable than 비지도 학습 모델?
- 요즘은 RL 거치지 않고, 명령어-솔루션 pair 튜닝만하고 쌍으로 모델 학습시키는 방식
- or DPO, ORPO
- not necessarily RLHF
multimodal grounding로 인해 model training을 위한 text data의 필요성이 줄어들까요?
Major differences between text vs code
- use similar architecture
- long-context extension
- MQA, GQA
- ovreall similar, but smaller model 필요
single GPU, finetuning model 시 우선순위는?
- llama cpp
- peft
- 질이 양보다 중요
Pre-training vs Fine-tuning
- fine-tuning은 pre-training만큼 데이터가 많이 필요하지 않으므로 filtering에 더 큰 시간 투자할 수 있음

'learning > Lectures' 카테고리의 다른 글

[강의] CS25 Transformers United V4 6강: From Large Language Models to Large Multimodal Models(Ming Ding, Zhipu AI) (2)	2025.01.23
[강의] CS25 Transformers United V4 4강: Demystifying Mixtral of Experts(Albert Jiang) (2)	2025.01.22
[강의] CS25 Transformers United V4 3강: Aligning Open Language Models(Nathan Lambert) (2)	2025.01.20
[강의] CS25 Transformers United V4 2강: Jason Wei & Hyung Won Chung of OpenAI (1)	2025.01.20
[강의] CS25 Transformers United V4 1강: Overview of Transformers (3)	2025.01.17

현재글[강의] CS25 Transformers United V4 8강: Behind the Scenes of LLM Pre-training-StarCoder Use Case(Loubna Ben Allal)

허밍 lab

GPT, dl, affective computing, Stanford, Ai, Prompt Engineering, llm, 논문리뷰, conversational AI, Transformer, CHI, vlm, 인공지능, ML, HCI, Zero shot, cs25, NLP, mldl, open ai,

Today :
Yesterday :

허밍 lab