[강의] CS25 Transformers United V4 2강: Jason Wei & Hyung Won Chung of OpenAI

learning/Lectures

[강의] CS25 Transformers United V4 2강: Jason Wei & Hyung Won Chung of OpenAI

silhumin9 2025. 1. 20. 01:28

강의 출처: https://web.stanford.edu/class/cs25/

Intuitions on Language Models (Jason)

Q. Why do LMs work so well?

→ manually inspect data

ex) 폐암의 종류를 분류하는(classify) 프로젝트를 했었음

but 이 일을 하려면 medical degree 필요하다고 함
논문 많이 읽고, 병리학적 case 많이 접함
How LM are trained
- next word prediction task: 앞에 나오는 단어가 있고, 그 다음에 나올 단어를 예측함
- 각 단어에 대한 확률 도출
- LM은 모든 단어에 확률 부여
- 손실(loss): 실제 다음 단어와 예측 단어 간 차이
- → 첫번째 직관*: next-word prediction = massively multi-task learning

Example tasks from next-word prediction
- Grammar
- Lexical semantics
- World knowledge
- Sentiment analysis
- Translation
- Spatial reasoning
- Math question
  
  → 그런데 가끔은 이런 task가 매우 arbitrary 할 수 있음
  
  =next word prediction이 어려움!

Scaling compute(=data, model size) reliably improves loss. (Kaplan et al., 2020)
- x축이 계산(compute), y축이 손실(loss)로 둘 때, 계산이 증가함에 따라 손실이 선형적으로 감소
- why does scaling imrove the loss?
- 비교
  - Small LM: choosey(memorization), first-order(heuristics)
  - Large LM: tail knowledge(memorization), complex(heuristics)
- While overall loss improves somoothly, individual tasks can improve suddenly.
  - overall loss = 작은 수loss grammer + 작은 수sentiment + … → might be saturated
    
    … + 작은수*loss math → can improve suddenly
- Scaling curves
  - smooth 29%
  - scaling curves 13%
  - emergent 33%
  - compute and accuracy → your accuracy will be zero, and then suddenly improve
- Inverse scaling / U-shaped scaling
  - Repeat after me.
  - All that glisters is not glib.
  - All that glisters is not ___.
  - cf) XS, S, L 사이즈 언어 모델이 있다고 칠 때, XS와 L 모델만 단어 ‘glib’을 예측하는 것은 어떤 메커니즘이 작동한 건가?
    - Repeat: XS(o) S(o) L(o)
    - Fix a quote: XS(x) S(o) L(o)
    - Follow instruction: XS(x) S(x) L(o)
Takeaway: Plot scaling curves

baseline - 데이터 절반을 가져와서 성과가 나오는 지점 - my thing 가정할 때
- 경우1(맨위 곡선): 모든 데이터를 수집할 필요 없음, 더 수집해도 성능 향상 없음
- 경우2(중간 곡선): 더 많은 연구 프로젝트를 한다면 성능 향상 가능성
- 경우3(맨밑 곡선): even larger jump in performance
Q&A
- 데이터?
  
  → 좋은 데이터 위주로 학습하는게 좋음
- emergent task에서, 평평한 지점에 있을 때 나중에 부상하게 되리라는 암시가 있는가?
- biggest bottleneck in LLM
  
  → scaling laws paradigm - 데이터, 모델의 크기
- emergent ability in LLM is a mirage?
  
  → 언어모델 능력은 언젠가는 진짜라고 생각

Shaping the Future of AI from the History of Transformer (Hyung Won)

Goal: Future ↔ History

→ important to study the change itself
What does it mean?
1. Identify: 변화를 이끄는 주요 원동력(dominant driving force) 찾기
2. Understand: 주요 원동력 이해하기
3. Predict: 미래의 변화 예측하기
주요 원동력: 컴퓨팅 비용이 기하급수적으로(exponentially) 감소함
- 어떻게 한 것인가?
- teach machines how to think **
- incorporate that into
- understand in very low level
과거 70년간의 AI 연구는 점점 약한 모델링(less structure) 방식으로 발전

→ bitter lesson: 복잡한 구조를 만들기보다 자유도를 제한하지 않는 방식이 더 유용
- 이는 기하급수적으로 감소한 연산 비용 덕분에 가능했음
- add more data and computation (i.e. scale-up)
- 기하급수적으로 연산 비용이 낮아짐 → 더 나은 연구자가 되는 것보다… → leverage 하기 위해 노력해라!
구조에 따른 비교
- more structure: 초반에 더 좋은 성능을 보일 수 있으나, 구조적 한계로 인해 확장성에 정체
- less structure: 더 많은 자유도로, 장기적 확장성이 뛰어남
- 일반적 사례에서 우리는 무한정 기다릴 수 없음
  
  → 그러나 강연자는 강연자는 귀납적 편견(inductive bias)과 같은 구조가 훨씬 더 많이 제거해야 한다고 생각
  
  → 현재에 좋지 않은 것이 장기적으로 더욱 강력하고 확장가능함

Understand the driving force

What is transformer?
- 입력 시퀀스: input of sequence model; each sequence as a vector
- 시퀀스 간 상호작용(interaction): dot product of each other
  - if high, semantically more related
- 트랜스포머: 이러한 방식으로 입력 시퀀스를 처리하는 특정 시퀀스 모델

전통적인 Transformer 모델

encoder(주황색 부분)
- 입력 시퀀스의 모든 토큰 간 상호작용을 허용
- N번 반복되는 MLP(피드포워드 레이어)로 출력 생성
- output = sequence of vectors (representing seqeunce element, words)
decoder(파란색 부분)
- input에 대해 인과적 자기 주의(causal self-attention) 메커니즘 사용
- causal self-attention = cannot win training
- sequence-sequence mapping
import patterns
- 교차 주의 메커니즘(cross-attention): 디코더가 인코더의 특정 레이어와 상호작용
- all the layers in decoder attends to final layer of the encoder

encoder-only(BERT)

출력이 단일 벡터로, 입력 시퀀스를 요약
주로 감정 분석과 같은 분류 문제에 활용
시퀀스를 생성하지 않으므로, 장기적으로는 한계
BERT, we had benchmark GLUE
- sequence in - classification layer out
- additional structure - give up on the generation
- seqeunce to classification label is so much easier
- not generating sequence
  
  ↔ in the long term not really useful

decoder-only(GPT)

단일 스택 구조로, 입력 및 출력 시퀀스를 모두 처리 가능
decoder only - also be used for supervised learning ↔ concatenate with target
자기 주의 메커니즘(self-attention): 입력과 출력 시퀀스 모두를 처리하며, 교차 주의(cross attention) 메커니즘의 역할을 통합
output is a seqeunce
self attention is serving both roles
both input and target sequences

How different are encoder-decoder and decoder-only architectures?

→ quite similar

Share cross and self-attention parameters
Share encoder and decoder parameters
Decoder layer l attends to encoder layer
Make encoder self-attention causal

스크린샷 2025-01-20 오전 1.14.16.png

Additional structures

입력 시퀀스와 출력 시퀀스가 충분히 다를 경우, 별도의 매개변수(parameters)를 사용하는 것이 효과적
- 입력과 출력의 구조적 차이가 클 경우, 같은 매개변수를 공유하면 정보의 혼란이나 손실이 발생할 수 있음
- Encoder-Decoder 구조
  
  ex) 예를 들어, 번역 작업에서 입력(영어)과 출력(한국어)은 어휘, 문법, 표현 방식이 매우 다르다. 이 경우, 입력과 출력을 각각 별도의 매개변수로 처리하면 더 효과적인 학습 가능
- 그러나 별도의 매개변수로 표현하면 더 크고 general한 모델에 대해서 자연스럽지 않음
출력 시퀀스의 각 요소는 입력 시퀀스의 완전히 인코딩된 표현에 주의를 기울일 수 있음
- 디코더의 출력 요소가 인코더에서 생성된 입력 요소의 최종 표현을 참고할 수 있음
  
  ex) 디코더 레이어1이 인코더 최종 레이어에 접근한다면, 원래의 attention 메커니즘에는 차이가 없음
- 따라서 강연자는 Encoder - Decoder 불필요한 디자인이라고 생각
입력 시퀀스를 인코딩할 때, 시퀀스 내 모든 요소 간(all-to-all) 상호작용이 선호됨

학술 데이터셋은 입력 길이가 길 수 있지만, 출력(Target)을 길게 만들 수 없는 제약이 있음
- ex) 텍스트 요약 작업에서 입력 문서는 길지만 요약문(Target)은 짧음
- 이러한 특성은 E-D 아키텍쳐에 적합
오늘날 LM은 입력과 출력(Target)이 모두 길고 복잡한 작업(longer target situation)을 처리해야 함
- 따라서 더 단순한 구조(Decoder-Only)를 채택하는 것이 좋음
- 더 큰 데이터와 긴 시퀀스를 다룰 수 있는 확장성
bidirectional, unidirectional
- Bidirectional: 입력 시퀀스의 모든 요소가 서로 영향을 주고받으며 학습(ex. BERT)
  - 모든 입력 토큰 간의 상호작용을 고려하므로 더 풍부한 문맥 정보 학습 가능
  - 다중 회차 대화(multi-turn conversation)에서는 매번 새로 입력을 인코딩해야 하므로 비효율적
- Unidirectional: 현재 단어는 이전 단어에만 의존해 학습(ex. GPT)
  - 현재 단어가 앞선 단어에만 의존하므로 효율적인 학습 가능
  - Bidirectional 모델은 "Why"(왜 그랬는가?)를 학습하기 위해 "How"(어떻게 발생했는가?)를 고려해야 하지만, Unidirectional 모델은 단순히 연속적인 데이터를 학습하므로 이러한 고려가 필요 없음
Conclusion
- 현재의 구조를 재검토하고 불필요한 요소를 제거할 필요
- 일반화(generalization)와 확장성(scalability)을 고려한 설계로 전환

'learning > Lectures' 카테고리의 다른 글

[강의] CS25 Transformers United V4 8강: Behind the Scenes of LLM Pre-training-StarCoder Use Case(Loubna Ben Allal) (1)	2025.02.18
[강의] CS25 Transformers United V4 6강: From Large Language Models to Large Multimodal Models(Ming Ding, Zhipu AI) (2)	2025.01.23
[강의] CS25 Transformers United V4 4강: Demystifying Mixtral of Experts(Albert Jiang) (2)	2025.01.22
[강의] CS25 Transformers United V4 3강: Aligning Open Language Models(Nathan Lambert) (2)	2025.01.20
[강의] CS25 Transformers United V4 1강: Overview of Transformers (3)	2025.01.17

현재글[강의] CS25 Transformers United V4 2강: Jason Wei & Hyung Won Chung of OpenAI

허밍 lab

Ai, open ai, ML, 인공지능, conversational AI, CHI, Transformer, HCI, llm, Prompt Engineering, cs25, 논문리뷰, dl, mldl, NLP, Zero shot, affective computing, vlm, Stanford, GPT,

Today :
Yesterday :

허밍 lab

[강의] CS25 Transformers United V4 2강: Jason Wei & Hyung Won Chung of OpenAI

Intuitions on Language Models (Jason)

Shaping the Future of AI from the History of Transformer (Hyung Won)

Understand the driving force

How different are encoder-decoder and decoder-only architectures?

Additional structures

'learning > Lectures' 카테고리의 다른 글

'learning/Lectures'의 다른글

티스토리툴바

« 2026/02 »
일	월	화	수	목	금	토
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28

[강의] CS25 Transformers United V4 2강: Jason Wei & Hyung Won Chung of OpenAI

Intuitions on Language Models (Jason)

Shaping the Future of AI from the History of Transformer (Hyung Won)

Understand the driving force

How different are encoder-decoder and decoder-only architectures?

Additional structures

'learning > Lectures' 카테고리의 다른 글

'learning/Lectures'의 다른글

관련글

티스토리툴바