๐Ÿš€ Google Colab์—์„œ Transformers ๋ชจ๋ธ ํ•™์Šต ์‹œ ๋ฐœ์ƒํ•˜๋Š” ์˜ค๋ฅ˜ ํ•ด๊ฒฐ

@leekh8 ยท November 08, 2024 ยท 14 min read

Code N Solve ๐Ÿ“˜: Google Colab์—์„œ Transformers ๋ชจ๋ธ ํ•™์Šต ์‹œ ๋ฐœ์ƒํ•˜๋Š” ์˜ค๋ฅ˜ ํ•ด๊ฒฐ ๊ฐ€์ด๋“œ

Transformer ๊ธฐ๋ฐ˜ NLP ๋ชจ๋ธ์„ Google Colab์—์„œ ํ•™์Šตํ•˜๋Š” ๋„์ค‘ ์ดˆ๋ณด์ž๊ฐ€ ๊ฒช์„ ์ˆ˜ ์žˆ๋Š” ๋‹ค์–‘ํ•œ ์˜ค๋ฅ˜์— ๋Œ€ํ•ด ์ •๋ฆฌํ•ด๋ณด์•˜๋‹ค.

ํŠนํžˆ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ์„ค์น˜ ๋ฌธ์ œ, ํŒŒ์ผ ๊ฒฝ๋กœ ์„ค์ •, ๋ฐ์ดํ„ฐ ์ „์ฒ˜๋ฆฌ ๋ฌธ์ œ ๋“ฑ ๋‹ค์–‘ํ•œ ์˜ค๋ฅ˜๋ฅผ ์ดํ•ดํ•˜๊ณ  ํ•ด๊ฒฐํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์•Œ์•„๋ณด์ž.


Google Colab ํ™˜๊ฒฝ ์ดํ•ดํ•˜๊ธฐ

Google Colab์ด๋ž€?

Google Colab(Colaboratory)์€ ๊ตฌ๊ธ€์ด ์ œ๊ณตํ•˜๋Š” ํด๋ผ์šฐ๋“œ ๊ธฐ๋ฐ˜ Jupyter ๋…ธํŠธ๋ถ ํ™˜๊ฒฝ์ด๋‹ค. ๋กœ์ปฌ์— ์•„๋ฌด๊ฒƒ๋„ ์„ค์น˜ํ•˜์ง€ ์•Š์•„๋„ ๋ธŒ๋ผ์šฐ์ €์—์„œ Python ์ฝ”๋“œ๋ฅผ ์‹คํ–‰ํ•˜๊ณ  ๋จธ์‹ ๋Ÿฌ๋‹ ๋ชจ๋ธ์„ ํ•™์Šตํ•  ์ˆ˜ ์žˆ๋‹ค.

ํŠนํžˆ GPU๋ฅผ ๋ฌด๋ฃŒ๋กœ ์ œ๊ณตํ•˜์—ฌ ๋Œ€๊ทœ๋ชจ ๋ฐ์ดํ„ฐ์…‹์„ ์ฒ˜๋ฆฌํ•˜๊ณ  ๋”ฅ๋Ÿฌ๋‹ ๋ชจ๋ธ์„ ํ•™์Šตํ•˜๋Š” ๋ฐ ํฐ ์žฅ์ ์„ ๊ฐ€์ง€๊ณ  ์žˆ๋‹ค.

๋ฌด๋ฃŒ ํ”Œ๋žœ vs Pro ํ”Œ๋žœ

ํ•ญ๋ชฉ ๋ฌด๋ฃŒ (Free) Pro Pro+
GPU T4 (๋žœ๋ค ํ• ๋‹น) T4, V100, A100 A100 ์šฐ์„ 
RAM ์•ฝ 12GB ์•ฝ 25GB ์•ฝ 52GB
์„ธ์…˜ ์œ ์ง€ ์ตœ๋Œ€ 12์‹œ๊ฐ„ ์ตœ๋Œ€ 24์‹œ๊ฐ„ ์ตœ๋Œ€ 24์‹œ๊ฐ„
์œ ํœด ํƒ€์ž„์•„์›ƒ ์•ฝ 90๋ถ„ ์•ฝ 90๋ถ„ ์•ฝ 90๋ถ„
๋””์Šคํฌ ์•ฝ 78GB ์•ฝ 166GB ์•ฝ 166GB
๊ฐ€๊ฒฉ ๋ฌด๋ฃŒ ์›” ~$10 ์›” ~$50

GPU ํ• ๋‹น ๋ฐฉ๋ฒ•

# ํ˜„์žฌ GPU ํ™•์ธ
import torch
print(torch.cuda.is_available())    # True๋ฉด GPU ์‚ฌ์šฉ ๊ฐ€๋Šฅ
print(torch.cuda.get_device_name(0))  # GPU ์ด๋ฆ„ ์ถœ๋ ฅ

# ๋Ÿฐํƒ€์ž„ โ†’ ๋Ÿฐํƒ€์ž„ ์œ ํ˜• ๋ณ€๊ฒฝ โ†’ GPU ์„ ํƒ

Colab ๋ฉ”๋‰ด: ๋Ÿฐํƒ€์ž„ โ†’ ๋Ÿฐํƒ€์ž„ ์œ ํ˜• ๋ณ€๊ฒฝ โ†’ ํ•˜๋“œ์›จ์–ด ๊ฐ€์†๊ธฐ โ†’ GPU

์„ธ์…˜ ์ œํ•œ ๋ฐ ์ฃผ์˜์‚ฌํ•ญ

  • ์„ธ์…˜ ํƒ€์ž„์•„์›ƒ: ๋ธŒ๋ผ์šฐ์ €๋ฅผ ๋‹ซ๊ฑฐ๋‚˜ ํƒญ์„ ๋น„ํ™œ์„ฑํ™”ํ•˜๋ฉด ์•ฝ 90๋ถ„ ํ›„ ์„ธ์…˜์ด ๋Š๊ธด๋‹ค
  • ๋Ÿฐํƒ€์ž„ ์žฌ์‹œ์ž‘: ์„ธ์…˜์ด ๋Š๊ธฐ๋ฉด ์„ค์น˜ํ•œ ํŒจํ‚ค์ง€, ๋ณ€์ˆ˜, ํ•™์Šต ์ƒํƒœ๊ฐ€ ๋ชจ๋‘ ์ดˆ๊ธฐํ™”๋œ๋‹ค
  • ํŒŒ์ผ ํœ˜๋ฐœ์„ฑ: Colab์˜ /content ๋””๋ ‰ํ„ฐ๋ฆฌ์— ์ €์žฅํ•œ ํŒŒ์ผ์€ ์„ธ์…˜ ์ข…๋ฃŒ ์‹œ ์‚ญ์ œ๋œ๋‹ค
  • Google Drive ์—ฐ๋™: ์˜๊ตฌ ์ €์žฅ์ด ํ•„์š”ํ•œ ํŒŒ์ผ์€ ๋ฐ˜๋“œ์‹œ Google Drive์— ์ €์žฅํ•ด์•ผ ํ•œ๋‹ค

Hugging Face Transformers ํŒŒ์ดํ”„๋ผ์ธ ์ „์ฒด ํ๋ฆ„

Hugging Face Transformers1๋Š” BERT, GPT, RoBERTa, T5 ๋“ฑ ์ˆ˜์ฒœ ๊ฐœ์˜ ์‚ฌ์ „ํ•™์Šต(pretrained) ๋ชจ๋ธ์„ ์‰ฝ๊ฒŒ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•ด์ฃผ๋Š” ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋‹ค.

๋ชจ๋ธ ํŒŒ์ธํŠœ๋‹(fine-tuning)์˜ ์ „์ฒด ํ๋ฆ„์€ ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

๋ฐ์ดํ„ฐ ์ค€๋น„

CSV/JSON/HuggingFace Dataset
ํ† ํฌ๋‚˜์ด์ € ๋กœ๋“œ

AutoTokenizer
๋ฐ์ดํ„ฐ ํ† ํฌ๋‚˜์ด์ง•

tokenize + padding + truncation
Dataset ๊ฐ์ฒด ์ƒ์„ฑ

DatasetDict
๋ชจ๋ธ ๋กœ๋“œ

AutoModelForSequenceClassification
TrainingArguments ์„ค์ •

batch size, epochs, lr
Trainer ์ดˆ๊ธฐํ™”

model + args + dataset + collator
ํ•™์Šต ์‹œ์ž‘

trainer.train
ํ‰๊ฐ€

trainer.evaluate
๋ชจ๋ธ ์ €์žฅ

model.save_pretrained
HuggingFace Hub ์—…๋กœ๋“œ

์„ ํƒ์‚ฌํ•ญ


๋ฌธ์ œ 1: ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ์„ค์น˜ ์˜ค๋ฅ˜ โ€” sklearn๊ณผ datasets

Google Colab์—์„œ Transformers ๋ชจ๋ธ ํ•™์Šต์„ ์‹œ์ž‘ํ•˜๋ ค๋ฉด Hugging Face transformers, torch, datasets ๋“ฑ์˜ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๊ฐ€ ํ•„์š”ํ•˜๋‹ค. ํ•˜์ง€๋งŒ sklearn์„ ์„ค์น˜ํ•  ๋•Œ ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์˜ค๋ฅ˜๊ฐ€ ๋ฐœ์ƒํ•  ์ˆ˜ ์žˆ๋‹ค.

ValueError: metadata-generation-failed

Encountered error while generating package metadata.

See above for output.

note: This is an issue with the package mentioned above, not pip.
hint: See above for details.

์›์ธ ๋ถ„์„

sklearn์€ ์‹ค์ œ ํŒจํ‚ค์ง€ ์ด๋ฆ„์ด ์•„๋‹ˆ๋‹ค. Python ํŒจํ‚ค์ง€ ์ธ๋ฑ์Šค(PyPI)์—์„œ ์‹ค์ œ ํŒจํ‚ค์ง€ ์ด๋ฆ„์€ scikit-learn์ด๋‹ค. sklearn์ด๋ผ๋Š” ์ด๋ฆ„์˜ ๋ณ„๋„ ํŒจํ‚ค์ง€๊ฐ€ ์กด์žฌํ•˜๋Š”๋ฐ, ์ด๊ฒƒ์€ ๊ตฌ๋ฒ„์ „ ๋ž˜ํผ ํŒจํ‚ค์ง€๋กœ ํ˜„์žฌ๋Š” ์ œ๋Œ€๋กœ ์„ค์น˜๋˜์ง€ ์•Š๋Š” ๊ฒฝ์šฐ๊ฐ€ ๋งŽ๋‹ค.

์ฆ‰, pip install sklearn์„ ์‹คํ–‰ํ•˜๋ฉด scikit-learn์ด ์•„๋‹Œ ๋‹ค๋ฅธ (๋ถˆ์•ˆ์ •ํ•œ) ํŒจํ‚ค์ง€๋ฅผ ์„ค์น˜ํ•˜๋ ค๊ณ  ์‹œ๋„ํ•ด์„œ ์˜ค๋ฅ˜๊ฐ€ ๋ฐœ์ƒํ•œ๋‹ค.

ํ•ด๊ฒฐ ๋ฐฉ๋ฒ•

# ์ž˜๋ชป๋œ ๋ฐฉ๋ฒ•
!pip install sklearn

# ์˜ฌ๋ฐ”๋ฅธ ๋ฐฉ๋ฒ•
!pip install -U scikit-learn

์ „์ฒด ํ•„์š” ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ํ•œ ๋ฒˆ์— ์„ค์น˜:

!pip install transformers torch datasets scikit-learn evaluate accelerate

๊ฐ ํŒจํ‚ค์ง€์˜ ์—ญํ• :

ํŒจํ‚ค์ง€ ์šฉ๋„
transformers BERT ๋“ฑ ์‚ฌ์ „ํ•™์Šต ๋ชจ๋ธ ๋กœ๋“œ ๋ฐ ํŒŒ์ธํŠœ๋‹
torch ๋”ฅ๋Ÿฌ๋‹ ์—ฐ์‚ฐ ๋ฐฑ์—”๋“œ (PyTorch)
datasets HuggingFace ๋ฐ์ดํ„ฐ์…‹ ๋กœ๋“œ ๋ฐ ์ฒ˜๋ฆฌ
scikit-learn ํ‰๊ฐ€ ์ง€ํ‘œ ๊ณ„์‚ฐ (accuracy, f1 ๋“ฑ)
evaluate HuggingFace ๊ณต์‹ ํ‰๊ฐ€ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ
accelerate ๋ถ„์‚ฐ ํ•™์Šต, ํ˜ผํ•ฉ ์ •๋ฐ€๋„(fp16) ์ง€์›

๋ฌธ์ œ 2: Google Drive ํŒŒ์ผ ๊ฒฝ๋กœ ์„ค์ • ๋ฌธ์ œ4

๋ฐ์ดํ„ฐ์…‹์„ Colab์—์„œ ์‚ฌ์šฉํ•˜๋ ค๋ฉด Google Drive์— ์ €์žฅ๋œ ํŒŒ์ผ์„ Colab์— ์—ฐ๊ฒฐํ•ด์•ผ ํ•œ๋‹ค. Drive๋ฅผ ๋งˆ์šดํŠธํ•˜์ง€ ์•Š์œผ๋ฉด FileNotFoundError ์˜ค๋ฅ˜๊ฐ€ ๋ฐœ์ƒํ•œ๋‹ค.

FileNotFoundError: [Errno 2] No such file or directory: '/content/drive/MyDrive/Dataset/Dataset.json'

์›์ธ ๋ถ„์„

Colab ์„ธ์…˜์˜ /content ๋””๋ ‰ํ„ฐ๋ฆฌ๋Š” Colab VM์˜ ๋กœ์ปฌ ์ €์žฅ์†Œ๋‹ค. Google Drive๋Š” ๋ณ„๋„์˜ ์ €์žฅ ๊ณต๊ฐ„์ด๋ฉฐ, ๋ช…์‹œ์ ์œผ๋กœ ๋งˆ์šดํŠธํ•ด์•ผ๋งŒ /content/drive ๊ฒฝ๋กœ๋กœ ์ ‘๊ทผ ๊ฐ€๋Šฅํ•˜๋‹ค. ๋˜ํ•œ ์„ธ์…˜์ด ์žฌ์‹œ์ž‘๋˜๋ฉด ๋งˆ์šดํŠธ๊ฐ€ ํ•ด์ œ๋˜๋ฏ€๋กœ ๋งค๋ฒˆ ๋‹ค์‹œ ๋งˆ์šดํŠธํ•ด์•ผ ํ•œ๋‹ค.

ํ•ด๊ฒฐ ๋ฐฉ๋ฒ•

from google.colab import drive
drive.mount('/content/drive')

์‹คํ–‰ํ•˜๋ฉด ๊ตฌ๊ธ€ ๊ณ„์ • ์ธ์ฆ์„ ์š”์ฒญํ•˜๋Š” ํŒ์—…์ด ๋‚˜ํƒ€๋‚œ๋‹ค. ์ธ์ฆ ์™„๋ฃŒ ํ›„ /content/drive/MyDrive/์—์„œ ํŒŒ์ผ์— ์ ‘๊ทผํ•  ์ˆ˜ ์žˆ๋‹ค.

import pandas as pd

# Google Drive์—์„œ ํŒŒ์ผ ์ฝ๊ธฐ
file_path = "/content/drive/MyDrive/Dataset/Dataset.json"
data = pd.read_json(file_path, lines=True)

print(data.head())
print(f"๋ฐ์ดํ„ฐ ํฌ๊ธฐ: {data.shape}")

Google Drive ๊ฒฝ๋กœ ๊ตฌ์กฐ

/content/drive/
  MyDrive/          โ† ๋‚ด ๋“œ๋ผ์ด๋ธŒ (๋ณธ์ธ ํŒŒ์ผ)
    Dataset/
      Dataset.json
  Shareddrives/     โ† ๊ณต์œ  ๋“œ๋ผ์ด๋ธŒ (ํŒ€ ๋“œ๋ผ์ด๋ธŒ)

ํŒŒ์ผ ํ™•์ธ ๋ฐฉ๋ฒ•

import os

# ๊ฒฝ๋กœ ์กด์žฌ ํ™•์ธ
path = "/content/drive/MyDrive/Dataset"
if os.path.exists(path):
    print("ํด๋” ์กด์žฌ")
    print(os.listdir(path))
else:
    print("๊ฒฝ๋กœ ์—†์Œ โ€” Drive ๋งˆ์šดํŠธ ํ™•์ธ ํ•„์š”")

๋ฌธ์ œ 3: Transformers ๋ชจ๋ธ ํ•™์Šต ์‹œ ๋ฐ์ดํ„ฐ ํŒจ๋”ฉ ์˜ค๋ฅ˜2

๋ชจ๋ธ ํ•™์Šต ์ค‘ ๋ฐฐ์น˜ ๋ฐ์ดํ„ฐ์˜ ๊ธธ์ด๊ฐ€ ์ผ์ •ํ•˜์ง€ ์•Š์œผ๋ฉด ๋‹ค์Œ ์˜ค๋ฅ˜๊ฐ€ ๋ฐœ์ƒํ•  ์ˆ˜ ์žˆ๋‹ค.

ValueError: expected sequence of length 128 at dim 1 (got 97)

# ๋˜๋Š”
RuntimeError: stack expects each tensor to be equal size, but got [128] at entry 0 and [97] at entry 1

์›์ธ ๋ถ„์„

Transformer ๋ชจ๋ธ์€ ์ž…๋ ฅ ์‹œํ€€์Šค์˜ ๊ธธ์ด๊ฐ€ ๋ฐฐ์น˜ ๋‚ด์—์„œ ๋™์ผํ•ด์•ผ ํ•œ๋‹ค. ์ž์—ฐ์–ด ํ…์ŠคํŠธ๋Š” ๋ฌธ์žฅ๋งˆ๋‹ค ๊ธธ์ด๊ฐ€ ๋‹ค๋ฅด๋ฏ€๋กœ, ์งง์€ ๋ฌธ์žฅ์—๋Š” ํŒจ๋”ฉ([PAD] ํ† ํฐ)์„ ์ถ”๊ฐ€ํ•ด ๊ธธ์ด๋ฅผ ๋งž์ถฐ์•ผ ํ•œ๋‹ค.

ํ† ํฌ๋‚˜์ด์ง• ๋‹จ๊ณ„์—์„œ padding=True๋ฅผ ์„ค์ •ํ–ˆ๋”๋ผ๋„, ๊ฐ ์ƒ˜ํ”Œ์„ ๊ฐœ๋ณ„๋กœ ์ฒ˜๋ฆฌํ•˜๋ฉด ๋ฐฐ์น˜๋กœ ๋ฌถ์„ ๋•Œ ํฌ๊ธฐ๊ฐ€ ๋งž์ง€ ์•Š์„ ์ˆ˜ ์žˆ๋‹ค.

ํ•ด๊ฒฐ ๋ฐฉ๋ฒ•

DataCollatorWithPadding์„ ์‚ฌ์šฉํ•˜๋ฉด ๋ฐฐ์น˜๋ฅผ ๊ตฌ์„ฑํ•  ๋•Œ ๋™์ ์œผ๋กœ ํŒจ๋”ฉ์„ ์ ์šฉํ•œ๋‹ค.

from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset,
    data_collator=data_collator,  # ์ž๋™ ํŒจ๋”ฉ ์ถ”๊ฐ€
)

Dynamic Padding์˜ ์žฅ์ : ๋ฐฐ์น˜ ๋‚ด์—์„œ ๊ฐ€์žฅ ๊ธด ์‹œํ€€์Šค์— ๋งž์ถฐ ํŒจ๋”ฉํ•˜๋ฏ€๋กœ, ๋ถˆํ•„์š”ํ•œ ํŒจ๋”ฉ์„ ์ตœ์†Œํ™”ํ•ด ํ•™์Šต ์†๋„๋ฅผ ๋†’์ธ๋‹ค.

# ์ „์ฒด ํ† ํฌ๋‚˜์ด์ง• ์˜ˆ์ œ
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("klue/bert-base")

def tokenize_function(examples):
    return tokenizer(
        examples["text"],
        truncation=True,     # ์ตœ๋Œ€ ๊ธธ์ด ์ดˆ๊ณผ ์‹œ ์ž๋ฆ„
        max_length=128,      # ์ตœ๋Œ€ ํ† ํฐ ๊ธธ์ด
        # padding์€ ์—ฌ๊ธฐ์„œ ํ•˜์ง€ ์•Š๊ณ  DataCollator์—์„œ ๋™์ ์œผ๋กœ ์ฒ˜๋ฆฌ
    )

tokenized_dataset = dataset.map(tokenize_function, batched=True)

๋ฌธ์ œ 4: ๋ชจ๋ธ ํ•™์Šต ์‹œ wandb ๋กœ๊ทธ์ธ ์š”์ฒญ5

Hugging Face Trainer๋Š” ๊ธฐ๋ณธ์ ์œผ๋กœ Weights & Biases(wandb)๋ฅผ ์‚ฌ์šฉํ•ด ํ•™์Šต ๊ณผ์ •์„ ์ถ”์ ํ•˜๋ ค ํ•œ๋‹ค. ์ฒ˜์Œ ์‹คํ–‰ ์‹œ ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์ž…๋ ฅ ํ”„๋กฌํ”„ํŠธ๊ฐ€ ๋‚˜ํƒ€๋‚œ๋‹ค.

wandb: (1) Create a W&B account
wandb: (2) Use an existing W&B account
wandb: (3) Don't visualize my results
wandb: Enter your choice:

Colab์—์„œ ์…€์ด ์‚ฌ์šฉ์ž ์ž…๋ ฅ์„ ๊ธฐ๋‹ค๋ฆฌ๋ฉฐ ๋ฉˆ์ถ”๋Š” ๊ฒฝ์šฐ๋‹ค.

ํ•ด๊ฒฐ ๋ฐฉ๋ฒ•

report_to="none"์œผ๋กœ wandb๋ฅผ ๋น„ํ™œ์„ฑํ™”ํ•œ๋‹ค.

from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=3,
    weight_decay=0.01,
    report_to="none"  # wandb ๋น„ํ™œ์„ฑํ™”
)

๋˜๋Š” ํ™˜๊ฒฝ ๋ณ€์ˆ˜๋กœ ์„ค์ •:

import os
os.environ["WANDB_DISABLED"] = "true"

wandb๋ฅผ ์‚ฌ์šฉํ•˜๊ณ  ์‹ถ๋‹ค๋ฉด ๋จผ์ € ๋กœ๊ทธ์ธ ํ›„ ํ† ํฐ์„ ์„ค์ •ํ•œ๋‹ค:

import wandb
wandb.login(key="your_api_key_here")

๋ฌธ์ œ 5: CUDA Out of Memory ์˜ค๋ฅ˜

๋ฌธ์ œ ์ƒํ™ฉ

RuntimeError: CUDA out of memory. Tried to allocate 2.00 GiB
(GPU 0; 14.76 GiB total capacity; 12.54 GiB already allocated;
1.20 GiB free; 12.67 GiB reserved in total by PyTorch)

์›์ธ ๋ถ„์„

GPU ๋ฉ”๋ชจ๋ฆฌ(VRAM)๊ฐ€ ๋ถ€์กฑํ•  ๋•Œ ๋ฐœ์ƒํ•œ๋‹ค. ์ฃผ์š” ์›์ธ:

  • ๋ฐฐ์น˜ ์‚ฌ์ด์ฆˆ๊ฐ€ ๋„ˆ๋ฌด ํผ: ๋ฐฐ์น˜ ํ•˜๋‚˜๋ฅผ ์ฒ˜๋ฆฌํ•˜๋Š” ๋ฐ ํ•„์š”ํ•œ ๋ฉ”๋ชจ๋ฆฌ๊ฐ€ VRAM์„ ์ดˆ๊ณผ
  • ์‹œํ€€์Šค ๊ธธ์ด๊ฐ€ ๋„ˆ๋ฌด ๊น€: ํ† ํฐ ๊ธธ์ด๊ฐ€ ๊ธธ์ˆ˜๋ก Attention ๊ณ„์‚ฐ์— ํ•„์š”ํ•œ ๋ฉ”๋ชจ๋ฆฌ๊ฐ€ ์ œ๊ณฑ์œผ๋กœ ์ฆ๊ฐ€
  • ๋ชจ๋ธ์ด ๋„ˆ๋ฌด ํผ: ํŒŒ๋ผ๋ฏธํ„ฐ ์ˆ˜๊ฐ€ ๋งŽ์€ ๋ชจ๋ธ (BERT-large, GPT-2 ๋“ฑ)
  • ์ด์ „ ์‹คํ–‰์˜ ๋ฉ”๋ชจ๋ฆฌ๊ฐ€ ํ•ด์ œ๋˜์ง€ ์•Š์Œ: ์ด์ „ ์…€์—์„œ ๋ชจ๋ธ์ด ๋ฉ”๋ชจ๋ฆฌ์— ๋‚จ์•„ ์žˆ์Œ

ํ•ด๊ฒฐ ๋ฐฉ๋ฒ•

1. ๋ฐฐ์น˜ ์‚ฌ์ด์ฆˆ ์ค„์ด๊ธฐ + Gradient Accumulation

training_args = TrainingArguments(
    output_dir="./results",
    per_device_train_batch_size=4,     # 8 โ†’ 4๋กœ ์ค„์ž„
    gradient_accumulation_steps=4,     # 4๋ฒˆ ๋ˆ„์ ํ•ด์„œ ์‹ค์งˆ์ ์œผ๋กœ ๋ฐฐ์น˜ 16 ํšจ๊ณผ
    per_device_eval_batch_size=8,
    num_train_epochs=3,
    report_to="none",
)

gradient_accumulation_steps=4๋ฅผ ์„ค์ •ํ•˜๋ฉด ๋ฉ”๋ชจ๋ฆฌ์ƒ์œผ๋กœ๋Š” ๋ฐฐ์น˜ 4๊ฐœ์”ฉ ์ฒ˜๋ฆฌํ•˜์ง€๋งŒ, 4๋ฒˆ accumulate ํ›„์— ๊ฐ€์ค‘์น˜๋ฅผ ์—…๋ฐ์ดํŠธํ•˜๋ฏ€๋กœ ์‹ค์งˆ์ ์ธ ๋ฐฐ์น˜ ํฌ๊ธฐ๋Š” 16์ด๋‹ค.

2. Gradient Checkpointing ํ™œ์„ฑํ™”

training_args = TrainingArguments(
    output_dir="./results",
    per_device_train_batch_size=4,
    gradient_checkpointing=True,  # ์ค‘๊ฐ„ activation์„ ์ €์žฅํ•˜์ง€ ์•Š๊ณ  ์žฌ๊ณ„์‚ฐ
    fp16=True,                    # 16๋น„ํŠธ ๋ถ€๋™์†Œ์ˆ˜์ ์œผ๋กœ ๋ฉ”๋ชจ๋ฆฌ ์ ˆ๋ฐ˜ ์ ˆ์•ฝ
    report_to="none",
)

gradient_checkpointing=True๋Š” Forward pass ์ค‘๊ฐ„ ๊ฒฐ๊ณผ(activation)๋ฅผ ์ €์žฅํ•˜์ง€ ์•Š๊ณ , backward pass ์‹œ ํ•„์š”ํ•  ๋•Œ ์žฌ๊ณ„์‚ฐํ•œ๋‹ค. ์†๋„๋Š” ์•ฝ 20~30% ๋А๋ ค์ง€์ง€๋งŒ ๋ฉ”๋ชจ๋ฆฌ๋Š” ํฌ๊ฒŒ ์ ˆ์•ฝ๋œ๋‹ค.

3. ํ˜ผํ•ฉ ์ •๋ฐ€๋„(Mixed Precision) ํ•™์Šต

training_args = TrainingArguments(
    output_dir="./results",
    fp16=True,   # float32 โ†’ float16 (๋ฉ”๋ชจ๋ฆฌ ์ ˆ๋ฐ˜, ์†๋„ 1.5~2๋ฐฐ)
    # bf16=True, # A100 GPU์—์„œ๋Š” bfloat16์ด ๋” ์•ˆ์ •์ 
    report_to="none",
)

4. ์ด์ „ ์„ธ์…˜ ๋ฉ”๋ชจ๋ฆฌ ํ•ด์ œ

import torch
import gc

# GPU ์บ์‹œ ๋น„์šฐ๊ธฐ
torch.cuda.empty_cache()

# Python ๊ฐ€๋น„์ง€ ์ปฌ๋ ‰ํ„ฐ ์‹คํ–‰
gc.collect()

# ํ˜„์žฌ GPU ๋ฉ”๋ชจ๋ฆฌ ์ƒํƒœ ํ™•์ธ
print(f"ํ• ๋‹น๋œ ๋ฉ”๋ชจ๋ฆฌ: {torch.cuda.memory_allocated() / 1024**3:.2f} GB")
print(f"์บ์‹œ๋œ ๋ฉ”๋ชจ๋ฆฌ: {torch.cuda.memory_reserved() / 1024**3:.2f} GB")

๋ฌธ์ œ 6: ์„ธ์…˜ ์—ฐ๊ฒฐ ๋Š๊น€์œผ๋กœ ์ธํ•œ ํ•™์Šต ์ค‘๋‹จ

๋ฌธ์ œ ์ƒํ™ฉ

๋ฌด๋ฃŒ Colab์—์„œ ์žฅ์‹œ๊ฐ„ ํ•™์Šตํ•˜๋‹ค๊ฐ€ ์„ธ์…˜์ด ๋Š๊ธฐ๋ฉด ํ•™์Šต ์ง„ํ–‰ ์ƒํ™ฉ์ด ๋ชจ๋‘ ์‚ฌ๋ผ์ง„๋‹ค.

ํ•ด๊ฒฐ ๋ฐฉ๋ฒ•: ์ฒดํฌํฌ์ธํŠธ ์ €์žฅ ์ „๋žต

์ฒดํฌํฌ์ธํŠธ๋ฅผ Google Drive์— ์ €์žฅํ•ด, ์„ธ์…˜์ด ๋Š๊ฒจ๋„ ์ด์–ด์„œ ํ•™์Šตํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•œ๋‹ค.

from google.colab import drive
drive.mount('/content/drive')

training_args = TrainingArguments(
    # Google Drive์— ์ฒดํฌํฌ์ธํŠธ ์ €์žฅ
    output_dir="/content/drive/MyDrive/model_checkpoints",
    
    # ๋งค ์—ํญ๋งˆ๋‹ค ์ €์žฅ
    save_strategy="epoch",
    evaluation_strategy="epoch",
    
    # ์ตœ๋Œ€ 3๊ฐœ ์ฒดํฌํฌ์ธํŠธ๋งŒ ๋ณด๊ด€ (๋””์Šคํฌ ์ ˆ์•ฝ)
    save_total_limit=3,
    
    # ๊ฐ€์žฅ ์ข‹์€ ๋ชจ๋ธ ์ž๋™ ์„ ํƒ
    load_best_model_at_end=True,
    metric_for_best_model="f1",
    
    report_to="none",
)

์ฒดํฌํฌ์ธํŠธ์—์„œ ์ด์–ด ํ•™์Šตํ•˜๊ธฐ

์„ธ์…˜์ด ๋Š๊ธด ํ›„ ๋‹ค์‹œ ์‹œ์ž‘ํ•  ๋•Œ:

# ๋งˆ์ง€๋ง‰ ์ฒดํฌํฌ์ธํŠธ ๊ฒฝ๋กœ ํ™•์ธ
import os
checkpoint_dir = "/content/drive/MyDrive/model_checkpoints"
checkpoints = [d for d in os.listdir(checkpoint_dir) if d.startswith("checkpoint")]
latest_checkpoint = sorted(checkpoints)[-1]
checkpoint_path = os.path.join(checkpoint_dir, latest_checkpoint)

print(f"์ด์–ด ํ•™์Šตํ•  ์ฒดํฌํฌ์ธํŠธ: {checkpoint_path}")

# ์ฒดํฌํฌ์ธํŠธ์—์„œ ์ด์–ด ํ•™์Šต
trainer.train(resume_from_checkpoint=checkpoint_path)

Colab ์„ธ์…˜ ์œ ์ง€ ๊ฟ€ํŒ

๋ธŒ๋ผ์šฐ์ € ๊ฐœ๋ฐœ์ž ์ฝ˜์†”(F12)์—์„œ ๋‹ค์Œ ์ฝ”๋“œ๋ฅผ ์‹คํ–‰ํ•˜๋ฉด ์ž๋™ ์žฌ์—ฐ๊ฒฐ์„ ์‹œ๋„ํ•œ๋‹ค (๋น„๊ณต์‹์ ์ธ ๋ฐฉ๋ฒ•):

// ๋ธŒ๋ผ์šฐ์ € ์ฝ˜์†”์—์„œ ์‹คํ–‰
function ClickConnect(){
  console.log("์—ฐ๊ฒฐ ์œ ์ง€ ํด๋ฆญ");
  document.querySelector("colab-connect-button").click()
}
setInterval(ClickConnect, 60000)  // 1๋ถ„๋งˆ๋‹ค ์‹คํ–‰

๋‹จ, ์ด ๋ฐฉ๋ฒ•์€ Colab ์ •์ฑ…์— ๋”ฐ๋ผ ๋™์ž‘ํ•˜์ง€ ์•Š์„ ์ˆ˜ ์žˆ๋‹ค. ๊ฐ€์žฅ ํ™•์‹คํ•œ ๋ฐฉ๋ฒ•์€ ์ฒดํฌํฌ์ธํŠธ๋ฅผ ์ž์ฃผ ์ €์žฅํ•˜๋Š” ๊ฒƒ์ด๋‹ค.


๋ฌธ์ œ 7: Tokenizer์™€ Model ๋ถˆ์ผ์น˜ ์˜ค๋ฅ˜

๋ฌธ์ œ ์ƒํ™ฉ

ValueError: You are trying to use a fast tokenizer, which is not supported by this model.
# ๋˜๋Š”
RuntimeError: The size of tensor a (30522) must match the size of tensor b (32000)
    at non-singleton dimension 1

์›์ธ ๋ถ„์„

ํ† ํฌ๋‚˜์ด์ €์™€ ๋ชจ๋ธ์˜ ์–ดํœ˜(vocabulary) ํฌ๊ธฐ๊ฐ€ ๋‹ค๋ฅผ ๋•Œ ๋ฐœ์ƒํ•œ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด BERT ์˜๋ฌธ ๋ชจ๋ธ์˜ ์–ดํœ˜ ํฌ๊ธฐ๋Š” 30,522๊ฐœ์ธ๋ฐ, LLaMA ๋ชจ๋ธ์€ 32,000๊ฐœ๋‹ค. ์„œ๋กœ ๋‹ค๋ฅธ ์‚ฌ์ „ํ•™์Šต ๋ชจ๋ธ์˜ ํ† ํฌ๋‚˜์ด์ €์™€ ๋ชจ๋ธ ๊ฐ€์ค‘์น˜๋ฅผ ์„ž์–ด ์‚ฌ์šฉํ•˜๋ฉด ์ด ์˜ค๋ฅ˜๊ฐ€ ๋ฐœ์ƒํ•œ๋‹ค.

ํ•ด๊ฒฐ ๋ฐฉ๋ฒ•

ํ•ญ์ƒ ๊ฐ™์€ ๋ชจ๋ธ ์ด๋ฆ„์œผ๋กœ ํ† ํฌ๋‚˜์ด์ €์™€ ๋ชจ๋ธ์„ ํ•จ๊ป˜ ๋กœ๋“œํ•ด์•ผ ํ•œ๋‹ค.

from transformers import AutoTokenizer, AutoModelForSequenceClassification

model_name = "klue/bert-base"  # ๋™์ผํ•œ ๋ชจ๋ธ ์ด๋ฆ„ ์‚ฌ์šฉ

# ์˜ฌ๋ฐ”๋ฅธ ๋ฐฉ๋ฒ•: ๊ฐ™์€ ์ด๋ฆ„์œผ๋กœ ๋กœ๋“œ
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(
    model_name,
    num_labels=2  # ๋ถ„๋ฅ˜ํ•  ํด๋ž˜์Šค ์ˆ˜
)

# ์ž˜๋ชป๋œ ๋ฐฉ๋ฒ•: ๋‹ค๋ฅธ ๋ชจ๋ธ์˜ ํ† ํฌ๋‚˜์ด์ € ์‚ฌ์šฉ
# tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")  # ์˜๋ฌธ BERT
# model = AutoModelForSequenceClassification.from_pretrained("klue/bert-base")  # ํ•œ๊ตญ์–ด BERT

ํ•œ๊ตญ์–ด NLP์—์„œ ์ž์ฃผ ์‚ฌ์šฉํ•˜๋Š” ๋ชจ๋ธ ๋ชฉ๋ก:

๋ชจ๋ธ ํŠน์ง•
klue/bert-base ํ•œ๊ตญ์–ด ํŠนํ™” BERT, KLUE ๋ฒค์น˜๋งˆํฌ
snunlp/KR-ELECTRA-discriminator ํ•œ๊ตญ์–ด ELECTRA, ๋น ๋ฅธ ํŒŒ์ธํŠœ๋‹
monologg/koelectra-base-v3-discriminator KoELECTRA v3
beomi/kcbert-base KcBERT (์ปค๋ฎค๋‹ˆํ‹ฐ ํ…์ŠคํŠธ ํ•™์Šต)

๋ฌธ์ œ 8: ๋ฐ์ดํ„ฐ์…‹ ํ˜•์‹ ์˜ค๋ฅ˜ (Dataset format, column names)

๋ฌธ์ œ ์ƒํ™ฉ

KeyError: 'label'
# ๋˜๋Š”
ValueError: The model did not return a loss from the inputs,
only the following keys: logits. For reference, the inputs it received are: input_ids, attention_mask.

์›์ธ ๋ถ„์„

HuggingFace Trainer๋Š” ํ•™์Šต ๋ฐ์ดํ„ฐ์…‹์— ํŠน์ • ์ปฌ๋Ÿผ ์ด๋ฆ„์ด ์žˆ์„ ๊ฒƒ์„ ๊ธฐ๋Œ€ํ•œ๋‹ค.

  • ๋ ˆ์ด๋ธ” ์ปฌ๋Ÿผ ์ด๋ฆ„์ด label ๋˜๋Š” labels์—ฌ์•ผ ํ•œ๋‹ค.
  • ์ž…๋ ฅ ์ปฌ๋Ÿผ์€ input_ids, attention_mask ๋“ฑ ํ† ํฌ๋‚˜์ด์ € ์ถœ๋ ฅ ํ‚ค์™€ ์ผ์น˜ํ•ด์•ผ ํ•œ๋‹ค.

ํ•ด๊ฒฐ ๋ฐฉ๋ฒ•

import pandas as pd
from datasets import Dataset

# ์›๋ณธ ๋ฐ์ดํ„ฐ (์ปฌ๋Ÿผ๋ช…์ด ๋‹ค๋ฅผ ์ˆ˜ ์žˆ์Œ)
df = pd.DataFrame({
    "review": ["์ •๋ง ์ข‹์•„์š”", "๋ณ„๋กœ์˜ˆ์š”", "๊ดœ์ฐฎ๋„ค์š”"],
    "sentiment": [1, 0, 1]
})

# ์ปฌ๋Ÿผ๋ช…์„ Trainer๊ฐ€ ๊ธฐ๋Œ€ํ•˜๋Š” ์ด๋ฆ„์œผ๋กœ ๋ณ€๊ฒฝ
df = df.rename(columns={
    "review": "text",
    "sentiment": "label"
})

# Pandas DataFrame์„ HuggingFace Dataset์œผ๋กœ ๋ณ€ํ™˜
dataset = Dataset.from_pandas(df)
print(dataset)
# Dataset({
#     features: ['text', 'label'],
#     num_rows: 3
# })

ํ•™์Šต/๊ฒ€์ฆ ์„ธํŠธ ๋ถ„๋ฆฌ

from datasets import Dataset, DatasetDict

# 80/20 ๋ถ„๋ฆฌ
split = dataset.train_test_split(test_size=0.2, seed=42)
dataset_dict = DatasetDict({
    "train": split["train"],
    "test": split["test"],
})

print(dataset_dict)

ํ† ํฌ๋‚˜์ด์ง• ํ›„ ๋ถˆํ•„์š”ํ•œ ์ปฌ๋Ÿผ ์ œ๊ฑฐ

def tokenize_function(examples):
    return tokenizer(
        examples["text"],
        truncation=True,
        max_length=128,
    )

tokenized_dataset = dataset_dict.map(
    tokenize_function,
    batched=True,
    remove_columns=["text"],  # ์›๋ณธ ํ…์ŠคํŠธ ์ปฌ๋Ÿผ ์ œ๊ฑฐ (Trainer๊ฐ€ ๋ชจ๋ฅด๋Š” ์ปฌ๋Ÿผ ์ œ๊ฑฐ)
)

print(tokenized_dataset["train"].column_names)
# ['label', 'input_ids', 'attention_mask', 'token_type_ids']

Colab์—์„œ ํšจ์œจ์ ์œผ๋กœ ํ•™์Šตํ•˜๋Š” ํŒ ์ •๋ฆฌ

1. ๋Ÿฐํƒ€์ž„ ์œ ์ง€

Colab์€ ํƒญ์ด ๋น„ํ™œ์„ฑํ™”๋˜๊ฑฐ๋‚˜ ๋ธŒ๋ผ์šฐ์ €๋ฅผ ๋‹ซ์œผ๋ฉด ์„ธ์…˜์ด ๋Š๊ธด๋‹ค. ํ•™์Šต ์ค‘์—๋Š” Colab ํƒญ์„ ํ™œ์„ฑ ์ƒํƒœ๋กœ ์œ ์ง€ํ•ด์•ผ ํ•œ๋‹ค.

# ํ•™์Šต ์ค‘ ์ง„ํ–‰ ์ƒํ™ฉ ์ถœ๋ ฅ์œผ๋กœ "ํ™œ์„ฑ ์ƒํƒœ" ์‹ ํ˜ธ ๋ณด๋‚ด๊ธฐ
from transformers import TrainerCallback

class ProgressCallback(TrainerCallback):
    def on_epoch_end(self, args, state, control, **kwargs):
        print(f"์—ํญ {state.epoch:.0f}/{args.num_train_epochs} ์™„๋ฃŒ")
        print(f"ํ˜„์žฌ ์†์‹ค: {state.log_history[-1].get('loss', 'N/A')}")

2. Google Drive์— ๋ชจ๋ธ ์ €์žฅ

# ํ•™์Šต ์™„๋ฃŒ ํ›„ Drive์— ์ €์žฅ
save_path = "/content/drive/MyDrive/my_model"
model.save_pretrained(save_path)
tokenizer.save_pretrained(save_path)
print(f"๋ชจ๋ธ ์ €์žฅ ์™„๋ฃŒ: {save_path}")

3. ๋ฐฐ์น˜ ํฌ๊ธฐ ์ž๋™ ํƒ์ƒ‰

# GPU ๋ฉ”๋ชจ๋ฆฌ์— ๋งž๋Š” ์ตœ๋Œ€ ๋ฐฐ์น˜ ํฌ๊ธฐ ์ฐพ๊ธฐ
def find_max_batch_size(model, tokenizer, start=32):
    batch_size = start
    while batch_size > 1:
        try:
            # ํ…Œ์ŠคํŠธ ๋ฐฐ์น˜๋กœ Forward pass ์‹œ๋„
            inputs = tokenizer(
                ["ํ…Œ์ŠคํŠธ ๋ฌธ์žฅ"] * batch_size,
                return_tensors="pt",
                padding=True,
                truncation=True,
                max_length=128
            ).to("cuda")
            with torch.no_grad():
                model(**inputs)
            print(f"๋ฐฐ์น˜ ํฌ๊ธฐ {batch_size}: ์„ฑ๊ณต")
            return batch_size
        except RuntimeError:
            batch_size //= 2
            print(f"OOM ๋ฐœ์ƒ โ†’ ๋ฐฐ์น˜ ํฌ๊ธฐ {batch_size}๋กœ ์ค„์ž„")
            torch.cuda.empty_cache()
    return 1

4. GPU ์‚ฌ์šฉ๋ฅ  ๋ชจ๋‹ˆํ„ฐ๋ง

# ์‹ค์‹œ๊ฐ„ GPU ๋ฉ”๋ชจ๋ฆฌ ํ™•์ธ
def print_gpu_status():
    if torch.cuda.is_available():
        allocated = torch.cuda.memory_allocated() / 1024**3
        reserved = torch.cuda.memory_reserved() / 1024**3
        total = torch.cuda.get_device_properties(0).total_memory / 1024**3
        print(f"GPU ๋ฉ”๋ชจ๋ฆฌ โ€” ์‚ฌ์šฉ: {allocated:.2f}GB / ์บ์‹œ: {reserved:.2f}GB / ์ด: {total:.2f}GB")
    else:
        print("GPU ์—†์Œ (CPU ๋ชจ๋“œ)")

print_gpu_status()

์ „์ฒด ํ•™์Šต ์ฝ”๋“œ ์˜ˆ์ œ: ๊ฐ์„ฑ ๋ถ„์„ ๋ชจ๋ธ ํŒŒ์ธํŠœ๋‹

ํ•œ๊ตญ์–ด ์˜ํ™” ๋ฆฌ๋ทฐ ๋ฐ์ดํ„ฐ์…‹์œผ๋กœ ๊ธ์ •/๋ถ€์ • ๊ฐ์„ฑ ๋ถ„์„ ๋ชจ๋ธ์„ ํŒŒ์ธํŠœ๋‹ํ•˜๋Š” ์ „์ฒด ์ฝ”๋“œ๋‹ค.

# ===== 1. ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ์„ค์น˜ =====
# !pip install transformers torch datasets scikit-learn evaluate accelerate

# ===== 2. ํ•„์š” ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ์ž„ํฌํŠธ =====
import os
import torch
import numpy as np
import pandas as pd
from datasets import Dataset, DatasetDict
from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    TrainingArguments,
    Trainer,
    DataCollatorWithPadding,
)
import evaluate

# ===== 3. Google Drive ๋งˆ์šดํŠธ =====
from google.colab import drive
drive.mount('/content/drive')

# ===== 4. GPU ํ™•์ธ =====
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"์‚ฌ์šฉ ๋””๋ฐ”์ด์Šค: {device}")
if device == "cuda":
    print(f"GPU: {torch.cuda.get_device_name(0)}")

# ===== 5. ๋ฐ์ดํ„ฐ ์ค€๋น„ =====
# ์˜ˆ์‹œ: Naver ์˜ํ™” ๋ฆฌ๋ทฐ ๋ฐ์ดํ„ฐ์…‹
# ์‹ค์ œ๋กœ๋Š” ์ž์‹ ์˜ ๋ฐ์ดํ„ฐ๋ฅผ ์‚ฌ์šฉํ•˜๊ฑฐ๋‚˜ HuggingFace datasets์—์„œ ๋กœ๋“œ
file_path = "/content/drive/MyDrive/Dataset/nsmc_train.txt"

try:
    df = pd.read_csv(file_path, sep="\t")
    df = df.dropna()  # ๊ฒฐ์ธก๊ฐ’ ์ œ๊ฑฐ
    df = df.rename(columns={"document": "text", "label": "label"})
    print(f"๋ฐ์ดํ„ฐ ๋กœ๋“œ ์™„๋ฃŒ: {df.shape}")
    print(df.head())
except FileNotFoundError:
    # ์ƒ˜ํ”Œ ๋ฐ์ดํ„ฐ๋กœ ๋Œ€์ฒด
    print("ํŒŒ์ผ ์—†์Œ โ€” ์ƒ˜ํ”Œ ๋ฐ์ดํ„ฐ ์‚ฌ์šฉ")
    df = pd.DataFrame({
        "text": [
            "์ •๋ง ์žฌ๋ฏธ์žˆ๋Š” ์˜ํ™”์˜€์–ด์š”", "๋ณ„๋กœ์˜€์–ด์š” ์‹œ๊ฐ„ ๋‚ญ๋น„",
            "์ตœ๊ณ ์˜ ์ž‘ํ’ˆ์ž…๋‹ˆ๋‹ค", "๊ธฐ๋Œ€ ์ดํ•˜์˜€์Šต๋‹ˆ๋‹ค",
            "๊ฐ•๋ ฅ ์ถ”์ฒœํ•ฉ๋‹ˆ๋‹ค", "๋‹ค์‹œ๋Š” ์•ˆ ๋ณผ ๊ฑฐ์˜ˆ์š”",
        ],
        "label": [1, 0, 1, 0, 1, 0]
    })

# ===== 6. Dataset ๊ฐ์ฒด ์ƒ์„ฑ ๋ฐ ๋ถ„๋ฆฌ =====
dataset = Dataset.from_pandas(df[["text", "label"]])
split = dataset.train_test_split(test_size=0.1, seed=42)
dataset_dict = DatasetDict({
    "train": split["train"],
    "validation": split["test"],
})
print(f"ํ•™์Šต: {len(dataset_dict['train'])}๊ฐœ, ๊ฒ€์ฆ: {len(dataset_dict['validation'])}๊ฐœ")

# ===== 7. ํ† ํฌ๋‚˜์ด์ € ๋ฐ ๋ชจ๋ธ ๋กœ๋“œ =====
model_name = "klue/bert-base"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(
    model_name,
    num_labels=2,
)
model = model.to(device)

# ===== 8. ํ† ํฌ๋‚˜์ด์ง• =====
def tokenize_function(examples):
    return tokenizer(
        examples["text"],
        truncation=True,
        max_length=128,
    )

tokenized_dataset = dataset_dict.map(
    tokenize_function,
    batched=True,
    remove_columns=["text"],
)
print("ํ† ํฌ๋‚˜์ด์ง• ์™„๋ฃŒ")
print(tokenized_dataset)

# ===== 9. Data Collator ์„ค์ • =====
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

# ===== 10. ํ‰๊ฐ€ ํ•จ์ˆ˜ ์„ค์ • =====
accuracy_metric = evaluate.load("accuracy")
f1_metric = evaluate.load("f1")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    accuracy = accuracy_metric.compute(
        predictions=predictions, references=labels
    )["accuracy"]
    f1 = f1_metric.compute(
        predictions=predictions, references=labels, average="binary"
    )["f1"]
    return {"accuracy": accuracy, "f1": f1}

# ===== 11. ํ•™์Šต ์„ค์ • =====
output_dir = "/content/drive/MyDrive/sentiment_model"

training_args = TrainingArguments(
    output_dir=output_dir,
    num_train_epochs=3,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=32,
    learning_rate=2e-5,
    weight_decay=0.01,
    warmup_ratio=0.1,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    metric_for_best_model="f1",
    fp16=(device == "cuda"),           # GPU์—์„œ๋งŒ fp16 ์‚ฌ์šฉ
    gradient_accumulation_steps=2,
    save_total_limit=2,
    report_to="none",                  # wandb ๋น„ํ™œ์„ฑํ™”
    logging_steps=50,
)

# ===== 12. Trainer ์ดˆ๊ธฐํ™” ๋ฐ ํ•™์Šต =====
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["validation"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

print("ํ•™์Šต ์‹œ์ž‘!")
trainer.train()

# ===== 13. ์ตœ์ข… ํ‰๊ฐ€ =====
results = trainer.evaluate()
print(f"\n์ตœ์ข… ํ‰๊ฐ€ ๊ฒฐ๊ณผ:")
print(f"  Accuracy: {results['eval_accuracy']:.4f}")
print(f"  F1 Score: {results['eval_f1']:.4f}")

๋ชจ๋ธ ์ €์žฅ ๋ฐ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ

# ===== ๋ชจ๋ธ ์ €์žฅ =====
save_path = "/content/drive/MyDrive/sentiment_model_final"

# ๋ชจ๋ธ ๊ฐ€์ค‘์น˜ ์ €์žฅ
model.save_pretrained(save_path)

# ํ† ํฌ๋‚˜์ด์ € ์ €์žฅ (์ถ”๋ก  ์‹œ ํ•„์š”)
tokenizer.save_pretrained(save_path)

print(f"๋ชจ๋ธ ์ €์žฅ ์™„๋ฃŒ: {save_path}")
print(f"์ €์žฅ๋œ ํŒŒ์ผ: {os.listdir(save_path)}")
# ['config.json', 'model.safetensors', 'tokenizer.json', 'tokenizer_config.json', ...]

# ===== ๋ชจ๋ธ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ =====
from transformers import pipeline

# ์ €์žฅ๋œ ๋ชจ๋ธ๋กœ ํŒŒ์ดํ”„๋ผ์ธ ์ƒ์„ฑ
classifier = pipeline(
    "text-classification",
    model=save_path,
    tokenizer=save_path,
    device=0 if device == "cuda" else -1,
)

# ์ถ”๋ก  ํ…Œ์ŠคํŠธ
test_texts = [
    "์ด ์˜ํ™” ์ •๋ง ๊ฐ๋™์ ์ด์—ˆ์–ด์š”!",
    "๋ˆ์ด ์•„๊นŒ์šด ์˜ํ™”์˜€์Šต๋‹ˆ๋‹ค.",
    "๋ฐฐ์šฐ๋“ค ์—ฐ๊ธฐ๊ฐ€ ์ตœ๊ณ ์˜€์–ด์š”.",
]

for text in test_texts:
    result = classifier(text)
    label = "๊ธ์ •" if result[0]["label"] == "LABEL_1" else "๋ถ€์ •"
    score = result[0]["score"]
    print(f"'{text}' โ†’ {label} ({score:.2%})")

Hugging Face Hub์— ์—…๋กœ๋“œํ•˜๊ธฐ

ํ•™์Šตํ•œ ๋ชจ๋ธ์„ Hugging Face Hub์— ๊ณต์œ ํ•˜๋ฉด ๋‹ค๋ฅธ ์‚ฌ๋žŒ๋“ค๋„ ์‰ฝ๊ฒŒ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋‹ค.

# ===== Hub์— ์—…๋กœ๋“œ =====
from huggingface_hub import notebook_login

# HuggingFace ๋กœ๊ทธ์ธ (ํ† ํฐ ํ•„์š”: https://huggingface.co/settings/tokens)
notebook_login()

# ๋ชจ๋ธ ์—…๋กœ๋“œ
model.push_to_hub("your-username/klue-bert-sentiment")
tokenizer.push_to_hub("your-username/klue-bert-sentiment")

print("HuggingFace Hub ์—…๋กœ๋“œ ์™„๋ฃŒ!")
print("๋ชจ๋ธ ์ฃผ์†Œ: https://huggingface.co/your-username/klue-bert-sentiment")

์—…๋กœ๋“œ ํ›„ ๋‹ค๋ฅธ ์‚ฌ๋žŒ์ด ํ•œ ์ค„๋กœ ์‚ฌ์šฉ ๊ฐ€๋Šฅ:

from transformers import pipeline

classifier = pipeline(
    "text-classification",
    model="your-username/klue-bert-sentiment"
)
result = classifier("์ •๋ง ์žฌ๋ฏธ์žˆ๋Š” ์˜ํ™”์˜€์–ด์š”!")

๊ฒฐ๋ก 

Google Colab์—์„œ Transformers ๋ชจ๋ธ์„ ํ•™์Šตํ•˜๋ฉด์„œ ๋ฐœ์ƒํ•˜๋Š” ์ฃผ์š” ์˜ค๋ฅ˜๋“ค๊ณผ ํ•ด๊ฒฐ ๋ฐฉ๋ฒ•์„ ์ •๋ฆฌํ–ˆ๋‹ค.

ํ•ต์‹ฌ ์ฒดํฌ๋ฆฌ์ŠคํŠธ

  1. ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ์„ค์น˜: sklearn ๋Œ€์‹  scikit-learn ์‚ฌ์šฉ
  2. ํŒŒ์ผ ๊ฒฝ๋กœ: Google Drive ๋งˆ์šดํŠธ ํ›„ /content/drive/MyDrive/ ๊ฒฝ๋กœ ์‚ฌ์šฉ
  3. ๋ฐ์ดํ„ฐ ํŒจ๋”ฉ: DataCollatorWithPadding์œผ๋กœ ๋™์  ํŒจ๋”ฉ ์ฒ˜๋ฆฌ
  4. wandb ๋น„ํ™œ์„ฑํ™”: report_to="none" ์„ค์ •
  5. OOM ์˜ค๋ฅ˜: ๋ฐฐ์น˜ ์‚ฌ์ด์ฆˆ ์ค„์ด๊ธฐ + gradient_accumulation_steps + fp16=True
  6. ์„ธ์…˜ ๋Š๊น€: ์ฒดํฌํฌ์ธํŠธ๋ฅผ Google Drive์— ์ €์žฅํ•˜๊ณ  resume_from_checkpoint ํ™œ์šฉ
  7. ํ† ํฌ๋‚˜์ด์ € ๋ถˆ์ผ์น˜: ๋ชจ๋ธ๊ณผ ํ† ํฌ๋‚˜์ด์ €๋ฅผ ํ•ญ์ƒ ๊ฐ™์€ model_name์œผ๋กœ ๋กœ๋“œ
  8. ๋ฐ์ดํ„ฐ์…‹ ํ˜•์‹: ์ปฌ๋Ÿผ๋ช…์„ text, label๋กœ ํ†ต์ผํ•˜๊ณ  ๋ถˆํ•„์š”ํ•œ ์ปฌ๋Ÿผ ์ œ๊ฑฐ

๊ฐ ๋‹จ๊ณ„์—์„œ ๋ฐœ์ƒํ•  ์ˆ˜ ์žˆ๋Š” ๋ฌธ์ œ๋“ค์„ ์ดํ•ดํ•˜๊ณ  ์ด๋ฅผ ํ•ด๊ฒฐํ•ด ๋‚˜๊ฐ€๋ฉด, Colab ํ™˜๊ฒฝ์—์„œ ํšจ์œจ์ ์œผ๋กœ ๋ชจ๋ธ์„ ํ•™์Šต์‹œํ‚ฌ ์ˆ˜ ์žˆ๋‹ค.

@leekh8
๋ณด์•ˆ, ์›น ๊ฐœ๋ฐœ, Python์„ ๋‹ค๋ฃจ๋Š” ๊ธฐ์ˆ  ๋ธ”๋กœ๊ทธ