๐Ÿš€ Google Colab์—์„œ Transformers ๋ชจ๋ธ ํ•™์Šต ์‹œ ๋ฐœ์ƒํ•˜๋Š” ์˜ค๋ฅ˜ ํ•ด๊ฒฐ

@leekh8 ยท November 08, 2024 ยท 3 min read

Code N Solve ๐Ÿ“˜: Google Colab์—์„œ Transformers ๋ชจ๋ธ ํ•™์Šต ์‹œ ๋ฐœ์ƒํ•˜๋Š” ์˜ค๋ฅ˜ ํ•ด๊ฒฐ ๊ฐ€์ด๋“œ

Transformer ๊ธฐ๋ฐ˜ NLP ๋ชจ๋ธ์„ Google Colab์—์„œ ํ•™์Šตํ•˜๋Š” ๋„์ค‘ ์ดˆ๋ณด์ž๊ฐ€ ๊ฒช์„ ์ˆ˜ ์žˆ๋Š” ๋‹ค์–‘ํ•œ ์˜ค๋ฅ˜์— ๋Œ€ํ•ด ์ •๋ฆฌํ•ด๋ณด์•˜๋‹ค.

ํŠนํžˆ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ์„ค์น˜ ๋ฌธ์ œ, ํŒŒ์ผ ๊ฒฝ๋กœ ์„ค์ •, ๋ฐ์ดํ„ฐ ์ „์ฒ˜๋ฆฌ ๋ฌธ์ œ ๋“ฑ ๋‹ค์–‘ํ•œ ์˜ค๋ฅ˜๋ฅผ ์ดํ•ดํ•˜๊ณ  ํ•ด๊ฒฐํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์•Œ์•„๋ณด์ž.

Google Colab?

  • ํด๋ผ์šฐ๋“œ ๊ธฐ๋ฐ˜์˜ Jupyter ๋…ธํŠธ๋ถ ํ™˜๊ฒฝ์œผ๋กœ, Python ์ฝ”๋“œ๋ฅผ ์‹คํ–‰ํ•˜๊ณ  ๋จธ์‹ ๋Ÿฌ๋‹ ๋ชจ๋ธ์„ ํ•™์Šตํ•˜๋Š” ๋ฐ ์œ ์šฉํ•˜๋‹ค.
  • ํŠนํžˆ, GPU๋ฅผ ๋ฌด๋ฃŒ๋กœ ์ œ๊ณตํ•˜์—ฌ ๋Œ€๊ทœ๋ชจ ๋ฐ์ดํ„ฐ์…‹์„ ๋‹ค๋ฃจ๋Š”๋ฐ ํฐ ์žฅ์ ์„ ๊ฐ€์ง€๊ณ  ์žˆ๋‹ค.

Transformers ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ 1

  • Hugging Face2์—์„œ ์ œ๊ณตํ•˜๋Š” ์ž์—ฐ์–ด ์ฒ˜๋ฆฌ (NLP) ๋ชจ๋ธ์„ ์‰ฝ๊ฒŒ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋„๋ก ๋„์™€์ฃผ๋Š” ๋„๊ตฌ์ด๋‹ค.

๋ฌธ์ œ1: ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ์„ค์น˜ ์˜ค๋ฅ˜ - sklearn๊ณผ datasets

  • Google Colab์—์„œ Transformers ๋ชจ๋ธ ํ•™์Šต์„ ์‹œ์ž‘ํ•˜๋ ค๋ฉด Hugging Face transformers, torch, datasets ๋“ฑ์˜ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๊ฐ€ ํ•„์š”ํ•˜๋‹ค.

  • ํ•˜์ง€๋งŒ sklearn์„ ์„ค์น˜ํ•  ๋•Œ ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์˜ค๋ฅ˜๊ฐ€ ๋ฐœ์ƒํ•  ์ˆ˜ ์žˆ๋‹ค.

      ValueError: metadata-generation-failed
  • ์ด ์˜ค๋ฅ˜๋Š” sklearn ๋Œ€์‹  scikit-learn์„ ์„ค์น˜ํ•ด์•ผ ํ•˜๊ธฐ ๋•Œ๋ฌธ์— ๋ฐœ์ƒํ•œ๋‹ค.

ํ•ด๊ฒฐ ๋ฐฉ๋ฒ•

  • sklearn ๋Œ€์‹  scikit-learn์„ ์„ค์น˜ํ•œ๋‹ค.3

      !pip install -U scikit-learn
  • ๊ทธ ์™ธ ํ•„์š”ํ•œ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋„ ํ•œ ๋ฒˆ์— ์„ค์น˜ํ•œ๋‹ค.

      !pip install transformers torch datasets
  • ์ด์ œ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ์„ค์น˜ ๊ด€๋ จ ์˜ค๋ฅ˜๋Š” ํ•ด๊ฒฐ๋œ๋‹ค.

๋ฌธ์ œ 2: Google Drive ํŒŒ์ผ ๊ฒฝ๋กœ ์„ค์ • ๋ฌธ์ œ4

  • ๋ฐ์ดํ„ฐ์…‹์„ Colab์—์„œ ์‚ฌ์šฉํ•˜๋ ค๋ฉด Google Drive์— ์ €์žฅ๋œ ํŒŒ์ผ์„ Colab์— ์—ฐ๊ฒฐํ•ด์•ผ ํ•œ๋‹ค.
  • Colab์— Drive๋ฅผ ๋งˆ์šดํŠธํ•˜์ง€ ์•Š์œผ๋ฉด FileNotFoundError ์˜ค๋ฅ˜๊ฐ€ ๋ฐœ์ƒํ•œ๋‹ค.

ํ•ด๊ฒฐ ๋ฐฉ๋ฒ•

  • Google Drive๋ฅผ Colab์— ๋งˆ์šดํŠธํ•œ๋‹ค.

      from google.colab import drive
      drive.mount('/content/drive')
  • ํŒŒ์ผ ๊ฒฝ๋กœ๋ฅผ Google Drive ๊ฒฝ๋กœ๋กœ ์ง€์ •ํ•œ๋‹ค.

    • ์˜ˆ๋ฅผ ๋“ค์–ด, Dataset.json ํŒŒ์ผ์ด Google Drive์˜ Dataset ํด๋”์— ์žˆ๋‹ค๋ฉด, ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์„ค์ •ํ•œ๋‹ค.
      import pandas as pd
      file_path = "/content/drive/MyDrive/Dataset/Dataset.json"
      data = pd.read_json(file_path, lines=True)
    • /content/drive/๋Š” Google Drive์— ๋งˆ์šดํŠธํ–ˆ์„๋•Œ์˜ ๊ธฐ๋ณธ ๊ฒฝ๋กœ๋‹ค.

๋ฌธ์ œ 3: Transformers ๋ชจ๋ธ ํ•™์Šต ์‹œ ๋ฐ์ดํ„ฐ ํŒจ๋”ฉ ์˜ค๋ฅ˜2

  • ๋ชจ๋ธ ํ•™์Šต ์ค‘ ๋ฐฐ์น˜ ๋ฐ์ดํ„ฐ์˜ ๊ธธ์ด๊ฐ€ ์ผ์ •ํ•˜์ง€ ์•Š์œผ๋ฉด ValueError: expected sequence of length ... ์˜ค๋ฅ˜๊ฐ€ ๋ฐœ์ƒํ•  ์ˆ˜ ์žˆ๋‹ค.
  • ์ด๋Š” ๋ฐ์ดํ„ฐ์˜ ๊ธธ์ด๊ฐ€ ๋‹ค๋ฅด๊ธฐ ๋•Œ๋ฌธ์— ๋ฐœ์ƒํ•œ๋‹ค.

ํ•ด๊ฒฐ ๋ฐฉ๋ฒ•

  • ๋ชจ๋“  ์ž…๋ ฅ ๋ฐ์ดํ„ฐ์˜ ๊ธธ์ด๋ฅผ ๋งž์ถ”๊ธฐ ์œ„ํ•ด DataCollatorWithPadding์„ ์‚ฌ์šฉํ•œ๋‹ค.
  from transformers import DataCollatorWithPadding

  data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

  trainer = Trainer(
      model=model,
      args=training_args,
      train_dataset=tokenized_dataset,
      data_collator=data_collator,  # ์ž๋™ ํŒจ๋”ฉ ์ถ”๊ฐ€
  )
  • ์ด๋ ‡๊ฒŒ ์„ค์ •ํ•˜๋ฉด, Trainer๊ฐ€ ๋ฐ์ดํ„ฐ์˜ ๊ธธ์ด๋ฅผ ์ž๋™์œผ๋กœ ๋งž์ถฐ ์˜ค๋ฅ˜๋ฅผ ๋ฐฉ์ง€ํ•  ์ˆ˜ ์žˆ๋‹ค.

๋ฌธ์ œ 4: ๋ชจ๋ธ ํ•™์Šต ์‹œ wandb ๋กœ๊ทธ์ธ ์š”์ฒญ5

  • Hugging Face Trainer๋Š” Weights & Biases(wandb)๋ฅผ ์‚ฌ์šฉํ•ด ํ•™์Šต ๊ณผ์ •์„ ์ถ”์ ํ•  ์ˆ˜ ์žˆ๋‹ค.
  • ํ•˜์ง€๋งŒ ๋กœ๊ทธ์ธ ์š”์ฒญ์ด ๋‚˜ํƒ€๋‚  ์ˆ˜ ์žˆ๋‹ค.

ํ•ด๊ฒฐ ๋ฐฉ๋ฒ•

  • wandb ๊ธฐ๋Šฅ์„ ์‚ฌ์šฉํ•˜์ง€ ์•Š์œผ๋ ค๋ฉด Trainer ์„ค์ •์—์„œ report_to="none"์œผ๋กœ ์ง€์ •ํ•˜์—ฌ ๋น„ํ™œ์„ฑํ™”ํ•œ๋‹ค.
  training_args = TrainingArguments(
      output_dir="./results",
      evaluation_strategy="epoch",
      per_device_train_batch_size=8,
      per_device_eval_batch_size=8,
      num_train_epochs=3,
      weight_decay=0.01,
      report_to="none"  # wandb ๋น„ํ™œ์„ฑํ™”
  )
  • ์ด๋ ‡๊ฒŒ ์„ค์ •ํ•˜๋ฉด wandb ๋กœ๊ทธ์ธ ์š”์ฒญ ์—†์ด ํ•™์Šต์ด ์ง„ํ–‰๋œ๋‹ค.

๊ฒฐ๋ก 

Google Colab์—์„œ Transformers ๋ชจ๋ธ์„ ํ•™์Šตํ•˜๋ฉด์„œ ๋ฐœ์ƒํ•˜๋Š” ์ฃผ์š” ์˜ค๋ฅ˜๋“ค์„ ํ•ด๊ฒฐํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์‚ดํŽด๋ณด์•˜๋‹ค.

๊ฐ ๋‹จ๊ณ„์—์„œ ๋ฐœ์ƒํ•  ์ˆ˜ ์žˆ๋Š” ๋ฌธ์ œ๋“ค์„ ์ดํ•ดํ•˜๊ณ  ์ด๋ฅผ ํ•ด๊ฒฐํ•ด ๋‚˜๊ฐ€๋ฉด, Colab ํ™˜๊ฒฝ์—์„œ ํšจ์œจ์ ์œผ๋กœ ๋ชจ๋ธ์„ ํ•™์Šต์‹œํ‚ฌ ์ˆ˜ ์žˆ๋‹ค.

์ด ๊ธ€์„ ํ†ตํ•ด Google Colab์—์„œ ๋ฐœ์ƒํ•  ์ˆ˜ ์žˆ๋Š” ๋‹ค์–‘ํ•œ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ณ , ๋ณด๋‹ค ์•ˆ์ •์ ์œผ๋กœ NLP ๋ชจ๋ธ์„ ํ•™์Šตํ•  ์ˆ˜ ์žˆ๊ธฐ๋ฅผ ๋ฐ”๋ž๋‹ˆ๋‹ค.

@leekh8
Hello :)