Torch.outofmemoryError: бедра вне памяти. Пытался выделить 96,00 миб. GPU 1 имеет общую мощность 15,98 GIB, из которых 0Python

Программы на Python
Ответить Пред. темаСлед. тема
Anonymous
 Torch.outofmemoryError: бедра вне памяти. Пытался выделить 96,00 миб. GPU 1 имеет общую мощность 15,98 GIB, из которых 0

Сообщение Anonymous »

Я пытаюсь еще раз начать обучение AMD, но проблема в том, что я снова нашел эту ошибку с этим кодом обучения: < /p>

Код: Выделить всё

import time
import torch
from datetime import datetime
from datasets import Dataset
import pandas as pd
from trl import SFTTrainer
from transformers import TrainingArguments, DataCollatorForLanguageModeling, AutoTokenizer, AutoModelForCausalLM
import os
from dotenv import load_dotenv
from huggingface_hub import login
import evaluate
import numpy as np

os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"

load_dotenv()
hf_token = os.getenv("HF_TOKEN")

if hf_token:
login(hf_token)
else:
raise ValueError("HF_TOKEN non trovato. Assicurati di essere autenticato.")

file_csv = "datasets/dataset_completo.csv"

if not os.path.exists(file_csv):
raise FileNotFoundError(f"Errore: Il file {file_csv} non esiste!")

df = pd.read_csv(file_csv)

if df.empty:
raise ValueError("Errore: Il dataset è vuoto!")

if {"request", "answer"}.issubset(df.columns):
df["conversations"] = df.apply(
lambda x: [
{"content": x["request"], "role": "user"},
{"content": x["answer"], "role": "assistant"}
], axis=1
)
df = df.drop(columns=["request", "answer"])

if "conversations" in df.columns:
dataset = Dataset.from_pandas(df)
else:
raise ValueError("Errore: il dataset non contiene né 'request' e 'answer' né 'conversations'.")

split_dataset = dataset.train_test_split(test_size=0.1, seed=42)
train_dataset = split_dataset["train"]
eval_dataset = split_dataset["test"]

metric = evaluate.load("accuracy")

def compute_metrics(eval_pred):
logits, labels = eval_pred
predictions = np.argmax(logits, axis=-1)
valid_indices = labels != -100
filtered_predictions = predictions[valid_indices]
filtered_labels = labels[valid_indices]

return metric.compute(predictions=filtered_predictions, references=filtered_labels)

hf_output_path = "/storage/code/finetuning/hf_model7"

torch.cuda.empty_cache()

if os.path.exists(hf_output_path):
print("Modello addestrato trovato! Carico il modello dalla cartella locale.")
model = AutoModelForCausalLM.from_pretrained(hf_output_path)
tokenizer = AutoTokenizer.from_pretrained(hf_output_path)

else:
print("Nessun modello addestrato trovato in locale.  Carico il modello base da Hugging Face.")
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.2-3B")
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-3B")

tokenizer.pad_token = tokenizer.eos_token
tokenizer.chat_template = """{% for message in messages %}
{% if message.role == 'user' %}
: {{ message.content }} 
{% elif message.role == 'assistant' %}
: {{ message.content }} 
{% endif %}
{% endfor %}
"""

def format_conversation(example):
return {"text": tokenizer.apply_chat_template(example["conversations"], tokenize=False)}

train_dataset = train_dataset.map(format_conversation)
eval_dataset = eval_dataset.map(format_conversation)

training_args = TrainingArguments(
per_device_train_batch_size = 4,
gradient_accumulation_steps = 8,
warmup_ratio = 0.05,
num_train_epochs = 5,
learning_rate = 1e-5,
fp16 = False,
bf16 = True,
logging_steps = 1,
optim = "adamw_torch",
weight_decay = 0.01,
lr_scheduler_type = "cosine",
seed = 3407,
output_dir = hf_output_path,
eval_strategy = "epoch",
report_to = "none",
save_strategy = "epoch"
)

start_train_time = time.time()
start_train_datetime = datetime.now().strftime("%H:%M:%S")

trainer = SFTTrainer(
model = model,
processing_class = tokenizer,
train_dataset = train_dataset,
eval_dataset = eval_dataset,
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False),
args = training_args,
compute_metrics = compute_metrics
)

trainer.train()

end_train_time = time.time()
end_train_datetime = datetime.now().strftime("%H:%M:%S")
training_time = end_train_time - start_train_time

trainer.model.save_pretrained(hf_output_path)
tokenizer.save_pretrained(hf_output_path)

print("\n**Training completato!**")
print(f"Modello addestrato salvato in: {hf_output_path}")
print(f"Inizio alle: {start_train_datetime}")
print(f"Fine alle: {end_train_datetime}")
print(f"Tempo totale di training: {training_time:.2f} secondi\n")
< /code>
И это ошибка: < /p>
torch.OutOfMemoryError: HIP out of memory. Tried to allocate 96.00 MiB. GPU 1 has a total capacity of 15.98 GiB of which 0 bytes is free. Of the allocated memory 15.72 GiB is allocated by PyTorch, and 21.06 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_HIP_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Эта ошибка выводится после того, как я запускаю обучение с простой командой: python train.py Так что в соответствии с ошибкой, в данном случае я выхожу из памяти только для 96 Mib ...
Но если я попытаюсь использовать Torchrun -nproc_per_node = 2 train_fsdp.py с добавлением этих двух линий:



с добавлением.fsdp="full_shard auto_wrap",
fsdp_transformer_layer_cls_to_wrap="LlamaDecoderLayer",
< /code>
Я столкнусь с этой ошибкой: < /p>
[rank1]: Traceback (most recent call last):
[rank1]: File "/storage/code/finetuning/train_fsdp.py", line 138, in
[rank1]: trainer.train()
[rank1]: File "/storage/code/finetuning/ai_venv/lib/python3.12/site-packages/transformers/trainer.py", line 2240, in train
[rank1]: return inner_training_loop(
[rank1]: ^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/storage/code/finetuning/ai_venv/lib/python3.12/site-packages/transformers/trainer.py", line 2555, in _inner_training_loop
[rank1]: tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/storage/code/finetuning/ai_venv/lib/python3.12/site-packages/trl/trainer/sft_trainer.py", line 733, in training_step
[rank1]: return super().training_step(*args, **kwargs)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/storage/code/finetuning/ai_venv/lib/python3.12/site-packages/transformers/trainer.py", line 3791, in training_step
[rank1]: self.accelerator.backward(loss, **kwargs)
[rank1]: File "/storage/code/finetuning/ai_venv/lib/python3.12/site-packages/accelerate/accelerator.py", line 2473, in backward
[rank1]: loss.backward(**kwargs)
[rank1]: File "/storage/code/finetuning/ai_venv/lib/python3.12/site-packages/torch/_tensor.py", line 648, in backward
[rank1]: torch.autograd.backward(
[rank1]: File "/storage/code/finetuning/ai_venv/lib/python3.12/site-packages/torch/autograd/__init__.py", line 353, in backward
[rank1]: _engine_run_backward(
[rank1]: File "/storage/code/finetuning/ai_venv/lib/python3.12/site-packages/torch/autograd/graph.py", line 824, in _engine_run_backward
[rank1]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: torch.OutOfMemoryError: HIP out of memory. Tried to allocate 752.00 MiB. GPU 1 has a total capacity of 15.98 GiB of which 590.00 MiB is free. Of the allocated memory 13.50 GiB is allocated by PyTorch, and 1.55 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_HIP_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/c ... -variables)
[rank0]: Traceback (most recent call last):
[rank0]: File "/storage/code/finetuning/train_fsdp.py", line 138, in
[rank0]: trainer.train()
[rank0]: File "/storage/code/finetuning/ai_venv/lib/python3.12/site-packages/transformers/trainer.py", line 2240, in train
[rank0]: return inner_training_loop(
[rank0]: ^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/storage/code/finetuning/ai_venv/lib/python3.12/site-packages/transformers/trainer.py", line 2555, in _inner_training_loop
[rank0]: tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/storage/code/finetuning/ai_venv/lib/python3.12/site-packages/trl/trainer/sft_trainer.py", line 733, in training_step
[rank0]: return super().training_step(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/storage/code/finetuning/ai_venv/lib/python3.12/site-packages/transformers/trainer.py", line 3791, in training_step
[rank0]: self.accelerator.backward(loss, **kwargs)
[rank0]: File "/storage/code/finetuning/ai_venv/lib/python3.12/site-packages/accelerate/accelerator.py", line 2473, in backward
[rank0]: loss.backward(**kwargs)
[rank0]: File "/storage/code/finetuning/ai_venv/lib/python3.12/site-packages/torch/_tensor.py", line 648, in backward
[rank0]: torch.autograd.backward(
[rank0]: File "/storage/code/finetuning/ai_venv/lib/python3.12/site-packages/torch/autograd/__init__.py", line 353, in backward
[rank0]: _engine_run_backward(
[rank0]: File "/storage/code/finetuning/ai_venv/lib/python3.12/site-packages/torch/autograd/graph.py", line 824, in _engine_run_backward
[rank0]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: torch.OutOfMemoryError: HIP out of memory. Tried to allocate 752.00 MiB. GPU 0 has a total capacity of 15.98 GiB of which 576.00 MiB is free. Of the allocated memory 14.23 GiB is allocated by PyTorch, and 843.77 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_HIP_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/c ... -variables)
0%| | 0/260 [00:03

Подробнее здесь: https://stackoverflow.com/questions/796 ... b-gpu-1-ha
Реклама
Ответить Пред. темаСлед. тема

Быстрый ответ

Изменение регистра текста: 
Смайлики
:) :( :oops: :roll: :wink: :muza: :clever: :sorry: :angel: :read: *x)
Ещё смайлики…
   
К этому ответу прикреплено по крайней мере одно вложение.

Если вы не хотите добавлять вложения, оставьте поля пустыми.

Максимально разрешённый размер вложения: 15 МБ.

  • Похожие темы
    Ответы
    Просмотры
    Последнее сообщение
  • Невозможно выделить 11,6 GIB на массив с формой (1554122468,) и тип данных Int64
    Anonymous » » в форуме Python
    0 Ответы
    9 Просмотры
    Последнее сообщение Anonymous
  • Невозможно выделить 11,6 GIB на массив с формой (1554122468,) и тип данных Int64
    Anonymous » » в форуме Python
    0 Ответы
    5 Просмотры
    Последнее сообщение Anonymous
  • Torch.cuda.OutOfMemoryError: CUDA не хватает памяти
    Anonymous » » в форуме Python
    0 Ответы
    44 Просмотры
    Последнее сообщение Anonymous
  • Torch.OutOfMemoryError: CUDA не хватает памяти. (Google Колаб)
    Anonymous » » в форуме Python
    0 Ответы
    27 Просмотры
    Последнее сообщение Anonymous
  • Torch.outofmemoryerror: Cuda Out из памяти. (Google Colab)
    Anonymous » » в форуме Python
    0 Ответы
    15 Просмотры
    Последнее сообщение Anonymous

Вернуться в «Python»