Skip to content

Nimble training 시 에러가 납니다. #12

@jsun94

Description

@jsun94

안녕하세요

Inference 코드를 실행할 때는 에러가 나지 않지만
Training 코드 실행 시 에러가 납니다.

import torch
import torchvision
import os

os.environ["CUDA_VISIBLE_DEVICES"]="1"

BATCH = 32


model = torchvision.models.resnet50(num_classes=10)
model = model.cuda()
model.train()

loss_fn = torch.nn.CrossEntropyLoss().cuda()
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)

input_shape = [BATCH, 3, 32, 32]
dummy_input = torch.randn(*input_shape).cuda()

nimble_model = torch.cuda.Nimble(model)
nimble_model.prepare(dummy_input, training=True)

rand_input = torch.rand(*input_shape).cuda()
output = nimble_model(rand_input)

label = torch.zeros(BATCH, dtype=torch.long).cuda()
loss = loss_fn(output, label)

loss.backward()

optimizer.step()

위의 코드대로 실행 시 prepare에서

TracerWarning: Output nr 1. of the traced function does not match the corresponding output of the Python function. Detailed error:
With rtol=1e-05 and atol=1e-05, found 297 element(s) (out of 320) whose difference(s) exceeded the margin of error (including 0 nan comparisons). The greatest difference was 0.0005762577056884766 (-0.9435920119285583 vs. -0.9441682696342468), which occurred at index (15, 3).

위와 같은 에러가 나서 어떤 부분이 잘못된 것인지 질문드립니다.

환경 :
Ubuntu : 18.04
Linux : 5.4.0
Pytorch : 1.7.0
Python : 3.7.10
cuDNN, CUDA는 각각 Nimble에서 요구하는 환경입니다.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions