-
Notifications
You must be signed in to change notification settings - Fork 37
Update 05_ddp.md #525
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Update 05_ddp.md #525
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
|
|
@@ -35,8 +35,10 @@ | |||||
| download=True, | ||||||
| ) | ||||||
|
|
||||||
| sampler = flow.utils.data.distributed.DistributedSampler(training_data) | ||||||
|
|
||||||
| train_dataloader = flow.utils.data.DataLoader( | ||||||
| training_data, BATCH_SIZE, shuffle=True | ||||||
| training_data, BATCH_SIZE, shuffle=(sampler is None), sampler=sampler | ||||||
doombeaker marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
| ) | ||||||
|
|
||||||
| model = flowvision.models.mobilenet_v2().to(DEVICE) | ||||||
|
|
@@ -48,6 +50,7 @@ | |||||
|
|
||||||
| for t in range(EPOCH_NUM): | ||||||
| print(f"Epoch {t+1}\n-------------------------------") | ||||||
| train_dataloader.sampler.set_epoch(t) | ||||||
| size = len(train_dataloader.dataset) | ||||||
| for batch, (x, y) in enumerate(train_dataloader): | ||||||
| x = x.to_global(placement=PLACEMENT, sbp=S0) | ||||||
|
|
@@ -88,6 +91,8 @@ | |||||
| y = y.to_global(placement=PLACEMENT, sbp=S0) | ||||||
| ``` | ||||||
|
|
||||||
| - 需要注意的是,在进行分布式并行训练时,代码中规定的`BATCH_SIZE`为每一台机器的本地值而非`GLOBAL_BATCH_SIZE`,故上述代码单机双卡`BATCH_SIZE=64`的训练效果与单机单卡`BATCH_SIZE=128`一致。 | ||||||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
中英文之间、中文和数字之间要有空格。
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 其实我觉得这句不用加这里,因为它如果懂 global tensor,应该自己懂这个道理。 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
好的,global tensor的文档中已经有相应的tensor形状变化的解释以及例子。因为客户在微信聊天记录里问了一下这个 |
||||||
|
|
||||||
| 这样,按照 [常见的分布式并行策略](./01_introduction.md) 中的介绍,我们就通过对数据进行 `split(0)` 切分,对模型进行广播,进行了分布式数据并行训练。 | ||||||
|
|
||||||
| ## 使用 DistributedDataParallel 做数据并行训练 | ||||||
|
|
||||||
Uh oh!
There was an error while loading. Please reload this page.