The use of 'discounts' in REINFORCE() class

This is just an enquiry about REINFORCE() class in chapter 11.

class REINFORCE():
 ......
    def optimize_model(self):
        T = len(self.rewards)
        discounts = np.logspace(0, T, num=T, base=self.gamma, endpoint=False)
        returns = np.array([np.sum(discounts[:T-t] * self.rewards[t:]) for t in range(T)])

        discounts = torch.FloatTensor(discounts).unsqueeze(1)
        returns = torch.FloatTensor(returns).unsqueeze(1)
        self.logpas = torch.cat(self.logpas)

        policy_loss = -(discounts * returns * self.logpas).mean()
        
        
In the code above, 'returns' already take into consideration 'discounts'. So, why do we multiply by another 'discounts' when working out 'policy_loss'? I am not clear on this.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

The use of 'discounts' in REINFORCE() class #38

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

The use of 'discounts' in REINFORCE() class #38

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions