benchmarl.algorithms.Ippo

class Ippo(share_param_critic: bool, clip_epsilon: float, entropy_coef: bool, critic_coef: float, loss_critic_type: str, lmbda: float, scale_mapping: str, use_tanh_normal: bool, minibatch_advantage: bool, **kwargs)[source]

Bases: Algorithm

Independent PPO (from https://arxiv.org/abs/2011.09533).

Parameters:

share_param_critic (bool) – Whether to share the parameters of the critics withing agent groups
clip_epsilon (scalar) – weight clipping threshold in the clipped PPO loss equation.
entropy_coef (scalar) – entropy multiplier when computing the total loss.
critic_coef (scalar) – critic loss multiplier when computing the total
loss_critic_type (str) – loss function for the value discrepancy. Can be one of “l1”, “l2” or “smooth_l1”.
lmbda (float) – The GAE lambda
scale_mapping (str) – positive mapping function to be used with the std. choices: “softplus”, “exp”, “relu”, “biased_softplus_1”;
use_tanh_normal (bool) – if True, use TanhNormal as the continuyous action distribution with support bound to the action domain. Otherwise, an IndependentNormal is used.
minibatch_advantage (bool) – if True, advantage computation is perfomend on minibatches of size experiment.config.on_policy_minibatch_size instead of the full experiment.config.on_policy_collected_frames_per_batch, this helps not exploding memory usage

_get_loss(group: str, policy_for_loss: TensorDictModule, continuous: bool) → Tuple[LossModule, bool][source]

Implement this function to return the LossModule for a specific group.

Parameters:

group (str) – agent group of the loss
policy_for_loss (TensorDictModule) – the policy to use in the loss
continuous (bool) – whether to return a loss for continuous or discrete actions

Returns: LossModule and a bool representing if the loss should have target parameters

_get_parameters(group: str, loss: ClipPPOLoss) → Dict[str, Iterable][source]

Get the dictionary mapping loss names to the relative parameters to optimize for a given group loss.

Returns: a dictionary mapping loss names to a parameters’ list

_get_policy_for_loss(group: str, model_config: ModelConfig, continuous: bool) → TensorDictModule[source]

Get the non-explorative policy for a specific group.

Parameters:

group (str) – agent group of the policy
model_config (ModelConfig) – model config class
continuous (bool) – whether the policy should be continuous or discrete

Returns: TensorDictModule representing the policy

_get_policy_for_collection(policy_for_loss: TensorDictModule, group: str, continuous: bool) → TensorDictModule[source]

Implement this function to add an explorative layer to the policy used in the loss.

Parameters:

policy_for_loss (TensorDictModule) – the group policy used in the loss
group (str) – agent group
continuous (bool) – whether the policy is continuous or discrete

Returns: TensorDictModule representing the explorative policy

process_batch(group: str, batch: TensorDictBase) → TensorDictBase[source]

This function can be used to reshape data coming from collection before it is passed to the policy.

Parameters:

group (str) – agent group
batch (TensorDictBase) – the batch of data coming from the collector

Returns: the processed batch

process_loss_vals(group: str, loss_vals: TensorDictBase) → TensorDictBase[source]

Here you can modify the loss_vals tensordict containing entries loss_name->loss_value For example, you can sum two entries in a new entry, to optimize them together.

Parameters:

group (str) – agent group
loss_vals (TensorDictBase) – the tensordict returned by the loss forward method

Returns: the processed loss_vals

get_critic(group: str) → TensorDictModule[source]

_abc_impl = <_abc._abc_data object>