benchmarl.algorithms.Masac
- class Masac(share_param_critic: bool, num_qvalue_nets: int, loss_function: str, delay_qvalue: bool, target_entropy: float | str, discrete_target_entropy_weight: float, alpha_init: float, min_alpha: float | None, max_alpha: float | None, fixed_alpha: bool, scale_mapping: str, use_tanh_normal: bool, coupled_discrete_values: bool, **kwargs)[source]
Bases:
AlgorithmMulti Agent Soft Actor Critic.
- Parameters:
share_param_critic (bool) – Whether to share the parameters of the critics withing agent groups
num_qvalue_nets (integer) – number of Q-Value networks used.
loss_function (str) – loss function to be used with the value function loss.
delay_qvalue (bool) – Whether to separate the target Q value networks from the Q value networks used for data collection.
target_entropy (float or str, optional) – Target entropy for the stochastic policy. Default is “auto”, where target entropy is computed as
-prod(n_actions).discrete_target_entropy_weight (float) – weight for the target entropy term when actions are discrete
alpha_init (float) – initial entropy multiplier.
min_alpha (float) – min value of alpha.
max_alpha (float) – max value of alpha.
fixed_alpha (bool) – if
True, alpha will be fixed to its initial value. Otherwise, alpha will be optimized to match the ‘target_entropy’ value.scale_mapping (str) – positive mapping function to be used with the std. choices: “softplus”, “exp”, “relu”, “biased_softplus_1”;
use_tanh_normal (bool) – if
True, use TanhNormal as the continuyous action distribution with support bound to the action domain. Otherwise, an IndependentNormal is used.coupled_discrete_values (bool) – only relevant for discrete action spaces. if
True, the critic will predict n_agents x n_actions action values given the global state (or concatenation of agents’ observations). ifFalse, the critic will predict n_actions values given the global state and the actions of the other agents. This is done for all agents in parallel.Trueis more theoretically sound and should be preferred. However, if the number of outputs gets too large, you may want to tryFalse.
- _get_loss(group: str, policy_for_loss: TensorDictModule, continuous: bool) Tuple[LossModule, bool][source]
Implement this function to return the LossModule for a specific group.
- Parameters:
Returns: LossModule and a bool representing if the loss should have target parameters
- _get_parameters(group: str, loss: LossModule) Dict[str, Iterable][source]
Get the dictionary mapping loss names to the relative parameters to optimize for a given group loss.
Returns: a dictionary mapping loss names to a parameters’ list
- _get_policy_for_loss(group: str, model_config: ModelConfig, continuous: bool) TensorDictModule[source]
Get the non-explorative policy for a specific group.
- Parameters:
group (str) – agent group of the policy
model_config (ModelConfig) – model config class
continuous (bool) – whether the policy should be continuous or discrete
Returns: TensorDictModule representing the policy
- _get_policy_for_collection(policy_for_loss: TensorDictModule, group: str, continuous: bool) TensorDictModule[source]
Implement this function to add an explorative layer to the policy used in the loss.
- Parameters:
Returns: TensorDictModule representing the explorative policy
- process_batch(group: str, batch: TensorDictBase) TensorDictBase[source]
This function can be used to reshape data coming from collection before it is passed to the policy.
- Parameters:
group (str) – agent group
batch (TensorDictBase) – the batch of data coming from the collector
Returns: the processed batch
- get_discrete_value_module_coupled(group: str) TensorDictModule[source]
- get_discrete_value_module_decoupled(group: str) TensorDictModule[source]
- get_continuous_value_module(group: str) TensorDictModule[source]
- _abc_impl = <_abc._abc_data object>