Base

Module

class sonnet.Module(name=None)[source]

Base class for Sonnet modules.

A Sonnet module is a lightweight container for variables and other modules. Modules typically define one or more “forward” methods (e.g. __call__) which apply operations combining user input and module parameters. For example:

>>> class MultiplyModule(snt.Module):
...   def __call__(self, x):
...     if not hasattr(self, 'w'):
...       self.w = tf.Variable(2., name='w')
...     return x * self.w

>>> mod = MultiplyModule()
>>> mod(1.)
<tf.Tensor: ... numpy=2.0>

Sonnet modules are a layer on top of tf.Module, implementing automatic name scoping as described in the original RFC [1].

__init__(name=None)[source]

Initializes the current module with the given name.

Subclasses should call this constructor before creating other modules or variables such that those modules are named correctly.

Parameters

name (Optional[str]) – An optional string name for the class. Must be a valid Python identifier. If name is not provided then the class name for the current instance is converted to lower_snake_case and used instead.

property variables

Sequence of tf.Variables owned by this module and it’s submodules.

See tf.Module.variables for implementation details.

NOTE: Most Sonnet modules create variables lazily (e.g. the first time they are called). As such just after construction there are typically no variables. To mitigate a common error (calling .variables or .trainable_variables before any variables are created) these properties will raise an exception if their result is empty. See allow_empty_variables() if you want to suppress this error.

Returns

A sequence of variables for the current module (sorted by attribute name) followed by variables from all submodules recursively (breadth first).

property trainable_variables

Sequence of tf.Variables owned by this module and it’s submodules.

See tf.Module.trainable_variables for implementation details.

NOTE: Most Sonnet modules create variables lazily (e.g. the first time they are called). As such just after construction there are typically no variables. To mitigate a common error (calling .variables or .trainable_variables before any variables are created) these properties will raise an exception if their result is empty. See allow_empty_variables() if you want to suppress this error.

Returns

A sequence of variables for the current module (sorted by attribute name) followed by variables from all submodules recursively (breadth first).

once

sonnet.once(f)[source]

Decorator which ensures a wrapped method is only ever run once.

>>> @snt.once
... def f():
...   print('Hello, world!')
>>> f()
Hello, world!
>>> f()
>>> f()

If f is a method then it will be evaluated once per instance:

>>> class MyObject:
...   @snt.once
...   def f(self):
...     print('Hello, world!')
>>> o = MyObject()
>>> o.f()
Hello, world!
>>> o.f()
>>> o2 = MyObject()
>>> o2.f()
Hello, world!
>>> o.f()
>>> o2.f()

If an error is raised during execution of f it will be raised to the user. Next time the method is run, it will be treated as not having run before.

Parameters

f – A function to wrap which should only be called once.

Returns

Wrapped version of f which will only evaluate f the first time it is called.

no_name_scope

sonnet.no_name_scope(method)[source]

Decorator to wrap a method, preventing automatic name scope wrapping.

By default, any method on a module is considered as a forwards function, and so any variables / modules created by the method will be scoped as belonging to the module. In some cases this is undesirable, for example when implementing .clone() / .transpose(), as in those cases we want the new module to have the scope of wherever the .transpose() call is made. To allow this, decorate any methods with no_name_scope.

Parameters

method (TypeVar(T)) – the method to wrap.

Return type

TypeVar(T)

Returns

The method, with a flag indicating no name scope wrapping should occur.

Deferred

class sonnet.Deferred(*args, **kwargs)[source]

Defers the construction of another module until the first call.

Deferred can be used to declare modules that depend on computed properties of other modules before those modules are defined. This allows users to separate the declaration and use of modules. For example at the start of your program you can declare two modules which are coupled:

>>> encoder = snt.Linear(64)
>>> decoder = snt.Deferred(lambda: snt.Linear(encoder.input_size))

Later you can use these naturally (note: that using decoder first would cause an error since encoder.input_size is only defined after encoder has been called):

>>> x = tf.ones([8, 32])
>>> y = encoder(x)
>>> z = decoder(y)  # Constructs the Linear encoder by calling the lambda.

The result will satisfy the following conditions:

>>> assert x.shape == z.shape
>>> assert y.shape == [8, 64]
>>> assert decoder.input_size == encoder.output_size
>>> assert decoder.output_size == encoder.input_size
__init__(constructor, call_methods=('__call__',), name=None)[source]

Initializes the Deferred module.

Parameters
  • constructor – A no argument callable which constructs the module to defer to. The first time one of the call_methods are called the constructor will be run and then the constructed module will be called with the same method and arguments as the deferred module.

  • call_methods – Methods which should trigger construction of the target module. The default value configures this module to construct the first time __call__ is run. If you want to add methods other than call you should explicitly pass them (optionally), for example call_methods=(“__call__”, “encode”, “decode”).

  • name – Name for the deferred module.

property target

Returns the target module.

If the constructor has not already run this will trigger construction. Subsequent calls to target will return the same instance.

Returns

A Module instance as created by self.constructor() .

__call__(*args, **kwargs)[source]

Call self as a function.

__str__()[source]

Return str(self).

__repr__()[source]

Return repr(self).

__delattr__(name)[source]

Implement delattr(self, name).

Linear modules

Linear

class sonnet.Linear(output_size, with_bias=True, w_init=None, b_init=None, name=None)[source]

Linear module, optionally including bias.

__init__(output_size, with_bias=True, w_init=None, b_init=None, name=None)[source]

Constructs a Linear module.

Parameters
  • output_size (int) – Output dimensionality.

  • with_bias (bool) – Whether to include bias parameters. Default True.

  • w_init (Optional[Initializer]) – Optional initializer for the weights. By default the weights are initialized truncated random normal values with a standard deviation of 1 / sqrt(input_feature_size), which is commonly used when the inputs are zero centered (see https://arxiv.org/abs/1502.03167v3).

  • b_init (Optional[Initializer]) – Optional initializer for the bias. By default the bias is initialized to zero.

  • name (Optional[str]) – Name of the module.

__call__(*args, **kwargs)[source]

Call self as a function.

Bias

class sonnet.Bias(output_size=None, bias_dims=None, b_init=None, name=None)[source]

Bias module.

Example Usage:

>>> N, H, W, C = 1, 2, 3, 4
>>> x = tf.random.normal([N, H, W, C])
>>> scalar_bias = snt.Bias(bias_dims=[])
>>> scalar_bias_output = scalar_bias(x)
>>> assert scalar_bias.b.shape == []

Create a bias over all non-minibatch dimensions:

>>> all_bias = snt.Bias()
>>> all_bias_output = all_bias(x)
>>> assert all_bias.b.shape == [H, W, C]

Create a bias over the last non-minibatch dimension:

>>> last_bias = snt.Bias(bias_dims=[-1])
>>> last_bias_output = last_bias(x)
>>> assert last_bias.b.shape == [C]

Create a bias over the first non-minibatch dimension:

>>> first_bias = snt.Bias(bias_dims=[1])
>>> first_bias_output = first_bias(x)
>>> assert first_bias.b.shape == [H, 1, 1]

Subtract and later add the same learned bias:

>>> bias = snt.Bias()
>>> h1 = bias(x, multiplier=-1)
>>> h2 = bias(x)
>>> h3 = bias(x, multiplier=-1)
>>> reconstructed_x = bias(h3)
>>> assert tf.reduce_all(tf.equal(x, reconstructed_x))
__init__(output_size=None, bias_dims=None, b_init=None, name=None)[source]

Constructs a Bias module that supports broadcasting.

Parameters
  • output_size (Optional[int]) – Output size (output shape without batch dimension). If output_size is left as None, the size will be directly inferred by the input.

  • bias_dims (Optional[Sequence[int]]) – Sequence of which dimensions to retain from the input shape when constructing the bias. The remaining dimensions will be broadcast over (given size of 1), and leading dimensions will be removed completely. See class doc for examples.

  • b_init (Optional[Initializer]) – Optional initializer for the bias. Default to zeros.

  • name (Optional[str]) – Name of the module.

__call__(*args, **kwargs)[source]

Call self as a function.

Convolutional modules

Conv1D

class sonnet.Conv1D(output_channels, kernel_shape, stride=1, rate=1, padding='SAME', with_bias=True, w_init=None, b_init=None, data_format='NWC', name=None)[source]

Conv1D module.

__init__(output_channels, kernel_shape, stride=1, rate=1, padding='SAME', with_bias=True, w_init=None, b_init=None, data_format='NWC', name=None)[source]

Constructs a Conv1D module.

Parameters
  • output_channels (int) – The number of output channels.

  • kernel_shape (Union[int, Sequence[int]]) – Sequence of length 1, or an integer. kernel_shape will be expanded to define a kernel size in all dimensions.

  • stride (Union[int, Sequence[int]]) – Sequence of strides of length 1, or an integer. stride will be expanded to define stride in all dimensions.

  • rate (Union[int, Sequence[int]]) – Sequence of dilation rates of length 1, or integer that is used to define dilation rate in all dimensions. 1 corresponds to standard convolution, rate > 1 corresponds to dilated convolution.

  • padding (Union[str, Callable[[int], Sequence[int]], Sequence[Callable[[int], Sequence[int]]]]) – Padding to apply to the input. This can be either SAME, VALID or a callable or sequence of callables of size 1. Any callables must take a single integer argument equal to the effective kernel size and return a list of two integers representing the padding before and after. See snt.pad.* for more details and example functions.

  • with_bias (bool) – Whether to include bias parameters. Default True.

  • w_init (Optional[Initializer]) – Optional initializer for the weights. By default the weights are initialized truncated random normal values with a standard deviation of 1/sqrt(input_feature_size), which is commonly used when the inputs are zero centered (see https://arxiv.org/abs/1502.03167v3).

  • b_init (Optional[Initializer]) – Optional initializer for the bias. By default the bias is initialized to zero.

  • data_format (str) – The data format of the input.

  • name (Optional[str]) – Name of the module.

Conv2D

class sonnet.Conv2D(output_channels, kernel_shape, stride=1, rate=1, padding='SAME', with_bias=True, w_init=None, b_init=None, data_format='NHWC', name=None)[source]

Conv2D module.

__init__(output_channels, kernel_shape, stride=1, rate=1, padding='SAME', with_bias=True, w_init=None, b_init=None, data_format='NHWC', name=None)[source]

Constructs a Conv2D module.

Parameters
  • output_channels (int) – The number of output channels.

  • kernel_shape (Union[int, Sequence[int]]) – Sequence of kernel sizes (of length 2), or an integer. kernel_shape will be expanded to define a kernel size in all dimensions.

  • stride (Union[int, Sequence[int]]) – Sequence of strides (of length 2), or an integer. stride will be expanded to define stride in all dimensions.

  • rate (Union[int, Sequence[int]]) – Sequence of dilation rates (of length 2), or integer that is used to define dilation rate in all dimensions. 1 corresponds to standard convolution, rate > 1 corresponds to dilated convolution.

  • padding (Union[str, Callable[[int], Sequence[int]], Sequence[Callable[[int], Sequence[int]]]]) – Padding to apply to the input. This can either SAME, VALID or a callable or sequence of callables of size 2. Any callables must take a single integer argument equal to the effective kernel size and return a list of two integers representing the padding before and after. See snt.pad.* for more details and example functions.

  • with_bias (bool) – Whether to include bias parameters. Default True.

  • w_init (Optional[Initializer]) – Optional initializer for the weights. By default the weights are initialized truncated random normal values with a standard deviation of 1 / sqrt(input_feature_size), which is commonly used when the inputs are zero centered (see https://arxiv.org/abs/1502.03167v3).

  • b_init (Optional[Initializer]) – Optional initializer for the bias. By default the bias is initialized to zero.

  • data_format (str) – The data format of the input.

  • name (Optional[str]) – Name of the module.

Conv3D

class sonnet.Conv3D(output_channels, kernel_shape, stride=1, rate=1, padding='SAME', with_bias=True, w_init=None, b_init=None, data_format='NDHWC', name=None)[source]

Conv3D module.

__init__(output_channels, kernel_shape, stride=1, rate=1, padding='SAME', with_bias=True, w_init=None, b_init=None, data_format='NDHWC', name=None)[source]

Constructs a Conv3D module.

Parameters
  • output_channels (int) – The number of output channels.

  • kernel_shape (Union[int, Sequence[int]]) – Sequence of kernel sizes (of length 3), or an integer. kernel_shape will be expanded to define a kernel size in all dimensions.

  • stride (Union[int, Sequence[int]]) – Sequence of strides (of length 3), or an integer. stride will be expanded to define stride in all dimensions.

  • rate (Union[int, Sequence[int]]) – Sequence of dilation rates (of length 3), or integer that is used to define dilation rate in all dimensions. 1 corresponds to standard convolution, rate > 1 corresponds to dilated convolution.

  • padding (Union[str, Callable[[int], Sequence[int]], Sequence[Callable[[int], Sequence[int]]]]) – Padding to apply to the input. This can either SAME, VALID or a callable or sequence of callables up to size N. Any callables must take a single integer argument equal to the effective kernel size and return a list of two integers representing the padding before and after. See snt.pad.* for more details and example functions.

  • with_bias (bool) – Whether to include bias parameters. Default True.

  • w_init (Optional[Initializer]) – Optional initializer for the weights. By default the weights are initialized truncated random normal values with a standard deviation of 1 / sqrt(input_feature_size), which is commonly used when the inputs are zero centered (see https://arxiv.org/abs/1502.03167v3).

  • b_init (Optional[Initializer]) – Optional initializer for the bias. By default the bias is initialized to zero.

  • data_format (str) – The data format of the input.

  • name (Optional[str]) – Name of the module.

Conv1DTranspose

class sonnet.Conv1DTranspose(output_channels, kernel_shape, output_shape=None, stride=1, rate=1, padding='SAME', with_bias=True, w_init=None, b_init=None, data_format='NWC', name=None)[source]

A 1D transpose convolutional module.

__init__(output_channels, kernel_shape, output_shape=None, stride=1, rate=1, padding='SAME', with_bias=True, w_init=None, b_init=None, data_format='NWC', name=None)[source]

Constructs a Conv1DTranspose module.

Parameters
  • output_channels (int) – Number of output channels.

  • kernel_shape (Union[int, Sequence[int]]) – Sequence of integers (of length 1), or an integer representing kernel shape. kernel_shape will be expanded to define a kernel size in all dimensions.

  • output_shape (Union[int, Sequence[int], TensorShape, None]) – Output shape of the spatial dimensions of a transpose convolution. Can be either an integer or an iterable of integers or Dimension`s, or a `TensorShape (of length 1). If a None value is given, a default shape is automatically calculated.

  • stride (Union[int, Sequence[int]]) – Sequence of integers (of length 1), or an integer. stride will be expanded to define stride in all dimensions.

  • rate (Union[int, Sequence[int]]) – Sequence of integers (of length 1), or integer that is used to define dilation rate in all dimensions. 1 corresponds to standard 1D convolution, rate > 1 corresponds to dilated convolution.

  • padding (str) – Padding algorithm, either “SAME” or “VALID”.

  • with_bias (bool) – Boolean, whether to include bias parameters. Default True.

  • w_init (Optional[Initializer]) – Optional initializer for the weights. By default the weights are initialized truncated random normal values with a standard deviation of 1 / sqrt(input_feature_size), which is commonly used when the inputs are zero centered (see https://arxiv.org/abs/1502.03167v3).

  • b_init (Optional[Initializer]) – Optional initializer for the bias. By default the bias is initialized to zero.

  • data_format (str) – The data format of the input.

  • name (Optional[str]) – Name of the module.

Conv2DTranspose

class sonnet.Conv2DTranspose(output_channels, kernel_shape, output_shape=None, stride=1, rate=1, padding='SAME', with_bias=True, w_init=None, b_init=None, data_format='NHWC', name=None)[source]

A 2D transpose convolutional module.

__init__(output_channels, kernel_shape, output_shape=None, stride=1, rate=1, padding='SAME', with_bias=True, w_init=None, b_init=None, data_format='NHWC', name=None)[source]

Constructs a Conv2DTranspose module.

Parameters
  • output_channels (int) – An integer, The number of output channels.

  • kernel_shape (Union[int, Sequence[int]]) – Sequence of integers (of length 2), or an integer representing kernel shape. kernel_shape will be expanded to define a kernel size in all dimensions.

  • output_shape (Union[int, Sequence[int], TensorShape, None]) – Output shape of the spatial dimensions of a transpose convolution. Can be either an integer or an iterable of integers or Dimension`s, or a `TensorShape (of length 2). If a None value is given, a default shape is automatically calculated.

  • stride (Union[int, Sequence[int]]) – Sequence of integers (of length 2), or an integer. stride will be expanded to define stride in all dimensions.

  • rate (Union[int, Sequence[int]]) – Sequence of integers (of length 2), or integer that is used to define dilation rate in all dimensions. 1 corresponds to standard 2D convolution, rate > 1 corresponds to dilated convolution.

  • padding (str) – Padding algorithm, either “SAME” or “VALID”.

  • with_bias (bool) – Boolean, whether to include bias parameters. Default True.

  • w_init (Optional[Initializer]) – Optional initializer for the weights. By default the weights are initialized truncated random normal values with a standard deviation of 1 / sqrt(input_feature_size), which is commonly used when the inputs are zero centered (see https://arxiv.org/abs/1502.03167v3).

  • b_init (Optional[Initializer]) – Optional initializer for the bias. By default the bias is initialized to zero.

  • data_format (str) – The data format of the input.

  • name (Optional[str]) – Name of the module.

Conv3DTranspose

class sonnet.Conv3DTranspose(output_channels, kernel_shape, output_shape=None, stride=1, rate=1, padding='SAME', with_bias=True, w_init=None, b_init=None, data_format='NDHWC', name=None)[source]

A 3D transpose convolutional module.

__init__(output_channels, kernel_shape, output_shape=None, stride=1, rate=1, padding='SAME', with_bias=True, w_init=None, b_init=None, data_format='NDHWC', name=None)[source]

Constructs a Conv3DTranspose module.

Parameters
  • output_channels (int) – An integer, The number of output channels.

  • kernel_shape (Union[int, Sequence[int]]) – Sequence of integers (of length 3), or an integer representing kernel shape. kernel_shape will be expanded to define a kernel size in all dimensions.

  • output_shape (Union[int, Sequence[int], TensorShape, None]) – Output shape of the spatial dimensions of a transpose convolution. Can be either an integer or an iterable of integers or Dimension`s, or a `TensorShape (of length 3). If a None value is given, a default shape is automatically calculated.

  • stride (Union[int, Sequence[int]]) – Sequence of integers (of length 3), or an integer. stride will be expanded to define stride in all dimensions.

  • rate (Union[int, Sequence[int]]) – Sequence of integers (of length 3), or integer that is used to define dilation rate in all dimensions. 1 corresponds to standard 3D convolution, rate > 1 corresponds to dilated convolution.

  • padding (str) – Padding algorithm, either “SAME” or “VALID”.

  • with_bias (bool) – Boolean, whether to include bias parameters. Default True.

  • w_init (Optional[Initializer]) – Optional initializer for the weights. By default the weights are initialized truncated random normal values with a standard deviation of 1 / sqrt(input_feature_size), which is commonly used when the inputs are zero centered (see https://arxiv.org/abs/1502.03167v3).

  • b_init (Optional[Initializer]) – Optional initializer for the bias. By default the bias is initialized to zero.

  • data_format (str) – The data format of the input.

  • name (Optional[str]) – Name of the module.

DepthwiseConv2D

class sonnet.DepthwiseConv2D(kernel_shape, channel_multiplier=1, stride=1, rate=1, padding='SAME', with_bias=True, w_init=None, b_init=None, data_format='NHWC', name=None)[source]

Spatial depth-wise 2D convolution module, including bias.

This acts as a light wrapper around the TensorFlow ops tf.nn.depthwise_conv2d, abstracting away variable creation and sharing.

__init__(kernel_shape, channel_multiplier=1, stride=1, rate=1, padding='SAME', with_bias=True, w_init=None, b_init=None, data_format='NHWC', name=None)[source]

Constructs a DepthwiseConv2D module.

Parameters
  • kernel_shape (Union[int, Sequence[int]]) – Sequence of kernel sizes (of length num_spatial_dims), or an integer. kernel_shape will be expanded to define a kernel size in all dimensions.

  • channel_multiplier (int) – Number of channels to expand convolution to. Must be an integer greater than 0. When channel_multiplier is 1, applies a different filter to each input channel producing one output channel per input channel. Numbers larger than 1 cause multiple different filters to be applied to each input channel, with their outputs being concatenated together, producing channel_multiplier * input_channels output channels.

  • stride (Union[int, Sequence[int]]) – Sequence of strides (of length num_spatial_dims), or an integer. stride will be expanded to define stride in all dimensions.

  • rate (Union[int, Sequence[int]]) – Sequence of dilation rates (of length num_spatial_dims), or integer that is used to define dilation rate in all dimensions. 1 corresponds to standard ND convolution, rate > 1 corresponds to dilated convolution.

  • padding (str) – Padding to apply to the input. This can either “SAME”, “VALID”.

  • with_bias (bool) – Whether to include bias parameters. Default True.

  • w_init (Optional[Initializer]) – Optional initializer for the weights. By default the weights are initialized truncated random normal values with a standard deviation of 1 / sqrt(input_feature_size), which is commonly used when the inputs are zero centered (see https://arxiv.org/abs/1502.03167v3).

  • b_init (Optional[Initializer]) – Optional initializer for the bias. By default the bias is initialized to zero.

  • data_format (str) – The data format of the input.

  • name (Optional[str]) – Name of the module.

__call__(*args, **kwargs)[source]

Call self as a function.

Normalization modules

LayerNorm

class sonnet.LayerNorm(axis, create_scale, create_offset, eps=1e-05, scale_init=None, offset_init=None, data_format='channels_last', name=None)[source]

Normalizes inputs along the given axes.

This is a generic implementation of normalization along specific axes of the input. InstanceNorm is a subclass of this module, it normalizes over the spatial dimensions.

It transforms the input x into:

\[\d{outputs} = \d{scale} \dfrac{x - \mu}{\sigma + \epsilon} + \d{offset}\]

Where \(\mu\) and \(\sigma\) are respectively the mean and standard deviation of x.

There are many different variations for how users want to manage scale and offset if they require them at all. These are:

  • No scale/offset in which case create_* should be set to False and scale/offset aren’t passed when the module is called.

  • Trainable scale/offset in which case create_* should be set to True and again scale/offset aren’t passed when the module is called. In this case this module creates and owns the scale/offset variables.

  • Externally generated scale/offset, such as for conditional normalization, in which case create_* should be set to False and then the values fed in at call time.

scale

If create_scale=True, a trainable tf.Variable holding the current scale.

offset

If create_offset=True, a trainable tf.Variable holding the current offset.

__init__(axis, create_scale, create_offset, eps=1e-05, scale_init=None, offset_init=None, data_format='channels_last', name=None)[source]

Constructs an LayerNorm module.

Parameters
  • axis (Union[int, slice, Sequence[int]]) – An int, slice or sequence of ints representing the axes which should be normalized across. Typical usages are: 1 or -1 for normalization over just the channels and slice(1, None), slice(2, None) for normalization over the spatial and channel dimensions whilst avoiding the batch and/or time dimensions.

  • create_scale (bool) – bool representing whether to create a trainable scale per channel applied after the normalization.

  • create_offset (bool) – bool representing whether to create a trainable offset per channel applied after normalization and scaling.

  • eps (Union[float, floating, ndarray, Tensor, Variable]) – Small epsilon to avoid division by zero variance. Defaults to 1e-5.

  • scale_init (Optional[Initializer]) – Optional initializer for the scale variable. Can only be set if create_scale=True. By default scale is initialized to 1.

  • offset_init (Optional[Initializer]) – Optional initializer for the offset variable. Can only be set if create_offset=True. By default offset is initialized to 0.

  • data_format (str) – The data format of the input. Can be either channels_first, channels_last, N...C or NC.... By default it is channels_last.

  • name (Optional[str]) – Name of the module.

__call__(*args, **kwargs)[source]

Call self as a function.

InstanceNorm

class sonnet.InstanceNorm(create_scale, create_offset, eps=1e-05, scale_init=None, offset_init=None, data_format='channels_last', name=None)[source]

Normalizes inputs along the spatial dimensions.

See LayerNorm for more details.

scale

If create_scale=True, a trainable tf.Variable holding the current scale.

offset

If create_offset=True, a trainable tf.Variable holding the current offset.

__init__(create_scale, create_offset, eps=1e-05, scale_init=None, offset_init=None, data_format='channels_last', name=None)[source]

Constructs an InstanceNorm module.

This method creates a module which normalizes over the spatial dimensions.

Parameters
  • create_scale (bool) – bool representing whether to create a trainable scale per channel applied after the normalization.

  • create_offset (bool) – bool representing whether to create a trainable offset per channel applied after normalization and scaling.

  • eps (Union[float, floating, ndarray, Tensor, Variable]) – Small epsilon to avoid division by zero variance. Defaults to 1e-5.

  • scale_init (Optional[Initializer]) – Optional initializer for the scale variable. Can only be set if create_scale=True. By default scale is initialized to 1.

  • offset_init (Optional[Initializer]) – Optional initializer for the offset variable. Can only be set if create_offset=True. By default offset is initialized to 0.

  • data_format (str) – The data format of the input. Can be either channels_first, channels_last, N...C or NC.... By default it is channels_last.

  • name (Optional[str]) – Name of the module.

BaseBatchNorm

class sonnet.BaseBatchNorm(create_scale, create_offset, moving_mean, moving_variance, eps=1e-05, scale_init=None, offset_init=None, data_format='channels_last', name=None)[source]

Batch normalization module.

This implements normalization across the batch and spatial dimensions. It maintains moving averages of the mean and variance which can be used to normalize at test time. The constructor is generic and requires the user to pass in objects to compute these.

At training time we use the batch statistics for that batch and these are then used to update the moving averages.

At test time we can either use the moving averages of the batch statistics (test_local_stats=False) or we can use the local statistics (test_local_stats=True).

It transforms the input x into:

\[\d{outputs} = \d{scale} \dfrac{x - \mu}{\sigma + \epsilon} + \d{offset}\]

Where \(\mu\) and \(\sigma\) are respectively the mean and standard deviation of x. Note that this module automatically uses the fused batch norm op if the data format is NHWC.

There are many different variations for how users want to manage scale and offset if they require them at all. These are:

  • No scale/offset in which case create_* should be set to False and scale/offset aren’t passed when the module is called.

  • Trainable scale/offset in which case create_* should be set to True and again scale/offset aren’t passed when the module is called. In this case this module creates and owns the scale/offset variables.

  • Externally generated scale/offset, such as for conditional normalization, in which case create_* should be set to False and then the values fed in at call time.

scale

If create_scale, a trainable tf.Variable holding the current scale after the module is connected for the first time.

offset

If create_offset, a trainable tf.Variable holding the current offset after the module is connected for the first time.

__init__(create_scale, create_offset, moving_mean, moving_variance, eps=1e-05, scale_init=None, offset_init=None, data_format='channels_last', name=None)[source]

Constructs a BaseBatchNorm module.

Parameters
  • create_scale (bool) – whether to create a trainable scale per channel applied after the normalization.

  • create_offset (bool) – whether to create a trainable offset per channel applied after normalization and scaling.

  • moving_mean (Metric) – A metric which tracks the moving average of the mean which can be used to normalize at test time.

  • moving_variance (Metric) – A metric which tracks the moving average of the variance which can be used to normalize at test time.

  • eps (Union[float, floating, ndarray, Tensor, Variable]) – Small epsilon to avoid division by zero variance. Defaults to 1e-5.

  • scale_init (Optional[Initializer]) – Optional initializer for the scale variable. Can only be set if create_scale=True. By default scale is initialized to 1.

  • offset_init (Optional[Initializer]) – Optional initializer for the offset variable. Can only be set if create_offset=True. By default offset is initialized to 0.

  • data_format (str) – The data format of the input. Can be either channels_first, channels_last, N...C or NC.... By default it is channels_last.

  • name (Optional[str]) – Name of the module.

__call__(inputs, is_training, test_local_stats=False, scale=None, offset=None)[source]

Returns normalized inputs.

Parameters
  • inputs (Tensor) – An n-D tensor of the data_format specified above on which the transformation is performed.

  • is_training (Union[bool, bool_, ndarray, Tensor, Variable]) – Whether the module should be connected in training mode, meaning the moving averages are updated.

  • test_local_stats (Union[bool, bool_, ndarray, Tensor, Variable]) – Whether local batch statistics should be used when is_training=False. If not, moving averages are used. By default False.

  • scale (Optional[Tensor]) – A tensor up to n-D. The shape of this tensor must be broadcastable to the shape of inputs. This is the scale applied to the normalized inputs. This cannot be passed in if the module was constructed with create_scale=True.

  • offset (Optional[Tensor]) – A tensor up to n-D. The shape of this tensor must be broadcastable to the shape of inputs. This is the offset applied to the normalized inputs. This cannot be passed in if the module was constructed with create_offset=True.

Returns

An n-d tensor of the same shape as inputs that has been normalized.

BatchNorm

class sonnet.BatchNorm(create_scale, create_offset, decay_rate=0.999, eps=1e-05, scale_init=None, offset_init=None, data_format='channels_last', name=None)[source]

Batch normalization with exponential moving average for test statistics.

See BaseBatchNorm for details.

scale

If create_scale=True, a trainable tf.Variable holding the current scale after the module is connected for the first time.

offset

If create_offset, a trainable tf.Variable holding the current offset after the module is connected for the first time.

__init__(create_scale, create_offset, decay_rate=0.999, eps=1e-05, scale_init=None, offset_init=None, data_format='channels_last', name=None)[source]

Constructs a BatchNorm module.

Parameters
  • create_scale (bool) – whether to create a trainable scale per channel applied after the normalization.

  • create_offset (bool) – whether to create a trainable offset per channel applied after normalization and scaling.

  • decay_rate (float) – Decay rate of the exponential moving averages of the mean and variance.

  • eps (Union[float, floating, ndarray, Tensor, Variable]) – Small epsilon to avoid division by zero variance. Defaults to 1e-5.

  • scale_init (Optional[Initializer]) – Optional initializer for the scale variable. Can only be set if create_scale=True. By default scale is initialized to 1.

  • offset_init (Optional[Initializer]) – Optional initializer for the offset variable. Can only be set if create_offset=True. By default offset is initialized to 0.

  • data_format (str) – The data format of the input. Can be either channels_first, channels_last, N...C or NC.... By default it is channels_last.

  • name (Optional[str]) – Name of the module.

CrossReplicaBatchNorm

class sonnet.distribute.CrossReplicaBatchNorm(create_scale, create_offset, moving_mean, moving_variance, eps=1e-05, scale_init=None, offset_init=None, data_format='channels_last', name=None)[source]

Cross-replica Batch Normalization.

At every step the full batch is used to calculate the batch statistics even within a distributed setting (note only with snt.(Tpu)Replicator).

See BaseBatchNorm for details.

scale

If create_scale=True, a trainable tf.Variable holding the current scale after the module is connected for the first time.

offset

If create_offset, a trainable tf.Variable holding the current offset after the module is connected for the first time.

__init__(create_scale, create_offset, moving_mean, moving_variance, eps=1e-05, scale_init=None, offset_init=None, data_format='channels_last', name=None)[source]

Constructs a CrossReplicaBatchNorm module.

Parameters
  • create_scale (bool) – whether to create a trainable scale per channel applied after the normalization.

  • create_offset (bool) – whether to create a trainable offset per channel applied after normalization and scaling.

  • moving_mean (Metric) – An object which keeps track of the moving average of the mean which can be used to normalize at test time. This object must have an update method which takes a value and updates the internal state and a value property which returns the current mean.

  • moving_variance (Metric) – An object which keeps track of the moving average of the variance which can be used to normalize at test time. This object must have an update method which takes a value and updates the internal state and a value property which returns the current variance.

  • eps (Union[float, floating, ndarray, Tensor, Variable]) – Small epsilon to avoid division by zero variance. Defaults to 1e-5.

  • scale_init (Optional[Initializer]) – Optional initializer for the scale variable. Can only be set if create_scale=True. By default scale is initialized to 1.

  • offset_init (Optional[Initializer]) – Optional initializer for the offset variable. Can only be set if create_offset=True. By default offset is initialized to 0.

  • data_format (str) – The data format of the input. Can be either channels_first, channels_last, N...C or NC.... By default it is channels_last.

  • name (Optional[str]) – Name of the module.

GroupNorm

class sonnet.GroupNorm(groups, axis=slice(1, None, None), create_scale=True, create_offset=True, eps=1e-05, scale_init=None, offset_init=None, data_format='channels_last', name=None)[source]

Group normalization module.

This applies group normalization to the inputs. This involves splitting the channels into groups before calculating the mean and variance. The default behaviour is to compute the mean and variance over the spatial dimensions and the grouped channels. The mean and variance will never be computed over the created groups axis.

It transforms the input x into:

\[\d{outputs} = \d{scale} \dfrac{x - \mu}{\sigma + \epsilon} + \d{offset}\]

Where \(\mu\) and \(\sigma\) are respectively the mean and standard deviation of x.

There are many different variations for how users want to manage scale and offset if they require them at all. These are:

  • No scale/offset in which case create_* should be set to False and scale/offset aren’t passed when the module is called.

  • Trainable scale/offset in which case create_* should be set to True and again scale/offset aren’t passed when the module is called. In this case this module creates and owns the scale/offset variables.

  • Externally generated scale/offset, such as for conditional normalization, in which case create_* should be set to False and then the values fed in at call time.

scale

If create_scale=True, a trainable tf.Variable holding the current scale.

offset

If create_offset=True, a trainable tf.Variable holding the current offset.

__init__(groups, axis=slice(1, None, None), create_scale=True, create_offset=True, eps=1e-05, scale_init=None, offset_init=None, data_format='channels_last', name=None)[source]

Constructs a GroupNorm module.

Parameters
  • groups (int) – number of groups to divide the channels by. The number of channels must be divisible by this.

  • axis (Union[int, slice, Sequence[int]]) – int, slice or sequence of ints representing the axes which should be normalized across. By default this is all but the first dimension. For time series data use slice(2, None) to average over the none Batch and Time data.

  • create_scale (bool) – whether to create a trainable scale per channel applied after the normalization.

  • create_offset (bool) – whether to create a trainable offset per channel applied after normalization and scaling.

  • eps (Union[float, floating, ndarray, Tensor, Variable]) – Small epsilon to add to the variance to avoid division by zero. Defaults to 1e-5.

  • scale_init (Optional[Initializer]) – Optional initializer for the scale variable. Can only be set if create_scale=True. By default scale is initialized to 1.

  • offset_init (Optional[Initializer]) – Optional initializer for the offset variable. Can only be set if create_offset=True. By default offset is initialized to 0.

  • data_format (str) – The data format of the input. Can be either channels_first, channels_last, N...C or NC.... By default it is channels_last.

  • name (Optional[str]) – Name of the module.

__call__(*args, **kwargs)[source]

Call self as a function.

Recurrent modules

RNNCore

class sonnet.RNNCore(name=None)[source]

Base class for Recurrent Neural Network cores.

This class defines the basic functionality that every core should implement: initial_state(), used to construct an example of the core state; and __call__() which applies the core parameterized by a previous state to an input.

Cores are typically used with dynamic_unroll() and static_unroll() to iteratively construct an output sequence from the given input sequence.

__call__(*args, **kwargs)[source]

Call self as a function.

UnrolledRNN

class sonnet.UnrolledRNN(name=None)[source]

Base class for unrolled Recurrent Neural Networks.

This class is a generalization of RNNCore which operates on an input sequence as opposed to a single time step.

__call__(*args, **kwargs)[source]

Call self as a function.

TrainableState

class sonnet.TrainableState(initial_values, mask=None, name=None)[source]

Trainable state for an RNNCore.

The state can be constructed manually from a nest of initial values:

>>> state = snt.TrainableState((tf.zeros([16]), tf.zeros([16])))

or automatically for a given RNNCore:

>>> core = snt.LSTM(hidden_size=16)
>>> state = snt.TrainableState.for_core(core)
classmethod for_core(core, mask=None, name=None)[source]

Constructs a trainable state for a given RNNCore.

Parameters
  • core (RNNCore) – An RNNCore to construct the state for.

  • mask (Union[ndarray, Tensor, Variable, Iterable[TensorNest], Mapping[str, TensorNest], None]) – Optional boolean mask of the same structure as the initial state of core specifying which components should be trainable. If not given, the whole state is considered trainable.

  • name (Optional[str]) – Name of the module.

Returns

A TrainableState.

__init__(initial_values, mask=None, name=None)[source]

Constructs a trainable state from initial values.

Parameters
  • initial_values (Union[ndarray, Tensor, Variable, Iterable[TensorNest], Mapping[str, TensorNest]]) – Arbitrarily nested initial values for the state.

  • mask (Union[ndarray, Tensor, Variable, Iterable[TensorNest], Mapping[str, TensorNest], None]) – Optional boolean mask of the same structure as initial_values specifying which components should be trainable. If not given, the whole state is considered trainable.

  • name (Optional[str]) – Name of the module.

__call__(*args, **kwargs)[source]

Call self as a function.

dynamic_unroll

sonnet.dynamic_unroll(core, input_sequence, initial_state, sequence_length=None, parallel_iterations=1, swap_memory=False)[source]

Performs a dynamic unroll of an RNN.

>>> core = snt.LSTM(hidden_size=16)
>>> batch_size = 3
>>> input_sequence = tf.random.uniform([1, batch_size, 2])
>>> output_sequence, final_state = snt.dynamic_unroll(
...     core,
...     input_sequence,
...     core.initial_state(batch_size))

An unroll corresponds to calling the core on each element of the input sequence in a loop, carrying the state through:

state = initial_state
for t in range(len(input_sequence)):
   outputs, state = core(input_sequence[t], state)

A dynamic unroll preserves the loop structure when executed within tf.function. See static_unroll() for an unroll function which replaces a loop with its body repeated multiple times.

Parameters
  • core – An RNNCore to unroll.

  • input_sequence – An arbitrarily nested structure of tensors of shape [T, B, ...] where T is the number of time steps, and B is the batch size.

  • initial_state – initial state of the given core.

  • sequence_length – An optional tensor of shape [B] specifying the lengths of sequences within the (padded) batch.

  • parallel_iterations – An optional int specifying the number of iterations to run in parallel. Those operations which do not have any temporal dependency and can be run in parallel, will be. This parameter trades off time for space. Values >> 1 use more memory but take less time, while smaller values use less memory but computations take longer. Defaults to 1.

  • swap_memory – Transparently swap the tensors produced in forward inference but needed for back prop from GPU to CPU. This allows training RNNs which would typically not fit on a single GPU, with very minimal (or no) performance penalty. Defaults to False.

Returns

  • output_sequence - An arbitrarily nested structure of tensors of shape [T, B, ...]. Dimensions following the batch size could be different from that of the input_sequence.

  • final_state - Core state at time step T.

Return type

A tuple with two elements

Raises

ValueError – If input_sequence is empty.

static_unroll

sonnet.static_unroll(core, input_sequence, initial_state, sequence_length=None)[source]

Performs a static unroll of an RNN.

>>> core = snt.LSTM(hidden_size=16)
>>> batch_size = 3
>>> input_sequence = tf.random.uniform([1, batch_size, 2])
>>> output_sequence, final_state = snt.static_unroll(
...     core,
...     input_sequence,
...     core.initial_state(batch_size))

An unroll corresponds to calling the core on each element of the input sequence in a loop, carrying the state through:

state = initial_state
for t in range(len(input_sequence)):
   outputs, state = core(input_sequence[t], state)

A static unroll replaces a loop with its body repeated multiple times when executed inside tf.function:

state = initial_state
outputs0, state = core(input_sequence[0], state)
outputs1, state = core(input_sequence[1], state)
outputs2, state = core(input_sequence[2], state)
...

See dynamic_unroll() for a loop-preserving unroll function.

Parameters
  • core (RNNCore) – An RNNCore to unroll.

  • input_sequence (Union[ndarray, Tensor, Variable, Iterable[TensorNest], Mapping[str, TensorNest]]) – An arbitrarily nested structure of tensors of shape [T, B, ...] where T is the number of time steps, and B is the batch size.

  • initial_state (Union[ndarray, Tensor, Variable, Iterable[TensorNest], Mapping[str, TensorNest]]) – An initial state of the given core.

  • sequence_length (Union[int, integer, ndarray, Tensor, Variable, None]) – An optional tensor of shape [B] specifying the lengths of sequences within the (padded) batch.

Returns

  • output_sequence - An arbitrarily nested structure of tensors of shape [T, B, ...]. Dimensions following the batch size could be different from that of the input_sequence.

  • final_state - Core state at time step T.

Return type

A tuple with two elements

Raises

ValueError – If input_sequence is empty or its leading dimension is not known statically.

VanillaRNN

class sonnet.VanillaRNN(hidden_size, activation=<function tanh>, w_i_init=None, w_h_init=None, b_init=None, dtype=tf.float32, name=None)[source]

Basic fully-connected RNN core.

Given \(x_t\) and the previous hidden state \(h_{t-1}\) the core computes

\[h_t = w_i x_t + w_h h_{t-1} + b\]
input_to_hidden

Input-to-hidden weights \(w_i\), a tensor of shape [hidden_size, hidden_size].

hidden_to_hidden

Hidden-to-hidden weights \(w_i\), a tensor of shape [input_size, hidden_size].

b

bias, a tensor or shape [hidden_size].

__init__(hidden_size, activation=<function tanh>, w_i_init=None, w_h_init=None, b_init=None, dtype=tf.float32, name=None)[source]

Constructs a vanilla RNN core.

Parameters
  • hidden_size (int) – Hidden layer size.

  • activation (Callable[[Union[ndarray, Tensor, Variable]], Union[ndarray, Tensor, Variable]]) – Activation function to use. Defaults to tf.tanh.

  • w_i_init (Optional[Initializer]) – Optional initializer for the input-to-hidden weights. Defaults to TruncatedNormal with a standard deviation of 1 / sqrt(input_size).

  • w_h_init (Optional[Initializer]) – Optional initializer for the hidden-to-hidden weights. Defaults to TruncatedNormal with a standard deviation of 1 / sqrt(hidden_size).

  • b_init (Optional[Initializer]) – Optional initializer for the bias. Defaults to Zeros.

  • dtype (DType) – Optional tf.DType of the core’s variables. Defaults to tf.float32.

  • name (Optional[str]) – Name of the module.

__call__(*args, **kwargs)[source]

Call self as a function.

DeepRNN

class sonnet.DeepRNN(layers, name=None)[source]

Linear chain of RNNCores or callables.

The core takes (input, prev_state) as input and passes the input through each internal module in the order they were presented, using elements from prev_state as necessary for internal RNN cores.

>>> deep_rnn = snt.DeepRNN([
...     snt.LSTM(hidden_size=16),
...     snt.LSTM(hidden_size=16),
... ])

Note that the state of a DeepRNN is always a tuple, which will contain the same number of elements as there are internal RNN cores. If no internal modules are RNN cores, the state of the DeepRNN as a whole is an empty tuple.

Wrapping non-recurrent modules into a DeepRNN can be useful to produce something API compatible with a “real” recurrent module, simplifying code that handles the cores.

__init__(layers, name=None)[source]

Constructs a DeepRNN.

Parameters
sonnet.deep_rnn_with_skip_connections(layers, concat_final_output=True, name='deep_rnn_with_skip_connections')[source]

Constructs a DeepRNN with skip connections.

Skip connections alter the dependency structure within a DeepRNN. Specifically, input to the i-th layer (i > 0) is given by a concatenation of the core’s inputs and the outputs of the (i-1)-th layer.

outputs0, ... = layers[0](inputs, ...)
outputs1, ... = layers[1](tf.concat([inputs, outputs0], axis=1], ...)
outputs2, ... = layers[2](tf.concat([inputs, outputs1], axis=1], ...)
...

This allows the layers to learn decoupled features.

Parameters
  • layers (Sequence[RNNCore]) – A list of RNNCores.

  • concat_final_output (bool) – If enabled (default), the outputs of the core is a concatenation of the outputs of all intermediate layers; otherwise, only the outputs of the final layer, i.e. that of layers[-1], are returned.

  • name (str) – Name of the module.

Return type

RNNCore

Returns

A DeepRNN with skip connections.

Raises

ValueError – If any of the layers is not an RNNCore.

sonnet.deep_rnn_with_residual_connections(layers, name='deep_rnn_with_residual_connections')[source]

Constructs a DeepRNN with residual connections.

Residual connections alter the dependency structure in a DeepRNN. Specifically, the input to the i-th intermediate layer is a sum of the original core’s inputs and the outputs of all the preceding layers (<i).

outputs0, ... = layers[0](inputs, ...)
outputs0 += inputs
outputs1, ... = layers[1](outputs0, ...)
outputs1 += outputs0
outputs2, ... = layers[2](outputs1, ...)
outputs2 += outputs1
...

This allows the layers to learn specialized features that compose incrementally.

Parameters
  • layers (Sequence[RNNCore]) – A list of RNNCores.

  • name (str) – Name of the module.

Return type

RNNCore

Returns

A DeepRNN with residual connections.

Raises

ValueError – If any of the layers is not an RNNCore.

LSTM

class sonnet.LSTM(hidden_size, projection_size=None, projection_init=None, w_i_init=None, w_h_init=None, b_init=None, forget_bias=1.0, dtype=tf.float32, name=None)[source]

Long short-term memory (LSTM) RNN core.

The implementation is based on [2]. Given \(x_t\) and the previous state \((h_{t-1}, c_{t-1})\) the core computes

\[\begin{array}{ll} i_t = \sigma(W_{ii} x_t + W_{hi} h_{t-1} + b_i) \\ f_t = \sigma(W_{if} x_t + W_{hf} h_{t-1} + b_f) \\ g_t = \tanh(W_{ig} x_t + W_{hg} h_{t-1} + b_g) \\ o_t = \sigma(W_{io} x_t + W_{ho} h_{t-1} + b_o) \\ c_t = f_t c_{t-1} + i_t g_t \\ h_t = o_t \tanh(c_t) \end{array}\]

Where \(i_t\), \(f_t\), \(o_t\) are input, forget and output gate activations, and \(g_t\) is a vector of cell updates.

Notes

Forget gate initialization:

Following [3] we add a constant forget_bias (defaults to 1.0) to \(b_f\) after initialization in order to reduce the scale of forgetting in the beginning of the training.

Recurrent projections:

Hidden state could be projected (via the project_size parameter) to reduce the number of parameters and speed up computation. For more details see [4].

input_to_hidden

Input-to-hidden weights \(W_{ii}\), \(W_{if}\), \(W_{ig}\) and \(W_{io}\) concatenated into a tensor of shape [input_size, 4 * hidden_size].

hidden_to_hidden

Hidden-to-hidden weights \(W_{hi}\), \(W_{hf}\), \(W_{hg}\) and \(W_{ho}\) concatenated into a tensor of shape [hidden_size, 4 * hidden_size].

b

Biases \(b_i\), \(b_f\), \(b_g\) and \(b_o\) concatenated into a tensor of shape [4 * hidden_size].

__init__(hidden_size, projection_size=None, projection_init=None, w_i_init=None, w_h_init=None, b_init=None, forget_bias=1.0, dtype=tf.float32, name=None)[source]

Constructs an LSTM.

Parameters
  • hidden_size (int) – Hidden layer size.

  • projection_size (Optional[int]) – Optional int; if set, then the hidden state is projected to this size via a trainable projection matrix.

  • projection_init (Optional[Initializer]) – Optional initializer for the projection matrix. Defaults to TruncatedNormal with a standard deviation of 1 / sqrt(hidden_size).

  • w_i_init (Optional[Initializer]) – Optional initializer for the input-to-hidden weights. Defaults to TruncatedNormal with a standard deviation of 1 / sqrt(input_size).

  • w_h_init (Optional[Initializer]) – Optional initializer for the hidden-to-hidden weights. Defaults to TruncatedNormal with a standard deviation of 1 / sqrt(hidden_size).

  • b_init (Optional[Initializer]) – Optional initializer for the biases. Defaults to Zeros.

  • forget_bias (Union[float, floating, ndarray, Tensor, Variable]) – Optional float to add to the bias of the forget gate after initialization.

  • dtype (DType) – Optional tf.DType of the core’s variables. Defaults to tf.float32.

  • name (Optional[str]) – Name of the module.

__call__(*args, **kwargs)[source]

Call self as a function.

class sonnet.LSTMState(hidden, cell)

lstm_with_recurrent_dropout

sonnet.lstm_with_recurrent_dropout(hidden_size, dropout=0.5, seed=None, **kwargs)[source]

Constructs an LSTM with recurrent dropout.

The implementation is based on [5]. Dropout is applied on the previous hidden state \(h_{t-1}\) during the computation of gate activations:

\[\begin{array}{ll} i_t = \sigma(W_{ii} x_t + W_{hi} d(h_{t-1}) + b_i) \\ f_t = \sigma(W_{if} x_t + W_{hf} d(h_{t-1}) + b_f) \\ g_t = \tanh(W_{ig} x_t + W_{hg} d(h_{t-1}) + b_g) \\ o_t = \sigma(W_{io} x_t + W_{ho} d(h_{t-1}) + b_o) \end{array}\]
Parameters
  • hidden_size – Hidden layer size.

  • dropout – Dropout probability.

  • seed – Optional int; seed passed to tf.nn.dropout.

  • **kwargs – Optional keyword arguments to pass to the LSTM constructor.

Returns

  • train_lstm - An LSTM with recurrent dropout enabled for training.

  • test_lstm - The same as train_lstm but without recurrent dropout.

Return type

A tuple of two elements

Raises

ValueError – If dropout is not in [0, 1).

UnrolledLSTM

class sonnet.UnrolledLSTM(hidden_size, w_i_init=None, w_h_init=None, b_init=None, forget_bias=1.0, dtype=tf.float32, name=None)[source]

Unrolled long short-term memory (LSTM).

The implementation uses efficient device-specialized ops, e.g. CuDNN-RNN on a CUDA-enabled GPU, and can be an order of magnitude faster than snt.*_unroll with an LSTM core.

__init__(hidden_size, w_i_init=None, w_h_init=None, b_init=None, forget_bias=1.0, dtype=tf.float32, name=None)[source]

Construct an unrolled LSTM.

Parameters
  • hidden_size – Hidden layer size.

  • w_i_init (Optional[Initializer]) – Optional initializer for the input-to-hidden weights. Defaults to TruncatedNormal with a standard deviation of 1 / sqrt(input_size).

  • w_h_init (Optional[Initializer]) – Optional initializer for the hidden-to-hidden weights. Defaults to TruncatedNormal with a standard deviation of 1 / sqrt(hidden_size).

  • b_init (Optional[Initializer]) – Optional initializer for the biases. Defaults to Zeros.

  • forget_bias (Union[float, floating, ndarray, Tensor, Variable]) – Optional float to add to the bias of the forget gate after initialization.

  • dtype (DType) – Optional tf.DType of the core’s variables. Defaults to tf.float32.

  • name (Optional[str]) – Name of the module.

__call__(*args, **kwargs)[source]

Call self as a function.

Conv1DLSTM

class sonnet.Conv1DLSTM(input_shape, output_channels, kernel_shape, data_format='NWC', w_i_init=None, w_h_init=None, b_init=None, forget_bias=1.0, dtype=tf.float32, name=None)[source]

1-D convolutional LSTM.

The implementation is based on [6]. Given \(x_t\) and the previous state \((h_{t-1}, c_{t-1})\) the core computes

\[\begin{array}{ll} i_t = \sigma(W_{ii} * x_t + W_{hi} * h_{t-1} + b_i) \\ f_t = \sigma(W_{if} * x_t + W_{hf} * h_{t-1} + b_f) \\ g_t = \tanh(W_{ig} * x_t + W_{hg} * h_{t-1} + b_g) \\ o_t = \sigma(W_{io} * x_t + W_{ho} * h_{t-1} + b_o) \\ c_t = f_t c_{t-1} + i_t g_t \\ h_t = o_t \tanh(c_t) \end{array}\]

where \(*\) denotes the convolution operator; \(i_t\), \(f_t\), \(o_t\) are input, forget and output gate activations, and \(g_t\) is a vector of cell updates.

Notes

Forget gate initialization:

Following [3] we add a constant forget_bias (defaults to 1.0) to \(b_f\) after initialization in order to reduce the scale of forgetting in the beginning of the training.

input_to_hidden

Input-to-hidden convolution weights \(W_{ii}\), \(W_{if}\), \(W_{ig}\) and \(W_{io}\) concatenated into a single tensor of shape [kernel_shape*, input_channels, 4 * output_channels] where kernel_shape is repeated 1 times.

hidden_to_hidden

Hidden-to-hidden convolution weights \(W_{hi}\), \(W_{hf}\), \(W_{hg}\) and \(W_{ho}\) concatenated into a single tensor of shape [kernel_shape*, input_channels, 4 * output_channels] where kernel_shape is repeated 1 times.

b

Biases \(b_i\), \(b_f\), \(b_g\) and \(b_o\) concatenated into a tensor of shape [4 * output_channels].

__init__(input_shape, output_channels, kernel_shape, data_format='NWC', w_i_init=None, w_h_init=None, b_init=None, forget_bias=1.0, dtype=tf.float32, name=None)[source]

Constructs a 1-D convolutional LSTM.

Parameters
  • input_shape (Union[int, Sequence[int], TensorShape]) – Shape of the inputs excluding batch size.

  • output_channels (int) – Number of output channels.

  • kernel_shape (Union[int, Sequence[int]]) – Sequence of kernel sizes (of length 1), or an int. kernel_shape will be expanded to define a kernel size in all dimensions.

  • data_format – The data format of the input.

  • w_i_init (Optional[Initializer]) – Optional initializer for the input-to-hidden convolution weights. Defaults to TruncatedNormal with a standard deviation of 1 / sqrt(kernel_shape * input_channels).

  • w_h_init (Optional[Initializer]) – Optional initializer for the hidden-to-hidden convolution weights. Defaults to TruncatedNormal with a standard deviation of 1 / sqrt(kernel_shape * input_channels).

  • b_init (Optional[Initializer]) – Optional initializer for the biases. Defaults to Zeros.

  • forget_bias (Union[float, floating, ndarray, Tensor, Variable]) – Optional float to add to the bias of the forget gate after initialization.

  • dtype (DType) – Optional tf.DType of the core’s variables. Defaults to tf.float32.

  • name (Optional[str]) – Name of the module.

Conv2DLSTM

class sonnet.Conv2DLSTM(input_shape, output_channels, kernel_shape, data_format='NHWC', w_i_init=None, w_h_init=None, b_init=None, forget_bias=1.0, dtype=tf.float32, name=None)[source]

2-D convolutional LSTM.

The implementation is based on [6]. Given \(x_t\) and the previous state \((h_{t-1}, c_{t-1})\) the core computes

\[\begin{array}{ll} i_t = \sigma(W_{ii} * x_t + W_{hi} * h_{t-1} + b_i) \\ f_t = \sigma(W_{if} * x_t + W_{hf} * h_{t-1} + b_f) \\ g_t = \tanh(W_{ig} * x_t + W_{hg} * h_{t-1} + b_g) \\ o_t = \sigma(W_{io} * x_t + W_{ho} * h_{t-1} + b_o) \\ c_t = f_t c_{t-1} + i_t g_t \\ h_t = o_t \tanh(c_t) \end{array}\]

where \(*\) denotes the convolution operator; \(i_t\), \(f_t\), \(o_t\) are input, forget and output gate activations, and \(g_t\) is a vector of cell updates.

Notes

Forget gate initialization:

Following [3] we add a constant forget_bias (defaults to 1.0) to \(b_f\) after initialization in order to reduce the scale of forgetting in the beginning of the training.

input_to_hidden

Input-to-hidden convolution weights \(W_{ii}\), \(W_{if}\), \(W_{ig}\) and \(W_{io}\) concatenated into a single tensor of shape [kernel_shape*, input_channels, 4 * output_channels] where kernel_shape is repeated 2 times.

hidden_to_hidden

Hidden-to-hidden convolution weights \(W_{hi}\), \(W_{hf}\), \(W_{hg}\) and \(W_{ho}\) concatenated into a single tensor of shape [kernel_shape*, input_channels, 4 * output_channels] where kernel_shape is repeated 2 times.

b

Biases \(b_i\), \(b_f\), \(b_g\) and \(b_o\) concatenated into a tensor of shape [4 * output_channels].

__init__(input_shape, output_channels, kernel_shape, data_format='NHWC', w_i_init=None, w_h_init=None, b_init=None, forget_bias=1.0, dtype=tf.float32, name=None)[source]

Constructs a 2-D convolutional LSTM.

Parameters
  • input_shape (Union[int, Sequence[int], TensorShape]) – Shape of the inputs excluding batch size.

  • output_channels (int) – Number of output channels.

  • kernel_shape (Union[int, Sequence[int]]) – Sequence of kernel sizes (of length 2), or an int. kernel_shape will be expanded to define a kernel size in all dimensions.

  • data_format (str) – The data format of the input.

  • w_i_init (Optional[Initializer]) – Optional initializer for the input-to-hidden convolution weights. Defaults to TruncatedNormal with a standard deviation of 1 / sqrt(kernel_shape**2 * input_channels).

  • w_h_init (Optional[Initializer]) – Optional initializer for the hidden-to-hidden convolution weights. Defaults to TruncatedNormal with a standard deviation of 1 / sqrt(kernel_shape**2 * input_channels).

  • b_init (Optional[Initializer]) – Optional initializer for the biases. Defaults to Zeros.

  • forget_bias (Union[float, floating, ndarray, Tensor, Variable]) – Optional float to add to the bias of the forget gate after initialization.

  • dtype (DType) – Optional tf.DType of the core’s variables. Defaults to tf.float32.

  • name (Optional[str]) – Name of the module.

Conv3DLSTM

class sonnet.Conv3DLSTM(input_shape, output_channels, kernel_shape, data_format='NDHWC', w_i_init=None, w_h_init=None, b_init=None, forget_bias=1.0, dtype=tf.float32, name=None)[source]

3-D convolutional LSTM.

The implementation is based on [6]. Given \(x_t\) and the previous state \((h_{t-1}, c_{t-1})\) the core computes

\[\begin{array}{ll} i_t = \sigma(W_{ii} * x_t + W_{hi} * h_{t-1} + b_i) \\ f_t = \sigma(W_{if} * x_t + W_{hf} * h_{t-1} + b_f) \\ g_t = \tanh(W_{ig} * x_t + W_{hg} * h_{t-1} + b_g) \\ o_t = \sigma(W_{io} * x_t + W_{ho} * h_{t-1} + b_o) \\ c_t = f_t c_{t-1} + i_t g_t \\ h_t = o_t \tanh(c_t) \end{array}\]

where \(*\) denotes the convolution operator; \(i_t\), \(f_t\), \(o_t\) are input, forget and output gate activations, and \(g_t\) is a vector of cell updates.

Notes

Forget gate initialization:

Following [3] we add a constant forget_bias (defaults to 1.0) to \(b_f\) after initialization in order to reduce the scale of forgetting in the beginning of the training.

input_to_hidden

Input-to-hidden convolution weights \(W_{ii}\), \(W_{if}\), \(W_{ig}\) and \(W_{io}\) concatenated into a single tensor of shape [kernel_shape*, input_channels, 4 * output_channels] where kernel_shape is repeated 3 times.

hidden_to_hidden

Hidden-to-hidden convolution weights \(W_{hi}\), \(W_{hf}\), \(W_{hg}\) and \(W_{ho}\) concatenated into a single tensor of shape [kernel_shape*, input_channels, 4 * output_channels] where kernel_shape is repeated 3 times.

b

Biases \(b_i\), \(b_f\), \(b_g\) and \(b_o\) concatenated into a tensor of shape [4 * output_channels].

__init__(input_shape, output_channels, kernel_shape, data_format='NDHWC', w_i_init=None, w_h_init=None, b_init=None, forget_bias=1.0, dtype=tf.float32, name=None)[source]

Constructs a 3-D convolutional LSTM.

Parameters
  • input_shape (Union[int, Sequence[int], TensorShape]) – Shape of the inputs excluding batch size.

  • output_channels (int) – Number of output channels.

  • kernel_shape (Union[int, Sequence[int]]) – Sequence of kernel sizes (of length 3), or an int. kernel_shape will be expanded to define a kernel size in all dimensions.

  • data_format (str) – The data format of the input.

  • w_i_init (Optional[Initializer]) – Optional initializer for the input-to-hidden convolution weights. Defaults to TruncatedNormal with a standard deviation of 1 / sqrt(kernel_shape**3 * input_channels).

  • w_h_init (Optional[Initializer]) – Optional initializer for the hidden-to-hidden convolution weights. Defaults to TruncatedNormal with a standard deviation of 1 / sqrt(kernel_shape**3 * input_channels).

  • b_init (Optional[Initializer]) – Optional initializer for the biases. Defaults to Zeros.

  • forget_bias (Union[float, floating, ndarray, Tensor, Variable]) – Optional float to add to the bias of the forget gate after initialization.

  • dtype (DType) – Optional tf.DType of the core’s variables. Defaults to tf.float32.

  • name (Optional[str]) – Name of the module.

GRU

class sonnet.GRU(hidden_size, w_i_init=None, w_h_init=None, b_init=None, dtype=tf.float32, name=None)[source]

Gated recurrent unit (GRU) RNN core.

The implementation is based on [7]. Given \(x_t\) and the previous state \(h_{t-1}\) the core computes

\[\begin{array}{ll} z_t &= \sigma(W_{iz} x_t + W_{hz} h_{t-1} + b_z) \\ r_t &= \sigma(W_{ir} x_t + W_{hr} h_{t-1} + b_r) \\ a_t &= \tanh(W_{ia} x_t + W_{ha} (r_t h_{t-1}) + b_a) \\ h_t &= (1 - z_t) h_{t-1} + z_t a_t \end{array}\]

where \(z_t\) and \(r_t\) are reset and update gates.

input_to_hidden

Input-to-hidden weights \(W_{iz}\), \(W_{ir}\) and \(W_{ia}\) concatenated into a tensor of shape [input_size, 3 * hidden_size].

hidden_to_hidden

Hidden-to-hidden weights \(W_{hz}\), \(W_{hr}\) and \(W_{ha}\) concatenated into a tensor of shape [hidden_size, 3 * hidden_size].

b

Biases \(b_z\), \(b_r\) and \(b_a\) concatenated into a tensor of shape [3 * hidden_size].

__init__(hidden_size, w_i_init=None, w_h_init=None, b_init=None, dtype=tf.float32, name=None)[source]

Constructs a GRU.

Parameters
  • hidden_size – Hidden layer size.

  • w_i_init (Optional[Initializer]) – Optional initializer for the input-to-hidden weights. Defaults to Glorot uniform initializer.

  • w_h_init (Optional[Initializer]) – Optional initializer for the hidden-to-hidden weights. Defaults to Glorot uniform initializer.

  • b_init (Optional[Initializer]) – Optional initializer for the biases. Defaults to Zeros.

  • dtype (DType) – Optional tf.DType of the core’s variables. Defaults to tf.float32.

  • name (Optional[str]) – Name of the module.

__call__(*args, **kwargs)[source]

Call self as a function.

Batch

reshape

sonnet.reshape(inputs, output_shape, preserve_dims=1, name=None)[source]

A shortcut for applying Reshape to the inputs.

Return type

Tensor

Reshape

class sonnet.Reshape(output_shape, preserve_dims=1, name=None)[source]

Reshapes input Tensor, preserving the batch dimension.

For example, given an input tensor with shape [B, H, W, C, D]:

>>> B, H, W, C, D = range(1, 6)
>>> x = tf.ones([B, H, W, C, D])

The default behavior when output_shape is (-1, D) is to flatten all dimensions between B and D:

>>> mod = snt.Reshape(output_shape=(-1, D))
>>> assert mod(x).shape == [B, H*W*C, D]

You can change the number of preserved leading dimensions via preserve_dims:

>>> mod = snt.Reshape(output_shape=(-1, D), preserve_dims=2)
>>> assert mod(x).shape == [B, H, W*C, D]

>>> mod = snt.Reshape(output_shape=(-1, D), preserve_dims=3)
>>> assert mod(x).shape == [B, H, W, C, D]

>>> mod = snt.Reshape(output_shape=(-1, D), preserve_dims=4)
>>> assert mod(x).shape == [B, H, W, C, 1, D]
__init__(output_shape, preserve_dims=1, name=None)[source]

Constructs a Reshape module.

Parameters
  • output_shape (Union[int, Sequence[int], TensorShape]) – Shape to reshape the input tensor to while preserving its first preserve_dims` dimensions. When the special value -1 appears in ``output_shape the corresponding size is automatically inferred. Note that -1 can only appear once in output_shape. To flatten all non-batch dimensions use Flatten.

  • preserve_dims (int) – Number of leading dimensions that will not be reshaped.

  • name (Optional[str]) – Name of the module.

Raises

ValueError – If preserve_dims is not positive.

__call__(*args, **kwargs)[source]

Call self as a function.

reversed(name=None)[source]

Returns inverse batch reshape.

Return type

Reshape

flatten

sonnet.flatten(inputs, name='flatten')[source]

A shortcut for applying Flatten to the inputs.

Return type

Tensor

Flatten

class sonnet.Flatten(preserve_dims=1, name=None)[source]

Flattens the input Tensor, preserving the batch dimension(s).

Flatten reshapes input tensors to combine all trailing dimensions apart from the first. Additional leading dimensions can be preserved by setting the preserve_dims parameter.

See Reshape for more details.

__init__(preserve_dims=1, name=None)[source]

Constructs a Flatten module.

Parameters
  • preserve_dims (int) – Number of leading dimensions that will not be reshaped.

  • name (Optional[str]) – Name of the module.

BatchApply

class sonnet.BatchApply(module, num_dims=2, name=None)[source]

Merges a number of leading dimensions of an input tensor to manipulate it.

Merges a number of leading dimensions of a tensor into a single dimension, connects the provided module, then splits the leading dimension of the result to match the input.

Input tensors whose rank is smaller than the number of dimensions to collapse (e.g. all scalar values, which are tensors of rank 0), are passed unaltered to the provided module.

This is useful for applying some module to each timestep of a Time x Batch x N tensor. If a module is hard coded to only support 2D (Batch x N) then the full 3D Tensor cannot be provided. BatchApply will ‘merge’ the first two dimensions of the sequence tensor by reshaping to a (Time * Batch) x N Tensor, and then the internal module can be applied. The result of that operation is reshaped such that its first dimensions are split to match the leading dimensions of the input.

__init__(module, num_dims=2, name=None)[source]

Initializes the current module with the given name.

Subclasses should call this constructor before creating other modules or variables such that those modules are named correctly.

Parameters

name (Optional[str]) – An optional string name for the class. Must be a valid Python identifier. If name is not provided then the class name for the current instance is converted to lower_snake_case and used instead.

__call__(*args, **kwargs)[source]

Call self as a function.

Embedding modules

Embed

class sonnet.Embed(vocab_size=None, embed_dim=None, existing_vocab=None, densify_gradients=False, initializer=None, trainable=True, dtype=tf.float32, name=None)[source]

Module for embedding tokens in a low-dimensional space.

__init__(vocab_size=None, embed_dim=None, existing_vocab=None, densify_gradients=False, initializer=None, trainable=True, dtype=tf.float32, name=None)[source]

Constructs an Embed module.

Parameters
  • vocab_size (Optional[int]) – Number of unique tokens to embed. If not provided, an existing vocabulary matrix from which vocab_size can be inferred must be provided as existing_vocab.

  • embed_dim (Optional[int]) – Number of dimensions to assign to each embedding. If not specified, we use 6 * sqrt(sqrt(vocab_size)). If an existing vocabulary matrix initializes the module, this should not be provided as it will be inferred.

  • existing_vocab (Union[ndarray, Tensor, Variable, None]) – A [vocab_size, embed_dim] vocabulary matrix. Will be converted to a tf.float32 tensor. If provided, neither or vocab_size or embed_dim should be provided as they are inferred.

  • densify_gradients (bool) – If True, we convert the embedding gradient from an tf.IndexedSlices to a regular tensor before sending it back to the parameter server. This avoids excess computation on the parameter server. Use this option for moderately sized embeddings, e.g., a vocabulary size on the order of up to thousands. For embeddings larger than these, e.g. a vocabulary size on the order of tens or hundreds of thousands, set this to False.

  • initializer (Optional[Initializer]) – Initializer for the embeddings. By default, embeddings are initialized via a truncated normal distribution.

  • trainable (bool) – if True, the embeddings will be updated during training. If False, they are fixed to their initial values.

  • dtype (DType) – The dtype to use for the embedding. Defaults to float32.

  • name (Optional[str]) – Name for this module.

Raises

ValueError – if neither one of vocab_size or existing_vocab is provided, or if existing_vocab is provided along with vocab_size, embedding_dim, initializer (as these should be inferred).

__call__(*args, **kwargs)[source]

Call self as a function.

Optimizers

Sonnet optimizers built for TensorFlow 2.

All optimizers implement the snt.Optimizer interface.

Adam

class sonnet.optimizers.Adam(learning_rate=0.001, beta1=0.9, beta2=0.999, epsilon=1e-08, name=None)[source]

Adaptive Moment Estimation (Adam) optimizer.

Adam is an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments. See [8] for more details.

Note: default parameter values have been taken from the paper.

learning_rate

Step size (alpha in the paper).

beta1

Exponential decay rate for first moment estimate.

beta2

Exponential decay rate for second moment estimate.

epsilon

Small value to avoid zero denominator.

step

Step count.

m

Biased first moment estimate (a list with one value per parameter).

v

Biased second raw moment estimate (a list with one value per parameter).

__init__(learning_rate=0.001, beta1=0.9, beta2=0.999, epsilon=1e-08, name=None)[source]

Constructs an Adam module.

Parameters
  • learning_rate (Union[float, floating, ndarray, Tensor, Variable]) – Step size (alpha in the paper).

  • beta1 (Union[float, floating, ndarray, Tensor, Variable]) – Exponential decay rate for first moment estimate.

  • beta2 (Union[float, floating, ndarray, Tensor, Variable]) – Exponential decay rate for second moment estimate.

  • epsilon (Union[float, floating, ndarray, Tensor, Variable]) – Small value to avoid zero denominator.

  • name (Optional[str]) – Name of the module.

Momentum

class sonnet.optimizers.Momentum(learning_rate, momentum, use_nesterov=False, name=None)[source]

SGD with Momentum module.

learning_rate

Learning rate.

momentum

Momentum scalar.

use_nesterov

True if using Nesterov momentum.

accumulated_momentum

Accumulated momentum for each parameter.

__init__(learning_rate, momentum, use_nesterov=False, name=None)[source]

Constructs a Momentum module.

Parameters
  • learning_rate (Union[float, floating, ndarray, Tensor, Variable]) – Learning rate.

  • momentum (Union[float, floating, ndarray, Tensor, Variable]) – Momentum scalar.

  • use_nesterov (bool) – Whether to use Nesterov momentum.

  • name (Optional[str]) – Name of the module.

RMSProp

class sonnet.optimizers.RMSProp(learning_rate, decay=0.9, momentum=0.0, epsilon=1e-10, centered=False, name=None)[source]

RMSProp module.

See: http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf

Maintain a moving (discounted) average of the square of updates. Divides each update by the root of this average.

ms <- decay * ms + (1-decay) * update^2 mom <- momentum * mom + learning_rate * update / sqrt(ms + epsilon) parameter <- parameter - mom

This implementation of RMSprop uses plain momentum, not Nesterov momentum.

The centered version additionally maintains a moving average of the gradients, and uses that average to estimate the variance:

mg <- decay * mg + (1-decay) * update ms <- decay * ms + (1-decay) * update^2 mom <- momentum * mom + learning_rate * update / sqrt(ms - mg^2 + epsilon) parameter <- parameter - mom

learning_rate

Learning rate.

decay

Learning rate decay over each update.

momentum

Momentum scalar.

epsilon

Small value to avoid zero denominator.

centered

True if centered.

mom

Accumulated mom for each parameter.

ms

Accumulated ms for each parameter.

mg

Accumulated mg for each parameter.

__init__(learning_rate, decay=0.9, momentum=0.0, epsilon=1e-10, centered=False, name=None)[source]

Constructs an RMSProp module.

Parameters
  • learning_rate (Union[float, floating, ndarray, Tensor, Variable]) – Learning rate.

  • decay (Union[float, floating, ndarray, Tensor, Variable]) – Learning rate decay over each update.

  • momentum (Union[float, floating, ndarray, Tensor, Variable]) – Momentum scalar.

  • epsilon (Union[float, floating, ndarray, Tensor, Variable]) – Small value to avoid zero denominator.

  • centered (bool) – If True, gradients are normalized by the estimated variance of the gradient; if False, by the uncentered second moment. Setting this to True may help with training, but is slightly more expensive in terms of computation and memory. Defaults to False.

  • name (Optional[str]) – Name for this module.

SGD

class sonnet.optimizers.SGD(learning_rate, name=None)[source]

Stochastic Gradient Descent (SGD) module.

learning_rate

Learning rate.

__init__(learning_rate, name=None)[source]

Constructs an SGD module.

Parameters
  • learning_rate (Union[float, floating, ndarray, Tensor, Variable]) – Learning rate.

  • name (Optional[str]) – Name of the module.

Initializers

Initializers.

Initializer

class sonnet.initializers.Initializer[source]

Initializer base class, all initializers must implement a call method.

abstract __call__(shape, dtype)[source]

Returns a tensor of the given shape and dtype.

Return type

Tensor

Constant

class sonnet.initializers.Constant(value)[source]

Initializer that generates tensors initialized to the given value.

__init__(value)[source]
__call__(shape, dtype)[source]

Returns a tensor of the given shape and dtype.

Return type

Tensor

Identity

class sonnet.initializers.Identity(gain=1.0)[source]

Initializer that generates the identity matrix.

Constructs a 2D identity matrix or batches of these.

__init__(gain=1.0)[source]

Constructs an identity initializer.

Parameters

gain (float) – Multiplicative factor to apply to the identity matrix.

__call__(shape, dtype)[source]

Returns a tensor of the given shape and dtype.

Return type

Tensor

Ones

class sonnet.initializers.Ones[source]

Initializer that generates tensors initialized to 1.

__call__(shape, dtype)[source]

Returns a tensor of the given shape and dtype.

Return type

Tensor

Orthogonal

class sonnet.initializers.Orthogonal(gain=1.0, seed=None)[source]

Initializer that generates an orthogonal matrix.

NOTE: Does not support 1D tensors.

The implementation is based on [9].

If the shape of the tensor to initialize is two-dimensional, it is initialized with an orthogonal matrix obtained from the QR decomposition of a matrix of random numbers drawn from a normal distribution. If the matrix has fewer rows than columns then the output will have orthogonal rows. Otherwise, the output will have orthogonal columns.

If the shape of the tensor to initialize is more than two-dimensional, a matrix of shape (shape[0] * ... * shape[n - 2], shape[n - 1]) is initialized, where n is the length of the shape vector. The matrix is subsequently reshaped to give a tensor of the desired shape.

__init__(gain=1.0, seed=None)[source]

Constructs an orthogonal initializer.

Parameters
  • gain (float) – Multiplicative factor to apply to the orthogonal matrix

  • seed (Optional[int]) – int, the seed used in the generation of random numbers.

__call__(shape, dtype)[source]

Returns a tensor of the given shape and dtype.

Return type

Tensor

RandomNormal

class sonnet.initializers.RandomNormal(mean=0.0, stddev=1.0, seed=None)[source]

Initializer that generates tensors with a normal distribution.

__init__(mean=0.0, stddev=1.0, seed=None)[source]

Constructs a random normal initializer.

Parameters
  • mean (Union[float, floating, ndarray, Tensor, Variable]) – A python scalar or a scalar tensor. Mean of the random values to generate.

  • stddev (Union[float, floating, ndarray, Tensor, Variable]) – A python scalar or a scalar tensor. Standard deviation of the random values to generate.

  • seed (Optional[int]) – The seed used in the generation of random numbers.

__call__(shape, dtype)[source]

Returns a tensor of the given shape and dtype.

Return type

Tensor

RandomUniform

class sonnet.initializers.RandomUniform(minval=0, maxval=1, seed=None)[source]

Initializer that generates tensors with a uniform distribution.

The generated values follow a uniform distribution in the range [minval, maxval).

__init__(minval=0, maxval=1, seed=None)[source]

Constructs a random uniform initializer.

Parameters
  • minval (Union[float, floating, ndarray, Tensor, Variable]) – A python scalar or a scalar tensor. Lower bound of the range of random values to generate. Defaults to 0.

  • maxval (Union[float, floating, ndarray, Tensor, Variable]) – A python scalar or a scalar tensor. Upper bound of the range of random values to generate. Defaults to 1.

  • seed (Optional[int]) – The seed used in the generation of random numbers.

__call__(shape, dtype)[source]

Returns a tensor of the given shape and dtype.

TruncatedNormal

class sonnet.initializers.TruncatedNormal(mean=0.0, stddev=1.0, seed=None)[source]

Initializer that generates a truncated normal distribution.

These values follow a normal distribution except that values more than two standard deviations from the mean are discarded and re-drawn. This is the recommended initializer for neural network weights and filters.

__init__(mean=0.0, stddev=1.0, seed=None)[source]

Constructs a truncated normal initializer.

Parameters
  • mean (Union[float, floating, ndarray, Tensor, Variable]) – A python scalar or a scalar tensor. Mean of the random values to generate.

  • stddev (Union[float, floating, ndarray, Tensor, Variable]) – A python scalar or a scalar tensor. Standard deviation of the random values to generate.

  • seed (Optional[int]) – The seed used in the generation of random numbers.

__call__(shape, dtype)[source]

Returns a tensor of the given shape and dtype.

VarianceScaling

class sonnet.initializers.VarianceScaling(scale=1.0, mode='fan_in', distribution='truncated_normal', seed=None)[source]

Initializer capable of adapting its scale to the shape of weights tensors.

With distribution="truncated_normal" or "normal", samples are drawn from a distribution with a mean of zero and a standard deviation (after truncation, if used) stddev = sqrt(scale / n) where n is:

  • Number of input units in the weight tensor, if mode = fan_in.

  • Number of output units, if mode = fan_out.

  • Average of the numbers of input and output units, if mode = fan_avg.

Note that for transposed convolution the mode selected should be reversed. For number of input units use fan_out and for number of output units fan_in.

With distribution=uniform, samples are drawn from a uniform distribution within [-limit, limit], with limit = sqrt(3 * scale / n).

The variance scaling initializer can be configured to generate other standard initializers using the scale, mode and distribution arguments. Here are some example configurations:

Name

Parameters

glorot_uniform

scale=1.0, mode=``fan_avg``, distribution=``uniform``

glorot_normal

scale=1.0, mode=``fan_avg``, distribution=``truncated_normal``

lecun_uniform

scale=1.0, mode=``fan_in``, distribution=``uniform``

lecun_normal

scale=1.0, mode=``fan_in``, distribution=``truncated_normal``

he_uniform

scale=2.0, mode=``fan_in``, distribution=``uniform``

he_normal

scale=2.0, mode=``fan_in``, distribution=``truncated_normal``

__init__(scale=1.0, mode='fan_in', distribution='truncated_normal', seed=None)[source]

Constructs a variance scaling initalizer.

Parameters
  • scale (float) – Scaling factor (positive float).

  • mode (str) – One of fan_in, fan_out, fan_avg.

  • distribution (str) – Random distribution to use. One of truncated_normal, untruncated_normal and uniform.

  • seed (Optional[int]) – int, the seed used in the generation of random numbers.

Raises

ValueError – In case of an invalid value for the scale, mode or distribution arguments.

__call__(shape, dtype)[source]

Returns a tensor of the given shape and dtype.

Return type

Tensor

Zeros

class sonnet.initializers.Zeros[source]

Initializer that generates tensors initialized to 0.

__call__(shape, dtype)[source]

Returns a tensor of the given shape and dtype.

Return type

Tensor

Regularizers

Regularizers.

Regularizer

class sonnet.regularizers.Regularizer[source]

Base regularizer class.

abstract __call__(tensors)[source]

Apply a regularizer.

Parameters

tensors (Sequence[Tensor]) – A sequence of tensors to regularize.

Return type

Tensor

Returns

Combined regularization loss for the given tensors.

L1

class sonnet.regularizers.L1(scale)[source]

L1 regularizer.

>>> reg = snt.regularizers.L1(0.01)
>>> reg([tf.constant([1.0, 2.0, 3.0])])
<tf.Tensor: ...>
__init__(scale)[source]

Create an L1 regularizer.

Parameters

scale (Union[float, floating, ndarray, Tensor, Variable]) – A non-negative regularization factor.

Raises

ValueError – if scale is <0.

__call__(tensors)[source]

See base class.

Return type

Tensor

L2

class sonnet.regularizers.L2(scale)[source]

L2 regularizer.

>>> reg = snt.regularizers.L2(0.01)
>>> reg([tf.constant([1.0, 2.0, 3.0])])
<tf.Tensor: ...>
__init__(scale)[source]

Create an L2 regularizer.

Parameters

scale (Union[float, floating, ndarray, Tensor, Variable]) – float or scalar tensor; regularization factor.

Raises

ValueError – if scale is <0.

__call__(tensors)[source]

See base class.

Return type

Tensor

OffDiagonalOrthogonal

class sonnet.regularizers.OffDiagonalOrthogonal(scale)[source]

Off-diagonal orthogonal regularizer.

The implementation is based on https://arxiv.org/abs/1809.11096. Given a rank N >= 2 tensor, the regularizer computes the sum of off-diagonal entries of (W^T W)^2 where

  • W is the input tensor reshaped to a matrix by collapsing the leading N - 1 axes into the first one;

  • ^2 is the element-wise square.

NB: that is equivalent to computing the off-diagonal sum of (W^T W - I)^2, as off-diagonal entries of I are 0.

For example,

>>> t = tf.reshape(tf.range(8, dtype=tf.float32), [2, 2, 2])
>>> reg = snt.regularizers.OffDiagonalOrthogonal(0.01)
>>> reg([t])
<tf.Tensor: ...>

corresponds to copmuting

>>> w = tf.reshape(t, [-1, 2])
>>> w_gram_sq = tf.square(tf.matmul(tf.transpose(w), w))
>>> 0.01 * (tf.reduce_sum(w_gram_sq) - tf.linalg.trace(w_gram_sq))
<tf.Tensor: ...>
__init__(scale)[source]

Create an off-diagonal orthogonal regularizer.

Parameters

scale (Union[float, floating, ndarray, Tensor, Variable]) – A non-negative regularization factor.

Raises

ValueError – if scale is <0.

__call__(tensors)[source]

See base class.

Return type

Tensor

Paddings

Paddings.

causal

sonnet.pad.causal(effective_kernel_size)[source]

Pre-padding such that output has no dependence on the future.

create

sonnet.pad.create(padding, kernel, rate, n, channel_index)[source]

Generates the padding required for a given padding algorithm.

Parameters
  • padding (Union[Callable[[int], Sequence[int]], Sequence[Callable[[int], Sequence[int]]]]) – callable or list of callables of length n. The callables take an integer representing the effective kernel size (kernel size when the rate is 1) and return a list of two integers representing the padding before and padding after for that dimension.

  • kernel (Union[int, Sequence[int]]) – int or list of ints of length n. The size of the kernel for each dimension. If it is an int it will be replicated for the non channel and batch dimensions.

  • rate (Union[int, Sequence[int]]) – int or list of ints of length n. The dilation rate for each dimension. If it is an int it will be replicated for the non channel and batch dimensions.

  • n (int) – the number of spatial dimensions.

  • channel_index (int) – the channel position of the input to which the padding will be applied.

Returns

A list of length n+2 containing the padding for each element. These are of the form [pad_before, pad_after].

full

sonnet.pad.full(effective_kernel_size)[source]

Maximal padding whilst not convolving over just padded elements.

reverse_causal

sonnet.pad.reverse_causal(effective_kernel_size)[source]

Post-padding such that output has no dependence on the past.

same

sonnet.pad.same(effective_kernel_size)[source]

Pads such that the output size matches input size for stride=1.

valid

sonnet.pad.valid(effective_kernel_size)[source]

No padding.

Distribution

Utilities for using Sonnet with TensorFlow Distribution Strategy.

Replicator

class sonnet.distribute.Replicator(devices=None, cross_device_ops=None)[source]

Replicates input, parameters and compute over multiple accelerators.

Replicator is a TensorFlow “Distribution Strategy” implementing the programming model described in the TF-Replicator paper [10] and TensorFlow RFC [11]. Replicator enables data-parallel training across multiple accelerators on a single machine, it supports eager execution and tf.function.

To get started create a Replicator instance:

>>> replicator = snt.distribute.Replicator()

Replicator provides a scope inside which any new tf.Variables will be replicated across all local devices:

>>> with replicator.scope():
...    mod = snt.Linear(32)

Additionally replicator provides utility functions to apply a module in parallel on multiple devices. First we need to define some computation that runs on each GPU. The “replica context” object provides us a way to communicate between replicas (e.g. to perform an all_reduce):

>>> def forward():
...   # Compute a random output on each GPU.
...   x = tf.random.normal([8, 28 * 28])
...   y = mod(x)
...   # Synchronize the value of `y` between all GPUs.
...   ctx = tf.distribute.get_replica_context()
...   y = ctx.all_reduce("mean", y)
...   return y

Finally we use the run API to apply forward in parallel on all accelerator devices:

>>> per_replica_y = replicator.run(forward)
scope()[source]

Context manager to make the strategy current and distribute variables.

This method returns a context manager, and is used as follows:

>>> strategy = tf.distribute.MirroredStrategy(["GPU:0", "GPU:1"])
>>> # Variable created inside scope:
>>> with strategy.scope():
...   mirrored_variable = tf.Variable(1.)
>>> mirrored_variable
MirroredVariable:{
  0: <tf.Variable 'Variable:0' shape=() dtype=float32, numpy=1.0>,
  1: <tf.Variable 'Variable/replica_1:0' shape=() dtype=float32, numpy=1.0>
}
>>> # Variable created outside scope:
>>> regular_variable = tf.Variable(1.)
>>> regular_variable
<tf.Variable 'Variable:0' shape=() dtype=float32, numpy=1.0>

_What happens when Strategy.scope is entered?_

  • strategy is installed in the global context as the “current” strategy. Inside this scope, tf.distribute.get_strategy() will now return this strategy. Outside this scope, it returns the default no-op strategy.

  • Entering the scope also enters the “cross-replica context”. See tf.distribute.StrategyExtended for an explanation on cross-replica and replica contexts.

  • Variable creation inside scope is intercepted by the strategy. Each strategy defines how it wants to affect the variable creation. Sync strategies like MirroredStrategy, TPUStrategy and MultiWorkerMiroredStrategy create variables replicated on each replica, whereas ParameterServerStrategy creates variables on the parameter servers. This is done using a custom tf.variable_creator_scope.

  • In some strategies, a default device scope may also be entered: in MultiWorkerMiroredStrategy, a default device scope of “/CPU:0” is entered on each worker.

Note: Entering a scope does not automatically distribute a computation, except

in the case of high level training framework like keras model.fit. If you’re not using model.fit, you need to use strategy.run API to explicitly distribute that computation. See an example in the [custom training loop tutorial](https://www.tensorflow.org/tutorials/distribute/custom_training).

_What should be in scope and what should be outside?_

There are a number of requirements on what needs to happen inside the scope. However, in places where we have information about which strategy is in use, we often enter the scope for the user, so they don’t have to do it explicitly (i.e. calling those either inside or outside the scope is OK).

  • Anything that creates variables that should be distributed variables must be called in a strategy.scope. This can be accomplished either by directly calling the variable creating function within the scope context, or by relying on another API like strategy.run or keras.Model.fit to automatically enter it for you. Any variable that is created outside scope will not be distributed and may have performance implications. Some common objects that create variables in TF are Models, Optimizers, Metrics. Such objects should always be initialized in the scope, and any functions that may lazily create variables (e.g., Model.__call__(), tracing a tf.function, etc.) should similarly be called within scope. Another source of variable creation can be a checkpoint restore - when variables are created lazily. Note that any variable created inside a strategy captures the strategy information. So reading and writing to these variables outside the strategy.scope can also work seamlessly, without the user having to enter the scope.

  • Some strategy APIs (such as strategy.run and strategy.reduce) which require to be in a strategy’s scope, enter the scope automatically, which means when using those APIs you don’t need to explicitly enter the scope yourself.

  • When a tf.keras.Model is created inside a strategy.scope, the Model object captures the scope information. When high level training framework methods such as model.compile, model.fit, etc. are then called, the captured scope will be automatically entered, and the associated strategy will be used to distribute the training etc. See a detailed example in [distributed keras tutorial](https://www.tensorflow.org/tutorials/distribute/keras). WARNING: Simply calling model(..) does not automatically enter the captured scope – only high level training framework APIs support this behavior: model.compile, model.fit, model.evaluate, model.predict and model.save can all be called inside or outside the scope.

  • The following can be either inside or outside the scope:
    • Creating the input datasets

    • Defining `tf.function`s that represent your training step

    • Saving APIs such as tf.saved_model.save. Loading creates variables, so that should go inside the scope if you want to train the model in a distributed way.

    • Checkpoint saving. As mentioned above - checkpoint.restore may sometimes need to be inside scope if it creates variables.

Returns

A context manager.

TpuReplicator

class sonnet.distribute.TpuReplicator(tpu_cluster_resolver=None, experimental_device_assignment=None, experimental_spmd_xla_partitioning=False)[source]

Replicates input, parameters and compute over multiple TPUs.

TpuReplicator is a TensorFlow “Distribution Strategy” implementing the programming model described in the TF-Replicator paper [10] and TensorFlow RFC [11]. TpuReplicator enables data-parallel training across multiple TPUs on one or more machines, it supports tf.function.

To get started create a TpuReplicator instance:

>>> replicator = snt.distribute.TpuReplicator()

This provides a scope inside which any new tf.Variables will be replicated across all TPU cores:

>>> with replicator.scope():
...    mod = snt.Linear(32)

Additionally replicator provides utility functions to apply a module in parallel on multiple devices. First we need to define some computation that runs on each TPU. The “replica context” object provides us a way to communicate between replicas:

>>> def forward():
...   # Compute a random output on each GPU.
...   x = tf.random.normal([8, 28 * 28])
...   y = mod(x)
...   # Synchronize the value of `y` between all GPUs.
...   ctx = tf.distribute.get_replica_context()
...   y = ctx.all_reduce("mean", y)
...   return y

Finally we use the run API to apply forward in parallel on all TPU devices. This must be run as part of a tf.function since TpuReplicator uses XLA to compile and replicate our function to run in parallel over all TPU cores:

>>> @tf.function(autograph=False)
... def all_forward():
...   return replicator.run(forward)
>>> per_replica_y = all_forward()
scope()[source]

Context manager to make the strategy current and distribute variables.

This method returns a context manager, and is used as follows:

>>> strategy = tf.distribute.MirroredStrategy(["GPU:0", "GPU:1"])
>>> # Variable created inside scope:
>>> with strategy.scope():
...   mirrored_variable = tf.Variable(1.)
>>> mirrored_variable
MirroredVariable:{
  0: <tf.Variable 'Variable:0' shape=() dtype=float32, numpy=1.0>,
  1: <tf.Variable 'Variable/replica_1:0' shape=() dtype=float32, numpy=1.0>
}
>>> # Variable created outside scope:
>>> regular_variable = tf.Variable(1.)
>>> regular_variable
<tf.Variable 'Variable:0' shape=() dtype=float32, numpy=1.0>

_What happens when Strategy.scope is entered?_

  • strategy is installed in the global context as the “current” strategy. Inside this scope, tf.distribute.get_strategy() will now return this strategy. Outside this scope, it returns the default no-op strategy.

  • Entering the scope also enters the “cross-replica context”. See tf.distribute.StrategyExtended for an explanation on cross-replica and replica contexts.

  • Variable creation inside scope is intercepted by the strategy. Each strategy defines how it wants to affect the variable creation. Sync strategies like MirroredStrategy, TPUStrategy and MultiWorkerMiroredStrategy create variables replicated on each replica, whereas ParameterServerStrategy creates variables on the parameter servers. This is done using a custom tf.variable_creator_scope.

  • In some strategies, a default device scope may also be entered: in MultiWorkerMiroredStrategy, a default device scope of “/CPU:0” is entered on each worker.

Note: Entering a scope does not automatically distribute a computation, except

in the case of high level training framework like keras model.fit. If you’re not using model.fit, you need to use strategy.run API to explicitly distribute that computation. See an example in the [custom training loop tutorial](https://www.tensorflow.org/tutorials/distribute/custom_training).

_What should be in scope and what should be outside?_

There are a number of requirements on what needs to happen inside the scope. However, in places where we have information about which strategy is in use, we often enter the scope for the user, so they don’t have to do it explicitly (i.e. calling those either inside or outside the scope is OK).

  • Anything that creates variables that should be distributed variables must be called in a strategy.scope. This can be accomplished either by directly calling the variable creating function within the scope context, or by relying on another API like strategy.run or keras.Model.fit to automatically enter it for you. Any variable that is created outside scope will not be distributed and may have performance implications. Some common objects that create variables in TF are Models, Optimizers, Metrics. Such objects should always be initialized in the scope, and any functions that may lazily create variables (e.g., Model.__call__(), tracing a tf.function, etc.) should similarly be called within scope. Another source of variable creation can be a checkpoint restore - when variables are created lazily. Note that any variable created inside a strategy captures the strategy information. So reading and writing to these variables outside the strategy.scope can also work seamlessly, without the user having to enter the scope.

  • Some strategy APIs (such as strategy.run and strategy.reduce) which require to be in a strategy’s scope, enter the scope automatically, which means when using those APIs you don’t need to explicitly enter the scope yourself.

  • When a tf.keras.Model is created inside a strategy.scope, the Model object captures the scope information. When high level training framework methods such as model.compile, model.fit, etc. are then called, the captured scope will be automatically entered, and the associated strategy will be used to distribute the training etc. See a detailed example in [distributed keras tutorial](https://www.tensorflow.org/tutorials/distribute/keras). WARNING: Simply calling model(..) does not automatically enter the captured scope – only high level training framework APIs support this behavior: model.compile, model.fit, model.evaluate, model.predict and model.save can all be called inside or outside the scope.

  • The following can be either inside or outside the scope:
    • Creating the input datasets

    • Defining `tf.function`s that represent your training step

    • Saving APIs such as tf.saved_model.save. Loading creates variables, so that should go inside the scope if you want to train the model in a distributed way.

    • Checkpoint saving. As mentioned above - checkpoint.restore may sometimes need to be inside scope if it creates variables.

Returns

A context manager.

Metrics

Metric

class sonnet.Metric(name=None)[source]

Metric base class.

property value

Returns the current value of the metric.

__call__(*args, **kwargs)[source]

Call self as a function.

Mean

class sonnet.Mean(name=None)[source]

Calculates the element-wise mean of the given values.

__init__(name=None)[source]

Initializes the current module with the given name.

Subclasses should call this constructor before creating other modules or variables such that those modules are named correctly.

Parameters

name (Optional[str]) – An optional string name for the class. Must be a valid Python identifier. If name is not provided then the class name for the current instance is converted to lower_snake_case and used instead.

property value

See base class.

Sum

class sonnet.Sum(name=None)[source]

Calculates the element-wise sum of the given values.

__init__(name=None)[source]

Initializes the current module with the given name.

Subclasses should call this constructor before creating other modules or variables such that those modules are named correctly.

Parameters

name (Optional[str]) – An optional string name for the class. Must be a valid Python identifier. If name is not provided then the class name for the current instance is converted to lower_snake_case and used instead.

property value

See base class.

Nets

Common network architectures implemented as Sonnet modules.

MLP

class sonnet.nets.MLP(output_sizes, w_init=None, b_init=None, with_bias=True, activation=<function relu>, dropout_rate=None, activate_final=False, name=None)[source]

A multi-layer perceptron module.

__init__(output_sizes, w_init=None, b_init=None, with_bias=True, activation=<function relu>, dropout_rate=None, activate_final=False, name=None)[source]

Constructs an MLP.

Parameters
  • output_sizes (Iterable[int]) – Sequence of layer sizes.

  • w_init (Optional[Initializer]) – Initializer for Linear weights.

  • b_init (Optional[Initializer]) – Initializer for Linear bias. Must be None if with_bias is False.

  • with_bias (bool) – Whether or not to apply a bias in each layer.

  • activation (Callable[[Tensor], Tensor]) – Activation function to apply between linear layers. Defaults to ReLU.

  • dropout_rate – Dropout rate to apply, a rate of None (the default) or 0 means no dropout will be applied.

  • activate_final (bool) – Whether or not to activate the final layer of the MLP.

  • name (Optional[str]) – Optional name for this module.

Raises

ValueError – If with_bias is False and b_init is not None.

__call__(*args, **kwargs)[source]

Call self as a function.

Cifar10ConvNet

class sonnet.nets.Cifar10ConvNet(num_classes=10, w_init=None, b_init=None, data_format='NHWC', output_channels=(64, 64, 128, 128, 128, 256, 256, 256, 512, 512, 512), strides=(1, 1, 2, 1, 1, 2, 1, 1, 2, 1, 1), name=None)[source]

Convolutional network designed for Cifar10.

Approximately equivalent to “VGG, minus max pooling, plus BatchNorm”. For best results the input data should be scaled to be between -1 and 1 when using the standard initializers.

__init__(num_classes=10, w_init=None, b_init=None, data_format='NHWC', output_channels=(64, 64, 128, 128, 128, 256, 256, 256, 512, 512, 512), strides=(1, 1, 2, 1, 1, 2, 1, 1, 2, 1, 1), name=None)[source]

Initializes the current module with the given name.

Subclasses should call this constructor before creating other modules or variables such that those modules are named correctly.

Parameters

name (Optional[str]) – An optional string name for the class. Must be a valid Python identifier. If name is not provided then the class name for the current instance is converted to lower_snake_case and used instead.

__call__(*args, **kwargs)[source]

Call self as a function.

ResNet

class sonnet.nets.ResNet(blocks_per_group_list, num_classes, bn_config=None, resnet_v2=False, channels_per_group_list=(256, 512, 1024, 2048), name=None)[source]

ResNet model.

__init__(blocks_per_group_list, num_classes, bn_config=None, resnet_v2=False, channels_per_group_list=(256, 512, 1024, 2048), name=None)[source]

Constructs a ResNet model.

Parameters
  • blocks_per_group_list (Sequence[int]) – A sequence of length 4 that indicates the number of blocks created in each group.

  • num_classes (int) – The number of classes to classify the inputs into.

  • bn_config (Optional[Mapping[str, float]]) – A dictionary of two elements, decay_rate and eps to be passed on to the BatchNorm layers. By default the decay_rate is 0.9 and eps is 1e-5.

  • resnet_v2 (bool) – Whether to use the v1 or v2 ResNet implementation. Defaults to False.

  • channels_per_group_list (Sequence[int]) – A sequence of length 4 that indicates the number of channels used for each block in each group.

  • name (Optional[str]) – Name of the module.

__call__(*args, **kwargs)[source]

Call self as a function.

ResNet50

class sonnet.nets.ResNet50(num_classes, bn_config=None, resnet_v2=False, name=None)[source]

ResNet50 module.

__init__(num_classes, bn_config=None, resnet_v2=False, name=None)[source]

Constructs a ResNet model.

Parameters
  • num_classes (int) – The number of classes to classify the inputs into.

  • bn_config (Optional[Mapping[str, float]]) – A dictionary of two elements, decay_rate and eps to be passed on to the BatchNorm layers.

  • resnet_v2 (bool) – Whether to use the v1 or v2 ResNet implementation. Defaults to False.

  • name (Optional[str]) – Name of the module.

VectorQuantizer

class sonnet.nets.VectorQuantizer(embedding_dim, num_embeddings, commitment_cost, dtype=tf.float32, name='vector_quantizer')[source]

Sonnet module representing the VQ-VAE layer.

Implements the algorithm presented in ‘Neural Discrete Representation Learning’ by van den Oord et al. https://arxiv.org/abs/1711.00937

Input any tensor to be quantized. Last dimension will be used as space in which to quantize. All other dimensions will be flattened and will be seen as different examples to quantize.

The output tensor will have the same shape as the input.

For example a tensor with shape [16, 32, 32, 64] will be reshaped into [16384, 64] and all 16384 vectors (each of 64 dimensions) will be quantized independently.

embedding_dim

integer representing the dimensionality of the tensors in the quantized space. Inputs to the modules must be in this format as well.

num_embeddings

integer, the number of vectors in the quantized space.

commitment_cost

scalar which controls the weighting of the loss terms (see equation 4 in the paper - this variable is Beta).

__init__(embedding_dim, num_embeddings, commitment_cost, dtype=tf.float32, name='vector_quantizer')[source]

Initializes a VQ-VAE module.

Parameters
  • embedding_dim (int) – dimensionality of the tensors in the quantized space. Inputs to the modules must be in this format as well.

  • num_embeddings (int) – number of vectors in the quantized space.

  • commitment_cost (Union[float, floating, ndarray, Tensor, Variable]) – scalar which controls the weighting of the loss terms (see equation 4 in the paper - this variable is Beta).

  • dtype (DType) – dtype for the embeddings variable, defaults to tf.float32.

  • name (str) – name of the module.

__call__(*args, **kwargs)[source]

Call self as a function.

VectorQuantizerEMA

class sonnet.nets.VectorQuantizerEMA(*args, **kwargs)[source]

Sonnet module representing the VQ-VAE layer.

Implements a slightly modified version of the algorithm presented in ‘Neural Discrete Representation Learning’ by van den Oord et al. https://arxiv.org/abs/1711.00937

The difference between VectorQuantizerEMA and VectorQuantizer is that this module uses exponential moving averages to update the embedding vectors instead of an auxiliary loss. This has the advantage that the embedding updates are independent of the choice of optimizer (SGD, RMSProp, Adam, K-Fac, …) used for the encoder, decoder and other parts of the architecture. For most experiments the EMA version trains faster than the non-EMA version.

Input any tensor to be quantized. Last dimension will be used as space in which to quantize. All other dimensions will be flattened and will be seen as different examples to quantize.

The output tensor will have the same shape as the input.

For example a tensor with shape [16, 32, 32, 64] will be reshaped into [16384, 64] and all 16384 vectors (each of 64 dimensions) will be quantized independently.

embedding_dim

integer representing the dimensionality of the tensors in the quantized space. Inputs to the modules must be in this format as well.

num_embeddings

integer, the number of vectors in the quantized space.

commitment_cost

scalar which controls the weighting of the loss terms (see equation 4 in the paper).

decay

float, decay for the moving averages.

epsilon

small float constant to avoid numerical instability.

__init__(embedding_dim, num_embeddings, commitment_cost, decay, epsilon=1e-05, dtype=tf.float32, name='vector_quantizer_ema')[source]

Initializes a VQ-VAE EMA module.

Parameters
  • embedding_dim – integer representing the dimensionality of the tensors in the quantized space. Inputs to the modules must be in this format as well.

  • num_embeddings – integer, the number of vectors in the quantized space.

  • commitment_cost – scalar which controls the weighting of the loss terms (see equation 4 in the paper - this variable is Beta).

  • decay – float between 0 and 1, controls the speed of the Exponential Moving Averages.

  • epsilon – small constant to aid numerical stability, default 1e-5.

  • dtype – dtype for the embeddings variable, defaults to tf.float32.

  • name – name of the module.

__call__(*args, **kwargs)[source]

Call self as a function.

Mixed Precision

Sonnet mixed precision built for TensorFlow 2.

modes

sonnet.mixed_precision.modes(valid_types)[source]

Decorate a function to cast inputs/outputs to different precision.

>>> support_modes = snt.mixed_precision.modes([tf.float32, tf.float16])
>>> snt.Linear.__call__ = support_modes(snt.Linear.__call__)
>>> mod = snt.Linear(10)
>>> snt.mixed_precision.enable(tf.float16)
>>> y = mod(tf.ones([1, 1]))  # First call will be done in F32.
>>> y = mod(tf.ones([1, 1]))  # MatMul/Add will be done in F16.
>>> snt.mixed_precision.disable()
Parameters
  • valid_types – Collection of types that the function being decorated is legal

  • in. (to run) –

Returns

A decorator that will cast the inputs and outputs of the decorated function according to the global mixed precision policy and the functions eligibility for mixed precision.

enable

sonnet.mixed_precision.enable(dtype)[source]

Set the mixed precision mode.

Parameters

dtype – type to cast to.

disable

sonnet.mixed_precision.disable()[source]

Disable mixed precision training.

scope

sonnet.mixed_precision.scope(dtype)[source]

Temporarily set the global mixed precision type to dtype.

The global type is reset to its original value when the context is exited.:

snt.mixed_precision.enable(tf.float32)
support_modes = snt.mixed_precision.modes([tf.float32, tf.float16])
snt.Linear.__call__ = support_modes(snt.Linear.__call__)
mod = snt.Linear(10)

with snt.mixed_precision.scope(tf.float16):
    y = mod(tf.ones([1, 1]))  # First call will be done in F32.
    y = mod(tf.ones([1, 1]))  # MatMul/Add will be done in F16.
y = mod(tf.ones([1, 1]))  # Outside the scope will be done in F32.
Parameters

dtype (DType) – type to set the mixed precision mode to.

Yields

Nothing. This is required for contextlib.contextmanager.

References

1

Ashish Agarwal, David Berthelot, Tom Hennigan, Alex Passos, and Malcolm Reynolds. Stateful containers with tf.Module. TensorFlow Community RFCs, Google / DeepMind, 2019. URL: https://github.com/tensorflow/community/pull/56.

2

Wojciech Zaremba, Ilya Sutskever, and Oriol Vinyals. Recurrent neural network regularization. arXiv preprint arXiv:1409.2329, 2014. URL: https://arxiv.org/abs/1409.2329.

3(1,2,3,4)

Rafal Jozefowicz, Wojciech Zaremba, and Ilya Sutskever. An empirical exploration of recurrent network architectures. In International Conference on Machine Learning, 2342–2350. 2015.

4

Haşim Sak, Andrew Senior, and Françoise Beaufays. Long short-term memory based recurrent neural network architectures for large vocabulary speech recognition. arXiv preprint arXiv:1402.1128, 2014. URL: https://arxiv.org/abs/1402.1128.

5

Yarin Gal and Zoubin Ghahramani. A theoretically grounded application of dropout in recurrent neural networks. In Advances in neural information processing systems, 1019–1027. 2016.

6(1,2,3)

SHI Xingjian, Zhourong Chen, Hao Wang, Dit-Yan Yeung, Wai-Kin Wong, and Wang-chun Woo. Convolutional lstm network: a machine learning approach for precipitation nowcasting. In Advances in neural information processing systems, 802–810. 2015.

7

Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555, 2014. URL: https://arxiv.org/abs/1412.3555.

8

Diederik P. Kingma and Jimmy Ba. Adam: a method for stochastic optimization. 2014. arXiv:1412.6980.

9

Andrew M Saxe, James L McClelland, and Surya Ganguli. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. arXiv preprint arXiv:1312.6120, 2013. URL: https://arxiv.org/abs/1312.6120.

10(1,2)

Peter Buchlovsky, David Budden, Dominik Grewe, Chris Jones, John Aslanides, Frederic Besse, Andy Brock, Aidan Clark, Sergio Gómez Colmenarejo, Aedan Pope, and others. TF-Replicator: Distributed machine learning for researchers. arXiv preprint arXiv:1902.00465, 2019. URL: https://arxiv.org/abs/1902.00465.

11(1,2)

Peter Buchlovsky, Dominik Grewe, Priya Gupta, Tom Hennigan, Jonathan Hseu, Chris Jones, and Josh Levenberg. Distribution Strategy - Revised API. TensorFlow Community RFCs, Google / DeepMind, 2018. URL: https://github.com/tensorflow/community/pull/25.