Base¶
Module¶

class
sonnet.
Module
(name=None)[source]¶ Base class for Sonnet modules.
A Sonnet module is a lightweight container for variables and other modules. Modules typically define one or more “forward” methods (e.g.
__call__
) which apply operations combining user input and module parameters. For example:>>> class MultiplyModule(snt.Module): ... def __call__(self, x): ... if not hasattr(self, 'w'): ... self.w = tf.Variable(2., name='w') ... return x * self.w >>> mod = MultiplyModule() >>> mod(1.) <tf.Tensor: ... numpy=2.0>
Sonnet modules are a layer on top of
tf.Module
, implementing automatic name scoping as described in the original RFC [1].
__init__
(name=None)[source]¶ Initializes the current module with the given name.
Subclasses should call this constructor before creating other modules or variables such that those modules are named correctly.
 Parameters
name (
Optional
[str
]) – An optional string name for the class. Must be a valid Python identifier. Ifname
is not provided then the class name for the current instance is converted tolower_snake_case
and used instead.

property
variables
¶ Sequence of
tf.Variable
s owned by this module and it’s submodules.See
tf.Module.variables
for implementation details.NOTE: Most Sonnet modules create variables lazily (e.g. the first time they are called). As such just after construction there are typically no variables. To mitigate a common error (calling
.variables
or.trainable_variables
before any variables are created) these properties will raise an exception if their result is empty. Seeallow_empty_variables()
if you want to suppress this error. Returns
A sequence of variables for the current module (sorted by attribute name) followed by variables from all submodules recursively (breadth first).

property
trainable_variables
¶ Sequence of
tf.Variable
s owned by this module and it’s submodules.See
tf.Module.trainable_variables
for implementation details.NOTE: Most Sonnet modules create variables lazily (e.g. the first time they are called). As such just after construction there are typically no variables. To mitigate a common error (calling
.variables
or.trainable_variables
before any variables are created) these properties will raise an exception if their result is empty. Seeallow_empty_variables()
if you want to suppress this error. Returns
A sequence of variables for the current module (sorted by attribute name) followed by variables from all submodules recursively (breadth first).

once¶

sonnet.
once
(f)[source]¶ Decorator which ensures a wrapped method is only ever run once.
>>> @snt.once ... def f(): ... print('Hello, world!') >>> f() Hello, world! >>> f() >>> f()
If f is a method then it will be evaluated once per instance:
>>> class MyObject(object): ... @snt.once ... def f(self): ... print('Hello, world!')
>>> o = MyObject() >>> o.f() Hello, world! >>> o.f()
>>> o2 = MyObject() >>> o2.f() Hello, world! >>> o.f() >>> o2.f()
If an error is raised during execution of f it will be raised to the user. Next time the method is run, it will be treated as not having run before.
 Parameters
f – A function to wrap which should only be called once.
 Returns
Wrapped version of f which will only evaluate f the first time it is called.
no_name_scope¶

sonnet.
no_name_scope
(method)[source]¶ Decorator to wrap a method, preventing automatic name scope wrapping.
By default, any method on a module is considered as a forwards function, and so any variables / modules created by the method will be scoped as belonging to the module. In some cases this is undesirable, for example when implementing
.clone()
/.transpose()
, as in those cases we want the new module to have the scope of wherever the.transpose()
call is made. To allow this, decorate any methods withno_name_scope
. Parameters
method (~T) – the method to wrap.
 Return type
~T
 Returns
The method, with a flag indicating no name scope wrapping should occur.
Deferred¶

class
sonnet.
Deferred
(*args, **kwargs)[source]¶ Defers the construction of another module until the first call.
Deferred can be used to declare modules that depend on computed properties of other modules before those modules are defined. This allows users to separate the declaration and use of modules. For example at the start of your program you can declare two modules which are coupled:
>>> encoder = snt.Linear(64) >>> decoder = snt.Deferred(lambda: snt.Linear(encoder.input_size))
Later you can use these naturally (note: that using decoder first would cause an error since encoder.input_size is only defined after encoder has been called):
>>> x = tf.ones([8, 32]) >>> y = encoder(x) >>> z = decoder(y) # Constructs the Linear encoder by calling the lambda.
The result will satisfy the following conditions:
>>> assert x.shape == z.shape >>> assert y.shape == [8, 64] >>> assert decoder.input_size == encoder.output_size >>> assert decoder.output_size == encoder.input_size

__init__
(constructor, call_methods=('__call__'), name=None)[source]¶ Initializes the Deferred module.
 Parameters
constructor – A no argument callable which constructs the module to defer to. The first time one of the call_methods are called the constructor will be run and then the constructed module will be called with the same method and arguments as the deferred module.
call_methods – Methods which should trigger construction of the target module. The default value configures this module to construct the first time __call__ is run. If you want to add methods other than call you should explicitly pass them (optionally), for example call_methods=(“__call__”, “encode”, “decode”).
name – Name for the deferred module.

property
target
¶ Returns the target module.
If the constructor has not already run this will trigger construction. Subsequent calls to target will return the same instance.
 Returns
A Module instance as created by self.constructor() .

Linear modules¶
Linear¶

class
sonnet.
Linear
(output_size, with_bias=True, w_init=None, b_init=None, name=None)[source]¶ Linear module, optionally including bias.

__init__
(output_size, with_bias=True, w_init=None, b_init=None, name=None)[source]¶ Constructs a Linear module.
 Parameters
output_size (
int
) – Output dimensionality.with_bias (
bool
) – Whether to include bias parameters. Default True.w_init (
Optional
[Initializer
]) – Optional initializer for the weights. By default the weights are initialized truncated random normal values with a standard deviation of 1 / sqrt(input_feature_size), which is commonly used when the inputs are zero centered (see https://arxiv.org/abs/1502.03167v3).b_init (
Optional
[Initializer
]) – Optional initializer for the bias. By default the bias is initialized to zero.name (
Optional
[str
]) – Name of the module.

Bias¶

class
sonnet.
Bias
(output_size=None, bias_dims=None, b_init=None, name=None)[source]¶ Bias module.
Example Usage:
>>> N, H, W, C = 1, 2, 3, 4 >>> x = tf.random.normal([N, H, W, C])
>>> scalar_bias = snt.Bias(bias_dims=[]) >>> scalar_bias_output = scalar_bias(x) >>> assert scalar_bias.b.shape == []
Create a bias over all nonminibatch dimensions:
>>> all_bias = snt.Bias() >>> all_bias_output = all_bias(x) >>> assert all_bias.b.shape == [H, W, C]
Create a bias over the last nonminibatch dimension:
>>> last_bias = snt.Bias(bias_dims=[1]) >>> last_bias_output = last_bias(x) >>> assert last_bias.b.shape == [C]
Create a bias over the first nonminibatch dimension:
>>> first_bias = snt.Bias(bias_dims=[1]) >>> first_bias_output = first_bias(x) >>> assert first_bias.b.shape == [H, 1, 1]
Subtract and later add the same learned bias:
>>> bias = snt.Bias() >>> h1 = bias(x, multiplier=1) >>> h2 = bias(x) >>> h3 = bias(x, multiplier=1) >>> reconstructed_x = bias(h3) >>> assert tf.reduce_all(tf.equal(x, reconstructed_x))

__init__
(output_size=None, bias_dims=None, b_init=None, name=None)[source]¶ Constructs a Bias module that supports broadcasting.
 Parameters
output_size (
Optional
[int
]) – Output size (output shape without batch dimension). If output_size is left as None, the size will be directly inferred by the input.bias_dims (
Optional
[Sequence
[int
]]) – Sequence of which dimensions to retain from the input shape when constructing the bias. The remaining dimensions will be broadcast over (given size of 1), and leading dimensions will be removed completely. See class doc for examples.b_init (
Optional
[Initializer
]) – Optional initializer for the bias. Default to zeros.name (
Optional
[str
]) – Name of the module.

__call__
(inputs, multiplier=None)[source]¶ Adds bias to inputs and optionally multiplies by multiplier.
 Parameters
inputs (
Tensor
) – A Tensor of size [batch_size, input_size1, …].multiplier (
Union
[float
,floating
,ndarray
,Tensor
,Variable
,None
]) – A scalar or Tensor which the bias term is multiplied by before adding it to inputs. Anything which works in the expression bias * multiplier is acceptable here. This may be useful if you want to add a bias in one place and subtract the same bias in another place via multiplier=1.
 Returns
A Tensor of size [batch_size, input_size1, …].

Convolutional modules¶
Conv1D¶

class
sonnet.
Conv1D
(output_channels, kernel_shape, stride=1, rate=1, padding='SAME', with_bias=True, w_init=None, b_init=None, data_format='NWC', name=None)[source]¶ Conv1D
module.
__init__
(output_channels, kernel_shape, stride=1, rate=1, padding='SAME', with_bias=True, w_init=None, b_init=None, data_format='NWC', name=None)[source]¶ Constructs a
Conv1D
module. Parameters
output_channels (
int
) – The number of output channels.kernel_shape (
Union
[int
,Sequence
[int
]]) – Sequence of length 1, or an integer.kernel_shape
will be expanded to define a kernel size in all dimensions.stride (
Union
[int
,Sequence
[int
]]) – Sequence of strides of length 1, or an integer.stride
will be expanded to define stride in all dimensions.rate (
Union
[int
,Sequence
[int
]]) – Sequence of dilation rates of length 1, or integer that is used to define dilation rate in all dimensions. 1 corresponds to standard convolution,rate > 1
corresponds to dilated convolution.padding (
Union
[str
,Callable
[[int
],Sequence
[int
]],Sequence
[Callable
[[int
],Sequence
[int
]]]]) – Padding to apply to the input. This can be eitherSAME
,VALID
or a callable or sequence of callables of size 1. Any callables must take a single integer argument equal to the effective kernel size and return a list of two integers representing the padding before and after. See snt.pad.* for more details and example functions.with_bias (
bool
) – Whether to include bias parameters. DefaultTrue
.w_init (
Optional
[Initializer
]) – Optional initializer for the weights. By default the weights are initialized truncated random normal values with a standard deviation of1
/sqrt(input_feature_size)
, which is commonly used when the inputs are zero centered (see https://arxiv.org/abs/1502.03167v3).b_init (
Optional
[Initializer
]) – Optional initializer for the bias. By default the bias is initialized to zero.data_format (
str
) – The data format of the input.name (
Optional
[str
]) – Name of the module.

Conv2D¶

class
sonnet.
Conv2D
(output_channels, kernel_shape, stride=1, rate=1, padding='SAME', with_bias=True, w_init=None, b_init=None, data_format='NHWC', name=None)[source]¶ Conv2D module.

__init__
(output_channels, kernel_shape, stride=1, rate=1, padding='SAME', with_bias=True, w_init=None, b_init=None, data_format='NHWC', name=None)[source]¶ Constructs a
Conv2D
module. Parameters
output_channels (
int
) – The number of output channels.kernel_shape (
Union
[int
,Sequence
[int
]]) – Sequence of kernel sizes (of length 2), or an integer.kernel_shape
will be expanded to define a kernel size in all dimensions.stride (
Union
[int
,Sequence
[int
]]) – Sequence of strides (of length 2), or an integer.stride
will be expanded to define stride in all dimensions.rate (
Union
[int
,Sequence
[int
]]) – Sequence of dilation rates (of length 2), or integer that is used to define dilation rate in all dimensions. 1 corresponds to standard convolution,rate > 1
corresponds to dilated convolution.padding (
Union
[str
,Callable
[[int
],Sequence
[int
]],Sequence
[Callable
[[int
],Sequence
[int
]]]]) – Padding to apply to the input. This can eitherSAME
,VALID
or a callable or sequence of callables of size 2. Any callables must take a single integer argument equal to the effective kernel size and return a list of two integers representing the padding before and after. See snt.pad.* for more details and example functions.with_bias (
bool
) – Whether to include bias parameters. DefaultTrue
.w_init (
Optional
[Initializer
]) – Optional initializer for the weights. By default the weights are initialized truncated random normal values with a standard deviation of1 / sqrt(input_feature_size)
, which is commonly used when the inputs are zero centered (see https://arxiv.org/abs/1502.03167v3).b_init (
Optional
[Initializer
]) – Optional initializer for the bias. By default the bias is initialized to zero.data_format (
str
) – The data format of the input.name (
Optional
[str
]) – Name of the module.

Conv3D¶

class
sonnet.
Conv3D
(output_channels, kernel_shape, stride=1, rate=1, padding='SAME', with_bias=True, w_init=None, b_init=None, data_format='NDHWC', name=None)[source]¶ Conv3D module.

__init__
(output_channels, kernel_shape, stride=1, rate=1, padding='SAME', with_bias=True, w_init=None, b_init=None, data_format='NDHWC', name=None)[source]¶ Constructs a
Conv3D
module. Parameters
output_channels (
int
) – The number of output channels.kernel_shape (
Union
[int
,Sequence
[int
]]) – Sequence of kernel sizes (of length 3), or an integer.kernel_shape
will be expanded to define a kernel size in all dimensions.stride (
Union
[int
,Sequence
[int
]]) – Sequence of strides (of length 3), or an integer. stride will be expanded to define stride in all dimensions.rate (
Union
[int
,Sequence
[int
]]) – Sequence of dilation rates (of length 3), or integer that is used to define dilation rate in all dimensions. 1 corresponds to standard convolution,rate > 1
corresponds to dilated convolution.padding (
Union
[str
,Callable
[[int
],Sequence
[int
]],Sequence
[Callable
[[int
],Sequence
[int
]]]]) – Padding to apply to the input. This can eitherSAME
,VALID
or a callable or sequence of callables up to size N. Any callables must take a single integer argument equal to the effective kernel size and return a list of two integers representing the padding before and after. See snt.pad.* for more details and example functions.with_bias (
bool
) – Whether to include bias parameters. DefaultTrue
.w_init (
Optional
[Initializer
]) – Optional initializer for the weights. By default the weights are initialized truncated random normal values with a standard deviation of1 / sqrt(input_feature_size)
, which is commonly used when the inputs are zero centered (see https://arxiv.org/abs/1502.03167v3).b_init (
Optional
[Initializer
]) – Optional initializer for the bias. By default the bias is initialized to zero.data_format (
str
) – The data format of the input.name (
Optional
[str
]) – Name of the module.

Conv1DTranspose¶

class
sonnet.
Conv1DTranspose
(output_channels, kernel_shape, output_shape=None, stride=1, rate=1, padding='SAME', with_bias=True, w_init=None, b_init=None, data_format='NWC', name=None)[source]¶ A 1D transpose convolutional module.

__init__
(output_channels, kernel_shape, output_shape=None, stride=1, rate=1, padding='SAME', with_bias=True, w_init=None, b_init=None, data_format='NWC', name=None)[source]¶ Constructs a Conv1DTranspose module.
 Parameters
output_channels (
int
) – Number of output channels.kernel_shape (
Union
[int
,Sequence
[int
]]) – Sequence of integers (of length 1), or an integer representing kernel shape. kernel_shape will be expanded to define a kernel size in all dimensions.output_shape (
Union
[int
,Sequence
[int
],TensorShape
,None
]) – Output shape of the spatial dimensions of a transpose convolution. Can be either an integer or an iterable of integers or Dimension`s, or a `TensorShape (of length 1). If a None value is given, a default shape is automatically calculated.stride (
Union
[int
,Sequence
[int
]]) – Sequence of integers (of length 1), or an integer. stride will be expanded to define stride in all dimensions.rate (
Union
[int
,Sequence
[int
]]) – Sequence of integers (of length 1), or integer that is used to define dilation rate in all dimensions. 1 corresponds to standard 1D convolution, rate > 1 corresponds to dilated convolution.padding (
str
) – Padding algorithm, either “SAME” or “VALID”.with_bias (
bool
) – Boolean, whether to include bias parameters. Default True.w_init (
Optional
[Initializer
]) – Optional initializer for the weights. By default the weights are initialized truncated random normal values with a standard deviation of 1 / sqrt(input_feature_size), which is commonly used when the inputs are zero centered (see https://arxiv.org/abs/1502.03167v3).b_init (
Optional
[Initializer
]) – Optional initializer for the bias. By default the bias is initialized to zero.data_format (
str
) – The data format of the input.name (
Optional
[str
]) – Name of the module.

Conv2DTranspose¶

class
sonnet.
Conv2DTranspose
(output_channels, kernel_shape, output_shape=None, stride=1, rate=1, padding='SAME', with_bias=True, w_init=None, b_init=None, data_format='NHWC', name=None)[source]¶ A 2D transpose convolutional module.

__init__
(output_channels, kernel_shape, output_shape=None, stride=1, rate=1, padding='SAME', with_bias=True, w_init=None, b_init=None, data_format='NHWC', name=None)[source]¶ Constructs a Conv2DTranspose module.
 Parameters
output_channels (
int
) – An integer, The number of output channels.kernel_shape (
Union
[int
,Sequence
[int
]]) – Sequence of integers (of length 2), or an integer representing kernel shape. kernel_shape will be expanded to define a kernel size in all dimensions.output_shape (
Union
[int
,Sequence
[int
],TensorShape
,None
]) – Output shape of the spatial dimensions of a transpose convolution. Can be either an integer or an iterable of integers or Dimension`s, or a `TensorShape (of length 2). If a None value is given, a default shape is automatically calculated.stride (
Union
[int
,Sequence
[int
]]) – Sequence of integers (of length 2), or an integer. stride will be expanded to define stride in all dimensions.rate (
Union
[int
,Sequence
[int
]]) – Sequence of integers (of length 2), or integer that is used to define dilation rate in all dimensions. 1 corresponds to standard 2D convolution, rate > 1 corresponds to dilated convolution.padding (
str
) – Padding algorithm, either “SAME” or “VALID”.with_bias (
bool
) – Boolean, whether to include bias parameters. Default True.w_init (
Optional
[Initializer
]) – Optional initializer for the weights. By default the weights are initialized truncated random normal values with a standard deviation of 1 / sqrt(input_feature_size), which is commonly used when the inputs are zero centered (see https://arxiv.org/abs/1502.03167v3).b_init (
Optional
[Initializer
]) – Optional initializer for the bias. By default the bias is initialized to zero.data_format (
str
) – The data format of the input.name (
Optional
[str
]) – Name of the module.

Conv3DTranspose¶

class
sonnet.
Conv3DTranspose
(output_channels, kernel_shape, output_shape=None, stride=1, rate=1, padding='SAME', with_bias=True, w_init=None, b_init=None, data_format='NDHWC', name=None)[source]¶ A 3D transpose convolutional module.

__init__
(output_channels, kernel_shape, output_shape=None, stride=1, rate=1, padding='SAME', with_bias=True, w_init=None, b_init=None, data_format='NDHWC', name=None)[source]¶ Constructs a Conv3DTranspose module.
 Parameters
output_channels (
int
) – An integer, The number of output channels.kernel_shape (
Union
[int
,Sequence
[int
]]) – Sequence of integers (of length 3), or an integer representing kernel shape. kernel_shape will be expanded to define a kernel size in all dimensions.output_shape (
Union
[int
,Sequence
[int
],TensorShape
,None
]) – Output shape of the spatial dimensions of a transpose convolution. Can be either an integer or an iterable of integers or Dimension`s, or a `TensorShape (of length 3). If a None value is given, a default shape is automatically calculated.stride (
Union
[int
,Sequence
[int
]]) – Sequence of integers (of length 3), or an integer. stride will be expanded to define stride in all dimensions.rate (
Union
[int
,Sequence
[int
]]) – Sequence of integers (of length 3), or integer that is used to define dilation rate in all dimensions. 1 corresponds to standard 3D convolution, rate > 1 corresponds to dilated convolution.padding (
str
) – Padding algorithm, either “SAME” or “VALID”.with_bias (
bool
) – Boolean, whether to include bias parameters. Default True.w_init (
Optional
[Initializer
]) – Optional initializer for the weights. By default the weights are initialized truncated random normal values with a standard deviation of 1 / sqrt(input_feature_size), which is commonly used when the inputs are zero centered (see https://arxiv.org/abs/1502.03167v3).b_init (
Optional
[Initializer
]) – Optional initializer for the bias. By default the bias is initialized to zero.data_format (
str
) – The data format of the input.name (
Optional
[str
]) – Name of the module.

DepthwiseConv2D¶

class
sonnet.
DepthwiseConv2D
(kernel_shape, channel_multiplier=1, stride=1, rate=1, padding='SAME', with_bias=True, w_init=None, b_init=None, data_format='NHWC', name=None)[source]¶ Spatial depthwise 2D convolution module, including bias.
This acts as a light wrapper around the TensorFlow ops tf.nn.depthwise_conv2d, abstracting away variable creation and sharing.

__init__
(kernel_shape, channel_multiplier=1, stride=1, rate=1, padding='SAME', with_bias=True, w_init=None, b_init=None, data_format='NHWC', name=None)[source]¶ Constructs a DepthwiseConv2D module.
 Parameters
kernel_shape (
Union
[int
,Sequence
[int
]]) – Sequence of kernel sizes (of length num_spatial_dims), or an integer. kernel_shape will be expanded to define a kernel size in all dimensions.channel_multiplier (
int
) – Number of channels to expand convolution to. Must be an integer greater than 0. When channel_multiplier is 1, applies a different filter to each input channel producing one output channel per input channel. Numbers larger than 1 cause multiple different filters to be applied to each input channel, with their outputs being concatenated together, producing channel_multiplier * input_channels output channels.stride (
Union
[int
,Sequence
[int
]]) – Sequence of strides (of length num_spatial_dims), or an integer. stride will be expanded to define stride in all dimensions.rate (
Union
[int
,Sequence
[int
]]) – Sequence of dilation rates (of length num_spatial_dims), or integer that is used to define dilation rate in all dimensions. 1 corresponds to standard ND convolution, rate > 1 corresponds to dilated convolution.padding (
str
) – Padding to apply to the input. This can either “SAME”, “VALID”.with_bias (
bool
) – Whether to include bias parameters. Default True.w_init (
Optional
[Initializer
]) – Optional initializer for the weights. By default the weights are initialized truncated random normal values with a standard deviation of 1 / sqrt(input_feature_size), which is commonly used when the inputs are zero centered (see https://arxiv.org/abs/1502.03167v3).b_init (
Optional
[Initializer
]) – Optional initializer for the bias. By default the bias is initialized to zero.data_format (
str
) – The data format of the input.name (
Optional
[str
]) – Name of the module.

Normalization modules¶
LayerNorm¶

class
sonnet.
LayerNorm
(axis, create_scale, create_offset, eps=1e05, scale_init=None, offset_init=None, data_format='channels_last', name=None)[source]¶ Normalizes inputs along the given axes.
This is a generic implementation of normalization along specific axes of the input.
InstanceNorm
is a subclass of this module, it normalizes over the spatial dimensions.It transforms the input
x
into:\[\d{outputs} = \d{scale} \dfrac{x  \mu}{\sigma + \epsilon} + \d{offset}\]Where \(\mu\) and \(\sigma\) are respectively the mean and standard deviation of
x
.There are many different variations for how users want to manage scale and offset if they require them at all. These are:
No
scale
/offset
in which casecreate_*
should be set toFalse
andscale
/offset
aren’t passed when the module is called.Trainable
scale
/offset
in which case create_* should be set toTrue
and againscale
/offset
aren’t passed when the module is called. In this case this module creates and owns the scale/offset variables.Externally generated
scale
/offset
, such as for conditional normalization, in which casecreate_*
should be set toFalse
and then the values fed in at call time.

scale
¶ If
create_scale=True
, a trainabletf.Variable
holding the current scale.

offset
¶ If
create_offset=True
, a trainabletf.Variable
holding the current offset.

__init__
(axis, create_scale, create_offset, eps=1e05, scale_init=None, offset_init=None, data_format='channels_last', name=None)[source]¶ Constructs an
LayerNorm
module. Parameters
axis (
Union
[int
,slice
,Sequence
[int
]]) – Anint
,slice
or sequence ofint
s representing the axes which should be normalized across. Typical usages are:1
or1
for normalization over just the channels andslice(1, None)
,slice(2, None)
for normalization over the spatial and channel dimensions whilst avoiding the batch and/or time dimensions.create_scale (
bool
) –bool
representing whether to create a trainable scale per channel applied after the normalization.create_offset (
bool
) –bool
representing whether to create a trainable offset per channel applied after normalization and scaling.eps (
Union
[float
,floating
,ndarray
,Tensor
,Variable
]) – Small epsilon to avoid division by zero variance. Defaults to1e5
.scale_init (
Optional
[Initializer
]) – Optional initializer for the scale variable. Can only be set ifcreate_scale=True
. By default scale is initialized to1
.offset_init (
Optional
[Initializer
]) – Optional initializer for the offset variable. Can only be set ifcreate_offset=True
. By default offset is initialized to0
.data_format (
str
) – The data format of the input. Can be eitherchannels_first
,channels_last
,N...C
orNC...
. By default it ischannels_last
.name (
Optional
[str
]) – Name of the module.

__call__
(inputs, scale=None, offset=None)[source]¶ Returns normalized inputs.
 Parameters
inputs (
Tensor
) – An nD tensor of thedata_format
specified in the constructor on which the transformation is performed.scale (
Optional
[Tensor
]) – A tensor up to nD. The shape of this tensor must be broadcastable to the shape ofinputs
. This is the scale applied to the normalized inputs. This cannot be passed in if the module was constructed withcreate_scale=True
.offset (
Optional
[Tensor
]) – A tensor up to nD. The shape of this tensor must be broadcastable to the shape ofinputs
. This is the offset applied to the normalizedinputs
. This cannot be passed in if the module was constructed withcreate_offset=True
.
 Return type
Tensor
 Returns
An nd tensor of the same shape as inputs that has been normalized.
InstanceNorm¶

class
sonnet.
InstanceNorm
(create_scale, create_offset, eps=1e05, scale_init=None, offset_init=None, data_format='channels_last', name=None)[source]¶ Normalizes inputs along the spatial dimensions.
See
LayerNorm
for more details.
scale
¶ If
create_scale=True
, a trainabletf.Variable
holding the current scale.

offset
¶ If
create_offset=True
, a trainabletf.Variable
holding the current offset.

__init__
(create_scale, create_offset, eps=1e05, scale_init=None, offset_init=None, data_format='channels_last', name=None)[source]¶ Constructs an
InstanceNorm
module.This method creates a module which normalizes over the spatial dimensions.
 Parameters
create_scale (
bool
) –bool
representing whether to create a trainable scale per channel applied after the normalization.create_offset (
bool
) –bool
representing whether to create a trainable offset per channel applied after normalization and scaling.eps (
Union
[float
,floating
,ndarray
,Tensor
,Variable
]) – Small epsilon to avoid division by zero variance. Defaults to1e5
.scale_init (
Optional
[Initializer
]) – Optional initializer for the scale variable. Can only be set ifcreate_scale=True
. By default scale is initialized to1
.offset_init (
Optional
[Initializer
]) – Optional initializer for the offset variable. Can only be set ifcreate_offset=True
. By default offset is initialized to0
.data_format (
str
) – The data format of the input. Can be eitherchannels_first
,channels_last
,N...C
orNC...
. By default it ischannels_last
.name (
Optional
[str
]) – Name of the module.

BaseBatchNorm¶

class
sonnet.
BaseBatchNorm
(create_scale, create_offset, moving_mean, moving_variance, eps=1e05, scale_init=None, offset_init=None, data_format='channels_last', name=None)[source]¶ Batch normalization module.
This implements normalization across the batch and spatial dimensions. It maintains moving averages of the mean and variance which can be used to normalize at test time. The constructor is generic and requires the user to pass in objects to compute these.
At training time we use the batch statistics for that batch and these are then used to update the moving averages.
At test time we can either use the moving averages of the batch statistics (
test_local_stats=False
) or we can use the local statistics (test_local_stats=True
).It transforms the input
x
into:\[\d{outputs} = \d{scale} \dfrac{x  \mu}{\sigma + \epsilon} + \d{offset}\]Where \(\mu\) and \(\sigma\) are respectively the mean and standard deviation of
x
. Note that this module automatically uses the fused batch norm op if the data format isNHWC
.There are many different variations for how users want to manage scale and offset if they require them at all. These are:
No scale/offset in which case
create_*
should be set toFalse
andscale
/offset
aren’t passed when the module is called.Trainable scale/offset in which case
create_*
should be set toTrue
and againscale
/offset
aren’t passed when the module is called. In this case this module creates and owns thescale
/offset
variables.Externally generated
scale
/offset
, such as for conditional normalization, in which casecreate_*
should be set toFalse
and then the values fed in at call time.

scale
¶ If
create_scale
, a trainabletf.Variable
holding the current scale after the module is connected for the first time.

offset
¶ If
create_offset
, a trainabletf.Variable
holding the current offset after the module is connected for the first time.

__init__
(create_scale, create_offset, moving_mean, moving_variance, eps=1e05, scale_init=None, offset_init=None, data_format='channels_last', name=None)[source]¶ Constructs a
BaseBatchNorm
module. Parameters
create_scale (
bool
) – whether to create a trainable scale per channel applied after the normalization.create_offset (
bool
) – whether to create a trainable offset per channel applied after normalization and scaling.moving_mean (
Metric
) – A metric which tracks the moving average of the mean which can be used to normalize at test time.moving_variance (
Metric
) – A metric which tracks the moving average of the variance which can be used to normalize at test time.eps (
Union
[float
,floating
,ndarray
,Tensor
,Variable
]) – Small epsilon to avoid division by zero variance. Defaults to1e5
.scale_init (
Optional
[Initializer
]) – Optional initializer for the scale variable. Can only be set ifcreate_scale=True
. By default scale is initialized to1
.offset_init (
Optional
[Initializer
]) – Optional initializer for the offset variable. Can only be set ifcreate_offset=True
. By default offset is initialized to0
.data_format (
str
) – The data format of the input. Can be eitherchannels_first
,channels_last
,N...C
orNC...
. By default it ischannels_last
.name (
Optional
[str
]) – Name of the module.

__call__
(inputs, is_training, test_local_stats=False, scale=None, offset=None)[source]¶ Returns normalized inputs.
 Parameters
inputs (
Tensor
) – An nD tensor of the data_format specified above on which the transformation is performed.is_training (
Union
[bool
,ndarray
,Tensor
,Variable
]) – Whether the module should be connected in training mode, meaning the moving averages are updated.test_local_stats (
Union
[bool
,ndarray
,Tensor
,Variable
]) – Whether local batch statistics should be used whenis_training=False
. If not, moving averages are used. By defaultFalse
.scale (
Optional
[Tensor
]) – A tensor up to nD. The shape of this tensor must be broadcastable to the shape ofinputs
. This is the scale applied to the normalized inputs. This cannot be passed in if the module was constructed withcreate_scale=True
.offset (
Optional
[Tensor
]) – A tensor up to nD. The shape of this tensor must be broadcastable to the shape ofinputs
. This is the offset applied to the normalized inputs. This cannot be passed in if the module was constructed withcreate_offset=True
.
 Returns
An nd tensor of the same shape as inputs that has been normalized.
BatchNorm¶

class
sonnet.
BatchNorm
(create_scale, create_offset, decay_rate=0.999, eps=1e05, scale_init=None, offset_init=None, data_format='channels_last', name=None)[source]¶ Batch normalization with exponential moving average for test statistics.
See
BaseBatchNorm
for details.
scale
¶ If
create_scale=True
, a trainabletf.Variable
holding the current scale after the module is connected for the first time.

offset
¶ If
create_offset
, a trainabletf.Variable
holding the current offset after the module is connected for the first time.

__init__
(create_scale, create_offset, decay_rate=0.999, eps=1e05, scale_init=None, offset_init=None, data_format='channels_last', name=None)[source]¶ Constructs a
BatchNorm
module. Parameters
create_scale (
bool
) – whether to create a trainable scale per channel applied after the normalization.create_offset (
bool
) – whether to create a trainable offset per channel applied after normalization and scaling.decay_rate (
float
) – Decay rate of the exponential moving averages of the mean and variance.eps (
Union
[float
,floating
,ndarray
,Tensor
,Variable
]) – Small epsilon to avoid division by zero variance. Defaults to1e5
.scale_init (
Optional
[Initializer
]) – Optional initializer for the scale variable. Can only be set ifcreate_scale=True
. By default scale is initialized to1
.offset_init (
Optional
[Initializer
]) – Optional initializer for the offset variable. Can only be set ifcreate_offset=True
. By default offset is initialized to0
.data_format (
str
) – The data format of the input. Can be eitherchannels_first
,channels_last
,N...C
orNC...
. By default it ischannels_last
.name (
Optional
[str
]) – Name of the module.

CrossReplicaBatchNorm¶

class
sonnet.distribute.
CrossReplicaBatchNorm
(create_scale, create_offset, moving_mean, moving_variance, eps=1e05, scale_init=None, offset_init=None, data_format='channels_last', name=None)[source]¶ Crossreplica Batch Normalization.
At every step the full batch is used to calculate the batch statistics even within a distributed setting (note only with
snt.(Tpu)Replicator
).See
BaseBatchNorm
for details.
scale
¶ If
create_scale=True
, a trainabletf.Variable
holding the current scale after the module is connected for the first time.

offset
¶ If
create_offset
, a trainabletf.Variable
holding the current offset after the module is connected for the first time.

__init__
(create_scale, create_offset, moving_mean, moving_variance, eps=1e05, scale_init=None, offset_init=None, data_format='channels_last', name=None)[source]¶ Constructs a
CrossReplicaBatchNorm
module. Parameters
create_scale (
bool
) – whether to create a trainable scale per channel applied after the normalization.create_offset (
bool
) – whether to create a trainable offset per channel applied after normalization and scaling.moving_mean (
Metric
) – An object which keeps track of the moving average of the mean which can be used to normalize at test time. This object must have an update method which takes a value and updates the internal state and a value property which returns the current mean.moving_variance (
Metric
) – An object which keeps track of the moving average of the variance which can be used to normalize at test time. This object must have an update method which takes a value and updates the internal state and a value property which returns the current variance.eps (
Union
[float
,floating
,ndarray
,Tensor
,Variable
]) – Small epsilon to avoid division by zero variance. Defaults to1e5
.scale_init (
Optional
[Initializer
]) – Optional initializer for the scale variable. Can only be set ifcreate_scale=True
. By default scale is initialized to1
.offset_init (
Optional
[Initializer
]) – Optional initializer for the offset variable. Can only be set ifcreate_offset=True
. By default offset is initialized to0
.data_format (
str
) – The data format of the input. Can be eitherchannels_first
,channels_last
,N...C
orNC...
. By default it ischannels_last
.name (
Optional
[str
]) – Name of the module.

GroupNorm¶

class
sonnet.
GroupNorm
(groups, axis=slice(1, None, None), create_scale=True, create_offset=True, eps=1e05, scale_init=None, offset_init=None, data_format='channels_last', name=None)[source]¶ Group normalization module.
This applies group normalization to the inputs. This involves splitting the channels into groups before calculating the mean and variance. The default behaviour is to compute the mean and variance over the spatial dimensions and the grouped channels. The mean and variance will never be computed over the created groups axis.
It transforms the input
x
into:\[\d{outputs} = \d{scale} \dfrac{x  \mu}{\sigma + \epsilon} + \d{offset}\]Where \(\mu\) and \(\sigma\) are respectively the mean and standard deviation of
x
.There are many different variations for how users want to manage scale and offset if they require them at all. These are:
No
scale
/offset
in which casecreate_*
should be set toFalse
andscale
/offset
aren’t passed when the module is called.Trainable
scale
/offset
in which case create_* should be set toTrue
and againscale
/offset
aren’t passed when the module is called. In this case this module creates and owns the scale/offset variables.Externally generated
scale
/offset
, such as for conditional normalization, in which casecreate_*
should be set toFalse
and then the values fed in at call time.

scale
¶ If
create_scale=True
, a trainabletf.Variable
holding the current scale.

offset
¶ If
create_offset=True
, a trainabletf.Variable
holding the current offset.

__init__
(groups, axis=slice(1, None, None), create_scale=True, create_offset=True, eps=1e05, scale_init=None, offset_init=None, data_format='channels_last', name=None)[source]¶ Constructs a
GroupNorm
module. Parameters
groups (
int
) – number of groups to divide the channels by. The number of channels must be divisible by this.axis (
Union
[int
,slice
,Sequence
[int
]]) –int
,slice
or sequence of ints representing the axes which should be normalized across. By default this is all but the first dimension. For time series data use slice(2, None) to average over the none Batch and Time data.create_scale (
bool
) – whether to create a trainable scale per channel applied after the normalization.create_offset (
bool
) – whether to create a trainable offset per channel applied after normalization and scaling.eps (
Union
[float
,floating
,ndarray
,Tensor
,Variable
]) – Small epsilon to add to the variance to avoid division by zero. Defaults to1e5
.scale_init (
Optional
[Initializer
]) – Optional initializer for the scale variable. Can only be set ifcreate_scale=True
. By default scale is initialized to1
.offset_init (
Optional
[Initializer
]) – Optional initializer for the offset variable. Can only be set ifcreate_offset=True
. By default offset is initialized to0
.data_format (
str
) – The data format of the input. Can be eitherchannels_first
,channels_last
,N...C
orNC...
. By default it ischannels_last
.name (
Optional
[str
]) – Name of the module.

__call__
(inputs, scale=None, offset=None)[source]¶ Returns normalized inputs.
 Parameters
inputs (
Tensor
) – An nD tensor of thedata_format
specified in the constructor on which the transformation is performed.scale (
Optional
[Tensor
]) – A tensor up to nD. The shape of this tensor must be broadcastable to the shape ofinputs
. This is the scale applied to the normalized inputs. This cannot be passed in if the module was constructed withcreate_scale=True
.offset (
Optional
[Tensor
]) – A tensor up to nD. The shape of this tensor must be broadcastable to the shape ofinputs
. This is the offset applied to the normalizedinputs
. This cannot be passed in if the module was constructed withcreate_offset=True
.
 Returns
An nd tensor of the same shape as inputs that has been normalized.
Recurrent modules¶
RNNCore¶

class
sonnet.
RNNCore
(name=None)[source]¶ Base class for Recurrent Neural Network cores.
This class defines the basic functionality that every core should implement:
initial_state()
, used to construct an example of the core state; and__call__()
which applies the core parameterized by a previous state to an input.Cores are typically used with
dynamic_unroll()
andstatic_unroll()
to iteratively construct an output sequence from the given input sequence.
abstract
__call__
(inputs, prev_state)[source]¶ Performs one step of an RNN.
 Parameters
inputs (
Union
[ndarray
,Tensor
,Variable
,Iterable
[ForwardRef
],Mapping
[str
,ForwardRef
]]) – An arbitrarily nested structure of shape [B, …] where B is the batch size.prev_state – Previous core state.
 Returns
outputs  An arbitrarily nested structure of shape [B, …]. Dimensions following the batch size could be different from that of inputs.
next_state  Next core state, must be of the same shape as the previous one.
 Return type
A tuple with two elements

abstract
initial_state
(batch_size, **kwargs)[source]¶ Constructs an initial state for this core.
 Parameters
batch_size (
Union
[int
,integer
,ndarray
,Tensor
,Variable
]) – An int or an integral scalar tensor representing batch size.**kwargs – Optional keyword arguments.
 Returns
Arbitrarily nested initial state for this core.

abstract
UnrolledRNN¶

class
sonnet.
UnrolledRNN
(name=None)[source]¶ Base class for unrolled Recurrent Neural Networks.
This class is a generalization of
RNNCore
which operates on an input sequence as opposed to a single time step.
abstract
__call__
(input_sequence, initial_state)[source]¶ Apply this RNN to the input sequence.
 Parameters
input_sequence (
Union
[ndarray
,Tensor
,Variable
,Iterable
[ForwardRef
],Mapping
[str
,ForwardRef
]]) – An arbitrarily nested structure of shape[T, B, ...]
whereT
is the number of time steps and B is the batch size.initial_state (
Union
[ndarray
,Tensor
,Variable
,Iterable
[ForwardRef
],Mapping
[str
,ForwardRef
]]) – Initial RNN state.
 Returns
output_sequence  An arbitrarily nested structure of tensors of shape
[T, B, ...]
. Dimensions following the batch size could be different from that of theinput_sequence
.final_state  Final RNN state, must be of the same shape as the initial one.
 Return type
A tuple with two elements

abstract
initial_state
(batch_size, **kwargs)[source]¶ Construct an initial state for this RNN.
 Parameters
batch_size (
Union
[int
,integer
,ndarray
,Tensor
,Variable
]) – An int or an integral scalar tensor representing batch size.**kwargs – Optional keyword arguments.
 Returns
Arbitrarily nested initial state for this RNN.

abstract
TrainableState¶

class
sonnet.
TrainableState
(initial_values, mask=None, name=None)[source]¶ Trainable state for an
RNNCore
.The state can be constructed manually from a nest of initial values:
>>> state = snt.TrainableState((tf.zeros([16]), tf.zeros([16])))
or automatically for a given
RNNCore
:>>> core = snt.LSTM(hidden_size=16) >>> state = snt.TrainableState.for_core(core)

classmethod
for_core
(core, mask=None, name=None)[source]¶ Constructs a trainable state for a given
RNNCore
. Parameters
core (
RNNCore
) – AnRNNCore
to construct the state for.mask (
Union
[ndarray
,Tensor
,Variable
,Iterable
[ForwardRef
],Mapping
[str
,ForwardRef
],None
]) – Optional boolean mask of the same structure as the initial state of core specifying which components should be trainable. If not given, the whole state is considered trainable.name (
Optional
[str
]) – Name of the module.
 Returns
A TrainableState.

__init__
(initial_values, mask=None, name=None)[source]¶ Constructs a trainable state from initial values.
 Parameters
initial_values (
Union
[ndarray
,Tensor
,Variable
,Iterable
[ForwardRef
],Mapping
[str
,ForwardRef
]]) – Arbitrarily nested initial values for the state.mask (
Union
[ndarray
,Tensor
,Variable
,Iterable
[ForwardRef
],Mapping
[str
,ForwardRef
]]) – Optional boolean mask of the same structure asinitial_values
specifying which components should be trainable. If not given, the whole state is considered trainable.name (
Optional
[str
]) – Name of the module.

classmethod
dynamic_unroll¶

sonnet.
dynamic_unroll
(core, input_sequence, initial_state, sequence_length=None, parallel_iterations=1, swap_memory=False)[source]¶ Performs a dynamic unroll of an RNN.
>>> core = snt.LSTM(hidden_size=16) >>> batch_size = 3 >>> input_sequence = tf.random.uniform([1, batch_size, 2]) >>> output_sequence, final_state = snt.dynamic_unroll( ... core, ... input_sequence, ... core.initial_state(batch_size))
An unroll corresponds to calling the core on each element of the input sequence in a loop, carrying the state through:
state = initial_state for t in range(len(input_sequence)): outputs, state = core(input_sequence[t], state)
A dynamic unroll preserves the loop structure when executed within
tf.function
. Seestatic_unroll()
for an unroll function which replaces a loop with its body repeated multiple times. Parameters
core – An
RNNCore
to unroll.input_sequence – An arbitrarily nested structure of tensors of shape
[T, B, ...]
whereT
is the number of time steps, andB
is the batch size.initial_state – initial state of the given core.
sequence_length – An optional tensor of shape
[B]
specifying the lengths of sequences within the (padded) batch.parallel_iterations – An optional
int
specifying the number of iterations to run in parallel. Those operations which do not have any temporal dependency and can be run in parallel, will be. This parameter trades off time for space. Values >> 1 use more memory but take less time, while smaller values use less memory but computations take longer. Defaults to 1.swap_memory – Transparently swap the tensors produced in forward inference but needed for back prop from GPU to CPU. This allows training RNNs which would typically not fit on a single GPU, with very minimal (or no) performance penalty. Defaults to False.
 Returns
output_sequence  An arbitrarily nested structure of tensors of shape
[T, B, ...]
. Dimensions following the batch size could be different from that of theinput_sequence
.final_state  Core state at time step
T
.
 Return type
A tuple with two elements
 Raises
ValueError – If
input_sequence
is empty.
static_unroll¶

sonnet.
static_unroll
(core, input_sequence, initial_state, sequence_length=None)[source]¶ Performs a static unroll of an RNN.
>>> core = snt.LSTM(hidden_size=16) >>> batch_size = 3 >>> input_sequence = tf.random.uniform([1, batch_size, 2]) >>> output_sequence, final_state = snt.static_unroll( ... core, ... input_sequence, ... core.initial_state(batch_size))
An unroll corresponds to calling the core on each element of the input sequence in a loop, carrying the state through:
state = initial_state for t in range(len(input_sequence)): outputs, state = core(input_sequence[t], state)
A static unroll replaces a loop with its body repeated multiple times when executed inside
tf.function
:state = initial_state outputs0, state = core(input_sequence[0], state) outputs1, state = core(input_sequence[1], state) outputs2, state = core(input_sequence[2], state) ...
See
dynamic_unroll()
for a looppreserving unroll function. Parameters
core (
RNNCore
) – AnRNNCore
to unroll.input_sequence (
Union
[ndarray
,Tensor
,Variable
,Iterable
[ForwardRef
],Mapping
[str
,ForwardRef
]]) – An arbitrarily nested structure of tensors of shape[T, B, ...]
whereT
is the number of time steps, andB
is the batch size.initial_state (
Union
[ndarray
,Tensor
,Variable
,Iterable
[ForwardRef
],Mapping
[str
,ForwardRef
]]) – An initial state of the given core.sequence_length (
Union
[int
,integer
,ndarray
,Tensor
,Variable
,None
]) – An optional tensor of shape[B]
specifying the lengths of sequences within the (padded) batch.
 Returns
output_sequence  An arbitrarily nested structure of tensors of shape
[T, B, ...]
. Dimensions following the batch size could be different from that of theinput_sequence
.final_state  Core state at time step
T
.
 Return type
A tuple with two elements
 Raises
ValueError – If
input_sequence
is empty or its leading dimension is not known statically.
VanillaRNN¶

class
sonnet.
VanillaRNN
(hidden_size, activation=<function tanh>, w_i_init=None, w_h_init=None, b_init=None, dtype=tf.float32, name=None)[source]¶ Basic fullyconnected RNN core.
Given \(x_t\) and the previous hidden state \(h_{t1}\) the core computes
\[h_t = w_i x_t + w_h h_{t1} + b\]Inputtohidden weights \(w_i\), a tensor of shape
[hidden_size, hidden_size]
.
Hiddentohidden weights \(w_i\), a tensor of shape
[input_size, hidden_size]
.

b
¶ bias, a tensor or shape
[hidden_size]
.

__init__
(hidden_size, activation=<function tanh>, w_i_init=None, w_h_init=None, b_init=None, dtype=tf.float32, name=None)[source]¶ Constructs a vanilla RNN core.
 Parameters
hidden_size (
int
) – Hidden layer size.activation (
Callable
[[Union
[ndarray
,Tensor
,Variable
]],Union
[ndarray
,Tensor
,Variable
]]) – Activation function to use. Defaults totf.tanh
.w_i_init (
Optional
[Initializer
]) – Optional initializer for the inputtohidden weights. Defaults toTruncatedNormal
with a standard deviation of1 / sqrt(input_size)
.w_h_init (
Optional
[Initializer
]) – Optional initializer for the hiddentohidden weights. Defaults toTruncatedNormal
with a standard deviation of1 / sqrt(hidden_size)
.b_init (
Optional
[Initializer
]) – Optional initializer for the bias. Defaults toZeros
.dtype (
DType
) – Optionaltf.DType
of the core’s variables. Defaults totf.float32
.name (
Optional
[str
]) – Name of the module.
DeepRNN¶

class
sonnet.
DeepRNN
(layers, name=None)[source]¶ Linear chain of
RNNCore
s or callables.The core takes
(input, prev_state)
as input and passes the input through each internal module in the order they were presented, using elements fromprev_state
as necessary for internal RNN cores.>>> deep_rnn = snt.DeepRNN([ ... snt.LSTM(hidden_size=16), ... snt.LSTM(hidden_size=16), ... ])
Note that the state of a
DeepRNN
is always a tuple, which will contain the same number of elements as there are internal RNN cores. If no internal modules are RNN cores, the state of theDeepRNN
as a whole is an empty tuple.Wrapping nonrecurrent modules into a
DeepRNN
can be useful to produce something API compatible with a “real” recurrent module, simplifying code that handles the cores.
__init__
(layers, name=None)[source]¶ Constructs a
DeepRNN
. Parameters
layers – A list of
RNNCore
s or callables.skip_connections – See
deep_rnn_with_skip_connections()
.concat_final_output_if_skip – See
deep_rnn_with_skip_connections()
.name (
Optional
[str
]) – Name of the module.


sonnet.
deep_rnn_with_skip_connections
(layers, concat_final_output=True, name='deep_rnn_with_skip_connections')[source]¶ Constructs a
DeepRNN
with skip connections.Skip connections alter the dependency structure within a
DeepRNN
. Specifically, input to the ith layer (i > 0) is given by a concatenation of the core’s inputs and the outputs of the (i1)th layer.outputs0, ... = layers[0](inputs, ...) outputs1, ... = layers[1](tf.concat([inputs, outputs0], axis=1], ...) outputs2, ... = layers[2](tf.concat([inputs, outputs1], axis=1], ...) ...
This allows the layers to learn decoupled features.
 Parameters
layers (
Sequence
[RNNCore
]) – A list ofRNNCore
s.concat_final_output (
bool
) – If enabled (default), the outputs of the core is a concatenation of the outputs of all intermediate layers; otherwise, only the outputs of the final layer, i.e. that oflayers[1]
, are returned.name (
str
) – Name of the module.
 Return type
RNNCore
 Returns
A
DeepRNN
with skip connections. Raises
ValueError – If any of the layers is not an
RNNCore
.

sonnet.
deep_rnn_with_residual_connections
(layers, name='deep_rnn_with_residual_connections')[source]¶ Constructs a
DeepRNN
with residual connections.Residual connections alter the dependency structure in a
DeepRNN
. Specifically, the input to the ith intermediate layer is a sum of the original core’s inputs and the outputs of all the preceding layers (<i).outputs0, ... = layers[0](inputs, ...) outputs0 += inputs outputs1, ... = layers[1](outputs0, ...) outputs1 += outputs0 outputs2, ... = layers[2](outputs1, ...) outputs2 += outputs1 ...
This allows the layers to learn specialized features that compose incrementally.
LSTM¶

class
sonnet.
LSTM
(hidden_size, projection_size=None, projection_init=None, w_i_init=None, w_h_init=None, b_init=None, forget_bias=1.0, dtype=tf.float32, name=None)[source]¶ Long shortterm memory (LSTM) RNN core.
The implementation is based on [2]. Given \(x_t\) and the previous state \((h_{t1}, c_{t1})\) the core computes
\[\begin{array}{ll} i_t = \sigma(W_{ii} x_t + W_{hi} h_{t1} + b_i) \\ f_t = \sigma(W_{if} x_t + W_{hf} h_{t1} + b_f) \\ g_t = \tanh(W_{ig} x_t + W_{hg} h_{t1} + b_g) \\ o_t = \sigma(W_{io} x_t + W_{ho} h_{t1} + b_o) \\ c_t = f_t c_{t1} + i_t g_t \\ h_t = o_t \tanh(c_t) \end{array}\]Where \(i_t\), \(f_t\), \(o_t\) are input, forget and output gate activations, and \(g_t\) is a vector of cell updates.
Notes
 Forget gate initialization:
Following [3] we add a constant
forget_bias
(defaults to 1.0) to \(b_f\) after initialization in order to reduce the scale of forgetting in the beginning of the training. Recurrent projections:
Hidden state could be projected (via the
project_size
parameter) to reduce the number of parameters and speed up computation. For more details see [4].
Inputtohidden weights \(W_{ii}\), \(W_{if}\), \(W_{ig}\) and \(W_{io}\) concatenated into a tensor of shape
[input_size, 4 * hidden_size]
.
Hiddentohidden weights \(W_{hi}\), \(W_{hf}\), \(W_{hg}\) and \(W_{ho}\) concatenated into a tensor of shape
[hidden_size, 4 * hidden_size]
.

b
¶ Biases \(b_i\), \(b_f\), \(b_g\) and \(b_o\) concatenated into a tensor of shape
[4 * hidden_size]
.

__init__
(hidden_size, projection_size=None, projection_init=None, w_i_init=None, w_h_init=None, b_init=None, forget_bias=1.0, dtype=tf.float32, name=None)[source]¶ Constructs an LSTM.
 Parameters
hidden_size (
int
) – Hidden layer size.projection_size (
Optional
[int
]) – Optional int; if set, then the hidden state is projected to this size via a trainable projection matrix.projection_init (
Optional
[Initializer
]) – Optional initializer for the projection matrix. Defaults toTruncatedNormal
with a standard deviation of1 / sqrt(hidden_size)
.w_i_init (
Optional
[Initializer
]) – Optional initializer for the inputtohidden weights. Defaults toTruncatedNormal
with a standard deviation of1 / sqrt(input_size)
.w_h_init (
Optional
[Initializer
]) – Optional initializer for the hiddentohidden weights. Defaults toTruncatedNormal
with a standard deviation of1 / sqrt(hidden_size)
.b_init (
Optional
[Initializer
]) – Optional initializer for the biases. Defaults toZeros
.forget_bias (
Union
[float
,floating
,ndarray
,Tensor
,Variable
]) – Optional float to add to the bias of the forget gate after initialization.dtype (
DType
) – Optionaltf.DType
of the core’s variables. Defaults totf.float32
.name (
Optional
[str
]) – Name of the module.

class
sonnet.
LSTMState
(hidden, cell)¶
lstm_with_recurrent_dropout¶

sonnet.
lstm_with_recurrent_dropout
(hidden_size, dropout=0.5, seed=None, **kwargs)[source]¶ Constructs an LSTM with recurrent dropout.
The implementation is based on [5]. Dropout is applied on the previous hidden state \(h_{t1}\) during the computation of gate activations:
\[\begin{array}{ll} i_t = \sigma(W_{ii} x_t + W_{hi} d(h_{t1}) + b_i) \\ f_t = \sigma(W_{if} x_t + W_{hf} d(h_{t1}) + b_f) \\ g_t = \tanh(W_{ig} x_t + W_{hg} d(h_{t1}) + b_g) \\ o_t = \sigma(W_{io} x_t + W_{ho} d(h_{t1}) + b_o) \end{array}\] Parameters
hidden_size – Hidden layer size.
dropout – Dropout probability.
seed – Optional int; seed passed to
tf.nn.dropout
.**kwargs – Optional keyword arguments to pass to the
LSTM
constructor.
 Returns
train_lstm  An
LSTM
with recurrent dropout enabled for training.test_lstm  The same as
train_lstm
but without recurrent dropout.
 Return type
A tuple of two elements
 Raises
ValueError – If
dropout
is not in[0, 1)
.
UnrolledLSTM¶

class
sonnet.
UnrolledLSTM
(hidden_size, w_i_init=None, w_h_init=None, b_init=None, forget_bias=1.0, dtype=tf.float32, name=None)[source]¶ Unrolled long shortterm memory (LSTM).
The implementation uses efficient devicespecialized ops, e.g. CuDNNRNN on a CUDAenabled GPU, and can be an order of magnitude faster than
snt.*_unroll
with anLSTM
core.
__init__
(hidden_size, w_i_init=None, w_h_init=None, b_init=None, forget_bias=1.0, dtype=tf.float32, name=None)[source]¶ Construct an unrolled LSTM.
 Parameters
hidden_size – Hidden layer size.
w_i_init (
Optional
[Initializer
]) – Optional initializer for the inputtohidden weights. Defaults toTruncatedNormal
with a standard deviation of1 / sqrt(input_size)
.w_h_init (
Optional
[Initializer
]) – Optional initializer for the hiddentohidden weights. Defaults toTruncatedNormal
with a standard deviation of1 / sqrt(hidden_size)
.b_init (
Optional
[Initializer
]) – Optional initializer for the biases. Defaults toZeros
.forget_bias (
Union
[float
,floating
,ndarray
,Tensor
,Variable
]) – Optional float to add to the bias of the forget gate after initialization.dtype (
DType
) – Optionaltf.DType
of the core’s variables. Defaults totf.float32
.name (
Optional
[str
]) – Name of the module.

Conv1DLSTM¶

class
sonnet.
Conv1DLSTM
(input_shape, output_channels, kernel_shape, data_format='NWC', w_i_init=None, w_h_init=None, b_init=None, forget_bias=1.0, dtype=tf.float32, name=None)[source]¶ 1D convolutional LSTM.
The implementation is based on [6]. Given \(x_t\) and the previous state \((h_{t1}, c_{t1})\) the core computes
\[\begin{array}{ll} i_t = \sigma(W_{ii} * x_t + W_{hi} * h_{t1} + b_i) \\ f_t = \sigma(W_{if} * x_t + W_{hf} * h_{t1} + b_f) \\ g_t = \tanh(W_{ig} * x_t + W_{hg} * h_{t1} + b_g) \\ o_t = \sigma(W_{io} * x_t + W_{ho} * h_{t1} + b_o) \\ c_t = f_t c_{t1} + i_t g_t \\ h_t = o_t \tanh(c_t) \end{array}\]where \(*\) denotes the convolution operator; \(i_t\), \(f_t\), \(o_t\) are input, forget and output gate activations, and \(g_t\) is a vector of cell updates.
Notes
 Forget gate initialization:
Following [3] we add a constant
forget_bias
(defaults to 1.0) to \(b_f\) after initialization in order to reduce the scale of forgetting in the beginning of the training.
Inputtohidden convolution weights \(W_{ii}\), \(W_{if}\), \(W_{ig}\) and \(W_{io}\) concatenated into a single tensor of shape
[kernel_shape*, input_channels, 4 * output_channels]
wherekernel_shape
is repeated 1 times.
Hiddentohidden convolution weights \(W_{hi}\), \(W_{hf}\), \(W_{hg}\) and \(W_{ho}\) concatenated into a single tensor of shape
[kernel_shape*, input_channels, 4 * output_channels]
wherekernel_shape
is repeated 1 times.

b
¶ Biases \(b_i\), \(b_f\), \(b_g\) and \(b_o\) concatenated into a tensor of shape
[4 * output_channels]
.

__init__
(input_shape, output_channels, kernel_shape, data_format='NWC', w_i_init=None, w_h_init=None, b_init=None, forget_bias=1.0, dtype=tf.float32, name=None)[source]¶ Constructs a 1D convolutional LSTM.
 Parameters
input_shape (
Union
[int
,Sequence
[int
],TensorShape
]) – Shape of the inputs excluding batch size.output_channels (
int
) – Number of output channels.kernel_shape (
Union
[int
,Sequence
[int
]]) – Sequence of kernel sizes (of length 1), or an int.kernel_shape
will be expanded to define a kernel size in all dimensions.data_format – The data format of the input.
w_i_init (
Optional
[Initializer
]) – Optional initializer for the inputtohidden convolution weights. Defaults toTruncatedNormal
with a standard deviation of1 / sqrt(kernel_shape * input_channels)
.w_h_init (
Optional
[Initializer
]) – Optional initializer for the hiddentohidden convolution weights. Defaults toTruncatedNormal
with a standard deviation of1 / sqrt(kernel_shape * input_channels)
.b_init (
Optional
[Initializer
]) – Optional initializer for the biases. Defaults toZeros
.forget_bias (
Union
[float
,floating
,ndarray
,Tensor
,Variable
]) – Optional float to add to the bias of the forget gate after initialization.dtype (
DType
) – Optionaltf.DType
of the core’s variables. Defaults totf.float32
.name (
Optional
[str
]) – Name of the module.
Conv2DLSTM¶

class
sonnet.
Conv2DLSTM
(input_shape, output_channels, kernel_shape, data_format='NHWC', w_i_init=None, w_h_init=None, b_init=None, forget_bias=1.0, dtype=tf.float32, name=None)[source]¶ 2D convolutional LSTM.
The implementation is based on [6]. Given \(x_t\) and the previous state \((h_{t1}, c_{t1})\) the core computes
\[\begin{array}{ll} i_t = \sigma(W_{ii} * x_t + W_{hi} * h_{t1} + b_i) \\ f_t = \sigma(W_{if} * x_t + W_{hf} * h_{t1} + b_f) \\ g_t = \tanh(W_{ig} * x_t + W_{hg} * h_{t1} + b_g) \\ o_t = \sigma(W_{io} * x_t + W_{ho} * h_{t1} + b_o) \\ c_t = f_t c_{t1} + i_t g_t \\ h_t = o_t \tanh(c_t) \end{array}\]where \(*\) denotes the convolution operator; \(i_t\), \(f_t\), \(o_t\) are input, forget and output gate activations, and \(g_t\) is a vector of cell updates.
Notes
 Forget gate initialization:
Following [3] we add a constant
forget_bias
(defaults to 1.0) to \(b_f\) after initialization in order to reduce the scale of forgetting in the beginning of the training.
Inputtohidden convolution weights \(W_{ii}\), \(W_{if}\), \(W_{ig}\) and \(W_{io}\) concatenated into a single tensor of shape
[kernel_shape*, input_channels, 4 * output_channels]
wherekernel_shape
is repeated 2 times.
Hiddentohidden convolution weights \(W_{hi}\), \(W_{hf}\), \(W_{hg}\) and \(W_{ho}\) concatenated into a single tensor of shape
[kernel_shape*, input_channels, 4 * output_channels]
wherekernel_shape
is repeated 2 times.

b
¶ Biases \(b_i\), \(b_f\), \(b_g\) and \(b_o\) concatenated into a tensor of shape
[4 * output_channels]
.

__init__
(input_shape, output_channels, kernel_shape, data_format='NHWC', w_i_init=None, w_h_init=None, b_init=None, forget_bias=1.0, dtype=tf.float32, name=None)[source]¶ Constructs a 2D convolutional LSTM.
 Parameters
input_shape (
Union
[int
,Sequence
[int
],TensorShape
]) – Shape of the inputs excluding batch size.output_channels (
int
) – Number of output channels.kernel_shape (
Union
[int
,Sequence
[int
]]) – Sequence of kernel sizes (of length 2), or an int.kernel_shape
will be expanded to define a kernel size in all dimensions.data_format (
str
) – The data format of the input.w_i_init (
Optional
[Initializer
]) – Optional initializer for the inputtohidden convolution weights. Defaults toTruncatedNormal
with a standard deviation of1 / sqrt(kernel_shape**2 * input_channels)
.w_h_init (
Optional
[Initializer
]) – Optional initializer for the hiddentohidden convolution weights. Defaults toTruncatedNormal
with a standard deviation of1 / sqrt(kernel_shape**2 * input_channels)
.b_init (
Optional
[Initializer
]) – Optional initializer for the biases. Defaults toZeros
.forget_bias (
Union
[float
,floating
,ndarray
,Tensor
,Variable
]) – Optional float to add to the bias of the forget gate after initialization.dtype (
DType
) – Optionaltf.DType
of the core’s variables. Defaults totf.float32
.name (
Optional
[str
]) – Name of the module.
Conv3DLSTM¶

class
sonnet.
Conv3DLSTM
(input_shape, output_channels, kernel_shape, data_format='NDHWC', w_i_init=None, w_h_init=None, b_init=None, forget_bias=1.0, dtype=tf.float32, name=None)[source]¶ 3D convolutional LSTM.
The implementation is based on [6]. Given \(x_t\) and the previous state \((h_{t1}, c_{t1})\) the core computes
\[\begin{array}{ll} i_t = \sigma(W_{ii} * x_t + W_{hi} * h_{t1} + b_i) \\ f_t = \sigma(W_{if} * x_t + W_{hf} * h_{t1} + b_f) \\ g_t = \tanh(W_{ig} * x_t + W_{hg} * h_{t1} + b_g) \\ o_t = \sigma(W_{io} * x_t + W_{ho} * h_{t1} + b_o) \\ c_t = f_t c_{t1} + i_t g_t \\ h_t = o_t \tanh(c_t) \end{array}\]where \(*\) denotes the convolution operator; \(i_t\), \(f_t\), \(o_t\) are input, forget and output gate activations, and \(g_t\) is a vector of cell updates.
Notes
 Forget gate initialization:
Following [3] we add a constant
forget_bias
(defaults to 1.0) to \(b_f\) after initialization in order to reduce the scale of forgetting in the beginning of the training.
Inputtohidden convolution weights \(W_{ii}\), \(W_{if}\), \(W_{ig}\) and \(W_{io}\) concatenated into a single tensor of shape
[kernel_shape*, input_channels, 4 * output_channels]
wherekernel_shape
is repeated 3 times.
Hiddentohidden convolution weights \(W_{hi}\), \(W_{hf}\), \(W_{hg}\) and \(W_{ho}\) concatenated into a single tensor of shape
[kernel_shape*, input_channels, 4 * output_channels]
wherekernel_shape
is repeated 3 times.

b
¶ Biases \(b_i\), \(b_f\), \(b_g\) and \(b_o\) concatenated into a tensor of shape
[4 * output_channels]
.

__init__
(input_shape, output_channels, kernel_shape, data_format='NDHWC', w_i_init=None, w_h_init=None, b_init=None, forget_bias=1.0, dtype=tf.float32, name=None)[source]¶ Constructs a 3D convolutional LSTM.
 Parameters
input_shape (
Union
[int
,Sequence
[int
],TensorShape
]) – Shape of the inputs excluding batch size.output_channels (
int
) – Number of output channels.kernel_shape (
Union
[int
,Sequence
[int
]]) – Sequence of kernel sizes (of length 3), or an int.kernel_shape
will be expanded to define a kernel size in all dimensions.data_format (
str
) – The data format of the input.w_i_init (
Optional
[Initializer
]) – Optional initializer for the inputtohidden convolution weights. Defaults toTruncatedNormal
with a standard deviation of1 / sqrt(kernel_shape**3 * input_channels)
.w_h_init (
Optional
[Initializer
]) – Optional initializer for the hiddentohidden convolution weights. Defaults toTruncatedNormal
with a standard deviation of1 / sqrt(kernel_shape**3 * input_channels)
.b_init (
Optional
[Initializer
]) – Optional initializer for the biases. Defaults toZeros
.forget_bias (
Union
[float
,floating
,ndarray
,Tensor
,Variable
]) – Optional float to add to the bias of the forget gate after initialization.dtype (
DType
) – Optionaltf.DType
of the core’s variables. Defaults totf.float32
.name (
Optional
[str
]) – Name of the module.
GRU¶

class
sonnet.
GRU
(hidden_size, w_i_init=None, w_h_init=None, b_init=None, dtype=tf.float32, name=None)[source]¶ Gated recurrent unit (GRU) RNN core.
The implementation is based on [7]. Given \(x_t\) and the previous state \(h_{t1}\) the core computes
\[\begin{array}{ll} z_t &= \sigma(W_{iz} x_t + W_{hz} h_{t1} + b_z) \\ r_t &= \sigma(W_{ir} x_t + W_{hr} h_{t1} + b_r) \\ a_t &= \tanh(W_{ia} x_t + W_{ha} (r_t h_{t1}) + b_a) \\ h_t &= (1  z_t) h_{t1} + z_t a_t \end{array}\]where \(z_t\) and \(r_t\) are reset and update gates.
Inputtohidden weights \(W_{iz}\), \(W_{ir}\) and \(W_{ia}\) concatenated into a tensor of shape
[input_size, 3 * hidden_size]
.
Hiddentohidden weights \(W_{hz}\), \(W_{hr}\) and \(W_{ha}\) concatenated into a tensor of shape
[hidden_size, 3 * hidden_size]
.

b
¶ Biases \(b_z\), \(b_r\) and \(b_a\) concatenated into a tensor of shape
[3 * hidden_size]
.

__init__
(hidden_size, w_i_init=None, w_h_init=None, b_init=None, dtype=tf.float32, name=None)[source]¶ Constructs a GRU.
 Parameters
hidden_size – Hidden layer size.
w_i_init (
Optional
[Initializer
]) – Optional initializer for the inputtohidden weights. Defaults to Glorot uniform initializer.w_h_init (
Optional
[Initializer
]) – Optional initializer for the hiddentohidden weights. Defaults to Glorot uniform initializer.b_init (
Optional
[Initializer
]) – Optional initializer for the biases. Defaults toZeros
.dtype (
DType
) – Optionaltf.DType
of the core’s variables. Defaults totf.float32
.name (
Optional
[str
]) – Name of the module.
Batch¶
reshape¶
Reshape¶

class
sonnet.
Reshape
(output_shape, preserve_dims=1, name=None)[source]¶ Reshapes input Tensor, preserving the batch dimension.
For example, given an input tensor with shape
[B, H, W, C, D]
:>>> B, H, W, C, D = range(1, 6) >>> x = tf.ones([B, H, W, C, D])
The default behavior when
output_shape
is(1, D)
is to flatten all dimensions betweenB
andD
:>>> mod = snt.Reshape(output_shape=(1, D)) >>> assert mod(x).shape == [B, H*W*C, D]
You can change the number of preserved leading dimensions via
preserve_dims
:>>> mod = snt.Reshape(output_shape=(1, D), preserve_dims=2) >>> assert mod(x).shape == [B, H, W*C, D] >>> mod = snt.Reshape(output_shape=(1, D), preserve_dims=3) >>> assert mod(x).shape == [B, H, W, C, D] >>> mod = snt.Reshape(output_shape=(1, D), preserve_dims=4) >>> assert mod(x).shape == [B, H, W, C, 1, D]

__init__
(output_shape, preserve_dims=1, name=None)[source]¶ Constructs a
Reshape
module. Parameters
output_shape (
Union
[int
,Sequence
[int
],TensorShape
]) – Shape to reshape the input tensor to while preserving its firstpreserve_dims` dimensions. When the special value 1 appears in ``output_shape
the corresponding size is automatically inferred. Note that 1 can only appear once inoutput_shape
. To flatten all nonbatch dimensions useFlatten
.preserve_dims (
int
) – Number of leading dimensions that will not be reshaped.name (
Optional
[str
]) – Name of the module.
 Raises
ValueError – If
preserve_dims
is not positive.

__call__
(inputs)[source]¶ Reshapes
inputs
. Parameters
inputs (
Tensor
) – A tensor of shape[b_1, b_2, ..., b_preserve_dims, b_preserve_dims + 1, ...]
. Return type
Tensor
 Returns
 A tensor of shape
[b_1, b_2, ..., b_preserve_dims, b_reshape_1, b_reshape_2, ...]
, with reshaping defined by the constructoroutput_shape
parameter.
 Raises
ValueError – If
output_shape
is incompatible with shape of theinputs
; or ifoutput_shape
contains more than one wildcard 1; or if theinputs
rank is less thanpreserved_dims
; or if theinputs
shape contains unknown, nonpreserved dimensions (except when the unknown dimension is the only nonpreserved dimension and doesn’t actually need reshaping).

flatten¶
Flatten¶

class
sonnet.
Flatten
(preserve_dims=1, name=None)[source]¶ Flattens the input Tensor, preserving the batch dimension(s).
Flatten
reshapes input tensors to combine all trailing dimensions apart from the first. Additional leading dimensions can be preserved by setting thepreserve_dims
parameter.See
Reshape
for more details.
BatchApply¶

class
sonnet.
BatchApply
(module, num_dims=2, name=None)[source]¶ Merges a number of leading dimensions of an input tensor to manipulate it.
Merges a number of leading dimensions of a tensor into a single dimension, connects the provided module, then splits the leading dimension of the result to match the input.
Input tensors whose rank is smaller than the number of dimensions to collapse (e.g. all scalar values, which are tensors of rank 0), are passed unaltered to the provided module.
This is useful for applying some module to each timestep of a Time x Batch x N tensor. If a module is hard coded to only support 2D (Batch x N) then the full 3D Tensor cannot be provided. BatchApply will ‘merge’ the first two dimensions of the sequence tensor by reshaping to a (Time * Batch) x N Tensor, and then the internal module can be applied. The result of that operation is reshaped such that its first dimensions are split to match the leading dimensions of the input.

__init__
(module, num_dims=2, name=None)[source]¶ Initializes the current module with the given name.
Subclasses should call this constructor before creating other modules or variables such that those modules are named correctly.
 Parameters
name (
Optional
[str
]) – An optional string name for the class. Must be a valid Python identifier. Ifname
is not provided then the class name for the current instance is converted tolower_snake_case
and used instead.

Embedding modules¶
Embed¶

class
sonnet.
Embed
(vocab_size=None, embed_dim=None, existing_vocab=None, densify_gradients=False, initializer=None, trainable=True, dtype=tf.float32, name=None)[source]¶ Module for embedding tokens in a lowdimensional space.

__init__
(vocab_size=None, embed_dim=None, existing_vocab=None, densify_gradients=False, initializer=None, trainable=True, dtype=tf.float32, name=None)[source]¶ Constructs an Embed module.
 Parameters
vocab_size (
Optional
[int
]) – Number of unique tokens to embed. If not provided, an existing vocabulary matrix from which vocab_size can be inferred must be provided as existing_vocab.embed_dim (
Optional
[int
]) – Number of dimensions to assign to each embedding. If not specified, we use6 * sqrt(sqrt(vocab_size))
. If an existing vocabulary matrix initializes the module, this should not be provided as it will be inferred.existing_vocab (
Union
[ndarray
,Tensor
,Variable
,None
]) – A[vocab_size, embed_dim]
vocabulary matrix. Will be converted to a tf.float32 tensor. If provided, neither or vocab_size or embed_dim should be provided as they are inferred.densify_gradients (
bool
) – If True, we convert the embedding gradient from antf.IndexedSlices
to a regular tensor before sending it back to the parameter server. This avoids excess computation on the parameter server. Use this option for moderately sized embeddings, e.g., a vocabulary size on the order of up to thousands. For embeddings larger than these, e.g. a vocabulary size on the order of tens or hundreds of thousands, set this to False.initializer (
Optional
[Initializer
]) – Initializer for the embeddings. By default, embeddings are initialized via a truncated normal distribution.trainable (
bool
) – if True, the embeddings will be updated during training. If False, they are fixed to their initial values.dtype (
DType
) – The dtype to use for the embedding. Defaults to float32.name (
Optional
[str
]) – Name for this module.
 Raises
ValueError – if neither one of
vocab_size
orexisting_vocab
is provided, or ifexisting_vocab
is provided along withvocab_size
,embedding_dim
,initializer
(as these should be inferred).

Optimizers¶
Sonnet optimizers built for TensorFlow 2.
All optimizers implement the snt.Optimizer interface.
Adam¶

class
sonnet.optimizers.
Adam
(learning_rate=0.001, beta1=0.9, beta2=0.999, epsilon=1e08, name=None)[source]¶ Adaptive Moment Estimation (Adam) optimizer.
Adam is an algorithm for firstorder gradientbased optimization of stochastic objective functions, based on adaptive estimates of lowerorder moments. See [8] for more details.
Note: default parameter values have been taken from the paper.

learning_rate
¶ Step size (
alpha
in the paper).

beta1
¶ Exponential decay rate for first moment estimate.

beta2
¶ Exponential decay rate for second moment estimate.

epsilon
¶ Small value to avoid zero denominator.

step
¶ Step count.

m
¶ Biased first moment estimate (a list with one value per parameter).

v
¶ Biased second raw moment estimate (a list with one value per parameter).

__init__
(learning_rate=0.001, beta1=0.9, beta2=0.999, epsilon=1e08, name=None)[source]¶ Constructs an Adam module.
 Parameters
learning_rate (
Union
[float
,floating
,ndarray
,Tensor
,Variable
]) – Step size (alpha
in the paper).beta1 (
Union
[float
,floating
,ndarray
,Tensor
,Variable
]) – Exponential decay rate for first moment estimate.beta2 (
Union
[float
,floating
,ndarray
,Tensor
,Variable
]) – Exponential decay rate for second moment estimate.epsilon (
Union
[float
,floating
,ndarray
,Tensor
,Variable
]) – Small value to avoid zero denominator.name (
Optional
[str
]) – Name of the module.

apply
(updates, parameters)[source]¶ Applies updates to parameters.
Applies the Adam update rule for each update, parameter pair:
\[\begin{array}{ll} m_t = \beta_1 \cdot m_{t1} + (1  \beta_1) \cdot update \\ v_t = \beta_2 \cdot v_{t1} + (1  \beta_2) \cdot update^2 \\ \hat{m}_t = m_t / (1  \beta_1^t) \\ \hat{v}_t = v_t / (1  \beta_2^t) \\ delta = \alpha \cdot \hat{m}_t / (\sqrt{\hat{v}_t} + \epsilon) \\ param_t = param_{t1}  delta \\ \end{array}\] Parameters
updates (
Sequence
[Union
[Tensor
,IndexedSlices
,None
]]) – A list of updates to apply to parameters. Updates are often gradients as returned bytf.GradientTape.gradient
.parameters (
Sequence
[Variable
]) – A list of parameters.
 Raises
ValueError – If updates and parameters are empty, have different lengths, or have inconsistent types.

Momentum¶

class
sonnet.optimizers.
Momentum
(learning_rate, momentum, use_nesterov=False, name=None)[source]¶ SGD with Momentum module.

learning_rate
¶ Learning rate.

momentum
¶ Momentum scalar.

use_nesterov
¶ True if using Nesterov momentum.

accumulated_momentum
¶ Accumulated momentum for each parameter.

__init__
(learning_rate, momentum, use_nesterov=False, name=None)[source]¶ Constructs a Momentum module.
 Parameters
learning_rate (
Union
[float
,floating
,ndarray
,Tensor
,Variable
]) – Learning rate.momentum (
Union
[float
,floating
,ndarray
,Tensor
,Variable
]) – Momentum scalar.use_nesterov (
bool
) – Whether to use Nesterov momentum.name (
Optional
[str
]) – Name of the module.

apply
(updates, parameters)[source]¶ Applies updates to parameters.
By default it applies the momentum update rule for each update, parameter pair:
accum_t < momentum * accum_{t1} + update parameter < parameter  learning_rate * accum_t
And when using Nesterov momentum (use_nesterov=True) it applies:
accum_t < momentum * accum_{t1} + update parameter < parameter  (learning_rate * update + learning_rate * momentum * accum_t)
 Parameters
updates (
Sequence
[Union
[Tensor
,IndexedSlices
,None
]]) – A list of updates to apply to parameters. Updates are often gradients as returned by tf.GradientTape.gradient.parameters (
Sequence
[Variable
]) – A list of parameters. A parameter is a tf.Variable.
 Raises
ValueError – If updates and parameters are empty, have different lengths, or have inconsistent types.

RMSProp¶

class
sonnet.optimizers.
RMSProp
(learning_rate, decay=0.9, momentum=0.0, epsilon=1e10, centered=False, name=None)[source]¶ RMSProp module.
See: http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf
Maintain a moving (discounted) average of the square of updates. Divides each update by the root of this average.
ms < decay * ms + (1decay) * update^2 mom < momentum * mom + learning_rate * update / sqrt(ms + epsilon) parameter < parameter  mom
This implementation of RMSprop uses plain momentum, not Nesterov momentum.
The centered version additionally maintains a moving average of the gradients, and uses that average to estimate the variance:
mg < decay * mg + (1decay) * update ms < decay * ms + (1decay) * update^2 mom < momentum * mom + learning_rate * update / sqrt(ms  mg^2 + epsilon) parameter < parameter  mom

learning_rate
¶ Learning rate.

decay
¶ Learning rate decay over each update.

momentum
¶ Momentum scalar.

epsilon
¶ Small value to avoid zero denominator.

centered
¶ True if centered.

mom
¶ Accumulated mom for each parameter.

ms
¶ Accumulated ms for each parameter.

mg
¶ Accumulated mg for each parameter.

__init__
(learning_rate, decay=0.9, momentum=0.0, epsilon=1e10, centered=False, name=None)[source]¶ Constructs an RMSProp module.
 Parameters
learning_rate (
Union
[float
,floating
,ndarray
,Tensor
,Variable
]) – Learning rate.decay (
Union
[float
,floating
,ndarray
,Tensor
,Variable
]) – Learning rate decay over each update.momentum (
Union
[float
,floating
,ndarray
,Tensor
,Variable
]) – Momentum scalar.epsilon (
Union
[float
,floating
,ndarray
,Tensor
,Variable
]) – Small value to avoid zero denominator.centered (
bool
) – If True, gradients are normalized by the estimated variance of the gradient; if False, by the uncentered second moment. Setting this to True may help with training, but is slightly more expensive in terms of computation and memory. Defaults to False.name (
Optional
[str
]) – Name for this module.

apply
(updates, parameters)[source]¶ Applies updates to parameters.
 Parameters
updates (
Sequence
[Union
[Tensor
,IndexedSlices
,None
]]) – A list of updates to apply to parameters. Updates are often gradients as returned by tf.GradientTape.gradient.parameters (
Sequence
[Variable
]) – A list of parameters.
 Raises
ValueError – If updates and parameters are empty, have different lengths, or have inconsistent types.

SGD¶

class
sonnet.optimizers.
SGD
(learning_rate, name=None)[source]¶ Stochastic Gradient Descent (SGD) module.

learning_rate
¶ Learning rate.

__init__
(learning_rate, name=None)[source]¶ Constructs an SGD module.
 Parameters
learning_rate (
Union
[float
,floating
,ndarray
,Tensor
,Variable
]) – Learning rate.name (
Optional
[str
]) – Name of the module.

apply
(updates, parameters)[source]¶ Applies updates to parameters.
 Parameters
updates (
Sequence
[Union
[Tensor
,IndexedSlices
,None
]]) – A list of updates to apply to parameters. Updates are often gradients as returned by tf.GradientTape.gradient.parameters (
Sequence
[Variable
]) – A list of parameters.
 Raises
ValueError – If updates and parameters are empty, have different lengths, or have inconsistent types.

Initializers¶
Initializers.
Initializer¶
Constant¶
Identity¶
Ones¶
Orthogonal¶

class
sonnet.initializers.
Orthogonal
(gain=1.0, seed=None)[source]¶ Initializer that generates an orthogonal matrix.
NOTE: Does not support 1D tensors.
The implementation is based on [9].
If the shape of the tensor to initialize is twodimensional, it is initialized with an orthogonal matrix obtained from the QR decomposition of a matrix of random numbers drawn from a normal distribution. If the matrix has fewer rows than columns then the output will have orthogonal rows. Otherwise, the output will have orthogonal columns.
If the shape of the tensor to initialize is more than twodimensional, a matrix of shape
(shape[0] * ... * shape[n  2], shape[n  1])
is initialized, wheren
is the length of the shape vector. The matrix is subsequently reshaped to give a tensor of the desired shape.
RandomNormal¶

class
sonnet.initializers.
RandomNormal
(mean=0.0, stddev=1.0, seed=None)[source]¶ Initializer that generates tensors with a normal distribution.

__init__
(mean=0.0, stddev=1.0, seed=None)[source]¶ Constructs a random normal initializer.
 Parameters
mean (
Union
[float
,floating
,ndarray
,Tensor
,Variable
]) – A python scalar or a scalar tensor. Mean of the random values to generate.stddev (
Union
[float
,floating
,ndarray
,Tensor
,Variable
]) – A python scalar or a scalar tensor. Standard deviation of the random values to generate.seed (
Optional
[int
]) – The seed used in the generation of random numbers.

RandomUniform¶

class
sonnet.initializers.
RandomUniform
(minval=0, maxval=1, seed=None)[source]¶ Initializer that generates tensors with a uniform distribution.
The generated values follow a uniform distribution in the range
[minval, maxval)
.
__init__
(minval=0, maxval=1, seed=None)[source]¶ Constructs a random uniform initializer.
 Parameters
minval (
Union
[float
,floating
,ndarray
,Tensor
,Variable
]) – A python scalar or a scalar tensor. Lower bound of the range of random values to generate. Defaults to0
.maxval (
Union
[float
,floating
,ndarray
,Tensor
,Variable
]) – A python scalar or a scalar tensor. Upper bound of the range of random values to generate. Defaults to1
.seed (
Optional
[int
]) – The seed used in the generation of random numbers.

TruncatedNormal¶

class
sonnet.initializers.
TruncatedNormal
(mean=0.0, stddev=1.0, seed=None)[source]¶ Initializer that generates a truncated normal distribution.
These values follow a normal distribution except that values more than two standard deviations from the mean are discarded and redrawn. This is the recommended initializer for neural network weights and filters.

__init__
(mean=0.0, stddev=1.0, seed=None)[source]¶ Constructs a truncated normal initializer.
 Parameters
mean (
Union
[float
,floating
,ndarray
,Tensor
,Variable
]) – A python scalar or a scalar tensor. Mean of the random values to generate.stddev (
Union
[float
,floating
,ndarray
,Tensor
,Variable
]) – A python scalar or a scalar tensor. Standard deviation of the random values to generate.seed (
Optional
[int
]) – The seed used in the generation of random numbers.

VarianceScaling¶

class
sonnet.initializers.
VarianceScaling
(scale=1.0, mode='fan_in', distribution='truncated_normal', seed=None)[source]¶ Initializer capable of adapting its scale to the shape of weights tensors.
With
distribution="truncated_normal" or "normal"
, samples are drawn from a distribution with a mean of zero and a standard deviation (after truncation, if used)stddev = sqrt(scale / n)
wheren
is:Number of input units in the weight tensor, if
mode = fan_in
.Number of output units, if
mode = fan_out
.Average of the numbers of input and output units, if
mode = fan_avg
.
Note that for transposed convolution the mode selected should be reversed. For number of input units use
fan_out
and for number of output unitsfan_in
.With
distribution=uniform
, samples are drawn from a uniform distribution within[limit, limit]
, withlimit = sqrt(3 * scale / n)
.The variance scaling initializer can be configured to generate other standard initializers using the scale, mode and distribution arguments. Here are some example configurations:
Name
Parameters
glorot_uniform
scale=1.0, mode=``fan_avg``, distribution=``uniform``
glorot_normal
scale=1.0, mode=``fan_avg``, distribution=``truncated_normal``
lecun_uniform
scale=1.0, mode=``fan_in``, distribution=``uniform``
lecun_normal
scale=1.0, mode=``fan_in``, distribution=``truncated_normal``
he_uniform
scale=2.0, mode=``fan_in``, distribution=``uniform``
he_normal
scale=2.0, mode=``fan_in``, distribution=``truncated_normal``

__init__
(scale=1.0, mode='fan_in', distribution='truncated_normal', seed=None)[source]¶ Constructs a variance scaling initalizer.
 Parameters
scale (
float
) – Scaling factor (positivefloat
).mode (
str
) – One offan_in
,fan_out
,fan_avg
.distribution (
str
) – Random distribution to use. One oftruncated_normal
,untruncated_normal
anduniform
.seed (
Optional
[int
]) –int
, the seed used in the generation of random numbers.
 Raises
ValueError – In case of an invalid value for the
scale
,mode
ordistribution
arguments.
Regularizers¶
Regularizers.
Regularizer¶
L1¶

class
sonnet.regularizers.
L1
(scale)[source]¶ L1 regularizer.
>>> reg = snt.regularizers.L1(0.01) >>> reg([tf.constant([1.0, 2.0, 3.0])]) <tf.Tensor: ...>
L2¶

class
sonnet.regularizers.
L2
(scale)[source]¶ L2 regularizer.
>>> reg = snt.regularizers.L2(0.01) >>> reg([tf.constant([1.0, 2.0, 3.0])]) <tf.Tensor: ...>
OffDiagonalOrthogonal¶

class
sonnet.regularizers.
OffDiagonalOrthogonal
(scale)[source]¶ Offdiagonal orthogonal regularizer.
The implementation is based on https://arxiv.org/abs/1809.11096. Given a rank N >= 2 tensor, the regularizer computes the sum of offdiagonal entries of (W^T W)^2 where
W is the input tensor reshaped to a matrix by collapsing the leading N  1 axes into the first one;
^2 is the elementwise square.
NB: that is equivalent to computing the offdiagonal sum of (W^T W  I)^2, as offdiagonal entries of I are 0.
For example,
>>> t = tf.reshape(tf.range(8, dtype=tf.float32), [2, 2, 2]) >>> reg = snt.regularizers.OffDiagonalOrthogonal(0.01) >>> reg([t]) <tf.Tensor: ...>
corresponds to copmuting
>>> w = tf.reshape(t, [1, 2]) >>> w_gram_sq = tf.square(tf.matmul(tf.transpose(w), w)) >>> 0.01 * (tf.reduce_sum(w_gram_sq)  tf.linalg.trace(w_gram_sq)) <tf.Tensor: ...>
Paddings¶
Paddings.
causal¶
create¶

sonnet.pad.
create
(padding, kernel, rate, n, channel_index)[source]¶ Generates the padding required for a given padding algorithm.
 Parameters
padding (
Union
[Callable
[[int
],Sequence
[int
]],Sequence
[Callable
[[int
],Sequence
[int
]]]]) – callable or list of callables of length n. The callables take an integer representing the effective kernel size (kernel size when the rate is 1) and return a list of two integers representing the padding before and padding after for that dimension.kernel (
Union
[int
,Sequence
[int
]]) – int or list of ints of length n. The size of the kernel for each dimension. If it is an int it will be replicated for the non channel and batch dimensions.rate (
Union
[int
,Sequence
[int
]]) – int or list of ints of length n. The dilation rate for each dimension. If it is an int it will be replicated for the non channel and batch dimensions.n (
int
) – the number of spatial dimensions.channel_index (
int
) – the channel position of the input to which the padding will be applied.
 Returns
A list of length n+2 containing the padding for each element. These are of the form [pad_before, pad_after].
full¶
reverse_causal¶
same¶
Distribution¶
Utilities for using Sonnet with TensorFlow Distribution Strategy.
Replicator¶

class
sonnet.distribute.
Replicator
(devices=None, cross_device_ops=None)[source]¶ Replicates input, parameters and compute over multiple accelerators.
Replicator
is a TensorFlow “Distribution Strategy” implementing the programming model described in the TFReplicator paper [10] and TensorFlow RFC [11].Replicator
enables dataparallel training across multiple accelerators on a single machine, it supports eager execution andtf.function
.To get started create a
Replicator
instance:>>> replicator = snt.distribute.Replicator()
Replicator provides a scope inside which any new
tf.Variable
s will be replicated across all local devices:>>> with replicator.scope(): ... mod = snt.Linear(32)
Additionally replicator provides utility functions to apply a module in parallel on multiple devices. First we need to define some computation that runs on each GPU. The “replica context” object provides us a way to communicate between replicas (e.g. to perform an
all_reduce
):>>> def forward(): ... # Compute a random output on each GPU. ... x = tf.random.normal([8, 28 * 28]) ... y = mod(x) ... # Synchronize the value of `y` between all GPUs. ... ctx = tf.distribute.get_replica_context() ... y = ctx.all_reduce("mean", y) ... return y
Finally we use the run API to apply
forward
in parallel on all accelerator devices:>>> per_replica_y = replicator.run(forward)

scope
()[source]¶ Context manager to make the strategy current and distribute variables.
This method returns a context manager, and is used as follows:
>>> strategy = tf.distribute.MirroredStrategy(["GPU:0", "GPU:1"]) >>> # Variable created inside scope: >>> with strategy.scope(): ... mirrored_variable = tf.Variable(1.) >>> mirrored_variable MirroredVariable:{ 0: <tf.Variable 'Variable:0' shape=() dtype=float32, numpy=1.0>, 1: <tf.Variable 'Variable/replica_1:0' shape=() dtype=float32, numpy=1.0> } >>> # Variable created outside scope: >>> regular_variable = tf.Variable(1.) >>> regular_variable <tf.Variable 'Variable:0' shape=() dtype=float32, numpy=1.0>
_What happens when Strategy.scope is entered?_
strategy is installed in the global context as the “current” strategy. Inside this scope, tf.distribute.get_strategy() will now return this strategy. Outside this scope, it returns the default noop strategy.
Entering the scope also enters the “crossreplica context”. See tf.distribute.StrategyExtended for an explanation on crossreplica and replica contexts.
Variable creation inside scope is intercepted by the strategy. Each strategy defines how it wants to affect the variable creation. Sync strategies like MirroredStrategy, TPUStrategy and MultiWorkerMiroredStrategy create variables replicated on each replica, whereas ParameterServerStrategy creates variables on the parameter servers. This is done using a custom tf.variable_creator_scope.
In some strategies, a default device scope may also be entered: in MultiWorkerMiroredStrategy, a default device scope of “/CPU:0” is entered on each worker.
 Note: Entering a scope does not automatically distribute a computation, except
in the case of high level training framework like keras model.fit. If you’re not using model.fit, you need to use strategy.run API to explicitly distribute that computation. See an example in the [custom training loop tutorial](https://www.tensorflow.org/tutorials/distribute/custom_training).
_What should be in scope and what should be outside?_
There are a number of requirements on what needs to happen inside the scope. However, in places where we have information about which strategy is in use, we often enter the scope for the user, so they don’t have to do it explicitly (i.e. calling those either inside or outside the scope is OK).
Anything that creates variables that should be distributed variables must be in strategy.scope. This can be either by directly putting it in scope, or relying on another API like strategy.run or model.fit to enter it for you. Any variable that is created outside scope will not be distributed and may have performance implications. Common things that create variables in TF: models, optimizers, metrics. These should always be created inside the scope. Another source of variable creation can be a checkpoint restore  when variables are created lazily. Note that any variable created inside a strategy captures the strategy information. So reading and writing to these variables outside the strategy.scope can also work seamlessly, without the user having to enter the scope.
Some strategy APIs (such as strategy.run and strategy.reduce) which require to be in a strategy’s scope, enter the scope for you automatically, which means when using those APIs you don’t need to enter the scope yourself.
When a tf.keras.Model is created inside a strategy.scope, we capture this information. When high level training frameworks methods such as model.compile, model.fit etc are then called on this model, we automatically enter the scope, as well as use this strategy to distribute the training etc. See detailed example in [distributed keras tutorial](https://www.tensorflow.org/tutorials/distribute/keras). Note that simply calling the model(..) is not impacted  only high level training framework APIs are. model.compile, model.fit, model.evaluate, model.predict and model.save can all be called inside or outside the scope.
 The following can be either inside or outside the scope:
Creating the input datasets
Defining `tf.function`s that represent your training step
Saving APIs such as tf.saved_model.save. Loading creates variables, so that should go inside the scope if you want to train the model in a distributed way.
Checkpoint saving. As mentioned above  checkpoint.restore may sometimes need to be inside scope if it creates variables.
 Returns
A context manager.

TpuReplicator¶

class
sonnet.distribute.
TpuReplicator
(tpu_cluster_resolver=None, experimental_device_assignment=None)[source]¶ Replicates input, parameters and compute over multiple TPUs.
TpuReplicator
is a TensorFlow “Distribution Strategy” implementing the programming model described in the TFReplicator paper [10] and TensorFlow RFC [11].TpuReplicator
enables dataparallel training across multiple TPUs on one or more machines, it supportstf.function
.To get started create a
TpuReplicator
instance:>>> replicator = snt.distribute.TpuReplicator()
This provides a scope inside which any new
tf.Variable
s will be replicated across all TPU cores:>>> with replicator.scope(): ... mod = snt.Linear(32)
Additionally replicator provides utility functions to apply a module in parallel on multiple devices. First we need to define some computation that runs on each TPU. The “replica context” object provides us a way to communicate between replicas:
>>> def forward(): ... # Compute a random output on each GPU. ... x = tf.random.normal([8, 28 * 28]) ... y = mod(x) ... # Synchronize the value of `y` between all GPUs. ... ctx = tf.distribute.get_replica_context() ... y = ctx.all_reduce("mean", y) ... return y
Finally we use the run API to apply
forward
in parallel on all TPU devices. This must be run as part of atf.function
sinceTpuReplicator
uses XLA to compile and replicate our function to run in parallel over all TPU cores:>>> @tf.function(autograph=False) ... def all_forward(): ... return replicator.run(forward) >>> per_replica_y = all_forward()

scope
()[source]¶ Context manager to make the strategy current and distribute variables.
This method returns a context manager, and is used as follows:
>>> strategy = tf.distribute.MirroredStrategy(["GPU:0", "GPU:1"]) >>> # Variable created inside scope: >>> with strategy.scope(): ... mirrored_variable = tf.Variable(1.) >>> mirrored_variable MirroredVariable:{ 0: <tf.Variable 'Variable:0' shape=() dtype=float32, numpy=1.0>, 1: <tf.Variable 'Variable/replica_1:0' shape=() dtype=float32, numpy=1.0> } >>> # Variable created outside scope: >>> regular_variable = tf.Variable(1.) >>> regular_variable <tf.Variable 'Variable:0' shape=() dtype=float32, numpy=1.0>
_What happens when Strategy.scope is entered?_
strategy is installed in the global context as the “current” strategy. Inside this scope, tf.distribute.get_strategy() will now return this strategy. Outside this scope, it returns the default noop strategy.
Entering the scope also enters the “crossreplica context”. See tf.distribute.StrategyExtended for an explanation on crossreplica and replica contexts.
Variable creation inside scope is intercepted by the strategy. Each strategy defines how it wants to affect the variable creation. Sync strategies like MirroredStrategy, TPUStrategy and MultiWorkerMiroredStrategy create variables replicated on each replica, whereas ParameterServerStrategy creates variables on the parameter servers. This is done using a custom tf.variable_creator_scope.
In some strategies, a default device scope may also be entered: in MultiWorkerMiroredStrategy, a default device scope of “/CPU:0” is entered on each worker.
 Note: Entering a scope does not automatically distribute a computation, except
in the case of high level training framework like keras model.fit. If you’re not using model.fit, you need to use strategy.run API to explicitly distribute that computation. See an example in the [custom training loop tutorial](https://www.tensorflow.org/tutorials/distribute/custom_training).
_What should be in scope and what should be outside?_
There are a number of requirements on what needs to happen inside the scope. However, in places where we have information about which strategy is in use, we often enter the scope for the user, so they don’t have to do it explicitly (i.e. calling those either inside or outside the scope is OK).
Anything that creates variables that should be distributed variables must be in strategy.scope. This can be either by directly putting it in scope, or relying on another API like strategy.run or model.fit to enter it for you. Any variable that is created outside scope will not be distributed and may have performance implications. Common things that create variables in TF: models, optimizers, metrics. These should always be created inside the scope. Another source of variable creation can be a checkpoint restore  when variables are created lazily. Note that any variable created inside a strategy captures the strategy information. So reading and writing to these variables outside the strategy.scope can also work seamlessly, without the user having to enter the scope.
Some strategy APIs (such as strategy.run and strategy.reduce) which require to be in a strategy’s scope, enter the scope for you automatically, which means when using those APIs you don’t need to enter the scope yourself.
When a tf.keras.Model is created inside a strategy.scope, we capture this information. When high level training frameworks methods such as model.compile, model.fit etc are then called on this model, we automatically enter the scope, as well as use this strategy to distribute the training etc. See detailed example in [distributed keras tutorial](https://www.tensorflow.org/tutorials/distribute/keras). Note that simply calling the model(..) is not impacted  only high level training framework APIs are. model.compile, model.fit, model.evaluate, model.predict and model.save can all be called inside or outside the scope.
 The following can be either inside or outside the scope:
Creating the input datasets
Defining `tf.function`s that represent your training step
Saving APIs such as tf.saved_model.save. Loading creates variables, so that should go inside the scope if you want to train the model in a distributed way.
Checkpoint saving. As mentioned above  checkpoint.restore may sometimes need to be inside scope if it creates variables.
 Returns
A context manager.

Metrics¶
Metric¶
Mean¶

class
sonnet.
Mean
(name=None)[source]¶ Calculates the elementwise mean of the given values.

__init__
(name=None)[source]¶ Initializes the current module with the given name.
Subclasses should call this constructor before creating other modules or variables such that those modules are named correctly.
 Parameters
name (
Optional
[str
]) – An optional string name for the class. Must be a valid Python identifier. Ifname
is not provided then the class name for the current instance is converted tolower_snake_case
and used instead.

property
value
¶ See base class.
 Return type
Tensor

Sum¶

class
sonnet.
Sum
(name=None)[source]¶ Calculates the elementwise sum of the given values.

__init__
(name=None)[source]¶ Initializes the current module with the given name.
Subclasses should call this constructor before creating other modules or variables such that those modules are named correctly.
 Parameters
name (
Optional
[str
]) – An optional string name for the class. Must be a valid Python identifier. Ifname
is not provided then the class name for the current instance is converted tolower_snake_case
and used instead.

property
value
¶ See base class.
 Return type
Tensor

Nets¶
Common network architectures implemented as Sonnet modules.
MLP¶

class
sonnet.nets.
MLP
(output_sizes, w_init=None, b_init=None, with_bias=True, activation=<function relu>, dropout_rate=None, activate_final=False, name=None)[source]¶ A multilayer perceptron module.

__init__
(output_sizes, w_init=None, b_init=None, with_bias=True, activation=<function relu>, dropout_rate=None, activate_final=False, name=None)[source]¶ Constructs an MLP.
 Parameters
output_sizes (
Iterable
[int
]) – Sequence of layer sizes.w_init (
Optional
[Initializer
]) – Initializer for Linear weights.b_init (
Optional
[Initializer
]) – Initializer for Linear bias. Must be None if with_bias is False.with_bias (
bool
) – Whether or not to apply a bias in each layer.activation (
Callable
[[Tensor
],Tensor
]) – Activation function to apply between linear layers. Defaults to ReLU.dropout_rate – Dropout rate to apply, a rate of None (the default) or 0 means no dropout will be applied.
activate_final (
bool
) – Whether or not to activate the final layer of the MLP.name (
Optional
[str
]) – Optional name for this module.
 Raises
ValueError – If with_bias is False and b_init is not None.

__call__
(inputs, is_training=None)[source]¶ Connects the module to some inputs.
 Parameters
inputs (
Tensor
) – A Tensor of shape [batch_size, input_size].is_training – A bool indicating if we are currently training. Defaults to None. Required if using dropout.
 Returns
The output of the model of size [batch_size, output_size].
 Return type
output

reverse
(activate_final=None, name=None)[source]¶ Returns a new MLP which is the layerwise reverse of this MLP.
NOTE: Since computing the reverse of an MLP requires knowing the input size of each linear layer this method will fail if the module has not been called at least once. See snt.Deferred as a possible solution to this problem.
The contract of reverse is that the reversed module will accept the output of the parent module as input and produce an output which is the input size of the parent.
>>> mlp = snt.nets.MLP([1, 2, 3]) >>> y = mlp(tf.ones([1, 2])) >>> rev = mlp.reverse() >>> rev(y) <tf.Tensor: shape=(1, 2), ...>
 Parameters
activate_final (
Optional
[bool
]) – Whether the final layer of the MLP should be activated.name (
Optional
[str
]) – Optional name for the new module. The default name will be the name of the current module prefixed with"reversed_"
.
 Return type
MLP
 Returns
An MLP instance which is the reverse of the current instance. Note these instances do not share weights and, apart from being symmetric to each other, are not coupled in any way.

Cifar10ConvNet¶

class
sonnet.nets.
Cifar10ConvNet
(num_classes=10, w_init=None, b_init=None, data_format='NHWC', output_channels=(64, 64, 128, 128, 128, 256, 256, 256, 512, 512, 512), strides=(1, 1, 2, 1, 1, 2, 1, 1, 2, 1, 1), name=None)[source]¶ Convolutional network designed for Cifar10.
Approximately equivalent to “VGG, minus max pooling, plus BatchNorm”. For best results the input data should be scaled to be between 1 and 1 when using the standard initializers.

__init__
(num_classes=10, w_init=None, b_init=None, data_format='NHWC', output_channels=(64, 64, 128, 128, 128, 256, 256, 256, 512, 512, 512), strides=(1, 1, 2, 1, 1, 2, 1, 1, 2, 1, 1), name=None)[source]¶ Initializes the current module with the given name.
Subclasses should call this constructor before creating other modules or variables such that those modules are named correctly.
 Parameters
name (
Optional
[str
]) – An optional string name for the class. Must be a valid Python identifier. Ifname
is not provided then the class name for the current instance is converted tolower_snake_case
and used instead.

__call__
(inputs, is_training, test_local_stats=True)[source]¶ Connects the module to some inputs.
 Parameters
inputs (
Tensor
) – A Tensor of size [batch_size, input_height, input_width, input_channels], representing a batch of input images.is_training (
Union
[bool
,ndarray
,Tensor
,Variable
]) – Boolean to indicate to snt.BatchNorm if we are currently training.test_local_stats (
bool
) – Boolean to indicate to snt.BatchNorm if batch normalization should use local batch statistics at test time. By default True.
 Returns
logits: The output logits of the network, this will be of size [batch_size, num_classes]
activations: A list of tf.Tensor, the feature activations of the module. The order of the activations is preserved in the output list. The activations in the output list are those computed after the activation function is applied, if one is applied at that layer.
 Return type
A dictionary containing two items

ResNet¶

class
sonnet.nets.
ResNet
(blocks_per_group_list, num_classes, bn_config=None, resnet_v2=False, channels_per_group_list=(256, 512, 1024, 2048), name=None)[source]¶ ResNet model.

__init__
(blocks_per_group_list, num_classes, bn_config=None, resnet_v2=False, channels_per_group_list=(256, 512, 1024, 2048), name=None)[source]¶ Constructs a ResNet model.
 Parameters
blocks_per_group_list (
Sequence
[int
]) – A sequence of length 4 that indicates the number of blocks created in each group.num_classes (
int
) – The number of classes to classify the inputs into.bn_config (
Optional
[Mapping
[str
,float
]]) – A dictionary of two elements, decay_rate and eps to be passed on to the BatchNorm layers. By default the decay_rate is 0.9 and eps is 1e5.resnet_v2 (
bool
) – Whether to use the v1 or v2 ResNet implementation. Defaults to False.channels_per_group_list (
Sequence
[int
]) – A sequence of length 4 that indicates the number of channels used for each block in each group.name (
Optional
[str
]) – Name of the module.

ResNet50¶

class
sonnet.nets.
ResNet50
(num_classes, bn_config=None, resnet_v2=False, name=None)[source]¶ ResNet50 module.

__init__
(num_classes, bn_config=None, resnet_v2=False, name=None)[source]¶ Constructs a ResNet model.
 Parameters
num_classes (
int
) – The number of classes to classify the inputs into.bn_config (
Optional
[Mapping
[str
,float
]]) – A dictionary of two elements, decay_rate and eps to be passed on to the BatchNorm layers.resnet_v2 (
bool
) – Whether to use the v1 or v2 ResNet implementation. Defaults to False.name (
Optional
[str
]) – Name of the module.

VectorQuantizer¶

class
sonnet.nets.
VectorQuantizer
(embedding_dim, num_embeddings, commitment_cost, dtype=tf.float32, name='vector_quantizer')[source]¶ Sonnet module representing the VQVAE layer.
Implements the algorithm presented in ‘Neural Discrete Representation Learning’ by van den Oord et al. https://arxiv.org/abs/1711.00937
Input any tensor to be quantized. Last dimension will be used as space in which to quantize. All other dimensions will be flattened and will be seen as different examples to quantize.
The output tensor will have the same shape as the input.
For example a tensor with shape [16, 32, 32, 64] will be reshaped into [16384, 64] and all 16384 vectors (each of 64 dimensions) will be quantized independently.

embedding_dim
¶ integer representing the dimensionality of the tensors in the quantized space. Inputs to the modules must be in this format as well.

num_embeddings
¶ integer, the number of vectors in the quantized space.

commitment_cost
¶ scalar which controls the weighting of the loss terms (see equation 4 in the paper  this variable is Beta).

__init__
(embedding_dim, num_embeddings, commitment_cost, dtype=tf.float32, name='vector_quantizer')[source]¶ Initializes a VQVAE module.
 Parameters
embedding_dim (
int
) – dimensionality of the tensors in the quantized space. Inputs to the modules must be in this format as well.num_embeddings (
int
) – number of vectors in the quantized space.commitment_cost (
Union
[float
,floating
,ndarray
,Tensor
,Variable
]) – scalar which controls the weighting of the loss terms (see equation 4 in the paper  this variable is Beta).dtype (
DType
) – dtype for the embeddings variable, defaults to tf.float32.name (
str
) – name of the module.

__call__
(inputs, is_training)[source]¶ Connects the module to some inputs.
 Parameters
inputs – Tensor, final dimension must be equal to embedding_dim. All other leading dimensions will be flattened and treated as a large batch.
is_training – boolean, whether this connection is to training data.
 Returns
quantize: Tensor containing the quantized version of the input. loss: Tensor containing the loss to optimize. perplexity: Tensor containing the perplexity of the encodings. encodings: Tensor containing the discrete encodings, ie which element of the quantized space each input element was mapped to. encoding_indices: Tensor containing the discrete encoding indices, ie which element of the quantized space each input element was mapped to.
 Return type
dict containing the following keys and values

VectorQuantizerEMA¶

class
sonnet.nets.
VectorQuantizerEMA
(*args, **kwargs)[source]¶ Sonnet module representing the VQVAE layer.
Implements a slightly modified version of the algorithm presented in ‘Neural Discrete Representation Learning’ by van den Oord et al. https://arxiv.org/abs/1711.00937
The difference between VectorQuantizerEMA and VectorQuantizer is that this module uses exponential moving averages to update the embedding vectors instead of an auxiliary loss. This has the advantage that the embedding updates are independent of the choice of optimizer (SGD, RMSProp, Adam, KFac, …) used for the encoder, decoder and other parts of the architecture. For most experiments the EMA version trains faster than the nonEMA version.
Input any tensor to be quantized. Last dimension will be used as space in which to quantize. All other dimensions will be flattened and will be seen as different examples to quantize.
The output tensor will have the same shape as the input.
For example a tensor with shape [16, 32, 32, 64] will be reshaped into [16384, 64] and all 16384 vectors (each of 64 dimensions) will be quantized independently.

embedding_dim
¶ integer representing the dimensionality of the tensors in the quantized space. Inputs to the modules must be in this format as well.

num_embeddings
¶ integer, the number of vectors in the quantized space.

commitment_cost
¶ scalar which controls the weighting of the loss terms (see equation 4 in the paper).

decay
¶ float, decay for the moving averages.

epsilon
¶ small float constant to avoid numerical instability.

__init__
(embedding_dim, num_embeddings, commitment_cost, decay, epsilon=1e05, dtype=tf.float32, name='vector_quantizer_ema')[source]¶ Initializes a VQVAE EMA module.
 Parameters
embedding_dim – integer representing the dimensionality of the tensors in the quantized space. Inputs to the modules must be in this format as well.
num_embeddings – integer, the number of vectors in the quantized space.
commitment_cost – scalar which controls the weighting of the loss terms (see equation 4 in the paper  this variable is Beta).
decay – float between 0 and 1, controls the speed of the Exponential Moving Averages.
epsilon – small constant to aid numerical stability, default 1e5.
dtype – dtype for the embeddings variable, defaults to tf.float32.
name – name of the module.

__call__
(inputs, is_training)[source]¶ Connects the module to some inputs.
 Parameters
inputs – Tensor, final dimension must be equal to embedding_dim. All other leading dimensions will be flattened and treated as a large batch.
is_training – boolean, whether this connection is to training data. When this is set to False, the internal moving average statistics will not be updated.
 Returns
quantize: Tensor containing the quantized version of the input. loss: Tensor containing the loss to optimize. perplexity: Tensor containing the perplexity of the encodings. encodings: Tensor containing the discrete encodings, ie which element of the quantized space each input element was mapped to. encoding_indices: Tensor containing the discrete encoding indices, ie which element of the quantized space each input element was mapped to.
 Return type
dict containing the following keys and values

Mixed Precision¶
Sonnet mixed precision built for TensorFlow 2.
modes¶

sonnet.mixed_precision.
modes
(valid_types)[source]¶ Decorate a function to cast inputs/outputs to different precision.
>>> support_modes = snt.mixed_precision.modes([tf.float32, tf.float16]) >>> snt.Linear.__call__ = support_modes(snt.Linear.__call__) >>> mod = snt.Linear(10) >>> snt.mixed_precision.enable(tf.float16) >>> y = mod(tf.ones([1, 1])) # First call will be done in F32. >>> y = mod(tf.ones([1, 1])) # MatMul/Add will be done in F16. >>> snt.mixed_precision.disable()
 Parameters
valid_types – Collection of types that the function being decorated is legal
run in. (to) –
 Returns
A decorator that will cast the inputs and outputs of the decorated function according to the global mixed precision policy and the functions eligibility for mixed precision.
enable¶
scope¶

sonnet.mixed_precision.
scope
(dtype)[source]¶ Temporarily set the global mixed precision type to dtype.
The global type is reset to its original value when the context is exited.:
snt.mixed_precision.enable(tf.float32) support_modes = snt.mixed_precision.modes([tf.float32, tf.float16]) snt.Linear.__call__ = support_modes(snt.Linear.__call__) mod = snt.Linear(10) with snt.mixed_precision.scope(tf.float16): y = mod(tf.ones([1, 1])) # First call will be done in F32. y = mod(tf.ones([1, 1])) # MatMul/Add will be done in F16. y = mod(tf.ones([1, 1])) # Outside the scope will be done in F32.
 Parameters
dtype (
DType
) – type to set the mixed precision mode to. Yields
Nothing. This is required for contextlib.contextmanager.
References¶
 1
Ashish Agarwal, David Berthelot, Tom Hennigan, Alex Passos, and Malcolm Reynolds. Stateful containers with tf.Module. TensorFlow Community RFCs, Google / DeepMind, 2019. URL: https://github.com/tensorflow/community/pull/56.
 2
Wojciech Zaremba, Ilya Sutskever, and Oriol Vinyals. Recurrent neural network regularization. arXiv preprint arXiv:1409.2329, 2014. URL: https://arxiv.org/abs/1409.2329.
 3(1,2,3,4)
Rafal Jozefowicz, Wojciech Zaremba, and Ilya Sutskever. An empirical exploration of recurrent network architectures. In International Conference on Machine Learning, 2342–2350. 2015.
 4
Haşim Sak, Andrew Senior, and Françoise Beaufays. Long shortterm memory based recurrent neural network architectures for large vocabulary speech recognition. arXiv preprint arXiv:1402.1128, 2014. URL: https://arxiv.org/abs/1402.1128.
 5
Yarin Gal and Zoubin Ghahramani. A theoretically grounded application of dropout in recurrent neural networks. In Advances in neural information processing systems, 1019–1027. 2016.
 6(1,2,3)
SHI Xingjian, Zhourong Chen, Hao Wang, DitYan Yeung, WaiKin Wong, and Wangchun Woo. Convolutional lstm network: a machine learning approach for precipitation nowcasting. In Advances in neural information processing systems, 802–810. 2015.
 7
Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555, 2014. URL: https://arxiv.org/abs/1412.3555.
 8
Diederik P. Kingma and Jimmy Ba. Adam: a method for stochastic optimization. 2014. arXiv:1412.6980.
 9
Andrew M Saxe, James L McClelland, and Surya Ganguli. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. arXiv preprint arXiv:1312.6120, 2013. URL: https://arxiv.org/abs/1312.6120.
 10(1,2)
Peter Buchlovsky, David Budden, Dominik Grewe, Chris Jones, John Aslanides, Frederic Besse, Andy Brock, Aidan Clark, Sergio Gómez Colmenarejo, Aedan Pope, and others. TFReplicator: Distributed machine learning for researchers. arXiv preprint arXiv:1902.00465, 2019. URL: https://arxiv.org/abs/1902.00465.
 11(1,2)
Peter Buchlovsky, Dominik Grewe, Priya Gupta, Tom Hennigan, Jonathan Hseu, Chris Jones, and Josh Levenberg. Distribution Strategy  Revised API. TensorFlow Community RFCs, Google / DeepMind, 2018. URL: https://github.com/tensorflow/community/pull/25.