# Approximation with Tensor Networks.

Part I: Approximation Spaces.

###### Abstract.

We study the approximation of functions by tensor networks (TNs). We show that Lebesgue -spaces in one dimension can be identified with tensor product spaces of arbitrary order through tensorization. We use this tensor product structure to define subsets of of rank-structured functions of finite representation complexity. These subsets are then used to define different approximation classes of tensor networks, associated with different measures of complexity. These approximation classes are shown to be quasi-normed linear spaces. We study some elementary properties and relationships of said spaces.

In part II of this work,
we will show that classical smoothness (Besov) spaces
are continuously embedded into these approximation classes.
We will also show that functions in these approximation
classes do not possess any Besov smoothness, unless
one restricts the *depth* of the tensor networks.

The results of this work are both an analysis of the approximation spaces of TNs and a study of the expressivity of a particular type of neural networks (NN) – namely feed-forward sum-product networks with sparse architecture. The input variables of this network result from the tensorization step, interpreted as a particular featuring step which can also be implemented with a neural network with a specific architecture. We point out interesting parallels to recent results on the expressivity of rectified linear unit (ReLU) networks – currently one of the most popular type of NNs.

###### Key words and phrases:

Tensor Networks, Tensor Trains, Matrix Product States, Neural Networks, Approximation Spaces, Besov Spaces, direct (Jackson) and inverse (Bernstein) inequalities###### 2010 Mathematics Subject Classification:

41A65, 41A15, 41A10 (primary); 68T05, 42C40, 65D99 (secondary)## 1. Introduction

### 1.1. Approximation of Functions

We present a new perspective and therewith a new tool for approximating one-dimensional real-valued functions on bounded intervals . We focus on the one-dimensional setting to keep the presentation comprehensible, but we intend to address the multi-dimensional setting in a forthcoming part III.

The approximation of general functions by simpler “building blocks” has been a central topic in mathematics for centuries with many arising methods: algebraic polynomials, trigonometric polynomials, splines, wavelets or rational functions are among some of the by now established tools. Recently, more sophisticated tools such as tensor networks (TNs) or neural networks (NNs) have proven to be powerful techniques. Approximation methods find application in various areas: signal processing, data compression, pattern recognition, statistical learning, differential equations, uncertainty quantification, and so on. See [10, 6, 32, 33, 8, 5, 36, 39, 21, 22, 35] for examples.

In the 20th century a deep mathematical theory of approximation has been established. It is by now well understood that approximability properties of a function by more standard tools – such as polynomials or splines – are closely related to its smoothness. Moreover, functions that can be approximated with a certain rate can be grouped to form quasi-normed linear spaces. Varying the approximation rate then generates an entire scale of spaces that turn out to be so-called interpolation spaces. See [14, 13] for more details.

In this work, we address the classical question of one-dimensional function approximation but with a new set of tools relying on tensorization of functions and the use of rank-structured tensor formats (or tensor networks). We analyze the resulting approximation classes. We will show in part II [2] that many known classical spaces of smoothness are embedded in these newly defined approximation classes. On the other hand, we will also show that these classes are, in a sense, much larger than classical smoothness spaces.

### 1.2. Neural Networks

Our work was partly motivated by current developments in the field of deep learning. Originally developed in [34], artificial neural networks (NNs) were inspired by models in theoretical neurophysiology of the nervous system that constitutes a brain. The intention behind NNs was to construct a mathematical (and ultimately digital) analogue of a biological neural network. The increase in computational power in recent decades has led to many successful applications of NNs in various fields [43]. This in turn has sparked interest in a better mathematical foundation for NNs. One flavor of this research is the analysis of the approximation power of feed-forward NNs. In particular, relevant for this work is the recent paper [23], where the authors analyzed the approximation spaces of deep ReLU and rectified power unit (RePU) networks. In Figure 1, we sketch an example of a feed-forward NN.

Mathematically, a feed-forward NN can be represented by a tuple

where are
affine maps; is the number of neurons in layer , with
being the number of inputs and the
number of outputs;
the functions
are typically non-linear and represent
the operations performed on data inside the neurons.
The functions are often implemented
via a component-wise application of a single
non-linear function
referred to as the *activation* function.
The realization of the NN is the function

Before one can proceed with *training* (i.e., estimating)
a NN, one usually specifies the *architecture* of
the NN: this includes the number of layers , neurons ,
connections between the neurons and
the non-linearities .
Only after this, one proceeds with training
which entails determining the affine maps
. This is typically done by minimizing some *loss*
or distance functional ,
where is the target function
that we want to approximate, in some sense.
If we set and, e.g.,
, then we are in
the classical setting of one-dimensional approximation^{1}^{1}1Though one
would typically not train a network by directly minimizing the -norm.
(in the -norm).

### 1.3. Tensor Networks

Tensor networks have been studied in parallel
in different fields, sometimes under different names: e.g.,
hierarchical tensor formats in numerical analysis, sum-product networks in machine learning, belief networks in bayesian inference.
Tensor networks are commonly
applied and studied in condensed matter physics, where understanding phenomena in quantum many-body systems
has proven to be a challenging problem, to a large extent
due to the sheer amount of dependencies that cannot be simulated
even on the most powerful computers (see [38] for a non-technical introduction).
In all these fields, a common challenging problem is the approximation
of functions of a very large number
of variables.
This led to the development of tools tailored to
so-called *high-dimensional* problems.

The mathematical principle behind tensor networks is to use tensor
products. For approximating a -variate function ,
there are several types of tensor formats.
The simplest is the so-called *-term* or *CP* format, where is approximated as

(1.1) |

If each factor is encoded with parameters, the total number of parameters is thus , which is linear in the number of variables.
The approximation format (1.1) is successful in many applications (chemometrics, inverse problems in signal processing…)
but due to a few unfavorable properties (see
[24, Chapter 9]), different types of tensor formats
are frequently used in numerical approximation.
In particular, with the so-called *tensor train* (TT) format or *matrix product state* (MPS), the function is approximated as

(1.2) |

The numbers are referred to as *multi-linear ranks* or *hierarchical ranks*. The rank is related to
the classical notion of rank for bi-variate functions, by identifying a -variate function as a function of two complementary groups
of variables and . It corresponds to the so-called -rank , with .
The format in (1.2) is a particular case of *tree-based tensor formats*, or *tree tensor networks* [24, 17],
the TT format being associated with a linear dimension partition tree.
Numerically, such formats have favorable stability properties
with robust algorithms (see [19, 40, 37, 21]). Moreover, the corresponding tensor networks and decompositions
have a physical
interpretation in the context of entangled many-body systems,
see [38, 39].

For more general tensor networks, we refer to Figure 2 for graphical representations.

*contraction)*between two functions, such as summation over in (1.2). The free edges represent input variables in (1.2).

The specific choice of a tensor network is sometimes suggested by the problem at hand: e.g., in quantum physics by the entanglement/interaction structure of the quantum system that is to model – see, e.g., [26, 1, 44, 3].

At first glance, it seems that tensor networks are a tool
suited only for approximating high-dimensional functions.
However, such formats can be applied in any multi-variate setting
and this multi-variate setting can be enforced even if by a
“coarse-graining” of an interval in allowing to identify a one-dimensional function with a multi-variate function (or tensor). This identification is the *tensorization* of functions which is at the core of the approximation tools considered in this work.
It was originally applied for matrices in
[41] and later coined as
*quantized tensor format* when tensorization is combined with the use of a tensor format.

In high-dimensional approximation, keeping the ranks small relies on the correct choice of the tensor network that “fits” the interaction structure as hinted above. For the approximation of tensorized functions a different type of structure is required. In [20, 41], it was shown that, if is vector of evaluations of a polynomial on a grid, then choosing the TT format yields hierarchical ranks that are bounded by the degree of the polynomial. Similar statements were shown for trigonometric polynomials and exponential functions. In [29], this fact was utilized to show that a finite element approximation of two-dimensional functions with singularities, where the coefficient vector was stored in a quantized tensor format, automatically recovers an exponential rate of convergence, analogue to that of -approximation.

This work can be seen as a consolidation and a deeper analysis of approximation of one-dimensional functions using quantized tensor formats. We first show that Lebesgue spaces of -integrable functions are isometric to tensor product spaces of any order and analyze some basic properties of this identification. We then define and analyze the approximation classes of functions that can be approximated by rank-structured functions in the TT format with a certain rate. In Part II [2], we will show direct and (lack of) inverse embeddings.

### 1.4. Tensor vs. Neural Networks

Recently multiple connections between TNs and NNs
have been discovered.
In [7], the author exhibits an analogy between
the Renormalization Group (RG) – the fundamental physical concept
behind TNs – and deep learning,
with the scale in RG being akin to depth in NNs.
In [31], it was observed that in fact
tree tensor networks can be viewed as a specific type of feed-forward NNs
with multi-linear functions , namely sum-product NNs (or arithmetic circuits) [42].
This connection offers intriguing perspectives
and it can be exploited both ways:
we can use our fundamental
understanding of quantum entanglement^{2}^{2}2That is not to say
that we have understood quantum entanglement.
But an argument can be made that our understanding
of entanglement and thus tensor networks offers a different perspective on neural networks.
to measure and design NNs, see [31, 12].
Or we can use NNs
to *learn* entanglement and augment
tensor networks to better represent
highly entangled systems, see [11, 30].
Tensor networks also
offer a big choice of well studied and robust
numerical algorithms.

In this spirit, our work in part II [2] can be seen as a result
on the approximation power of a particular type
of NN,
where the TT format is a
feed-forward sum-product NN and a recurrent neural network architecture.
When compared to the results
of [23] on approximation classes
of
RePU networks, we will observe in Part II [2] that
both tools achieve optimal approximation order for Sobolev spaces.
We also show that TNs (using the TT format) achieve optimal approximation
order for Besov spaces on the embedding line (corresponding to
non-linear approximation). These statements hold for
a tensor network of fixed polynomial degree and *any* smoothness
order of the Sobolev/Besov spaces.

On the other hand,
TNs are much easier to handle – both analytically and
numerically. Moreover,
it seems the much simpler
architecture of TNs does not sacrifice anything in terms
of representing functions of classical smoothness when compared
to RePU networks.
In particular, both tools are able to recover optimal
or close to optimal approximation rates – without
being “adapted” to the particular space in question.
In other words, all smoothness spaces are included in the
same approximation class.
This is to be contrasted with more standard approximation
tools, such as splines or wavelets, where the approximation class
(and thus the approximation method) has to be adapted to
the smoothness notion in question.
Moreover, both tools will frequently perform
better than predicted on specific instances of functions
that possess structural features that are not
captured by classical smoothness
theory^{3}^{3}3A fundamental theory of these structures
remains an open question for both tools..

Of course, this is simply to say that both tools do a good job when it comes to classical notions of smoothness. We still expect that NN approximation classes are very different than those of TNs, in an appropriate sense. We also show in Part II [2] that TNs approximation classes (using the TT format) are not embedded into any Besov space – as was shown in [23] for RePU networks.

### 1.5. Main Results

First, we show that any -function defined on the interval can be identified with a tensor. For a given (the base) and (the level), we first note that any can be uniquely decomposed as

where is the representation of in base and . This allows to identify a function with a tensor (or multivariate function)

and to define different notions of ranks for a univariate function. A function can be tensorized at different levels . We analyze the relation between tensorization maps at different levels, and the relation between the ranks of the corresponding tensors of different orders. When looking at as a map on , a main result is given by Theorem 2.15 and Lemma A.1.

###### Main Result 1.1.

For any , () and , the map is a linear isometry from to the algebraic tensor space where is equipped with a reasonable crossnorm.

For later use in approximation, we introduce the tensor subspace

where is some finite-dimensional subspace. Then, we can identify with a finite-dimensional subspace of as

We introduce the crucial assumption that is closed under -adic dilation, i.e., for any and any , Under this assumption, which is reminiscent of multi-resolution analysis (MRA), we obtain bounds for multilinear ranks that are related to the dimension of . Also, under this assumption on , we obtain the main results given by Propositions 2.24 and 2.23 and Theorem 2.25.

###### Main Result 1.2.

The spaces form a hierarchy of -subspaces, i.e.

and is a linear space. If we further assume that contains the constant function one, is dense in for .

For the approximation of multivariate functions (or tensors), we use the set of tensors in the tensor train (TT) format , , cf. Section 1.3 and Figure 1(b). Given a basis of , a tensor in can be written as

where the parameters form a tensor network (a collection of low-order tensors) with

With this we define

where is the map which associates to a tensor network the function with defined as above. Then our approximation tool for univariate functions is defined as

where is some measure of complexity of a function , defined as

where the infimum is taken over all tensor networks whose realization is the function . We introduce three different measures of complexity

where is the number of non-zero entries in the tensor . Consequently, this defines three types of subsets

Complexity measures and are related to the TT-ranks of the tensor , where is the minimal such that . The function is a natural measure of complexity which corresponds to the dimension of the parameter space. The function is also a natural measure of complexity which counts the number of non-zero parameters. When interpreting a tensor network as a sum-product neural network, corresponds to the number of neurons, to the number of weights, and the number of non-zero weights (or connections).

We use and the corresponding best approximation error

for functions in to define approximation classes

for and , where

For the three approximation classes

we obtain the main result of this part I given by Theorems 3.19 and 3.17.

###### Main Result 1.3.

For any , and , the classes , and are quasi-normed vector spaces and satisfy the continuous embeddings

### 1.6. Outline

In Section 2, we discuss how one-dimensional functions can be identified with tensors and analyze some basic properties of this identification. In Section 3, we introduce our approximation tool, briefly review general results from approximation theory, and analyze several approximation classes of rank-structured functions. In particular, we show that these classes are quasi-normed linear spaces. We conclude in Section 4 by a brief discussion on how tensorization can be viewed as a particular featuring step and tensor networks as a particular neural network with features as input variables.

## 2. Tensorization of Functions

We begin by introducing how one-dimensional functions can be identified with tensors of arbitrary dimension. We then introduce finite-dimensional subspaces of tensorized functions and show that these form a hierarchy of subspaces that are dense in . This will be the basis for our approximation tool in Section 3.

### 2.1. The Tensorization Map

Consider one-dimensional functions on the unit interval

We tensorize such functions by encoding the input variable as follows. Let be the base and the level. We introduce a uniform partition of with intervals with , . An integer admits a representation in base such that

where
We define a *conversion map* from to defined by

For any , there exists a unique such that , where is the representation of in base and . We therefore deduce the following property.

###### Lemma 2.1.

The conversion map defines a linear bijection from the set to the interval , with inverse defined for by

###### Definition 2.2 (Tensorization Map).

We define the
*tensorization map*

which associates to a function the multivariate function such that

From Lemma 2.1, we directly deduce the following property of .

###### Proposition 2.3.

The tensorization map is a linear bijection from to , with inverse given for by .

The space can be identified with the algebraic tensor space

which is the set of functions defined on that admit a representation

(2.1) |

for some and for some functions and , , . Letting be the canonical basis of , defined by , a function admits the particular representation

(2.2) |

The following result provides an interpretation of the above representation.

###### Lemma 2.4.

Let and . For and , it holds

(2.3) |

and

(2.4) |

where is the restriction of to the interval rescaled to .

###### Proof.

From Lemma 2.4, we deduce that the representation (2.2) corresponds to the decomposition of as a superposition of functions with disjoint supports,

(2.5) |

where is the function supported on the interval and equal to on this interval. Also, Lemma 2.4 yields the following result.

###### Corollary 2.5.

A function defined by

with and admits a tensorization , which is an elementary tensor.

We now provide a useful result on compositions of tensorization maps for changing the representation level of a function.

###### Lemma 2.6.

Let such that . For any , it holds

and the operator from to is such that

where is the identity operator on and is the tensorization map from to . Also, the operator from to is such that

###### Proof.

See Appendix A. ∎

For , we adopt the conventions that is the identity on , is the identity operator on , and .

### 2.2. Ranks and Minimal Subspaces

The minimal integer such that admits a representation of the form (2.1) is the *canonical tensor rank* of denoted We deduce from the representation (2.2) that

Other notions of ranks can be defined from the classical notion of rank by identifying a tensor with a tensor of order two (through unfolding). Letting for , and , we have

Then for any and its complementary set , a tensor can be identified with an order-two tensor in , where , called the -unfolding of This allows us to define the notion of -rank.

###### Definition 2.7 (-rank).

For , the -rank of , denoted , is the minimal integer such that admits a representation of the form

(2.6) |

where and .

Since is an algebraic tensor space, the -rank is finite and we have (though the -rank can be much smaller). Moreover, we have the following straightforward property

and the bound

(2.7) |

which can be useful for small and either very small or very large .

Representation (2.6) is not unique but
the space spanned by the is unique and corresponds to the -*minimal subspace* of .

###### Definition 2.8 (-minimal subspace).

For , the -minimal subspace of , denoted , is the smallest subspace such that , and its dimension is

We have the following useful characterization of minimal subspaces from partial evaluations of a tensor.

###### Lemma 2.9.

For and ,

where is a partial evaluation of along dimensions .

###### Proof.

See Appendix A. ∎

Next we define a notion of -rank for univariate functions in .

###### Definition 2.10 (-rank).

For a function , and , we define the -rank of , denoted , as the -rank of its tensorization in ,

In the rest of this work, we will essentially consider subsets of the form or for some . For the corresponding -ranks, we will use the shorthand notations

Note that should not be confused with The ranks of a tensor have to satisfy some relations, as seen in the next lemma.

###### Lemma 2.11 (Ranks Admissibility Conditions).

Let . For any set and any partition , we have

and in particular

(2.8) |

###### Proof.

See Appendix A. ∎

A function admits infinitely many tensorizations of different levels. The following result provides a relation between minimal subspaces.

###### Lemma 2.12.

Consider a function and its tensorization at level . For any ,

where is the tensorization of at level .

###### Proof.

See Appendix A. ∎

For since is the restriction of to the interval rescaled to , Lemma 2.12 provides a simple interpretation of minimal subspace as the linear span of contiguous pieces of rescaled to , see the illustration in Figure 3.

###### Corollary 2.13.

Let and . For any

and

###### Proof.

It holds

and

Lemma 2.12 then implies that and provides the characterization from the linear span of with which is linearly identified with the restriction shifted and rescaled to . ∎

### 2.3. Measures, Lebesgue and Smoothness Spaces

We now look at as a linear map between spaces of measurable functions, by equipping the interval with the Lebesgue measure.

###### Proposition 2.14.

The Lebesgue measure on is the push-forward measure through the map of the product measure , where is the uniform probability measure on . Then the tensorization map defines a linear isomorphism from the space of measurable functions to the space of measurable functions , where is equipped with the Lebesgue measure and is equipped with the product measure .

###### Proof.

See Appendix A. ∎

#### 2.3.1. Lebesgue Spaces

For , we consider the Lebesgue space of functions defined on equipped with its standard (quasi-)norm. Then we consider the algebraic tensor space

which is the space of multivariate functions on with partial evaluations . From hereon we frequently abbreviate .

###### Theorem 2.15 (Tensorization is an -Isometry).

For any , is a linear isometry from to equipped with the (quasi-)norm defined by