1 Introduction
This project is not yet complete, but intends to formalize a strengthening of the following theorem. Let \(\mathbb {I} = \left[0,1\right]\) and let \(n \geq 2\) be an given integer. We use \(C^d\) to indicate the space of functions defined on a domain whose \(d\)-th derivative is continuous and use \(\circ \) to indicate function composition.
For every \(f \in C\left(\mathbb {I}^n\right) \mapsto \mathbb {R}\), there exist increasing \(\psi _{p,q} \in C\left(\mathbb {I}\right) \mapsto \mathbb {R}\) that do not depend on \(f\) and generally non-monotone \(\chi _q \in C\left(\mathbb {R}\right) \mapsto \mathbb {R}\) that depend on \(f\) and \(\psi _{p,q}\), such that \(f\left(x_0, x_1, \cdots , x_{n - 1}\right) = \sum \limits _{q = 0}^{2n} \chi _q \circ \Psi _q\left(x_0, x_1, \cdots , x_{n - 1}\right)\), where \(\Psi _q\left(x_0, x_1, \cdots , x_{n - 1}\right) \equiv \sum \limits _{p = 0}^{n - 1} \psi _{p,q}\left(x_p\right)\).
This theorem is called the Kolmogorov Superposition Representation (KSR) or the Kolmogorov Superposition Theorem (KST) because Kolmogorov proved that a given multivariable “target” function, \(f\), could be represented as a finite sum of compositions of univariate functions.
The relevance of the KRT continues to be debated in the neural networks literature. On one side, many want to utilize the KSR to approximate unknown or expensive target functions. However, the “inner” functions, \(\psi _{p,q}\), and “outer” functions, \(\chi _q\), are not elementary, in the sense that they cannot be computed with a finite number of calls to basic functions. Thus, the KSR requires harmonic analysis and other side is skeptical that the it will ever be of much practical use.
Although the KST is perhaps surprising in its generality and simplicity, it has been proven in several ways, raising the question of what value is added by an additional proof? The extant proofs of the KST are of three kinds (and none have been formalized):
Nonconstructive: The most elegant proofs utilize the Baire Category Theorem to prove the existence of inner and outer functions that satisfy the KST but do not suggest how to construct them.
Constructive: Other proofs are considered to be constructive but do not even suggest a feasible algorithm to implement them in floating-point arithmetic.
Constructed: Two constructions have been implemented in software but both utilize bad inner functions and worse outer functions, making them unworkable in floating-point arithmetic.
By “bad" inner functions, we specifically mean they are in the Devil’s Staircase family, that is, their derivative is zero almost everywhere on \(\mathbb {I}\) but does not exist on a dense set of points. By “worse" outer functions, we specifically mean that they are nowhere differentiable. Thus, even if \(f\) were differentiable at some point, the chain rule does not apply to a compositions involving nowhere differentiable functions, so it cannot be used to compute a gradient under the original KSR. Most algorithms in statistics and supervised learning rely on gradients as part of some larger algorithm, so the KSR has not been useful for those purposes.
Our goal is to construct decent inner functions and adequate outer functions that satisfy the KST and allow the chain rule to compute the gradient of \(f\), which can then be implemented to floating-point arithmetic. Vitushkin and Khenkin proved that the inner functions in the KST cannot all be continuously differentiable. Thus, by “decent” inner functions, we specifically mean Lipschitz continuous functions, which satisfy \(\left|\psi _{p,q}\left(x\right) - \psi _{p,q}\left(x^\prime \right)\right| \leq K\left|x - x^\prime \right|\) for some positive but finite constant \(K\). The KST literature often refers to a more general definition of Lipschitz continuity (which is also termed Hölder continuity) where \(\left|\psi _{p,q}\left(x\right) - \psi _{p,q}\left(x^\prime \right)\right| \leq K\left|x - x^\prime \right|^\alpha \). Fridman and Kahane proved that the inner functions can all be weak contractions, that is, \(\alpha = 1\) and \(K = 1\).
Sprecher proved that there are outer functions that satisfy the KST for any target function when \(\psi _{p,q}\left(x_p\right) \equiv \lambda ^p \varphi \left(x_p - \frac{q}{2n}\right)\), where \(\lambda {\gt} 0\) is an irrational number and \(\varphi \) is a weak contraction that we call the “core” function. However, the only known attempt to implement the Fridman-Sprecher approach in software was made by Actor, which was unsuccessful.
Actor did succeed in reparameterizing the Hölder continuous core function proposed by Sprecher, corrected by Köppen, and proven correct by Braun and Griebel. Actor’s function is parameterized via its arc length and is Lipschitz continuous by construction. While there is graphical evidence that strongly suggests Actor’s function is compatible with the KST, some aspects of the approach are not fully proved, and it requires an expensive numerical inversion of Braun and Griebel’s function just to achieve a few decimal places of accuracy.
Montanelli and Yang notes the importance of constructing a Lipschitz continuous KST
We have proven upper bounds for the approximation of multivariate functions \(f : [0, 1]^n \mapsto \mathbb {R}\) by deep ReLU networks, for which the curse of dimensionality is lessened. The depth and the size of the networks to approximate such functions \(f\) grow like \(\mathcal{O}\left(\varepsilon ^{-\log n}\right)\), as opposed to \(\mathcal{O}\left(\varepsilon ^{-n}\right)\). The proof is based on the ability of very deep ReLU networks to implement the Kolmogorov-Arnold superposition theorem.
There are many ways in which this work could be fruitfully continued. If we were able to construct a Lipschitz continuous inner function, we would be able to obtain \(\mathcal{O}\left(\varepsilon ^{-1}\right)\) estimates. Actor and Knepley designed in 2017 an algorithm to compute a Lipschitz continuous inner function, but they did not provide a method to compute the outer functions.
This quote reflects the fact that the outer functions depend on the inner functions, in addition to \(f\). The smoother are the inner functions, the easier it is to obtain “adequate” outer functions that approximate a target function to a given degree of accuracy. In short, a Lipschitz continuous KST would overcome (to first order) the curse of dimensionality when approximating \(y\) on \(\mathbb {I}^n\).
Also, target functions in statistics, such as log-likelihood functions, typically depend on \(N\) data points and the cost of evaluating them directly is polynomial (at best) in \(N\). Once the outer functions have been calibrated to the target function, the KSR of the target function can be evaluated in parallel in essentially constant wall time regardless of \(n\) or \(N\). These savings could be enormous in computationally intensive applications, such as Markov Chain Monte Carlo or Large Language Models.
There are several open questions that are not even investigated in the KST literature:
Can some proper subset of the inner functions be continuously differentiable (or smoother)? We show that if \(0 {\lt} q {\lt} 2n\), then \(\psi _{p,q}\) can be twice continuously differentiable.
Does the core function depend on \(n\), apart from being evaluated at \(x - \frac{q}{2n}\)? We show that \(\varphi \) can be universal, although \(\lambda \) does depend on \(n\) in a simple fashion.
Can the set of points where \(\varphi \) is not differentiable be finite? We show that \(\varphi \) is not differentiable at \(1\) and \(-1\) but is differentiable on \(\left(-1,1\right)\).
Can \(\varphi \) be approximated to floating-point precision? We show that \(\varphi \) can be quickly approximated to any degree of accuracy.
Our approach is unique in that we first construct a function with the requisite properties and then show that it is a viable core function in the KSR. Chapter 2 uses basic real analysis — most of which is already formalized in the Mathlib library used by the Lean software — to show that our candidate for \(\psi \) is increasing, Lipschitz continuous, and not continuously differentiable. In addition, since \(\varphi \) can be represented as a series, it is essential to demonstrate that the series converges uniformly, which is an improvement over the other constructions in the KST literature whose functions only converge pointwise. As a result, \(\varphi \) can be evaluated to double precision in fewer than a hundred floating point operations, and its derivative can be evaluated at double that cost (albeit to less accuracy).
Chapter 3 establishes in Lean that \(\Psi _q\left(x_0, x_1, \cdots , x_{n - 1}\right) \equiv \sum \limits _{p = 1}^n \lambda ^{p -1} \varphi \left(x_p - \frac{q}{2n}\right)\) is injective for each \(q\) when \(\lambda \) is a transcendental number. The proof is only slightly different from the one Sprecher has relied upon for decades that essentially extended the rational number field with an irrational \(\lambda \). Much of this can leverage the theorems on cyclotomic fields that are already in Mathlib.
Chapter 4 is the least complete but is devoted to the outer functions in the KSR. Kahane utilized the notion of a Helson set, which is a subset with the characterizing property that every continuous function over this subset can be represented as a convergent Fourier series. Kahane proved necessary and sufficient conditions for a subset of \(\mathbb {I}^{2n + 1}\) to be a Helson set.