Fourier Series and Fourier Transform - I

Certain feelings in my body lead me to believel that I have to stduy Fourier Series and Fourier Transform for a better understanding of probability theory, measure theory,entroy and information theory.

For people who are in the field of AI, it is well known that ‘neural networks can approximate any function given enough hidden neurons’. This is a very powerful statement. But, what does it mean? What is the proof? what’s the connection between this statement and linear algebra space, such as project matrix, eigenvalue, eigenvector, etc.? What’s the role of distribution of input plays in this statement? What’s the role of activation function plays in this statement?

To answer these questions, we need to understand Fourier Series and Fourier Transform. I have been trying to master Fourier Series and Fourier Transform for a long time. But, it takes me sveral years to understand the basic concepts of Fourier Series and Fourier Transform. I am still learning it. I will try to explain what I have learned so far in this series of blog posts.

If you have read my previous post about probability and distribution, you might notice that sometimes we have to talk about Gamma function, Beta function, etc. These functions are related to Fourier Series and Fourier Transform. So, I will also try to explain these functions in this series of blog posts. The point is that to understand the basic concepts of probability theory, we have to understand Fourier Series and Fourier Transform. Since probabiliy theory is the foundation of AI and even quantum mechanics, we have to understand Fourier Series and Fourier Transform.

Where to start?

Both probability theory and fourier analysis (Fourier Series and Fourier Transform) are very deep subjects. They are also very rich subjects. One could easily spend her/his entire life to study them. So, where to start? We need to make a choice, which certainly involves some trade-offs.

The choice I am making is to have an advanced standpoint that will allow us to have a unified view of fourier analysis, probability theory and their applications in the field of AI. This advanced standpoint is both elementary and advanced. It is elementary in the sense that it does not require any advanced math background. It is advanced in the sense that it will lead us to mroe advanced topics in the future.

The standpoint I am taking is to start with the following concepts:

To explain this standpoint, let’s take a look at a specific example. Look at the following piciture:

Inequality bounds compare
Figure 1. The right triangle with sides of length a=b=1a=b=1 has hypotenuse of length 2\sqrt{2}.

Now, you want to estimate the value of 2\sqrt{2}. People from different regions have different ways to estimate the value of 2\sqrt{2}. Please read this if you are interested in the history of 2\sqrt{2}.

Now living in the 21st centry, how could we estimate the value of 2\sqrt{2}. We can rely on Taylor series to estimate the value of 2\sqrt{2}. Let’s review Taylor series first.

Taylor series

Talyor series is a way to approximate a function f(x)f(x) by a polynomial p(x)p(x). Here is the formula:

p(x)=n=0f(n)(a)n!(xa)n=f(a)+f(a)(xa)+f(a)2!(xa)2+f(a)3!(xa)3+ \begin{aligned} p(x) & = \sum_{n=0}^{\infty} \frac{f^{(n)}(a)}{n!} (x-a)^n \\ & = f(a) + f'(a)(x-a) + \frac{f''(a)}{2!}(x-a)^2 + \frac{f'''(a)}{3!}(x-a)^3 + \cdots \end{aligned}

where f(n)(a)f^{(n)}(a) is the nn-th derivative of f(x)f(x) evaluated at x=ax=a. Now, we let f(x)=x=x1/2f(x) = \sqrt{x} = x^{1/2} and a=1a=1. We know

f(x)=x1/2f(1)=1f(x)=12x1/2f(1)=12f(x)=14x3/2f(1)=14f(x)=38x5/2f(1)=38f(4)(x)=1516x7/2f(4)(1)=1516 \begin{aligned} f(x) & = x^{1/2} \quad \Rightarrow \quad f(1) = 1 \\ f'(x) & = \frac{1}{2} x^{-1/2} \quad \Rightarrow \quad f'(1) = \frac{1}{2} \\ f''(x) & = -\frac{1}{4} x^{-3/2} \quad \Rightarrow \quad f''(1) = -\frac{1}{4} \\ f'''(x) & = \frac{3}{8} x^{-5/2} \quad \Rightarrow \quad f'''(1) = \frac{3}{8} \\ f^{(4)}(x) & = -\frac{15}{16} x^{-7/2} \quad \Rightarrow \quad f^{(4)}(1) = -\frac{15}{16} \\ \end{aligned}

Then, we have

p(x)=n=0f(n)(1)n!(x1)n=f(1)+f(1)(x1)+f(1)2!(x1)2+f(1)3!(x1)3+=1+12(x1)18(x1)2+116(x1)3+ \begin{aligned} p(x) & = \sum_{n=0}^{\infty} \frac{f^{(n)}(1)}{n!} (x-1)^n \\ & = f(1) + f'(1)(x-1) + \frac{f''(1)}{2!}(x-1)^2 + \frac{f'''(1)}{3!}(x-1)^3 + \cdots \\ & = 1 + \frac{1}{2}(x-1) - \frac{1}{8}(x-1)^2 + \frac{1}{16}(x-1)^3 + \cdots \end{aligned}

If we let x=2x=2, we have

p(2)=1+12(21)18(21)2+116(21)3+=1+1218+116+=?(let’s write a program to compute this value) \begin{aligned} p(2) & = 1 + \frac{1}{2}(2-1) - \frac{1}{8}(2-1)^2 + \frac{1}{16}(2-1)^3 + \cdots \\ & = 1 + \frac{1}{2} - \frac{1}{8} + \frac{1}{16} + \cdots \\ & = ? \quad \text{(let's write a program to compute this value)} \end{aligned}

Inequality bounds compare
Figure 2. The Taylor series estimation of 2\sqrt{2}, and its convergence.
#%%
import numpy as np
import matplotlib.pyplot as plt


# f(x) = x^{1/2}
def f(x, power=0.5):
    return x**power

# derivative of f(x)
def df(x, n):
    # not a general solution but works for this case
    # assume n >= 1
    # using recursion
    if n == 1:
        return 1/2 * x**(-1/2)
    else:
        return (1/2 - (n-1)) * x**(1/2-n) * df(x, n-1)


def taylor_series(x, n, a=1):
    # n is integer assume n >= 1
    ts = 0
    for i in range(1, n):
        ts += df(a, i) * (x-a)**i / np.math.factorial(i)
    return ts + f(a)


# plot taylor series
def plot_taylor_series():
    n = 17  # set up to 16th order
    nlist = list(range(1, 10))
    y = [taylor_series(2, i) for i in nlist]
    fig, ax = plt.subplots(1, 1, figsize=(7, 4))
    ax.plot(nlist, y, 'k--', label='Taylor series')
    # add np.sqrt(2) as a horizontal line
    ax.axhline(y=np.sqrt(2), color='k', linestyle='-', label='sqrt(2)')
    ax.set_xlabel('order')
    ax.set_ylabel('value')
    ax.set_title('Taylor series of sqrt(2) at x=2 around a=1')
    # legend right bottom
    ax.legend(loc='lower right')
        


if __name__ == "__main__":
    # plot option retinal display
    plt.rcParams['figure.dpi'] = 300
    print("hello world")
    print(f(1))
    print(df(1, 4))
    print(taylor_series(2, 6))
    # retinal display
    %config InlineBackend.figure_format = 'retina'
    plot_taylor_series()

Figure 2 shows the Taylor series estimation of 2\sqrt{2}, and its convergence. We can see that the Taylor series estimation of 2\sqrt{2} converges to 2\sqrt{2} as the order of the Taylor series increases. The Taylor series estimation of 2\sqrt{2} is a good estimation of 2\sqrt{2}.

Reflections
This simple example enables us to reflect on the many topics: i) the meaning of theoretical mathematics v.s. applied mathematics, ii) the meaning of idealized v.s. real world, iii) the meaning of continuous v.s. discrete. iv) the connection between analysis and linear algebra, etc.
The most important message I want to convey is that one has to develop the certain level of mathematical maturity to carry out the mathematical analysis and implement the calculation in the real world, which is so called 'engineering'.

When we use the Taylor series to estimate 2\sqrt{2}, we did not cover the following interesting and important questions:

All those questions will be answered in the future blog posts when we study Fourier Series and Fourier Transform systematically. So, without further ado, let’s start our journey of Fourier Series and Fourier Transform.

How about (2)1/3(2)^{1/3} or 57\sqrt{57}?

I added this section after talking with my friend as he suggested that one could estimate 2\sqrt{2} with a method called ``long division”. We used this method to do division when we were in elementary school. But I never thought about using this method to estimate 2\sqrt{2}. So, I tried it. It works (However, I have to warn you that this is quite wild and unstable method).

long division to estimate square root
Figure 3. Two examples of using long division to estimate square root.

The problem of using this method to estimate 2\sqrt{2} is that it is not a `general’ method, which means that it is not easy to guess the number of digits of the estimation. There is a certain algorithm to do this, but it is not easy to implement. So, I will not talk about it here. Moreover, you can not rely on this method to estimate (2)1/3(2)^{1/3} but you can use Taylor series to estimate (2)1/3(2)^{1/3}.

Reflections
Since mathematical thinking does not come to our mind naturally, one has to train herself/himself to think mathematically. When we think mathematically, we have to think about the following questions: i) what is the generalization of the problem; ii) how could prove the generalization; iii) how could we apply the generalization to solve the real world problems.
Certain algorithm might work for certain problems, but it might not work for other problems. This reminds me of an interview by Wu WenJun (吴文俊). When he was asked about the difference between the Chinese mathematicians and the western mathematicians, he said that the Chinese mathematicians are good at solving certain problems, but the western mathematicians are good at solving general problems. I think this is a very good observation.
You may not like theory, but the power of the theory has to be appreciated well. The theory is the foundation of the practice. Without the theory, the practice is just a wild guess.

Euler’s number and Euler’s formula

Euler’s number is a very important number in mathematics. It is denoted by ee. It is defined as the following infinite series:

e=n=01n!=1+11!+12!+13!+ e = \sum_{n=0}^{\infty} \frac{1}{n!} = 1 + \frac{1}{1!} + \frac{1}{2!} + \frac{1}{3!} + \cdots

where n!n! is the factorial of nn. For example, 3!=3×2×1=63! = 3 \times 2 \times 1 = 6. One cannot overstate the importance of Euler’s number. It is almost everywhere in mathematics. For instance, the famous Gaussian distribution is defined as the following:

N(x;μ,σ2)=12πσ2exp((xμ)22σ2) \mathcal{N}(x; \mu, \sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right)

where exp(x)=ex\exp(x) = e^x. In number theroy, probability theory, statistics, etc., Euler’s number is everywhere. To learn how Euler discovered this number, please read this. I have not read this note, but put it here for future reference.

With appreciation of Euler’s number, we can define Euler’s formula as the following:

eix=cos(x)+isin(x)(1) e^{ix} = \cos(x) + i \sin(x) \tag{1}

where ii is the imaginary unit, which is defined as the following:

i2=1 i^2 = -1

This simple formula connects the exponential function with the trigonometric functions. It is a very powerful formula and also the beautiful one. Please read the wikipedia to learn more about Euler’s formula.

With Euler’s formula, we can derive the following two important formulas:

cos(x)=eix+eix2sin(x)=eixeix2i(2) \begin{aligned} \cos(x) & = \frac{e^{ix} + e^{-ix}}{2} \\ \sin(x) & = \frac{e^{ix} - e^{-ix}}{2i} \tag{2} \end{aligned}

In our posts, we will use the above two formulas frequently. So, please remember them.

Dot product and inner product

In linear algebra, we have the concept of dot product and inner product. They are very important concepts. In this section, we will review them. It is important to know that inner product is a generalization of dot product. So, we will start with dot product first.

The dot product of two vectors a\mathbf{a} and b\mathbf{b} is defined as the following:

ab=i=1naibi=a1b1+a2b2++anbn \mathbf{a} \cdot \mathbf{b} = \sum_{i=1}^{n} a_i b_i = a_1 b_1 + a_2 b_2 + \cdots + a_n b_n

where nn is the dimension of the vectors a\mathbf{a} and b\mathbf{b}. For example, if a=[1,2,3]\mathbf{a} = [1, 2, 3] and b=[4,5,6]\mathbf{b} = [4, 5, 6], then

ab=1×4+2×5+3×6=32 \mathbf{a} \cdot \mathbf{b} = 1 \times 4 + 2 \times 5 + 3 \times 6 = 32

The dot product of two vectors a\mathbf{a} and b\mathbf{b} is also called the inner product of two vectors a\mathbf{a} and b\mathbf{b}. The inner product of two vectors a\mathbf{a} and b\mathbf{b} is denoted by a,b\langle \mathbf{a}, \mathbf{b} \rangle. So, we have

a,b=ab=i=1naibi=a1b1+a2b2++anbn(3) \langle \mathbf{a}, \mathbf{b} \rangle = \mathbf{a} \cdot \mathbf{b} = \sum_{i=1}^{n} a_i b_i = a_1 b_1 + a_2 b_2 + \cdots + a_n b_n \tag{3}

The inner product has a very nice geometric interpretation:

a,b=abcosθ(4) \langle \mathbf{a}, \mathbf{b} \rangle = \|\mathbf{a}\| \|\mathbf{b}\| \cos \theta \tag{4}

where θ\theta is the angle between the two vectors a\mathbf{a} and b\mathbf{b}, and a\|\mathbf{a}\| and b\|\mathbf{b}\| are the lengths of the vectors a\mathbf{a} and b\mathbf{b}, respectively. We also refer to a\|\mathbf{a}\| as the norm of the vector a\mathbf{a}. The norm of a vector a\mathbf{a} is defined as the following:

a=a,a=aa=i=1nai2(5) \|\mathbf{a}\| = \sqrt{\langle \mathbf{a}, \mathbf{a} \rangle} = \sqrt{\mathbf{a} \cdot \mathbf{a}} = \sqrt{\sum_{i=1}^{n} a_i^2} \tag{5}

Before we move on, we need to define the inner product for complex functions. Suppose we have two complex functions f(t)f(t) and g(t)g(t), where tt is a real number. The inner product of f(t)f(t) and g(t)g(t) is defined as the following:

f(t),g(t)=01f(t)g(t)dt \langle f(t), g(t) \rangle = \int_{0}^{1} f(t) \overline{g(t)} dt

where g(t)\overline{g(t)} is the complex conjugate of g(t)g(t). For example, if g(t)=e2πintg(t) = e^{2\pi int}, then g(t)=e2πint\overline{g(t)} = e^{-2\pi int}. The inner product of f(t)f(t) and g(t)g(t) is also called the dot product of f(t)f(t) and g(t)g(t).

Please read this to learn more about the complex conjugate.

Two ways of deriving least square solution

In this section, we will review two ways of deriving the least square solution. The first way is to use the calculus. The second way is to use the linear algebra. Again, the purpose of doing this is to give you certain level of mathematical maturity, especailly on the relationship between analysis and algebra.

Suppose the regression model has the following matrix model

[y1y2yn]=[1x11x12x1p1x21x22x2p1xn1xn2xnp][β0β1βp]+[ϵ1ϵ2ϵn] \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_n \end{bmatrix} = \begin{bmatrix} 1 & x_{11} & x_{12} & \cdots & x_{1p} \\ 1 & x_{21} & x_{22} & \cdots & x_{2p} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & x_{n1} & x_{n2} & \cdots & x_{np} \\ \end{bmatrix} \begin{bmatrix} \beta_0 \\ \beta_1 \\ \vdots \\ \beta_p \end{bmatrix} + \begin{bmatrix} \epsilon_1 \\ \epsilon_2 \\ \vdots \\ \epsilon_n \end{bmatrix}

where yiy_i is the ii-th observation of the dependent variable, xijx_{ij} is the jj-th observation of the ii-th independent variable, βj\beta_j is the coefficient of the jj-th independent variable, and ϵi\epsilon_i is the error term of the ii-th observation. We assume that the error term ϵi\epsilon_i is independent and identically distributed (i.i.d.) with mean zero and variance σ2\sigma^2. We also assume that the error term ϵi\epsilon_i is independent of the independent variables xijx_{ij}.

The equation (6) could be written in the following matrix form:

y=Xβ+ϵ \mathbf{y} = \mathbf{X} \boldsymbol{\beta} + \boldsymbol{\epsilon}

where y\mathbf{y} is the vector of the dependent variable, X\mathbf{X} is the matrix of the independent variables, β\boldsymbol{\beta} is the vector of the coefficients, and ϵ\boldsymbol{\epsilon} is the vector of the error terms.

Now, our goal is to find the estimation of the coefficients β\boldsymbol{\beta}, denoted by β^\hat{\boldsymbol{\beta}} by minimizing the residual sum of squares (RSS), which is defined as the following:

RSS=n=1Nϵ^i2=n=1N(yiy^i)2=n=1N(yixiTβ^)2 RSS = \sum_{n=1}^N \hat{\epsilon}_i^2 = \sum_{n=1}^N (y_i - \hat{y}_i)^2 = \sum_{n=1}^N (y_i - \mathbf{x}_i^T \hat{\boldsymbol{\beta}})^2

where ϵ^i\hat{\epsilon}_i is the residual of the ii-th observation, y^i\hat{y}_i is the predicted value of the ii-th observation, and xi\mathbf{x}_i is the vector of the ii-th observation of the independent variables.

Since ϵ^=yXβ^\hat{\epsilon} = y - X \hat{\boldsymbol{\beta}}, we have

RSS=ϵ^Tϵ^=(yXβ^)T(yXβ^)=yTy2yTXβ^+β^TXTXβ^ \begin{aligned} RSS & = \hat{\epsilon}^T \hat{\epsilon} = (y - X \hat{\boldsymbol{\beta}})^T (y - X \hat{\boldsymbol{\beta}}) \\ & = y^T y - 2y^T X \hat{\boldsymbol{\beta}} + \hat{\boldsymbol{\beta}}^T X^T X \hat{\boldsymbol{\beta}} \\ \end{aligned}

The first order condition of the above equation is the following:

RSSβ^=2XTy+2XTXβ^=0β^=(XTX)1XTy(6) \begin{aligned} \frac{\partial RSS}{\partial \hat{\boldsymbol{\beta}}} & = -2 X^T y + 2 X^T X \hat{\boldsymbol{\beta}} = 0 \\ \Rightarrow \quad \hat{\boldsymbol{\beta}} & = (X^T X)^{-1} X^T y \end{aligned} \tag{6}

This is the least square solution. We can also derive the least square solution by using the linear algebra. To do this, we need to reply on the following two facts:

Therefore, we have

XT(yXβ^)=0 X^T (y - X \hat{\boldsymbol{\beta}}) = 0

where Xβ^X\hat{\boldsymbol{\beta}} is the projection of y\mathbf{y} onto C(X)\mathcal{C}(\mathbf{X}). This will lead to the same least square solution as the one derived by using the calculus.

Back to Euler’s formula

I hope at this stage, you will have developed an intuition that:

In the real world, we need both analysis and algebra to solve the real world problems, especially for the problems in the field of probability and AI.

Now, let’s study the Euler’s formula again. We have

eix=cos(x)+isin(x) e^{ix} = \cos(x) + i \sin(x)

Now, we will extend the above formula to the complex exponentials. To connect the analysis and algebra, we will rely on the following series:

en(t)=e2πint e_n(t) = e^{2\pi int}

The inner product of two of them, en(t)e_n(t) and em(t)e_m(t) is the following:

en(t),em(t)=01e2πinte2πimtdt=01e2πi(nm)tdt={1,if n=m0,if nm(7) \begin{aligned} \langle e_n(t), e_m(t) \rangle = \int_{0}^{1} e^{2\pi int} e^{-2\pi imt} dt & = \int_{0}^{1} e^{2\pi i(n-m)t} dt \\ & = \begin{cases} 1, & \text{if } n = m \\ 0, & \text{if } n \neq m \end{cases} \end{aligned} \tag{7}

When n=mn=m, the derivation is easy. When nmn \neq m, we have

01e2πi(nm)tdt=12πi(nm)e2πi(nm)t01=12πi(nm)(e2πi(nm)1)=12πi(nm)(11)=0 \begin{aligned} \int_{0}^{1} e^{2\pi i(n-m)t} dt & = \frac{1}{2\pi i(n-m)} e^{2\pi i(n-m)t} \bigg|_{0}^{1} \\ & = \frac{1}{2\pi i(n-m)} (e^{2\pi i(n-m)} - 1) \\ & = \frac{1}{2\pi i(n-m)} (1 - 1) \\ & = 0 \end{aligned}

e2πi(nm)=cos(2π(nm))+isin(2π(nm))=1 e^{2\pi i(n-m)} = \cos(2\pi(n-m)) + i \sin(2\pi(n-m)) = 1

Therefore, the complex exponentials en(t)e_n(t) are orthogonal to each other. They form an orthonormal basis of the vector space C([0,1])\mathcal{C}([0, 1]), which is quite similar to the standard basis of the vector space Rn\mathbb{R}^n.

Reflections
I hope you could apprecite en(t)e_n(t) and em(t)e_m(t) are orthogonal to each other. This is the key to understand the Fourier Series and Fourier Transform.
What I like about the above derivation is that it connects the analysis and algebra. The analysis is about the inner product in complex field, and the algebra is about the orthonormal basis. Again, let's look at it en(t)=e2πinte_n(t) = e^{2\pi int}, where n=0,±1,±2,n = 0, \pm 1, \pm 2, \cdots and tt is a real number. You have real number tt, complex number e2πinte^{2\pi int}, and a sequence with index nn. This is a very rich structure.

Fourier Series

Now, we are ready to study the Fourier Series. The Fourier Series starts with the following question:

To answer this question, we need to find the coefficients of the linear combination. To find the coefficients, we need to rely on the inner product.

Now, suppose we could write f(t)f(t) as the following:

f(t)=n=NNcnen(t)=n=NNcne2πint(8) f(t) = \sum_{n=-N}^{N} c_n e_n(t) = \sum_{n=-N}^N c_n e^{2\pi int} \tag{8}

Now, we will fix an index kk, and pull the kk-th term out of the summation. We have

cke2πikt=f(t)n=N,nkNcne2πint c_k e^{2\pi ikt} = f(t) - \sum_{n=-N, n \neq k}^N c_n e^{2\pi int}

Now, multiply both sides by e2πikte^{-2\pi ikt}, and integrate both sides from 00 to 11. We have

01cke2πikte2πiktdt=01f(t)e2πiktdt01n=N,nkNcne2πinte2πiktdt=01f(t)e2πiktdtn=N,nkNcn01e2πi(nk)tdt=01f(t)e2πiktdtn=N,nkNcnδnk \begin{aligned} \int_{0}^{1} c_k e^{2\pi ikt} e^{-2\pi ikt} dt & = \int_{0}^{1} f(t) e^{-2\pi ikt} dt - \int_{0}^{1} \sum_{n=-N, n \neq k}^N c_n e^{2\pi int} e^{-2\pi ikt} dt \\ & = \int_{0}^{1} f(t) e^{-2\pi ikt} dt - \sum_{n=-N, n \neq k}^N c_n \int_{0}^{1} e^{2\pi i(n-k)t} dt \\ & = \int_{0}^{1} f(t) e^{-2\pi ikt} dt - \sum_{n=-N, n \neq k}^N c_n \delta_{nk} \\ \end{aligned}

where δnk\delta_{nk} is the Kronecker delta, which is orthonormal. Therefore, we have

ck=01f(t)e2πiktdt=f(t),ek(t)(9) c_k = \int_{0}^{1} f(t) e^{-2\pi ikt} dt = \langle f(t), e_k(t) \rangle \tag{9}

Therefore the fourier series coefficients ckc_k are the inner product of the function f(t)f(t) and the complex exponentials ek(t)e_k(t). Or you can say that the fourier series coefficients ckc_k are the projection of the function f(t)f(t) onto the vector space C([0,1])\mathcal{C}([0, 1]) spanned by the complex exponentials ek(t)e_k(t).

Reflections
We keep repeating the same idea: the analysis (calculus and integration) and the algebra (linear algebra and inner product). Hope you now could have a unified view of the analysis and algebra. While f(t)f(t) is defined for a continuous variable tt, the fourier series coefficients cnc_n are defined for a discrete variable nn.