Data Experiment #04 Recover the sine curve

I found recently a very interesting blog entry "Neural Nets: Time Series Prediction". The entry is a demonstration of a neural network and the sample code is written in Scala. Because the author does not compare the neural network with other statistical learning algorithms, so I decided to do it. But ...

Problem: recover the sine curve from a small part of it

The values of $\sin(10x)$ are given for $x=0,0.05,0.10,\cdots,0.75$. Predict the values of $\sin(10x)$ for $x=0.80, 0.85,0.90,\cdots$.

Training Data

The idea is simple: detect a recurrence relation $f$ $$\sin(10(x+3a)) = f( \sin(10(x+2a)), \sin(10(x+a)), \sin(10x) )$$ by using a statistical learning. Here $a=0.05$. Therefore our training data look like the following table.

    x-coord            x0        x1        x2         y
0      0.15 -2.775558e-16  0.479426  0.841471  0.997495
1      0.20  4.794255e-01  0.841471  0.997495  0.909297
2      0.25  8.414710e-01  0.997495  0.909297  0.598472
3      0.30  9.974950e-01  0.909297  0.598472  0.141120
4      0.35  9.092974e-01  0.598472  0.141120 -0.350783
5      0.40  5.984721e-01  0.141120 -0.350783 -0.756802
6      0.45  1.411200e-01 -0.350783 -0.756802 -0.977530
7      0.50 -3.507832e-01 -0.756802 -0.977530 -0.958924
8      0.55 -7.568025e-01 -0.977530 -0.958924 -0.705540
9      0.60 -9.775301e-01 -0.958924 -0.705540 -0.279415
10     0.65 -9.589243e-01 -0.705540 -0.279415  0.215120
11     0.70 -7.055403e-01 -0.279415  0.215120  0.656987
12     0.75 -2.794155e-01  0.215120  0.656987  0.938000

If we denote by $x$ a value of the x-coord column, x0, x1, x2 and y are $\sin(10(x-3a))$, $\sin(10(x-2a))$, $\sin(10(x-a))$ and $\sin(10x)$ respectively.

import numpy as np
import pandas as pd

''' Prepare the training set '''
step = 0.05
df = pd.DataFrame(np.arange(0.15,0.80,step=0.05), columns=['x-coord'])
df['x0'] = np.sin(10*(df['x-coord']-3*step)) ## sin(x-3*step)
df['x1'] = np.sin(10*(df['x-coord']-2*step)) ## sin(x-2*step)
df['x2'] = np.sin(10*(df['x-coord']-1*step)) ## sin(x-1*step)
df['y']  = np.sin(10*df['x-coord'])          ## sin(x) <- target variable
print(df)

We have a very simple training data, so let's make a predictive model. The following image shows the rediction of a certain model which I trained. The prediction is perfect. (The RMSE is 4.61466567015e-15.)

Prediction

This model is not a neural network

Which statistical learning algorithm did I use? The answer is in the following code.

''' Create a linear regression estimator '''
from sklearn.linear_model import LinearRegression
fit_lr = LinearRegression()
fit_lr.fit(df[['x0','x1','x2']],df['y'])
print('coefficients:', fit_lr.coef_)      ## [-0.69092766  0.21269213  1.06423747]
print('intercept   :', fit_lr.intercept_) ## 8.32667268469e-17

Yes, this is just a linear regression. No trick. The estimated intercept is almost 0. Therefore we have a very simple recurrence relation:

$$ -0.691 \sin(10x) + 0.213 \sin(10(x+a)) + 1.064 \sin(10(x+2a)) = \sin(10(x+3a))$$

''' Make a prediction '''
yhat = []
x_test = np.arange(0.8,4.05,step=0.05)
X = df.ix[df.shape[0]-1,['x1','x2','y']].as_matrix().reshape((1,3)) ## initial input
for i in range(len(x_test)):
    pred = fit_lr.predict(X)[0] ## predict the value of sin(10x)
    yhat.append(pred)
    X = np.array([[X[0,1],X[0,2],pred]]) ## next input 
print('RMSE        :', np.sqrt(np.mean((np.sin(10*x_test)-yhat)**2)))

What is happening?

The initial guess could be to make use of $10a=0.5 \sim \pi/6 = 0.52...$. But this approximation is too rough to see the above accurate prediction. In fact we have a rigorous linear recurrence relation.

Remark. In the following discussion we observe $\sin(x+\alpha)$ instead of $\sin(10x+10a)$.

The key observation is the Fourier series expansion of $\sin(x+\alpha)$: $$\sin(x+\alpha) = \frac{e^{i(x+\alpha)}+e^{-i(x+\alpha)}}{2i} = -\frac{e^{-i\alpha}}{2i} e^{-ix}+\frac{e^{i\alpha}}{2i}e^{ix}.$$ Here $i=\sqrt{-1}$. Therefore $\sin(x)$, $\sin(x+\alpha)$, $\sin(x+2\alpha)$, $\sin(x+3\alpha)$ lie all in the complex 2-dimentional vector space spanned by $e^{-ix}$ and $e^{ix}$. (Replace $\alpha$ with $0, 2\alpha, 3\alpha$ in the above formula.) Thus they are linearly dependent (with complex coefficients).

(If you are not familiar with a Fourier series, you may replace two functions $e^{-ix}$ and $e^{ix}$ with two linearly independent vectors $\vec{a}$ and $\vec{b}$. Then you can understand the discussion surprisingly rigorously.)

Put $\sin(x+3\alpha) = \beta_0 \sin(x) + \beta_1\sin(x+\alpha) + \beta_2\sin(x+2\alpha)$ ($\beta_0, \beta_1, \beta_2 \in \mathbb C$). Looking at the coefficients of $e^{-ix}$ and $e^{ix}$, we can easily obtain the following formulas.

$$ \beta_0 = t, \quad \beta_1 = -2t\cos\alpha -1, \quad \beta_2 = t + 2\cos\alpha \quad (t \in \mathbb C).$$

Therefore we have the following formula

$$\sin(x+3\alpha) = t \sin(x) - (2t\cos\alpha +1)\sin(x+\alpha) + (t+2\cos\alpha) \sin(x+2\alpha)$$

which holds for any $t$. (You can directly check that the above formula holds.)

Our training data is enough to find a value of $t$ and the formula explains why the intercept is almost 0. I should also mention that we do not use a specific value of $\alpha$, namely the above formula also holds for any $\alpha$.

Other algorithms?

The linear regression gives an almost perfect analytical approximation, therefore I do not try other algorithms.

Edit [22.03.2016]: I compared a neural network with other algorithms. See the next entry.

Share this page on        
Categories: #data-mining