In this blog post, we will analyze China’s GDP growth from the year 1960 to 2019. If the data shows a curvy trend, then linear regression will not produce very accurate results when compared to a non-linear regression.
# import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
# read the data into a pandas dataframe
= pd.read_csv('/content/china_gdp.csv')
df df
Year | Value | |
---|---|---|
0 | 1960 | 5.918412e+10 |
1 | 1961 | 4.955705e+10 |
2 | 1962 | 4.668518e+10 |
3 | 1963 | 5.009730e+10 |
4 | 1964 | 5.906225e+10 |
5 | 1965 | 6.970915e+10 |
6 | 1966 | 7.587943e+10 |
7 | 1967 | 7.205703e+10 |
8 | 1968 | 6.999350e+10 |
9 | 1969 | 7.871882e+10 |
10 | 1970 | 9.150621e+10 |
11 | 1971 | 9.856202e+10 |
12 | 1972 | 1.121598e+11 |
13 | 1973 | 1.367699e+11 |
14 | 1974 | 1.422547e+11 |
15 | 1975 | 1.611625e+11 |
16 | 1976 | 1.516277e+11 |
17 | 1977 | 1.723490e+11 |
18 | 1978 | 1.483821e+11 |
19 | 1979 | 1.768565e+11 |
20 | 1980 | 1.896500e+11 |
21 | 1981 | 1.943690e+11 |
22 | 1982 | 2.035496e+11 |
23 | 1983 | 2.289502e+11 |
24 | 1984 | 2.580821e+11 |
25 | 1985 | 3.074796e+11 |
26 | 1986 | 2.988058e+11 |
27 | 1987 | 2.713498e+11 |
28 | 1988 | 3.107222e+11 |
29 | 1989 | 3.459575e+11 |
30 | 1990 | 3.589732e+11 |
31 | 1991 | 3.814547e+11 |
32 | 1992 | 4.249341e+11 |
33 | 1993 | 4.428746e+11 |
34 | 1994 | 5.622611e+11 |
35 | 1995 | 7.320320e+11 |
36 | 1996 | 8.608441e+11 |
37 | 1997 | 9.581594e+11 |
38 | 1998 | 1.025277e+12 |
39 | 1999 | 1.089447e+12 |
40 | 2000 | 1.205261e+12 |
41 | 2001 | 1.332235e+12 |
42 | 2002 | 1.461906e+12 |
43 | 2003 | 1.649929e+12 |
44 | 2004 | 1.941746e+12 |
45 | 2005 | 2.268599e+12 |
46 | 2006 | 2.729784e+12 |
47 | 2007 | 3.523094e+12 |
48 | 2008 | 4.558431e+12 |
49 | 2009 | 5.059420e+12 |
50 | 2010 | 6.039659e+12 |
51 | 2011 | 7.492432e+12 |
52 | 2012 | 8.461623e+12 |
53 | 2013 | 9.490603e+12 |
54 | 2014 | 1.035483e+13 |
55 | 2015 | 1.105995e+13 |
56 | 2016 | 1.123700e+13 |
57 | 2017 | 1.232317e+13 |
58 | 2018 | 1.389188e+13 |
59 | 2019 | 1.436348e+13 |
Plot the data
=(8,5))
plt.figure(figsize= (df['Year'].values, df['Value'].values)
x_data, y_data 'ro')
plt.plot(x_data, y_data, 'GDP')
plt.ylabel('Year')
plt.xlabel( plt.show()
We can see that the growth starts off slow. Then, from 2005 onwards, the growth is very significant. It decelerates slightly after the period of the 2008 global recession.
Choosing a model
Looking at the plot, a logistic function would be a good approximation, since it has the property of starting with a slow growth, increasing growth in the middle, and then decreasing again at the end.
Let’s check this assumption below:
= np.arange(-5.0, 5.0, 0.1)
X = 1.0 / (1.0 + np.exp(-X))
Y
plt.plot(X, Y)'Dependent Variable')
plt.ylabel('Independent Variable')
plt.xlabel( plt.show()
Build the model
Let’s build our regression model and initialize its parameters.
def sigmoid(x, Beta_1, Beta_2):
= 1/ (1 + np.exp(-Beta_1 * (x - Beta_2)))
y return y
Let’s look at a sample sigmoid line that might fit with the data.
= 0.1
beta_1 = 1990
beta_2
# logistic function
= sigmoid(x_data, beta_1, beta_2)
Y_pred
# plot initial prediction againts data points
*15000000000000)
plt.plot(x_data, Y_pred'ro') plt.plot(x_data, y_data,
[<matplotlib.lines.Line2D at 0x7f5c93a4d780>]
Our task is to find the best parameters for the model.
First, lets normalize our x and y.
= x_data / max(x_data)
xdata = y_data / max(y_data) ydata
How can we find the best parameters for our fit line?
We can use curve_fit
, which uses non-linear least squares to fit our sigmoid function to the data.
from scipy.optimize import curve_fit
= curve_fit(sigmoid, xdata, ydata)
popt, pcov
# print the final parameters
print('beta_1=%f, beta_2=%f' % (popt[0], popt[1]))
beta_1 = 571.415035, beta_2 = 0.995885
Plot the model
# plot the resulting regression model
= np.linspace(1960, 2015, 55)
x = x/max(x)
x =(8,5))
plt.figure(figsize= sigmoid(x, *popt)
y 'ro', label='data')
plt.plot(xdata, ydata, =3.0, label='fit')
plt.plot(x, y, linewidth='best')
plt.legend(loc'GDP')
plt.ylabel('Year')
plt.xlabel( plt.show()
Train/Test Split the data
Split data into training and testing sets.
= np.random.randn(len(df)) < 0.8
msk = x_data[msk]
train_x = xdata[~msk]
test_x = y_data[msk]
train_y = ydata[~msk] test_y
Build the model using the train set.
= curve_fit(sigmoid, train_x, train_y) popt, pcov
/usr/local/lib/python3.6/dist-packages/scipy/optimize/minpack.py:808: OptimizeWarning: Covariance of the parameters could not be estimated
category=OptimizeWarning)
Predict GDP using the test set.
= sigmoid(test_x, *popt) y_hat
Evaluate the model
print('Mean absolute error: %.2f' % np.mean(np.absolute(y_hat - test_y)))
print('Residual sum of error (MSE): %.2f' % np.mean((y_hat - test_y)**2))
from sklearn.metrics import r2_score
print('R2-score: %.2f' % r2_score(y_hat, test_y))
Mean absolute error: 0.40
Residual sum of error (MSE): 0.17
R2-score: -34427.16