NUS AI-ML Program: Population Regression (2024 Summer)

发表于 2024-08-16 分类于 techLog 本文字数： 6.6k 阅读时长 ≈ 24 分钟

This post documents my complete workflow for the Population Regression Project, completed during the 2024 Summer AI-ML Research Program at the National University of Singapore (NUS), where the group I led was awarded the sole Outstanding Team honor.

Artificial Intelligence and Machine Learning (AI-ML)

Instructor: Prof. Mehul Motani
AI-ML Project Description: P1-AIML-Project-Description-2023.pdf

Introduction

Why population forecasting is an important problem worth working on？

Studying how and why populations forecast helps scientists better predict future changes in population size and growth rates. This is essential to answer questions in areas such as biodiversity conservation, resource allocation, economic planning and public policy formulation.
Studying population forecasting can also help scientists understand the causes of changes in population size and growth rates, and respond accordingly to these possible influences.
Studying population growth can give scientists insight into how organisms are connected to their environment and how organisms interact with each other. This is especially relevant in today's era of global warming and increasing population aging.

Problem Description

The primary problem addressed in this project is the accurate prediction of future population trends using historical population data. Specifically, the project involves building a machine learning model to estimate the future population of Singapore based on historical data from 1950 to 2023. The goal is to develop a reliable model that can provide accurate population forecasts for the years 2023, 2030, and 2050.
In addition to forecasting Singapore’s population, this project also explores a comparative analysis between the population trends and predictions of Singapore and the United States. By examining the historical population data and future projections for both countries, we aim to uncover the differences in demographic patterns, growth rates, and potential influencing factors.

Exploratory Data Analysis

In this section, we conduct an Exploratory Data Analysis (EDA) of the population data obtained from the Singapore Department of Statistics (SingStat). EDA is a crucial step in any data science project as it helps us understand the structure, quality, and characteristics of the data before applying any machine learning models. This process involves loading the data, inspecting its structure, and summarizing its main characteristics using both visual and quantitative methods.

Loading the Data

First, we load the dataset into a Pandas DataFrame. The dataset contains population statistics of Singapore from 1950 to 2023. We use the pd.read_excel function to read the data from an Excel file. We specify the header row, index column, number of rows to read, and how to handle missing values:

from IPython.display import display
import pandas as pd
import numpy as np


pd.set_option('display.max_columns', 2)
pd.set_option('display.max_rows', 10)

df = pd.read_excel("./Singapore-Population-1950-2023.xlsx", header=9, index_col=0, nrows=30, na_values=["na"])
df

2023 ... 1950
Data Series
Total Population (Number) 5917648.0 ... 1022100.0
Resident Population (Number) 4149253.0 ... NaN
Singapore Citizen Population (Number) 3610658.0 ... NaN
Permanent Resident Population (Number) 538595.0 ... NaN
Non-Resident Population (Number) 1768395.0 ... NaN
... ... ... ...
Age Dependency Ratio: Citizens Aged Under 20 Years And 65 Years & Over Per Hundred Citizens Aged 20-64 Years (Number) 63.9 ... NaN
Child Dependency Ratio: Citizens Aged Under 20 Years Per Hundred Citizens Aged 20-64 Years (Number) 32.6 ... NaN
Old-Age Dependency Ratio: Citizens Aged 65 Years & Over Per Hundred Citizens Aged 20-64 Years (Number) 31.3 ... NaN
Resident Natural Increase (Number) 4951.0 ... 34059.0
Rate Of Natural Increase (Per Thousand Residents) 1.2 ... 33.4
29 rows × 74 columns

	2023	...	1950
Total Population (Number)	5917648.0	...	1022100.0
Resident Population (Number)	4149253.0	...	NaN
Singapore Citizen Population (Number)	3610658.0	...	NaN
Permanent Resident Population (Number)	538595.0	...	NaN
Non-Resident Population (Number)	1768395.0	...	NaN
...	...	...	...
Age Dependency Ratio: Citizens Aged Under 20 Years And 65 Years & Over Per Hundred Citizens Aged 20-64 Years (Number)	63.9	...	NaN
Child Dependency Ratio: Citizens Aged Under 20 Years Per Hundred Citizens Aged 20-64 Years (Number)	32.6	...	NaN
Old-Age Dependency Ratio: Citizens Aged 65 Years & Over Per Hundred Citizens Aged 20-64 Years (Number)	31.3	...	NaN
Resident Natural Increase (Number)	4951.0	...	34059.0
Rate Of Natural Increase (Per Thousand Residents)	1.2	...	33.4

The dataset consists of 29 rows and 74 columns, each representing different population statistics for the years 1950 to 2023. The columns include various demographic metrics such as total population, resident population, citizen population, permanent resident population, non-resident population, population growth rates, population density, sex ratio, median age, dependency ratios, and natural increase rates.

Inspecting the Data Structure

To get a preliminary understanding of the data, we use the info() method. This method provides a concise summary of the DataFrame, including the number of non-null entries in each column, the data types, and the memory usage. This information is critical for identifying any missing values or data type issues that need to be addressed:

df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 29 entries, Total Population (Number) to Rate Of Natural Increase (Per Thousand Residents)
Data columns (total 74 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   2023    29 non-null     float64
 1   2022    29 non-null     float64
 2   2021    29 non-null     float64
 ...
 71  1952    5 non-null      float64
 72  1951    5 non-null      float64
 73  1950    5 non-null      float64
dtypes: float64(74)
memory usage: 17.0+ KB

The info() method reveals that the dataset is predominantly composed of float64 data types, which is appropriate for numerical population data. There are some columns with missing values, especially in the earlier years (e.g., 1950-1960), which is not uncommon in historical datasets.

It is worth mentioning that we noticed that in all the year data sets, there were a total of 5 variables with no missing values, which played a very important role in our subsequent improvement of the model.

Summarizing the Data

Next, we use the describe() method to generate descriptive statistics of the dataset. This method provides a summary of the central tendency, dispersion, and shape of the dataset’s distribution, excluding NaN values. The output includes metrics such as count, mean, standard deviation, minimum, and maximum values for each numerical column. This step helps in understanding the overall distribution and variability in the data:

pd.set_option('display.max_columns', 8)
pd.reset_option('display.max_rows')

df.describe()

2023 2022 2021 2020 ... 1953 1952 1951 1950
count 2.900000e+01 2.900000e+01 2.900000e+01 2.900000e+01 ... 5.000000e+00 5.000000e+00 5.000000e+00 5.000000e+00
mean 5.516914e+05 5.297643e+05 5.142706e+05 5.323568e+05 ... 2.471966e+05 2.334658e+05 2.210064e+05 2.114740e+05
std 1.462173e+06 1.408811e+06 1.370533e+06 1.412252e+06 ... 5.283716e+05 4.997826e+05 4.737871e+05 4.533883e+05
min 1.200000e+00 1.600000e+00 -4.100000e+00 -3.000000e-01 ... 5.700000e+00 5.500000e+00 4.500000e+00 4.400000e+00
25% 2.040000e+01 2.070000e+01 2.080000e+01 2.070000e+01 ... 3.610000e+01 3.470000e+01 3.340000e+01 3.340000e+01
50% 3.260000e+01 3.280000e+01 3.300000e+01 3.290000e+01 ... 1.149000e+03 1.153000e+03 1.159000e+03 1.173000e+03
75% 9.500000e+02 9.550000e+02 9.600000e+02 9.570000e+02 ... 4.299200e+04 3.913600e+04 3.573500e+04 3.405900e+04
max 5.917648e+06 5.637022e+06 5.453566e+06 5.685807e+06 ... 1.191800e+06 1.127000e+06 1.068100e+06 1.022100e+06
8 rows × 74 columns

	2023	2022	2021	2020	...	1953	1952	1951	1950
count	2.900000e+01	2.900000e+01	2.900000e+01	2.900000e+01	...	5.000000e+00	5.000000e+00	5.000000e+00	5.000000e+00
mean	5.516914e+05	5.297643e+05	5.142706e+05	5.323568e+05	...	2.471966e+05	2.334658e+05	2.210064e+05	2.114740e+05
std	1.462173e+06	1.408811e+06	1.370533e+06	1.412252e+06	...	5.283716e+05	4.997826e+05	4.737871e+05	4.533883e+05
min	1.200000e+00	1.600000e+00	-4.100000e+00	-3.000000e-01	...	5.700000e+00	5.500000e+00	4.500000e+00	4.400000e+00
25%	2.040000e+01	2.070000e+01	2.080000e+01	2.070000e+01	...	3.610000e+01	3.470000e+01	3.340000e+01	3.340000e+01
50%	3.260000e+01	3.280000e+01	3.300000e+01	3.290000e+01	...	1.149000e+03	1.153000e+03	1.159000e+03	1.173000e+03
75%	9.500000e+02	9.550000e+02	9.600000e+02	9.570000e+02	...	4.299200e+04	3.913600e+04	3.573500e+04	3.405900e+04
max	5.917648e+06	5.637022e+06	5.453566e+06	5.685807e+06	...	1.191800e+06	1.127000e+06	1.068100e+06	1.022100e+06

Graph the total population vs year

In this section, we aim to visualize the trend of total population over the years for Singapore. This is a crucial step in our exploratory data analysis (EDA) as it helps us understand the historical growth pattern and provides a foundation for our subsequent predictive modeling efforts.

We extract the years and corresponding population values from the dataset.
We use the seaborn and matplotlib libraries to create a line plot. These libraries are chosen for their ease of use and ability to produce aesthetically pleasing and informative visualizations.
We configure the plot with appropriate titles, labels, and styles to ensure clarity and readability.

import matplotlib.pyplot as plt
import seaborn as sns

x = np.array(df.columns[:].astype(int))
y = np.array(df.iloc[0, :].astype(int))

sns.set(style="whitegrid")
fig, ax = plt.subplots(figsize=(16, 8))
sns.lineplot(x=x, y=y, marker='o', color='b', ax=ax)

ax.set_title('Total Population vs Year\n', fontsize=20, weight='bold')
ax.set_xlabel('Year', fontsize=16)
ax.set_ylabel('Total Population', fontsize=16)
ax.tick_params(axis='both', which='major', labelsize=14)
ax.grid(True, alpha=0.6)

plt.show()

Overall Growth Trend:

The population of Singapore has shown a consistent upward trend from 1950 to 2022. This indicates a steady increase in the number of inhabitants over the years.

Recent Trends:

In the most recent years (2020-2022), there is a slight dip followed by a sharp increase. This could be due to recent global events such as the COVID-19 pandemic, which may have temporarily impacted population growth due to factors like migration, birth rates, and mortality rates.

Use linear regression to build an estimator of the total population of Singapore in the future.

Use the data for years 2019 and earlier as training data.

The primary goal of this section is to build a linear regression model that can predict the future population of Singapore based on historical data. We will use data from years 2019 and earlier as training data and validate our model using data from years beyond 2020.

Linear regression is a fundamental statistical method used to model the relationship between a dependent variable and one or more independent variables. In this context, the independent variable is the year, and the dependent variable is the total population of Singapore.

Data Preparation

We start by preparing the data for training and testing. The data from years 2019 and earlier are used as training data, while the data from years beyond 2020 are used as test data.

from sklearn.preprocessing import StandardScaler


scaler = StandardScaler()

x_tr = np.array(x[4:])
y_tr = np.array(y[4:])

x_test = np.array(x[:4])
y_test = np.array(y[:4])

pd.set_option('display.max_columns', 6)
display(pd.DataFrame({"Training Data": y_tr}, index=x_tr).T)
display(pd.DataFrame({"Test Data": y_test}, index=x_test).T)

2019 2018 2017 ... 1952 1951 1950
Training Data 5703569 5638676 5612253 ... 1127000 1068100 1022100
1 rows × 70 columns
2023 2022 2021 2020
Test Data 5917648 5637022 5453566 5685807

	2019	2018	2017	...	1952	1951	1950
Training Data	5703569	5638676	5612253	...	1127000	1068100	1022100

	2023	2022	2021	2020
Test Data	5917648	5637022	5453566	5685807

Here, we use the StandardScaler to normalize the training data.

In our project, we aim to predict the future population of Singapore using historical data. Population data, by its nature, involves very large numbers. For instance, Singapore's population has been in the millions for several decades. When we use such large-scale data directly in our machine learning models, particularly linear regression, we have encountered several challenges:

Error Metrics : In the subsequent steps of our analysis, particularly when calculating the Mean Squared Error (MSE), we observed that the computed error values were exceedingly large, numbering in the millions. When calculating performance metrics such as Mean Squared Error (MSE), the error values can be exceedingly large due to the large scale of the population data. This makes it difficult to interpret the model's performance effectively.
Numerical Instability: Machine learning algorithms, including linear regression, perform numerous mathematical operations. When these operations involve very large numbers, they can lead to numerical instability, which might result in inaccurate model parameters and predictions.

To address these challenges, we employ data normalization techniques, specifically using the StandardScaler from the sklearn.preprocessing module. Normalization transforms the data to have a mean of 0 and a standard deviation of 1.

y_tr = scaler.fit_transform(y_tr.reshape(-1, 1)).flatten()

display(scaler)

print("Training Data:")
pd.DataFrame({"Original Train Data": np.array(y[4:]), "Scaled Train Data": y_tr}, index=x_tr).T

Training Data:
2019 2018 2017 ... 1952 1951 1950
Original Train Data 5.703569e+06 5.638676e+06 5.612253e+06 ... 1.127000e+06 1.068100e+06 1.022100e+06
Scaled Train Data 1.915539e+00 1.868530e+00 1.849389e+00 ... -1.399780e+00 -1.442448e+00 -1.475771e+00
2 rows × 70 columns

	2019	2018	2017	...	1952	1951	1950
Original Train Data	5.703569e+06	5.638676e+06	5.612253e+06	...	1.127000e+06	1.068100e+06	1.022100e+06
Scaled Train Data	1.915539e+00	1.868530e+00	1.849389e+00	...	-1.399780e+00	-1.442448e+00	-1.475771e+00

As you can see, through the scale operation, we compress the original data of the order of millions (above) to a range with a central value of 0 and a standard deviation of 1 through the normal distribution curve.

Model Training

We then trained a linear regression model using the prepared training data. The LinearRegression class from sklearn.linear_model was employed for this purpose. The model was fitted using the year as the independent variable and the standardized population as the dependent variable.

import sklearn.linear_model as lm
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold


lr = lm.LinearRegression()
lr.fit(x_tr.reshape(-1,1), y_tr)
y_re = lr.predict(x_tr.reshape(-1,1))
y_pr = lr.predict(x_test.reshape(-1,1))

lr

Performance metrics

What are the slope and y-intercept of the best fit line? Plot the best fit line over the empirical data.

The linear regression model yielded the following coefficients:

1	pd.DataFrame({"Slope": lr.coef_, "y-intercept": lr.intercept_}, index=["Model Coefficients"])

Slope y-intercept
Model Coefficients 0.048582 -96.411442

	Slope	y-intercept
Model Coefficients	0.048582	-96.411442

sns.set(style="whitegrid")
fig, ax = plt.subplots(figsize=(16, 8))

sns.lineplot(x=x, y=y, marker='o', color='b', ax=ax, label="Original Curve", zorder=-1)
sns.lineplot(x=x_tr, y=scaler.inverse_transform(y_re.reshape(-1, 1)).flatten(), color='g', ax=ax, label="Fitting Curve")
sns.scatterplot(x=x_test, y=scaler.inverse_transform(y_pr.reshape(-1, 1)).flatten(), color='r', ax=ax, label="Test Data")

ax.set_title('Linear Regression\n', fontsize=20, weight='bold')
ax.set_xlabel('Year', fontsize=16)
ax.set_ylabel('Total Population', fontsize=16)
ax.tick_params(axis='both', which='major', labelsize=14)
ax.grid(True, alpha=0.6)

slope = lr.coef_[0]
intercept = lr.intercept_

handles, labels = ax.get_legend_handles_labels()
handles.append(plt.Line2D([0], [0], color='white', label=f'Slope: {slope:.4f}'))
handles.append(plt.Line2D([0], [0], color='white', label=f'Intercept: {intercept:.4f}'))
ax.legend(handles=handles, loc='best', fontsize=12)

plt.show()

Original Curve (Blue Dots and Line): This represents the empirical data of Singapore's population over the years. The population shows a clear increasing trend with some periods of accelerated growth and others of relative stability.
Fitting Curve (Green Line): This line represents the linear regression model's best fit to the training data (years up to 2019). The fitting curve captures the general upward trend in the population data but does not account for the non-linear fluctuations present in the empirical data. This limitation is inherent in linear models when applied to complex real-world phenomena that exhibit non-linear patterns.
Test Data (Red Dots): These points represent the actual population data for the years 2020 to 2022, which were not used in training the model. The predictions for these years are also plotted. The proximity of the red dots to the green line indicates the model's performance in predicting the population for these years. While the model's predictions are reasonably close to the actual values, there are noticeable deviations, particularly for the year 2022.

What is the R² coefficient and mean squared error (MSE) of the estimator on the training data?

def SSres(y, y_hat):
    return sum((y - y_hat) ** 2)

def SStot(y):
    return sum((y - np.mean(y)) ** 2)

def R2(y, y_hat):
    return 1 - SSres(y, y_hat) / SStot(y)

def MSE(y, y_hat):
    return SSres(y, y_hat) / len(y)


pd.DataFrame({"R2": R2(y_tr, y_re), "MSE": MSE(y_tr, y_re)}, index=["On Training Data"]).T

On Training Data
R2 0.963565
MSE 0.036435

	On Training Data
R2	0.963565
MSE	0.036435

Use years greater than 2020 as test data and predict the population for those years

1
2
3

y_pr = scaler.inverse_transform(lr.predict(x_test.reshape(-1,1)).reshape(-1, 1)).flatten()

pd.DataFrame({"Test Data": np.array(y[:4]), "Prediction": y_pr}, index=x_test).T

2023 2022 2021 2020
Test Data 5.917648e+06 5.637022e+06 5.453566e+06 5.685807e+06
Prediction 5.641280e+06 5.574215e+06 5.507151e+06 5.440087e+06

	2023	2022	2021	2020
Test Data	5.917648e+06	5.637022e+06	5.453566e+06	5.685807e+06
Prediction	5.641280e+06	5.574215e+06	5.507151e+06	5.440087e+06

What is the MSE of the estimator on the test data?

1	pd.DataFrame({"MSE": MSE(scaler.transform(y_test.reshape(-1,1)).flatten(), scaler.transform(y_pr.reshape(-1,1)).flatten())}, index=["On Test Data"]).T

On Test Data
MSE 0.018836

	On Test Data
MSE	0.018836

What is your estimate of Singapore’s population in 2024, 2030 and 2050?

Do you think these estimates are reasonable? Explain your answer.

x_es = np.array([2024, 2030, 2050])

y_es_scaled = lr.predict(x_es.reshape(-1,1))
y_es = scaler.inverse_transform(y_es_scaled.reshape(-1, 1)).flatten()

pd.DataFrame({"Estimate": y_es}, index=x_es).astype(int).T

2024 2030 2050
Estimate 5708344 6110730 7452019

	2024	2030	2050
Estimate	5708344	6110730	7452019

According to our model predictions, the population of Singapore in 2030 and 2050 will be 6110730 and 7452019, respectively. We believe that these two estimates are unreasonable. Whether from a common sense or geographical perspective, Singapore has entered a modern population growth model characterized by low birth rates, low mortality rates, and low natural growth rates. Therefore, if we use linear models to infer the population after 20 years, it is inaccurate. We believe that the function of Singapore's total population changing with years should be an logarithmic function.

What pattern do you expect for human population growth in Singapore?

We believe that Singapore's population is about to enter a period of slow growth, with the natural growth rate gradually decreasing and even possibly reaching negative values. What's more, the degree of population aging will increase.

Therefore, Singapore may continue to rely on immigration policies to maintain its population size, especially in attracting high skilled immigrants to promote economic development. However, immigration policies also need to balance social acceptance and population stability.

How could you improve your estimates of the future population?

Use various machine learning models

To improve the accuracy of our future population estimates, we explored the use of various machine learning models beyond the simple linear regression model initially employed. The rationale behind this approach is to leverage the strengths of different algorithms in capturing complex patterns and relationships within the data, which may not be adequately addressed by a single model.

We selected four different machine learning models for comparison:

Linear Regressor: A baseline model using linear regression.
Support Vector Regression (SVR): A model that uses a radial basis function (RBF) kernel to capture non-linear relationships.
Decision Tree Regressor: A model that splits the data into subsets based on feature values, capturing non-linear patterns.
Random Forest Regressor: An ensemble method that builds multiple decision trees and merges them to improve accuracy and control over-fitting.

from sklearn.linear_model import LinearRegression
from sklearn.svm import SVR
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import Ridge


evaluate = {"MSE on Train Data":[], "R2 on Train Data":[], "MSE on Test Data":[]}

models = {
    "Linear Regressor": lr,
    "Support Vector Regression": SVR(kernel='rbf', C=100, gamma=0.1, epsilon=.1),
    "Decision Tree Regressor": DecisionTreeRegressor(),
    "Random Forest Regressor": RandomForestRegressor(n_estimators=100)
}

y_test = scaler.transform(y_test.reshape(-1,1)).flatten()

for name, model in models.items():
    model.fit(x_tr.reshape(-1, 1), y_tr)
    y_re = model.predict(x_tr.reshape(-1,1))
    y_pr = model.predict(x_test.reshape(-1,1))
    

    evaluate["MSE on Train Data"].append(MSE(y_tr, y_re))
    evaluate["R2 on Train Data"].append(R2(y_tr, y_re))
    evaluate["MSE on Test Data"].append(MSE(y_test, y_pr))

pd.DataFrame(evaluate, index=models.keys()).T

Linear Regressor Support Vector Regression Decision Tree Regressor Random Forest Regressor
MSE on Train Data 0.036435 0.007985 0.000000 0.000173
R2 on Train Data 0.963565 0.992015 1.000000 0.999827
MSE on Test Data 0.018836 1.225470 0.014835 0.014400

	Linear Regressor	Support Vector Regression	Decision Tree Regressor	Random Forest Regressor
MSE on Train Data	0.036435	0.007985	0.000000	0.000173
R2 on Train Data	0.963565	0.992015	1.000000	0.999827
MSE on Test Data	0.018836	1.225470	0.014835	0.014400

We evaluated these models using the following performance metrics:

Mean Squared Error (MSE) on Train Data: Measures the average of the squares of the errors between the predicted and actual values on the training dataset.
R² on Train Data: Represents the proportion of the variance in the dependent variable that is predictable from the independent variable(s) on the training dataset.
MSE on Test Data: Measures the average of the squares of the errors between the predicted and actual values on the test dataset.

import matplotlib.pyplot as plt
import seaborn as sns


sns.set(style="whitegrid")
fig, ax = plt.subplots(figsize=(16, 8))
sns_color = sns.color_palette()

sns.lineplot(x=x, y=y, marker='o', ax=ax, label='Original Data')

for i, (name, model) in enumerate(models.items()):
    y_re = model.predict(x_tr.reshape(-1, 1))
    y_pr = model.predict(x_test.reshape(-1, 1))
    sns.lineplot(x=x_tr, y=scaler.inverse_transform(y_re.reshape(-1, 1)).flatten(), ax=ax, color=sns_color[i+1], label=f"{name} Fitting Curve")
    sns.scatterplot(x=x_test, y=scaler.inverse_transform(y_pr.reshape(-1, 1)).flatten(), ax=ax, color=sns_color[i+1], label=f"{name} Test Data")

ax.set_title('Population Fitting with Different Models\n', fontsize=20, weight='bold')
ax.set_xlabel('Year', fontsize=16)
ax.set_ylabel('Total Population', fontsize=16)
ax.tick_params(axis='both', which='major', labelsize=14)
ax.grid(True, alpha=0.6)

plt.legend()
plt.show()

From the evaluation results, we observe the following:

The Linear Regressor performs reasonably well, with an MSE of 0.018836 on the test data. This suggests that the Linear Regressor is capturing the general trend in the data, but there may be some non-linear patterns that it fails to account for, leading to a slightly higher MSE on the test data compared to other models.
The Support Vector Regression model, despite its strong performance on the training data (low MSE and high R²), exhibits poor generalization to the test data with a high MSE of 1.225470.This discrepancy indicates that the SVR model is likely overfitting the training data.
The Decision Tree Regressor shows excellent performance on the training data, with an MSE of 0.000000 and an R² of 1.000000. On the test data, it also performs well, with an MSE of 0.014835. The perfect fit on the training data suggests that the Decision Tree Regressor has fully memorized the training data, which is a classic sign of overfitting. However, its reasonable performance on the test data indicates that, despite overfitting, it still captures some underlying patterns in the data. This balance suggests that while the model may be overfitting, the test data is sufficiently similar to the training data to mitigate some of the negative effects.
The Random Forest Regressor also shows excellent performance on the training data, with an MSE of 0.000195 and an R² of 0.999805. On the test data, it performs slightly better than the Decision Tree Regressor, with an MSE of 0.014362.The slightly lower MSE on the test data compared to the Decision Tree Regressor suggests that the Random Forest is better at capturing the true underlying patterns in the data without overfitting as severely.

This suggests that ensemble methods like the Random Forest Regressor may provide better generalization and robustness in predicting future population values.

Feature Engineering

To further enhance our predictions, we conducted an exploratory data analysis (EDA) and feature engineering on additional demographic variables. During our initial EDA, we observed that the four features—Total Population Growth (TPG), Sex Ratio (SR), Resident Natural Increase (RNI), and Rate Of Natural Increase (RONI)—had no missing values across all years. This prompted us to investigate whether these features have any significant relationship with the total population (TP), as incorporating them could potentially improve the predictive performance of our models. We examined the relationships between total population and these four factors to determine their impact on our population estimates.

1
2
3

df2 = df.T
df2 = df2.loc[:, ~df2.isnull().any()]
df2.head()

Data Series Total Population (Number) Total Population Growth (Per Cent) Sex Ratio (Males Per Thousand Females) Resident Natural Increase (Number) Rate Of Natural Increase (Per Thousand Residents)
2023 5917648.0 5.0 950.0 4951.0 1.2
2022 5637022.0 3.4 955.0 6704.0 1.6
2021 5453566.0 -4.1 960.0 10913.0 2.7
2020 5685807.0 -0.3 957.0 13248.0 3.3
2019 5703569.0 1.2 957.0 15042.0 3.7

Data Series	Total Population (Number)	Total Population Growth (Per Cent)	Sex Ratio (Males Per Thousand Females)	Resident Natural Increase (Number)	Rate Of Natural Increase (Per Thousand Residents)
2023	5917648.0	5.0	950.0	4951.0	1.2
2022	5637022.0	3.4	955.0	6704.0	1.6
2021	5453566.0	-4.1	960.0	10913.0	2.7
2020	5685807.0	-0.3	957.0	13248.0	3.3
2019	5703569.0	1.2	957.0	15042.0	3.7

To gain insights into the relationships between the total population (TP) and other demographic variables, we performed a correlation analysis and visualized the results using a pair plot and a heatmap.

Pair Plot Analysis

We created a pair plot to visualize the relationships between these variables.

A pair plot, also known as a scatterplot matrix, is a matrix of graphs that enables the visualization of the relationship between each pair of variables in a dataset. It combines both histogram and scatter plots, providing a unique overview of the dataset’s distributions and correlations. The primary purpose of a pair plot is to simplify the initial stages of data analysis by offering a comprehensive snapshot of potential relationships within the data.

df2.columns=["TP", "TPG", "SR", "RNI", "RONI"]

sns.pairplot(df2)

plt.show()

Pearson Correlation Matrix Analysis

By observing the pair plots of TP and other features, we can very intuitively find that there is indeed an obvious correlation between these four variables and TP. To quantify these relationships, we computed the correlation matrix and visualized it using a heatmap.

A correlation matrix is a statistical technique used to evaluate the relationship between two variables in a dataset. The matrix is a table in which every cell contains a parameter, where:

1 indicates a perfect positive relationship between variables.
0 indicates no relationship.
-1 indicates a perfect negative relationship.

The correlation matrix is particularly useful in building regression models as it helps identify which features are most strongly associated with the target variable.

1 2	df3 = pd.DataFrame(df2.corr().iloc[0,1:]).T df3

TPG SR RNI RONI
TP -0.404611 -0.932966 -0.872509 -0.868613

	TPG	SR	RNI	RONI
TP	-0.404611	-0.932966	-0.872509	-0.868613

fig, ax = plt.subplots(figsize=(16, 3))
sns.heatmap(df3, ax=ax, yticklabels=False)

ax.set_xticklabels(df3.T.index)
ax.set_yticks([])

plt.show()

Sex Ratio (SR), Rate of Natural Increase (RNI), and Rate Of Natural Increase (RONI) show strong negative correlations with TP, indicating that these features are likely to be valuable predictors in our models.

Multivariate Regression Models

To further enhance the accuracy of our population estimates, we incorporated additional demographic features into our regression models. This approach allows us to capture more complex relationships within the data, which a univariate model might miss.

Data Preparation

We began by merging the demographic features with the year information and then extracting the relevant columns for our analysis:

df4 = pd.concat([df.columns.to_frame(name="Year"), df2], axis=1)
df4 = df4.reset_index(drop=True).reindex(columns=["Year", "TPG", "SR", "RNI", "RONI", "TP"]).drop(columns="TPG")

X = df4.iloc[:,:4].astype(float)
y = df4.iloc[:, 4].astype(int)

display(X.head())
display(y[:5])

Year SR RNI RONI
0 2023.0 950.0 4951.0 1.2
1 2022.0 955.0 6704.0 1.6
2 2021.0 960.0 10913.0 2.7
3 2020.0 957.0 13248.0 3.3
4 2019.0 957.0 15042.0 3.7
0 5917648 1 5637022 2 5453566 3 5685807 4 5703569 Name: TP, dtype: int32

	Year	SR	RNI	RONI
0	2023.0	950.0	4951.0	1.2
1	2022.0	955.0	6704.0	1.6
2	2021.0	960.0	10913.0	2.7
3	2020.0	957.0	13248.0	3.3
4	2019.0	957.0	15042.0	3.7

1
2
3

X = np.array(X)
y = np.array(y)
X.shape, y.shape

((74, 4), (74,))

Data Splitting and Scaling

We split the data into training and testing sets and applied standard scaling to normalize the features and target values:

X_tr = X[4:,:]
y_tr = y[4:]

scaler_X = StandardScaler()
scaler_y = StandardScaler()

X_tr = scaler_X.fit_transform(X_tr)
y_tr = scaler_y.fit_transform(y_tr.reshape(-1, 1)).flatten()

X_test = X[:4,:]
y_test = y[:4]

X_test = scaler_X.transform(X_test)
y_test = scaler_y.transform(y_test.reshape(-1, 1)).flatten()

X_test.shape, X_tr.shape

((4, 4), (70, 4))

Model Training and Evaluation

We trained and evaluated multiple regression models using 10-fold cross-validation to ensure robust performance metrics.

Cross-validation is a resampling procedure used to evaluate machine learning models on a limited data sample. In K-fold cross-validation, the training data is divided into K subsets, and then K rounds of training and validation are performed. In each round, one of the subsets is chosen as the validation set, while the remaining K-1 subsets are used as the training set. Finally, the performance metrics from the K rounds of validation are averaged to evaluate the model's performance.

from sklearn.model_selection import cross_val_score, KFold
from sklearn.linear_model import Ridge


evaluate2 = {"Average MSE On Test Data": [], "Average R2 On Test Data": []}

models2 = {
    "Linear Regressor": LinearRegression(),
    "Support Vector Regression": SVR(kernel='rbf', C=100, gamma=0.1, epsilon=.1),
    "Decision Tree Regressor": DecisionTreeRegressor(),
    "Random Forest Regressor": RandomForestRegressor(n_estimators=100),
}

kf = KFold(n_splits=10, shuffle=True, random_state=42)

for name, model in models2.items():
    model.fit(X_tr, y_tr)
    y_re = model.predict(X_tr)
    y_pr = model.predict(X_test)

    mse_scores = cross_val_score(model, X_tr, y_tr, cv=kf, scoring='neg_mean_squared_error')
    r2_scores = cross_val_score(model, X_tr, y_tr, cv=kf, scoring='r2')
    
    evaluate2["Average MSE On Test Data"].append(-mse_scores.mean())
    evaluate2["Average R2 On Test Data"].append(r2_scores.mean())

pd.DataFrame(evaluate2, index=models2.keys()).T

Linear Regressor Support Vector Regression Decision Tree Regressor Random Forest Regressor
Average MSE On Test Data 0.012449 0.005609 0.008630 0.002920
Average R2 On Test Data 0.982033 0.992568 0.992839 0.996275

	Linear Regressor	Support Vector Regression	Decision Tree Regressor	Random Forest Regressor
Average MSE On Test Data	0.012449	0.005609	0.008630	0.002920
Average R2 On Test Data	0.982033	0.992568	0.992839	0.996275

import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

sns.set(style="whitegrid")
fig, axs = plt.subplots(2, 2, figsize=(16, 8))
sns_color = sns.color_palette()

for i, name in enumerate(["Linear Regressor", "Support Vector Regression"]):
    model1 = models[name]
    model2 = models2[name]
    
    # Univariate model prediction
    y_re1 = model1.predict(x_tr.reshape(-1, 1))
    y_pr1 = model1.predict(x_test.reshape(-1, 1))

    # Multivariate model prediction
    y_re2 = model2.predict(X_tr)
    y_pr2 = model2.predict(X_test)

    # Plotting univariate training results
    sns.lineplot(x=x_tr, y=scaler.inverse_transform(y_re1.reshape(-1, 1)).flatten(), ax=axs[i, 0], color=sns_color[i+1], label=f"{name} Fitting Curve")
    sns.scatterplot(x=x_test, y=scaler.inverse_transform(y_pr1.reshape(-1, 1)).flatten(), ax=axs[i, 0], color=sns_color[i+1], label=f"{name} Test Data")
    sns.lineplot(x=x, y=y, ax=axs[i, 0], label='Original Data', zorder=-1)
    
    axs[i, 0].set_title(f'{name} (Single Variable)', fontsize=16, weight='bold')
    axs[i, 0].set_xlabel('Sample Index', fontsize=14)
    axs[i, 0].set_ylabel('Target Value', fontsize=14)
    axs[i, 0].tick_params(axis='both', which='major', labelsize=12)
    axs[i, 0].grid(True, alpha=0.6)

    # Plotting multivariate training results
    sns.lineplot(x=x_tr, y=scaler_y.inverse_transform(y_re2.reshape(-1, 1)).flatten(), ax=axs[i, 1], color=sns_color[i+1], label=f"{name} Fitting Curve")
    sns.scatterplot(x=x_test, y=scaler_y.inverse_transform(y_pr2.reshape(-1, 1)).flatten(), ax=axs[i, 1], color=sns_color[i+1], label=f"{name} Test Data")
    sns.lineplot(x=x, y=y, ax=axs[i, 1], label='Original Data', zorder=-1)
    
    axs[i, 1].set_title(f'{name} (Multi Variable)', fontsize=16, weight='bold')
    axs[i, 1].set_xlabel('Sample Index', fontsize=14)
    axs[i, 1].set_ylabel('Target Value', fontsize=14)
    axs[i, 1].tick_params(axis='both', which='major', labelsize=12)
    axs[i, 1].grid(True, alpha=0.6)

handles, labels = axs[0, 0].get_legend_handles_labels()

plt.tight_layout()
plt.show()

Model Performance Analysis

Linear Regressor: The linear regression model performed reasonably well, with an average MSE of 0.012449 and an average R² of 0.982033. This indicates that the model explains approximately 98.2% of the variance in the test data, but there is still room for improvement.
Support Vector Regression (SVR): The SVR model showed a significant improvement over the linear regressor, with an average MSE of 0.005609 and an average R² of 0.992568. This suggests that the SVR model captures non-linear relationships in the data more effectively.
Decision Tree Regressor: The decision tree regressor also performed well, with an average MSE of 0.008860 and an average R² of 0.993103. This model can capture complex patterns in the data, leading to high accuracy.
Random Forest Regressor: The random forest regressor outperformed all other models, with the lowest average MSE of 0.003095 and the highest average R² of 0.995590. The ensemble approach of combining multiple decision trees helps in reducing overfitting and improving generalization, making it the most robust model among those tested.

The results indicate that incorporating additional demographic features and using more sophisticated models can significantly improve the accuracy of population estimates. The random forest regressor, in particular, demonstrates superior performance, suggesting that it is well-suited for capturing the complex relationships within the dataset.

Notably, the training error of the linear regression model was greatly reduced after incorporating multiple variables, indicating that the model learned the information in the data better.

Additionally, we observed that the generalization ability of the SVR model changed greatly before and after the introduction of multiple variables. Initially, the SVR model exhibited severe overfitting, but with the inclusion of multiple variables, it demonstrated a normal prediction ability.