Skip to main content

Documentation

Downloading and visualizing the data

import pandas as pd
data = pd.read_csv('/content/Country-data.csv') #downloading the dataset
data
  • Downloaded the dataset using pd.read_csv using pandas library.
data[['exports','health','imports']] = data[['exports','health','imports']].apply(lambda x : x*data["gdpp"]/100)    #multiplying by %age GDP
  • Multiplied by %age Gdp using lambda x function
data.isna().sum()
  • checking for total null values in each column using data.isna().sum()
len(data['country'].unique()) == len(data)
  • checking repetitions in country column using .unique() function
data.shape
  • using .shape to check the dimensions of dataset
data.dtypes
  • using .dtypes to check the datatypes of each column
data.describe()
  • using .describe() to do statistical analysis of dataset
df = data.drop('country', axis=1)
columns = df.columns
columns
  • using .drop to drop coutry column from dataset
import matplotlib.pyplot as plt
import seaborn as sns

fig, ax = plt.subplots(nrows=3, ncols=3, figsize=(15, 15))
for i in range(9):
plt.subplot(3,3,i+1) #i+1 because we want to start with 1 (3, 3, 1)
sns.histplot(df[columns[i]]) #plotting histograms
plt.show()
  • using matplotlib and seaborn to plot the histogram of each column using sns.histplot

Similarly plotted rest of the columns

plt.figure(figsize=(10,10))
corr = data.corr()
sns.heatmap(corr, annot=True)
  • using matplotlib to create heatmap to visualise correlation matrix
  • calculated corr matrix using data.corr() function and visulaised using sns.heatmap() function

Preprocessing the data

Run the development server:

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
new = scaler.fit_transform(df) #standardising the numerical data
newdf = pd.DataFrame(data= new , columns = columns)
newdf
  • used StandardScaler from sklear.processing library to scale the data using scaler.fit_transform() method
  • created a new Dataframe using pd.DataFrame()
from sklearn.decomposition import PCA
import numpy as np

pca = PCA()
pca.fit(newdf)
pca_data_standard = pca.transform(newdf)
  • used PCA from sklearn.decomposition library and transformed the data using pca.transform()
  • imported numpy library as np
var = np.round(pca.explained_variance_ratio_*100, decimals=2) 
l = len(var)
labels = ['PC' + str(x) for x in range (1, l+1)] #creating a list of labels for each principal component
plt.bar(x=range(1,l+1), height=var, tick_label = labels) #plotting the variance
plt.ylabel('Percentage of Explained Variance')
plt.xlabel('Principal Component')
plt.show()
  • used pca.explainedvariance_ratio to calculating the %age of explained variance for each principal component
  • used np.round() to round decimal to two places
  • plotted the bargraph of each component using mathplotlib
pca_df = pd.DataFrame(pca_data_standard, columns=labels)
plt.scatter(pca_df.PC1, pca_df.PC2, color='black', edgecolors='pink') #scatter plot to visualise projection of data onto first two PC obtained from PCA
plt.title('PCA')
plt.xlabel(f'PC1 - {var[0]}%')
plt.ylabel(f'PC2 - {var[1]}%')
  • used plt.scatter from mathplotlib to plot the scatter plot to visualise projection of data onto first two PC obtained from PCA
pca_df.drop(['PC5','PC6','PC7','PC8','PC9'], axis = 1, inplace=True)
pca_df
  • removed unnecessaty principal components using df.drop()

Kmeans Clustering

from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

sil=[]
for k in range(2, 11):
kmeans = KMeans(n_clusters=k, init='random', n_init=10, max_iter=300).fit(pca_df) #KMeans class is instantiated with the number of clusters specified by k
labels = kmeans.labels_ #resulting cluster labels from the model
sil.append(silhouette_score(pca_df, labels, metric = 'euclidean')) #appending silhouette score for each 'k' in the list
sns.lineplot(x = range(2, 11), y = sil);
plt.xlabel('Number of clusters')
plt.ylabel('Silhouette score')
plt.show()
  • imported Kmeans from sklearn.cluster library and silhouette_score from sklearn.metrics library
  • The silhouette score measures how similar an object is to its own cluster compared to other clusters. A higher silhouette score indicates better clustering.
  • n_clusters (int) - the number of clusters to form as well as the number of centroids to generate.
  • n_init (int) - the number of time the k-means algorithm will be run with different centroid seeds.
  • max_iter (int) - the maximum number of iterations of the k-means algorithm for a single run.
  • labels (array-like) - the resulting cluster labels from the model.
  • metric (string) - the distance metric to use. 'euclidean' is the default metric.
  • sil (list) - a list containing the silhouette score for each number of clusters.
  • sns.lineplot (Line plot object) - a line plot showing the silhouette score for each number of clusters.
  • init(string) - the method used to initialize the centroids. 'random' selects random data points as initial centroids.
  • .fit() used to fit the dataset
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

sse = []
for i in range(1, 11):
kmeans = KMeans(n_clusters=i, init='random', n_init=10, max_iter=300).fit(pca_df)
sse.append(kmeans.inertia_) #appending sum of squared errors for each 'k' in the list

plt.plot(range(1, 11), sse, marker='o')
plt.xlabel('Number of clusters')
plt.ylabel('Sum of squared errors')
plt.show()
  • inertia(float) - the within-cluster sum of squares (inertia) of KMeans clustering for the input data.
  • sse gives the sum of squared errors for each 'k' in the list
kmeans = KMeans(n_clusters=4, init='random', n_init=10, max_iter=300, random_state=0)
kmeans.fit(pca_df)
  • created new kmeans object and fitted the dataset
cluster = kmeans.fit_predict(pca_df)
cluster
  • cluster(array-like) - an array containing the predicted clusters for each data point in the input data.
unique_cluster, counts = np.unique(cluster, return_counts=True)
percentages = counts / len(cluster) * 100

print("Number of samples:")
for i, label in enumerate(unique_cluster): #calculated number of samples for each cluster
print(f"Cluster {label}: {counts[i]}")
print("")

print("Percentage:")
for i, label in enumerate(unique_cluster):
print(f"Cluster {label}: {percentages[i]:.2f}%")
  • unique_cluster (array-like) - an array containing the unique predicted clusters.
  • counts (array-like) - an array containing the count of samples for each predicted cluster.
  • percentages (array-like) - an array containing the percentage of samples for each predicted cluster.
silhouette_score(pca_df, cluster, metric = 'euclidean')
  • calculated score
df_cluster = data
df_cluster['cluster']=cluster
df_cluster
  • assigned the cluster to original dataset
fig, ax = plt.subplots(nrows = 1, ncols = 2, figsize = (15,5))

plt.subplot(1,2,1)
sns.boxplot(x='cluster', y='child_mort', data=df_cluster);
plt.title('child_mort vs cluster')

plt.subplot(1,2,2)
sns.boxplot(x='cluster', y='income', data=df_cluster);
plt.title('income vs cluster')

plt.show()
  • plotted boxplot suing sns.boxplot()
df_cluster['cluster'] = df_cluster['cluster'].replace([3], 1)
  • used .replace() to repalce value
df_cluster['cluster'].loc[df_cluster['cluster'] == 0] = 'Need Help'
df_cluster['cluster'].loc[df_cluster['cluster'] == 1] = 'No Help Needed'
df_cluster['cluster'].loc[df_cluster['cluster'] == 2] = 'Might Need Help'
  • This code modifies the values in the 'cluster' column of the input DataFrame df_cluster. Each occurrence of a specific value (0, 1, or 2) is replaced with a corresponding new value (Need Help, No Help Needed, or Might Need Help)
import plotly.express as px
fig = px.choropleth(df_cluster[['country','cluster']],
locationmode = 'country names',
locations = 'country',
color = df_cluster['cluster'],
color_discrete_map = {'Need Help':'Red',
'Might Need Help':'Yellow',
'No Help Needed': 'Blue'})
fig.update_geos(fitbounds = "locations", visible = True)
fig.show(engine = 'kaleido')
  • used px from plotly.express library to plot countries
  • df_cluster (DataFrame) - the input DataFrame containing the data to be visualized.
  • locationmode (str) - the location mode to use for the choropleth map (e.g. 'country names', 'ISO-3', etc.).
  • color (str) - the column in df_cluster that contains the data to be plotted as the color of the map.
  • color_discrete_map (dict) - a dictionary mapping each unique value in the color column to a specific color.
  • fig (Figure) - a Plotly figure object representing the choropleth map.
  • fig.update_geos(fitbounds = "locations", visible = True) sets the fitbounds and visible parameters of the update_geos() method of the fig object. This modifies the appearance and behavior of the map, allowing it to display more effectively.
  • fig.show(engine = 'kaleido') displays the map using the kaleido engine, which is a vector graphics rendering engine that can be used to save high-quality static images of Plotly figures. The show() method opens a window in the browser, displaying the map.

Similarly plotted rest of the continents

Hierarchical clustering

from scipy.cluster.hierarchy import dendrogram, linkage

Z = linkage(pca_df, method='ward', metric='euclidean') #created a linkage matrix

plt.figure(figsize=(25,8))
plt.title('Hierarchical Clustering Dendrogram')
plt.xlabel('Sample Index')
plt.ylabel('Distance')
dendrogram(Z) #plotting a dendogram
plt.show()
  • from scipy.cluster.hierarchy import dendrogram, linkage imports the dendrogram() and linkage() functions from the scipy.cluster.hierarchy module.
  • Z = linkage(pca_df, method='ward', metric='euclidean') calculates the hierarchical clustering linkage matrix Z using the Ward method and Euclidean distance metric. -dendrogram(Z) generates and plots a dendrogram using the linkage matrix Z.
from sklearn.cluster import AgglomerativeClustering
np.random.seed(0)
clustering = AgglomerativeClustering(n_clusters=3, metric='euclidean')
clustering.fit(pca_df)
  • from sklearn.cluster import AgglomerativeClusteringimports the AgglomerativeClustering class from sklearn.cluster module.
  • np.random.seed(0) sets the random seed to 0 for reproducibility.
  • clustering = AgglomerativeClustering(n_clusters=3, metric='euclidean') instantiates an AgglomerativeClustering object with n_clusters=3 and metric='euclidean'. This creates a hierarchical clustering model that will group samples into 3 clusters based on the Euclidean distance between the samples.
  • clustering.fit(pca_df) fits the hierarchical clustering model to the pca_df dataset, grouping the samples into 3 clusters based on the Euclidean distance between the samples.
cluster2 = clustering.fit_predict(pca_df)
cluster2
  • cluster2 = clustering.fit_predict(pca_df) assigns each sample to a cluster based on the hierarchical clustering model, and returns an array of cluster assignments for each sample in the dataset. The resulting array cluster2 contains the cluster assignments of each sample in the pca_df dataset.
unique_cluster, counts = np.unique(cluster2, return_counts=True)
percentages = counts / len(cluster2) * 100

print("Number of samples:")
for i, label in enumerate(unique_cluster): #calculated number of samples for each cluster
print(f"Cluster {label}: {counts[i]}")
print("")

print("Percentage:")
for i, label in enumerate(unique_cluster):
print(f"Cluster {label}: {percentages[i]:.2f}%")
  • calculated cluster counts using np.unique() function and calculated percentages
df_cluster2 = data
df_cluster2['cluster']=cluster2 #assigned the cluster to original dataset
df_cluster2
  • assigned the cluster to original dataset
fig, ax = plt.subplots(nrows = 1, ncols = 2, figsize = (15,5))

plt.subplot(1,2,1)
sns.boxplot(x='cluster', y='child_mort', data=df_cluster2);
plt.title('child_mort vs cluster')

plt.subplot(1,2,2)
sns.boxplot(x='cluster', y='income', data=df_cluster2);
plt.title('income vs cluster')

plt.show()
  • plotted boxplot using sns.boxplot()
df_cluster2['cluster'].loc[df_cluster2['cluster'] == 0] = 'No Help Needed'
df_cluster2['cluster'].loc[df_cluster2['cluster'] == 1] = 'Need Help'
df_cluster2['cluster'].loc[df_cluster2['cluster'] == 2] = 'Might Need Help'
  • similar to step done before assigned new names to clusters
sns.pairplot(df_cluster2, hue = "cluster")
  • The sns.pairplot() function creates a matrix of scatter plots for all pairs of features in a dataset.
  • each point is colored according to cluster using hue parameter
import plotly.express as px
fig = px.choropleth(df_cluster2[['country','cluster']],
locationmode = 'country names',
locations = 'country',
color = df_cluster2['cluster'],
color_discrete_map = {'Need Help':'Red',
'Might Need Help':'Yellow',
'No Help Needed': 'Blue'})
fig.update_geos(fitbounds = "locations", visible = True)
fig.show(engine = 'kaleido')
  • same steps as in Kmeans to visualise according to countries

Similarly all continents are plotted

DBSCAN clustering

from sklearn.cluster import DBSCAN

eps_range = np.linspace(0.10, 1.00, num=100)
min_samples_range = range(2, 10)
s=-np.inf

for eps in eps_range:
for min_samples in min_samples_range:
dbscan = DBSCAN(eps=eps, min_samples=min_samples)
labels = dbscan.fit_predict(pca_df)
n_clusters = len(np.unique(labels)) - (1 if -1 in labels else 0)
n_noise = list(labels).count(-1)
if (n_clusters==2): #calculated silhouette score whenever the number of clusters is equal to 3 (n_clusters=2 + one cluster for noise)
score = silhouette_score(pca_df, labels, metric = "euclidean")
print(f'eps={eps:.2f}, min_samples={min_samples}, n_clusters={n_clusters}, n_noise={n_noise}, silhouette_score={score}')
if (score>s):
final_eps = eps
final_sample = min_samples #updated the values of eps and min_samples if the score has improved
s=score
  • imported DBSCAN from sklearn.cluster library
  • np.linspace(start, stop, num) returns num evenly spaced numbers over the interval [start, stop]
  • dbscan = DBSCAN(eps=eps, min_samples=min_samples) creates a DBSCAN object with the specified eps and min_samples values. eps is the radius of the neighborhood around each point that will be considered to define the density of the points. min_samples is the minimum number of points required to form a dense region. If a point has fewer than min_samples neighbors within a distance of eps, it will be considered as noise.
  • This code performs hyperparameter tuning for the DBSCAN clustering algorithm. It first defines a range of values for the epsilon parameter (eps_range) and the minimum number of samples required to form a dense region (min_samples_range). It then loops over all combinations of these parameters, and for each combination, fits a DBSCAN model to the data and calculates the number of clusters, number of noise points, and the silhouette score. If the number of clusters is equal to 2 (plus one cluster for noise), the silhouette score is calculated using the silhouette_score function from scikit-learn. The combination of eps and min_samples that result in the highest silhouette score is then selected as the final hyperparameters, and their values are stored in the variables final_eps and final_sample, respectively.
from sklearn.cluster import DBSCAN
np.random.seed(0)
db = DBSCAN(eps=0.8818181818181817, min_samples=9).fit(pca_df)
  • This code initializes a DBSCAN clustering algorithm with eps=0.8818181818181817 and min_samples=9, and applies it to the pca_df dataset using the fit method. The resulting model is stored in the db variable.
unique_cluster, counts = np.unique(cluster3, return_counts=True)
percentages = counts / len(cluster3) * 100

print("Number of samples:")
for i, label in enumerate(unique_cluster): #calculated number of samples for each cluster
print(f"Cluster {label}: {counts[i]}")
print("")

print("Percentage:")
for i, label in enumerate(unique_cluster):
print(f"Cluster {label}: {percentages[i]:.2f}%")
  • calculated unique clusters similar to before algorithms
df_cluster3 = data
df_cluster3['cluster']=cluster3
df_cluster3
  • assigned the cluster to original dataset
fig, ax = plt.subplots(nrows = 1, ncols = 2, figsize = (15,5))

plt.subplot(1,2,1)
sns.boxplot(x='cluster', y='child_mort', data=df_cluster3);
plt.title('child_mort vs cluster')

plt.subplot(1,2,2)
sns.boxplot(x='cluster', y='income', data=df_cluster3);
plt.title('income vs cluster')

plt.show()
  • plotted box plots similar to before
df_cluster3['cluster'].loc[df_cluster3['cluster'] == -1] = 'Might Need Help'
df_cluster3['cluster'].loc[df_cluster3['cluster'] == 0] = 'Need Help'
df_cluster3['cluster'].loc[df_cluster3['cluster'] == 1] = 'No Help Needed'
  • asasigned name to clusters similar to before
sns.pairplot(df_cluster3, hue = "cluster")
  • visualises using pairplots
import plotly.express as px
fig = px.choropleth(df_cluster3[['country','cluster']],
locationmode = 'country names',
locations = 'country',
color = df_cluster3['cluster'],
color_discrete_map = {'Need Help':'Red',
'Might Need Help':'Yellow',
'No Help Needed': 'Blue'})
fig.update_geos(fitbounds = "locations", visible = True)
fig.show(engine = 'kaleido')
  • plotted countries similar to before Similarly plotted graph for each continent

ALl the steps are then repeated for dataset without pca