Python code for data exploration and analysis of the submitted paper: Assessing the physicochemical and microbiological condition of surface waters in Urabá-Colombia: Impact of human activities and agro-industry (2024)¶
Code developed by Victor H. Aristizabal-Tique
Introduction¶
Brief description of the data set and a summary of its attributes:¶
Following Table shows the sampling points (P1 to P10) in each of the León, Chigorodó, Carepa, Zungo, Apartadó, Grande, Arcua, Currulao, Guadualito, and Turbo Rivers in the central area of the Urabá Antioquia-Colombia. The Standard Methods 23rd edition recommendations were followed for collecting, preserving, and managing water samples from each river. The temperature, physicochemical and microbiological parameters were measured at each sampling point.
Sample point | Location | Latitude | Longitude | Altitude (m) |
---|---|---|---|---|
P1 | Turbo River | 8.1266660 | -76.6936676 | 21.4 |
| P2 | Guadualito River | 8.0641056 | -76.6585601 | 28.6 | P3 | Currulao River | 7.9878032 | -76.6490078 | 23.4 | P4 | Arcua River | 7.9658303 | -76.6240585 | 25.1 | P5 | Grande River | 7.9278369 | -76.6214059 | 27.6 | P6 | Apartadó River | 7.8962484 | -76.6480077 | 24.7 | P7 | Zungo River | 7.8241638 | -76.6484759 | 39.3 | P8 | Carepa River | 7.7647969 | -76.6639725 | 31.3 | P9 | Chigorodó River | 7.6729423 | -76.6841424 | 34.0 | P10 | León River | 7.5716390 | -76.7110000 | 16.6
In general, the data is structured and complete, accordingly data cleaning and feature engineering aren't necessary.
Initial plan for data exploration and analysis:¶
First, the original data is loaded from an Excel file to a Pandas-DataFrame called OrigData. Then, the number of elements, columns, and rows are reviewed, in addition to the content data types and their labels by columns and rows are explored. Even more, a general visualization of the data is done to identify outliers, some tools for this goal include histograms, boxplots, scattered data and linear regression models fit with a 95% confidence interval size. Finally, the Spearman correlation and p-value matrices are calculated in order to estimate the strength association between variables.
Second, just after the outliers are identified, these are removed and a new Pandas-DataFrame called NewData is created, and the steps describe above are repeated.
Key Findings and Insights, which synthesizes the results of Exploratory Data Analysis in an insightful and actionable manner:¶
In general, in the data exploration it was possible to identify the main source of outliers, which led to a second analysis ignoring this source. There is also the need to perform a non-parametric analysis because some data present non-normality. Finally, considering the Spearman Rank Correlation Coefficients (Rho) matrix for the analysis with and without outliers, it can be seen that most of the coefficients retained their significance. In the case of the correlation coefficients of the population with the other variables, where some of them lost significance when removing the outliers, it draws a lot of attention since there is a strong theoretical basis that supports the strong association between microbiological variables and human activity in water sources. This indicates that we must be very careful before making the decision to remove an outlier without both theoretical and statistical support.
Here starts the python script¶
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
#import os
#from scipy import stats
#from sklearn.impute import SimpleImputer
#from sklearn.impute import KNNImputer
Original Data input and loading¶
Load and display the data and its properties from an Excel file.
#data file path
#OrigInputFilePath = 'data/SamplingPointsPhysicochemicalParameters.xlsx' #In desktop version Jupyter
OrigInputFilePath = '../data/SamplingPointsPhysicochemicalParameters.xlsx' #In Code Ocena platform
#data is loaded into Pandas DataFrame an Excel file
OrigData = pd.read_excel(OrigInputFilePath, dtype='object')
#display all columns
pd.set_option('display.max_columns', OrigData.shape[1])
#pd.set_option('display.max_rows', 6) #display 6 rows
#display all rows
pd.set_option('display.max_rows', OrigData.shape[0])
#display Data
OrigData
Sampling point | Latitude | Longitude | Altitude (m) | Distance (m) | Population (hab) | Ambient Temp. (°C) | Water Temp. (°C) | DO (mg/L) | pH | Conductivity (mS/cm) | Ammonia nitrogen (mg/L) | Ortho phosphate (mg/L) | BOD5 (mg/L) | COD (mg/L) | Viable heterotrophs (CFU/mL) | E. coli (MPN/100 mL) | Total coliforms (MPN/100 mL) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | P1 | 8.126666 | -76.693668 | 21.4 | 6595.6 | 4033 | 26.6 | 27.4 | 7.71 | 8.2 | 847 | 7.6 | 0.108 | 220 | 287.1 | 34000 | 540 | 1600 |
1 | P2 | 8.064106 | -76.65856 | 28.6 | 9747.4 | 10955 | 28.1 | 27.3 | 7.62 | 8.2 | 774 | 6.8 | 0.128 | 215 | 239 | 57000 | 920 | 1600 |
2 | P3 | 7.987803 | -76.649008 | 23.4 | 10783.6 | 23344 | 30.4 | 28.8 | 7.55 | 7.9 | 714 | 8 | 0.131 | 191.7 | 255.6 | 200000 | 1600 | 1600 |
3 | P4 | 7.96583 | -76.624058 | 25.1 | 15754.4 | 9682 | 29.8 | 26.7 | 7.36 | 8 | 464 | 7.9 | 0.193 | 188.3 | 209.7 | 200000 | 920 | 1600 |
4 | P5 | 7.927837 | -76.621406 | 27.6 | 15562 | 7490 | 28 | 27.4 | 7.62 | 7.9 | 428 | 8.3 | 0.136 | 154.3 | 176.7 | 200000 | 1600 | 1600 |
5 | P6 | 7.896248 | -76.648008 | 24.7 | 15194.3 | 98454 | 30.8 | 28.6 | 0.03 | 7.5 | 579 | 17.7 | 0.316 | 241.7 | 324.5 | 200000 | 1600 | 1600 |
6 | P7 | 7.824164 | -76.648476 | 39.3 | 19587.1 | 9093 | 29.9 | 27.8 | 8 | 7.7 | 305 | 8.5 | 0.07 | 113.3 | 151.3 | 150000 | 735 | 1600 |
7 | P8 | 7.764797 | -76.663972 | 31.3 | 34983.3 | 33009 | 30.1 | 27.5 | 7.04 | 7.9 | 428 | 15.1 | 0.085 | 46.2 | 90 | 200000 | 1600 | 1600 |
8 | P9 | 7.672942 | -76.684142 | 34 | 40356.1 | 47046 | 27.1 | 26.7 | 7.74 | 7.1 | 186 | 11.3 | 0.09 | 120 | 121.9 | 200000 | 1600 | 1600 |
9 | P10 | 7.571639 | -76.711 | 16.6 | 53325.7 | 4597 | 28 | 27.8 | 6.73 | 7.5 | 169 | 7.9 | 0.126 | 78.3 | 104.5 | 57000 | 1000 | 1600 |
# displays number of elements
print('number of elements:', OrigData.size)
# displays number of rows and columns
print('\nnumber of rows and columns:', OrigData.shape, '\n')
# Examine the columns, look at missing data
OrigData.info()
number of elements: 180 number of rows and columns: (10, 18) <class 'pandas.core.frame.DataFrame'> RangeIndex: 10 entries, 0 to 9 Data columns (total 18 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Sampling point 10 non-null object 1 Latitude 10 non-null object 2 Longitude 10 non-null object 3 Altitude (m) 10 non-null object 4 Distance (m) 10 non-null object 5 Population (hab) 10 non-null object 6 Ambient Temp. (°C) 10 non-null object 7 Water Temp. (°C) 10 non-null object 8 DO (mg/L) 10 non-null object 9 pH 10 non-null object 10 Conductivity (mS/cm) 10 non-null object 11 Ammonia nitrogen (mg/L) 10 non-null object 12 Ortho phosphate (mg/L) 10 non-null object 13 BOD5 (mg/L) 10 non-null object 14 COD (mg/L) 10 non-null object 15 Viable heterotrophs (CFU/mL) 10 non-null object 16 E. coli (MPN/100 mL) 10 non-null object 17 Total coliforms (MPN/100 mL) 10 non-null object dtypes: object(18) memory usage: 1.5+ KB
# list of Column names
ColumnsList = OrigData.columns.tolist()
print('\nColumns list:', ColumnsList)
Columns list: ['Sampling point', 'Latitude', 'Longitude', 'Altitude (m)', 'Distance (m)', 'Population (hab)', 'Ambient Temp. (°C)', 'Water Temp. (°C)', 'DO (mg/L)', 'pH', 'Conductivity (mS/cm)', 'Ammonia nitrogen (mg/L)', 'Ortho phosphate (mg/L)', 'BOD5 (mg/L)', 'COD (mg/L)', 'Viable heterotrophs (CFU/mL)', 'E. coli (MPN/100 mL)', 'Total coliforms (MPN/100 mL)']
Original Data exploration¶
Complete and fully structured data is explored
Histograms and Shapiro-Wilk test of normality are performed
The results show that some variables are not normally distributed according to Shapiro-Wilk test.
M = 5;
#graph the last 15 columns of array Data
#number of elements in (ColumnsList - 3) divided by M
N = int((len(ColumnsList)-3)/M)
#create a NxM grid of plots
fig1, axes1 = plt.subplots(N, M, figsize=(18, 6))
#counter for ColumnsList
p = 3;
for i in range(0,N):
for j in range(0,M):
#extract only the data from the column under analysis
AUX_data = OrigData[ColumnsList[p]];
#compute width of bars for histogram
Width = (max(AUX_data) - min(AUX_data)) / 5
if Width < 3:
Width = round(Width, 2);
else:
Width = int(Width);
#extract the integer part of the minimum value of the column under analysis
AUX_Min= int(min(AUX_data));
#define intervals for histogram
Ranges = [AUX_Min+0*Width, AUX_Min+Width, AUX_Min+2*Width, AUX_Min+3*Width, AUX_Min+4*Width, AUX_Min+5*Width, AUX_Min+6*Width, AUX_Min+7*Width]
#create a histogram at position [i, j] of the plots grid
sns.histplot(x=AUX_data, bins=Ranges, element='bars', kde=True, ax=axes1[i, j]);
# Perform Shapiro-Wilk test
Statistic, Pvalue = stats.shapiro(AUX_data.astype('float64'))
#Print the results in titles
#Interpret the results
alpha = 0.05
if Pvalue > alpha:
#Sample looks Gaussian (fail to reject H0)
axes1[i, j].set_title(f'StatisticSW={Statistic:.3f}, P-value={Pvalue:.3f}', color='green')
else:
#Sample does not look Gaussian (reject H0)
axes1[i, j].set_title(f'StatisticSW={Statistic:.3f}, P-value={Pvalue:.3f}', color='red')
p += 1;
# Adjust layout
plt.tight_layout()
#save the plot as a JPG image with specified DPI
#plt.savefig('results/OrigData_Histograms.jpg', format='jpg', dpi=300) #in desktop version Jupyter
plt.savefig('../results/OrigData_Histograms.jpg', format='jpg', dpi=300) #in Code Ocena platform
plt.show()
/opt/conda/lib/python3.10/site-packages/scipy/stats/_morestats.py:1879: UserWarning: Input data for shapiro has range zero. The results may not be accurate. warnings.warn("Input data for shapiro has range zero. The results "
Boxplots are performed to identify outliers
Five of the seven outliers in boxplots correspond to the data of P6, the other two correspond to P3 (Water Temp.) and P8 (Ammonia nitrogen), accordingly it's convenient to ignore the P6 data.
#create a NxM grid of plots
fig2, axes2 = plt.subplots(N, M, figsize=(12, 6))
#counter for ColumnsList
p = 3;
for i in range(0,N):
for j in range(0,M):
#create a boxplot at position [i, j] of the plots grid
sns.boxplot(x=OrigData[ColumnsList[p]], ax=axes2[i, j])
p += 1;
# Adjust layout
plt.tight_layout()
#save the plot as a JPG image with specified DPI
#plt.savefig('results/OrigData_Boxplots.jpg', format='jpg', dpi=300) #in desktop version Jupyter
plt.savefig('../results/OrigData_Boxplots.jpg', format='jpg', dpi=300) #in Code Ocena platform
plt.show()
Scattered data and linear regression models fit with a 95% confidence interval size are performed to identify outliers
Plots scattered data and linear regression models fit with a 95% confidence interval size show that some data points of P6 are not within or are far from confidence interval, thus P6 is identified as the main source of outliers. This supports the idea given in above section on boxplots of ignoring the data from P6.
# Define colors based on row position for each sampling point in regplots
row_positions = range(len(OrigData[ColumnsList[0]]));
point_colors = sns.color_palette("Paired", n_colors=len(row_positions));
#plot a bar chart to visualize the colors of the sampling points in regplots
plt.figure(figsize=(8, 2))
sns.barplot(x=['P1', 'P2', 'P3', 'P4', 'P5', 'P6', 'P7', 'P8', 'P9', 'P10'], y=[1, 1, 1, 1, 1, 1, 1, 1, 1, 1], palette=point_colors)
# Remove y-axis and labels for better visualization
plt.yticks([])
# Adjust layout
plt.tight_layout()
#save the plot as a JPG image with specified DPI
#plt.savefig('results/OrigData_color_palette_regplots.jpg', format='jpg', dpi=300) #in desktop version Jupyter
plt.savefig('../results/OrigData_color_palette_regplots.jpg', format='jpg', dpi=300) #in Code Ocena platform
#create scattered data and a linear regression model fit with a size of the confidence interval of 95%
L = len(ColumnsList);
# Create a (L-3)x(L-3) grid of plots
#graph the last 15 (L-3) columns of array Data
fig3, axes3 = plt.subplots(L-3, L-3, figsize=(40, 40))
for i in range(3,L):
for j in range(3,L):
#create a plot data and a linear regression model fit with a size of the confidence interval of 95% at position [i, j] of the plots grid
sns.regplot(x = OrigData[ColumnsList[i]].astype('float64'), y = OrigData[ColumnsList[j]].astype('float64'),
ci=95, scatter_kws={'color': point_colors}, ax=axes3[i-3, j-3])
# Adjust layout
plt.tight_layout()
#save the plot as a JPG image with specified DPI
#plt.savefig('results/OrigData_Regplots.jpg', format='jpg', dpi=300) #in desktop version Jupyter
plt.savefig('../results/OrigData_Regplots.jpg', format='jpg', dpi=300) #in Code Ocena platform
Original Data analysis¶
Observation of data distributions:¶
In the previous scattered plot of relationships by pairs of the dataset, it can be observed that many variables seem to have a strength association, therefore, it is convenient to look for the correlation between them. Since some variables are not normally distributed according to Shapiro-Wilk test, thus a non parametric correlation is used as the Spearman's Rank Correlation Coefficient (Rho).
Conducting a formal significance test for one of the hypotheses and discuss the results:¶
If p-value <0.05, then we reject the null hypothesis, and the Spearman’s Coefficient (Rho) estimates the strength association between two variables, as follow:
Rho value | Strength of association |
---|---|
0.00 - 0.19 | very weak |
0.20 - 0.39 | weak |
0.40 - 0.59 | moderate |
0.60 - 0.79 | strong |
0.80 - 1.00 | very strong |
Heatmap of the Spearman correlation matrix and p-values are performed
# Heatmap of correlation
InterestData = OrigData.iloc[:,3:17].astype('float64')
#invert the order of columns for better visualization in the correlation matrix
InterestData = InterestData[InterestData.columns[::-1]]
#Calculate the Spearman correlation matrix and p-values
correlation_matrix, p_values = stats.spearmanr(InterestData)
mask = np.triu(np.ones_like(correlation_matrix, dtype=bool)) # Mask to hide the upper triangle for better visualization
#mask = np.tril(np.ones_like(correlation_matrix, dtype=bool)) # Mask to hide the lower triangle for better visualization
fig4, ax4 = plt.subplots(figsize=(12, 9))
# Create a heatmap using Seaborn
ax4=sns.heatmap(correlation_matrix, square=True, annot=False, fmt=".2f", cmap='bwr', cbar_kws={"shrink": 0.7}, ax=ax4,
mask=mask, xticklabels=InterestData.columns, yticklabels=InterestData.columns) #, annot_kws = {'size': 6})
# Add asterisks and coefficients to indicate significance levels
for i in range(correlation_matrix.shape[0]):
for j in range(i+1, correlation_matrix.shape[1]):
p_val = p_values[i, j]
correlation_coefficient = correlation_matrix[i, j]
# Format the annotation text with correlation coefficient and asterisks for significance
annotation_text = f"{correlation_coefficient:.2f}\n"
if p_val < 0.001:
annotation_text += '***'
elif p_val < 0.01:
annotation_text += '**'
elif p_val < 0.05:
annotation_text += '*'
#plt.text(j+ 0.5, i + 0.5, annotation_text, ha='center', va='center', color='black', fontsize=10) #for upper tirangle
plt.text(i + 0.5, j + 0.5, annotation_text, ha='center', va='center', color='black', fontsize=10) #for lower tirangle
plt.setp(ax4.get_xticklabels(), rotation = 25, ha = "right", rotation_mode = "anchor")
plt.setp(ax4.get_yticklabels(), rotation = 10, ha = "right", rotation_mode = "anchor")
plt.title('Spearman’s Rho matrix with Significance Levels of Original Data, P-value: * p < 0.05, ** p < 0.01, ***p < 0.001');
#save the plot as a JPG image with specified DPI
#plt.savefig('results/OrigData_SpearmanRho_Pvalue_Matrix.jpg', format='jpg', dpi=300) #in desktop version Jupyter
plt.savefig('../results/OrigData_SpearmanRho_Pvalue_Matrix.jpg', format='jpg', dpi=300) #in Code Ocena platform
# Convert the NumPy array to a DataFrame
df_correlation_matrix = pd.DataFrame(correlation_matrix, index=ColumnsList[3:17], columns=ColumnsList[3:17])
df_Pvalue_matrix = pd.DataFrame(p_values, index=ColumnsList[3:17], columns=ColumnsList[3:17])
#saves the pandas frames as an excel files in desktop version Jupyter
#df_correlation_matrix.to_excel('results/OrigData_SpearmanRhoMatrix.xlsx')
#df_Pvalue_matrix.to_excel('results/OrigData_PvalueMatrix.xlsx')
#saves the pandas frames as an excel files in Code Ocena platform
df_correlation_matrix.to_excel('../results/OrigData_SpearmanRhoMatrix.xlsx')
df_Pvalue_matrix.to_excel('../results/OrigData_PvalueMatrix.xlsx')
New Modified Data input and loading¶
Load and display the data without outliers and its properties from an Excel file. Here, the data of P6 are ignored.
#data file path
#NewInputFilePath = 'data/SamplingPointsPhysicochemicalParameters_WithoutOutliers.xlsx' #In desktop version Jupyter
NewInputFilePath = '../data/SamplingPointsPhysicochemicalParameters.xlsx' #In Code Ocena platform
#data is loaded into Pandas DataFrame an Excel file
NewgData = pd.read_excel(NewInputFilePath, dtype='object')
#display all columns
pd.set_option('display.max_columns', NewgData.shape[1])
#pd.set_option('display.max_rows', 6) #display 6 rows
#display all rows
pd.set_option('display.max_rows', NewgData.shape[0])
#display Data
NewgData
Sampling point | Latitude | Longitude | Altitude (m) | Distance (m) | Population (hab) | Ambient Temp. (°C) | Water Temp. (°C) | DO (mg/L) | pH | Conductivity (mS/cm) | Ammonia nitrogen (mg/L) | Ortho phosphate (mg/L) | BOD5 (mg/L) | COD (mg/L) | Viable heterotrophs (CFU/mL) | E. coli (MPN/100 mL) | Total coliforms (MPN/100 mL) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | P1 | 8.126666 | -76.693668 | 21.4 | 6595.6 | 4033 | 26.6 | 27.4 | 7.71 | 8.2 | 847 | 7.6 | 0.108 | 220 | 287.1 | 34000 | 540 | 1600 |
1 | P2 | 8.064106 | -76.65856 | 28.6 | 9747.4 | 10955 | 28.1 | 27.3 | 7.62 | 8.2 | 774 | 6.8 | 0.128 | 215 | 239 | 57000 | 920 | 1600 |
2 | P3 | 7.987803 | -76.649008 | 23.4 | 10783.6 | 23344 | 30.4 | 28.8 | 7.55 | 7.9 | 714 | 8 | 0.131 | 191.7 | 255.6 | 200000 | 1600 | 1600 |
3 | P4 | 7.96583 | -76.624058 | 25.1 | 15754.4 | 9682 | 29.8 | 26.7 | 7.36 | 8 | 464 | 7.9 | 0.193 | 188.3 | 209.7 | 200000 | 920 | 1600 |
4 | P5 | 7.927837 | -76.621406 | 27.6 | 15562 | 7490 | 28 | 27.4 | 7.62 | 7.9 | 428 | 8.3 | 0.136 | 154.3 | 176.7 | 200000 | 1600 | 1600 |
5 | P6 | 7.896248 | -76.648008 | 24.7 | 15194.3 | 98454 | 30.8 | 28.6 | 0.03 | 7.5 | 579 | 17.7 | 0.316 | 241.7 | 324.5 | 200000 | 1600 | 1600 |
6 | P7 | 7.824164 | -76.648476 | 39.3 | 19587.1 | 9093 | 29.9 | 27.8 | 8 | 7.7 | 305 | 8.5 | 0.07 | 113.3 | 151.3 | 150000 | 735 | 1600 |
7 | P8 | 7.764797 | -76.663972 | 31.3 | 34983.3 | 33009 | 30.1 | 27.5 | 7.04 | 7.9 | 428 | 15.1 | 0.085 | 46.2 | 90 | 200000 | 1600 | 1600 |
8 | P9 | 7.672942 | -76.684142 | 34 | 40356.1 | 47046 | 27.1 | 26.7 | 7.74 | 7.1 | 186 | 11.3 | 0.09 | 120 | 121.9 | 200000 | 1600 | 1600 |
9 | P10 | 7.571639 | -76.711 | 16.6 | 53325.7 | 4597 | 28 | 27.8 | 6.73 | 7.5 | 169 | 7.9 | 0.126 | 78.3 | 104.5 | 57000 | 1000 | 1600 |
# displays number of elements
print('number of elements:', NewgData.size)
# displays number of rows and columns
print('\nnumber of rows and columns:', NewgData.shape, '\n')
# Examine the columns, look at missing data
NewgData.info()
number of elements: 180 number of rows and columns: (10, 18) <class 'pandas.core.frame.DataFrame'> RangeIndex: 10 entries, 0 to 9 Data columns (total 18 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Sampling point 10 non-null object 1 Latitude 10 non-null object 2 Longitude 10 non-null object 3 Altitude (m) 10 non-null object 4 Distance (m) 10 non-null object 5 Population (hab) 10 non-null object 6 Ambient Temp. (°C) 10 non-null object 7 Water Temp. (°C) 10 non-null object 8 DO (mg/L) 10 non-null object 9 pH 10 non-null object 10 Conductivity (mS/cm) 10 non-null object 11 Ammonia nitrogen (mg/L) 10 non-null object 12 Ortho phosphate (mg/L) 10 non-null object 13 BOD5 (mg/L) 10 non-null object 14 COD (mg/L) 10 non-null object 15 Viable heterotrophs (CFU/mL) 10 non-null object 16 E. coli (MPN/100 mL) 10 non-null object 17 Total coliforms (MPN/100 mL) 10 non-null object dtypes: object(18) memory usage: 1.5+ KB
# list of Column names
ColumnsList = NewgData.columns.tolist()
print('\nColumns list:', ColumnsList)
Columns list: ['Sampling point', 'Latitude', 'Longitude', 'Altitude (m)', 'Distance (m)', 'Population (hab)', 'Ambient Temp. (°C)', 'Water Temp. (°C)', 'DO (mg/L)', 'pH', 'Conductivity (mS/cm)', 'Ammonia nitrogen (mg/L)', 'Ortho phosphate (mg/L)', 'BOD5 (mg/L)', 'COD (mg/L)', 'Viable heterotrophs (CFU/mL)', 'E. coli (MPN/100 mL)', 'Total coliforms (MPN/100 mL)']
New Modified Data exploration¶
Complete and fully structured data without outliers is explored.
Histograms and Shapiro-Wilk test of normality are performed
The results show that some variables are not normally distributed according to Shapiro-Wilk test.
M = 5;
#graph the last 15 columns of array Data
#number of elements in (ColumnsList - 3) divided by M
N = int((len(ColumnsList)-3)/M)
#create a NxM grid of plots
fig1, axes1 = plt.subplots(N, M, figsize=(18, 6))
#counter for ColumnsList
p = 3;
for i in range(0,N):
for j in range(0,M):
#extract only the data from the column under analysis
AUX_data = NewgData[ColumnsList[p]];
#compute width of bars for histogram
Width = (max(AUX_data) - min(AUX_data)) / 4
if Width < 3:
Width = round(Width, 2);
else:
Width = int(Width);
#extract the integer part of the minimum value of the column under analysis
AUX_Min= int(min(AUX_data));
#define intervals for histogram
Ranges = [AUX_Min+0*Width, AUX_Min+Width, AUX_Min+2*Width, AUX_Min+3*Width, AUX_Min+4*Width, AUX_Min+5*Width, AUX_Min+6*Width, AUX_Min+7*Width]
#create a histogram at position [i, j] of the plots grid
sns.histplot(x=AUX_data, bins=Ranges, element='bars', kde=True, ax=axes1[i, j]);
# Perform Shapiro-Wilk test
Statistic, Pvalue = stats.shapiro(AUX_data.astype('float64'))
#Print the results in titles
#Interpret the results
alpha = 0.05
if Pvalue > alpha:
#Sample looks Gaussian (fail to reject H0)
axes1[i, j].set_title(f'StatisticSW={Statistic:.3f}, P-value={Pvalue:.3f}', color='green')
else:
#Sample does not look Gaussian (reject H0)
axes1[i, j].set_title(f'StatisticSW={Statistic:.3f}, P-value={Pvalue:.3f}', color='red')
p += 1;
# Adjust layout
plt.tight_layout()
#save the plot as a JPG image with specified DPI
#plt.savefig('results/NewData_Histograms.jpg', format='jpg', dpi=300) #in desktop version Jupyter
plt.savefig('../results/NewData_Histograms.jpg', format='jpg', dpi=300) #in Code Ocena platform
plt.show()
/opt/conda/lib/python3.10/site-packages/scipy/stats/_morestats.py:1879: UserWarning: Input data for shapiro has range zero. The results may not be accurate. warnings.warn("Input data for shapiro has range zero. The results "
Boxplots are performed to identify outliers
Scattered data and linear regression models fit with a 95% confidence interval size are performed to identify outliers
Plots scattered data and linear regression models fit with a 95% confidence interval size show that most of the data points are within or are close to the confidence interval.
# Define colors based on row position for each sampling point in regplots
row_positions = range(len(NewgData[ColumnsList[0]]));
point_colors = sns.color_palette("Paired", n_colors=len(row_positions));
#plot a bar chart to visualize the colors of the sampling points in regplots
plt.figure(figsize=(8, 2))
sns.barplot(x=['P1', 'P2', 'P3', 'P4', 'P5', 'P7', 'P8', 'P9', 'P10'], y=[1, 1, 1, 1, 1, 1, 1, 1, 1], palette=point_colors)
# Remove y-axis and labels for better visualization
plt.yticks([])
# Adjust layout
plt.tight_layout()
#save the plot as a JPG image with specified DPI
#plt.savefig('results/NewData_color_palette_regplots.jpg', format='jpg', dpi=300) #in desktop version Jupyter
plt.savefig('../results/NewData_color_palette_regplots.jpg', format='jpg', dpi=300) #in Code Ocena platform
#create scattered data and a linear regression model fit with a size of the confidence interval of 95%
L = len(ColumnsList);
# Create a (L-3)x(L-3) grid of plots
#graph the last 15 (L-3) columns of array Data
fig3, axes3 = plt.subplots(L-3, L-3, figsize=(40, 40))
for i in range(3,L):
for j in range(3,L):
#create a plot data and a linear regression model fit with a size of the confidence interval of 95% at position [i, j] of the plots grid
sns.regplot(x = NewgData[ColumnsList[i]].astype('float64'), y = NewgData[ColumnsList[j]].astype('float64'),
ci=95, scatter_kws={'color': point_colors}, ax=axes3[i-3, j-3])
# Adjust layout
plt.tight_layout()
#save the plot as a JPG image with specified DPI
#plt.savefig('results/NewData_Regplots.jpg', format='jpg', dpi=300) #in desktop version Jupyter
plt.savefig('../results/NewData_Regplots.jpg', format='jpg', dpi=300) #in Code Ocena platform
New Modified Data analysis¶
Observation of data distributions:¶
In the previous scattered plot of relationships by pairs of the dataset, it can be observed that many variables seem to have a strength association, therefore, it is convenient to look for the correlation between them. Since some variables are not normally distributed according to Shapiro-Wilk test, thus a non parametric correlation is used as the Spearman's Rank Correlation Coefficient (Rho).
Conducting a formal significance test for one of the hypotheses and discuss the results:¶
If p-value <0.05, then we reject the null hypothesis, and the Spearman’s Coefficient (Rho) estimates the strength association between two variables, as follow:
Rho value | Strength of association |
---|---|
0.00 - 0.19 | very weak |
0.20 - 0.39 | weak |
0.40 - 0.59 | moderate |
0.60 - 0.79 | strong |
0.80 - 1.00 | very strong |
Heatmap of the Spearman correlation matrix and p-values are performed
# Heatmap of correlation
InterestData = NewgData.iloc[:,3:17].astype('float64')
#invert the order of columns for better visualization in the correlation matrix
InterestData = InterestData[InterestData.columns[::-1]]
#Calculate the Spearman correlation matrix and p-values
correlation_matrix, p_values = stats.spearmanr(InterestData)
mask = np.triu(np.ones_like(correlation_matrix, dtype=bool)) # Mask to hide the upper triangle for better visualization
#mask = np.tril(np.ones_like(correlation_matrix, dtype=bool)) # Mask to hide the lower triangle for better visualization
fig4, ax4 = plt.subplots(figsize=(12, 9))
# Create a heatmap using Seaborn
ax4=sns.heatmap(correlation_matrix, square=True, annot=False, fmt=".2f", cmap='bwr', cbar_kws={"shrink": 0.7}, ax=ax4,
mask=mask, xticklabels=InterestData.columns, yticklabels=InterestData.columns) #, annot_kws = {'size': 6})
# Add asterisks and coefficients to indicate significance levels
for i in range(correlation_matrix.shape[0]):
for j in range(i+1, correlation_matrix.shape[1]):
p_val = p_values[i, j]
correlation_coefficient = correlation_matrix[i, j]
# Format the annotation text with correlation coefficient and asterisks for significance
annotation_text = f"{correlation_coefficient:.2f}\n"
if p_val < 0.001:
annotation_text += '***'
elif p_val < 0.01:
annotation_text += '**'
elif p_val < 0.05:
annotation_text += '*'
#plt.text(j+ 0.5, i + 0.5, annotation_text, ha='center', va='center', color='black', fontsize=10) #for upper tirangle
plt.text(i + 0.5, j + 0.5, annotation_text, ha='center', va='center', color='black', fontsize=10) #for lower tirangle
plt.setp(ax4.get_xticklabels(), rotation = 25, ha = "right", rotation_mode = "anchor")
plt.setp(ax4.get_yticklabels(), rotation = 10, ha = "right", rotation_mode = "anchor")
plt.title('Spearman’s Rho matrix with Significance Levels of New Data, P-value: * p < 0.05, ** p < 0.01, ***p < 0.001');
#save the plot as a JPG image with specified DPI
#plt.savefig('results/NewData_SpearmanRho_Pvalue_Matrix.jpg', format='jpg', dpi=300) #in desktop version Jupyter
plt.savefig('../results/NewData_SpearmanRho_Pvalue_Matrix.jpg', format='jpg', dpi=300) #in Code Ocena platform
# Convert the NumPy array to a DataFrame
df_correlation_matrix = pd.DataFrame(correlation_matrix, index=ColumnsList[3:17], columns=ColumnsList[3:17])
df_Pvalue_matrix = pd.DataFrame(p_values, index=ColumnsList[3:17], columns=ColumnsList[3:17])
#saves the pandas frames as an excel files in desktop version Jupyter
#df_correlation_matrix.to_excel('results/NewData_SpearmanRhoMatrix.xlsx')
#df_Pvalue_matrix.to_excel('results/NewData_PvalueMatrix.xlsx')
#saves the pandas frames as an excel files in Code Ocena platform
df_correlation_matrix.to_excel('../results/NewData_SpearmanRhoMatrix.xlsx')
df_Pvalue_matrix.to_excel('../results/NewData_PvalueMatrix.xlsx')