Python code for data exploration and analysis of the submitted paper: Assessing the physicochemical and microbiological condition of surface waters in Urabá-Colombia: Impact of human activities and agro-industry (2024)¶

Code developed by Victor H. Aristizabal-Tique

Introduction¶

Brief description of the data set and a summary of its attributes:¶

Following Table shows the sampling points (P1 to P10) in each of the León, Chigorodó, Carepa, Zungo, Apartadó, Grande, Arcua, Currulao, Guadualito, and Turbo Rivers in the central area of the Urabá Antioquia-Colombia. The Standard Methods 23rd edition recommendations were followed for collecting, preserving, and managing water samples from each river. The temperature, physicochemical and microbiological parameters were measured at each sampling point.

Sample point Location Latitude Longitude Altitude (m)
P1 Turbo River 8.1266660 -76.6936676 21.4

| P2 | Guadualito River | 8.0641056 | -76.6585601 | 28.6 | P3 | Currulao River | 7.9878032 | -76.6490078 | 23.4 | P4 | Arcua River | 7.9658303 | -76.6240585 | 25.1 | P5 | Grande River | 7.9278369 | -76.6214059 | 27.6 | P6 | Apartadó River | 7.8962484 | -76.6480077 | 24.7 | P7 | Zungo River | 7.8241638 | -76.6484759 | 39.3 | P8 | Carepa River | 7.7647969 | -76.6639725 | 31.3 | P9 | Chigorodó River | 7.6729423 | -76.6841424 | 34.0 | P10 | León River | 7.5716390 | -76.7110000 | 16.6

In general, the data is structured and complete, accordingly data cleaning and feature engineering aren't necessary.

Initial plan for data exploration and analysis:¶

First, the original data is loaded from an Excel file to a Pandas-DataFrame called OrigData. Then, the number of elements, columns, and rows are reviewed, in addition to the content data types and their labels by columns and rows are explored. Even more, a general visualization of the data is done to identify outliers, some tools for this goal include histograms, boxplots, scattered data and linear regression models fit with a 95% confidence interval size. Finally, the Spearman correlation and p-value matrices are calculated in order to estimate the strength association between variables.

Second, just after the outliers are identified, these are removed and a new Pandas-DataFrame called NewData is created, and the steps describe above are repeated.

Key Findings and Insights, which synthesizes the results of Exploratory Data Analysis in an insightful and actionable manner:¶

In general, in the data exploration it was possible to identify the main source of outliers, which led to a second analysis ignoring this source. There is also the need to perform a non-parametric analysis because some data present non-normality. Finally, considering the Spearman Rank Correlation Coefficients (Rho) matrix for the analysis with and without outliers, it can be seen that most of the coefficients retained their significance. In the case of the correlation coefficients of the population with the other variables, where some of them lost significance when removing the outliers, it draws a lot of attention since there is a strong theoretical basis that supports the strong association between microbiological variables and human activity in water sources. This indicates that we must be very careful before making the decision to remove an outlier without both theoretical and statistical support.

Here starts the python script¶

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
#import os
#from scipy import stats
#from sklearn.impute import SimpleImputer
#from sklearn.impute import KNNImputer

Original Data input and loading¶

Load and display the data and its properties from an Excel file.

In [2]:
#data file path
#OrigInputFilePath = 'data/SamplingPointsPhysicochemicalParameters.xlsx' #In desktop version Jupyter
OrigInputFilePath = '../data/SamplingPointsPhysicochemicalParameters.xlsx' #In Code Ocena platform

#data is loaded into Pandas DataFrame an Excel file
OrigData = pd.read_excel(OrigInputFilePath, dtype='object')

#display all columns
pd.set_option('display.max_columns', OrigData.shape[1])
#pd.set_option('display.max_rows', 6) #display 6 rows

#display all rows
pd.set_option('display.max_rows', OrigData.shape[0])

#display Data
OrigData
Out[2]:
Sampling point Latitude Longitude Altitude (m) Distance (m) Population (hab) Ambient Temp. (°C) Water Temp. (°C) DO (mg/L) pH Conductivity (mS/cm) Ammonia nitrogen (mg/L) Ortho phosphate (mg/L) BOD5 (mg/L) COD (mg/L) Viable heterotrophs (CFU/mL) E. coli (MPN/100 mL) Total coliforms (MPN/100 mL)
0 P1 8.126666 -76.693668 21.4 6595.6 4033 26.6 27.4 7.71 8.2 847 7.6 0.108 220 287.1 34000 540 1600
1 P2 8.064106 -76.65856 28.6 9747.4 10955 28.1 27.3 7.62 8.2 774 6.8 0.128 215 239 57000 920 1600
2 P3 7.987803 -76.649008 23.4 10783.6 23344 30.4 28.8 7.55 7.9 714 8 0.131 191.7 255.6 200000 1600 1600
3 P4 7.96583 -76.624058 25.1 15754.4 9682 29.8 26.7 7.36 8 464 7.9 0.193 188.3 209.7 200000 920 1600
4 P5 7.927837 -76.621406 27.6 15562 7490 28 27.4 7.62 7.9 428 8.3 0.136 154.3 176.7 200000 1600 1600
5 P6 7.896248 -76.648008 24.7 15194.3 98454 30.8 28.6 0.03 7.5 579 17.7 0.316 241.7 324.5 200000 1600 1600
6 P7 7.824164 -76.648476 39.3 19587.1 9093 29.9 27.8 8 7.7 305 8.5 0.07 113.3 151.3 150000 735 1600
7 P8 7.764797 -76.663972 31.3 34983.3 33009 30.1 27.5 7.04 7.9 428 15.1 0.085 46.2 90 200000 1600 1600
8 P9 7.672942 -76.684142 34 40356.1 47046 27.1 26.7 7.74 7.1 186 11.3 0.09 120 121.9 200000 1600 1600
9 P10 7.571639 -76.711 16.6 53325.7 4597 28 27.8 6.73 7.5 169 7.9 0.126 78.3 104.5 57000 1000 1600
In [3]:
# displays number of elements
print('number of elements:', OrigData.size)

# displays number of rows and columns
print('\nnumber of rows and columns:', OrigData.shape, '\n')

# Examine the columns, look at missing data
OrigData.info()
number of elements: 180

number of rows and columns: (10, 18) 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 18 columns):
 #   Column                        Non-Null Count  Dtype 
---  ------                        --------------  ----- 
 0   Sampling point                10 non-null     object
 1   Latitude                      10 non-null     object
 2   Longitude                     10 non-null     object
 3   Altitude (m)                  10 non-null     object
 4   Distance (m)                  10 non-null     object
 5   Population (hab)              10 non-null     object
 6   Ambient Temp. (°C)            10 non-null     object
 7   Water Temp. (°C)              10 non-null     object
 8   DO (mg/L)                     10 non-null     object
 9   pH                            10 non-null     object
 10  Conductivity (mS/cm)          10 non-null     object
 11  Ammonia nitrogen (mg/L)       10 non-null     object
 12  Ortho phosphate (mg/L)        10 non-null     object
 13  BOD5 (mg/L)                   10 non-null     object
 14  COD (mg/L)                    10 non-null     object
 15  Viable heterotrophs (CFU/mL)  10 non-null     object
 16  E. coli (MPN/100 mL)          10 non-null     object
 17  Total coliforms (MPN/100 mL)  10 non-null     object
dtypes: object(18)
memory usage: 1.5+ KB
In [4]:
# list of Column names
ColumnsList = OrigData.columns.tolist()
print('\nColumns list:', ColumnsList)
Columns list: ['Sampling point', 'Latitude', 'Longitude', 'Altitude (m)', 'Distance (m)', 'Population (hab)', 'Ambient Temp. (°C)', 'Water Temp. (°C)', 'DO (mg/L)', 'pH', 'Conductivity (mS/cm)', 'Ammonia nitrogen (mg/L)', 'Ortho phosphate (mg/L)', 'BOD5 (mg/L)', 'COD (mg/L)', 'Viable heterotrophs (CFU/mL)', 'E. coli (MPN/100 mL)', 'Total coliforms (MPN/100 mL)']

Original Data exploration¶

Complete and fully structured data is explored

Histograms and Shapiro-Wilk test of normality are performed

The results show that some variables are not normally distributed according to Shapiro-Wilk test.

In [5]:
M = 5;
#graph the last 15 columns of array Data
#number of elements in (ColumnsList - 3) divided by M
N = int((len(ColumnsList)-3)/M)

#create a NxM grid of plots
fig1, axes1 = plt.subplots(N, M, figsize=(18, 6))

#counter for ColumnsList
p = 3;

for i in range(0,N):
    for j in range(0,M):
        
        #extract only the data from the column under analysis
        AUX_data = OrigData[ColumnsList[p]];
        
        #compute width of bars for histogram
        Width = (max(AUX_data) - min(AUX_data)) / 5
        
        if Width < 3:
            Width = round(Width, 2);
        else:
            Width = int(Width);
        
        #extract the integer part of the minimum value of the column under analysis
        AUX_Min= int(min(AUX_data));
        
        #define intervals for histogram
        Ranges = [AUX_Min+0*Width, AUX_Min+Width, AUX_Min+2*Width, AUX_Min+3*Width, AUX_Min+4*Width, AUX_Min+5*Width, AUX_Min+6*Width, AUX_Min+7*Width]
        
        #create a histogram at position [i, j] of the plots grid
        sns.histplot(x=AUX_data, bins=Ranges, element='bars', kde=True, ax=axes1[i, j]);
        
        # Perform Shapiro-Wilk test
        Statistic, Pvalue = stats.shapiro(AUX_data.astype('float64'))
        
        #Print the results in titles
        #Interpret the results
        alpha = 0.05
        if Pvalue > alpha:
            #Sample looks Gaussian (fail to reject H0)
            axes1[i, j].set_title(f'StatisticSW={Statistic:.3f}, P-value={Pvalue:.3f}', color='green')
        else:
            #Sample does not look Gaussian (reject H0)
            axes1[i, j].set_title(f'StatisticSW={Statistic:.3f}, P-value={Pvalue:.3f}', color='red')
            
        p += 1;
        
# Adjust layout
plt.tight_layout()

#save the plot as a JPG image with specified DPI
#plt.savefig('results/OrigData_Histograms.jpg', format='jpg', dpi=300) #in desktop version Jupyter
plt.savefig('../results/OrigData_Histograms.jpg', format='jpg', dpi=300) #in Code Ocena platform

plt.show()
/opt/conda/lib/python3.10/site-packages/scipy/stats/_morestats.py:1879: UserWarning: Input data for shapiro has range zero. The results may not be accurate.
  warnings.warn("Input data for shapiro has range zero. The results "
No description has been provided for this image

Boxplots are performed to identify outliers

Five of the seven outliers in boxplots correspond to the data of P6, the other two correspond to P3 (Water Temp.) and P8 (Ammonia nitrogen), accordingly it's convenient to ignore the P6 data.

In [6]:
#create a NxM grid of plots
fig2, axes2 = plt.subplots(N, M, figsize=(12, 6))

#counter for ColumnsList
p = 3;

for i in range(0,N):
    for j in range(0,M):
        #create a boxplot at position [i, j] of the plots grid
        sns.boxplot(x=OrigData[ColumnsList[p]], ax=axes2[i, j])
        p += 1;

# Adjust layout
plt.tight_layout()

#save the plot as a JPG image with specified DPI
#plt.savefig('results/OrigData_Boxplots.jpg', format='jpg', dpi=300) #in desktop version Jupyter
plt.savefig('../results/OrigData_Boxplots.jpg', format='jpg', dpi=300) #in Code Ocena platform

plt.show()
No description has been provided for this image

Scattered data and linear regression models fit with a 95% confidence interval size are performed to identify outliers

Plots scattered data and linear regression models fit with a 95% confidence interval size show that some data points of P6 are not within or are far from confidence interval, thus P6 is identified as the main source of outliers. This supports the idea given in above section on boxplots of ignoring the data from P6.

In [7]:
# Define colors based on row position for each sampling point in regplots
row_positions = range(len(OrigData[ColumnsList[0]]));
point_colors = sns.color_palette("Paired", n_colors=len(row_positions));

#plot a bar chart to visualize the colors of the sampling points in regplots
plt.figure(figsize=(8, 2))
sns.barplot(x=['P1', 'P2', 'P3', 'P4', 'P5', 'P6', 'P7', 'P8', 'P9', 'P10'], y=[1, 1, 1, 1, 1, 1, 1, 1, 1, 1], palette=point_colors)

# Remove y-axis and labels for better visualization
plt.yticks([])

# Adjust layout
plt.tight_layout()

#save the plot as a JPG image with specified DPI
#plt.savefig('results/OrigData_color_palette_regplots.jpg', format='jpg', dpi=300) #in desktop version Jupyter
plt.savefig('../results/OrigData_color_palette_regplots.jpg', format='jpg', dpi=300) #in Code Ocena platform
No description has been provided for this image
In [8]:
#create scattered data and a linear regression model fit with a size of the confidence interval of 95%

L = len(ColumnsList);
# Create a (L-3)x(L-3) grid of plots
#graph the last 15 (L-3) columns of array Data
fig3, axes3 = plt.subplots(L-3, L-3, figsize=(40, 40))

for i in range(3,L):
    for j in range(3,L):
        
        #create a plot data and a linear regression model fit with a size of the confidence interval of 95% at position [i, j] of the plots grid
        sns.regplot(x = OrigData[ColumnsList[i]].astype('float64'), y = OrigData[ColumnsList[j]].astype('float64'),
                    ci=95, scatter_kws={'color': point_colors}, ax=axes3[i-3, j-3])

# Adjust layout
plt.tight_layout()

#save the plot as a JPG image with specified DPI
#plt.savefig('results/OrigData_Regplots.jpg', format='jpg', dpi=300) #in desktop version Jupyter
plt.savefig('../results/OrigData_Regplots.jpg', format='jpg', dpi=300) #in Code Ocena platform
No description has been provided for this image

Original Data analysis¶

Observation of data distributions:¶

In the previous scattered plot of relationships by pairs of the dataset, it can be observed that many variables seem to have a strength association, therefore, it is convenient to look for the correlation between them. Since some variables are not normally distributed according to Shapiro-Wilk test, thus a non parametric correlation is used as the Spearman's Rank Correlation Coefficient (Rho).

Conducting a formal significance test for one of the hypotheses and discuss the results:¶

If p-value <0.05, then we reject the null hypothesis, and the Spearman’s Coefficient (Rho) estimates the strength association between two variables, as follow:

Rho value Strength of association
0.00 - 0.19 very weak
0.20 - 0.39 weak
0.40 - 0.59 moderate
0.60 - 0.79 strong
0.80 - 1.00 very strong

Heatmap of the Spearman correlation matrix and p-values are performed

In [9]:
# Heatmap of correlation
InterestData = OrigData.iloc[:,3:17].astype('float64')

#invert the order of columns for better visualization in the correlation matrix
InterestData = InterestData[InterestData.columns[::-1]]

#Calculate the Spearman correlation matrix and p-values
correlation_matrix, p_values = stats.spearmanr(InterestData)

mask = np.triu(np.ones_like(correlation_matrix, dtype=bool)) # Mask to hide the upper triangle for better visualization
#mask = np.tril(np.ones_like(correlation_matrix, dtype=bool)) # Mask to hide the lower triangle for better visualization

fig4, ax4 = plt.subplots(figsize=(12, 9))

# Create a heatmap using Seaborn
ax4=sns.heatmap(correlation_matrix, square=True, annot=False, fmt=".2f", cmap='bwr', cbar_kws={"shrink": 0.7}, ax=ax4,
               mask=mask, xticklabels=InterestData.columns, yticklabels=InterestData.columns) #, annot_kws = {'size': 6})

# Add asterisks and coefficients to indicate significance levels
for i in range(correlation_matrix.shape[0]):
    for j in range(i+1, correlation_matrix.shape[1]):
        p_val = p_values[i, j]
        correlation_coefficient = correlation_matrix[i, j]
        
        # Format the annotation text with correlation coefficient and asterisks for significance
        annotation_text = f"{correlation_coefficient:.2f}\n"
        if p_val < 0.001:
            annotation_text += '***'
        elif p_val < 0.01:
            annotation_text += '**'
        elif p_val < 0.05:
            annotation_text += '*'
        
        #plt.text(j+ 0.5, i + 0.5, annotation_text, ha='center', va='center', color='black', fontsize=10) #for upper tirangle
        plt.text(i + 0.5, j + 0.5, annotation_text, ha='center', va='center', color='black', fontsize=10) #for lower tirangle

plt.setp(ax4.get_xticklabels(), rotation = 25, ha = "right", rotation_mode = "anchor")
plt.setp(ax4.get_yticklabels(), rotation = 10, ha = "right", rotation_mode = "anchor")
plt.title('Spearman’s Rho matrix with Significance Levels of Original Data, P-value: * p < 0.05, ** p < 0.01, ***p < 0.001');

#save the plot as a JPG image with specified DPI
#plt.savefig('results/OrigData_SpearmanRho_Pvalue_Matrix.jpg', format='jpg', dpi=300) #in desktop version Jupyter
plt.savefig('../results/OrigData_SpearmanRho_Pvalue_Matrix.jpg', format='jpg', dpi=300) #in Code Ocena platform

# Convert the NumPy array to a DataFrame
df_correlation_matrix = pd.DataFrame(correlation_matrix, index=ColumnsList[3:17], columns=ColumnsList[3:17])
df_Pvalue_matrix = pd.DataFrame(p_values, index=ColumnsList[3:17], columns=ColumnsList[3:17])

#saves the pandas frames as an excel files in desktop version Jupyter
#df_correlation_matrix.to_excel('results/OrigData_SpearmanRhoMatrix.xlsx') 
#df_Pvalue_matrix.to_excel('results/OrigData_PvalueMatrix.xlsx')
#saves the pandas frames as an excel files in Code Ocena platform
df_correlation_matrix.to_excel('../results/OrigData_SpearmanRhoMatrix.xlsx') 
df_Pvalue_matrix.to_excel('../results/OrigData_PvalueMatrix.xlsx')
No description has been provided for this image

New Modified Data input and loading¶

Load and display the data without outliers and its properties from an Excel file. Here, the data of P6 are ignored.

In [10]:
#data file path
#NewInputFilePath = 'data/SamplingPointsPhysicochemicalParameters_WithoutOutliers.xlsx' #In desktop version Jupyter
NewInputFilePath = '../data/SamplingPointsPhysicochemicalParameters.xlsx' #In Code Ocena platform

#data is loaded into Pandas DataFrame an Excel file
NewgData = pd.read_excel(NewInputFilePath, dtype='object')

#display all columns
pd.set_option('display.max_columns', NewgData.shape[1])
#pd.set_option('display.max_rows', 6) #display 6 rows

#display all rows
pd.set_option('display.max_rows', NewgData.shape[0])

#display Data
NewgData
Out[10]:
Sampling point Latitude Longitude Altitude (m) Distance (m) Population (hab) Ambient Temp. (°C) Water Temp. (°C) DO (mg/L) pH Conductivity (mS/cm) Ammonia nitrogen (mg/L) Ortho phosphate (mg/L) BOD5 (mg/L) COD (mg/L) Viable heterotrophs (CFU/mL) E. coli (MPN/100 mL) Total coliforms (MPN/100 mL)
0 P1 8.126666 -76.693668 21.4 6595.6 4033 26.6 27.4 7.71 8.2 847 7.6 0.108 220 287.1 34000 540 1600
1 P2 8.064106 -76.65856 28.6 9747.4 10955 28.1 27.3 7.62 8.2 774 6.8 0.128 215 239 57000 920 1600
2 P3 7.987803 -76.649008 23.4 10783.6 23344 30.4 28.8 7.55 7.9 714 8 0.131 191.7 255.6 200000 1600 1600
3 P4 7.96583 -76.624058 25.1 15754.4 9682 29.8 26.7 7.36 8 464 7.9 0.193 188.3 209.7 200000 920 1600
4 P5 7.927837 -76.621406 27.6 15562 7490 28 27.4 7.62 7.9 428 8.3 0.136 154.3 176.7 200000 1600 1600
5 P6 7.896248 -76.648008 24.7 15194.3 98454 30.8 28.6 0.03 7.5 579 17.7 0.316 241.7 324.5 200000 1600 1600
6 P7 7.824164 -76.648476 39.3 19587.1 9093 29.9 27.8 8 7.7 305 8.5 0.07 113.3 151.3 150000 735 1600
7 P8 7.764797 -76.663972 31.3 34983.3 33009 30.1 27.5 7.04 7.9 428 15.1 0.085 46.2 90 200000 1600 1600
8 P9 7.672942 -76.684142 34 40356.1 47046 27.1 26.7 7.74 7.1 186 11.3 0.09 120 121.9 200000 1600 1600
9 P10 7.571639 -76.711 16.6 53325.7 4597 28 27.8 6.73 7.5 169 7.9 0.126 78.3 104.5 57000 1000 1600
In [11]:
# displays number of elements
print('number of elements:', NewgData.size)

# displays number of rows and columns
print('\nnumber of rows and columns:', NewgData.shape, '\n')

# Examine the columns, look at missing data
NewgData.info()
number of elements: 180

number of rows and columns: (10, 18) 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 18 columns):
 #   Column                        Non-Null Count  Dtype 
---  ------                        --------------  ----- 
 0   Sampling point                10 non-null     object
 1   Latitude                      10 non-null     object
 2   Longitude                     10 non-null     object
 3   Altitude (m)                  10 non-null     object
 4   Distance (m)                  10 non-null     object
 5   Population (hab)              10 non-null     object
 6   Ambient Temp. (°C)            10 non-null     object
 7   Water Temp. (°C)              10 non-null     object
 8   DO (mg/L)                     10 non-null     object
 9   pH                            10 non-null     object
 10  Conductivity (mS/cm)          10 non-null     object
 11  Ammonia nitrogen (mg/L)       10 non-null     object
 12  Ortho phosphate (mg/L)        10 non-null     object
 13  BOD5 (mg/L)                   10 non-null     object
 14  COD (mg/L)                    10 non-null     object
 15  Viable heterotrophs (CFU/mL)  10 non-null     object
 16  E. coli (MPN/100 mL)          10 non-null     object
 17  Total coliforms (MPN/100 mL)  10 non-null     object
dtypes: object(18)
memory usage: 1.5+ KB
In [12]:
# list of Column names
ColumnsList = NewgData.columns.tolist()
print('\nColumns list:', ColumnsList)
Columns list: ['Sampling point', 'Latitude', 'Longitude', 'Altitude (m)', 'Distance (m)', 'Population (hab)', 'Ambient Temp. (°C)', 'Water Temp. (°C)', 'DO (mg/L)', 'pH', 'Conductivity (mS/cm)', 'Ammonia nitrogen (mg/L)', 'Ortho phosphate (mg/L)', 'BOD5 (mg/L)', 'COD (mg/L)', 'Viable heterotrophs (CFU/mL)', 'E. coli (MPN/100 mL)', 'Total coliforms (MPN/100 mL)']

New Modified Data exploration¶

Complete and fully structured data without outliers is explored.

Histograms and Shapiro-Wilk test of normality are performed

The results show that some variables are not normally distributed according to Shapiro-Wilk test.

In [13]:
M = 5;
#graph the last 15 columns of array Data
#number of elements in (ColumnsList - 3) divided by M
N = int((len(ColumnsList)-3)/M)

#create a NxM grid of plots
fig1, axes1 = plt.subplots(N, M, figsize=(18, 6))

#counter for ColumnsList
p = 3;

for i in range(0,N):
    for j in range(0,M):
        
        #extract only the data from the column under analysis
        AUX_data = NewgData[ColumnsList[p]];
        
        #compute width of bars for histogram
        Width = (max(AUX_data) - min(AUX_data)) / 4
        
        if Width < 3:
            Width = round(Width, 2);
        else:
            Width = int(Width);
        
        #extract the integer part of the minimum value of the column under analysis
        AUX_Min= int(min(AUX_data));
        
        #define intervals for histogram
        Ranges = [AUX_Min+0*Width, AUX_Min+Width, AUX_Min+2*Width, AUX_Min+3*Width, AUX_Min+4*Width, AUX_Min+5*Width, AUX_Min+6*Width, AUX_Min+7*Width]
        
        #create a histogram at position [i, j] of the plots grid
        sns.histplot(x=AUX_data, bins=Ranges, element='bars', kde=True, ax=axes1[i, j]);
        
        # Perform Shapiro-Wilk test
        Statistic, Pvalue = stats.shapiro(AUX_data.astype('float64'))
        
        #Print the results in titles
        #Interpret the results
        alpha = 0.05
        if Pvalue > alpha:
            #Sample looks Gaussian (fail to reject H0)
            axes1[i, j].set_title(f'StatisticSW={Statistic:.3f}, P-value={Pvalue:.3f}', color='green')
        else:
            #Sample does not look Gaussian (reject H0)
            axes1[i, j].set_title(f'StatisticSW={Statistic:.3f}, P-value={Pvalue:.3f}', color='red')
            
        p += 1;
        
# Adjust layout
plt.tight_layout()

#save the plot as a JPG image with specified DPI
#plt.savefig('results/NewData_Histograms.jpg', format='jpg', dpi=300) #in desktop version Jupyter
plt.savefig('../results/NewData_Histograms.jpg', format='jpg', dpi=300) #in Code Ocena platform

plt.show()
/opt/conda/lib/python3.10/site-packages/scipy/stats/_morestats.py:1879: UserWarning: Input data for shapiro has range zero. The results may not be accurate.
  warnings.warn("Input data for shapiro has range zero. The results "
No description has been provided for this image

Boxplots are performed to identify outliers

Scattered data and linear regression models fit with a 95% confidence interval size are performed to identify outliers

Plots scattered data and linear regression models fit with a 95% confidence interval size show that most of the data points are within or are close to the confidence interval.

In [14]:
# Define colors based on row position for each sampling point in regplots
row_positions = range(len(NewgData[ColumnsList[0]]));
point_colors = sns.color_palette("Paired", n_colors=len(row_positions));

#plot a bar chart to visualize the colors of the sampling points in regplots
plt.figure(figsize=(8, 2))
sns.barplot(x=['P1', 'P2', 'P3', 'P4', 'P5', 'P7', 'P8', 'P9', 'P10'], y=[1, 1, 1, 1, 1, 1, 1, 1, 1], palette=point_colors)

# Remove y-axis and labels for better visualization
plt.yticks([])

# Adjust layout
plt.tight_layout()

#save the plot as a JPG image with specified DPI
#plt.savefig('results/NewData_color_palette_regplots.jpg', format='jpg', dpi=300) #in desktop version Jupyter
plt.savefig('../results/NewData_color_palette_regplots.jpg', format='jpg', dpi=300) #in Code Ocena platform
No description has been provided for this image
In [15]:
#create scattered data and a linear regression model fit with a size of the confidence interval of 95%

L = len(ColumnsList);
# Create a (L-3)x(L-3) grid of plots
#graph the last 15 (L-3) columns of array Data
fig3, axes3 = plt.subplots(L-3, L-3, figsize=(40, 40))

for i in range(3,L):
    for j in range(3,L):
        
        #create a plot data and a linear regression model fit with a size of the confidence interval of 95% at position [i, j] of the plots grid
        sns.regplot(x = NewgData[ColumnsList[i]].astype('float64'), y = NewgData[ColumnsList[j]].astype('float64'),
                    ci=95, scatter_kws={'color': point_colors}, ax=axes3[i-3, j-3])

# Adjust layout
plt.tight_layout()

#save the plot as a JPG image with specified DPI
#plt.savefig('results/NewData_Regplots.jpg', format='jpg', dpi=300) #in desktop version Jupyter
plt.savefig('../results/NewData_Regplots.jpg', format='jpg', dpi=300) #in Code Ocena platform
No description has been provided for this image

New Modified Data analysis¶

Observation of data distributions:¶

In the previous scattered plot of relationships by pairs of the dataset, it can be observed that many variables seem to have a strength association, therefore, it is convenient to look for the correlation between them. Since some variables are not normally distributed according to Shapiro-Wilk test, thus a non parametric correlation is used as the Spearman's Rank Correlation Coefficient (Rho).

Conducting a formal significance test for one of the hypotheses and discuss the results:¶

If p-value <0.05, then we reject the null hypothesis, and the Spearman’s Coefficient (Rho) estimates the strength association between two variables, as follow:

Rho value Strength of association
0.00 - 0.19 very weak
0.20 - 0.39 weak
0.40 - 0.59 moderate
0.60 - 0.79 strong
0.80 - 1.00 very strong

Heatmap of the Spearman correlation matrix and p-values are performed

In [16]:
# Heatmap of correlation
InterestData = NewgData.iloc[:,3:17].astype('float64')

#invert the order of columns for better visualization in the correlation matrix
InterestData = InterestData[InterestData.columns[::-1]]

#Calculate the Spearman correlation matrix and p-values
correlation_matrix, p_values = stats.spearmanr(InterestData)

mask = np.triu(np.ones_like(correlation_matrix, dtype=bool)) # Mask to hide the upper triangle for better visualization
#mask = np.tril(np.ones_like(correlation_matrix, dtype=bool)) # Mask to hide the lower triangle for better visualization

fig4, ax4 = plt.subplots(figsize=(12, 9))

# Create a heatmap using Seaborn
ax4=sns.heatmap(correlation_matrix, square=True, annot=False, fmt=".2f", cmap='bwr', cbar_kws={"shrink": 0.7}, ax=ax4,
               mask=mask, xticklabels=InterestData.columns, yticklabels=InterestData.columns) #, annot_kws = {'size': 6})

# Add asterisks and coefficients to indicate significance levels
for i in range(correlation_matrix.shape[0]):
    for j in range(i+1, correlation_matrix.shape[1]):
        p_val = p_values[i, j]
        correlation_coefficient = correlation_matrix[i, j]
        
        # Format the annotation text with correlation coefficient and asterisks for significance
        annotation_text = f"{correlation_coefficient:.2f}\n"
        if p_val < 0.001:
            annotation_text += '***'
        elif p_val < 0.01:
            annotation_text += '**'
        elif p_val < 0.05:
            annotation_text += '*'
        
        #plt.text(j+ 0.5, i + 0.5, annotation_text, ha='center', va='center', color='black', fontsize=10) #for upper tirangle
        plt.text(i + 0.5, j + 0.5, annotation_text, ha='center', va='center', color='black', fontsize=10) #for lower tirangle

plt.setp(ax4.get_xticklabels(), rotation = 25, ha = "right", rotation_mode = "anchor")
plt.setp(ax4.get_yticklabels(), rotation = 10, ha = "right", rotation_mode = "anchor")
plt.title('Spearman’s Rho matrix with Significance Levels of New Data, P-value: * p < 0.05, ** p < 0.01, ***p < 0.001');

#save the plot as a JPG image with specified DPI
#plt.savefig('results/NewData_SpearmanRho_Pvalue_Matrix.jpg', format='jpg', dpi=300) #in desktop version Jupyter
plt.savefig('../results/NewData_SpearmanRho_Pvalue_Matrix.jpg', format='jpg', dpi=300) #in Code Ocena platform

# Convert the NumPy array to a DataFrame
df_correlation_matrix = pd.DataFrame(correlation_matrix, index=ColumnsList[3:17], columns=ColumnsList[3:17])
df_Pvalue_matrix = pd.DataFrame(p_values, index=ColumnsList[3:17], columns=ColumnsList[3:17])

#saves the pandas frames as an excel files in desktop version Jupyter
#df_correlation_matrix.to_excel('results/NewData_SpearmanRhoMatrix.xlsx') 
#df_Pvalue_matrix.to_excel('results/NewData_PvalueMatrix.xlsx')
#saves the pandas frames as an excel files in Code Ocena platform
df_correlation_matrix.to_excel('../results/NewData_SpearmanRhoMatrix.xlsx') 
df_Pvalue_matrix.to_excel('../results/NewData_PvalueMatrix.xlsx')
No description has been provided for this image
In [ ]: