Normalisation and correlations

Normalization



In this blog, we will be learning –

1.     Normalizing dataset
2.     Visualizing normalization
3.     Determining correlation between variables
4.     Visualizing correlation

Find the dataset to work on here: raw_data.csv

     I.         Normalizing dataset


The goal is to get all input variables into roughly one of these ranges. Ideally: −1 ≤ x ≤ 1 or −0.5 ≤ x ≤ 0.5 Normalizing variables involves dividing the input values by the range (i.e. the maximum value minus the minimum value) of the input variable, resulting in a new range of just 1.
For normalizing to a range of 1, perform the following steps:

Step 1: Create a normalizing function –


## Normalization
normalize_function <- function(x){
  return((x-min(x))/(max(x) - min(x)))
}

Step 2: Apply the function to the desired variables or to the entire dataset.
# Remember, only continuous variable/attributes are required to be normalized and there is no need to normalize discrete/ categorical values.


N_data<- as.data.frame(lapply(insurance,normalize_function))
## insurance is the variable where we load original dataset

   II.         Visualizing normalization

The effects of normalizing the dataset can be seen by plotting a boxplot.
But first, let’s see the summary status of our dataset.


Take a look at the minimum and maximum of each attribute of both original and normalized dataset. The values are scaled to a range of 0 to 1 for the normalized dataset.

Now let’s compare these values by visualizing boxplots for both datasets.




library(ggplot2)
library(gridExtra)

## ggplot2 requires data in a specific format. 
# Here, you need x= and y= where y will be the values and x will be the corresponding column ids. 
# Use melt from reshape2 package to melt the data to get the data in this format and then plot.

require(reshape2)
p1 <- ggplot(data = melt(insurance), aes(x=variable, y=value))  + geom_boxplot(aes(fill=variable)) + ggtitle("Boxplot of original data")  
p2 <- ggplot(data = melt(N_data), aes(x=variable, y=value)) + geom_boxplot(aes(fill=variable)) + ggtitle("Boxplot of normalized data")  
grid.arrange(p1, p2,ncol=2)




 For the original dataset, the range of our boxplot varies from -5 to +20, whereas for the normalized data, the range for each and every attribute lies in the range of 0 to 1. Isn’t it simpler to relate the attributes now?

 III.         Determining correlation between variables

In statistics, correlation is defined as the interdependence of variable quantities. Finding the correlation between attributes, we can determine which attributes vary with each other linearly, either positively or negatively.

For reference, as the value of correlation gets nearby or greater than ±0.5, the attributes get more a more correlated with each other, with positive and negative correlation determined by the sign of the value.

For our dataset, we see that the value of correlation for attributes A and B is -0.03059086. Numerically, we can clearly say that these attributes are not at all correlated with each other. Let’s visualize this correlation using scatterplot and confirm our calculation.

 IV.         Visualizing correlation

Correlation


> coef(lm(N_data$A~N_data$B)) (Intercept) N_data$B 0.4914188 -0.0308245 > ggplot(data = N_data,aes(x=N_data$A, y=N_data$B)) + geom_point() + geom_abline(intercept = 0.4914188, slope = -0.0308245 )


The scatterplot clearly shows us that the points are randomly distributed over the plot, indicating a negligible correlation between the attributes.





Comments

Popular Posts