Normalisation and correlations
1. Normalizing dataset
2. Visualizing normalization
3. Determining correlation between variables
4. Visualizing correlation
Find the dataset to work on here: raw_data.csv
I. Normalizing dataset
The goal is to get all input variables into roughly
one of these ranges. Ideally: −1 ≤ x ≤ 1 or −0.5 ≤ x ≤ 0.5 Normalizing
variables involves dividing the input values by the range (i.e. the maximum
value minus the minimum value) of the input variable, resulting in a new range
of just 1.
For normalizing to a range of 1, perform the following steps:
Step 1: Create a normalizing function –
## Normalization normalize_function <- function(x){ return((x-min(x))/(max(x) - min(x))) }
Step 2: Apply the function to the desired variables or to the entire dataset.
# Remember, only continuous variable/attributes are required to be normalized and there is no need to normalize discrete/ categorical values.
N_data<- as.data.frame(lapply(insurance,normalize_function)) ## insurance is the variable where we load original dataset
II. Visualizing normalization
The effects of normalizing the dataset can be seen by plotting a boxplot.
But first, let’s see the summary status of our dataset.
Take a look at the minimum and maximum of each attribute of both original and normalized dataset. The values are scaled to a range of 0 to 1 for the normalized dataset.
Now let’s compare these values by visualizing boxplots for both datasets.
Now let’s compare these values by visualizing boxplots for both datasets.
library(ggplot2) library(gridExtra) ## ggplot2 requires data in a specific format. # Here, you need x= and y= where y will be the values and x will be the corresponding column ids. # Use melt from reshape2 package to melt the data to get the data in this format and then plot. require(reshape2) p1 <- ggplot(data = melt(insurance), aes(x=variable, y=value)) + geom_boxplot(aes(fill=variable)) + ggtitle("Boxplot of original data") p2 <- ggplot(data = melt(N_data), aes(x=variable, y=value)) + geom_boxplot(aes(fill=variable)) + ggtitle("Boxplot of normalized data") grid.arrange(p1, p2,ncol=2)
For the original dataset, the range of our boxplot varies from -5 to +20, whereas for the normalized data, the range for each and every attribute lies in the range of 0 to 1. Isn’t it simpler to relate the attributes now?
III. Determining correlation between variables
In statistics, correlation is defined as the interdependence of variable quantities. Finding the correlation between attributes, we can determine which attributes vary with each other linearly, either positively or negatively.
For reference, as the value of correlation gets nearby or greater than ±0.5, the attributes get more a more correlated with each other, with positive and negative correlation determined by the sign of the value.
For our dataset, we see that the value of correlation for attributes A and B is -0.03059086. Numerically, we can clearly say that these attributes are not at all correlated with each other. Let’s visualize this correlation using scatterplot and confirm our calculation.
IV. Visualizing correlation
Correlation |
> coef(lm(N_data$A~N_data$B)) (Intercept) N_data$B 0.4914188 -0.0308245 > ggplot(data = N_data,aes(x=N_data$A, y=N_data$B)) + geom_point() + geom_abline(intercept = 0.4914188, slope = -0.0308245 )
The scatterplot clearly shows us that the points are randomly distributed over the plot, indicating a negligible correlation between the attributes.
Comments
Post a Comment