More visualizations

In this blog, I will introduce you to more visualizations/ graphs which help us understand the data better. These are:


  1. Pie chart
  2. Waffle plot
  3. Kernel density chart
We will gain more insights into the data by - 
  1. Knowing the quality of data
  2. Data distribution of attributes

Let's begin by first visualizing the data distribution of 2 attributes from our dataset. Find the dataset to work on here: M01_quasi_twitter.csv

The first step is to load the dataset into an appropriate variable. Next, view this dataset and get the feeling of the dataset, how many rows, columns, what are the different attributes, whether these attributes are categorical, continuous, discrete, what kind of information is available for analysis and how this data can be useful to us.

After viewing the dataset, we will plot the density graph of a specific attribute - 'friend_count'. We include some important libraries as and when we move forward in our analysis. For plotting a modern and more appealing density plot, we include 'ggplot2' library.

Let's find the data distribution for the friend_count variable.

ggplot(twitter_data, aes(x = twitter_data$friends_count))
+ geom_histogram(aes(y = ..density..), color="black", fill="white")  
+ scale_x_log10() + geom_density(alpha=.2, fill="#FF6666")




As can be seen from the density plot, the data distribution of friend_count variable is almost normally distributed with maximum number of friends at around 1000. From the summary statistics, it can be seen that average number of friends count is around 1058 whereas median is 324.


Now, let's analyze the quality of data.

Data quality – The data quality is determined by, whether all the data is present or not. To check this, there is an inbuilt function in R. is.na() will display a Boolean reply to whether a value for a particular row/column is available or not. If you take the sum of these missing values, we will obtain the number of missing values in our dataset. This will help us determine the quality of our data.


From the analysis, we can conclude that there are no missing values in our variable friends_count.

Now, let's learn something about a new type of chart called - percentage pie chart.


Percentage pie chart is a pretty fascinating graph, which makes it clearly understand the proportion of each component in a complete circle. You can relate to it, because it reminds us of food (pie) and who doesn’t want to relate work to life.!

# Pie Chart with Percentages without ggplot
slices <- c(650, 1000,900,300,14900) 
lbls <- c("UK", "Canada","India", "Australia","US")
pct <- round(slices/sum(slices)*100)
lbls <- paste(lbls, pct) # add percents to labels 
lbls <- paste(lbls,"%",sep="") # ad % to labels 
pie1 <- pie(slices,labels = lbls, col=rainbow(length(lbls)),
    main="Pie Chart of Countries")





Let's plot a more fancy version of pie chart, named - '3D pie chart'


# 3D Exploded Pie Chart
library(plotrix)
slices <- c(650, 1000,900,300,14900) 
lbls <- c("UK", "Canada","India", "Australia","US")
pct <- round(slices/sum(slices)*100)
lbls <- paste(lbls, pct) # add percents to labels 
lbls <- paste(lbls,"%",sep="") # ad % to labels 
pie2<-pie3D(slices,labels=lbls,
      main="Pie Chart of Countries in 3D ")


Let's plot a 'Waffle plot'

## Waffle plot
library(waffle)
library(ggthemes)
par(mfrow=c(1,1))
vals <- c(6, 10,9,3,149) 
val_names <- sprintf("%s (%s)", c("UK", "Canada","India", "Australia","US"), scales::percent(round(vals/sum(vals), 2)))
names(vals) <- val_names

waffle::waffle(vals) +
  ggthemes::scale_fill_tableau(name=NULL)


Next important graph to gain more insights into our dataset is 'the kernel density plot'

ggplot(twitter_data, aes(x = twitter_data$created_at_year))
+ geom_histogram(aes(y = ..density..), color="black", fill="white")  
+ scale_x_log10()
+ geom_density(alpha=.2, fill="#FF6666")






Kernel density plot of created_at_year variable gives us a pretty wavy distribution. We see that it is not at a particular value or a particular range of values that we find most of our created_at_year. It goes up and down at various values. The density plot smooths out the histogram plot and that’s why we obtain the wave.

With this, I end my analysis and visualizations. We see that it becomes more and more comfortable to analyze our dataset, once we plot more insightful and appealing graphs.  

Comments

Popular Posts