Visualizing data

June 04, 2018

Visualizing data

Data science is the art of making discoveries from data. This is done using statistical analysis and machine learning methods. Data science is about gaining insights from data. Not only this, it gives us enough information to predict what can be the possible outcome given a particular situation.

Before beginning to gain information from the data, it becomes important to understand the data; its structure, highs, lows and averages, distributions, correlations, etc. Understanding the data, not only involves, numbers, percentages, and proportions, but it has a very important component, which is visualizations. To clearly interpret the data, and its different components, data visualisations comes into picture

Find the dataset to work on here: forestfires.csv

Attribute Information:

X: x-axis spatial coordinate within the Montesinho park map: 1 to 9

Y: y-axis spatial coordinate within the Montesinho park map: 2 to 9

month: month of the year: 'jan' to 'dec' day day of the week: 'mon' to 'sun'

FFMC: FFMCindexfromtheFWIsystem:18.7to96.20
DMC: DMC index from the FWI system: 1.1 to 291.3
DC: DC index from the FWI system: 7.9 to 860.6
ISI: ISI index from the FWI system: 0.0 to 56.10
temp: temperature in Celsius degrees: 2.2 to 33.30

RH: relative humidity in %: 15.0 to 100

wind: wind speed in km/h: 0.40 to 9.40

rain: outside rain in mm/m2 : 0.0 to 6.4

area: the burned area of the forest (in ha): 0.00 to 1090.84

Step 1: Import libraries

library(ggplot2)
library(gridExtra)
library(car)

I. ggplot2 -

i. The grammar of graphics. It the graphics down to their elements.

iii. The geometry of objects. Represents scales and transformations, coordinate systems and annotations.

vii. Building graphics from these elementary components, greater precision and flexibility

viii. Build visualizations + modify them incrementally i.e. one bit at a time

II. gridExtra - To arrange multiple ggplot2 graphs on the same plane

III. car - functions for applied regression, linear models, and generalized linear models and plotting scatterplot matrices.

Step 2: Load the file - forest.csv

dataset <- read.csv("/Users/csv files/Forest Fires Data.csv")

Step 3: Compute summary statistics

# Summary statistics
summary(dataset)

Step 4: Relation of attributes with the dependent variable (Area)

# Plot area vs.temp, area vs. month, area vs. DC, area vs. RH for January through December combined in 1 graph.

p1 <- ggplot(data = dataset, aes(x=dataset$Temp, y=log(1+dataset$Area), colour = dataset$Temp)) + geom_point()
p2 <- ggplot(data = dataset, aes(x=dataset$Month, y=log(1+dataset$Area), colour = dataset$Month)) + geom_point()
p3 <- ggplot(data = dataset, aes(x=dataset$DC, y=log(1+dataset$Area), colour = dataset$DC)) + geom_point()
p4 <- ggplot(data = dataset, aes(x=dataset$RH, y=log(1+dataset$Area), colour = dataset$RH)) + geom_point()

grid.arrange(p1, p2, p3, p4, ncol=2)

i. As can be seen from the graph, varying temperature has a random effect on the area under forest fires

ii. More area was covered by forest fires during the month of August and September

iii. Drought code (DC) has very less impact on the area under fire.

iv. However, with the increasing levels of relative humidity (RH), the area under forest fires decreases.

Step 5: Histogram plot of wind speed

# histogram of wind speed (km/h).
qplot(dataset$Wind, geom="histogram") 
ggplot(data=dataset, aes(dataset$Wind)) + geom_histogram() + geom_density(aes(x = dataset$Wind, color = dataset$Wind))

ggplot(dataset) + geom_density(aes(x = dataset$Wind, color = dataset$Wind))

hist(dataset$Wind,freq = FALSE)
lines(density(dataset$Wind), col="blue", lwd=2)

From the plot, we can say that wind speed of 1.25-3.75 km/hr mostly blows in the given area.

Step 6: Density plot of months

# density function of months
# qplot, plots graphs quicky, without creation of complex graphics

qplot(dataset$Month, geom = "density", colour = dataset$Month)

From the plot, we can say that largest number of forest fires were recorded in the month of November, April, August and then January and May.

Step 7: scatterplot matrix

# scatter matrix for temp, RH, DC and DMC. How you can interpret the
# result in terms of correlation among these data.

pairs(dataset[,c(9,10,7,6)], col = dataset$Area)


# Scatterplot matrix

library(car)
scatterplotMatrix(~dataset$Temp+dataset$RH+dataset$DC+dataset$DMC, data=dataset)

From the plot, we can visualize the correlation between Temperature, Relative Humidity, Drought code and Duff Moisture code (DMC)

Interpretation:

i. With the increasing Relative Humidity, the temperature linearly decreases. (high negative correlation)

ii. The temperature is linearly correlated with drought code (from green line)

iii. Relative humidity has no correlation with Drought code and DMC, as from the graph, we see that, increasing DC and DMC, has an invariable value of RH

iv. As can be interpreted from above relations, it is clear from the graph, that DC and DMC are very highly and positively correlated with each other.

Step 8: Boxplot

# boxplot for wind, ISI and DC. Are there anomalies/outliers.
p10 <- ggplot(dataset, aes(y=dataset$Wind,x=factor(0))) + geom_boxplot() + xlab(NULL)
p11 <- ggplot(dataset, aes(y=dataset$ISI,x=factor(0))) + geom_boxplot()+ xlab(NULL)
p12 <- ggplot(dataset, aes(y=dataset$DC,x=factor(0))) + geom_boxplot()+ xlab(NULL)

grid.arrange(p10, p11, p12, ncol=3)

boxplot(dataset$Wind)

A box plot displays the range and distribution of data along a number line. The lines that divide the box into 2 parts represents the median of the data. The end of the box shows the upper and lower quartiles. The extreme lines show the highest and lowest value excluding outliers.

i. From the plots of wind, ISI, and DC we can say that Wind has 3 outliers above 8 km/hr.

ii. ISI index has around 8 outliers around and above 20 and one outlier around the value 0.

iii. DC value has 3 outliers below the lower index around 0

Step 9: Log of attributes (DMC)

#  histogram of DMC and log dmc

g1 <- ggplot(data=dataset, aes(dataset$DMC)) + geom_histogram() + geom_density(aes(x = dataset$Wind, color = dataset$Wind))
g2 <- ggplot(data=dataset, aes(log(dataset$DMC))) + geom_histogram() + geom_density(aes(x = dataset$Wind, color = dataset$Wind))

grid.arrange(g1, g2, ncol=2)

Taking log, a wide range of variations in DMC value can be represented with a relatively small array. Also, the distribution is easier to explain.

Initially, DMC varied from 0-300 on the x-axis and count from 0-50

After taking logarithm, the variation is compressed to 0-7.5 on x-axis and count has increased to a range of 0-150.

All the small ups and downs (increments and decrements) are made invisible after taking the logarithm of DMC visualizing only major changes, thus making it easier to interpret the distribution.

Search This Blog

Internet of Things

Visualizing data

Comments

Post a Comment

Popular Posts

VHDL - 16:1 mux using 4:1 mux

Deploying Flask based web application on aws cloud