In this section, you will learn what types of data and how to maniplate data, including data import and export, as well as reshape data.
There are several ways to find the included datasets in R:
Using data() to list the datasets of all loaded packages (not only the ones from the datasets package); the datasets are ordered by package
Using data(package = .packages(all.available = TRUE)) to list all datasets in the available packages on your computer (i.e. also the not-loaded ones)
Using data(package = “packagename”) to list the datasets built in the package, so data(package = “plyr”) will give the datasets in the plyr package
In R you can read data from files stored outside the R environment. You can also write data into files which will be stored and accessed by computers. R can read and write into various file formats like csv, excel, xml etc.
The file should be present in current working directory so that R can read it. You can set our own directory and read files from there. You can use the getwd() function to check current directory, and also use setwd()functional to set a new working directory.
[1] "/home/tank/Desktop/ecodatasci/_datexpl/2023-10-15-datamaniplation"
The csv file is a text file in which the values in the columns are separated by a comma. You can use read.csv() function to read it into R.
Microsoft Excel is the most widely used spreadsheet which stores data in the .xls or .xlsx format. R can read directly from the files using some specific packages, such as xlsx package.
[1] TRUE
Many websites can provide data. For example, WHO provides reports on health and medical information in the form of CSV, txt and XML files. Using R, we can programmatically extract data from websites. Some R packages, such as “RCurl”, “XML”, and “stringr”, are used to connect to the URL’s, identify required links for the data and download them to R environment.
For example, if you visit the URL weather data and download the CSV files using R for the year 2015.
The data is Relational database systems are stored in a normalized format. So, to carry out statistical computing you will need very advanced and complex Sql queries. But R can connect easily to many relational databases like MySql, Oracle, Sql server etc. and fetch records from them as a dataframe. Once the data is available in the R environment, it becomes a normal R data set and can be manipulated or analyzed using packages and functions. Below you will be using MySql as our reference database for connecting to R.
Once the RMySQL package is installed we create a connection object in R to connect to the database. It takes the username, password, database name and host name as input.
You can query the database tables in MySql using the dbSendQuery() function. The query executs in MySql and the result is returned using the fetch() function. Finally it is stored as a dataframe in R.
You can pass any valid select query to get the result.
You can update the rows in a Mysql table by passing the update query to the dbSendQuery() function.
You can create tables in the MySql using the dbWriteTable() function. It overwrites the table if it already exists and takes a dataframe as input.
You can drop the tables in MySql database passing the drop table statement into the dbSendQuery() in the same way as we used it for querying data from tables.
Generally, you may store information of data types like character, integer, floating point, and Boolean, etc. Based on the data type of a variable, the operating system allocates memory and decides what can be stored. There are many types of R-objects. The frequently used ones are:
In R the very basic data types are the R-objects called vectors. When creating a vector with more than one element, you should use c() function which means to combine the elements into a vector.
Two vectors of same length can be added, subtracted, multiplied or divided giving the result as a vector output.
[1] 7 19 4 13 1 13
[1] -1 -3 4 -3 -1 9
[1] 12 88 0 40 0 22
[1] 0.7500000 0.7272727 Inf 0.6250000 0.0000000 5.5000000
Elements in a vector can be sorted using the sort() function.
[1] -9 0 3 4 5 8 11 304
[1] 304 11 8 5 4 3 0 -9
A list is an R-object which can contain many different types of elements inside it like vectors, functions and even another list inside it.
[[1]]
[1] 2 5 3
[[2]]
[1] 21.3
[[3]]
function (x) .Primitive("sin")
A list can be converted to a vector so that the elements of the vector can be used for further manipulation. All the arithmetic operations on vectors can be applied after the list is converted into vectors. To do this conversion, we use the unlist() function. It takes the list as input and produces a vector.
[[1]]
[1] 1 2 3 4 5
[[1]]
[1] 10 11 12 13 14
[1] 1 2 3 4 5
[1] 10 11 12 13 14
[1] 11 13 15 17 19
A matrix is a two-dimensional rectangular R-object. It can be created using a vector input to the matrix function.
# Create a matrix.
M = matrix(c('a','a','b','c','b','a'), nrow = 2, ncol = 3, byrow = TRUE)
print(M)
[,1] [,2] [,3]
[1,] "a" "a" "b"
[2,] "c" "b" "a"
Create a matrix taking a vector of numbers as input.
[,1] [,2] [,3]
[1,] 3 4 5
[2,] 6 7 8
[3,] 9 10 11
[4,] 12 13 14
[,1] [,2] [,3]
[1,] 3 7 11
[2,] 4 8 12
[3,] 5 9 13
[4,] 6 10 14
col1 col2 col3
row1 3 4 5
row2 6 7 8
row3 9 10 11
row4 12 13 14
Various mathematical operations are performed on the matrices using the R operators. The result of the operation is also a matrix.
[,1] [,2] [,3]
[1,] 3 -1 2
[2,] 9 4 6
[,1] [,2] [,3]
[1,] 5 0 3
[2,] 2 9 4
Result of addition
[,1] [,2] [,3]
[1,] 8 -1 5
[2,] 11 13 10
Result of subtraction
[,1] [,2] [,3]
[1,] -2 -1 -1
[2,] 7 -5 2
While matrices are confined to two dimensions, arrays can be of any number of dimensions. The array function takes a dim attribute which creates the required number of dimension. Below is an example array with two elements which are 3x3 matrices each. Arrays can store data in more than two dimensions.
# Create an array.
a <- array(c('green','yellow'),dim = c(3,3,2)) # 2 matrices each with 3 rows and 3 columns
print(a)
, , 1
[,1] [,2] [,3]
[1,] "green" "yellow" "green"
[2,] "yellow" "green" "yellow"
[3,] "green" "yellow" "green"
, , 2
[,1] [,2] [,3]
[1,] "yellow" "green" "yellow"
[2,] "green" "yellow" "green"
[3,] "yellow" "green" "yellow"
Arrays are the R-objects that can store data in more than two dimensions. If we create an array of dimension (2, 3, 4) then it creates 4 rectangular matrices each with 2 rows and 3 columns. Arrays can store only data type.
An array is created using the array() function. It takes vectors as input and uses the values in the dim parameter to create an array.
, , 1
[,1] [,2] [,3]
[1,] 5 10 13
[2,] 9 11 14
[3,] 3 12 15
, , 2
[,1] [,2] [,3]
[1,] 5 10 13
[2,] 9 11 14
[3,] 3 12 15
Factors are the R-objects which are created using a vector. It stores a vector along with the distinct values of the elements in the vector as labels. The labels are always character irrespective of whether it is numeric or character or Boolean etc. in the input vector. They are useful in statistical modeling.
Factors are created using the factor() function. The nlevels() functions gives the count of levels.
# Create a vector
apple_colors <- c('green','green','yellow','red','red','red','green')
# Create a factor object
factor_apple <- factor(apple_colors)
# Print the factor
print(factor_apple)
[1] green green yellow red red red green
Levels: green red yellow
[1] 3
Factors are the R-objects that are used to categorize the data and store it as levels. They can store both strings and integers. They are useful in the columns that have a limited number of unique values. Like “Male,”Female” and True, False etc. They are useful in data analysis for statistical modeling.
[1] "East" "West" "East" "North" "North" "East" "West" "West"
[9] "West" "East" "North"
[1] FALSE
[1] East West East North North East West West West East North
Levels: East North West
[1] TRUE
Data frames are tabular data objects. Unlike a matrix in data frame each column can contain different modes of data. The first column can be numeric while the second column can be character and third column can be logical. It is a list of vectors of equal length.
Data Frames are created using the data.frame() function.
# Create the data frame.
BMI <- data.frame(
gender = c("Male", "Male","Female"),
height = c(152, 171.5, 165),
weight = c(81,93, 78),
Age = c(42,38,26)
)
print(BMI)
gender height weight Age
1 Male 152.0 81 42
2 Male 171.5 93 38
3 Female 165.0 78 26
A data frame is a table or a two-dimensional array-like structure in which each column contains values of one variable and each row contains one set of values from each column. You can use data.frame() function to create a dataframe.
emp_id emp_name salary start_date
1 1 Rick 623.30 2012-01-01
2 2 Dan 515.20 2013-09-23
3 3 Michelle 611.00 2014-11-15
4 4 Ryan 729.00 2014-05-11
5 5 Gary 843.25 2015-03-27
The structure of the data frame can be seen by using str() function.
'data.frame': 5 obs. of 4 variables:
$ emp_id : int 1 2 3 4 5
$ emp_name : chr "Rick" "Dan" "Michelle" "Ryan" ...
$ salary : num 623 515 611 729 843
$ start_date: Date, format: "2012-01-01" ...
The statistical summary and nature of the data can be obtained by applying summary() function.
emp_id emp_name salary start_date
Min. :1 Length:5 Min. :515.2 Min. :2012-01-01
1st Qu.:2 Class :character 1st Qu.:611.0 1st Qu.:2013-09-23
Median :3 Mode :character Median :623.3 Median :2014-05-11
Mean :3 Mean :664.4 Mean :2014-01-14
3rd Qu.:4 3rd Qu.:729.0 3rd Qu.:2014-11-15
Max. :5 Max. :843.2 Max. :2015-03-27
Extract specific column from a dataframe using column name.
emp.data.emp_name emp.data.salary
1 Rick 623.30
2 Dan 515.20
3 Michelle 611.00
4 Ryan 729.00
5 Gary 843.25
Extract the first two rows and then all columns.
emp_id emp_name salary start_date
1 1 Rick 623.3 2012-01-01
2 2 Dan 515.2 2013-09-23
Extract 3rd and 5th row with 2nd and 4th column.
emp_name start_date
3 Michelle 2014-11-15
5 Gary 2015-03-27
A dataframe can be expanded by adding columns and rows.
emp_id emp_name salary start_date dept
1 1 Rick 623.30 2012-01-01 IT
2 2 Dan 515.20 2013-09-23 Operations
3 3 Michelle 611.00 2014-11-15 IT
4 4 Ryan 729.00 2014-05-11 HR
5 5 Gary 843.25 2015-03-27 Finance
To add more rows permanently to an existing data frame, we need to bring in the new rows in the same structure as the existing data frame and use the rbind() function.
emp_id emp_name salary start_date dept
1 1 Rick 623.30 2012-01-01 IT
2 2 Dan 515.20 2013-09-23 Operations
3 3 Michelle 611.00 2014-11-15 IT
4 4 Ryan 729.00 2014-05-11 HR
5 5 Gary 843.25 2015-03-27 Finance
6 6 Rasmi 578.00 2013-05-21 IT
7 7 Pranab 722.50 2013-07-30 Operations
8 8 Tusar 632.80 2014-06-17 Fianance
Data Reshaping in R is about changing the way data is organized into rows and columns. Most of the time data processing in R is done by taking the input data as a dataframe. It is easy to extract data from the rows and columns of a data frame but there are situations when we need the dataframe in a format that is different from format in which we received it. R has many functions to split, merge and change the rows to columns and vice-versa in a data frame.
You can join multiple vectors to create a data frame using the cbind()function. Also we can merge two data frames using rbind() function.
# # # # The First data frame
city state zipcode
[1,] "Tampa" "FL" "33602"
[2,] "Seattle" "WA" "98104"
[3,] "Hartford" "CT" "6161"
[4,] "Denver" "CO" "80294"
# # # The Second data frame
city state zipcode
1 Lowry CO 80230
2 Charlotte FL 33949
# # # The combined data frame
city state zipcode
1 Tampa FL 33602
2 Seattle WA 98104
3 Hartford CT 6161
4 Denver CO 80294
5 Lowry CO 80230
6 Charlotte FL 33949
You can merge two dataframes by using the merge() function. The data frames must have same column names on which the merging happens.
In the example below, you consider the data sets about Diabetes in the MASS package. we merge the two data sets based on the values of blood pressure (“bp”) and body mass index (“bmi”). To choose these two columns for merging, the records where values of two variables match in both data sets are combined together to form a single data frame.
bp bmi npreg.x glu.x skin.x ped.x age.x type.x npreg.y glu.y
1 60 33.8 1 117 23 0.466 27 No 2 125
2 64 29.7 2 75 24 0.370 33 No 2 100
3 64 31.2 5 189 33 0.583 29 Yes 3 158
4 64 33.2 4 117 27 0.230 24 No 1 96
5 66 38.1 3 115 39 0.150 28 No 1 114
6 68 38.5 2 100 25 0.324 26 No 7 129
7 70 27.4 1 116 28 0.204 21 No 0 124
8 70 33.1 4 91 32 0.446 22 No 9 123
9 70 35.4 9 124 33 0.282 34 No 6 134
10 72 25.6 1 157 21 0.123 24 No 4 99
11 72 37.7 5 95 33 0.370 27 No 6 103
12 74 25.9 9 134 33 0.460 81 No 8 126
13 74 25.9 1 95 21 0.673 36 No 8 126
14 78 27.6 5 88 30 0.258 37 No 6 125
15 78 27.6 10 122 31 0.512 45 No 6 125
16 78 39.4 2 112 50 0.175 24 No 4 112
17 88 34.5 1 117 24 0.403 40 Yes 4 127
skin.y ped.y age.y type.y
1 20 0.088 31 No
2 23 0.368 21 No
3 13 0.295 24 No
4 27 0.289 21 No
5 36 0.289 21 No
6 49 0.439 43 Yes
7 20 0.254 36 Yes
8 44 0.374 40 No
9 23 0.542 29 Yes
10 17 0.294 28 No
11 32 0.324 55 No
12 38 0.162 39 No
13 38 0.162 39 No
14 31 0.565 49 Yes
15 31 0.565 49 Yes
16 40 0.236 38 No
17 11 0.598 28 No
[1] 17
One of the most interesting aspects of R is about changing the shape of the data in multiple steps to get a desired shape. The functions used to do this are called melt() and cast(). We consider the dataset called ships in the MASS package.
You can cast the molten data into a new form where the aggregate of each type of ship for each year is created. It is done using the cast() function.
In R the pie chart is created using the pie() function which takes positive numbers as a vector input.
The below script will create and save the bar chart in the current R working directory.
Boxplots are created in R by using the boxplot() function.
R creates histogram using hist() function. This function takes a vector as an input and uses some more parameters to plot histograms.
A line chart is a graph that connects a series of points by drawing line segments between them. The plot() function is used to create the line graph.
Scatterplots show many points plotted in the Cartesian plane. Each point represents the values of two variables. One variable is chosen in the horizontal axis and another in the vertical axis. The simple scatterplot is created using the plot() function.