Lesson 1: Data manipluation

In this section, you will learn what types of data and how to maniplate data, including data import and export, as well as reshape data.

1. Acquring data from R

1.1 The data within R

There are several ways to find the included datasets in R:

1.2 The data outside R

In R you can read data from files stored outside the R environment. You can also write data into files which will be stored and accessed by computers. R can read and write into various file formats like csv, excel, xml etc.

The file should be present in current working directory so that R can read it. You can set our own directory and read files from there. You can use the getwd() function to check current directory, and also use setwd()functional to set a new working directory.

[1] "/home/tank/Desktop/ecodatasci/_datexpl/2023-10-15-datamaniplation"

A CSV or excel File

The csv file is a text file in which the values in the columns are separated by a comma. You can use read.csv() function to read it into R.

Microsoft Excel is the most widely used spreadsheet which stores data in the .xls or .xlsx format. R can read directly from the files using some specific packages, such as xlsx package.

[1] TRUE

From the web site

Many websites can provide data. For example, WHO provides reports on health and medical information in the form of CSV, txt and XML files. Using R, we can programmatically extract data from websites. Some R packages, such as “RCurl”, “XML”, and “stringr”, are used to connect to the URL’s, identify required links for the data and download them to R environment.

For example, if you visit the URL weather data and download the CSV files using R for the year 2015.

From the databases

The data is Relational database systems are stored in a normalized format. So, to carry out statistical computing you will need very advanced and complex Sql queries. But R can connect easily to many relational databases like MySql, Oracle, Sql server etc. and fetch records from them as a dataframe. Once the data is available in the R environment, it becomes a normal R data set and can be manipulated or analyzed using packages and functions. Below you will be using MySql as our reference database for connecting to R.

Once the RMySQL package is installed we create a connection object in R to connect to the database. It takes the username, password, database name and host name as input.

You can query the database tables in MySql using the dbSendQuery() function. The query executs in MySql and the result is returned using the fetch() function. Finally it is stored as a dataframe in R.

You can pass any valid select query to get the result.

You can update the rows in a Mysql table by passing the update query to the dbSendQuery() function.

You can create tables in the MySql using the dbWriteTable() function. It overwrites the table if it already exists and takes a dataframe as input.

You can drop the tables in MySql database passing the drop table statement into the dbSendQuery() in the same way as we used it for querying data from tables.

2. Data types and manipulation

Generally, you may store information of data types like character, integer, floating point, and Boolean, etc. Based on the data type of a variable, the operating system allocates memory and decides what can be stored. There are many types of R-objects. The frequently used ones are:

2.1 Vectors and maniplation

Vectors

In R the very basic data types are the R-objects called vectors. When creating a vector with more than one element, you should use c() function which means to combine the elements into a vector.

# Create a vector.
apple <- c('red','green',"yellow")
print(apple)
[1] "red"    "green"  "yellow"

Vector manipulation

Two vectors of same length can be added, subtracted, multiplied or divided giving the result as a vector output.

[1]  7 19  4 13  1 13
[1] -1 -3  4 -3 -1  9
[1] 12 88  0 40  0 22
[1] 0.7500000 0.7272727       Inf 0.6250000 0.0000000 5.5000000

Elements in a vector can be sorted using the sort() function.

[1]  -9   0   3   4   5   8  11 304
[1] 304  11   8   5   4   3   0  -9

2.2 lists and manipluation

Lists

A list is an R-object which can contain many different types of elements inside it like vectors, functions and even another list inside it.

# Create a list.
list1 <- list(c(2,5,3),21.3,sin)

# Print the list.
print(list1)
[[1]]
[1] 2 5 3

[[2]]
[1] 21.3

[[3]]
function (x)  .Primitive("sin")

List manipulation

A list can be converted to a vector so that the elements of the vector can be used for further manipulation. All the arithmetic operations on vectors can be applied after the list is converted into vectors. To do this conversion, we use the unlist() function. It takes the list as input and produces a vector.

[[1]]
[1] 1 2 3 4 5
[[1]]
[1] 10 11 12 13 14
[1] 1 2 3 4 5
[1] 10 11 12 13 14
[1] 11 13 15 17 19

2.3 Matrics and manipulation

Matrices

A matrix is a two-dimensional rectangular R-object. It can be created using a vector input to the matrix function.

# Create a matrix.
M = matrix(c('a','a','b','c','b','a'), nrow = 2, ncol = 3, byrow = TRUE)
print(M)
     [,1] [,2] [,3]
[1,] "a"  "a"  "b" 
[2,] "c"  "b"  "a" 

Matrix manipulation

Create a matrix taking a vector of numbers as input.

     [,1] [,2] [,3]
[1,]    3    4    5
[2,]    6    7    8
[3,]    9   10   11
[4,]   12   13   14
     [,1] [,2] [,3]
[1,]    3    7   11
[2,]    4    8   12
[3,]    5    9   13
[4,]    6   10   14
     col1 col2 col3
row1    3    4    5
row2    6    7    8
row3    9   10   11
row4   12   13   14

Various mathematical operations are performed on the matrices using the R operators. The result of the operation is also a matrix.

     [,1] [,2] [,3]
[1,]    3   -1    2
[2,]    9    4    6
     [,1] [,2] [,3]
[1,]    5    0    3
[2,]    2    9    4
Result of addition 
     [,1] [,2] [,3]
[1,]    8   -1    5
[2,]   11   13   10
Result of subtraction 
     [,1] [,2] [,3]
[1,]   -2   -1   -1
[2,]    7   -5    2

2.4 Arrays and manipulation

Arrays

While matrices are confined to two dimensions, arrays can be of any number of dimensions. The array function takes a dim attribute which creates the required number of dimension. Below is an example array with two elements which are 3x3 matrices each. Arrays can store data in more than two dimensions.

# Create an array.
a <- array(c('green','yellow'),dim = c(3,3,2)) #  2 matrices each with 3 rows and 3 columns
print(a)
, , 1

     [,1]     [,2]     [,3]    
[1,] "green"  "yellow" "green" 
[2,] "yellow" "green"  "yellow"
[3,] "green"  "yellow" "green" 

, , 2

     [,1]     [,2]     [,3]    
[1,] "yellow" "green"  "yellow"
[2,] "green"  "yellow" "green" 
[3,] "yellow" "green"  "yellow"

Array Manipulation

Arrays are the R-objects that can store data in more than two dimensions. If we create an array of dimension (2, 3, 4) then it creates 4 rectangular matrices each with 2 rows and 3 columns. Arrays can store only data type.

An array is created using the array() function. It takes vectors as input and uses the values in the dim parameter to create an array.

, , 1

     [,1] [,2] [,3]
[1,]    5   10   13
[2,]    9   11   14
[3,]    3   12   15

, , 2

     [,1] [,2] [,3]
[1,]    5   10   13
[2,]    9   11   14
[3,]    3   12   15

2.5 Factors and manipulation

Factors

Factors are the R-objects which are created using a vector. It stores a vector along with the distinct values of the elements in the vector as labels. The labels are always character irrespective of whether it is numeric or character or Boolean etc. in the input vector. They are useful in statistical modeling.

Factors are created using the factor() function. The nlevels() functions gives the count of levels.

# Create a vector
apple_colors <- c('green','green','yellow','red','red','red','green')

# Create a factor object
factor_apple <- factor(apple_colors)

# Print the factor
print(factor_apple)
[1] green  green  yellow red    red    red    green 
Levels: green red yellow
print(nlevels(factor_apple))
[1] 3

Factor Manipulation

Factors are the R-objects that are used to categorize the data and store it as levels. They can store both strings and integers. They are useful in the columns that have a limited number of unique values. Like “Male,”Female” and True, False etc. They are useful in data analysis for statistical modeling.

 [1] "East"  "West"  "East"  "North" "North" "East"  "West"  "West" 
 [9] "West"  "East"  "North"
[1] FALSE
 [1] East  West  East  North North East  West  West  West  East  North
Levels: East North West
[1] TRUE

2.6 Data frames and manipulation

Data Frames

Data frames are tabular data objects. Unlike a matrix in data frame each column can contain different modes of data. The first column can be numeric while the second column can be character and third column can be logical. It is a list of vectors of equal length.

Data Frames are created using the data.frame() function.

# Create the data frame.
BMI <-   data.frame(
   gender = c("Male", "Male","Female"), 
   height = c(152, 171.5, 165), 
   weight = c(81,93, 78),
   Age = c(42,38,26)
)
print(BMI)
  gender height weight Age
1   Male  152.0     81  42
2   Male  171.5     93  38
3 Female  165.0     78  26

Data frame Manipulation

A data frame is a table or a two-dimensional array-like structure in which each column contains values of one variable and each row contains one set of values from each column. You can use data.frame() function to create a dataframe.

  emp_id emp_name salary start_date
1      1     Rick 623.30 2012-01-01
2      2      Dan 515.20 2013-09-23
3      3 Michelle 611.00 2014-11-15
4      4     Ryan 729.00 2014-05-11
5      5     Gary 843.25 2015-03-27

The structure of the data frame can be seen by using str() function.

'data.frame':   5 obs. of  4 variables:
 $ emp_id    : int  1 2 3 4 5
 $ emp_name  : chr  "Rick" "Dan" "Michelle" "Ryan" ...
 $ salary    : num  623 515 611 729 843
 $ start_date: Date, format: "2012-01-01" ...

The statistical summary and nature of the data can be obtained by applying summary() function.

     emp_id    emp_name             salary        start_date        
 Min.   :1   Length:5           Min.   :515.2   Min.   :2012-01-01  
 1st Qu.:2   Class :character   1st Qu.:611.0   1st Qu.:2013-09-23  
 Median :3   Mode  :character   Median :623.3   Median :2014-05-11  
 Mean   :3                      Mean   :664.4   Mean   :2014-01-14  
 3rd Qu.:4                      3rd Qu.:729.0   3rd Qu.:2014-11-15  
 Max.   :5                      Max.   :843.2   Max.   :2015-03-27  

Extract specific column from a dataframe using column name.

  emp.data.emp_name emp.data.salary
1              Rick          623.30
2               Dan          515.20
3          Michelle          611.00
4              Ryan          729.00
5              Gary          843.25

Extract the first two rows and then all columns.

  emp_id emp_name salary start_date
1      1     Rick  623.3 2012-01-01
2      2      Dan  515.2 2013-09-23

Extract 3rd and 5th row with 2nd and 4th column.

  emp_name start_date
3 Michelle 2014-11-15
5     Gary 2015-03-27

A dataframe can be expanded by adding columns and rows.

  emp_id emp_name salary start_date       dept
1      1     Rick 623.30 2012-01-01         IT
2      2      Dan 515.20 2013-09-23 Operations
3      3 Michelle 611.00 2014-11-15         IT
4      4     Ryan 729.00 2014-05-11         HR
5      5     Gary 843.25 2015-03-27    Finance

To add more rows permanently to an existing data frame, we need to bring in the new rows in the same structure as the existing data frame and use the rbind() function.

  emp_id emp_name salary start_date       dept
1      1     Rick 623.30 2012-01-01         IT
2      2      Dan 515.20 2013-09-23 Operations
3      3 Michelle 611.00 2014-11-15         IT
4      4     Ryan 729.00 2014-05-11         HR
5      5     Gary 843.25 2015-03-27    Finance
6      6    Rasmi 578.00 2013-05-21         IT
7      7   Pranab 722.50 2013-07-30 Operations
8      8    Tusar 632.80 2014-06-17   Fianance

Data reshape

Data Reshaping in R is about changing the way data is organized into rows and columns. Most of the time data processing in R is done by taking the input data as a dataframe. It is easy to extract data from the rows and columns of a data frame but there are situations when we need the dataframe in a format that is different from format in which we received it. R has many functions to split, merge and change the rows to columns and vice-versa in a data frame.

You can join multiple vectors to create a data frame using the cbind()function. Also we can merge two data frames using rbind() function.

# # # # The First data frame
     city       state zipcode
[1,] "Tampa"    "FL"  "33602"
[2,] "Seattle"  "WA"  "98104"
[3,] "Hartford" "CT"  "6161" 
[4,] "Denver"   "CO"  "80294"
# # # The Second data frame
       city state zipcode
1     Lowry    CO   80230
2 Charlotte    FL   33949
# # # The combined data frame
       city state zipcode
1     Tampa    FL   33602
2   Seattle    WA   98104
3  Hartford    CT    6161
4    Denver    CO   80294
5     Lowry    CO   80230
6 Charlotte    FL   33949

You can merge two dataframes by using the merge() function. The data frames must have same column names on which the merging happens.

In the example below, you consider the data sets about Diabetes in the MASS package. we merge the two data sets based on the values of blood pressure (“bp”) and body mass index (“bmi”). To choose these two columns for merging, the records where values of two variables match in both data sets are combined together to form a single data frame.

   bp  bmi npreg.x glu.x skin.x ped.x age.x type.x npreg.y glu.y
1  60 33.8       1   117     23 0.466    27     No       2   125
2  64 29.7       2    75     24 0.370    33     No       2   100
3  64 31.2       5   189     33 0.583    29    Yes       3   158
4  64 33.2       4   117     27 0.230    24     No       1    96
5  66 38.1       3   115     39 0.150    28     No       1   114
6  68 38.5       2   100     25 0.324    26     No       7   129
7  70 27.4       1   116     28 0.204    21     No       0   124
8  70 33.1       4    91     32 0.446    22     No       9   123
9  70 35.4       9   124     33 0.282    34     No       6   134
10 72 25.6       1   157     21 0.123    24     No       4    99
11 72 37.7       5    95     33 0.370    27     No       6   103
12 74 25.9       9   134     33 0.460    81     No       8   126
13 74 25.9       1    95     21 0.673    36     No       8   126
14 78 27.6       5    88     30 0.258    37     No       6   125
15 78 27.6      10   122     31 0.512    45     No       6   125
16 78 39.4       2   112     50 0.175    24     No       4   112
17 88 34.5       1   117     24 0.403    40    Yes       4   127
   skin.y ped.y age.y type.y
1      20 0.088    31     No
2      23 0.368    21     No
3      13 0.295    24     No
4      27 0.289    21     No
5      36 0.289    21     No
6      49 0.439    43    Yes
7      20 0.254    36    Yes
8      44 0.374    40     No
9      23 0.542    29    Yes
10     17 0.294    28     No
11     32 0.324    55     No
12     38 0.162    39     No
13     38 0.162    39     No
14     31 0.565    49    Yes
15     31 0.565    49    Yes
16     40 0.236    38     No
17     11 0.598    28     No
[1] 17

One of the most interesting aspects of R is about changing the shape of the data in multiple steps to get a desired shape. The functions used to do this are called melt() and cast(). We consider the dataset called ships in the MASS package.

You can cast the molten data into a new form where the aggregate of each type of ship for each year is created. It is done using the cast() function.

3. Data visualization in R

3.1 Pie charts

In R the pie chart is created using the pie() function which takes positive numbers as a vector input.

3.2 Bar charts

The below script will create and save the bar chart in the current R working directory.

3.3 Boxplots

Boxplots are created in R by using the boxplot() function.

3.4 Histograms

R creates histogram using hist() function. This function takes a vector as an input and uses some more parameters to plot histograms.

3.5 Line Charts

A line chart is a graph that connects a series of points by drawing line segments between them. The plot() function is used to create the line graph.

3.6 Scatterplots

Scatterplots show many points plotted in the Cartesian plane. Each point represents the values of two variables. One variable is chosen in the horizontal axis and another in the vertical axis. The simple scatterplot is created using the plot() function.