R Programming: dplyr package basics

dplyr is one of the commonly used packages in R for data manipulation.Following are major verbs used in dplyr for data curation and analysis:

  • filter()
  • select()
  • summarize()
  • arrange()
  • mutate()

We will look at use cases for these verbs with a simulated data:

#set libraries

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
#Simulate dataset with random normal distribution 
set.seed(1234)  #set seed so that results don't change for future rerun

#Generate two vectors with  random normal distribution
x<- rnorm(n=10, sd=2, mean=25)
y<- rnorm(n=10, sd=3, mean=30)

#Combine both simulated vectors to create a data frame
our_data<- data.frame(x,y)

Now lets use above mentioned verbs to see how they can be used in data processing.

#filter chooses rows from the columns;
#we want to subset data only to those observations whose x is greater than 25

filter(our_data, x > 25)
##          x        y
## 1 25.55486 27.00484
## 2 27.16888 27.67124
## 3 25.85825 32.87848
## 4 26.01211 29.66914
#select chooses columns from our data; 
#we want keep only x as our column in the dataset

select(our_data, x)
##           x
## 1  22.58587
## 2  25.55486
## 3  27.16888
## 4  20.30860
## 5  25.85825
## 6  26.01211
## 7  23.85052
## 8  23.90674
## 9  23.87110
## 10 23.21992
#summarise creates stats summary of the variable in the dataset and collapses to a single row; 
#here we are calculating mean of x

summarise(our_data, x=mean(x))
##          x
## 1 24.23369
#arrange sorts variable by ascending or descending order; 
#here we are sorting x by descending order

arrange(our_data, desc(x))
##           x        y
## 1  27.16888 27.67124
## 2  26.01211 29.66914
## 3  25.85825 32.87848
## 4  25.55486 27.00484
## 5  23.90674 27.26641
## 6  23.87110 27.48848
## 7  23.85052 28.46697
## 8  23.21992 37.24751
## 9  22.58587 28.56842
## 10 20.30860 30.19338
#mutate creates a new variable; 
#here we are creating a new variable x_twice multiplying x by 2

mutate(our_data, x_twice=2*x)
##           x        y  x_twice
## 1  22.58587 28.56842 45.17174
## 2  25.55486 27.00484 51.10972
## 3  27.16888 27.67124 54.33776
## 4  20.30860 30.19338 40.61721
## 5  25.85825 32.87848 51.71650
## 6  26.01211 29.66914 52.02422
## 7  23.85052 28.46697 47.70104
## 8  23.90674 27.26641 47.81347
## 9  23.87110 27.48848 47.74219
## 10 23.21992 37.24751 46.43985

I showed here only few basics of dplyr to get you started quickly. For more detailed information on this package, go to this link: https://cran.r-project.org/web/packages/dplyr/vignettes/dplyr.html