Data Structures in R Programming : DataFrame, Factors



Data Structures –

 

a. DataFrame – 


A data frame is a table or a two-dimensional array-like structure in which each column contains values of one variable and each row contains one set of values from each column.

 

Following are the characteristics of a data frame.

 

The column names should be non-empty.

The row names should be unique.

The data stored in a data frame can be of numeric, factor or character type. Each column should contain same number of data items.

 

Create Data Frame –

Create the data frame.

emp.data <- data.frame(    emp_id = c (1:5),     emp_name=c("Rick","Dan","Michelle","Ryan","Gary"), salary = c(623.3,515.2,611.0,729.0,843.25)start_date = as.Date(c("2012-01-01", "2013-09-23", "2014-11-15", "2014-05-11",       "2015-03-27")),    stringsAsFactors = FALSE )

# Print the data frame.                 

print(emp.data)

 

When we execute the above code, it produces the following result –

emp_id    emp_name     salary     start_date 1     1     Rick        623.30     2012-01-01

2         2     Dan         515.20     2013-09-23

3         3     Michelle    611.00     2014-11-15

4         4     Ryan        729.00     2014-05-11

5         5     Gary        843.25     2015-03-27

 

Get the Structure of the Data Frame –


The structure of the data frame can be seen by using str() function.

 

# Create the data frame.

emp.data <- data.frame(    emp_id = c (1:5),     emp_name = c("Rick","Dan","Michelle","Ryan","Gary"),    salary = c(623.3,515.2,611.0,729.0,843.25),start_date = as.Date(c("2012-01-01", "2013-09-23", "2014-11-15", "2014-05-11",       "2015-03-27")),    stringsAsFactors = FALSE)

# Get the structure of the data frame. str(emp.data)

 

When we execute the above code, it produces the following result –

'data.frame':   5 obs. of  4 variables:

 $ emp_id    : int  1 2 3 4 5

 $ emp_name  : chr  "Rick" "Dan" "Michelle" "Ryan" ...

 $ salary    : num  623 515 611 729 843

 $ start_date: Date, format: "2012-01-01" "2013-09-23" "2014-1115" "2014-05-11" ...

 

Summary of Data in Data Frame –


The statistical summary and nature of the data can be obtained by applying summary() function.

 

# Create the data frame.

emp.data <- data.frame(    emp_id = c (1:5),     emp_name = c("Rick","Dan","Michelle","Ryan","Gary"),    salary = c(623.3,515.2,611.0,729.0,843.25),  start_date = as.Date(c("2012-01-01", "2013-09-23", "2014-11-15", "2014-05-11",       "2015-03-27")),    stringsAsFactors = FALSE)

# Print the summary.

print(summary(emp.data))

 

 

When we execute the above code, it produces the following result –

emp_id    emp_name             salary        start_date          Min.   :1   Length:5           Min.   :515.2   Min.   :2012-0101  

 1st Qu.:2   Class :character   1st Qu.:611.0   1st Qu.:2013-0923  

 Median :3   Mode  :character   Median :623.3   Median :2014-0511  

 Mean   :3                      Mean   :664.4   Mean   :2014-0114  

 3rd Qu.:4                      3rd Qu.:729.0   3rd Qu.:2014-1115  

 Max.   :5                      Max.   :843.2   Max.   :2015-03-

27

 

Extract Data from Data Frame –


Extract specific column from a data frame using column name.

 

# Create the data frame.

emp.data <- data.frame(    emp_id = c (1:5),    emp_name = c("Rick","Dan","Michelle","Ryan","Gary"),    salary = c(623.3,515.2,611.0,729.0,843.25)start_date = as.Date(c("2012-01-01","2013-09-23","2014-11-15","2014-05-11",       "2015-03-27")),    stringsAsFactors = FALSE)

# Extract Specific columns.

result <- data.frame(emp.data$emp_name,emp.data$salary) print(result)

 

When we execute the above code, it produces the following result –

  emp.data.emp_name emp.data.salary 1              Rick          623.30

2                         Dan          515.20

3                         Michelle          611.00

4                         Ryan          729.00

5                         Gary          843.25

 

Extract the first two rows and then all columns # Create the data frame.


emp.data <- data.frame(    emp_id = c (1:5),    emp_name = c("Rick","Dan","Michelle","Ryan","Gary"),    salary = c(623.3,515.2,611.0,729.0,843.25),start_date = as.Date(c("2012-01-01", "2013-09-23", "2014-11-15", "2014-05-11", "2015-03-27")),    stringsAsFactors = FALSE)


# Extract first two rows.

result <- emp.data[1:2,] print(result)

 

When we execute the above code, it produces the following result –

  emp_id    emp_name   salary    start_date 1      1     Rick      623.3     2012-01-01

2      2     Dan       515.2     2013-09-23

 

Extract 3rd and 5th row with 2nd and 4th column # Create the data frame.

emp.data <- data.frame(    emp_id = c (1:5),     emp_name = c("Rick","Dan","Michelle","Ryan","Gary"),    salary = c(623.3,515.2,611.0,729.0,843.25), start_date = as.Date(c("2012-01-01", "2013-09-23", "201411-15", "2014-05-11",       "2015-03-27")),    stringsAsFactors = FALSE)

 

# Extract 3rd and 5th row with 2nd and 4th column.

result <- emp.data[c(3,5),c(2,4)] print(result)

 

When we execute the above code, it produces the following result –

  emp_name start_date 3 Michelle 2014-11-15

5     Gary 2015-03-27

 

Expand Data Frame


A data frame can be expanded by adding columns and rows.

 

Add Column

Just add the column vector using a new column name.

 

# Create the data frame.

emp.data <- data.frame(    emp_id = c (1:5),     emp_name = c("Rick","Dan","Michelle","Ryan","Gary"),    salary = c(623.3,515.2,611.0,729.0,843.25),  start_date = as.Date(c("2012-01-01", "2013-09-23", "2014-11-15", "2014-05-11","2015-03-27")),    stringsAsFactors = FALSE)

 

# Add the "dept" coulmn.


emp.data$dept <- c("IT","Operations","IT","HR","Finance") v <- emp.data

print(v)

 

When we execute the above code, it produces the following result –

  emp_id   emp_name    salary    start_date       dept

1         1    Rick        623.30    2012-01-01       IT

2         2    Dan         515.20    2013-09-23       Operations

3         3    Michelle    611.00    2014-11-15       IT

4         4    Ryan        729.00    2014-05-11       HR

5         5    Gary        843.25    2015-03-27       Finance

 

Add Row


To add more rows permanently to an existing data frame, we need to bring in the new rows in the same structure as the existing data frame and use the rbind() function.

 

In the example below we create a data frame with new rows and merge it with the existing data frame to create the final data frame.

 

# Create the first data frame.

emp.data <- data.frame(    emp_id = c (1:5),     emp_name = c("Rick","Dan","Michelle","Ryan","Gary"),    salary = c(623.3,515.2,611.0,729.0,843.25),start_date = as.Date(c("2012-01-01", "2013-09-23", "2014-11-15", "2014-05-11",       "2015-03-27")),    dept =c("IT","Operations","IT","HR","Finance"),    stringsAsFactors = FALSE)

 

# Create the second data frame

emp.newdata <-  data.frame(    emp_id = c (6:8),     emp_name = c("Rasmi","Pranab","Tusar"),    salary = c(578.0,722.5,632.8),     start_date = as.Date(c("2013-05-21","2013-07-30","2014-06-

17")),    dept = c("IT","Operations","Finance")stringsAsFactors = FALSE)

 

# Bind the two data frames.

emp.finaldata <- rbind(emp.data,emp.newdata) print(emp.finaldata)

 

When we execute the above code, it produces the following result –

  emp_id     emp_name    salary     start_date       dept

1          1     Rick        623.30     2012-01-01       IT

2          2     Dan         515.20     2013-09-23       Operations

3          3     Michelle    611.00     2014-11-15       IT

4          4     Ryan        729.00     2014-05-11       HR

5          5     Gary        843.25     2015-03-27       Finance

6          6     Rasmi       578.00     2013-05-21       IT

7          7     Pranab      722.50     2013-07-30       Operations

8          8     Tusar       632.80     2014-06-17       Finance

 

 

b. Factors –


Factors are the data objects which are used to categorize the data and store it as levels. They can store both strings and integers. They are useful in the columns which have a limited number of unique values. Like "Male, "Female" and True, False etc. They are useful in data analysis for statistical modeling.

 

Factors are created using the factor () function by taking a vector as input. Example

# Create a vector as input.

data <- c("East","West","East","North","North","East","West","West","West

","East","North")

 

print(data)

print(is.factor(data))

 

# Apply the factor function.

factor_data <- factor(data)

 

print(factor_data)

print(is.factor(factor_data))

 

When we execute the above code, it produces the following result –

[1] "East"  "West"  "East"  "North" "North" "East"  "West" 

"West"  "West"  "East" "North"

[1] FALSE

[1] East  West  East  North North East  West  West  West  East 

North

Levels: East North West

[1] TRUE

 

Factors in Data Frame –

On creating any data frame with a column of text data, R treats the text column as categorical data and creates factors on it.

 

# Create the vectors for data frame.

height <- c(132,151,162,139,166,147,122) weight <- c(48,49,66,53,67,52,40) gender <- c("male","male","female","female","male","female","male")

 

# Create the data frame.

input_data <- data.frame(height,weight,gender) print(input_data)

 

# Test if the gender column is a factor.

print(is.factor(input_data$gender))

 

# Print the gender column so see the levels.

print(input_data$gender)

 

When we execute the above code, it produces the following result –

  height weight gender 1    132     48   male

2       151     49   male

3       162     66 female

4       139     53 female

5       166     67   male

6       147     52 female

7       122     40   male

[1] TRUE

[1] male   male   female female male   female male  

Levels: female male

 

Changing the Order of Levels –


The order of the levels in a factor can be changed by applying the factor function again with new order of the levels.

 

data <- c("East","West","East","North","North","East","West",

   "West","West","East","North")

# Create the factors factor_data <- factor(data) print(factor_data)

 

# Apply the factor function with required order of the level.

new_order_data <- factor(factor_data,levels = c("East","West","North"))

print(new_order_data)

 

When we execute the above code, it produces the following result –

[1] East  West  East  North North East  West  West  West  East 

North

Levels: East North West

 [1] East  West  East  North North East  West  West  West  East 

North

Levels: East West North

 

Generating Factor Levels –


We can generate factor levels by using the gl() function. It takes two integers as input which indicates how many levels and how many times each level.

 

Syntax

gl(n, k, labels)

 

Following is the description of the parameters used −

 

n is a integer giving the number of levels.

 

k is a integer giving the number of replications.

 

labels is a vector of labels for the resulting factor levels.

 

Example


v <- gl(3, 4, labels = c("Tampa", "Seattle","Boston")) print(v)

 

When we execute the above code, it produces the following result –

Tampa   Tampa   Tampa   Tampa   Seattle Seattle Seattle Seattle

Boston 

[10] Boston  Boston  Boston 

Levels: Tampa Seattle Boston

 

Post a Comment

Previous Post Next Post