Data Structures in R Programming : DataFrame, Factors

byChaitanya Patil •November 13, 2022

0

Data Structures –

a. DataFrame –

A data frame is a table or a two-dimensional array-like structure in which each column contains values of one variable and each row contains one set of values from each column.

Following are the characteristics of a data frame.

The column names should be non-empty.

The row names should be unique.

The data stored in a data frame can be of numeric, factor or character type. Each column should contain same number of data items.

Create Data Frame – #

Create the data frame.

emp.data <- data.frame( emp_id = c (1:5), emp_name=c("Rick","Dan","Michelle","Ryan","Gary"), salary = c(623.3,515.2,611.0,729.0,843.25)start_date = as.Date(c("2012-01-01", "2013-09-23", "2014-11-15", "2014-05-11", "2015-03-27")), stringsAsFactors = FALSE )

# Print the data frame.

print(emp.data)

When we execute the above code, it produces the following result –

emp_id emp_name salary start_date 1 1 Rick 623.30 2012-01-01

2 2 Dan 515.20 2013-09-23

3 3 Michelle 611.00 2014-11-15

4 4 Ryan 729.00 2014-05-11

5 5 Gary 843.25 2015-03-27

Get the Structure of the Data Frame –

The structure of the data frame can be seen by using str() function.

# Create the data frame.

emp.data <- data.frame( emp_id = c (1:5), emp_name = c("Rick","Dan","Michelle","Ryan","Gary"), salary = c(623.3,515.2,611.0,729.0,843.25),start_date = as.Date(c("2012-01-01", "2013-09-23", "2014-11-15", "2014-05-11", "2015-03-27")), stringsAsFactors = FALSE)

# Get the structure of the data frame. str(emp.data)

When we execute the above code, it produces the following result –

'data.frame': 5 obs. of 4 variables:

$ emp_id : int 1 2 3 4 5

$ emp_name : chr "Rick" "Dan" "Michelle" "Ryan" ...

$ salary : num 623 515 611 729 843

$ start_date: Date, format: "2012-01-01" "2013-09-23" "2014-1115" "2014-05-11" ...

Summary of Data in Data Frame –

The statistical summary and nature of the data can be obtained by applying summary() function.

# Create the data frame.

emp.data <- data.frame( emp_id = c (1:5), emp_name = c("Rick","Dan","Michelle","Ryan","Gary"), salary = c(623.3,515.2,611.0,729.0,843.25), start_date = as.Date(c("2012-01-01", "2013-09-23", "2014-11-15", "2014-05-11", "2015-03-27")), stringsAsFactors = FALSE)

# Print the summary.

print(summary(emp.data))

When we execute the above code, it produces the following result –

emp_id emp_name salary start_date Min. :1 Length:5 Min. :515.2 Min. :2012-0101

1st Qu.:2 Class :character 1st Qu.:611.0 1st Qu.:2013-0923

Median :3 Mode :character Median :623.3 Median :2014-0511

Mean :3 Mean :664.4 Mean :2014-0114

3rd Qu.:4 3rd Qu.:729.0 3rd Qu.:2014-1115

Max. :5 Max. :843.2 Max. :2015-03-

27

Extract Data from Data Frame –

Extract specific column from a data frame using column name.

# Create the data frame.

emp.data <- data.frame( emp_id = c (1:5), emp_name = c("Rick","Dan","Michelle","Ryan","Gary"), salary = c(623.3,515.2,611.0,729.0,843.25)start_date = as.Date(c("2012-01-01","2013-09-23","2014-11-15","2014-05-11", "2015-03-27")), stringsAsFactors = FALSE)

# Extract Specific columns.

result <- data.frame(emp.data$emp_name,emp.data$salary) print(result)

When we execute the above code, it produces the following result –

emp.data.emp_name emp.data.salary 1 Rick 623.30

2 Dan 515.20

3 Michelle 611.00

4 Ryan 729.00

5 Gary 843.25

Extract the first two rows and then all columns # Create the data frame.

emp.data <- data.frame( emp_id = c (1:5), emp_name = c("Rick","Dan","Michelle","Ryan","Gary"), salary = c(623.3,515.2,611.0,729.0,843.25),start_date = as.Date(c("2012-01-01", "2013-09-23", "2014-11-15", "2014-05-11", "2015-03-27")), stringsAsFactors = FALSE)

# Extract first two rows.

result <- emp.data[1:2,] print(result)

When we execute the above code, it produces the following result –

emp_id emp_name salary start_date 1 1 Rick 623.3 2012-01-01

2 2 Dan 515.2 2013-09-23

Extract 3rd and 5th row with 2nd and 4th column # Create the data frame.

emp.data <- data.frame( emp_id = c (1:5), emp_name = c("Rick","Dan","Michelle","Ryan","Gary"), salary = c(623.3,515.2,611.0,729.0,843.25), start_date = as.Date(c("2012-01-01", "2013-09-23", "201411-15", "2014-05-11", "2015-03-27")), stringsAsFactors = FALSE)

# Extract 3rd and 5th row with 2nd and 4th column.

result <- emp.data[c(3,5),c(2,4)] print(result)

When we execute the above code, it produces the following result –

emp_name start_date 3 Michelle 2014-11-15

5 Gary 2015-03-27

Expand Data Frame

A data frame can be expanded by adding columns and rows.

Add Column

Just add the column vector using a new column name.

# Create the data frame.

emp.data <- data.frame( emp_id = c (1:5), emp_name = c("Rick","Dan","Michelle","Ryan","Gary"), salary = c(623.3,515.2,611.0,729.0,843.25), start_date = as.Date(c("2012-01-01", "2013-09-23", "2014-11-15", "2014-05-11","2015-03-27")), stringsAsFactors = FALSE)

# Add the "dept" coulmn.

emp.data$dept <- c("IT","Operations","IT","HR","Finance") v <- emp.data

print(v)

When we execute the above code, it produces the following result –

emp_id emp_name salary start_date dept

1 1 Rick 623.30 2012-01-01 IT

2 2 Dan 515.20 2013-09-23 Operations

3 3 Michelle 611.00 2014-11-15 IT

4 4 Ryan 729.00 2014-05-11 HR

5 5 Gary 843.25 2015-03-27 Finance

Add Row

To add more rows permanently to an existing data frame, we need to bring in the new rows in the same structure as the existing data frame and use the rbind() function.

In the example below we create a data frame with new rows and merge it with the existing data frame to create the final data frame.

# Create the first data frame.

emp.data <- data.frame( emp_id = c (1:5), emp_name = c("Rick","Dan","Michelle","Ryan","Gary"), salary = c(623.3,515.2,611.0,729.0,843.25),start_date = as.Date(c("2012-01-01", "2013-09-23", "2014-11-15", "2014-05-11", "2015-03-27")), dept =c("IT","Operations","IT","HR","Finance"), stringsAsFactors = FALSE)

# Create the second data frame

emp.newdata <- data.frame( emp_id = c (6:8), emp_name = c("Rasmi","Pranab","Tusar"), salary = c(578.0,722.5,632.8), start_date = as.Date(c("2013-05-21","2013-07-30","2014-06-

17")), dept = c("IT","Operations","Finance")stringsAsFactors = FALSE)

# Bind the two data frames.

emp.finaldata <- rbind(emp.data,emp.newdata) print(emp.finaldata)

When we execute the above code, it produces the following result –

emp_id emp_name salary start_date dept

1 1 Rick 623.30 2012-01-01 IT

2 2 Dan 515.20 2013-09-23 Operations

3 3 Michelle 611.00 2014-11-15 IT

4 4 Ryan 729.00 2014-05-11 HR

5 5 Gary 843.25 2015-03-27 Finance

6 6 Rasmi 578.00 2013-05-21 IT

7 7 Pranab 722.50 2013-07-30 Operations

8 8 Tusar 632.80 2014-06-17 Finance

b. Factors –

Factors are the data objects which are used to categorize the data and store it as levels. They can store both strings and integers. They are useful in the columns which have a limited number of unique values. Like "Male, "Female" and True, False etc. They are useful in data analysis for statistical modeling.

Factors are created using the factor () function by taking a vector as input. Example

# Create a vector as input.

data <- c("East","West","East","North","North","East","West","West","West

","East","North")

print(data)

print(is.factor(data))

# Apply the factor function.

factor_data <- factor(data)

print(factor_data)

print(is.factor(factor_data))

When we execute the above code, it produces the following result –

[1] "East" "West" "East" "North" "North" "East" "West"

"West" "West" "East" "North"

[1] FALSE

[1] East West East North North East West West West East

North

Levels: East North West

[1] TRUE

Factors in Data Frame –

On creating any data frame with a column of text data, R treats the text column as categorical data and creates factors on it.

# Create the vectors for data frame.

height <- c(132,151,162,139,166,147,122) weight <- c(48,49,66,53,67,52,40) gender <- c("male","male","female","female","male","female","male")

# Create the data frame.

input_data <- data.frame(height,weight,gender) print(input_data)

# Test if the gender column is a factor.

print(is.factor(input_data$gender))

# Print the gender column so see the levels.

print(input_data$gender)

When we execute the above code, it produces the following result –

height weight gender 1 132 48 male

2 151 49 male

3 162 66 female

4 139 53 female

5 166 67 male

6 147 52 female

7 122 40 male

[1] TRUE

[1] male male female female male female male

Levels: female male

Changing the Order of Levels –

The order of the levels in a factor can be changed by applying the factor function again with new order of the levels.

data <- c("East","West","East","North","North","East","West",

"West","West","East","North")

# Create the factors factor_data <- factor(data) print(factor_data)

# Apply the factor function with required order of the level.

new_order_data <- factor(factor_data,levels = c("East","West","North"))

print(new_order_data)

When we execute the above code, it produces the following result –

[1] East West East North North East West West West East

North

Levels: East North West

[1] East West East North North East West West West East

North

Levels: East West North

Generating Factor Levels –

We can generate factor levels by using the gl() function. It takes two integers as input which indicates how many levels and how many times each level.

Syntax

gl(n, k, labels)

Following is the description of the parameters used −

n is a integer giving the number of levels.

k is a integer giving the number of replications.

labels is a vector of labels for the resulting factor levels.

Example

v <- gl(3, 4, labels = c("Tampa", "Seattle","Boston")) print(v)

When we execute the above code, it produces the following result –

Tampa Tampa Tampa Tampa Seattle Seattle Seattle Seattle

Boston

[10] Boston Boston Boston

Levels: Tampa Seattle Boston

Tags: R Programming

Data Structures in R Programming : DataFrame, Factors

a. DataFrame –

Get the Structure of the Data Frame –

Summary of Data in Data Frame –

Extract Data from Data Frame –

Expand Data Frame

Add Row

b. Factors –

Factors in Data Frame –

Changing the Order of Levels –

Generating Factor Levels –

Example

Post a Comment

Facebook

Contact Form