I will often come across a coded, or categorized, dataset where a particular categories values are all stuffed into a single column. Although there may be reasons for this type of data structure, it does not lend itself very nicely for use in R.
Here is a simplified example, tracking the sport interests of four individuals:
# data:
ID | Name | Interests
-----------------------
1 | Jon | Football
2 | Sue | Soccer,Basketball
3 | Beth | Football,Baseball,Soccer
4 | Steve | Soccer
As you can see, this format can quickly pose a problem. What if we wanted to know how many people liked Soccer
?
We could string match on the column values:
soccer = grep(data$interests, pattern="Soccer") # c(1,2,3)
However, depending on the search string, this can break very easily. For example, imagine if instead this was our original data:
# data:
ID | Name | Interests
-----------------------
1 | Jon | American Football
2 | Sue | Football,Basketball
3 | Beth | American Football,Baseball,Football
4 | Steve | Football
So, if we were still interested in the number of people interested in Soccer
, or now Football
, this would give us:
football = grep(data$interests, pattern="Football") # c(1,2,3,4)
Notice, we are now getting a positive match on the first row. Since Jon is interested in American Football
, our regular expression matches on both sports.
So we could use some more sophistacation in our pattern, but, across larger datasets, you won’t always know what problems to look for. This makes using these regular expressions difficult and time-consuming.
Instead, what if we just created individual columns for the unique values from the Interests
column? The resulting data would look like this:
# data:
ID | Name | American Football | Football | Basketball | Baseball
-----------------------------------------------------------------------
1 | Jon | 1 | 0 | 0 | 0
2 | Sue | 0 | 1 | 1 | 0
3 | Beth | 1 | 1 | 0 | 1
4 | Steve | 0 | 1 | 0 | 0
This format will make things much easier. We can find all Soccer (or Football) fans very easily (1), or quickly find the sport with the most interest (2):
# 1: Find soccer fans
football = data[data$Football == 1,]
# 2:
sports = c("American Football","Football","Basketball","Baseball")
sport_sums = colSums(data[,sports])
sports[which.max(sport_sums)] # Football: which.max(sport_sums) == 1
Now that we see some the benefits of translating the initial Interests
column, let’s see how to do it.
Step 1: Unique Values
To get the values to be used for the new columns we will first split all of the values in the Interests
column, and then reduce that the unique values.
interests = as.character(data$Interests)
split_interests = sapply(interests, strsplit, ",", USE.NAMES=F); # USE.NAMES isn't required, but makes looking at the output easier
unique_interests = unique(unlist(split_interests)) # c("American Football", "Football", "Basketball", "Baseball")
Step 2: Create the Columns
So, unique_interests
contains the names for our new columns, so we can append those to our data.frame, setting all initial values to 0
Note: This is not required, but I’m doing it here mostly for illustrative purposes following Method 1 in the Step 3
data[,unique_interests] = 0
# data, Interests column truncated for better formatting
ID Name Interests American Football Football Basketball Baseball
1 Jon ... 0 0 0 0
2 Sue ... 0 0 0 0
3 Beth ... 0 0 0 0
4 Steve ... 0 0 0 0
Step 3: Populate Each Row
So now it’s a matter of populating each row according to their respective values in the Interests
column. If we take a look at split_interests
from above, we have a list of vectors representing the per-row interests:
# split_interests
[[1]] # Jon
[1] "American Football"
[[2]] # Sue
[1] "Football" "Basketball"
[[3]] # Beth
[1] "American Football" "Baseball" "Football"
[[4]] # Steve
[1] "Football"
There are many methods that can be used here, a couple of which are to 1.) iterate split_interests
and find the new columns in our data
and set them to 1
, or 2.) convert our list to a matrix and append it to data
Method 1
Using a simple loop, we can iterate the interests and set each value to 1.
Warning: This is NOT the preferred approach. I’m more so including this method as means to illustrate the process, rather than an example to implement, and that will become clear after we examine both methods.
for(row in 1:length(split_interests)) {
data[row, split_interests[[row]]] = 1
}
Method 2
Create a (N, M)
matrix, where N == nrow(data)
and M == length(unique_interests)
. We will then use this matrix to append on to our original data
.
An easy way to do this is to use sapply
and give it our split_interests
list, and a function where current_interests
will represent the interest values for the given row (or entry in the list).
interests_per_row = sapply(split_interests, function(current_interests) {
# Using the full vector of new columns, unique_interests,
# we see which are in the current row's interest vector.
# This will return a logical vector with a length equal
# to length(unique_interests)
current = (unique_interests %in% current_interests)
# multiply the logical vector by 1 and return. The
# multiplication will convert to a numeric vector,
# with FALSE == 0, and TRUE == 1
return(1 * current)
})
# interests_per_row
[,1] [,2] [,3] [,4]
[1,] 1 0 1 0
[2,] 0 1 1 1
[3,] 0 1 0 0
[4,] 0 0 1 0
However, we have encountered a catch, and can’t append this to the data quite yet: it’s actually in the wrong format. The rows from our original data is actually represented here as the columns and the interests are the rows. There an easy fix though:.
# transpose the matrix using t()
interests_per_row = t(interests_per_row)
[,1] [,2] [,3] [,4]
[1,] 1 0 0 0
[2,] 0 1 1 0
[3,] 1 1 0 1
[4,] 0 1 0 0
Now we can append this to our original data
Note: When following this method, this is what makes the Step 2 unnecessary
data[,unique_interests] = interests_per_row
# data, Interests column truncated for better formatting
ID Name Interests American Football Football Basketball Baseball
1 Jon ... 1 0 1 0
2 Sue ... 0 1 1 1
3 Beth ... 0 1 0 0
4 Steve ... 0 0 1 0
Step 5: Why Method 2
Although the first method, depending on your background, seems much more straightforward, it is, in fact, much slower. As the dataset gets larger, the problem will really show itself, but even a dataset of only four rows, its noticeably slower.
> microbenchmark::microbenchmark(method1(), method2(), times=1000)
Unit: microseconds
expr min lq mean median uq max neval cld
method1() 184.157 198.042 220.934 205.239 220.66 1608.19 1000 b
method2() 149.452 157.264 179.171 164.378 175.62 1568.67 1000 a