r - Split text string in a data.table columns -
i have script reads in data csv file data.table
, splits text in 1 column several new columns. using lapply
, strsplit
functions this. here's example:
library("data.table") df = data.table(prefix = c("a_b","a_c","a_d","b_a","b_c","b_d"), value = 1:6) dt = as.data.table(df) # split prefix new columns dt$px = as.character(lapply(strsplit(as.character(dt$prefix), split="_"), "[", 1)) dt$py = as.character(lapply(strsplit(as.character(dt$prefix), split="_"), "[", 2)) dt # prefix value px py # 1: a_b 1 b # 2: a_c 2 c # 3: a_d 3 d # 4: b_a 4 b # 5: b_c 5 b c # 6: b_d 6 b d
in example above column prefix
split 2 new columns px
, py
on "_" character.
even though works fine, wondering if there better (more efficient) way using data.table
. real datasets have >=10m+ rows, time/memory efficiency becomes important.
update:
following @frank's suggestion created larger test case , used suggested commands, stringr::str_split_fixed
takes lot longer original method.
library("data.table") library("stringr") system.time ({ df = data.table(prefix = rep(c("a_b","a_c","a_d","b_a","b_c","b_d"), 1000000), value = rep(1:6, 1000000)) dt = data.table(df) }) # user system elapsed # 0.682 0.075 0.758 system.time({ dt[, c("px","py") := data.table(str_split_fixed(prefix,"_",2))] }) # user system elapsed # 738.283 3.103 741.674 rm(dt) system.time ( { df = data.table(prefix = rep(c("a_b","a_c","a_d","b_a","b_c","b_d"), 1000000), value = rep(1:6, 1000000) ) dt = as.data.table(df) }) # user system elapsed # 0.123 0.000 0.123 # split prefix new columns system.time ({ dt$px = as.character(lapply(strsplit(as.character(dt$prefix), split="_"), "[", 1)) dt$py = as.character(lapply(strsplit(as.character(dt$prefix), split="_"), "[", 2)) }) # user system elapsed # 33.185 0.000 33.191
so str_split_fixed
method takes 20x times longer.
update: version 1.9.6 (on cran of sep'15), can use function tstrsplit()
results directly (and in more efficient manner):
require(data.table) ## v1.9.6+ dt[, c("px", "py") := tstrsplit(prefix, "_", fixed=true)] # prefix value px py # 1: a_b 1 b # 2: a_c 2 c # 3: a_d 3 d # 4: b_a 4 b # 5: b_c 5 b c # 6: b_d 6 b d
tstrsplit()
wrapper transpose(strsplit())
, transpose()
function, implemented, transposes list. please see ?tstrsplit()
, ?transpose()
examples.
see history old answers.
Comments
Post a Comment