r - Split text string in a data.table columns -


i have script reads in data csv file data.table , splits text in 1 column several new columns. using lapply , strsplit functions this. here's example:

library("data.table") df = data.table(prefix = c("a_b","a_c","a_d","b_a","b_c","b_d"),                 value  = 1:6) dt = as.data.table(df)  # split prefix new columns dt$px = as.character(lapply(strsplit(as.character(dt$prefix), split="_"), "[", 1)) dt$py = as.character(lapply(strsplit(as.character(dt$prefix), split="_"), "[", 2))  dt  #    prefix value px py # 1:    a_b     1   b # 2:    a_c     2   c # 3:    a_d     3   d # 4:    b_a     4  b  # 5:    b_c     5  b  c # 6:    b_d     6  b  d  

in example above column prefix split 2 new columns px , py on "_" character.

even though works fine, wondering if there better (more efficient) way using data.table. real datasets have >=10m+ rows, time/memory efficiency becomes important.


update:

following @frank's suggestion created larger test case , used suggested commands, stringr::str_split_fixed takes lot longer original method.

library("data.table") library("stringr") system.time ({     df = data.table(prefix = rep(c("a_b","a_c","a_d","b_a","b_c","b_d"), 1000000),                     value  = rep(1:6, 1000000))     dt = data.table(df) }) #   user  system elapsed  #  0.682   0.075   0.758   system.time({ dt[, c("px","py") := data.table(str_split_fixed(prefix,"_",2))] }) #    user  system elapsed  # 738.283   3.103 741.674   rm(dt) system.time ( {     df = data.table(prefix = rep(c("a_b","a_c","a_d","b_a","b_c","b_d"), 1000000),                      value = rep(1:6, 1000000) )     dt = as.data.table(df) }) #    user  system elapsed  #   0.123   0.000   0.123   # split prefix new columns system.time ({     dt$px = as.character(lapply(strsplit(as.character(dt$prefix), split="_"), "[", 1))     dt$py = as.character(lapply(strsplit(as.character(dt$prefix), split="_"), "[", 2)) }) #    user  system elapsed  #  33.185   0.000  33.191  

so str_split_fixed method takes 20x times longer.

update: version 1.9.6 (on cran of sep'15), can use function tstrsplit() results directly (and in more efficient manner):

require(data.table) ## v1.9.6+ dt[, c("px", "py") := tstrsplit(prefix, "_", fixed=true)] #    prefix value px py # 1:    a_b     1   b # 2:    a_c     2   c # 3:    a_d     3   d # 4:    b_a     4  b  # 5:    b_c     5  b  c # 6:    b_d     6  b  d 

tstrsplit() wrapper transpose(strsplit()), transpose() function, implemented, transposes list. please see ?tstrsplit() , ?transpose() examples.

see history old answers.


Comments

Popular posts from this blog

css - Which browser returns the correct result for getBoundingClientRect of an SVG element? -

gcc - Calling fftR4() in c from assembly -

.htaccess - Matching full URL in RewriteCond -