Removing non-English text from Corpus in R using tm() -
i using tm()
, wordcloud()
basic data-mining in r, running difficulties because there non-english characters in dataset (even though i've tried filter out other languages based on background variables.
let's of lines in txt file (saved utf-8 in textwrangler) this:
special satisfação happy sad potential für
i read txt file r:
words <- corpus(dirsource("~/temp", encoding = "utf-8"),readercontrol = list(language = "lat"))
this yields warning message:
warning message: in readlines(y, encoding = x$encoding) : incomplete final line found on '/temp/file.txt'
but since it's warning, not error, continue push forward.
words <- tm_map(words, stripwhitespace) words <- tm_map(words, tolower)
this yields error:
error in fun(x[[1l]], ...) : invalid input 'satisfa��o' in 'utf8towcs'
i'm open finding ways filter out non-english characters either in textwrangler or r; whatever expedient. help!
here's method remove words non-ascii characters before making corpus:
# remove words non-ascii characters # assuming read txt file in vector, eg. # dat <- readlines('~/temp/dat.txt') dat <- "special, satisfação, happy, sad, potential, für" # convert string vector of words dat2 <- unlist(strsplit(dat, split=", ")) # find indices of words non-ascii characters dat3 <- grep("dat2", iconv(dat2, "latin1", "ascii", sub="dat2")) # subset original vector of words exclude words non-ascii char dat4 <- dat2[-dat3] # convert vector string dat5 <- paste(dat4, collapse = ", ") # make corpus require(tm) words1 <- corpus(vectorsource(dat5)) inspect(words1) corpus 1 text document metadata consists of 2 tag-value pairs , data frame available tags are: create_date creator available variables in data frame are: metaid [[1]] special, happy, sad, potential
Comments
Post a Comment