Removing non-English text from Corpus in R using tm() -


i using tm() , wordcloud() basic data-mining in r, running difficulties because there non-english characters in dataset (even though i've tried filter out other languages based on background variables.

let's of lines in txt file (saved utf-8 in textwrangler) this:

special satisfação happy sad potential für 

i read txt file r:

words <- corpus(dirsource("~/temp", encoding = "utf-8"),readercontrol = list(language = "lat")) 

this yields warning message:

warning message: in readlines(y, encoding = x$encoding) :   incomplete final line found on '/temp/file.txt' 

but since it's warning, not error, continue push forward.

words <- tm_map(words, stripwhitespace) words <- tm_map(words, tolower) 

this yields error:

error in fun(x[[1l]], ...) : invalid input 'satisfa��o' in 'utf8towcs' 

i'm open finding ways filter out non-english characters either in textwrangler or r; whatever expedient. help!

here's method remove words non-ascii characters before making corpus:

# remove words non-ascii characters # assuming read txt file in vector, eg.  # dat <- readlines('~/temp/dat.txt') dat <- "special,  satisfação, happy, sad, potential, für" # convert string vector of words dat2 <- unlist(strsplit(dat, split=", ")) # find indices of words non-ascii characters dat3 <- grep("dat2", iconv(dat2, "latin1", "ascii", sub="dat2")) # subset original vector of words exclude words non-ascii char dat4 <- dat2[-dat3] # convert vector string dat5 <- paste(dat4, collapse = ", ") # make corpus require(tm) words1 <- corpus(vectorsource(dat5)) inspect(words1)  corpus 1 text document  metadata consists of 2 tag-value pairs , data frame available tags are:   create_date creator  available variables in data frame are:   metaid   [[1]] special, happy, sad, potential 

Comments

Popular posts from this blog

css - Which browser returns the correct result for getBoundingClientRect of an SVG element? -

gcc - Calling fftR4() in c from assembly -

.htaccess - Matching full URL in RewriteCond -