java - Scanner's nextLine(), Only fetching partial -


so, using like:

for (int = 0; < files.length; i++) {             if (!files[i].isdirectory() && files[i].canread()) {                 try {                     scanner scan = new scanner(files[i]);                 system.out.println("generating categories " + files[i].topath());                 while (scan.hasnextline()) {                     count++;                     string line = scan.nextline();                     system.out.println("  ->" + line);                     line = line.split("\t", 2)[1];                     system.out.println("!- " + line);                     jsonparser parser = new jsonparser();                     jsonobject object = parser.parse(line).getasjsonobject();                     set<entry<string, jsonelement>> entryset = object.entryset();                     exploreset(entryset);                 }                 scan.close();                 // system.out.println(keyset);             } catch (filenotfoundexception e) {                 e.printstacktrace();             }          }     } 

as 1 goes on hadoop output file, 1 of json objects in middle breaking... because scan.nextline() not fetching whole line before brings split. ie, output is:

  ->0   {"flags":"0","transactions":{"totaltransactionamount":"0","totalquantitysold":"0"},"listingstatus":"null","conditionrollupid":"0","photodisplaytype":"0","title":"null","quantityavailable":"0","viewitemcount":"0","visitcount":"0","itemcountryid":"0","itemaspects":{   ...  "sellersiteid":"0","siteid":"0","pictureurl":"http://somewhere.com/45/x/alphanumeric/$(kgrhqr,!rgf!6n5wjstbqo-g4k(ww~~ !- {"flags":"0","transactions":{"totaltransactionamount":"0","totalquantitysold":"0"},"listingstatus":"null","conditionrollupid":"0","photodisplaytype":"0","title":"null","quantityavailable":"0","viewitemcount":"0","visitcount":"0","itemcountryid":"0","itemaspects":{   ...  "sellersiteid":"0","siteid":"0","pictureurl":"http://somewhere.com/45/x/alphanumeric/$(kgrhqr,!rgf!6n5wjstbqo-g4k(ww~~ 

most of above data has been sanitized (not url (for part) however... )

and url continues as: $(kgrhqzhjcgfbso4dc3mbqdc2)y4tg~~60_1.jpg?set_id=8800005007 in file....

so miffing.

this entry #112, , have had other files parse without errors... 1 screwing mind, because dont see how scan.nextline() isnt working...

by debug output, json error caused string not being split properly.

and forgot, works fine if attempt put offending line in own file , parse that.

edit: blows if remove offending line in same place.

attempted jvm 1.6 , 1.7


workaround solution: bufferedreader scan = new bufferedreader(new filereader(files[i])); instead of scanner....

based on code, best explanation can come line end after "~~" according criteria used scanner.nextline().

the criteria end-of-line are:

  • something matches regex: "\r\n|[\n\r\u2028\u2029\u0085]" or
  • the end of input stream

you file continues after "~~", lets put eof aside, , @ regex. match of following:

the usual line separators:

  • <cr>
  • <nl>
  • <cr><nl>

... , 3 unusual forms of line separator scanner recognizes.

  • 0x0085 <nel> or "next line" control code in "iso c1 control" group
  • 0x2028 unicode "line separator" character
  • 0x2029 unicode "paragraph separator" character

my theory you've got 1 of "unusual" forms in input file, , not showing in .... whatever tool using examine files.


i suggest examine input file using tool can show actual bytes of file; e.g. od utility on linux / unix system. also, check isn't caused kind of character encoding mismatch ... or trying read or write binary data text.

if these don't help, next step should run application using ide's java debugger, , single-step through scanner.hasnextline() , nextline() calls find out code doing.


and forgot, works fine if attempt put offending line in own file , parse that.

that's interesting. if tool using extract line same 1 not showing (hypothesized) unusual line separator, evidence not reliable. process of extraction may altering "stuff" causing problems.


Comments

Popular posts from this blog

css - Which browser returns the correct result for getBoundingClientRect of an SVG element? -

gcc - Calling fftR4() in c from assembly -

Function that returns a formatted array in VBA -