node.js - Scrape a webpage and navigate by clicking buttons -
i want perform following actions @ server side:
1) scrape webpage
2) simulate click on page , navigate new page.
3) scrape new page
4) simulate button clicks on new page
5) sending data client via json or
i thinking of using node.js.
but confused module should use
a) zombie
b) node.io
c) phantomjs
d) jsdom
e) else
i have installed node,io not able run via command prompt.
ps: working in windows 2008 server
zombie.js , node.io run on jsdom, hence options either going jsdom (or equivalent wrapper), headless browser (phantomjs, slimerjs) or cheerio.
- jsdom slow because has recreate dom , cssom in node.js.
- phantomjs/slimerjs proper headless browsers, performances ok , reliable.
- cheerio lightweight alternative jsdom. doesn't recreate entire page in node.js (it downloads , parses dom - no javascript executed). therefore can't click on buttons/links, it's fast scrape webpages.
given requirements, i'd go headless browser. in particular, i'd choose casperjs because has nice , expressive api, it's fast , reliable (it doesn't need reinvent wheel on how parse , render dom or css jsdom does) , it's easy interact elements such buttons , links.
your workflow in casperjs should more or less this:
casper.start(); casper .then(function(){ console.log("start:"); }) .thenopen("https://www.domain.com/page1") .then(function(){ // scrape this.echo(this.gethtml('h1#foobar')); }) .thenclick("#button1") .then(function(){ // scrape else this.echo(this.gethtml('h2#foobar')); }) .thenclick("#button2") thenopen("http://myserver.com", { method: "post", data: { my: 'data', } }, function() { this.echo("data sent server") }); casper.run();
Comments
Post a Comment