web crawler - Scrapy/Python Issue: [count] does not collect entire table -

- January 15, 2012

this spin off of previous xpath thread (dude told me it's not xpath related).

so trying scrape web page: http://www.baseball-reference.com/teams/bos/2013.shtml rank, position, name, age, etc.

whenever use:

item['rank'] = stat.select('//table[@id="team_batting"]/tbody/tr/td[1]//text()')[count].extract() item ['position'] = stat.select('//table[@id="team_batting"]/tbody/tr/td[2]//text()')[count].extract()

it gives me 5 results:

[{"position": "c", "rank": "1"}, {"position": "1b", "rank": "2"}, {"position": "2b", "rank": "3"}, {"position": "ss", "rank": "4"}, {"position": "3b", "rank": "5"}]

if remove [count], gives me of ranks , positions, not in correct format, , gives me 4 duplicate lines of (i condensed fit in here,there 44):

{"position": ["c", "1b", "2b", "ss", "3b", "lf"], "rank": ["1", "2", "3", "4", "5", "6", "7"]}

here parser code help:

def parse(self, response):  hxs = htmlxpathselector(response)  stats = hxs.select('//div[@class="table_container"]')  count = 0  items = []   stat in stats:       item = mlbrefitem()       items .append(item)      count +=1    return items

i confused. when did tutorial: http://davidwalsh.name/python-scrape [count] pulled of results. had no issue trying on different website. website need.

any appreciated. thanks

edit: here actual code (ignore comments): https://www.dropbox.com/s/rk29kotd09kioeg/mlbref_spyder.py

the code have posted not have definition mlbrefitem class. also, appears posted code have incorrect indentation @ following line:

items.append(item)

you appending item object outside loop. means end appending final instance of type mlbrefitem. please try below code snippet for-loop , let know if works:

for stat in stats:      item = mlbrefitem()      items .append(item)     count +=1

a big edit based on discussion angrywhopper

there multiple aspects consider here:

there 5 divs on page style table_container statement stats = hxs.select('//div[@class="table_container"]') selects these 5 elements

the source code @ reference link provided shown below. have added comments code beginning user name increment count (count += 1). since within loop happens 5 times. hence each of 5 times, end selecting single element list of rows returned xpath selectors position , rank. explains why 5 pairs when use count variable within code.

stats = hxs.select('//div[@class="table_container"]')         count = 0         items = []          stat in stats:              item = mlbrefitem()              item['rank'] = stat.select('//table[@id="team_batting"]/tbody/tr/td[1]//text()').extract()              item ['position'] = stat.select('//table[@id="team_batting"]/tbody/tr/td[2]//text()').extract()              items .append(item)             count +=1 #prahaladdeshpande - count variable incremented 5 times since within loop          return items

without count variable, rows - not in correct format. well, here have to work. tried experiments on scrapy terminal , below code snippet should work you

stats = hxs.select('//div[@class="table_container"]') stat in stats:         rank_list = stat.select('//table[@id="team_batting"]/tbody/tr/td[1]//text()').extract()         position_list = stat.select('//table[@id="team_batting"]/tbody/tr/td[2]//text()').extract()           //todo : verify length of rank_list , position_list same. else throw error          in range(0, len(rank_list)):              item = mlbrefitem()              item['rank'] = rank_list[i]              item['position'] = position_list[i]              items.append(item)

you can return items variable method.

hope points towards solution

Search This Blog

Shashtri

web crawler - Scrapy/Python Issue: [count] does not collect entire table -

Comments

Post a Comment

Popular posts from this blog

gcc - Calling fftR4() in c from assembly -

css - Which browser returns the correct result for getBoundingClientRect of an SVG element? -

reflection - Prism + Log4Net build error : "Cannot resolve dependency to assembly log4net" -