web crawler - Scrapy/Python Issue: [count] does not collect entire table -
this spin off of previous xpath thread (dude told me it's not xpath related).
so trying scrape web page: http://www.baseball-reference.com/teams/bos/2013.shtml rank, position, name, age, etc.
whenever use:
item['rank'] = stat.select('//table[@id="team_batting"]/tbody/tr/td[1]//text()')[count].extract() item ['position'] = stat.select('//table[@id="team_batting"]/tbody/tr/td[2]//text()')[count].extract()
it gives me 5 results:
[{"position": "c", "rank": "1"}, {"position": "1b", "rank": "2"}, {"position": "2b", "rank": "3"}, {"position": "ss", "rank": "4"}, {"position": "3b", "rank": "5"}]
if remove [count], gives me of ranks , positions, not in correct format, , gives me 4 duplicate lines of (i condensed fit in here,there 44):
{"position": ["c", "1b", "2b", "ss", "3b", "lf"], "rank": ["1", "2", "3", "4", "5", "6", "7"]}
here parser code help:
def parse(self, response): hxs = htmlxpathselector(response) stats = hxs.select('//div[@class="table_container"]') count = 0 items = [] stat in stats: item = mlbrefitem() items .append(item) count +=1 return items
i confused. when did tutorial: http://davidwalsh.name/python-scrape [count] pulled of results. had no issue trying on different website. website need.
any appreciated. thanks
edit: here actual code (ignore comments): https://www.dropbox.com/s/rk29kotd09kioeg/mlbref_spyder.py
the code have posted not have definition mlbrefitem class. also, appears posted code have incorrect indentation @ following line:
items.append(item)
you appending item
object outside loop. means end appending final instance of type mlbrefitem. please try below code snippet for-loop
, let know if works:
for stat in stats: item = mlbrefitem() items .append(item) count +=1
a big edit based on discussion angrywhopper
there multiple aspects consider here:
there 5 divs on page style table_container statement stats = hxs.select('//div[@class="table_container"]')
selects these 5 elements
the source code @ reference link provided shown below. have added comments code beginning user name increment count (count += 1
). since within loop happens 5 times. hence each of 5 times, end selecting single element list of rows returned xpath selectors position , rank. explains why 5 pairs when use count
variable within code.
stats = hxs.select('//div[@class="table_container"]') count = 0 items = [] stat in stats: item = mlbrefitem() item['rank'] = stat.select('//table[@id="team_batting"]/tbody/tr/td[1]//text()').extract() item ['position'] = stat.select('//table[@id="team_batting"]/tbody/tr/td[2]//text()').extract() items .append(item) count +=1 #prahaladdeshpande - count variable incremented 5 times since within loop return items
without count
variable, rows - not in correct format. well, here have to work. tried experiments on scrapy terminal , below code snippet should work you
stats = hxs.select('//div[@class="table_container"]') stat in stats: rank_list = stat.select('//table[@id="team_batting"]/tbody/tr/td[1]//text()').extract() position_list = stat.select('//table[@id="team_batting"]/tbody/tr/td[2]//text()').extract() //todo : verify length of rank_list , position_list same. else throw error in range(0, len(rank_list)): item = mlbrefitem() item['rank'] = rank_list[i] item['position'] = position_list[i] items.append(item)
you can return items
variable method.
hope points towards solution
Comments
Post a Comment