python - Multithreaded screen scraping help needed -

- August 15, 2012

i'm relatively new python, , i'm working through screen- scraping application gathers data multiple financial sites. have 4 procedures now. 2 run in couple minutes, , other two... hours each. these 2 information on particular stock symbols have in csv file. there 4,000+ symbols i'm using. know enough know vast majority of time spent in io on wire. it's essential these down 1/2 hour each (or, better. ambitious?) of practical use me. i'm using python 3 , beautifulsoup.

i have general structure of i'm doing below. i've abbreviated conceptually non essential sections. i'm reading many threads on multiple calls/ threads @ once speed things up, , seems there lot of options. can point me in right direction should pursue, based on structure of have far? it'd huge help. i'm sure it's obvious, procedure gets called along other data download procs in main driver module. in advance...

from bs4 import beautifulsoup import misc modules  class stockoption:     def __init__(self, datedownloaded, optiondata):         self.datedownloaded = datedownloaded         self.optiondata = optiondata      def forcsv(self):         return [self.datedownloaded, self.optiondata]  def extract_options(tablerowsfrombeautifulsoup):     optionslist = []     opt in range(0, len(tablerowsfrombeautifulsoup))         optionslist.append(stockoption(data parsed tablerows arg))     return optionslist  def run_proc():     symbollist = read in csv file of tickers     symb in symbollist:         webstr = #write connection string         try:             urllib.request.urlopen(webstr) url: page = url.read()             soup = beautifulsoup(page)             if soup.text.find('there no markets results for') == -1:                 tbls = soup.findall('table')                 if len(tbls[9]) > 1:                     expstrings = soup.findall('td', text=true, attrs={'align': 'right'})[0].contents[0].split()                     expdate = datetime.date(int(expstrings[6]), int(currmonth), int(expstrings[5].replace(',', '')))                     calls = extract_options(tbls[9], symb, 'call', expdate)                     puts = extract_options(tbls[13], symb, 'put', expdate)                     optionsrows = optionsrows + calls                     optionsrows = optionsrows + puts          except urllib.error.httperror err:             if err.code == 404:                 pass             else:                 raise     opts = [0] * (len(optionsrows))     option in range(0, len(optionsrows)):     opts[option] = optionsrows[option].forcsv()     #write csv file.     open('c:/optionschains.csv', 'a', newline='') fp:         = csv.writer(fp, delimiter=',')         a.writerows(opts)  if __name__ == '__main__':     run_proc()

there mistakes in abbreviated code have given, little hard understand code. if show more code , check it, easier understand problem.

from code , problem description, have advice share you:

in run_proc() function, read webpage every symbol. if urls same or urls repeated, how read webpages 1 time , write them memory or hardware, analyze page contents every symbol? save
beautifulsoup easy write code, little slow in performance. if lxml can work, save lot time on analyzing webpage contents.

hope help.

Search This Blog

Shashtri

python - Multithreaded screen scraping help needed -

Comments

Post a Comment

Popular posts from this blog

css - Which browser returns the correct result for getBoundingClientRect of an SVG element? -

gcc - Calling fftR4() in c from assembly -

.htaccess - Matching full URL in RewriteCond -