python - Multithreaded screen scraping help needed -
i'm relatively new python, , i'm working through screen- scraping application gathers data multiple financial sites. have 4 procedures now. 2 run in couple minutes, , other two... hours each. these 2 information on particular stock symbols have in csv file. there 4,000+ symbols i'm using. know enough know vast majority of time spent in io on wire. it's essential these down 1/2 hour each (or, better. ambitious?) of practical use me. i'm using python 3 , beautifulsoup.
i have general structure of i'm doing below. i've abbreviated conceptually non essential sections. i'm reading many threads on multiple calls/ threads @ once speed things up, , seems there lot of options. can point me in right direction should pursue, based on structure of have far? it'd huge help. i'm sure it's obvious, procedure gets called along other data download procs in main driver module. in advance...
from bs4 import beautifulsoup import misc modules class stockoption: def __init__(self, datedownloaded, optiondata): self.datedownloaded = datedownloaded self.optiondata = optiondata def forcsv(self): return [self.datedownloaded, self.optiondata] def extract_options(tablerowsfrombeautifulsoup): optionslist = [] opt in range(0, len(tablerowsfrombeautifulsoup)) optionslist.append(stockoption(data parsed tablerows arg)) return optionslist def run_proc(): symbollist = read in csv file of tickers symb in symbollist: webstr = #write connection string try: urllib.request.urlopen(webstr) url: page = url.read() soup = beautifulsoup(page) if soup.text.find('there no markets results for') == -1: tbls = soup.findall('table') if len(tbls[9]) > 1: expstrings = soup.findall('td', text=true, attrs={'align': 'right'})[0].contents[0].split() expdate = datetime.date(int(expstrings[6]), int(currmonth), int(expstrings[5].replace(',', ''))) calls = extract_options(tbls[9], symb, 'call', expdate) puts = extract_options(tbls[13], symb, 'put', expdate) optionsrows = optionsrows + calls optionsrows = optionsrows + puts except urllib.error.httperror err: if err.code == 404: pass else: raise opts = [0] * (len(optionsrows)) option in range(0, len(optionsrows)): opts[option] = optionsrows[option].forcsv() #write csv file. open('c:/optionschains.csv', 'a', newline='') fp: = csv.writer(fp, delimiter=',') a.writerows(opts) if __name__ == '__main__': run_proc()
there mistakes in abbreviated code have given, little hard understand code. if show more code , check it, easier understand problem.
from code , problem description, have advice share you:
in
run_proc()
function, read webpage every symbol. if urls same or urls repeated, how read webpages 1 time , write them memory or hardware, analyze page contents every symbol? savebeautifulsoup
easy write code, little slow in performance. iflxml
can work, save lot time on analyzing webpage contents.
hope help.
Comments
Post a Comment