web scraping - Making a basic web scrapper in Python with only built in libraries - Python newbie -
i'm quite new python, instead of reading tutorials felt completing actual project better way learn. however, means i'd needing quite assistance (:
now point.
i'm trying make web scrapper without 3rd party libraries, process isn't simplified me, , know doing. looked through several online resources, of have left me confused things.
the html looks this,
<html> <head>...</head> <body> *lots of other <div> tags* <div class = "want" style="font-family:verdana;font-size:12px;letter-spacing:normal""> <form class ="subform">...</form> <div class = "subdiv1" >...</div> <div class = "subdiv2" >...</div> *lots of other <div> tags* </body> </html> i want scrapper extract <div class = "want"...>*content*</div> , save html file.
i have basic idea of how need go this.
import urllib urllib import request #import re #from html.parser import htmlparser response = urllib.request.urlopen("http://website.com") html = response.read() #some how extract wanted data f = open('page.html', 'w') f.write(data) f.close() thank much!
the standard library comes variety of structured markup processing tools, can use parsing html , searching extract div.
there's whole lot of choices there. use?
html.parser looks obvious choice, i'd start elementtree instead. it's nice , powerful api, , there's tons of documentation , sample code on web started, , lot of experts using on daily basis can problems. if turns out etree can't parse html, have use else… try first.
for example, few minor fixes snipped html it's valid, , there's text worth getting out of div:
<html> <head>...</head> <body> *lots of other <div /> tags* <div class = "want" style="font-family:verdana;font-size:12px;letter-spacing:normal">spam spam spam <form class ="subform">...</form> <div class = "subdiv1" >...</div> <div class = "subdiv2" >...</div> *lots of other <div /> tags* </div> </body> </html> you can use code (i'm assuming know, or willing learn, xpath):
tree = elementtree.fromstring(page) mydiv = tree.find('.//div[@class="want"]') now you've got reference div class "want". can direct text this:
print(mydiv.text) but if want extract whole subtree, that's easier:
data = elementtree.tostring(mydiv) if want wrap in valid <html> , <body> and/or remove <div> itself, you'll have part manually. documentation explains how build elements using simple tree api: create head , body put in html, stick div in body, tostring html, , that's it.
Comments
Post a Comment