web scraping - Making a basic web scrapper in Python with only built in libraries - Python newbie -
i'm quite new python, instead of reading tutorials felt completing actual project better way learn. however, means i'd needing quite assistance (:
now point.
i'm trying make web scrapper without 3rd party libraries, process isn't simplified me, , know doing. looked through several online resources, of have left me confused things.
the html looks this,
<html> <head>...</head> <body> *lots of other <div> tags* <div class = "want" style="font-family:verdana;font-size:12px;letter-spacing:normal""> <form class ="subform">...</form> <div class = "subdiv1" >...</div> <div class = "subdiv2" >...</div> *lots of other <div> tags* </body> </html>
i want scrapper extract <div class = "want"...>*content*</div>
, save html file.
i have basic idea of how need go this.
import urllib urllib import request #import re #from html.parser import htmlparser response = urllib.request.urlopen("http://website.com") html = response.read() #some how extract wanted data f = open('page.html', 'w') f.write(data) f.close()
thank much!
the standard library comes variety of structured markup processing tools, can use parsing html , searching extract div.
there's whole lot of choices there. use?
html.parser
looks obvious choice, i'd start elementtree
instead. it's nice , powerful api, , there's tons of documentation , sample code on web started, , lot of experts using on daily basis can problems. if turns out etree can't parse html, have use else… try first.
for example, few minor fixes snipped html it's valid, , there's text worth getting out of div:
<html> <head>...</head> <body> *lots of other <div /> tags* <div class = "want" style="font-family:verdana;font-size:12px;letter-spacing:normal">spam spam spam <form class ="subform">...</form> <div class = "subdiv1" >...</div> <div class = "subdiv2" >...</div> *lots of other <div /> tags* </div> </body> </html>
you can use code (i'm assuming know, or willing learn, xpath):
tree = elementtree.fromstring(page) mydiv = tree.find('.//div[@class="want"]')
now you've got reference div
class "want"
. can direct text this:
print(mydiv.text)
but if want extract whole subtree, that's easier:
data = elementtree.tostring(mydiv)
if want wrap in valid <html>
, <body>
and/or remove <div>
itself, you'll have part manually. documentation explains how build elements using simple tree api: create head
, body
put in html
, stick div
in body
, tostring
html
, , that's it.
Comments
Post a Comment