如何批量爬取網(wǎng)址中的所有圖片,可能大家還是覺得功能不夠強大,自己手動操作比寫代碼還要快,今天給大家介紹一下進階版,我們批量把幾百個網(wǎng)址的上萬張照片分主題名字在一分鐘內全部下載下來。
首先先回顧一下上次的完整代碼:
#encoding = utf-8from urllib.request import urlretrieveimport osurl='http://blog.sina.com.cn/s/blog_4d79089f0102xojl.html'S=requests.Session()html=S.get(url)c=html.contentdef down_image(url): name=url[-12:]+'.jpg' urlretrieve(url, name)rs=str(c)e=1while e>0: e=rs.find('&690') s=rs[:e].rfind('real_src') ad=rs[s+11:e]+'&690' if len(ad)<=80: down_image(ad)="" rs="rs[e+1:]" else:="">=80:>
小編對上面這份代碼做個修改,改為使用BeautifulSoup 庫,提高一下容錯率,修改后的代碼功能木有改變,新代碼如下:
encoding = utf-8from urllib.request import urlretrieveimport requestsimport osfrom bs4 import BeautifulSoupos.chdir('E:\python-code\photo')S=requests.Session()def down_image(im_url): name=im_url[-12:]+'.jpg' urlretrieve(im_url,name)def get_path(url): html=S.get(url) c=html.content soup = BeautifulSoup(c,'lxml') title=soup.find('title').string if not os.path.isdir(title): os.makedirs(title) os.chdir(title) path=soup.find('div',attrs={'id':'sina_keyword_ad_area2'}).find_all('img') for i in path: x=i.attrs['real_src'] down_image(x)url='http://blog.sina.com.cn/s/blog_4d79089f0102xojl.html'get_path(url)
修改后的代碼多了個功能,可以把圖片保存到文章標題的文件夾下,效果如圖:
好,那現(xiàn)在我們放大招,一樣以上次的博客為例,我們把他全部博客圖片都使用此辦法爬?。ù颂巸H為技術交流,不要用于違法用途):
第一步:獲取全部博文地址
可以參考批量獲取圖片地址的方式,代碼如下:
#encoding = utf-8import requestsfrom bs4 import BeautifulSoupfrom multiprocessing.dummy import Pool as ThreadPoolS=requests.Session()pool=ThreadPool(5) #多線程no=range(1,11)def get_url(url): html=S.get(url) c=html.content soup = BeautifulSoup(c,'lxml') url=soup.find_all('span',attrs={'class':'atc_title'}) for i in url: x=i.a.attrs['href'] print (x)url=['http://blog.sina.com.cn/s/articlelist_1299777695_0_'+str(j)+'.html' for j in no]data = pool.map(get_url,url)
效果:
使用了多線程操作
第二步:合并代碼
#encoding = utf-8from urllib.request import urlretrieveimport requestsimport osfrom bs4 import BeautifulSoupos.chdir('E:\python-code\photo')S=requests.Session()no=range(1,11)def down_image(im_url): name=im_url[-12:]+'.jpg' urlretrieve(im_url,name)def get_path(url): html=S.get(url) c=html.content soup = BeautifulSoup(c,'lxml') title=soup.find('h2',attrs={'class':'titName SG_txta'}).string if not os.path.isdir(title): os.chdir('E:\python-code\photo') os.makedirs(title) os.chdir(title) path=soup.find('div',attrs={'id':'sina_keyword_ad_area2'}).find_all('img') for i in path: x=i.attrs['real_src'] down_image(x)def get_url(url): html=S.get(url) c=html.content soup = BeautifulSoup(c,'lxml') url=soup.find_all('span',attrs={'class':'atc_title'}) for i in url: x=i.a.attrs['href'] get_path(x)for j in no: url='http://blog.sina.com.cn/s/articlelist_1299777695_0_'+str(j)+'.html' get_url(url)
效果:
多線程會導致圖片存貯位置出錯,所以取消了上面的多線程操作
結果:
500篇博文僅爬取了39篇就報錯了,報錯日志為:
直覺告訴我是因為冒號,于是修改了一下代碼,將title中:替換為為-,問題解決:
title=soup.find('h2',attrs={'class':'titName SG_txta'}).string.replace(':','-')
不使用多線程,代碼速度過慢,于是還是用了,同時使用絕對路徑的方式避免圖片保存路徑出錯,最終代碼:
#encoding = utf-8from urllib.request import urlretrieveimport requestsimport osfrom bs4 import BeautifulSoupfrom multiprocessing.dummy import Pool as ThreadPoolp='E:\\python-code\\photo'S=requests.Session()pool=ThreadPool(10)pool_url=ThreadPool(20)no=range(1,11)def down_image(im_url,s): name=s+'\\'+im_url[-12:]+'.jpg' urlretrieve(im_url,name)def get_path(url): html=S.get(url) c=html.content soup = BeautifulSoup(c,'lxml') title=soup.find('h2',attrs={'class':'titName SG_txta'}).string.replace(':','-') if not os.path.isdir(title): s=p+'\\'+title os.makedirs(s) path=soup.find('div',attrs={'id':'sina_keyword_ad_area2'}).find_all('img') path_l=[] for i in path: path_l=i.attrs['real_src'] down_image(path_l,s)def get_url(url): html=S.get(url) c=html.content soup = BeautifulSoup(c,'lxml') url=soup.find_all('span',attrs={'class':'atc_title'}) url_l=[] for i in url: x=i.a.attrs['href'] url_l.append(x) data = pool_url.map(get_path,url_l)url=['http://blog.sina.com.cn/s/articlelist_1299777695_0_'+str(j)+'.html' for j in no]data = pool.map(get_url,url)
最終耗時120s左右,爬取博文494篇,圖片7480張
今天的分享就到這里,再見。
Python學習書籍推薦