Python圖片爬?。阂绘I下載萬張圖片

2018.06.16

如何批量爬取網(wǎng)址中的所有圖片，可能大家還是覺得功能不夠強大，自己手動操作比寫代碼還要快，今天給大家介紹一下進階版，我們批量把幾百個網(wǎng)址的上萬張照片分主題名字在一分鐘內全部下載下來。

首先先回顧一下上次的完整代碼：

#encoding = utf-8from urllib.request import urlretrieveimport osurl='http://blog.sina.com.cn/s/blog_4d79089f0102xojl.html'S=requests.Session()html=S.get(url)c=html.contentdef down_image(url): name=url[-12:]+'.jpg' urlretrieve(url, name)rs=str(c)e=1while e>0: e=rs.find('&690') s=rs[:e].rfind('real_src') ad=rs[s+11:e]+'&690' if len(ad)<=80: down_image(ad)="" rs="rs[e+1:]" else:="">

小編對上面這份代碼做個修改，改為使用BeautifulSoup 庫，提高一下容錯率，修改后的代碼功能木有改變，新代碼如下：

encoding = utf-8from urllib.request import urlretrieveimport requestsimport osfrom bs4 import BeautifulSoupos.chdir('E:\python-code\photo')S=requests.Session()def down_image(im_url): name=im_url[-12:]+'.jpg' urlretrieve(im_url,name)def get_path(url): html=S.get(url) c=html.content soup = BeautifulSoup(c,'lxml') title=soup.find('title').string if not os.path.isdir(title): os.makedirs(title) os.chdir(title) path=soup.find('div',attrs={'id':'sina_keyword_ad_area2'}).find_all('img') for i in path: x=i.attrs['real_src'] down_image(x)url='http://blog.sina.com.cn/s/blog_4d79089f0102xojl.html'get_path(url)

修改后的代碼多了個功能，可以把圖片保存到文章標題的文件夾下，效果如圖：

好，那現(xiàn)在我們放大招，一樣以上次的博客為例，我們把他全部博客圖片都使用此辦法爬?。ù颂巸H為技術交流，不要用于違法用途）：

第一步：獲取全部博文地址

可以參考批量獲取圖片地址的方式，代碼如下：

#encoding = utf-8import requestsfrom bs4 import BeautifulSoupfrom multiprocessing.dummy import Pool as ThreadPoolS=requests.Session()pool=ThreadPool(5) #多線程no=range(1,11)def get_url(url): html=S.get(url) c=html.content soup = BeautifulSoup(c,'lxml') url=soup.find_all('span',attrs={'class':'atc_title'}) for i in url: x=i.a.attrs['href'] print (x)url=['http://blog.sina.com.cn/s/articlelist_1299777695_0_'+str(j)+'.html' for j in no]data = pool.map(get_url,url)

效果：

使用了多線程操作

第二步：合并代碼

#encoding = utf-8from urllib.request import urlretrieveimport requestsimport osfrom bs4 import BeautifulSoupos.chdir('E:\python-code\photo')S=requests.Session()no=range(1,11)def down_image(im_url): name=im_url[-12:]+'.jpg' urlretrieve(im_url,name)def get_path(url): html=S.get(url) c=html.content soup = BeautifulSoup(c,'lxml') title=soup.find('h2',attrs={'class':'titName SG_txta'}).string if not os.path.isdir(title): os.chdir('E:\python-code\photo') os.makedirs(title) os.chdir(title) path=soup.find('div',attrs={'id':'sina_keyword_ad_area2'}).find_all('img') for i in path: x=i.attrs['real_src'] down_image(x)def get_url(url): html=S.get(url) c=html.content soup = BeautifulSoup(c,'lxml') url=soup.find_all('span',attrs={'class':'atc_title'}) for i in url: x=i.a.attrs['href'] get_path(x)for j in no: url='http://blog.sina.com.cn/s/articlelist_1299777695_0_'+str(j)+'.html' get_url(url)

效果：

多線程會導致圖片存貯位置出錯，所以取消了上面的多線程操作

結果:

500篇博文僅爬取了39篇就報錯了，報錯日志為：

直覺告訴我是因為冒號，于是修改了一下代碼，將title中:替換為為-，問題解決：

title=soup.find('h2',attrs={'class':'titName SG_txta'}).string.replace(':','-')

不使用多線程，代碼速度過慢，于是還是用了，同時使用絕對路徑的方式避免圖片保存路徑出錯，最終代碼：

#encoding = utf-8from urllib.request import urlretrieveimport requestsimport osfrom bs4 import BeautifulSoupfrom multiprocessing.dummy import Pool as ThreadPoolp='E:\\python-code\\photo'S=requests.Session()pool=ThreadPool(10)pool_url=ThreadPool(20)no=range(1,11)def down_image(im_url,s): name=s+'\\'+im_url[-12:]+'.jpg' urlretrieve(im_url,name)def get_path(url): html=S.get(url) c=html.content soup = BeautifulSoup(c,'lxml') title=soup.find('h2',attrs={'class':'titName SG_txta'}).string.replace(':','-') if not os.path.isdir(title): s=p+'\\'+title os.makedirs(s) path=soup.find('div',attrs={'id':'sina_keyword_ad_area2'}).find_all('img') path_l=[] for i in path: path_l=i.attrs['real_src'] down_image(path_l,s)def get_url(url): html=S.get(url) c=html.content soup = BeautifulSoup(c,'lxml') url=soup.find_all('span',attrs={'class':'atc_title'}) url_l=[] for i in url: x=i.a.attrs['href'] url_l.append(x) data = pool_url.map(get_path,url_l)url=['http://blog.sina.com.cn/s/articlelist_1299777695_0_'+str(j)+'.html' for j in no]data = pool.map(get_url,url)

最終耗時120s左右，爬取博文494篇，圖片7480張

今天的分享就到這里，再見。

Python學習書籍推薦

本站僅提供存儲服務，所有內容均由用戶發(fā)布，如發(fā)現(xiàn)有害或侵權內容，請點擊舉報。

打開APP，閱讀全文并永久保存查看更多類似文章

小豬的Python學習之旅

利用Python3爬蟲唯一圖庫網(wǎng)上的漂亮妹子圖ok

Python爬蟲小白入門（三）BeautifulSoup庫

用python實現(xiàn)一個抓取騰訊電影的爬蟲

用Python爬蟲爬取煎蛋網(wǎng)小姐姐的絕世美顏，我1T的硬盤都裝滿了

python爬蟲15 | 害羞，用多線程秒爬那些萬惡的妹紙們，紙巾呢？

更多類似文章 >>

九色国产,午夜在线视频,新黄色网址,九九色综合,天天做夜夜做久久做狠狠,天天躁夜夜躁狠狠躁2021a,久久不卡一区二区三区

Python學習書籍推薦