在通過:
【整理】關(guān)于抓取網(wǎng)頁,分析網(wǎng)頁內(nèi)容,模擬登陸網(wǎng)站的邏輯/流程和注意事項
了解了抓取網(wǎng)頁的一般流程之后,加上之前介紹的:
【總結(jié)】瀏覽器中的開發(fā)人員工具(IE9的F12和Chrome的Ctrl+Shift+I)-網(wǎng)頁分析的利器
應(yīng)該就很清楚如何利用工具去抓取網(wǎng)頁,并分析源碼,獲得所需內(nèi)容了。
下面,就來通過實際的例子來介紹,如何通過Python語言,實現(xiàn)這個抓取網(wǎng)頁并提取所需內(nèi)容的過程:
假設(shè)我們的需求是,從我(crifan)的Songtaste上的頁面:
http://www.songtaste.com/user/351979/
先抓取網(wǎng)頁的html源碼,然后再提取其中我的songtaste上面的名字:crifan
對應(yīng)的html代碼為:
此任務(wù),相對很簡單。下面就來說說,如何用C#來實現(xiàn)。
新建一個C#項目,使用.NET Framework 2.0,設(shè)置一些基本的控件用于顯示。
相關(guān)的,先寫出,獲得html的代碼:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 | using System.Net; using System.IO; //step1: get html from url string urlToCrawl = txbUrlToCrawl.Text; //generate http request HttpWebRequest req = (HttpWebRequest)WebRequest.Create(urlToCrawl); //use GET method to get url's html req.Method = "GET" ; //use request to get response HttpWebResponse resp = (HttpWebResponse)req.GetResponse(); string htmlCharset = "GBK" ; //use songtaste's html's charset GB2312 to decode html //otherwise will return messy code Encoding htmlEncoding = Encoding.GetEncoding(htmlCharset); StreamReader sr = new StreamReader(resp.GetResponseStream(), htmlEncoding); //read out the returned html string respHtml = sr.ReadToEnd(); rtbExtractedHtml.Text = respHtml; |
對應(yīng)的,UI中,點擊按鈕“抓取網(wǎng)頁html源碼”:
可以獲得對應(yīng)的html了:
注意:
此處,需要根據(jù)你的需要,而決定是否關(guān)心html的編碼類型(charset);
以及,此處為何使用GBK的編碼,不了解的均可參考:
【整理】關(guān)于HTML網(wǎng)頁源碼的字符編碼(charset)格式(GB2312,GBK,UTF-8,ISO8859-1等)的解釋
然后獲得了html之后,再去通過C#中的正則表達式庫函數(shù),Regex,去提取出我們想要的數(shù)據(jù):
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | using System.Text.RegularExpressions; //step2: extract expected info //<h1 class="h1user">crifan</h1> string h1userP = @"<h1\s+class=""h1user"">(?<h1user>.+?)</h1>" ; Match foundH1user = ( new Regex(h1userP)).Match(rtbExtractedHtml.Text); if (foundH1user.Success) { //extracted the expected h1user's value txbExtractedInfo.Text = foundH1user.Groups[ "h1user" ].Value; } else { txbExtractedInfo.Text = "Not found h1 user !" ; } |
點擊“提取所需的信息”,即可提取出我們要的h1user的值crifan:
對應(yīng)的完整的C#代碼為:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 | using System; using System.Collections.Generic; using System.ComponentModel; using System.Data; using System.Drawing; using System.Text; using System.Windows.Forms; using System.Net; using System.IO; using System.Text.RegularExpressions; namespace crawlWebsiteAndExtractInfo { public partial class frmCrawlWebsite : Form { public frmCrawlWebsite() { InitializeComponent(); } private void btnCrawlAndExtract_Click( object sender, EventArgs e) { //step1: get html from url string urlToCrawl = txbUrlToCrawl.Text; //generate http request HttpWebRequest req = (HttpWebRequest)WebRequest.Create(urlToCrawl); //use GET method to get url's html req.Method = "GET" ; //use request to get response HttpWebResponse resp = (HttpWebResponse)req.GetResponse(); string htmlCharset = "GBK" ; //use songtaste's html's charset GB2312 to decode html //otherwise will return messy code Encoding htmlEncoding = Encoding.GetEncoding(htmlCharset); StreamReader sr = new StreamReader(resp.GetResponseStream(), htmlEncoding); //read out the returned html string respHtml = sr.ReadToEnd(); rtbExtractedHtml.Text = respHtml; } private void btnExtractInfo_Click( object sender, EventArgs e) { //step2: extract expected info //<h1 class="h1user">crifan</h1> string h1userP = @"<h1\s+class=""h1user"">(?<h1user>.+?)</h1>" ; Match foundH1user = ( new Regex(h1userP)).Match(rtbExtractedHtml.Text); if (foundH1user.Success) { //extracted the expected h1user's value txbExtractedInfo.Text = foundH1user.Groups[ "h1user" ].Value; } else { txbExtractedInfo.Text = "Not found h1 user !" ; } } private void lklTutorialUrl_LinkClicked( object sender, LinkLabelLinkClickedEventArgs e) { System.Diagnostics.Process.Start(tutorialUrl); } } } |
完整的VS2010的項目,可以去這里下載:
crawlWebsiteAndExtractInfo_csharp_2012-11-07.7z
【總結(jié)】
總的來說,使用C#抓取網(wǎng)站,從返回的html源碼中提取所需內(nèi)容,相對之前的Python,還是要復(fù)雜一些的。
因為要手動處理很多和http相關(guān)的request,response,以及stream,編碼類型等內(nèi)容。
聯(lián)系客服