【教程】抓取網(wǎng)并提取網(wǎng)頁中所需要的信息之C#版|在路上

2014.10.24

關(guān)注

【教程】抓取網(wǎng)并提取網(wǎng)頁中所需要的信息之 C#版

2012 年 11 月 23 日下午 2:26crifan 已有13427人圍觀 10個評論

在通過：

【整理】關(guān)于抓取網(wǎng)頁，分析網(wǎng)頁內(nèi)容，模擬登陸網(wǎng)站的邏輯/流程和注意事項

了解了抓取網(wǎng)頁的一般流程之后，加上之前介紹的：

【總結(jié)】瀏覽器中的開發(fā)人員工具（IE9的F12和Chrome的Ctrl+Shift+I）-網(wǎng)頁分析的利器

應(yīng)該就很清楚如何利用工具去抓取網(wǎng)頁，并分析源碼，獲得所需內(nèi)容了。

下面，就來通過實際的例子來介紹，如何通過Python語言，實現(xiàn)這個抓取網(wǎng)頁并提取所需內(nèi)容的過程：

假設(shè)我們的需求是，從我(crifan)的Songtaste上的頁面：

http://www.songtaste.com/user/351979/

先抓取網(wǎng)頁的html源碼，然后再提取其中我的songtaste上面的名字：crifan

對應(yīng)的html代碼為：

1	`<h1` `class="h1user">crifan</h1>`

此任務(wù)，相對很簡單。下面就來說說，如何用C#來實現(xiàn)。

新建一個C#項目，使用.NET Framework 2.0，設(shè)置一些基本的控件用于顯示。

相關(guān)的，先寫出，獲得html的代碼：

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

using System.Net;

using System.IO;

//step1: get html from url

//http://www.songtaste.com/user/351979/

string urlToCrawl = txbUrlToCrawl.Text;

//generate http request

HttpWebRequest req = (HttpWebRequest)WebRequest.Create(urlToCrawl);

//use GET method to get url's html

req.Method = "GET";

//use request to get response

HttpWebResponse resp = (HttpWebResponse)req.GetResponse();

string htmlCharset = "GBK";

//use songtaste's html's charset GB2312 to decode html

//otherwise will return messy code

Encoding htmlEncoding = Encoding.GetEncoding(htmlCharset);

StreamReader sr = new StreamReader(resp.GetResponseStream(), htmlEncoding);

//read out the returned html

string respHtml = sr.ReadToEnd();

rtbExtractedHtml.Text = respHtml;

對應(yīng)的，UI中，點擊按鈕“抓取網(wǎng)頁html源碼”：

可以獲得對應(yīng)的html了：

注意：
此處，需要根據(jù)你的需要，而決定是否關(guān)心html的編碼類型（charset）；
以及，此處為何使用GBK的編碼，不了解的均可參考：
【整理】關(guān)于HTML網(wǎng)頁源碼的字符編碼（charset）格式（GB2312，GBK，UTF-8，ISO8859-1等）的解釋

然后獲得了html之后，再去通過C#中的正則表達式庫函數(shù)，Regex，去提取出我們想要的數(shù)據(jù)：

1

2

3

4

5

6

7

8

9

10

11

12

13

14

using System.Text.RegularExpressions;

//step2: extract expected info

//<h1 class="h1user">crifan</h1>

string h1userP = @"<h1\s+class=""h1user"">(?<h1user>.+?)</h1>";

Match foundH1user = (new Regex(h1userP)).Match(rtbExtractedHtml.Text);

if (foundH1user.Success)

{

//extracted the expected h1user's value

txbExtractedInfo.Text = foundH1user.Groups["h1user"].Value;

}

else

{

txbExtractedInfo.Text = "Not found h1 user !";

}

點擊“提取所需的信息”，即可提取出我們要的h1user的值crifan：

對應(yīng)的完整的C#代碼為：

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

using System;

using System.Collections.Generic;

using System.ComponentModel;

using System.Data;

using System.Drawing;

using System.Text;

using System.Windows.Forms;

using System.Net;

using System.IO;

using System.Text.RegularExpressions;

namespace crawlWebsiteAndExtractInfo

{

public partial class frmCrawlWebsite : Form

{

public frmCrawlWebsite()

{

InitializeComponent();

}

private void btnCrawlAndExtract_Click(object sender, EventArgs e)

{

//step1: get html from url

//http://www.songtaste.com/user/351979/

string urlToCrawl = txbUrlToCrawl.Text;

//generate http request

HttpWebRequest req = (HttpWebRequest)WebRequest.Create(urlToCrawl);

//use GET method to get url's html

req.Method = "GET";

//use request to get response

HttpWebResponse resp = (HttpWebResponse)req.GetResponse();

string htmlCharset = "GBK";

//use songtaste's html's charset GB2312 to decode html

//otherwise will return messy code

Encoding htmlEncoding = Encoding.GetEncoding(htmlCharset);

StreamReader sr = new StreamReader(resp.GetResponseStream(), htmlEncoding);

//read out the returned html

string respHtml = sr.ReadToEnd();

rtbExtractedHtml.Text = respHtml;

}

private void btnExtractInfo_Click(object sender, EventArgs e)

{

//step2: extract expected info

//<h1 class="h1user">crifan</h1>

string h1userP = @"<h1\s+class=""h1user"">(?<h1user>.+?)</h1>";

Match foundH1user = (new Regex(h1userP)).Match(rtbExtractedHtml.Text);

if (foundH1user.Success)

{

//extracted the expected h1user's value

txbExtractedInfo.Text = foundH1user.Groups["h1user"].Value;

}

else

{

txbExtractedInfo.Text = "Not found h1 user !";

}

private void lklTutorialUrl_LinkClicked(object sender, LinkLabelLinkClickedEventArgs e)

{

string tutorialUrl = ";

System.Diagnostics.Process.Start(tutorialUrl);

}

完整的VS2010的項目，可以去這里下載：

crawlWebsiteAndExtractInfo_csharp_2012-11-07.7z

【總結(jié)】

總的來說，使用C#抓取網(wǎng)站，從返回的html源碼中提取所需內(nèi)容，相對之前的Python，還是要復(fù)雜一些的。

因為要手動處理很多和http相關(guān)的request，response，以及stream，編碼類型等內(nèi)容。

本站僅提供存儲服務(wù)，所有內(nèi)容均由用戶發(fā)布，如發(fā)現(xiàn)有害或侵權(quán)內(nèi)容，請點擊舉報。

打開APP，閱讀全文并永久保存查看更多類似文章

Python 抓取網(wǎng)頁并提取信息(程序詳解)

C#簡單爬蟲案例

[C#]兩個分析HTML網(wǎng)頁的方法[轉(zhuǎn)]

使用ASP.NET發(fā)送HTML格式郵件

介紹C#解析HTML的兩種方法

WebApp之增刪改查(三層)

更多類似文章 >>

九色国产,午夜在线视频,新黄色网址,九九色综合,天天做夜夜做久久做狠狠,天天躁夜夜躁狠狠躁2021a,久久不卡一区二区三区

【教程】抓取網(wǎng)并提取網(wǎng)頁中所需要的信息 之 C#版

【教程】抓取網(wǎng)并提取網(wǎng)頁中所需要的信息之 C#版