HP Discover 2011 Vienna

How to make Web Crawler in C#

Web Crawler is a application that export web pages from web. Web crawler has various application in IT, Business and also in Government Agencies.  Many sites, in particular search engines, use spidering as a means of providing up-to-date data. Web crawlers are mainly used to create a copy of all the visited pages for later processing by a search engine that will index the downloaded pages to provide fast searches. Crawlers can also be used for automating maintenance tasks on a Web site, such as checking links or validating HTML code.  A Web crawler is one type of bot, or software agent. In general, it starts with a list of URLs to visit, called the seeds.

Let starts:

  1. Create a windows project in C#
  2. Drag text box , Lable and button
  3. Need textbox to take URL from user
  4. Lable will be use for Output

Now  you have to add following libraries in your upper code Section:

using System.Collections;                                                                                                                                                                                        using system.Net;                                                                                                                                                                                                           using System.Text.RegularExpressions;

Then go to the click event of button and write following code:

string url=textbox1.text;

//WebClient is a class which provided by C# which do all necessary work to establish a connect:

WebClient webClient = new WebClient();

//use a string to save all data from wedsite

string strSource = webClient.DownloadString(URL);
webClient.Dispose();

Now this string will  all html tags, javascript etc, to remove all unnecessary text from string use following syntex;

strSource = Regex.Replace(strSource, “<script.*?</script>”, string.Empty, RegexOptions.Singleline | RegexOptions.IgnoreCase);

strSource = Regex.Replace(strSource, “<style.*?</style>”, string.Empty, RegexOptions.Singleline | RegexOptions.IgnoreCase);

strSource =Regex.Replace(strSource, @”<(.|\n)*?>”, string.Empty);

lable1.Text =strSource;

Part2 Coming Soon (How to extract URL from Web pages).

Feel Free to ask question, Any suggestion or if you have any problem then kindly comment it out.

See Also

    Leave a Reply

    blog comments powered by Disqus