Web Crawler is a application that export web pages from web. Web crawler has various application in IT, Business and also in Government Agencies.  Many sites, in particular search engines, use spidering as a means of providing up-to-date data. Web crawlers are mainly used to create a copy of all the visited pages for later processing by a search engine that will index the downloaded pages to provide fast searches. Crawlers can also be used for automating maintenance tasks on a Web site, such as checking links or validating HTML code.  A Web crawler is one type of bot, or software agent. In general, it starts with a list of URLs to visit, called the seeds.

Let starts:

  1. Create a windows project in C#
  2. Drag text box , Lable and button
  3. Need textbox to take URL from user
  4. Lable will be use for Output

Now  you have to add following libraries in your upper code Section:

using System.Collections;                                                                                                                                                                                        using system.Net;                                                                                                                                                                                                           using System.Text.RegularExpressions;

Then go to the click event of button and write following code:

string url=textbox1.text;

//WebClient is a class which provided by C# which do all necessary work to establish a connect:

WebClient webClient = new WebClient();

//use a string to save all data from wedsite

string strSource = webClient.DownloadString(URL);
webClient.Dispose();

Now this string will  all html tags, javascript etc, to remove all unnecessary text from string use following syntex;

strSource = Regex.Replace(strSource, “<script.*?</script>”, string.Empty, RegexOptions.Singleline | RegexOptions.IgnoreCase);

strSource = Regex.Replace(strSource, “<style.*?</style>”, string.Empty, RegexOptions.Singleline | RegexOptions.IgnoreCase);

strSource =Regex.Replace(strSource, @”<(.|\n)*?>”, string.Empty);

lable1.Text =strSource;

Part2 Coming Soon (How to extract URL from Web pages).

Feel Free to ask question, Any suggestion or if you have any problem then kindly comment it out.