Web Scrapper (Information Extraction)

Brief:

Web scraping (web harvesting or web data extraction) is a computer software technique of extracting information from websites. Usually, such software programs simulate human exploration of the World Wide Web by either implementing low-level Hypertext Transfer Protocol (HTTP), or embedding a fully-fledged web browser.

Web scraping focuses more on the transformation of unstructured data on the web, typically in HTML format, into structured data that can be stored and analyzed in a central local database or spreadsheet. Web scraping is also related to web automation, which simulates human browsing using computer software. Uses of web scraping include online price comparison, contact scraping, weather data monitoring, website change detection, research, web mashup and web data integration.

Need:

We find lot of information available over internet but most of the time it is not present as per our requirement. Also at many times the information is available behind authentication screen. Here Web Scrapper comes for rescue. Through web scrapper we can code in such a way that it can login in the web site navigate to the page and extract the desired information. Once the information is extracted, it can be presented as required. This whole process is per in few seconds. Now assume same process doing manually. It takes more time.

How It Works:

In order to do web scrapping we must understand below concepts.

1. HTTP Sniffer tools - There are many tools available in the market like, Http Watch, Http Analyzer, Fiddler. It integrates with Internet Explorer and Firefox browsers to show you the HTTP and HTTPS traffic that is generated when you access a web page.

2. Raw Request - It is string form of each & every item sent as a request to server. It contains Headers, Cookies, Querystring, Post Data, etc.

3. Raw Response - It is string form of response received from Server.

4. Headers - Headers consists of Host, User-Agent, Cookies, Content-Length, Cache-Control

5. POST Data - It is data sent to server in hidden form. For example, you fill a form or enters login details. When you submit then all the fields are sent to server in key-value pair as a POST Data. It is not visible in the browser while transferring.

6. Querystring - Querystring is similar to POST Data but it is visible in Browser Address bar along with URL.

7. Cookies - Cookies are again key value pairs of data stored in client browser. Cookies are sent from server to client which are again sent back to server every time a request to server is sent until it is expired. Most of the time cookies are used for authentication purpose.

8. User-Agent - It contains information about the browser being used at client side.

The main principle of web scrapping is that the server cannot identify whether the request is sent through code or a real browser if we send all required information along with our request.

Main challenge in web scrapping is to identify which information the server is looking for in the request.

Podcast

Michael Patterson sat down with the CEO of Boston Byte, Mustapha Shaikh to discuss the significance and rapid digitization of the healthcar...