Advanced Level

Python Web Scraper

Build an ethical, polite web scraper in Python using built-in urllib or requests to extract title headlines and structured data from sample HTML pages, format results into CSV, and respect robots.txt rules.

The Problem

Real-World Use Case

Technology Stack

Understanding of HTML structure and DOM elements

Prerequisite

Familiarity with HTTP GET requests

Prerequisite

Understanding of string parsing or regular expressions

Prerequisite

Familiarity with Python CSV module

Prerequisite

Architecture & Design

Folder Structure

web_scraper/
├── scraper.py
├── output.csv
└── README.md

Step-by-Step Implementation

Verify target URL permits crawling by inspecting robots.txt.

### Step 1: Project Setup Create folder `web_scraper` and script `scraper.py`. We use `urllib.request` and `csv`.

python

"""
Complete Solution Code: Ethical Python Web Scraper
"""
import urllib.request
import urllib.error
import re
import csv

def scrape_site(url: string, output_csv: string):
    print(f"Connecting to: {url}...")
    try:
        req = urllib.request.Request(
            url, 
            headers={'User-Agent': 'TechIdea Education Scraper Bot'}
        )
        with urllib.request.urlopen(req, timeout=10) as response:
            html_content = response.read().decode('utf-8')
            
        print("HTML fetched successfully. Parsing content...")
        
        # Simple extraction of list items or headers
        matches = re.findall(r'<h[1-3][^>]*>(.*?)<\/h[1-3]>', html_content, re.DOTALL)
        
        # Clean HTML tags from matches
        records = []
        for match in matches:
            clean_text = re.sub(r'<[^>]+>', '', match).strip()
            if clean_text and len(clean_text) > 3:
                records.append([clean_text])
                
        if not records:
            print("No headings found on the page.")
            return
            
        # Write to CSV
        with open(output_csv, 'w', newline='', encoding='utf-8') as f:
            writer = csv.writer(f)
            writer.writerow(['Extracted Heading'])
            writer.writerows(records)
            
        print(f"➔ Scrape complete! {len(records)} records saved to {output_csv}")
        
    except urllib.error.HTTPError as e:
        print(f"HTTP Error: {e.code} - {e.reason}")
    except urllib.error.URLError as e:
        print(f"URL Connection Error: {e.reason}")
    except Exception as e:
        print(f"Unexpected Error: {e}")

if __name__ == "__main__":
    # Test on a public educational benchmark or example.com
    target = "http://example.com"
    scrape_site(target, "headings.csv")

Code Explanation

Implementation step

Send HTTP GET request with standard User-Agent headers.

### Step 2: Core Logic Implementation Implement the HTML fetching function with custom User-Agent headers to ensure polite requests. ```python import urllib.request import urllib.error import re import csv def fetch_html(url: string) -> string: req = urllib.request.Request( url, headers={'User-Agent': 'Mozilla/5.0 (Education Bot)'} ) with urllib.request.urlopen(req) as response: return response.read().decode('utf-8') ```

python

"""
Complete Solution Code: Ethical Python Web Scraper
"""
import urllib.request
import urllib.error
import re
import csv

def scrape_site(url: string, output_csv: string):
    print(f"Connecting to: {url}...")
    try:
        req = urllib.request.Request(
            url, 
            headers={'User-Agent': 'TechIdea Education Scraper Bot'}
        )
        with urllib.request.urlopen(req, timeout=10) as response:
            html_content = response.read().decode('utf-8')
            
        print("HTML fetched successfully. Parsing content...")
        
        # Simple extraction of list items or headers
        matches = re.findall(r'<h[1-3][^>]*>(.*?)<\/h[1-3]>', html_content, re.DOTALL)
        
        # Clean HTML tags from matches
        records = []
        for match in matches:
            clean_text = re.sub(r'<[^>]+>', '', match).strip()
            if clean_text and len(clean_text) > 3:
                records.append([clean_text])
                
        if not records:
            print("No headings found on the page.")
            return
            
        # Write to CSV
        with open(output_csv, 'w', newline='', encoding='utf-8') as f:
            writer = csv.writer(f)
            writer.writerow(['Extracted Heading'])
            writer.writerows(records)
            
        print(f"➔ Scrape complete! {len(records)} records saved to {output_csv}")
        
    except urllib.error.HTTPError as e:
        print(f"HTTP Error: {e.code} - {e.reason}")
    except urllib.error.URLError as e:
        print(f"URL Connection Error: {e.reason}")
    except Exception as e:
        print(f"Unexpected Error: {e}")

if __name__ == "__main__":
    # Test on a public educational benchmark or example.com
    target = "http://example.com"
    scrape_site(target, "headings.csv")

Code Explanation

Implementation step

Receive HTML response body as a text string.

### Step 3: Parsing & Exporting Extract headlines and write to CSV. ```python def parse_headlines(html: string) -> list: # Extracting standard h2 tags using regex for demonstration matches = re.findall(r'<h2>(.*?)<\/h2>', html, re.DOTALL) clean = [re.sub(r'<.*?>', '', m).strip() for m in matches] return clean ```

python

"""
Complete Solution Code: Ethical Python Web Scraper
"""
import urllib.request
import urllib.error
import re
import csv

def scrape_site(url: string, output_csv: string):
    print(f"Connecting to: {url}...")
    try:
        req = urllib.request.Request(
            url, 
            headers={'User-Agent': 'TechIdea Education Scraper Bot'}
        )
        with urllib.request.urlopen(req, timeout=10) as response:
            html_content = response.read().decode('utf-8')
            
        print("HTML fetched successfully. Parsing content...")
        
        # Simple extraction of list items or headers
        matches = re.findall(r'<h[1-3][^>]*>(.*?)<\/h[1-3]>', html_content, re.DOTALL)
        
        # Clean HTML tags from matches
        records = []
        for match in matches:
            clean_text = re.sub(r'<[^>]+>', '', match).strip()
            if clean_text and len(clean_text) > 3:
                records.append([clean_text])
                
        if not records:
            print("No headings found on the page.")
            return
            
        # Write to CSV
        with open(output_csv, 'w', newline='', encoding='utf-8') as f:
            writer = csv.writer(f)
            writer.writerow(['Extracted Heading'])
            writer.writerows(records)
            
        print(f"➔ Scrape complete! {len(records)} records saved to {output_csv}")
        
    except urllib.error.HTTPError as e:
        print(f"HTTP Error: {e.code} - {e.reason}")
    except urllib.error.URLError as e:
        print(f"URL Connection Error: {e.reason}")
    except Exception as e:
        print(f"Unexpected Error: {e}")

if __name__ == "__main__":
    # Test on a public educational benchmark or example.com
    target = "http://example.com"
    scrape_site(target, "headings.csv")

Code Explanation

Implementation step

Parse specific HTML tags (e.g., <h2> or <li> items) using string methods or regex.

### Step 4: Politeness & Exceptions Always catch HTTP errors (404, 500) and respect crawl delays.

python

"""
Complete Solution Code: Ethical Python Web Scraper
"""
import urllib.request
import urllib.error
import re
import csv

def scrape_site(url: string, output_csv: string):
    print(f"Connecting to: {url}...")
    try:
        req = urllib.request.Request(
            url, 
            headers={'User-Agent': 'TechIdea Education Scraper Bot'}
        )
        with urllib.request.urlopen(req, timeout=10) as response:
            html_content = response.read().decode('utf-8')
            
        print("HTML fetched successfully. Parsing content...")
        
        # Simple extraction of list items or headers
        matches = re.findall(r'<h[1-3][^>]*>(.*?)<\/h[1-3]>', html_content, re.DOTALL)
        
        # Clean HTML tags from matches
        records = []
        for match in matches:
            clean_text = re.sub(r'<[^>]+>', '', match).strip()
            if clean_text and len(clean_text) > 3:
                records.append([clean_text])
                
        if not records:
            print("No headings found on the page.")
            return
            
        # Write to CSV
        with open(output_csv, 'w', newline='', encoding='utf-8') as f:
            writer = csv.writer(f)
            writer.writerow(['Extracted Heading'])
            writer.writerows(records)
            
        print(f"➔ Scrape complete! {len(records)} records saved to {output_csv}")
        
    except urllib.error.HTTPError as e:
        print(f"HTTP Error: {e.code} - {e.reason}")
    except urllib.error.URLError as e:
        print(f"URL Connection Error: {e.reason}")
    except Exception as e:
        print(f"Unexpected Error: {e}")

if __name__ == "__main__":
    # Test on a public educational benchmark or example.com
    target = "http://example.com"
    scrape_site(target, "headings.csv")

Code Explanation

Implementation step

Clean extracted text and write structured rows into output.csv.

### Step 4: Politeness & Exceptions Always catch HTTP errors (404, 500) and respect crawl delays.

python

"""
Complete Solution Code: Ethical Python Web Scraper
"""
import urllib.request
import urllib.error
import re
import csv

def scrape_site(url: string, output_csv: string):
    print(f"Connecting to: {url}...")
    try:
        req = urllib.request.Request(
            url, 
            headers={'User-Agent': 'TechIdea Education Scraper Bot'}
        )
        with urllib.request.urlopen(req, timeout=10) as response:
            html_content = response.read().decode('utf-8')
            
        print("HTML fetched successfully. Parsing content...")
        
        # Simple extraction of list items or headers
        matches = re.findall(r'<h[1-3][^>]*>(.*?)<\/h[1-3]>', html_content, re.DOTALL)
        
        # Clean HTML tags from matches
        records = []
        for match in matches:
            clean_text = re.sub(r'<[^>]+>', '', match).strip()
            if clean_text and len(clean_text) > 3:
                records.append([clean_text])
                
        if not records:
            print("No headings found on the page.")
            return
            
        # Write to CSV
        with open(output_csv, 'w', newline='', encoding='utf-8') as f:
            writer = csv.writer(f)
            writer.writerow(['Extracted Heading'])
            writer.writerows(records)
            
        print(f"➔ Scrape complete! {len(records)} records saved to {output_csv}")
        
    except urllib.error.HTTPError as e:
        print(f"HTTP Error: {e.code} - {e.reason}")
    except urllib.error.URLError as e:
        print(f"URL Connection Error: {e.reason}")
    except Exception as e:
        print(f"Unexpected Error: {e}")

if __name__ == "__main__":
    # Test on a public educational benchmark or example.com
    target = "http://example.com"
    scrape_site(target, "headings.csv")

Code Explanation

Implementation step

Common Errors

HTTP 403 Forbidden error.

Add User-Agent header to HTTP request.

UnicodeEncodeError when writing to CSV.

Use encoding='utf-8' when opening file.

Security & Performance

Run script against example.com.

Verify headings.csv is generated with extracted text.

Check error handling by inputting a broken URL.

Use BeautifulSoup or lxml for robust DOM parsing instead of regex.

Add pagination scraping across multiple sequential pages.

Implement automatic robots.txt verification before fetching.

Interview Questions

Q: Is web scraping legal?

A: Public data scraping is generally permitted for educational/research purposes, but you must respect website terms of service and robots.txt rules.

Q: Why use urllib instead of requests?

A: urllib is built directly into Python, making this script run without pip installs.

Technology Stack

Understanding of HTML structure and DOM elements

Familiarity with HTTP GET requests

Understanding of string parsing or regular expressions

Familiarity with Python CSV module

Architecture & Design

Folder Structure

Step-by-Step Implementation

Verify target URL permits crawling by inspecting robots.txt.

Code Explanation

Send HTTP GET request with standard User-Agent headers.

Code Explanation

Receive HTML response body as a text string.

Code Explanation

Parse specific HTML tags (e.g., <h2> or <li> items) using string methods or regex.

Code Explanation

Clean extracted text and write structured rows into output.csv.

Code Explanation

Common Errors

Security & Performance

Interview Questions

Q: Is web scraping legal?

Q: Why use urllib instead of requests?

Get practical AI tools, SEO tips, and growth guides weekly.