< All Topics
Print

urllib

Urllib is a Python-based package that allows users to access and interact with URLs (Uniform Resource Locator). This package is divided into four modules oriented to access websites, download and parse data, modify URLs, handle errors, and more.

Components and Modules

Request

This module is used for opening and reading URLs. In other words, users can retrieve data from URLs by using HTTP, HTTPS, and FTP protocols. It supports basic and digest authentication, redirections, cookies, and more. The “.urlopen()” function allows users to establish a connection to a given website, while the “.read()” function allows the reading of its source code.

This module also comprises other functions and classes such as “.install_opener(),” “.build_opener(),” “.pathname2url()”, “.url2pathname()”, “.getproxies(),” “.Request()”, “.OpenerDirector()”, “.BaseHandler()”, and more. For more information about these functions, see the documentation.

Parse

The “urllib.parse” module was created to adhere to the Internet standards in accordance with the RFC (Request for Comments) requirements on Uniform Relative Resource Locators. It ensures compatibility and interoperability with various URL schemes, including ftp, gopher, hdl, http, https, imap, mailto, and more.

In general, this module allows breaking a URL into components such as addressing scheme, network location, queries, and more. For example, it is useful when users need to create new URLs given a “base URL.”

All functions in this module can be divided into two categories defined as URL parsing and URL quoting. The functions in URL parsing are designed to split URLs or combine resulting components into new URLs. Whereas the main goal of URL quoting functions is to take data for use as URL components by quoting special characters. These functions also facilitate the reverse process, allowing users to recreate original data from URL components by providing appropriate operations.

Error

The “urllib.error” module was designed to handle exception classes raised by the “urllib.request” module. It includes three simple functions defined as “.URLError()” that allows handling exceptions when it encounters a problem, “.HTTPError()” to handle specific HTTP errors (e.g., authentication requests), a subclass of “.URLError()”, and “.ContentTooShortError()” to handle exceptions when the “.urlretrieve()” function detects that the amount of the downloaded data is less than the expected amount.

Robotparser

This module parses robots.txt files, which specify what pages can be crawled by the search engines. It provides a single class called “.RobotFileParser()”, which was designed to determine whether a specific user agent can get access to a URL on a website based on the content of its robots.txt file.

Urllib vs. Similar Python Packages

When developers search for tools specialized in URL management, it’s common to come across multiple libraries with similar names; however, they don’t essentially share their origin and functionality. The most known libraries are urllib, urllib2, and urllib3.

urllib

It is the first urllib version, which has become obsolete since its introduction in Python 1.4. It was initially built to provide access to URLs and their subsequent reading. This package comprised the functions “.urlopen(),” “.urlretrieve(),” “.urlcleanup(),” “.quote(),” and “.unquote().”

urllib2

The urllib2 was built to replace its predecessor, providing extra features such as request, HTTPRedirectHandler, and OpenerDirector classes. It appeared integrated initially into Python 1.6.

urllib in Python 3

The urllib package in Python 3 just shares the name with its first version. It contains enhancements from its older versions, making it more efficient for working with HTTP requests and responses. More details are covered in this article.

urllib3

Despite the similarities with the previous ones, this package is unrelated to the Python standard library (third-party package). Andrey Petrov built this package as a powerful and user-friendly HTTP client for Python, providing missed features from the standard libraries such as thread safety, connection pooling, client-side SSL/TLS verification, and more.

Quick Installation Guide

Starting with the Python 3.0 version, urllib is already added to the standard library, so you don’t need to install additional packages to use it. However, it is a good practice to ensure that the package is installed and accessible to your Python environment. For conda users, use the following command in the Anaconda prompt:

conda list

It will print a list, and you will be able to see the “urllib” package. Look at these tutorials to install Anaconda in Windows and Ubuntu. Alternatively, you can use the pip package manager.

pip list

Assuming the package is already installed, you can import it into your code using the import statement.

from urllib.request import urlopen
from urllib.parse import urlparse 
The import statement for a urllib package.

Urllib Fundamentals

Using urllib.request

Use the “urlopen” function from the “urllib.request” module to create a connection and make a request to a specific URL. Once the connection is successfully established, use the “.read()” function to retrieve and read the HTML source code of the specified URL.

#PYNOMIAL - 2023 
# import the "urlopen" function from the "urllib.request" module  
from urllib.request import urlopen

myURL="http://www.google.com/"
# Fetch the content of the webpage
URL_data = urlopen(myURL)

print(URL_data.read())

Using urllib.parse

Use the “urlparse” function to extract information from a URL. The components that this package extracts include the protocol scheme (scheme) such as “HTTP,” “HTTPS,” or “FTP,” the network location (Netloc) that specifies the domain name or IP address, the path of the URL (Path) that specifies the resource on the server which includes directories or filenames, and finally, the Query string (Query) which is used to pass additional data to the server.

#PYNOMIAL - 2023 
# import the "urlparse" function from the "urllib.parse" module  
from urllib.parse import urlparse

myURL="https://www.python.org/about/help/"
# Parse the URL into its components
parsedURL = urlparse(myURL)

# Extract the scheme, netloc, path, and query components
scheme = parsedURL.scheme
netloc = parsedURL.netloc
path = parsedURL.path
query = parsedURL.query

# Print the extracted components
print('Scheme:', scheme)
print('Netloc:', netloc)
print('Path:', path)
print('Query:', query)

Using urllib.error

Use the “HTTPError” and “URLError” functions to handle errors when working with URLs. In the following example, we have intentionally modified an URL to return an HTTP error (404 Not Found). These function conditions allow users to make their code more robust and resilient to unexpected situations.

#PYNOMIAL - 2023 
# import the "urlopen", "HTTPError", and "URLError" functions
from urllib.request import urlopen
from urllib.error import HTTPError, URLError


myURL = "https://www.python.org/notfound"

try:
# Attempt to open a connection to the URL and fetch the content
    URL_data = urlopen(myURL)

except HTTPError as e:
# Handle HTTP errors (e.g., 404 Not Found)
    print('HTTP Error:', e.code, e.reason)

except URLError as e:
# Handle URL errors (e.g., bad URL)
    print('URL Error:', e.reason)

else:
# If no error occurred, read and print the content of the response
    print(URL_data.read())

Highlights

Project Background

  • Project: urllib
  • Author: –
  • Initial Release: 2019
  • Type: HTTP client for Python
  • License: –
  • Contains: request, parse, error, and robotparser
  • Language: Python 3
  • GitHub: –
  • Runs On: Linux, Windows, MacOS
  • Twitter: –

Main Features

  • Urllib allows users to access and interact with websites from your Python environment.
  • It facilitates the connection to a web page through a code snippet.
  • The parse module simplifies the manipulation of a URL and its components.
  • It provides helper classes to check for HTTP or URL errors.
  • Uillib in Python 3 contains enhancements from its older versions.

Prior Knowledge Requirements

Here are a few tips to effectively use this library:

  • Have a basic understanding of the Python programming language.
  • Know basic URLs concepts such as protocol scheme, domain name, path, and query string.
  • Have a basic understanding of the HTTP and FTP protocols.

Community Benchmarks

  • – Stars
  • – Forks
  • – Code contributors
  • – releases

Versions

  • urlllib in Python 3 – It is the newest package, developed from older versions.
  • urlllib2 in Python 1.6 – It replaces its predecessor, providing extra features.
  • urlllib in Python 1.4 – It provides access to URLs and their subsequent reading. It is an obsolete package.

References

Home page

Internet Protocols and Support

Was this article helpful?
0 out of 5 stars
5 Stars 0%
4 Stars 0%
3 Stars 0%
2 Stars 0%
1 Stars 0%
5
Please Share Your Feedback
How Can We Improve This Article?
Table of Contents
Scroll to Top