Python Web Scraping

Opening Webpage
Making Requests
1. Requests
2. Responses
3. Status Codes
Parsing Web Documents
1. Initializing HTML
2. CSS Selectors
3. Tag Features

Opening Webpage

The webbrowser module allows displaying web-based documents to users.

webbrowser.open(url, new=0, autoraise=True) opens a webpage specified by the url string using the default browser.

If new is 0, in the same browser window;
If new is 1, in a new browser window;
If new is 2, in a new browser tab.

Making Requests

The requests module is offered by the third-party package requests, which is a Python HTTP library. Install through the terminal by:

python -m pip install requests

Requests

requests.get(url, params=None) sends a GET request.
- params can be a dictionary, list of tuples, or bytes to send in the query string.
requests.head(url) sends a HEAD request.
requests.post(url, data=None, json=None) sends a POST request.
- data can be a dictionary, list of tuples, bytes, or file object to send in the request body.
- json should be a JSON serializable Python object to send in the request body.
requests.put(url, data=None) sends a PUT request.
requests.delete(url) sends a DELETE request.
requests.patch(url) sends a PATCH request.

Responses

All requests from the requests module returns a requests.Response object.

Status

res.url is the final URL location of response.
res.status_code is the integer code of responded HTTP status.
res.reason is the textual reason of responded HTTP status.
res.raise_for_status() raises HTTPError, if one occurred.

Content

res.encoding is the encoding to decode with when accessing text() property.
res.text is the content of the response, in text.
res.content is the content of the response, in bytes.
res.iter_content(chunk_size=1) iterates over the response data, with each time chunk_size bytes read into memory.
res.json() returns the json-encoded content of a response utilizing json.loads.

Status Codes

requests.codes is an object defining a mapping from common names for HTTP statuses to numerical codes.

import requests
res = requests.get(url)
if res.status_code == requests.codes['ok']:
    file = open(path, 'wb')
    for chunk in res.iter_content(1024):
        file.write(chunk)
    file.close()

Parsing Web Documents

Beautiful Soup with the bs4 module, is a library for extracting information from an HTML page. Install by:

python -m pip install beautifulsoup4

Initializing HTML

import bs4
soup = bs4.BeautifulSoup(document, 'html.parser')

bs4.BeautifulSoup(document, parser) takes in a string or an open file object and returns a BeautifulSoup object.

CSS Selectors

soup.select(selector) takes in a string of a CSS selector and returns a list of Tag objects.

Purpose	Syntax
Type selector	`tag`
Class selector	`'.' + class`
ID selector	`'#' + id`
Attribute selector	`f'[{attr}]'` `f'[{attr}={value}]'`
Grouping selectors	`','`
Descendant combinator	`' '`
Child combinator	`' > '`
General sibling combinator	`' ~ '`
Adjacent sibling combinator	`' + '`
Pseudo classes	`':'`
Pseudo elements	`'::'`

Tag Features

Name

tag.name is the name of the HTML element tag.

Attributes

tag.attrs is the dictionary containing the attributes with values.

For simplicity, access the attributes of a tag by treating the Tag objects like a dictionary.

References

Sweigart, A. (2015). Automate the Boring Stuff With Python: Practical Programming for Total Beginners. San Francisco, CA: No Starch Press.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Python Web Scraping

Table of Contents

Opening Webpage

Making Requests

Requests

Responses

Status Codes

Parsing Web Documents

Initializing HTML

CSS Selectors

Tag Features

References

FilesExpand file tree

py-Web.md

Latest commit

History

py-Web.md

File metadata and controls

Python Web Scraping

Table of Contents

Opening Webpage

Making Requests

Requests

Responses

Status Codes

Parsing Web Documents

Initializing HTML

CSS Selectors

Tag Features

References