- Opening Webpage
- Making Requests
- Requests
- Responses
- Status Codes
- Parsing Web Documents
- Initializing HTML
- CSS Selectors
- Tag Features
The webbrowser module allows displaying web-based documents to users.
webbrowser.open(url, new=0, autoraise=True) opens a webpage specified by the url string using the default browser.
- If new is 0, in the same browser window;
- If new is 1, in a new browser window;
- If new is 2, in a new browser tab.
The requests module is offered by the third-party package requests,
which is a Python HTTP library. Install through the terminal by:
python -m pip install requestsrequests.get(url, params=None)sends a GET request.- params can be a dictionary, list of tuples, or bytes to send in the query string.
requests.head(url)sends a HEAD request.requests.post(url, data=None, json=None)sends a POST request.- data can be a dictionary, list of tuples, bytes, or
fileobject to send in the request body. - json should be a JSON serializable Python object to send in the request body.
- data can be a dictionary, list of tuples, bytes, or
requests.put(url, data=None)sends a PUT request.requests.delete(url)sends a DELETE request.requests.patch(url)sends a PATCH request.
All requests from the requests module returns a requests.Response object.
Status
res.urlis the final URL location of response.res.status_codeis the integer code of responded HTTP status.res.reasonis the textual reason of responded HTTP status.res.raise_for_status()raisesHTTPError, if one occurred.
Content
res.encodingis the encoding to decode with when accessingtext()property.res.textis the content of the response, in text.res.contentis the content of the response, in bytes.res.iter_content(chunk_size=1)iterates over the response data, with each time chunk_size bytes read into memory.res.json()returns the json-encoded content of a response utilizingjson.loads.
requests.codes
is an object defining a mapping from common names for HTTP statuses to numerical codes.
import requests
res = requests.get(url)
if res.status_code == requests.codes['ok']:
file = open(path, 'wb')
for chunk in res.iter_content(1024):
file.write(chunk)
file.close()Beautiful Soup with the bs4 module,
is a library for extracting information from an HTML page. Install by:
python -m pip install beautifulsoup4import bs4
soup = bs4.BeautifulSoup(document, 'html.parser')bs4.BeautifulSoup(document, parser) takes in a string or an open file object and returns a BeautifulSoup object.
soup.select(selector) takes in a string of a CSS selector and returns a list of Tag objects.
| Purpose | Syntax |
|---|---|
| Type selector | tag |
| Class selector | '.' + class |
| ID selector | '#' + id |
| Attribute selector | f'[{attr}]' f'[{attr}={value}]' |
| Grouping selectors | ',' |
| Descendant combinator | ' ' |
| Child combinator | ' > ' |
| General sibling combinator | ' ~ ' |
| Adjacent sibling combinator | ' + ' |
| Pseudo classes | ':' |
| Pseudo elements | '::' |
Name
tag.name is the name of the HTML element tag.
Attributes
tag.attrs is the dictionary containing the attributes with values.
For simplicity, access the attributes of a tag by treating the Tag objects like a dictionary.
Sweigart, A. (2015). Automate the Boring Stuff With Python: Practical Programming for Total Beginners. San Francisco, CA: No Starch Press.