Skip to content

Commit 8a93279

Browse files
committed
First version.
1 parent cd91f5f commit 8a93279

14 files changed

Lines changed: 4742 additions & 10 deletions

12_bs4_requests.ipynb

Lines changed: 251 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,251 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "markdown",
5+
"id": "daba7cc8-d507-422a-ae76-9cea8809a646",
6+
"metadata": {
7+
"papermill": {},
8+
"tags": []
9+
},
10+
"source": [
11+
"<img width=\"8%\" alt=\"BeautifulSoup.png\" src=\"https://raw.githubusercontent.com/jupyter-naas/awesome-notebooks/master/.github/assets/logos/BeautifulSoup.png\" style=\"border-radius: 15%\">"
12+
]
13+
},
14+
{
15+
"cell_type": "markdown",
16+
"id": "2778f16d-7641-4958-bc78-3dde9c493d65",
17+
"metadata": {
18+
"papermill": {},
19+
"tags": []
20+
},
21+
"source": [
22+
"# BeautifulSoup - List social network links from website\n",
23+
"\n",
24+
"Largely inspired by https://github.com/jupyter-naas/awesome-notebooks/blob/master/BeautifulSoup/BeautifulSoup_List_social_network_links_from_website.ipynb"
25+
]
26+
},
27+
{
28+
"cell_type": "markdown",
29+
"id": "8f1a033f-9362-43d8-8a8f-1878bd2115c4",
30+
"metadata": {
31+
"papermill": {},
32+
"tags": []
33+
},
34+
"source": [
35+
"**Description:** This notebook will use BeautifulSoup to list all the social network links from a website. It is usefull for organizations to quickly get a list of all the social networks they are present on."
36+
]
37+
},
38+
{
39+
"cell_type": "markdown",
40+
"id": "5a50c831-2700-4fb5-acee-ed0513446815",
41+
"metadata": {
42+
"papermill": {},
43+
"tags": []
44+
},
45+
"source": [
46+
"**References:**\n",
47+
"- [BeautifulSoup Documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)"
48+
]
49+
},
50+
{
51+
"cell_type": "markdown",
52+
"id": "9df24602-3917-4f76-b29a-2be92dea558d",
53+
"metadata": {
54+
"papermill": {},
55+
"tags": []
56+
},
57+
"source": "## Import libraries"
58+
},
59+
{
60+
"cell_type": "code",
61+
"id": "07f7c93e-5f43-4084-a434-d3a591046738",
62+
"metadata": {
63+
"papermill": {},
64+
"tags": [],
65+
"ExecuteTime": {
66+
"end_time": "2024-10-13T12:22:09.049638Z",
67+
"start_time": "2024-10-13T12:22:09.047611Z"
68+
}
69+
},
70+
"source": [
71+
"import requests\n",
72+
"from bs4 import BeautifulSoup\n",
73+
"from pprint import pprint"
74+
],
75+
"outputs": [],
76+
"execution_count": 33
77+
},
78+
{
79+
"cell_type": "markdown",
80+
"id": "f0fc7961-2be2-4c4e-a915-e5683952df41",
81+
"metadata": {
82+
"papermill": {},
83+
"tags": []
84+
},
85+
"source": [
86+
"### Setup Variables\n",
87+
"- `url`: The URL of the website you want to extract social network links from\n",
88+
"- `social_network_links`: List of social network links extracted from website"
89+
]
90+
},
91+
{
92+
"cell_type": "code",
93+
"id": "8f88972c-5b30-4656-9e8f-6e9215665131",
94+
"metadata": {
95+
"papermill": {},
96+
"tags": [],
97+
"ExecuteTime": {
98+
"end_time": "2024-10-13T12:22:09.052633Z",
99+
"start_time": "2024-10-13T12:22:09.050699Z"
100+
}
101+
},
102+
"source": [
103+
"# Inputs\n",
104+
"url = \"https://www.papit.fr/utiles.html\"\n",
105+
"\n",
106+
"# Outputs\n",
107+
"social_network_links = []"
108+
],
109+
"outputs": [],
110+
"execution_count": 34
111+
},
112+
{
113+
"cell_type": "markdown",
114+
"id": "e8d53777-9a93-49ee-a8e7-79bbdf27e029",
115+
"metadata": {
116+
"papermill": {},
117+
"tags": []
118+
},
119+
"source": "## Get social network links"
120+
},
121+
{
122+
"cell_type": "code",
123+
"id": "2577d407-9da1-431a-b11c-2624a8b749e0",
124+
"metadata": {
125+
"papermill": {},
126+
"tags": [],
127+
"ExecuteTime": {
128+
"end_time": "2024-10-13T12:22:09.056374Z",
129+
"start_time": "2024-10-13T12:22:09.053464Z"
130+
}
131+
},
132+
"source": [
133+
"def get_social_network_links(url, social_network_links):\n",
134+
" # Make a GET request to the URL and get the HTML content\n",
135+
" response = requests.get(url)\n",
136+
" html_content = response.text\n",
137+
"\n",
138+
" # Create a BeautifulSoup object to parse the HTML content\n",
139+
" soup = BeautifulSoup(html_content, 'html.parser')\n",
140+
"\n",
141+
" # Find all the links on the page\n",
142+
" links = soup.find_all('a')\n",
143+
"\n",
144+
" # Loop through the links and find the social network links\n",
145+
" social_networks = ['facebook', 'twitter', 'linkedin', 'instagram', 'github', 'youtube']\n",
146+
" for link in links:\n",
147+
" href = link.get('href')\n",
148+
" if href:\n",
149+
" for social in social_networks:\n",
150+
" if social in href:\n",
151+
" if href not in social_network_links:\n",
152+
" social_network_links.append(href)\n",
153+
" return social_network_links"
154+
],
155+
"outputs": [],
156+
"execution_count": 35
157+
},
158+
{
159+
"cell_type": "markdown",
160+
"id": "cb5902d1-db8b-4fbd-bbf2-4b660c21a5f2",
161+
"metadata": {
162+
"papermill": {},
163+
"tags": []
164+
},
165+
"source": "## Crawling the website and display results"
166+
},
167+
{
168+
"cell_type": "code",
169+
"id": "92c2d4d5-58c6-48da-ae99-ad5bfc5b97b9",
170+
"metadata": {
171+
"papermill": {},
172+
"tags": [],
173+
"ExecuteTime": {
174+
"end_time": "2024-10-13T12:22:09.429263Z",
175+
"start_time": "2024-10-13T12:22:09.057161Z"
176+
}
177+
},
178+
"source": [
179+
"social_network_links = get_social_network_links(url, social_network_links)\n",
180+
"pprint(social_network_links)"
181+
],
182+
"outputs": [
183+
{
184+
"name": "stdout",
185+
"output_type": "stream",
186+
"text": [
187+
"['https://github.com/phe-sto',\n",
188+
" 'https://github.com/pcko1/bscscan-python',\n",
189+
" 'https://github.com/Polve/bitcoin-rpc-client',\n",
190+
" 'https://github.com/HydraCG/Specifications',\n",
191+
" 'https://github.com/mingqian/zigbee-viewer',\n",
192+
" 'https://github.com/CodeforFR/enthic-dataviz',\n",
193+
" 'https://www.youtube.com/@Computerphile',\n",
194+
" 'https://shellchocolat.github.io//',\n",
195+
" 'https://github.com/papit-fr/papit-frontend']\n"
196+
]
197+
}
198+
],
199+
"execution_count": 36
200+
},
201+
{
202+
"cell_type": "markdown",
203+
"id": "e3c324dc-fea0-47da-8f89-2747ab5fa5c0",
204+
"metadata": {
205+
"papermill": {},
206+
"tags": []
207+
},
208+
"source": [
209+
" "
210+
]
211+
}
212+
],
213+
"metadata": {
214+
"kernelspec": {
215+
"display_name": "Python 3",
216+
"language": "python",
217+
"name": "python3"
218+
},
219+
"language_info": {
220+
"codemirror_mode": {
221+
"name": "ipython",
222+
"version": 3
223+
},
224+
"file_extension": ".py",
225+
"mimetype": "text/x-python",
226+
"name": "python",
227+
"nbconvert_exporter": "python",
228+
"pygments_lexer": "ipython3",
229+
"version": "3.9.6"
230+
},
231+
"naas": {
232+
"notebook_id": "200a5dcebfc9ff32f08e84aaba44cb6125fbc8bbde5f686f467b8626c7ef5f78",
233+
"notebook_path": "BeautifulSoup/BeautifulSoup_List_social_network_links_from_website.ipynb"
234+
},
235+
"papermill": {
236+
"default_parameters": {},
237+
"environment_variables": {},
238+
"parameters": {},
239+
"version": "2.4.0"
240+
},
241+
"widgets": {
242+
"application/vnd.jupyter.widget-state+json": {
243+
"state": {},
244+
"version_major": 2,
245+
"version_minor": 0
246+
}
247+
}
248+
},
249+
"nbformat": 4,
250+
"nbformat_minor": 5
251+
}

0 commit comments

Comments
 (0)