Skip to content

Commit 70f1602

Browse files
committed
1 parent 9fdf00c commit 70f1602

2 files changed

Lines changed: 15 additions & 1 deletion

File tree

README.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -21,10 +21,12 @@ Surprisingly, the only thing that tells a server the application triggered the r
2121

2222
## The source code
2323

24-
The project code in this repository is crawling two different public proxy websites:
24+
The project code in this repository is crawling three different public proxy websites:
2525
* http://proxyfor.eu/geo.php
2626
* http://free-proxy-list.net
27+
* http://rebro.weebly.com/proxy-list.html
2728

2829
After collecting the proxy data and filtering the slowest ones it is randomly selecting one of them to query the target url.
2930
The request timeout is configured at 30 seconds and if the proxy fails to return a response it is deleted from the application proxy list.
3031
I have to mention that for each request a different agent header is used. The different headers are stored in the **/data/user_agents.txt** file which contains around 900 different agents.
32+

project/http/requests/proxy/requestProxy.py

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,7 @@ def __init__(self, web_proxy_list=[]):
2121
self.proxy_list = web_proxy_list
2222
self.proxy_list += self.proxyForEU_url_parser('http://proxyfor.eu/geo.php', 100.0)
2323
self.proxy_list += self.freeProxy_url_parser('http://free-proxy-list.net')
24+
self.proxy_list += self.weebly_url_parser('http://rebro.weebly.com/proxy-list.html')
2425

2526
def get_proxy_list(self):
2627
return self.proxy_list
@@ -115,6 +116,17 @@ def freeProxy_url_parser(self, web_url):
115116
#print "ALL: ", curr_proxy_list
116117
return curr_proxy_list
117118

119+
def weebly_url_parser(self, web_url):
120+
curr_proxy_list = []
121+
content = requests.get(web_url).content
122+
soup = BeautifulSoup(content, "html.parser")
123+
table = soup.find("div", attrs={"class": "paragraph", 'style': "text-align:left;"}).find('font', attrs={'color' :'#33a27f'})
124+
125+
for row in [ x for x in table.contents if getattr(x, 'name', None) != 'br']:
126+
proxy = "http://" + row
127+
curr_proxy_list.append(proxy.__str__())
128+
return curr_proxy_list
129+
118130
def generate_proxied_request(self, url, params={}, req_timeout=30):
119131
#if len(self.proxy_list) < 2:
120132
# self.proxy_list += self.proxyForEU_url_parser('http://proxyfor.eu/geo.php')

0 commit comments

Comments
 (0)