Skip to content

Commit c7d209d

Browse files
authored
first shot at adding web drivers for hard urls! (#77)
* first shot at adding web drivers for hard urls! * tweak to dockerfile and failed tests * update test * update fake user agent to be fixed fork * catch any general error with browser and try driver updating test cases with harder urls (e.g., that return 400) exception fallback should try driver too * increase timeout for twitter status Signed-off-by: vsoch <vsoch@users.noreply.github.com>
1 parent 2416589 commit c7d209d

16 files changed

Lines changed: 324 additions & 46 deletions

File tree

.dockerignore

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
geckodriver
2+
chromedriver

.github/workflows/main.yml

Lines changed: 24 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -1,31 +1,34 @@
1-
name: docker-deploy
1+
name: Build and Deploy containers
22

33
on:
4+
# Always test on pull request
5+
pull_request: []
6+
7+
# Deploy on merge to main
48
push:
5-
branches:
6-
- master
9+
branches:
10+
- main
711

812
jobs:
9-
testing-docker:
13+
deploy-test-containers:
1014
runs-on: ubuntu-latest
15+
name: Build Container
1116
steps:
12-
- uses: actions/checkout@v3
13-
- name: Build container image
14-
run: |
15-
docker build -t quay.io/urlstechie/urlchecker .
16-
DOCKER_TAG=$(docker run quay.io/urlstechie/urlchecker --version)
17-
printf "Docker Tag is ${DOCKER_TAG}\n"
18-
echo "DOCKER_TAG=${DOCKER_TAG}" >> $GITHUB_ENV
19-
- name: Docker login
20-
env:
21-
docker_user: ${{ secrets.DOCKER_USERNAME }}
22-
docker_pass: ${{ secrets.DOCKER_PASSWORD }}
17+
- name: Checkout
18+
uses: actions/checkout@v3
19+
20+
- name: Build
2321
run: |
24-
docker login -u="${docker_user}" -p="${docker_pass}" quay.io
25-
- name: Push containers
22+
docker build -t ghcr.io/urlstechie/urlchecker .
23+
DOCKER_TAG=$(docker run ghcr.io/urlstechie/urlchecker --version)
24+
printf "Docker Tag is ${DOCKER_TAG}\n"
25+
echo "DOCKER_TAG=${DOCKER_TAG}" >> $GITHUB_ENV
26+
- name: Login and Deploy Test Container
27+
if: (github.event_name != 'pull_request')
2628
run: |
29+
docker images
2730
printf "Docker Tag is ${DOCKER_TAG}\n"
28-
docker tag quay.io/urlstechie/urlchecker:latest "quay.io/urlstechie/urlchecker:${DOCKER_TAG}"
29-
docker push quay.io/urlstechie/urlchecker:latest
30-
docker push "quay.io/urlstechie/urlchecker:${DOCKER_TAG}"
31-
31+
echo ${{ secrets.GITHUB_TOKEN }} | docker login ghcr.io -u ${{ secrets.GHCR_USERNAME }} --password-stdin
32+
docker tag ghcr.io/urlstechie/urlchecker:latest "ghcr.io/urlstechie/urlchecker:${DOCKER_TAG}"
33+
docker push ghcr.io/urlstechie/urlchecker:latest
34+
docker push "ghcr.io/urlstechie/urlchecker:${DOCKER_TAG}"

.github/workflows/test.yml

Lines changed: 8 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -41,7 +41,6 @@ jobs:
4141
pip install types-requests
4242
mypy urlchecker
4343
44-
4544
testing:
4645
needs: type_checking
4746
runs-on: ubuntu-latest
@@ -50,11 +49,18 @@ jobs:
5049
- name: Setup testing environment
5150
run: conda create --quiet --name testing pytest
5251

52+
- name: Download ChromeDriver
53+
run: |
54+
wget https://chromedriver.storage.googleapis.com/103.0.5060.134/chromedriver_linux64.zip
55+
unzip chromedriver_linux64.zip
56+
rm chromedriver_linux64.zip
57+
5358
- name: Test
5459
run: |
5560
export PATH="/usr/share/miniconda/bin:$PATH"
5661
source activate testing
57-
pip install .
62+
pip install git+https://github.com/danger89/fake-useragent.git
63+
pip install .[all]
5864
pip install -r tests/test-requirements.txt
5965
pytest -vs -x --cov=./urlchecker tests/test_*.py
6066

.gitignore

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,10 @@ __pycache__/
33
*.py[cod]
44
*$py.class
55

6+
# Web drivers
7+
chromedriver
8+
geckodriver
9+
610
# C extensions
711
*.so
812

CHANGELOG.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,7 @@ and **Merged pull requests**. Critical items to know are:
1212
Referenced versions in headers are tagged on Github, in parentheses are for pypi.
1313

1414
## [vxx](https://github.com/urlstechie/urlschecker-python/tree/master) (master)
15+
- adding support for web driver for harder URLs (0.0.31)
1516
- use ANSI escape sequences for colors, fake-useragent for agents (0.0.30)
1617
- adding type hints to code, more tests and logging bug fix (0.0.29)
1718
- decrease verbosity when filename is None (0.0.28)

Dockerfile

Lines changed: 17 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,23 +1,36 @@
11
FROM bitnami/minideb:buster
2-
# docker build -t urlschecker .
2+
# docker build -t ghcr.io/urlstechie/urlchecker .
33
WORKDIR /code
44
ENV PATH /opt/conda/bin:${PATH}
55
ENV LANG C.UTF-8
66
ENV SHELL /bin/bash
77
RUN apt-get update && \
8-
/bin/bash -c "install_packages wget bzip2 ca-certificates git && \
8+
/bin/bash -c "install_packages wget bzip2 ca-certificates git unzip gnupg2 && \
9+
install_packages libglib2.0-dev libnss3 libfontconfig1 libgconf-2-4 && \
10+
install_packages libxcb-randr0-dev libxcb-xtest0-dev libxcb-xinerama0-dev libxcb-shape0-dev libxcb-xkb-dev && \
911
wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh && \
1012
bash Miniconda3-latest-Linux-x86_64.sh -b -p /opt/conda && \
1113
rm Miniconda3-latest-Linux-x86_64.sh && \
1214
conda create --name urlchecker && \
1315
conda clean --all -y"
16+
17+
# Google chrome binary
18+
RUN /bin/bash -c 'wget -q -O - https://dl-ssl.google.com/linux/linux_signing_key.pub | apt-key add - \
19+
&& echo "deb http://dl.google.com/linux/chrome/deb/ stable main" >> /etc/apt/sources.list.d/google.list && \
20+
apt-get update && apt-get -y install google-chrome-stable'
21+
1422
COPY . /code
1523
RUN /bin/bash -c "source activate urlchecker && \
1624
which python && \
1725
which pip && \
1826
pip install --upgrade certifi && \
19-
pip install ."
27+
pip install git+https://github.com/danger89/fake-useragent.git && \
28+
pip install .[all]"
29+
# Download chrome driver for selenium
30+
RUN /bin/bash -c "wget https://chromedriver.storage.googleapis.com/103.0.5060.134/chromedriver_linux64.zip && \
31+
unzip chromedriver_linux64.zip && \
32+
rm chromedriver_linux64.zip"
2033
RUN echo "source activate urlchecker" > ~/.bashrc
21-
ENV PATH /opt/conda/envs/urlchecker/bin:${PATH}
34+
ENV PATH /code:/opt/conda/envs/urlchecker/bin:${PATH}
2235
ENTRYPOINT ["urlchecker"]
2336
CMD ["check", "--help"]

README.md

Lines changed: 28 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,14 @@ A detailed documentation of the code is available under [urlchecker-python.readt
2020

2121
### Install
2222

23-
You can install the urlchecker from [pypi](https://pypi.org/project/urlchecker):
23+
You can install the urlchecker from [pypi](https://pypi.org/project/urlchecker).
24+
Before you do, it's recommended to install fake-useragent from:
25+
26+
```bash
27+
pip install git+https://github.com/danger89/fake-useragent.git
28+
```
29+
30+
And then urlchecker:
2431

2532
```bash
2633
$ pip install urlchecker
@@ -557,6 +564,26 @@ In the "client" folder, for example, the commands that are exposed for the clien
557564
Functions for Github are be provided in `main/github.py`. This organization should
558565
be fairly straight forward to always find what you are looking for.
559566
567+
### Drivers
568+
569+
To test more difficult urls, we use a web driver, and you can choose between:
570+
571+
- [Chrome Driver](https://chromedriver.chromium.org/downloads)
572+
- [Gecko Driver](https://github.com/mozilla/geckodriver/releases) (firefox)
573+
574+
both to be used with selenium. This driver is optional, but will come by default with our action. To install
575+
it, you can download the driver at either of the links above and ensure you install selenium:
576+
577+
```bash
578+
$ pip install urlchecker[selenium]
579+
```
580+
and either:
581+
582+
1. Add it directly to your path
583+
2. Export the directory where it lives as `URLCHECKER_DRIVERS_PATH`
584+
3. Put it in the root of the urlchecker clone (it will be looked for here)
585+
586+
560587
## Support
561588
562589
If you need help, or want to suggest a project for the organization,

setup.py

Lines changed: 11 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -21,12 +21,11 @@ def get_lookup():
2121

2222
# Read in requirements
2323
def get_reqs(lookup=None, key="INSTALL_REQUIRES"):
24-
"""get requirements, mean reading in requirements and versions from
25-
the lookup obtained with get_lookup"""
26-
27-
if lookup == None:
28-
lookup = get_lookup()
29-
24+
"""
25+
Get requirements, mean reading in requirements and versions from
26+
the lookup obtained with get_lookup
27+
"""
28+
lookup = lookup or get_lookup()
3029
install_requires = []
3130
for module in lookup[key]:
3231
module_name = module[0]
@@ -67,6 +66,8 @@ def get_reqs(lookup=None, key="INSTALL_REQUIRES"):
6766

6867
INSTALL_REQUIRES = get_reqs(lookup)
6968
TESTS_REQUIRES = get_reqs(lookup, "TESTS_REQUIRES")
69+
INSTALL_REQUIRES_ALL = get_reqs(lookup, "INSTALL_REQUIRES_ALL")
70+
SELENIUM_REQUIRES = get_reqs(lookup, "SELENIUM_REQUIRES")
7071

7172
setup(
7273
name=NAME,
@@ -87,7 +88,10 @@ def get_reqs(lookup=None, key="INSTALL_REQUIRES"):
8788
setup_requires=["pytest-runner"],
8889
install_requires=INSTALL_REQUIRES,
8990
tests_require=TESTS_REQUIRES,
90-
extras_require={},
91+
extras_require={
92+
"all": INSTALL_REQUIRES_ALL,
93+
"selenium": SELENIUM_REQUIRES,
94+
},
9195
classifiers=[
9296
"Intended Audience :: Developers",
9397
"License :: OSI Approved :: MIT License",

tests/test_core_check.py

Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -37,6 +37,40 @@ def test_check_files(file_paths, print_all, exclude_urls, exclude_patterns):
3737
)
3838

3939

40+
@pytest.mark.parametrize("file_paths", [["tests/test_files/hard_urls.md"]])
41+
def test_difficult_urls(file_paths):
42+
"""
43+
test difficult urls that likely require selenium.
44+
"""
45+
checker = UrlChecker()
46+
results = checker.run(file_paths, timeout=20)
47+
48+
# This should be the only failing (503)
49+
assert (
50+
"https://thisurldoesnotexist-pancakes.whatever"
51+
in results["failed"]
52+
)
53+
working = [
54+
"https://www.hpcwire.com/2019/01/17/pfizer-hpc-engineer-aims-to-automate-software-stack-testing/",
55+
"https://www.sciencedirect.com/science/article/pii/S0013468608005045",
56+
"https://doi.org/10.1063/5.0023771",
57+
"https://www.linux.org/",
58+
"https://drupal.org/",
59+
"https://codepen.io/rootwork/",
60+
"http://groundwire.org/blog/groundwire-engagement-pyramid/",
61+
"https://twig.symfony.com/doc/",
62+
"https://groups.drupal.org/node/298298",
63+
"https://portland2013.drupal.org/program/sprints.html",
64+
"https://twitter.com/wharman",
65+
"https://www.progressiveexchange.org",
66+
"https://twitter.com/jooy8/status/322734500226412544",
67+
"https://www.drupal.org/node/1982024",
68+
"https://groups.drupal.org/node/278968",
69+
]
70+
for url in working:
71+
assert url in results["passed"]
72+
73+
4074
@pytest.mark.parametrize("local_folder_path", ["./tests/test_files"])
4175
@pytest.mark.parametrize("config_fname", ["./tests/_local_test_config.conf"])
4276
def test_locally(local_folder_path, config_fname):

tests/test_core_fileproc.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -92,10 +92,12 @@ def test_get_file_paths(base_path, file_types):
9292
[
9393
"tests/test_files/sample_test_file.md",
9494
"tests/test_files/sample_test_file.py",
95+
"tests/test_files/hard_urls.md",
9596
],
9697
[
9798
"tests/test_files/sample_test_file.py",
9899
"tests/test_files/sample_test_file.md",
100+
"tests/test_files/hard_urls.md",
99101
],
100102
]
101103
# assert

0 commit comments

Comments
 (0)