Skip to content

Commit 252386d

Browse files
authored
Add support for extra options in the fullDomFetcher (#1173)
2 parents 8a92cbd + 0873204 commit 252386d

6 files changed

Lines changed: 137 additions & 12 deletions

File tree

.env.example

Lines changed: 19 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,21 @@
1-
OTA_ENGINE_SENDINBLUE_API_KEY='xkeysib-3f51c…'
2-
OTA_ENGINE_SMTP_PASSWORD='password'
1+
# Open Terms Archive Engine - Environment Variables Example
2+
# Copy this file to .env and fill in your actual values
33

4-
# If both GitHub and GitLab tokens are defined, GitHub takes precedence for dataset publishing
5-
OTA_ENGINE_GITHUB_TOKEN=ghp_XXXXXXXXX
4+
OTA_ENGINE_GITHUB_TOKEN=your_github_token_here
5+
OTA_ENGINE_GITLAB_TOKEN=your_gitlab_token_here
6+
OTA_ENGINE_GITLAB_RELEASES_TOKEN=your_gitlab_releases_token_here
7+
OTA_ENGINE_SENDINBLUE_API_KEY=your_sendinblue_api_key_here
8+
OTA_ENGINE_SMTP_PASSWORD=your_smtp_password_here
69

7-
OTA_ENGINE_GITLAB_TOKEN=XXXXXXXXXX
8-
OTA_ENGINE_GITLAB_RELEASES_TOKEN=XXXXXXXXXX
10+
HTTP_PROXY=http://proxy.example.com:8080
11+
HTTPS_PROXY=https://proxy.example.com:8080
12+
13+
# Alternative lowercase versions (some systems prefer these)
14+
# http_proxy=http://proxy.example.com:8080
15+
# https_proxy=https://proxy.example.com:8080
16+
17+
# Disable headless mode for Puppeteer (shows browser window)
18+
OTA_ENGINE_FETCHER_NO_HEADLESS=1
19+
20+
# Disable Chrome sandbox (required for some Docker environments)
21+
OTA_ENGINE_FETCHER_NO_SANDBOX=1

CHANGELOG.md

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,16 @@
22

33
All changes that impact users of this module are documented in this file, in the [Common Changelog](https://common-changelog.org) format with some additional specifications defined in the CONTRIBUTING file. This codebase adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
44

5+
## Unreleased [minor]
6+
7+
> Development of this release was supported by the [European Commission](https://commission.europa.eu/) for its [VLOPs/VLOSEs instance](https://code.europa.eu/dsa/terms-and-conditions-database/vlops-and-vloses/).
8+
9+
### Added
10+
11+
- Add proxy support for fetching documents behind firewalls or restricted networks; configure using `HTTP_PROXY` and `HTTPS_PROXY` (or `http_proxy` and `https_proxy`) environment variables
12+
- Add debugging options to disable headless mode for visual troubleshooting during development; set `OTA_ENGINE_FETCHER_NO_HEADLESS=1` to show browser window
13+
- Add sandbox control for improved compatibility with Docker and containerized environments; set `OTA_ENGINE_FETCHER_NO_SANDBOX=1` when running in containers
14+
515
## 9.1.2 - 2025-10-30
616

717
_Full changeset and discussions: [#1199](https://github.com/OpenTermsArchive/engine/pull/1199)._

CONTRIBUTING.md

Lines changed: 36 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,7 @@ First of all, thanks for taking the time to contribute! 🎉👍
88
- [Commit messages](#commit-messages)
99
- [Changelog](#changelog)
1010
- [Development](#development)
11+
- [Configuration and environment variables](#configuration-and-environment-variables)
1112
- [Documentation](#documentation)
1213
- [Naming](#naming)
1314
- [Instances and repositories](#instances-and-repositories)
@@ -75,7 +76,7 @@ Changes that require an adjustment in the infrastructure, they are considered as
7576

7677
4. Since each release is produced automatically from a single pull request, the [notice](https://common-changelog.org/#23-notice) links to the source pull request rather than [references](https://common-changelog.org/#242-references), which would always reference the same pull request. References can link to relevant parts of an RFC, decision record, or diff. **This notice is automatically generated by the CI during the release process and should not be added manually.**
7778

78-
5. The [notice](https://common-changelog.org/#23-notice) is also used to present sponsor information and it is required. Since the development of this project is funded by different actors, and following discussions with sponsors, financial contributions are acknowledged in the changelog itself. The format of the notice thus diverges from the Common Changelog specification in that it is not “a single-sentence paragraph”. Sponsor information is in quote format, starts with “Development of this release was supported by <funding_from>”, and provides the name and link to the sponsor, as well as information on the specific funding instrument, as specified by the sponsor itself or as required by law. A short message from the sponsor might also be added, as long as it abides by the community’s [Code of Conduct](./CODE_OF_CONDUCT.md) and aligns with the project’s goals. For volunteer contributions, the sentence should start with: “Development of this release was made on a volunteer basis by <contributor_name>”
79+
5. The [notice](https://common-changelog.org/#23-notice) is also used to present sponsor information and it is required. Since the development of this project is funded by different actors, as a matter of transparency and recognition, financial contributions and contributions supported by employers are acknowledged in the changelog itself. The format of the notice thus diverges from the Common Changelog specification in that it is not “a single-sentence paragraph”. Sponsor information is in quote format, starts with “Development of this release was supported by <funding_from>”, and provides the name and link to the sponsor, as well as information on the specific funding instrument, as specified by the sponsor itself or as required by law. A short message from the sponsor might also be added, as long as it abides by the community’s [Code of Conduct](./CODE_OF_CONDUCT.md) and aligns with the project’s goals. For volunteer contributions, the sentence should start with: “Development of this release was made on a volunteer basis by <contributor_name>”
7980

8081
#### Changes that do not impact users
8182

@@ -91,6 +92,40 @@ This content will be automatically deleted by the CI after merging.
9192

9293
## Development
9394

95+
### Configuration and environment variables
96+
97+
The choice between environment variables and configuration files should be made based on the nature of the data and how it will be used.
98+
99+
**Use environment variables for:**
100+
101+
- Secrets: API keys, passwords, tokens, or any sensitive data that should not be committed to version control. Examples:
102+
- `OTA_ENGINE_GITHUB_TOKEN`: GitHub API token for creating issues and managing repositories
103+
- `OTA_ENGINE_SMTP_PASSWORD`: password for SMTP server authentication
104+
- Debugging flags: toggles for development features. Examples:
105+
- `OTA_ENGINE_FETCHER_NO_HEADLESS`: disables headless mode in Puppeteer to show the browser window during fetching
106+
- Unix standards: system-level settings following Unix conventions. Examples:
107+
- `HTTP_PROXY`, `HTTPS_PROXY`, `http_proxy`, `https_proxy`: proxy server configuration for HTTP/HTTPS requests
108+
- Runtime overrides: container-specific or deployment-specific settings that vary between environments. Examples:
109+
- `OTA_ENGINE_FETCHER_NO_SANDBOX`: disables Chrome sandbox (required in some Docker environments)
110+
111+
**Use configuration files for:**
112+
113+
- Engine behavior: Core functionality settings that define how the application operates. Examples:
114+
- `trackingSchedule`: Cron expression defining when to track terms (e.g., `"30 */12 * * *"` for every 12 hours)
115+
- `fetcher.language`: Language code for Accept-Language header in HTTP requests
116+
- Service settings: External service endpoints and integration parameters. Examples:
117+
- `versionsRepositoryURL`: URL of the GitHub repository storing document versions
118+
- `logger.smtp.host`: SMTP server hostname for sending error notifications
119+
- Static infrastructure: Deployment-independent paths and identifiers. Examples:
120+
- `recorder.versions.storage.git.path`: File system path where Git repository for versions is stored
121+
- `recorder.versions.storage.git.author`: Git commit author name and email for automated commits
122+
123+
When uncertain whether to use an environment variable or a configuration file, consider:
124+
125+
- Does it contain sensitive information? → Environment variable
126+
- Should it be version-controlled and reviewed? → Configuration file
127+
- Is it a stable setting that defines application behavior? → Configuration file
128+
94129
### Documentation
95130

96131
#### Copywriting

src/archivist/fetcher/fullDomFetcher.js

Lines changed: 34 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,8 @@
11
import puppeteer from 'puppeteer-extra';
22
import stealthPlugin from 'puppeteer-extra-plugin-stealth';
33

4+
import { resolveProxyConfiguration, extractProxyCredentials } from './proxyUtils.js';
5+
46
puppeteer.use(stealthPlugin());
57

68
let browser;
@@ -25,6 +27,10 @@ export default async function fetch(url, cssSelectors, config) {
2527

2628
await client.send('Network.clearBrowserCookies'); // Clear cookies to ensure clean state between fetches and prevent session persistence across different URLs
2729

30+
if (browser.proxyCredentials?.username && browser.proxyCredentials?.password) {
31+
await page.authenticate(browser.proxyCredentials);
32+
}
33+
2834
response = await page.goto(url, { waitUntil: 'load' }); // Using `load` instead of `networkidle0` as it's more reliable and faster. The 'load' event fires when the page and all its resources (stylesheets, scripts, images) have finished loading. `networkidle0` can be problematic as it waits for 500ms of network inactivity, which may never occur on dynamic pages and then triggers a navigation timeout.
2935

3036
if (!response) {
@@ -86,7 +92,34 @@ export async function launchHeadlessBrowser() {
8692
return browser;
8793
}
8894

89-
browser = await puppeteer.launch({ headless: true });
95+
const options = {
96+
args: [],
97+
headless: !process.env.OTA_ENGINE_FETCHER_NO_HEADLESS,
98+
};
99+
100+
const { httpProxy, httpsProxy } = resolveProxyConfiguration();
101+
102+
let proxyCredentials = null;
103+
104+
if (httpProxy) {
105+
const httpProxyUrl = new URL(httpProxy);
106+
const httpsProxyUrl = new URL(httpsProxy);
107+
108+
proxyCredentials = extractProxyCredentials(httpProxy, httpsProxy);
109+
110+
options.args.push(`--proxy-server=http=${httpProxyUrl.host};https=${httpsProxyUrl.host}`);
111+
}
112+
113+
if (process.env.OTA_ENGINE_FETCHER_NO_SANDBOX) {
114+
options.args.push('--no-sandbox');
115+
options.args.push('--disable-setuid-sandbox');
116+
}
117+
118+
browser = await puppeteer.launch(options);
119+
120+
if (proxyCredentials) {
121+
browser.proxyCredentials = proxyCredentials;
122+
}
90123

91124
return browser;
92125
}

src/archivist/fetcher/htmlOnlyFetcher.js

Lines changed: 8 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,8 @@ import HttpProxyAgent from 'http-proxy-agent';
44
import HttpsProxyAgent from 'https-proxy-agent';
55
import nodeFetch, { AbortError } from 'node-fetch';
66

7+
import { resolveProxyConfiguration } from './proxyUtils.js';
8+
79
export default async function fetch(url, config) {
810
const controller = new AbortController();
911
const timeout = setTimeout(() => controller.abort(), config.navigationTimeout);
@@ -14,10 +16,12 @@ export default async function fetch(url, config) {
1416
headers: { 'Accept-Language': config.language },
1517
};
1618

17-
if (url.startsWith('https:') && process.env.HTTPS_PROXY) {
18-
nodeFetchOptions.agent = new HttpsProxyAgent(process.env.HTTPS_PROXY);
19-
} else if (url.startsWith('http:') && process.env.HTTP_PROXY) {
20-
nodeFetchOptions.agent = new HttpProxyAgent(process.env.HTTP_PROXY);
19+
const { httpProxy, httpsProxy } = resolveProxyConfiguration();
20+
21+
if (url.startsWith('https:') && httpsProxy) {
22+
nodeFetchOptions.agent = new HttpsProxyAgent(httpsProxy);
23+
} else if (url.startsWith('http:') && httpProxy) {
24+
nodeFetchOptions.agent = new HttpProxyAgent(httpProxy);
2125
}
2226

2327
let response;
Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,30 @@
1+
export function resolveProxyConfiguration() {
2+
const httpProxy = process.env.http_proxy || process.env.HTTP_PROXY;
3+
const httpsProxy = process.env.https_proxy || process.env.HTTPS_PROXY || httpProxy;
4+
5+
return {
6+
httpProxy,
7+
httpsProxy,
8+
};
9+
}
10+
11+
export function extractProxyCredentials(httpProxy, httpsProxy) {
12+
if (!httpProxy) {
13+
return null;
14+
}
15+
16+
const httpProxyUrl = new URL(httpProxy);
17+
const httpsProxyUrl = new URL(httpsProxy);
18+
19+
const { username, password } = httpProxyUrl;
20+
21+
if (!username || !password) {
22+
return null;
23+
}
24+
25+
if (httpProxyUrl.username !== httpsProxyUrl.username || httpProxyUrl.password !== httpsProxyUrl.password) {
26+
throw new Error('Unsupported proxies specified, http and https proxy should have the same credentials.');
27+
}
28+
29+
return { username, password };
30+
}

0 commit comments

Comments
 (0)