feat/direct get scraper#1
Conversation
…enium fallback, and add configurable politeness delays
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Plus Run ID: 📒 Files selected for processing (2)
🚧 Files skipped from review as they are similar to previous changes (2)
📝 WalkthroughWalkthroughIntroduces a frontend i18n system (translations dictionary, ChangesBackend authentication and schedule configuration
Frontend i18n system and UI component adoption
Scraper requests-based refactor and legacy Selenium preservation
Estimated code review effort🎯 4 (Complex) | ⏱️ ~75 minutes Poem
🚥 Pre-merge checks | ✅ 3 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Code Review
This pull request introduces a lightweight internationalization (i18n) system with English and German translations, refactors several frontend components to use standardized UI elements (Button, Card, Input), and optimizes the scraper backend to support a fast requests-based pipeline alongside the legacy Selenium implementation. The code review feedback highlights several key areas for improvement: 1. Input Validation: Ensure configured scraper delays are validated as non-negative numbers in the backend to prevent runtime crashes. 2. Internationalization & Redundancy: Replace redundant ternary operators and hardcoded dimension labels in the frontend with the newly introduced translation keys. 3. Code Safety: Use replaceAll instead of dynamic RegExp in the translation hook to avoid injection vulnerabilities. 4. Defensive Programming: Guard against potential IndexError and TypeError exceptions in the scraper when parsing malformed URLs or empty JSON-LD scripts. 5. Timezone Safety: Consistently use timezone-aware UTC datetimes instead of naive local datetimes to prevent synchronization mismatches.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| delayBetweenPages: parseFloat(delayBetweenPages !== undefined ? delayBetweenPages : 0.25), | ||
| delayBetweenListings: parseFloat(delayBetweenListings !== undefined ? delayBetweenListings : 0.25) |
There was a problem hiding this comment.
The parsed values for delayBetweenPages and delayBetweenListings are not validated to ensure they are non-negative numbers. If a user configures a negative value, Python's time.sleep() in the scraper will throw a ValueError and crash the background process. Additionally, if parseFloat returns NaN, it will be serialized as null in the JSON file, which might cause issues when read.
Validate that the parsed values are valid, non-negative numbers, falling back to 0.25 if invalid.
| <span className="text-slate-750">|</span> | ||
| <h2 className="text-base font-bold text-slate-200">{campaigns.find(c => c.id === currentCampaignId)?.name} Dashboard</h2> | ||
| <h2 className="text-base font-bold text-slate-200"> | ||
| {campaigns.find(c => c.id === currentCampaignId)?.name} {lang === 'en' ? 'Dashboard' : 'Dashboard'} |
| trustworthiness: lang === 'en' ? 'Trustworthiness' : 'Vertrauenswürdigkeit', | ||
| transparency: lang === 'en' ? 'Transparency' : 'Transparenz', | ||
| conditionConfidence: lang === 'en' ? 'Condition Confidence' : 'Zustandssicherheit', | ||
| documentationQuality: lang === 'en' ? 'Documentation Quality' : 'Dokumentationsqualität', | ||
| hiddenRiskSuspicion: lang === 'en' ? 'Hidden Risk Suspicion' : 'Versteckter Risikoverdacht', | ||
| marketAboveAverageSignal: lang === 'en' ? 'Market Above Average Signal' : 'Überdurchschnittliches Marktsignal' |
| let result = current; | ||
| if (replacements) { | ||
| Object.entries(replacements).forEach(([k, v]) => { | ||
| result = result.replace(new RegExp(`{{${k}}}`, 'g'), String(v)); |
| parts = base_url.split("/") | ||
| domain_part = "/".join(parts[:3]) | ||
| path_part = parts[3] |
| for script in soup.find_all("script", type="application/ld+json"): | ||
| try: | ||
| data = json.loads(script.string) |
There was a problem hiding this comment.
| parsed["detailed_description"], | ||
| json.dumps(parsed["details"]), | ||
| json.dumps(parsed["images"]), | ||
| datetime.datetime.now().isoformat(), |
There was a problem hiding this comment.
There was a problem hiding this comment.
Actionable comments posted: 16
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
scraper/scraper.py (1)
331-425:⚠️ Potential issue | 🟠 Major | ⚡ Quick winEnsure SQLite connections are closed on all error paths.
Both functions close
connonly on the happy path. Any exception before the trailingconn.close()can leave the connection open and eventually cause lock contention in subsequent runs.Suggested fix
def harvest_descriptions(campaign_id=None): @@ - try: - conn = sqlite3.connect(db_path) + conn = None + try: + conn = sqlite3.connect(db_path) conn.row_factory = sqlite3.Row @@ - conn.close() except Exception as e: logger.error(f"Error in requests-based harvest_descriptions: {str(e)}") + finally: + if conn is not None: + conn.close() @@ def update_all_descriptions_session(campaign_id=None): @@ - try: - conn = sqlite3.connect(db_path) + conn = None + try: + conn = sqlite3.connect(db_path) conn.row_factory = sqlite3.Row @@ - conn.close() except Exception as e: logger.error( f"Error in requests-based update_all_descriptions_session: {str(e)}" ) + finally: + if conn is not None: + conn.close()Also applies to: 440-532
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@scraper/scraper.py` around lines 331 - 425, The SQLite connection opened with sqlite3.connect(db_path) is only closed in the try block's happy path, leaving it open if any exception occurs before reaching conn.close(). Wrap the entire try block's operations in a try-finally structure where the finally block ensures conn.close() is always executed, or alternatively use a with statement for the database connection to automatically manage its lifecycle. This same pattern needs to be applied to the other similar function mentioned in the comment (also applies to lines 440-532).
🧹 Nitpick comments (2)
scraper/scraper_selenium.py (1)
219-226: 💤 Low valuePickle deserialization from local file has limited attack surface.
Static analysis flags
pickle.loadas unsafe (S301). In this context, the cookie file is application-controlled and stored locally, so exploitation requires prior file system access. The current try/except pattern aroundadd_cookieis good defensive practice. Consider adding a comment documenting this risk/tradeoff for future maintainers.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@scraper/scraper_selenium.py` around lines 219 - 226, Add a comment above the pickle.load() call in the cookie loading block to document the security consideration and justification. The comment should explain that while pickle.load is flagged as unsafe by static analysis tools (S301), this usage is acceptable because the cookie file is application-controlled and stored locally, which limits the attack surface and requires prior file system access for exploitation. This will help future maintainers understand the deliberate tradeoff being made here.Source: Linters/SAST tools
frontend/src/components/ListingDetailCard.tsx (1)
320-327: ⚡ Quick winCentralize soft-dimension labels in translations instead of inline language branching.
This duplicates i18n logic in the component and makes new-language rollout harder. Use translation keys for these labels like the rest of the card.
♻️ Suggested refactor
-const labels: Record<string, string> = { - trustworthiness: lang === 'en' ? 'Trustworthiness' : 'Vertrauenswürdigkeit', - transparency: lang === 'en' ? 'Transparency' : 'Transparenz', - conditionConfidence: lang === 'en' ? 'Condition Confidence' : 'Zustandssicherheit', - documentationQuality: lang === 'en' ? 'Documentation Quality' : 'Dokumentationsqualität', - hiddenRiskSuspicion: lang === 'en' ? 'Hidden Risk Suspicion' : 'Versteckter Risikoverdacht', - marketAboveAverageSignal: lang === 'en' ? 'Market Above Average Signal' : 'Überdurchschnittliches Marktsignal' -}; -const label = labels[key] || key; +const label = t(`listing.softDimensionsLabels.${key}`);🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@frontend/src/components/ListingDetailCard.tsx` around lines 320 - 327, Move the soft-dimension labels (trustworthiness, transparency, conditionConfidence, documentationQuality, hiddenRiskSuspicion, marketAboveAverageSignal) from the inline Record definition with ternary language branching into your translation/i18n system. Replace the hardcoded labels object with translation key references using the same i18n pattern already used elsewhere in the ListingDetailCard component, eliminating the redundant lang === 'en' checks and making it easier to support new languages without modifying the component code.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@backend/server.js`:
- Around line 645-647: The delayBetweenPages and delayBetweenListings properties
are using parseFloat with only undefined checks, which allows invalid values
like NaN or negative numbers to be persisted. Add validation after each
parseFloat call to ensure the parsed number is a valid positive number; if the
value is NaN or negative, use the default value (0.25) instead. Apply this
validation logic consistently to both delay properties to prevent invalid delays
from corrupting the config.
In `@frontend/src/App.tsx`:
- Around line 1012-1016: The select dropdown in the status filter has a mismatch
between the option value and label. The option with value="New" uses the
translation key dashboard.statusEvaluated which suggests an "evaluated only"
filter, but the underlying predicate filters for status === 'New'. Change the
label for the option with value="New" to use a translation key that accurately
describes filtering for new items (e.g., dashboard.statusNew or similar) instead
of dashboard.statusEvaluated to align the UI label with the actual filtering
behavior.
- Around line 948-950: The h2 element in the dashboard header contains a no-op
ternary operator that always renders "Dashboard" regardless of the language,
which bypasses the i18n system. Replace the entire ternary expression `lang ===
'en' ? 'Dashboard' : 'Dashboard'` with a proper i18n translation function call
(such as t('dashboard') or equivalent based on your application's translation
system) to ensure the dashboard title is properly localized.
In `@frontend/src/components/ListingDetailCard.tsx`:
- Line 395: The neutral status in the ternary operator evaluating
evalItem.status is hardcoded as the English string 'NEUTRAL' instead of being
translated like the satisfied and violated statuses. Replace the hardcoded
'NEUTRAL' string with t('listing.neutral').toUpperCase() to apply the same
translation and uppercase formatting pattern used for the other two status cases
in the ternary expression.
In `@frontend/src/components/SettingsView.tsx`:
- Line 300: The button text on line 300 in SettingsView.tsx is using the
translation key `settings.scraperRules`, which is meant for a section title, not
a button action. Replace this translation key with an appropriate save-action
translation key (such as `common.save` or `actions.save`) that properly
describes the submit action for this button, ensuring the UI correctly labels
the primary CTA.
- Line 80: The setTimeout call in SettingsView that clears the success message
does not cancel previously scheduled timers, causing multiple overlapping
timeouts to potentially fire and clear the message prematurely, plus leaving
dangling timers on unmount. Store the timeout ID from each setTimeout in a ref
variable, and before scheduling a new timeout in the setSuccess handler, first
call clearTimeout on any existing timeout ID stored in that ref to ensure only
one active timer exists at any time.
In `@frontend/src/components/ui/Button.tsx`:
- Around line 9-17: The Button component is missing an explicit type attribute,
which causes HTML to default to type="submit" inside forms, leading to
unintended form submissions. Add a type prop with a default value of 'button' to
the Button component's destructured parameters (alongside variant, size,
loading, disabled, className, children, and props), then pass this type prop to
the underlying button element to ensure the button behaves as expected in all
contexts.
In `@frontend/src/components/ui/Input.tsx`:
- Around line 8-18: The Input component in its forwardRef callback needs to
expose the error state semantically for accessibility. Add the aria-invalid
attribute to the input element, setting it to the boolean value of the error
prop. Position the aria-invalid attribute before the {...props} spread operator
so that consumers can override the accessibility attribute if needed through the
props parameter.
In `@frontend/src/hooks/useTranslation.ts`:
- Around line 8-11: The globalLang initialization casts the localStorage value
directly to Language type without validating it matches the supported locales.
Add validation after retrieving the value from localStorage using
localStorage.getItem('ui-lang') to check if it is either 'en' or 'de'. If the
retrieved value is not one of the supported locales, fall back to a default
language (such as 'en'). This ensures only valid language values are assigned to
globalLang.
In `@frontend/src/utils/listingTransformer.ts`:
- Line 149: The type assertion on line 149 using `as Listing['highlights']` does
not validate the runtime shape of the highlights data from the LLM payload.
Create a validation function that checks each highlight object in the
extracted_facts.highlights array to ensure the sentiment, type, and confidence
fields contain only the expected literal values (sentiment: 'positive' |
'negative' | 'neutral', type: 'maintenance' | 'warning' | 'feature', confidence:
'high' | 'med' | 'low'). Apply this validation function to filter or validate
the highlights before the type assertion in the highlights line, ensuring only
valid highlights are passed to the Listing type and preventing silent failures
in downstream UI code like ListingDetailCard.tsx that checks conditions such as
h.sentiment === 'negative'.
In `@scraper/scraper_selenium.py`:
- Around line 721-724: The cookie loading loop that iterates over pickled
cookies and calls driver.add_cookie(cookie) lacks error handling for individual
cookies. Wrap each driver.add_cookie(cookie) call within a try/except block
similar to the pattern used in the load_cookies() helper function. This ensures
that if a single cookie fails to load, the loop continues processing remaining
cookies instead of aborting the entire session.
- Around line 386-393: The exception handler in the except block attempts to use
`current_url` which may not be defined if an exception occurs before the
assignment at line 356, causing a NameError. Additionally, the silent except
handler at lines 391-392 hides navigation failures. Initialize `current_url`
before the try block to ensure it's always defined, and replace the silent
`except: pass` clause with a proper logger.error call to capture and log
navigation failures instead of silently swallowing them.
- Around line 65-201: The database connection created with sqlite3.connect at
the beginning of the harvest_missing_descriptions function is not guaranteed to
be closed if an exception occurs before reaching the conn.close() call at the
end. Replace the current try/except structure with a context manager (using the
with statement) for the sqlite3.connect call to ensure the connection is
automatically closed and resources are properly cleaned up even if an exception
occurs during the listing harvesting loop or database operations.
In `@scraper/scraper.py`:
- Around line 445-461: The current update logic for listings only triggers a
database update when the description text changes, causing specs and images
changes to be ignored if the description remains the same. Modify the
conditional logic that determines whether to update a listing row (in the
section starting around line 478) to check not only if the description has
changed, but also if there are any changes to the specs or images. This ensures
that all changes to listing details are persisted to the database, even when the
description text itself remains unchanged.
In `@scripts/demo_direct_get.py`:
- Around line 123-125: The fallback CSS class selector in the soup.select_one()
call for price_elem contains a typo in the class name format. The selector uses
ad-item-main--middle--price-shipping--price but should use
aditem-main--middle--price-shipping--price (without the hyphen between ad and
item) to match the correct class naming convention used elsewhere in the code.
Update the first selector string in the soup.select_one() call to correct this
class name so that price extraction works properly when the primary id selector
is not available.
In `@scripts/demo_full_pipeline.py`:
- Around line 88-90: The price extraction in the code directly chains
select_one() and get_text() methods, which will raise an exception and skip the
entire listing if the price element is missing. Add a null-safety check after
the select_one() call to verify the element exists before calling get_text() on
it, allowing listings with missing prices to be handled gracefully rather than
being dropped.
---
Outside diff comments:
In `@scraper/scraper.py`:
- Around line 331-425: The SQLite connection opened with
sqlite3.connect(db_path) is only closed in the try block's happy path, leaving
it open if any exception occurs before reaching conn.close(). Wrap the entire
try block's operations in a try-finally structure where the finally block
ensures conn.close() is always executed, or alternatively use a with statement
for the database connection to automatically manage its lifecycle. This same
pattern needs to be applied to the other similar function mentioned in the
comment (also applies to lines 440-532).
---
Nitpick comments:
In `@frontend/src/components/ListingDetailCard.tsx`:
- Around line 320-327: Move the soft-dimension labels (trustworthiness,
transparency, conditionConfidence, documentationQuality, hiddenRiskSuspicion,
marketAboveAverageSignal) from the inline Record definition with ternary
language branching into your translation/i18n system. Replace the hardcoded
labels object with translation key references using the same i18n pattern
already used elsewhere in the ListingDetailCard component, eliminating the
redundant lang === 'en' checks and making it easier to support new languages
without modifying the component code.
In `@scraper/scraper_selenium.py`:
- Around line 219-226: Add a comment above the pickle.load() call in the cookie
loading block to document the security consideration and justification. The
comment should explain that while pickle.load is flagged as unsafe by static
analysis tools (S301), this usage is acceptable because the cookie file is
application-controlled and stored locally, which limits the attack surface and
requires prior file system access for exploitation. This will help future
maintainers understand the deliberate tradeoff being made here.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro Plus
Run ID: 87d1f053-bd6e-42e9-9c4f-3b7220aa747d
📒 Files selected for processing (17)
backend/server.jsfrontend/src/App.tsxfrontend/src/components/CriteriaTuner.tsxfrontend/src/components/GuidelinesWizard.tsxfrontend/src/components/ListingDetailCard.tsxfrontend/src/components/SettingsView.tsxfrontend/src/components/ui/Button.tsxfrontend/src/components/ui/Card.tsxfrontend/src/components/ui/Input.tsxfrontend/src/hooks/useTranslation.tsfrontend/src/i18n/translations.tsfrontend/src/utils/listingTransformer.tsscraper/scraper.pyscraper/scraper_selenium.pyscripts/demo_direct_get.pyscripts/demo_discover_listings.pyscripts/demo_full_pipeline.py
| <option value="All">{t('dashboard.statusAll')}</option> | ||
| <option value="High Niceness">{t('dashboard.statusMatches')} (70+)</option> | ||
| <option value="Evaluate with AI">{t('dashboard.statusPending')}</option> | ||
| <option value="New">{t('dashboard.statusEvaluated')}</option> | ||
| </select> |
There was a problem hiding this comment.
Fix status filter label/value mismatch in the dropdown.
The option with value="New" is labeled as AI Evaluated Only, but the predicate for New filters l.status === 'New'. This makes the UI misleading and removes a true evaluated-only filter path.
Minimal alignment fix (label matches current behavior)
<option value="All">{t('dashboard.statusAll')}</option>
<option value="High Niceness">{t('dashboard.statusMatches')} (70+)</option>
<option value="Evaluate with AI">{t('dashboard.statusPending')}</option>
- <option value="New">{t('dashboard.statusEvaluated')}</option>
+ <option value="New">{t('landing.new')}</option>🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@frontend/src/App.tsx` around lines 1012 - 1016, The select dropdown in the
status filter has a mismatch between the option value and label. The option with
value="New" uses the translation key dashboard.statusEvaluated which suggests an
"evaluated only" filter, but the underlying predicate filters for status ===
'New'. Change the label for the option with value="New" to use a translation key
that accurately describes filtering for new items (e.g., dashboard.statusNew or
similar) instead of dashboard.statusEvaluated to align the UI label with the
actual filtering behavior.
| </div> | ||
| <span className={`text-[10px] font-bold px-2 py-0.5 rounded-md ${evalItem.status === 'satisfied' ? 'bg-emerald-500/10 text-emerald-400' : evalItem.status === 'violated' ? 'bg-rose-500/10 text-rose-400' : 'bg-slate-800 text-slate-400'}`}> | ||
| {evalItem.status === 'satisfied' ? 'SATISFIED' : evalItem.status === 'violated' ? 'VIOLATED' : 'NEUTRAL'} | ||
| {evalItem.status === 'satisfied' ? t('listing.satisfied').toUpperCase() : evalItem.status === 'violated' ? t('listing.violated').toUpperCase() : 'NEUTRAL'} |
There was a problem hiding this comment.
Localize the neutral checklist status.
The satisfied/violated statuses are translated, but the neutral branch is still hardcoded as English text.
🌐 Suggested fix
-{evalItem.status === 'satisfied' ? t('listing.satisfied').toUpperCase() : evalItem.status === 'violated' ? t('listing.violated').toUpperCase() : 'NEUTRAL'}
+{evalItem.status === 'satisfied'
+ ? t('listing.satisfied').toUpperCase()
+ : evalItem.status === 'violated'
+ ? t('listing.violated').toUpperCase()
+ : t('listing.neutral').toUpperCase()}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| {evalItem.status === 'satisfied' ? t('listing.satisfied').toUpperCase() : evalItem.status === 'violated' ? t('listing.violated').toUpperCase() : 'NEUTRAL'} | |
| {evalItem.status === 'satisfied' | |
| ? t('listing.satisfied').toUpperCase() | |
| : evalItem.status === 'violated' | |
| ? t('listing.violated').toUpperCase() | |
| : t('listing.neutral').toUpperCase()} |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@frontend/src/components/ListingDetailCard.tsx` at line 395, The neutral
status in the ternary operator evaluating evalItem.status is hardcoded as the
English string 'NEUTRAL' instead of being translated like the satisfied and
violated statuses. Replace the hardcoded 'NEUTRAL' string with
t('listing.neutral').toUpperCase() to apply the same translation and uppercase
formatting pattern used for the other two status cases in the ternary
expression.
| setDelayBetweenPages(data.config.delayBetweenPages ?? 0.25); | ||
| setDelayBetweenListings(data.config.delayBetweenListings ?? 0.25); | ||
| setSuccess(t('settings.saveSuccess')); | ||
| setTimeout(() => setSuccess(''), 4000); |
There was a problem hiding this comment.
Clear existing success timers before scheduling a new one.
Line 80 starts a new timeout every save without cancelling prior ones. Rapid saves can clear the newest message too early, and a pending timer may fire after unmount.
Suggested fix
-import { useState, useEffect } from 'react';
+import { useState, useEffect, useRef } from 'react';
@@
export default function SettingsView({ onBack }: SettingsViewProps) {
+ const successTimerRef = useRef<number | null>(null);
@@
- setSuccess(t('settings.saveSuccess'));
- setTimeout(() => setSuccess(''), 4000);
+ setSuccess(t('settings.saveSuccess'));
+ if (successTimerRef.current) {
+ window.clearTimeout(successTimerRef.current);
+ }
+ successTimerRef.current = window.setTimeout(() => setSuccess(''), 4000);
@@
+ useEffect(() => {
+ return () => {
+ if (successTimerRef.current) {
+ window.clearTimeout(successTimerRef.current);
+ }
+ };
+ }, []);🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@frontend/src/components/SettingsView.tsx` at line 80, The setTimeout call in
SettingsView that clears the success message does not cancel previously
scheduled timers, causing multiple overlapping timeouts to potentially fire and
clear the message prematurely, plus leaving dangling timers on unmount. Store
the timeout ID from each setTimeout in a ref variable, and before scheduling a
new timeout in the setSuccess handler, first call clearTimeout on any existing
timeout ID stored in that ref to ensure only one active timer exists at any
time.
| except Exception as e: | ||
| logger.error(f"Error getting detailed description: {str(e)}") | ||
| # Try to go back to the search results | ||
| try: | ||
| driver.get(current_url) | ||
| except Exception: | ||
| pass | ||
| return "" |
There was a problem hiding this comment.
Potential NameError if driver.current_url fails, and silent exception swallows errors.
If the exception originates at line 356 (before current_url is assigned), line 390 raises NameError. Additionally, the silent except: pass at lines 391-392 hides navigation failures.
🛠️ Suggested fix
+ current_url = None
try:
# Store current URL to return to later
current_url = driver.current_url
...
except Exception as e:
logger.error(f"Error getting detailed description: {str(e)}")
# Try to go back to the search results
- try:
- driver.get(current_url)
- except Exception:
- pass
+ if current_url:
+ try:
+ driver.get(current_url)
+ except Exception as nav_err:
+ logger.warning(f"Failed to navigate back: {nav_err}")
return ""📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| except Exception as e: | |
| logger.error(f"Error getting detailed description: {str(e)}") | |
| # Try to go back to the search results | |
| try: | |
| driver.get(current_url) | |
| except Exception: | |
| pass | |
| return "" | |
| current_url = None | |
| try: | |
| # Store current URL to return to later | |
| current_url = driver.current_url | |
| ... | |
| except Exception as e: | |
| logger.error(f"Error getting detailed description: {str(e)}") | |
| # Try to go back to the search results | |
| if current_url: | |
| try: | |
| driver.get(current_url) | |
| except Exception as nav_err: | |
| logger.warning(f"Failed to navigate back: {nav_err}") | |
| return "" |
🧰 Tools
🪛 Ruff (0.15.17)
[warning] 386-386: Do not catch blind exception: Exception
(BLE001)
[warning] 387-387: Use explicit conversion flag
Replace with conversion flag
(RUF010)
[error] 391-392: try-except-pass detected, consider logging the exception
(S110)
[warning] 391-391: Do not catch blind exception: Exception
(BLE001)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@scraper/scraper_selenium.py` around lines 386 - 393, The exception handler in
the except block attempts to use `current_url` which may not be defined if an
exception occurs before the assignment at line 356, causing a NameError.
Additionally, the silent except handler at lines 391-392 hides navigation
failures. Initialize `current_url` before the try block to ensure it's always
defined, and replace the silent `except: pass` clause with a proper logger.error
call to capture and log navigation failures instead of silently swallowing them.
Source: Linters/SAST tools
| with open(cookies_path, "rb") as f: | ||
| cookies = pickle.load(f) | ||
| for cookie in cookies: | ||
| driver.add_cookie(cookie) |
There was a problem hiding this comment.
Cookie loading lacks per-cookie error handling, unlike load_cookies().
The load_cookies() helper (lines 221-226) wraps each add_cookie() in try/except to skip problematic cookies. Here, a single failing cookie will abort the entire session. Apply the same defensive pattern.
🛠️ Suggested fix
with open(cookies_path, "rb") as f:
cookies = pickle.load(f)
for cookie in cookies:
- driver.add_cookie(cookie)
+ try:
+ driver.add_cookie(cookie)
+ except Exception as e:
+ logger.warning(f"Error loading cookie: {e}")
logger.info("Loaded persistent session cookies for description harvesting")📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| with open(cookies_path, "rb") as f: | |
| cookies = pickle.load(f) | |
| for cookie in cookies: | |
| driver.add_cookie(cookie) | |
| with open(cookies_path, "rb") as f: | |
| cookies = pickle.load(f) | |
| for cookie in cookies: | |
| try: | |
| driver.add_cookie(cookie) | |
| except Exception as e: | |
| logger.warning(f"Error loading cookie: {e}") |
🧰 Tools
🪛 OpenGrep (1.22.0)
[ERROR] 722-722: pickle.load/loads deserializes arbitrary Python objects and can execute arbitrary code. Use a safe format like JSON instead.
(coderabbit.deserialization.python-pickle)
🪛 Ruff (0.15.17)
[error] 722-722: pickle and modules that wrap it can be unsafe when used to deserialize untrusted data, possible security issue
(S301)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@scraper/scraper_selenium.py` around lines 721 - 724, The cookie loading loop
that iterates over pickled cookies and calls driver.add_cookie(cookie) lacks
error handling for individual cookies. Wrap each driver.add_cookie(cookie) call
within a try/except block similar to the pattern used in the load_cookies()
helper function. This ensures that if a single cookie fails to load, the loop
continues processing remaining cookies instead of aborting the entire session.
| if campaign_id is not None: | ||
| cursor.execute( | ||
| """ | ||
| SELECT l.id, l.url, l.title, l.detailed_description FROM listings l | ||
| JOIN searches s ON l.search_id = s.id | ||
| WHERE s.enabled = 1 AND s.campaign_id = ? | ||
| """, | ||
| (campaign_id,), | ||
| ) | ||
| else: | ||
| cursor.execute( | ||
| """ | ||
| SELECT l.id, l.url, l.title, l.detailed_description FROM listings l | ||
| JOIN searches s ON l.search_id = s.id | ||
| WHERE s.enabled = 1 | ||
| """ | ||
| ) |
There was a problem hiding this comment.
Persist details/images updates even when description text is unchanged.
Current logic updates the row only when description changes. If listing specs or images change while description remains the same, DB data stays stale even though fresh values were parsed.
Suggested fix
- SELECT l.id, l.url, l.title, l.detailed_description FROM listings l
+ SELECT l.id, l.url, l.title, l.detailed_description, l.details, l.images FROM listings l
@@
- SELECT l.id, l.url, l.title, l.detailed_description FROM listings l
+ SELECT l.id, l.url, l.title, l.detailed_description, l.details, l.images FROM listings l
@@
old_description = r["detailed_description"] or ""
+ old_details = r["details"] or "{}"
+ old_images = r["images"] or "[]"
@@
detailed_description = parsed["detailed_description"] or ""
+ details_json = json.dumps(parsed["details"], ensure_ascii=False, sort_keys=True)
+ images_json = json.dumps(parsed["images"], ensure_ascii=False)
- if detailed_description.strip() != old_description.strip():
+ if (
+ detailed_description.strip() != old_description.strip()
+ or details_json != old_details
+ or images_json != old_images
+ ):
@@
- json.dumps(parsed["details"]),
- json.dumps(parsed["images"]),
+ details_json,
+ images_json,Also applies to: 478-522
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@scraper/scraper.py` around lines 445 - 461, The current update logic for
listings only triggers a database update when the description text changes,
causing specs and images changes to be ignored if the description remains the
same. Modify the conditional logic that determines whether to update a listing
row (in the section starting around line 478) to check not only if the
description has changed, but also if there are any changes to the specs or
images. This ensures that all changes to listing details are persisted to the
database, even when the description text itself remains unchanged.
| price_elem = soup.select_one( | ||
| 'h2.ad-item-main--middle--price-shipping--price, [itemprop="price"]' | ||
| ) |
There was a problem hiding this comment.
Fix fallback price selector typo.
The fallback selector currently uses ad-item..., which does not match the class used elsewhere (aditem...). This can miss price extraction when #viewad-price is absent.
Suggested fix
- price_elem = soup.select_one(
- 'h2.ad-item-main--middle--price-shipping--price, [itemprop="price"]'
- )
+ price_elem = soup.select_one(
+ 'p.aditem-main--middle--price-shipping--price, [itemprop="price"]'
+ )🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@scripts/demo_direct_get.py` around lines 123 - 125, The fallback CSS class
selector in the soup.select_one() call for price_elem contains a typo in the
class name format. The selector uses ad-item-main--middle--price-shipping--price
but should use aditem-main--middle--price-shipping--price (without the hyphen
between ad and item) to match the correct class naming convention used elsewhere
in the code. Update the first selector string in the soup.select_one() call to
correct this class name so that price extraction works properly when the primary
id selector is not available.
| price = item.select_one( | ||
| "p.aditem-main--middle--price-shipping--price" | ||
| ).get_text(strip=True) |
There was a problem hiding this comment.
Avoid skipping valid listings when the price node is missing.
This direct chained call can raise and drop the whole listing record if the price element is absent. Make extraction null-safe.
Suggested fix
- price = item.select_one(
- "p.aditem-main--middle--price-shipping--price"
- ).get_text(strip=True)
+ price_elem = item.select_one(
+ "p.aditem-main--middle--price-shipping--price"
+ )
+ price = price_elem.get_text(strip=True) if price_elem else "N/A"📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| price = item.select_one( | |
| "p.aditem-main--middle--price-shipping--price" | |
| ).get_text(strip=True) | |
| price_elem = item.select_one( | |
| "p.aditem-main--middle--price-shipping--price" | |
| ) | |
| price = price_elem.get_text(strip=True) if price_elem else "N/A" |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@scripts/demo_full_pipeline.py` around lines 88 - 90, The price extraction in
the code directly chains select_one() and get_text() methods, which will raise
an exception and skip the entire listing if the price element is missing. Add a
null-safety check after the select_one() call to verify the element exists
before calling get_text() on it, allowing listings with missing prices to be
handled gracefully rather than being dropped.
There was a problem hiding this comment.
🧹 Nitpick comments (2)
scraper/agent_worker.py (1)
301-302: ⚡ Quick winUse one captured UTC timestamp per row update for both datetime columns.
llm_processed_timeandlast_ai_evaluated_atare written as a pair, but they currently use two separatenow()calls, which can drift by microseconds. Capture once and reuse so both fields stay exactly aligned.Suggested patch
@@ - cursor.execute( + processed_at = datetime.datetime.now(datetime.timezone.utc).isoformat() + cursor.execute( @@ - ( - datetime.datetime.now(datetime.timezone.utc).isoformat(), - datetime.datetime.now(datetime.timezone.utc).isoformat(), + ( + processed_at, + processed_at, json.dumps(facts_envelope), listing_id, ), ) @@ - cursor.execute( + processed_at = datetime.datetime.now(datetime.timezone.utc).isoformat() + cursor.execute( @@ - ( - datetime.datetime.now(datetime.timezone.utc).isoformat(), - datetime.datetime.now(datetime.timezone.utc).isoformat(), + ( + processed_at, + processed_at, full_info, json.dumps(facts_envelope), score,Also applies to: 571-572
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@scraper/agent_worker.py` around lines 301 - 302, The code at lines 301-302 makes two separate calls to datetime.datetime.now(datetime.timezone.utc).isoformat() for the llm_processed_time and last_ai_evaluated_at columns, which can cause microsecond drift between the two timestamps. Capture the UTC timestamp once into a variable before the update statement and reuse that same variable for both datetime columns instead of calling .now() twice. Apply the same fix at lines 571-572 where this same pattern occurs.scraper/scraper.py (1)
270-274: ⚡ Quick winExcessive file I/O: writing to disk on every single listing.
This writes the entire JSON file after each listing is scraped, causing O(n²) total bytes written and high I/O overhead. For crash recovery, batching once per page would provide similar resilience with much better performance.
♻️ Suggested refactor: batch write per page
Move the file write outside the item loop to write once per page:
existing_ids.add(listing_id) all_scraped_listings.append(listing) scraped_count += 1 - if output_file: - with open(output_file, "w", encoding="utf-8") as f: - json.dump( - existing_listings, f, ensure_ascii=False, indent=4 - ) - except Exception as e: logger.error(f"Error scraping single listing: {str(e)}") continue logger.info( f"Scraped {scraped_count} discovered listings from {current_url}" ) + + # Save after each page for crash recovery + if output_file and scraped_count > 0: + with open(output_file, "w", encoding="utf-8") as f: + json.dump(existing_listings, f, ensure_ascii=False, indent=4) except Exception as e: logger.error(f"Error scraping page {current_url}: {str(e)}")🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@scraper/scraper.py` around lines 270 - 274, The file write operation using json.dump on existing_listings is currently inside the item loop, causing the entire JSON file to be written after each listing is processed, resulting in excessive I/O overhead. Move the entire block that opens output_file and calls json.dump outside the innermost loop that processes individual listings so that the file is written once per page rather than once per listing. This will maintain crash recovery capabilities while significantly improving performance by reducing write operations.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@scraper/agent_worker.py`:
- Around line 301-302: The code at lines 301-302 makes two separate calls to
datetime.datetime.now(datetime.timezone.utc).isoformat() for the
llm_processed_time and last_ai_evaluated_at columns, which can cause microsecond
drift between the two timestamps. Capture the UTC timestamp once into a variable
before the update statement and reuse that same variable for both datetime
columns instead of calling .now() twice. Apply the same fix at lines 571-572
where this same pattern occurs.
In `@scraper/scraper.py`:
- Around line 270-274: The file write operation using json.dump on
existing_listings is currently inside the item loop, causing the entire JSON
file to be written after each listing is processed, resulting in excessive I/O
overhead. Move the entire block that opens output_file and calls json.dump
outside the innermost loop that processes individual listings so that the file
is written once per page rather than once per listing. This will maintain crash
recovery capabilities while significantly improving performance by reducing
write operations.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro Plus
Run ID: 7700d350-353b-49b2-8783-78b0d5bff72e
📒 Files selected for processing (8)
backend/server.jsfrontend/src/App.tsxfrontend/src/components/ListingDetailCard.tsxfrontend/src/hooks/useTranslation.tsfrontend/src/i18n/translations.tsscraper/agent_worker.pyscraper/scraper.pyscraper/scraper_selenium.py
✅ Files skipped from review due to trivial changes (1)
- frontend/src/i18n/translations.ts
🚧 Files skipped from review as they are similar to previous changes (5)
- frontend/src/hooks/useTranslation.ts
- backend/server.js
- scraper/scraper_selenium.py
- frontend/src/components/ListingDetailCard.tsx
- frontend/src/App.tsx
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (1)
backend/server.js (1)
51-53: 💤 Low valueConsider generating a random initial password.
The hardcoded default password
'password'is trivially guessable. For a dev/demo setup, consider generating a random password on first run and printing it once, or at minimum document prominently that this must be changed before any non-local use.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@backend/server.js` around lines 51 - 53, The hardcoded defaultPassword variable is set to the weak default value 'password', which poses a security risk. Replace this with a solution that either generates a random initial password on first run and logs it to the console for the administrator to capture, or add prominent documentation (in comments and README) stating that this hardcoded password must be changed immediately before any non-local deployment. If implementing random generation, ensure the password is only generated once during initial setup and persisted securely.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@backend/package.json`:
- Around line 13-15: The jsonwebtoken package at version 9.0.3 has transitive
dependencies on lodash with known security vulnerabilities. Update jsonwebtoken
to a newer version that resolves these vulnerabilities (consider upgrading to a
more recent stable version), and if necessary, explicitly update lodash to a
patched version in the dependencies. After making these changes, run npm audit
to verify that the vulnerabilities have been resolved and no new issues are
introduced. Additionally, consider setting up automated transitive dependency
scanning in your CI/CD pipeline to catch such vulnerabilities early in future
development.
In `@backend/server.js`:
- Line 64: The console.log statement that logs the seeded default admin user is
including the plaintext defaultPassword value in the output, which poses a
security risk as logs may be persisted or sent to external systems. Remove the
password variable from the log message. Keep the log statement that confirms the
default admin user was seeded but only include the defaultEmail and a success
message without exposing the actual defaultPassword value.
- Line 21: Remove the hardcoded fallback value from the JWT_SECRET constant
definition and implement a fail-fast approach that throws an error if the
environment variable is not set in production environments, or generate a random
ephemeral secret with a warning in development. Create a jwtSigningKey variable
that will hold the resolved secret value, and replace all uses of JWT_SECRET
with jwtSigningKey throughout the codebase for signing and verifying
authentication tokens.
---
Nitpick comments:
In `@backend/server.js`:
- Around line 51-53: The hardcoded defaultPassword variable is set to the weak
default value 'password', which poses a security risk. Replace this with a
solution that either generates a random initial password on first run and logs
it to the console for the administrator to capture, or add prominent
documentation (in comments and README) stating that this hardcoded password must
be changed immediately before any non-local deployment. If implementing random
generation, ensure the password is only generated once during initial setup and
persisted securely.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro Plus
Run ID: 51974059-1c2b-4477-8358-d8109b0c1f85
⛔ Files ignored due to path filters (1)
backend/package-lock.jsonis excluded by!**/package-lock.json
📒 Files selected for processing (4)
backend/package.jsonbackend/server.jsfrontend/src/App.tsxfrontend/src/i18n/translations.ts
🚧 Files skipped from review as they are similar to previous changes (1)
- frontend/src/i18n/translations.ts
| "bcrypt": "^6.0.0", | ||
| "express": "^4.21.2", | ||
| "jsonwebtoken": "^9.0.3", |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Check for security advisories on bcrypt and jsonwebtoken
echo "=== Checking bcrypt advisories ==="
gh api graphql -f query='
{
securityVulnerabilities(first: 5, ecosystem: NPM, package: "bcrypt") {
nodes {
advisory { summary severity publishedAt }
vulnerableVersionRange
firstPatchedVersion { identifier }
}
}
}'
echo ""
echo "=== Checking jsonwebtoken advisories ==="
gh api graphql -f query='
{
securityVulnerabilities(first: 5, ecosystem: NPM, package: "jsonwebtoken") {
nodes {
advisory { summary severity publishedAt }
vulnerableVersionRange
firstPatchedVersion { identifier }
}
}
}'
echo ""
echo "=== Latest versions on npm ==="
curl -s https://registry.npmjs.org/bcrypt/latest | jq '.version'
curl -s https://registry.npmjs.org/jsonwebtoken/latest | jq '.version'Repository: Infraviored/DealMapper
Length of output: 1996
🌐 Web query:
bcrypt npm 6.0.0 security vulnerabilities advisories
💡 Result:
As of June 19, 2026, there are no known security vulnerabilities associated with bcrypt version 6.0.0 [1][2]. The bcrypt package previously contained a significant security issue known as the "wrap-around" bug (CVE-2020-7689), which caused passwords 255 characters or longer to be truncated, and an additional issue involving improper handling of NUL characters in passwords [3][4][5]. Both of these vulnerabilities were addressed and patched in version 5.0.0 [3][6][4]. Version 6.0.0, released on May 11, 2025, is the latest stable version and remains free of these historical vulnerabilities [1][4]. Users are encouraged to keep their dependencies up to date and use tools like npm audit to monitor for any newly discovered issues [7].
Citations:
- 1: https://security.snyk.io/package/npm/bcrypt
- 2: https://secure.software/npm/packages/bcrypt/6.0.0
- 3: https://security.snyk.io/vuln/SNYK-JS-BCRYPT-572911
- 4: https://www.npmjs.com/package/bcrypt
- 5: https://github.com/kelektiv/node.bcrypt.js/
- 6: https://github.com/kelektiv/node.bcrypt.js/blob/master/CHANGELOG.md
- 7: https://www.sherlockforensics.com/security/npm/bcrypt.html
🌐 Web query:
jsonwebtoken npm 9.0.3 security vulnerabilities CVE advisories
💡 Result:
As of June 19, 2026, the npm package jsonwebtoken version 9.0.3 does not have direct, uniquely assigned CVEs for vulnerabilities within its own codebase [1][2]. While major security advisories (such as CVE-2022-23529, CVE-2022-23540, CVE-2022-23541, and CVE-2022-23539) affected versions prior to 9.0.0, these have been patched [3][4][5]. However, users should be aware of the following security context for version 9.0.3: 1. Transitive Vulnerabilities: As of April 2026, version 9.0.3 has been reported to be flagged by security scanners due to vulnerabilities in its transitive lodash dependencies (e.g., CVE-2026-4800, CVE-2019-10744, CVE-2021-23337, and CVE-2020-8203) [6]. Development teams are tracking these issues in the project repository (e.g., issue #1023), but these are dependency-level concerns rather than flaws in the jsonwebtoken logic itself [6]. 2. Dependency Updates: Version 9.0.3 (released December 4, 2025) updated the jws dependency to version 4.0.1 [3][7]. 3. Recommended Practice: Because jsonwebtoken is a widely used library, always ensure that your application-level usage follows secure coding practices, such as validating algorithms and sanitizing user input, as recommended in the official documentation [8][9]. For the most accurate and up-to-date status, consult the project's official GitHub repository issues and your own security scanning tools to assess the impact of these transitive dependencies on your specific environment [6].
Citations:
- 1: https://security.snyk.io/package/npm/jsonwebtoken
- 2: https://secure.software/npm/packages/jsonwebtoken
- 3: https://github.com/auth0/node-jsonwebtoken/blob/v9.0.3/CHANGELOG.md
- 4: https://nvd.nist.gov/vuln/detail/cve-2022-23539
- 5: GHSA-qwph-4952-7xr6
- 6: Inquiry about lodash dependency updates auth0/node-jsonwebtoken#1023
- 7: auth0/node-jsonwebtoken@v9.0.2...v9.0.3
- 8: GHSA-27h2-hvpr-p74q
- 9: https://npmx.dev/package/jsonwebtoken/v/9.0.3
🌐 Web query:
bcrypt npm latest version 2026
💡 Result:
As of June 19, 2026, the latest version of the bcrypt package on npm is 6.0.0 [1][2]. It was published on May 11, 2025 [1][3].
Citations:
- 1: https://www.npmjs.com/package/bcrypt
- 2: https://www.npmjs.com/package/bcrypt?activeTab=versions
- 3: https://www.npmjs.com/package/bcrypt?activeTab=code
🌐 Web query:
jsonwebtoken npm latest version 2026
💡 Result:
As of June 19, 2026, the latest version of the jsonwebtoken npm package is 9.0.3 [1][2][3]. This version was published on December 4, 2025 [1][3][4].
Citations:
- 1: https://www.npmjs.com/package/jsonwebtoken
- 2: https://npmx.dev/package/jsonwebtoken
- 3: https://security.snyk.io/package/npm/jsonwebtoken
- 4: https://packages.ecosyste.ms/registries/npmjs.org/packages/jsonwebtoken
Address transitive vulnerabilities in jsonwebtoken dependencies.
bcrypt 6.0.0 is current and free from known vulnerabilities. However, jsonwebtoken 9.0.3 depends on lodash with known vulnerabilities (CVE-2026-4800, CVE-2019-10744, CVE-2021-23337, CVE-2020-8203). Run npm audit to assess impact on your application and consider updating lodash or implementing transitive dependency auditing in your CI/CD pipeline.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@backend/package.json` around lines 13 - 15, The jsonwebtoken package at
version 9.0.3 has transitive dependencies on lodash with known security
vulnerabilities. Update jsonwebtoken to a newer version that resolves these
vulnerabilities (consider upgrading to a more recent stable version), and if
necessary, explicitly update lodash to a patched version in the dependencies.
After making these changes, run npm audit to verify that the vulnerabilities
have been resolved and no new issues are introduced. Additionally, consider
setting up automated transitive dependency scanning in your CI/CD pipeline to
catch such vulnerabilities early in future development.
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces user authentication using JWT and bcrypt, adds a custom internationalization (i18n) framework supporting English and German, refactors the frontend to use reusable UI components (Button, Card, Input), and optimizes the scraper to use fast, direct HTTP requests instead of Selenium by default. The review feedback highlights critical security vulnerabilities regarding hardcoded production secrets and default credentials, a bug in cookie parsing, and a pagination regex issue. Additionally, valuable performance improvements are suggested to batch disk writes and reuse HTTP connections via requests.Session.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| if (req.headers.cookie) { | ||
| const cookies = Object.fromEntries( | ||
| req.headers.cookie.split('; ').map(c => { | ||
| const parts = c.split('='); | ||
| return [parts[0], parts.slice(1).join('=')]; | ||
| }) | ||
| ); | ||
| token = cookies.token; | ||
| } |
There was a problem hiding this comment.
The current cookie parsing logic splits strictly by '; ' (semicolon followed by a space). However, according to RFC 6265, the space after the semicolon is optional. If a client sends cookies separated only by a semicolon (e.g., token=xyz;other=abc), this parsing logic will fail to split them correctly, resulting in a malformed token and authentication failure.
Using a regular expression like /;\s*/ to split the cookies and trimming the keys ensures robust parsing.
| if (req.headers.cookie) { | |
| const cookies = Object.fromEntries( | |
| req.headers.cookie.split('; ').map(c => { | |
| const parts = c.split('='); | |
| return [parts[0], parts.slice(1).join('=')]; | |
| }) | |
| ); | |
| token = cookies.token; | |
| } | |
| if (req.headers.cookie) { | |
| const cookies = Object.fromEntries( | |
| req.headers.cookie.split(/;\s*/).map(c => { | |
| const parts = c.split('='); | |
| return [parts[0].trim(), parts.slice(1).join('=')]; | |
| }) | |
| ); | |
| token = cookies.token; | |
| } |
| const sqlite3 = require('sqlite3').verbose(); | ||
| const jwt = require('jsonwebtoken'); | ||
| const bcrypt = require('bcrypt'); | ||
| const JWT_SECRET = process.env.JWT_SECRET || 'dealmapper_dev_secret_key_12345'; |
There was a problem hiding this comment.
Using a hardcoded fallback for JWT_SECRET in production is a security risk. If the environment variable is missing, the application will run with a known default secret, making it vulnerable to token forgery.
It is highly recommended to enforce that JWT_SECRET is set when running in production, and throw an error or exit if it is missing.
| const JWT_SECRET = process.env.JWT_SECRET || 'dealmapper_dev_secret_key_12345'; | |
| if (process.env.NODE_ENV === 'production' && !process.env.JWT_SECRET) { | |
| console.error("FATAL: JWT_SECRET environment variable is required in production!"); | |
| process.exit(1); | |
| } | |
| const JWT_SECRET = process.env.JWT_SECRET || 'dealmapper_dev_secret_key_12345'; |
| const defaultEmail = 'admin@dealmapper.local'; | ||
| const defaultPassword = 'password'; | ||
| const saltRounds = 10; |
There was a problem hiding this comment.
Seeding a default admin user with a hardcoded password ('password') is a security risk. It is highly recommended to allow configuring the default admin credentials via environment variables, with fallback values only for development environments.
| const defaultEmail = 'admin@dealmapper.local'; | |
| const defaultPassword = 'password'; | |
| const saltRounds = 10; | |
| const defaultEmail = process.env.DEFAULT_ADMIN_EMAIL || 'admin@dealmapper.local'; | |
| const defaultPassword = process.env.DEFAULT_ADMIN_PASSWORD || 'password'; | |
| const saltRounds = 10; |
| app.use('/api', (req, res, next) => { | ||
| if (req.path === '/auth/login' || req.path === '/auth/logout' || req.path === '/auth/me') { | ||
| return next(); | ||
| } | ||
| authenticateToken(req, res, next); | ||
| }); |
There was a problem hiding this comment.
The path checks inside the global API auth protection middleware are redundant because the public routes /api/auth/login and /api/auth/logout are defined before this middleware in the Express routing chain. In Express, requests matching those routes are fully handled and terminated before reaching this middleware.
You can simplify this by removing the redundant path checks and applying the authenticateToken middleware directly.
// Global API Auth Protection (applied to all subsequent /api/* routes)
app.use('/api', authenticateToken);| def parse_listing_details_requests(url): | ||
| """Retrieve and parse detailed specifications, description, and images using direct requests""" | ||
| try: | ||
| response = requests.get(url, headers=HEADERS, timeout=10) |
There was a problem hiding this comment.
The parse_listing_details_requests function performs a direct requests.get call for every listing. When harvesting descriptions for multiple listings sequentially, this establishes a new TCP connection and SSL handshake for every single request, which is highly inefficient.
By utilizing a requests.Session object, you can enable HTTP Keep-Alive connection reuse, significantly speeding up the harvesting process and reducing overhead on both the client and server.
def parse_listing_details_requests(url, session=None):
"""Retrieve and parse detailed specifications, description, and images using direct requests"""
try:
caller = session if session else requests
response = caller.get(url, headers=HEADERS, timeout=10)| if "/seite:" in base_url: | ||
| current_url = re.sub(r"/seite:\d+/", f"/seite:{page}/", base_url) |
There was a problem hiding this comment.
The regular expression r"/seite:\d+/" expects a trailing slash. If the base_url ends with /seite:2 without a trailing slash, the regex will fail to match, and the URL will not be paginated correctly.
Using a more robust regex pattern like r"/seite:\d+(/|$)" handles both cases correctly.
| if "/seite:" in base_url: | |
| current_url = re.sub(r"/seite:\d+/", f"/seite:{page}/", base_url) | |
| if "/seite:" in base_url: | |
| current_url = re.sub(r"/seite:\d+(/|$)", f"/seite:{page}\\1", base_url) |
|
|
||
| if output_file: | ||
| with open(output_file, "w", encoding="utf-8") as f: | ||
| json.dump( | ||
| existing_listings, f, ensure_ascii=False, indent=4 | ||
| ) | ||
|
|
||
| except Exception as e: | ||
| logger.error(f"Error scraping single listing: {str(e)}") | ||
| continue | ||
|
|
||
| logger.info( | ||
| f"Scraped {scraped_count} discovered listings from {current_url}" |
There was a problem hiding this comment.
Writing the entire existing_listings list to the JSON file on every single iteration of the loop is extremely inefficient and causes unnecessary disk I/O overhead.
Instead, you should batch the write operation by moving the json.dump call outside of the inner loop, executing it once per page after all listings on that page have been parsed.
except Exception as e:
logger.error(f"Error scraping single listing: {str(e)}")
continue
if output_file:
with open(output_file, "w", encoding="utf-8") as f:
json.dump(
existing_listings, f, ensure_ascii=False, indent=4
)
logger.info(
f"Scraped {scraped_count} discovered listings from {current_url}"
)|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces user authentication, internationalization (i18n) support, and a fast requests-based scraping pipeline. On the backend, JWT-based authentication and SQLite user seeding have been implemented, while the frontend has been updated with a login portal, reusable UI components, and English/German translations. Additionally, the scraper has been optimized to bypass browser automation using direct HTTP requests, with legacy Selenium paths preserved. The code review feedback highlights critical security vulnerabilities, including a default admin password fallback in production, plain-text logging of seeded credentials, and susceptibility to timing attacks on the login endpoint. A robustness issue was also identified in the JSON-LD parser when handling array-wrapped metadata structures.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| const defaultEmail = process.env.DEFAULT_ADMIN_EMAIL || 'admin@dealmapper.local'; | ||
| const defaultPassword = process.env.DEFAULT_ADMIN_PASSWORD || 'password'; |
There was a problem hiding this comment.
In production, defaulting the admin password to 'password' if DEFAULT_ADMIN_PASSWORD is not set poses a severe security risk. It is highly recommended to require DEFAULT_ADMIN_PASSWORD to be explicitly set in production environments, similar to how JWT_SECRET is handled, and exit the process if it is missing.
const defaultEmail = process.env.DEFAULT_ADMIN_EMAIL || 'admin@dealmapper.local';
let defaultPassword = process.env.DEFAULT_ADMIN_PASSWORD;
if (process.env.NODE_ENV === 'production' && !defaultPassword) {
console.error("FATAL: DEFAULT_ADMIN_PASSWORD environment variable is required in production to seed the default admin user!");
process.exit(1);
}
if (!defaultPassword) {
defaultPassword = 'password';
}| [defaultEmail, hash, 'admin'], | ||
| (err) => { | ||
| if (err) console.error("Failed to seed default admin user:", err); | ||
| else console.log(`Seeded default admin user: ${defaultEmail} (password: ${defaultPassword})`); |
There was a problem hiding this comment.
Logging the seeded admin password in plain text to the console/stdout is a security vulnerability (CWE-532). In production, these logs are often persisted or monitored by third-party systems, exposing sensitive credentials. Remove the password from the log message.
| else console.log(`Seeded default admin user: ${defaultEmail} (password: ${defaultPassword})`); | |
| else console.log(`Seeded default admin user: ${defaultEmail}`); |
| const user = await get("SELECT * FROM users WHERE email = ?", [email.toLowerCase().trim()]); | ||
| if (!user) { | ||
| return res.status(401).json({ error: 'Invalid email or password.' }); | ||
| } | ||
|
|
||
| const isMatch = await bcrypt.compare(password, user.password_hash); | ||
| if (!isMatch) { | ||
| return res.status(401).json({ error: 'Invalid email or password.' }); | ||
| } |
There was a problem hiding this comment.
The login route is vulnerable to user enumeration via a timing attack. If the user is not found in the database, the server returns a 401 response immediately. If the user is found, it performs a computationally expensive bcrypt.compare operation. This timing difference allows an attacker to easily determine which email addresses are registered. To mitigate this, always perform a dummy bcrypt comparison when the user is not found.
| const user = await get("SELECT * FROM users WHERE email = ?", [email.toLowerCase().trim()]); | |
| if (!user) { | |
| return res.status(401).json({ error: 'Invalid email or password.' }); | |
| } | |
| const isMatch = await bcrypt.compare(password, user.password_hash); | |
| if (!isMatch) { | |
| return res.status(401).json({ error: 'Invalid email or password.' }); | |
| } | |
| const user = await get("SELECT * FROM users WHERE email = ?", [email.toLowerCase().trim()]); | |
| // Always run bcrypt.compare to prevent timing attacks (user enumeration) | |
| const passwordHash = user ? user.password_hash : "$2b$10$invalidhashplaceholderdummyvalue"; | |
| const isMatch = await bcrypt.compare(password, passwordHash); | |
| if (!user || !isMatch) { | |
| return res.status(401).json({ error: 'Invalid email or password.' }); | |
| } |
| data = json.loads(script.string) | ||
| if isinstance(data, dict) and data.get("@type") == "ImageObject": | ||
| if data.get("representativeOfPage") is True: | ||
| schema_description = data.get("description") | ||
| schema_main_image = data.get("contentUrl") |
There was a problem hiding this comment.
The JSON-LD parser assumes that the parsed data is always a single dictionary. However, Schema.org metadata can often be wrapped in a JSON array (list) or contain a @graph array of objects. If data is a list, calling data.get will raise an AttributeError and skip the block. Enhance the parser to handle both single objects, lists, and @graph structures to make metadata extraction more robust.
| data = json.loads(script.string) | |
| if isinstance(data, dict) and data.get("@type") == "ImageObject": | |
| if data.get("representativeOfPage") is True: | |
| schema_description = data.get("description") | |
| schema_main_image = data.get("contentUrl") | |
| data = json.loads(script.string) | |
| # Normalize to a list of objects to handle both single objects and arrays | |
| items = [] | |
| if isinstance(data, list): | |
| items = data | |
| elif isinstance(data, dict): | |
| if "@graph" in data and isinstance(data["@graph"], list): | |
| items = data["@graph"] | |
| else: | |
| items = [data] | |
| for item in items: | |
| if isinstance(item, dict) and item.get("@type") == "ImageObject": | |
| if item.get("representativeOfPage") is True: | |
| schema_description = item.get("description") | |
| schema_main_image = item.get("contentUrl") |
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
scraper/scraper.py (1)
384-423:⚠️ Potential issue | 🔴 Critical | ⚡ Quick winCritical: Database UPDATE is outside the for loop - only the last listing gets updated.
The
cursor.executeblock starting at line 405 is indented at the same level as theforstatement, placing it outside the loop. This means:
- The loop fetches and parses details for every listing
- Only the final iteration's
listing_idandparsedvalues are used in the UPDATE- All other listings' harvested data is silently lost
Proposed fix to move UPDATE inside the loop
parsed = parse_listing_details_requests(url, session=session) if parsed is None: continue - cursor.execute( - """ - UPDATE listings - SET detailed_description = ?, details = ?, images = ?, full_info_obtained = 1, - last_description_changed_at = ? - WHERE id = ? - """, - ( - parsed["detailed_description"], - json.dumps(parsed["details"]), - json.dumps(parsed["images"]), - datetime.datetime.now(datetime.timezone.utc).isoformat(), - listing_id, - ), - ) - conn.commit() - logger.info( - f"Successfully harvested detailed description, {len(parsed['details'])} details, and {len(parsed['images'])} images for ID {listing_id}" - ) + cursor.execute( + """ + UPDATE listings + SET detailed_description = ?, details = ?, images = ?, full_info_obtained = 1, + last_description_changed_at = ? + WHERE id = ? + """, + ( + parsed["detailed_description"], + json.dumps(parsed["details"]), + json.dumps(parsed["images"]), + datetime.datetime.now(datetime.timezone.utc).isoformat(), + listing_id, + ), + ) + conn.commit() + logger.info( + f"Successfully harvested detailed description, {len(parsed['details'])} details, and {len(parsed['images'])} images for ID {listing_id}" + )🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@scraper/scraper.py` around lines 384 - 423, The cursor.execute block that updates the database is currently indented at the same level as the for loop, placing it outside the loop. This causes only the last listing to be updated while all other listings' harvested data is discarded. Move the entire cursor.execute UPDATE statement (which updates the listings table with detailed_description, details, images, and full_info_obtained) inside the for loop by increasing its indentation level so it executes after each listing is parsed. Also ensure the conn.commit() and logger.info calls that follow the UPDATE are moved inside the loop as well.
♻️ Duplicate comments (1)
scraper/scraper.py (1)
485-530:⚠️ Potential issue | 🔴 Critical | ⚡ Quick winCritical: Update logic is outside the for loop - only the last listing is processed.
Lines 507-530 are indented at the
withblock level, not inside theforloop. This results in:
- The loop fetches details for all listings but discards results except the last
- Only the final listing's description is compared and potentially updated
- All other listings are never checked for changes
Proposed fix to move processing inside the loop
parsed = parse_listing_details_requests(url, session=session) if parsed is None: continue - detailed_description = parsed["detailed_description"] or "" - - if detailed_description.strip() != old_description.strip(): - logger.info( - f"Description changed for listing {listing_id}! Updating in DB." - ) - cursor.execute( - """ - UPDATE listings - SET detailed_description = ?, details = ?, images = ?, full_info_obtained = 1, - last_description_changed_at = ? - WHERE id = ? - """, - ( - detailed_description, - json.dumps(parsed["details"]), - json.dumps(parsed["images"]), - datetime.datetime.now(datetime.timezone.utc).isoformat(), - listing_id, - ), - ) - conn.commit() - else: - logger.info(f"No description changes for listing {listing_id}.") + detailed_description = parsed["detailed_description"] or "" + + if detailed_description.strip() != old_description.strip(): + logger.info( + f"Description changed for listing {listing_id}! Updating in DB." + ) + cursor.execute( + """ + UPDATE listings + SET detailed_description = ?, details = ?, images = ?, full_info_obtained = 1, + last_description_changed_at = ? + WHERE id = ? + """, + ( + detailed_description, + json.dumps(parsed["details"]), + json.dumps(parsed["images"]), + datetime.datetime.now(datetime.timezone.utc).isoformat(), + listing_id, + ), + ) + conn.commit() + else: + logger.info(f"No description changes for listing {listing_id}.")Note: Once this indentation is fixed, the prior review feedback about persisting
details/imageschanges even when description is unchanged will still apply.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@scraper/scraper.py` around lines 485 - 530, The code block starting with the detailed_description assignment and continuing through the database update and logging logic is indented at the wrong level (at the with statement level instead of inside the for loop). This means the loop iterates through all rows and calls parse_listing_details_requests for each one, but the comparison and update logic only executes once after the loop completes, using only the last parsed result. Move all the code from the detailed_description assignment through the final else block to be indented inside the for loop, at the same indentation level as the parse_listing_details_requests call and the if parsed is None check, so that each listing's details are compared and updated before moving to the next listing.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@scraper/scraper.py`:
- Around line 384-423: The cursor.execute block that updates the database is
currently indented at the same level as the for loop, placing it outside the
loop. This causes only the last listing to be updated while all other listings'
harvested data is discarded. Move the entire cursor.execute UPDATE statement
(which updates the listings table with detailed_description, details, images,
and full_info_obtained) inside the for loop by increasing its indentation level
so it executes after each listing is parsed. Also ensure the conn.commit() and
logger.info calls that follow the UPDATE are moved inside the loop as well.
---
Duplicate comments:
In `@scraper/scraper.py`:
- Around line 485-530: The code block starting with the detailed_description
assignment and continuing through the database update and logging logic is
indented at the wrong level (at the with statement level instead of inside the
for loop). This means the loop iterates through all rows and calls
parse_listing_details_requests for each one, but the comparison and update logic
only executes once after the loop completes, using only the last parsed result.
Move all the code from the detailed_description assignment through the final
else block to be indented inside the for loop, at the same indentation level as
the parse_listing_details_requests call and the if parsed is None check, so that
each listing's details are compared and updated before moving to the next
listing.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro Plus
Run ID: edece5e5-363f-4e1a-bfd2-4a9a3c8a07ff
📒 Files selected for processing (2)
backend/server.jsscraper/scraper.py
🚧 Files skipped from review as they are similar to previous changes (1)
- backend/server.js
… harden JSON-LD parser
Summary by CodeRabbit
Release Notes