Skip to content

feat/direct get scraper#1

Open
Infraviored wants to merge 8 commits into
mainfrom
feat/direct-get-scraper
Open

feat/direct get scraper#1
Infraviored wants to merge 8 commits into
mainfrom
feat/direct-get-scraper

Conversation

@Infraviored

@Infraviored Infraviored commented Jun 18, 2026

Copy link
Copy Markdown
Owner
  • feat: migrate to reusable UI components and lightweight translation system
  • style: fix linter errors and typecast compiler constraints
  • feat(scraper): add direct GET scraping and discovery demonstration scripts
  • feat: implement fast requests-based HTTP scraper, preserve legacy Selenium fallback, and add configurable politeness delays

Summary by CodeRabbit

Release Notes

  • New Features
    • Added English/German internationalization across the app.
    • Added cookie-based login/logout, “current user” endpoint, and protected API access.
    • Scheduling now supports configurable delays between pages and listings.
  • Improvements
    • Refreshed key screens with consistent shared UI components (Card, Button, Input) and localized labels.
    • Enhanced scraping using a faster direct-HTTP discovery/enrichment flow, with Selenium kept as a fallback.
  • Bug Fixes
    • Improved highlight handling for non-legacy listings and standardized UTC timestamps.

@coderabbitai

coderabbitai Bot commented Jun 18, 2026

Copy link
Copy Markdown

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: f853200e-052a-48ec-ab01-a85d7416499b

📥 Commits

Reviewing files that changed from the base of the PR and between 25adb7c and e1295a1.

📒 Files selected for processing (2)
  • backend/server.js
  • scraper/scraper.py
🚧 Files skipped from review as they are similar to previous changes (2)
  • backend/server.js
  • scraper/scraper.py

📝 Walkthrough

Walkthrough

Introduces a frontend i18n system (translations dictionary, useTranslation hook, shared Button/Card/Input primitives) wired across all major views and components, adds JWT/bcrypt authentication to backend API routes, adds delayBetweenPages/delayBetweenListings to the /api/schedule API and SettingsView. Refactors scraper.py to use requests+BeautifulSoup for discovery and harvesting with Selenium as a fallback, extracts the legacy Selenium path into a new scraper_selenium.py module, updates timestamps to UTC across agent_worker.py, and adds three standalone demo scripts.

Changes

Backend authentication and schedule configuration

Layer / File(s) Summary
Auth: JWT/bcrypt initialization and user table
backend/package.json, backend/server.js
Imports jsonwebtoken and bcrypt, defines JWT_SECRET from environment with hardcoded fallback and production-time validation, creates users table on startup, and seeds a default admin user with bcrypt hash when the table is empty.
Auth: authenticateToken middleware and endpoints
backend/server.js
Introduces authenticateToken middleware extracting and verifying JWT from token cookie, fetching user from database, and returning 401 for invalid tokens. Adds POST /api/auth/login (validates email/password, verifies bcrypt hash, issues 7-day JWT, sets httpOnly cookie), POST /api/auth/logout (clears cookie), GET /api/auth/me (returns current user). Installs global /api middleware enforcing auth on all /api/* routes except the auth endpoints.
Schedule API: delay field persistence
backend/server.js
GET /api/schedule default config gains delayBetweenPages and delayBetweenListings (default 0.25). POST /api/schedule reads both from request body, parses as floats with 0.25 fallback for missing/negative/NaN, coerces fullFetchOnStartup to boolean, and persists all fields into schedule_config.json.

Frontend i18n system and UI component adoption

Layer / File(s) Summary
Shared UI primitives: Button, Card, Input
frontend/src/components/ui/Button.tsx, frontend/src/components/ui/Card.tsx, frontend/src/components/ui/Input.tsx
New Button supports variant/size/loading props with inline spinner SVG; Card renders a styled div with conditional interactive hover classes; Input uses forwardRef and applies error-state border styling.
i18n infrastructure: translations and useTranslation hook
frontend/src/i18n/translations.ts, frontend/src/hooks/useTranslation.ts
Defines full en/de translation dictionaries across eight namespaces (common, landing, dashboard, wizard, settings, tuner, listing, auth) with {{placeholder}} interpolation. useTranslation provides module-level globalLang synced via listeners set and localStorage, dot-path t() resolver with English fallback, toggleLanguage, and setLang.
App.tsx: i18n wiring and all-view translation
frontend/src/App.tsx
Wires useTranslation in App(), implements checkAuth() (/api/auth/me), handleLoginSubmit() (/api/auth/login), and handleLogout() (/api/auth/logout) with translated error messages. Displays translated loading screen and login form during auth flow. Refactors header session widget to use Button with translated strings. Replaces all hardcoded labels in landing/dashboard/edit/create-campaign views with t() calls, wraps form areas in Card, localizes dropdown options.
SettingsView: translation, delay fields, and UI component adoption
frontend/src/components/SettingsView.tsx
Adds useTranslation and translates all UI text (back button, title, section headings, toggle labels, help text, errors). Extends fetchConfig to populate delayBetweenPages/delayBetweenListings from GET response (with 0.25 defaults). Updates handleSave to send and read back delay fields from /api/schedule. Refactors inputs to Input component with translated labels. Migrates Save/Cancel to Button with loading={saving}.
CriteriaTuner and ListingDetailCard localization
frontend/src/components/CriteriaTuner.tsx, frontend/src/components/ListingDetailCard.tsx, frontend/src/utils/listingTransformer.ts
Adds useTranslation to each component; translates headings, labels, badges, status text; migrates boolean toggle and AI evaluation controls to Button; wraps ListingDetailCard in Card; localizes date/time formatting using lang for locale-aware display; derives per-dimension labels from translation keys; tightens highlights type cast from any to Listing['highlights'].
GuidelinesWizard localization
frontend/src/components/GuidelinesWizard.tsx
Adds useTranslation and replaces all hardcoded strings across three wizard steps with translation keys; migrates copy-prompt, navigation, and save actions to shared Button components while preserving step logic and status rendering.

Scraper requests-based refactor and legacy Selenium preservation

Layer / File(s) Summary
scraper.py: requests infrastructure and detail parsing
scraper/scraper.py
Adds requests, BeautifulSoup, and HEADERS constant. Introduces parse_listing_details_requests to fetch detail pages and extract title/description from JSON-LD ImageObject schema (with HTML fallback), key/value details, and gallery image URLs. Routes scrape_listings to scrape_listings_requests (primary) or scrape_listings_selenium (fallback) based on INTERACTIVE_LOGIN. Implements scrape_listings_requests with paginated discovery, duplicate-ID detection, incremental JSON save, and per-page delays. Updates preview_url_listings_count to use requests+BeautifulSoup.
scraper.py: requests-based harvesting and update workflows
scraper/scraper.py
Replaces per-listing Selenium harvest in harvest_descriptions with DB-only requests workflow: queries listings missing detailed info, fetches via parse_listing_details_requests, updates SQLite fields (detailed_description, JSON details/images, full_info_obtained, UTC last_description_changed_at) with per-listing error handling. Reworks update_all_descriptions_session to use requests parsing and conditional DB writes: re-fetches detail pages, compares normalized description to stored value, updates only when changed, sets timestamps.
scraper.py: Selenium helper functions and fallback discovery
scraper/scraper.py
save_logged_in_email uses UTC ISO timestamps. check_login_status ensures navigation to projects page before reading login indicator. manual_login replaces quick-check branch with timed watcher loop (up to 300s). scrape_listings_selenium implements interactive-login fallback with persistent Chrome profile, automation-disabling flags, runs manual_login, early-return for validation. Inner discovery mirrors requests path: /seite: pagination URL rewriting, waits for results container, BeautifulSoup listing parsing, dedup checks, max-listing limits, incremental output_file save, per-page delays.
scraper_selenium.py: full legacy Selenium scraper module
scraper/scraper_selenium.py
New module preserving complete Selenium-based scraper: update_progress (JSON status), harvest_missing_descriptions (DB-driven per-listing detail extraction), cookie/session persistence (pickle-based), manual_login with interactive watcher, get_detailed_description, scrape_page (search result parsing with dedup), scrape_listings (full orchestration with persistent profile and anti-detection), preview_url_listings_count, and session wrappers.
Demo scripts: requests-based scraping examples
scripts/demo_direct_get.py, scripts/demo_discover_listings.py, scripts/demo_full_pipeline.py
Three standalone scripts demonstrating requests-based scraping without Selenium: single-listing detail parsing (schema JSON-LD with HTML fallback, key/value details, gallery images, price); paginated search discovery (pagination detection, listing extraction, dedup, preview); and end-to-end pipeline (chained discovery and detail harvesting with formatted summary).
agent_worker.py: UTC timestamp standardization
scraper/agent_worker.py
Updates logging and database writes to use UTC-based ISO timestamps instead of local-time format for log file naming, RUN headers, and SQLite updates to llm_processed_time and last_ai_evaluated_at.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Poem

🐇 Hop, hop! The rabbit translates the land,
With t('landing.title') keys so grand.
Requests replace Selenium's slow crawl,
Button, Card, and Input stand tall.
Auth guards the API with JWT's might—
Two tongues, two workflows, both polished bright! 🌐

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Title check ⚠️ Warning The PR title 'feat/direct get scraper' is partially related to the changeset but does not capture the full scope of changes. While the direct GET scraper is a significant feature, the PR includes substantial frontend i18n/UI component work, backend auth improvements, and multi-platform scraper refactoring that are not reflected in the title. Consider a more comprehensive title such as 'feat: direct GET scraper with auth, i18n, and UI components' or 'feat: HTTP-based scraper, authentication system, and internationalization layer' to better represent the multi-faceted nature of the changes.
Docstring Coverage ⚠️ Warning Docstring coverage is 71.43% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/direct-get-scraper

Comment @coderabbitai help to get the list of available commands.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a lightweight internationalization (i18n) system with English and German translations, refactors several frontend components to use standardized UI elements (Button, Card, Input), and optimizes the scraper backend to support a fast requests-based pipeline alongside the legacy Selenium implementation. The code review feedback highlights several key areas for improvement: 1. Input Validation: Ensure configured scraper delays are validated as non-negative numbers in the backend to prevent runtime crashes. 2. Internationalization & Redundancy: Replace redundant ternary operators and hardcoded dimension labels in the frontend with the newly introduced translation keys. 3. Code Safety: Use replaceAll instead of dynamic RegExp in the translation hook to avoid injection vulnerabilities. 4. Defensive Programming: Guard against potential IndexError and TypeError exceptions in the scraper when parsing malformed URLs or empty JSON-LD scripts. 5. Timezone Safety: Consistently use timezone-aware UTC datetimes instead of naive local datetimes to prevent synchronization mismatches.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread backend/server.js Outdated
Comment on lines +646 to +647
delayBetweenPages: parseFloat(delayBetweenPages !== undefined ? delayBetweenPages : 0.25),
delayBetweenListings: parseFloat(delayBetweenListings !== undefined ? delayBetweenListings : 0.25)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The parsed values for delayBetweenPages and delayBetweenListings are not validated to ensure they are non-negative numbers. If a user configures a negative value, Python's time.sleep() in the scraper will throw a ValueError and crash the background process. Additionally, if parseFloat returns NaN, it will be serialized as null in the JSON file, which might cause issues when read.

Validate that the parsed values are valid, non-negative numbers, falling back to 0.25 if invalid.

Comment thread frontend/src/App.tsx Outdated
<span className="text-slate-750">|</span>
<h2 className="text-base font-bold text-slate-200">{campaigns.find(c => c.id === currentCampaignId)?.name} Dashboard</h2>
<h2 className="text-base font-bold text-slate-200">
{campaigns.find(c => c.id === currentCampaignId)?.name} {lang === 'en' ? 'Dashboard' : 'Dashboard'}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The ternary operator lang === 'en' ? 'Dashboard' : 'Dashboard' is redundant and hardcoded. Since you are migrating to a lightweight translation system, you should use the translation key listing.dashboardTitle from your translations file.

Comment on lines +321 to +326
trustworthiness: lang === 'en' ? 'Trustworthiness' : 'Vertrauenswürdigkeit',
transparency: lang === 'en' ? 'Transparency' : 'Transparenz',
conditionConfidence: lang === 'en' ? 'Condition Confidence' : 'Zustandssicherheit',
documentationQuality: lang === 'en' ? 'Documentation Quality' : 'Dokumentationsqualität',
hiddenRiskSuspicion: lang === 'en' ? 'Hidden Risk Suspicion' : 'Versteckter Risikoverdacht',
marketAboveAverageSignal: lang === 'en' ? 'Market Above Average Signal' : 'Überdurchschnittliches Marktsignal'

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

These soft dimension labels are hardcoded with inline language checks. To align with the lightweight translation system introduced in this PR, these strings should be moved to the translation dictionary (translations.ts) and accessed via the t function.

Comment thread frontend/src/hooks/useTranslation.ts Outdated
let result = current;
if (replacements) {
Object.entries(replacements).forEach(([k, v]) => {
result = result.replace(new RegExp(`{{${k}}}`, 'g'), String(v));

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using new RegExp with dynamic keys can be unsafe if the keys contain special regex characters. Since you are performing a simple global string replacement, using replaceAll is safer, cleaner, and more performant.

Comment thread scraper/scraper.py Outdated
Comment on lines +178 to +180
parts = base_url.split("/")
domain_part = "/".join(parts[:3])
path_part = parts[3]

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

If base_url has fewer than 4 parts when split by / (e.g., if an invalid URL or just the domain is entered), accessing parts[3] will raise an IndexError and crash the scraper. Guard against this by checking the length of parts first.

Comment thread scraper/scraper.py
Comment on lines +77 to +79
for script in soup.find_all("script", type="application/ld+json"):
try:
data = json.loads(script.string)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

If a <script> tag of type application/ld+json is empty or has no text content, script.string will be None. Passing None to json.loads() raises a TypeError. While this is caught by the try...except block, it is better to defensively check that script.string is not empty before parsing.

Comment thread scraper/scraper.py Outdated
parsed["detailed_description"],
json.dumps(parsed["details"]),
json.dumps(parsed["images"]),
datetime.datetime.now().isoformat(),

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using naive local datetimes (datetime.datetime.now()) can lead to timezone mismatch bugs when compared with backend-generated timestamps (e.g., if the backend runs in UTC and the scraper runs in local time). Use timezone-aware UTC datetimes consistently across the system.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 16

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
scraper/scraper.py (1)

331-425: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Ensure SQLite connections are closed on all error paths.

Both functions close conn only on the happy path. Any exception before the trailing conn.close() can leave the connection open and eventually cause lock contention in subsequent runs.

Suggested fix
 def harvest_descriptions(campaign_id=None):
@@
-    try:
-        conn = sqlite3.connect(db_path)
+    conn = None
+    try:
+        conn = sqlite3.connect(db_path)
         conn.row_factory = sqlite3.Row
@@
-        conn.close()
     except Exception as e:
         logger.error(f"Error in requests-based harvest_descriptions: {str(e)}")
+    finally:
+        if conn is not None:
+            conn.close()
@@
 def update_all_descriptions_session(campaign_id=None):
@@
-    try:
-        conn = sqlite3.connect(db_path)
+    conn = None
+    try:
+        conn = sqlite3.connect(db_path)
         conn.row_factory = sqlite3.Row
@@
-        conn.close()
     except Exception as e:
         logger.error(
             f"Error in requests-based update_all_descriptions_session: {str(e)}"
         )
+    finally:
+        if conn is not None:
+            conn.close()

Also applies to: 440-532

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scraper/scraper.py` around lines 331 - 425, The SQLite connection opened with
sqlite3.connect(db_path) is only closed in the try block's happy path, leaving
it open if any exception occurs before reaching conn.close(). Wrap the entire
try block's operations in a try-finally structure where the finally block
ensures conn.close() is always executed, or alternatively use a with statement
for the database connection to automatically manage its lifecycle. This same
pattern needs to be applied to the other similar function mentioned in the
comment (also applies to lines 440-532).
🧹 Nitpick comments (2)
scraper/scraper_selenium.py (1)

219-226: 💤 Low value

Pickle deserialization from local file has limited attack surface.

Static analysis flags pickle.load as unsafe (S301). In this context, the cookie file is application-controlled and stored locally, so exploitation requires prior file system access. The current try/except pattern around add_cookie is good defensive practice. Consider adding a comment documenting this risk/tradeoff for future maintainers.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scraper/scraper_selenium.py` around lines 219 - 226, Add a comment above the
pickle.load() call in the cookie loading block to document the security
consideration and justification. The comment should explain that while
pickle.load is flagged as unsafe by static analysis tools (S301), this usage is
acceptable because the cookie file is application-controlled and stored locally,
which limits the attack surface and requires prior file system access for
exploitation. This will help future maintainers understand the deliberate
tradeoff being made here.

Source: Linters/SAST tools

frontend/src/components/ListingDetailCard.tsx (1)

320-327: ⚡ Quick win

Centralize soft-dimension labels in translations instead of inline language branching.

This duplicates i18n logic in the component and makes new-language rollout harder. Use translation keys for these labels like the rest of the card.

♻️ Suggested refactor
-const labels: Record<string, string> = {
-  trustworthiness: lang === 'en' ? 'Trustworthiness' : 'Vertrauenswürdigkeit',
-  transparency: lang === 'en' ? 'Transparency' : 'Transparenz',
-  conditionConfidence: lang === 'en' ? 'Condition Confidence' : 'Zustandssicherheit',
-  documentationQuality: lang === 'en' ? 'Documentation Quality' : 'Dokumentationsqualität',
-  hiddenRiskSuspicion: lang === 'en' ? 'Hidden Risk Suspicion' : 'Versteckter Risikoverdacht',
-  marketAboveAverageSignal: lang === 'en' ? 'Market Above Average Signal' : 'Überdurchschnittliches Marktsignal'
-};
-const label = labels[key] || key;
+const label = t(`listing.softDimensionsLabels.${key}`);
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@frontend/src/components/ListingDetailCard.tsx` around lines 320 - 327, Move
the soft-dimension labels (trustworthiness, transparency, conditionConfidence,
documentationQuality, hiddenRiskSuspicion, marketAboveAverageSignal) from the
inline Record definition with ternary language branching into your
translation/i18n system. Replace the hardcoded labels object with translation
key references using the same i18n pattern already used elsewhere in the
ListingDetailCard component, eliminating the redundant lang === 'en' checks and
making it easier to support new languages without modifying the component code.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@backend/server.js`:
- Around line 645-647: The delayBetweenPages and delayBetweenListings properties
are using parseFloat with only undefined checks, which allows invalid values
like NaN or negative numbers to be persisted. Add validation after each
parseFloat call to ensure the parsed number is a valid positive number; if the
value is NaN or negative, use the default value (0.25) instead. Apply this
validation logic consistently to both delay properties to prevent invalid delays
from corrupting the config.

In `@frontend/src/App.tsx`:
- Around line 1012-1016: The select dropdown in the status filter has a mismatch
between the option value and label. The option with value="New" uses the
translation key dashboard.statusEvaluated which suggests an "evaluated only"
filter, but the underlying predicate filters for status === 'New'. Change the
label for the option with value="New" to use a translation key that accurately
describes filtering for new items (e.g., dashboard.statusNew or similar) instead
of dashboard.statusEvaluated to align the UI label with the actual filtering
behavior.
- Around line 948-950: The h2 element in the dashboard header contains a no-op
ternary operator that always renders "Dashboard" regardless of the language,
which bypasses the i18n system. Replace the entire ternary expression `lang ===
'en' ? 'Dashboard' : 'Dashboard'` with a proper i18n translation function call
(such as t('dashboard') or equivalent based on your application's translation
system) to ensure the dashboard title is properly localized.

In `@frontend/src/components/ListingDetailCard.tsx`:
- Line 395: The neutral status in the ternary operator evaluating
evalItem.status is hardcoded as the English string 'NEUTRAL' instead of being
translated like the satisfied and violated statuses. Replace the hardcoded
'NEUTRAL' string with t('listing.neutral').toUpperCase() to apply the same
translation and uppercase formatting pattern used for the other two status cases
in the ternary expression.

In `@frontend/src/components/SettingsView.tsx`:
- Line 300: The button text on line 300 in SettingsView.tsx is using the
translation key `settings.scraperRules`, which is meant for a section title, not
a button action. Replace this translation key with an appropriate save-action
translation key (such as `common.save` or `actions.save`) that properly
describes the submit action for this button, ensuring the UI correctly labels
the primary CTA.
- Line 80: The setTimeout call in SettingsView that clears the success message
does not cancel previously scheduled timers, causing multiple overlapping
timeouts to potentially fire and clear the message prematurely, plus leaving
dangling timers on unmount. Store the timeout ID from each setTimeout in a ref
variable, and before scheduling a new timeout in the setSuccess handler, first
call clearTimeout on any existing timeout ID stored in that ref to ensure only
one active timer exists at any time.

In `@frontend/src/components/ui/Button.tsx`:
- Around line 9-17: The Button component is missing an explicit type attribute,
which causes HTML to default to type="submit" inside forms, leading to
unintended form submissions. Add a type prop with a default value of 'button' to
the Button component's destructured parameters (alongside variant, size,
loading, disabled, className, children, and props), then pass this type prop to
the underlying button element to ensure the button behaves as expected in all
contexts.

In `@frontend/src/components/ui/Input.tsx`:
- Around line 8-18: The Input component in its forwardRef callback needs to
expose the error state semantically for accessibility. Add the aria-invalid
attribute to the input element, setting it to the boolean value of the error
prop. Position the aria-invalid attribute before the {...props} spread operator
so that consumers can override the accessibility attribute if needed through the
props parameter.

In `@frontend/src/hooks/useTranslation.ts`:
- Around line 8-11: The globalLang initialization casts the localStorage value
directly to Language type without validating it matches the supported locales.
Add validation after retrieving the value from localStorage using
localStorage.getItem('ui-lang') to check if it is either 'en' or 'de'. If the
retrieved value is not one of the supported locales, fall back to a default
language (such as 'en'). This ensures only valid language values are assigned to
globalLang.

In `@frontend/src/utils/listingTransformer.ts`:
- Line 149: The type assertion on line 149 using `as Listing['highlights']` does
not validate the runtime shape of the highlights data from the LLM payload.
Create a validation function that checks each highlight object in the
extracted_facts.highlights array to ensure the sentiment, type, and confidence
fields contain only the expected literal values (sentiment: 'positive' |
'negative' | 'neutral', type: 'maintenance' | 'warning' | 'feature', confidence:
'high' | 'med' | 'low'). Apply this validation function to filter or validate
the highlights before the type assertion in the highlights line, ensuring only
valid highlights are passed to the Listing type and preventing silent failures
in downstream UI code like ListingDetailCard.tsx that checks conditions such as
h.sentiment === 'negative'.

In `@scraper/scraper_selenium.py`:
- Around line 721-724: The cookie loading loop that iterates over pickled
cookies and calls driver.add_cookie(cookie) lacks error handling for individual
cookies. Wrap each driver.add_cookie(cookie) call within a try/except block
similar to the pattern used in the load_cookies() helper function. This ensures
that if a single cookie fails to load, the loop continues processing remaining
cookies instead of aborting the entire session.
- Around line 386-393: The exception handler in the except block attempts to use
`current_url` which may not be defined if an exception occurs before the
assignment at line 356, causing a NameError. Additionally, the silent except
handler at lines 391-392 hides navigation failures. Initialize `current_url`
before the try block to ensure it's always defined, and replace the silent
`except: pass` clause with a proper logger.error call to capture and log
navigation failures instead of silently swallowing them.
- Around line 65-201: The database connection created with sqlite3.connect at
the beginning of the harvest_missing_descriptions function is not guaranteed to
be closed if an exception occurs before reaching the conn.close() call at the
end. Replace the current try/except structure with a context manager (using the
with statement) for the sqlite3.connect call to ensure the connection is
automatically closed and resources are properly cleaned up even if an exception
occurs during the listing harvesting loop or database operations.

In `@scraper/scraper.py`:
- Around line 445-461: The current update logic for listings only triggers a
database update when the description text changes, causing specs and images
changes to be ignored if the description remains the same. Modify the
conditional logic that determines whether to update a listing row (in the
section starting around line 478) to check not only if the description has
changed, but also if there are any changes to the specs or images. This ensures
that all changes to listing details are persisted to the database, even when the
description text itself remains unchanged.

In `@scripts/demo_direct_get.py`:
- Around line 123-125: The fallback CSS class selector in the soup.select_one()
call for price_elem contains a typo in the class name format. The selector uses
ad-item-main--middle--price-shipping--price but should use
aditem-main--middle--price-shipping--price (without the hyphen between ad and
item) to match the correct class naming convention used elsewhere in the code.
Update the first selector string in the soup.select_one() call to correct this
class name so that price extraction works properly when the primary id selector
is not available.

In `@scripts/demo_full_pipeline.py`:
- Around line 88-90: The price extraction in the code directly chains
select_one() and get_text() methods, which will raise an exception and skip the
entire listing if the price element is missing. Add a null-safety check after
the select_one() call to verify the element exists before calling get_text() on
it, allowing listings with missing prices to be handled gracefully rather than
being dropped.

---

Outside diff comments:
In `@scraper/scraper.py`:
- Around line 331-425: The SQLite connection opened with
sqlite3.connect(db_path) is only closed in the try block's happy path, leaving
it open if any exception occurs before reaching conn.close(). Wrap the entire
try block's operations in a try-finally structure where the finally block
ensures conn.close() is always executed, or alternatively use a with statement
for the database connection to automatically manage its lifecycle. This same
pattern needs to be applied to the other similar function mentioned in the
comment (also applies to lines 440-532).

---

Nitpick comments:
In `@frontend/src/components/ListingDetailCard.tsx`:
- Around line 320-327: Move the soft-dimension labels (trustworthiness,
transparency, conditionConfidence, documentationQuality, hiddenRiskSuspicion,
marketAboveAverageSignal) from the inline Record definition with ternary
language branching into your translation/i18n system. Replace the hardcoded
labels object with translation key references using the same i18n pattern
already used elsewhere in the ListingDetailCard component, eliminating the
redundant lang === 'en' checks and making it easier to support new languages
without modifying the component code.

In `@scraper/scraper_selenium.py`:
- Around line 219-226: Add a comment above the pickle.load() call in the cookie
loading block to document the security consideration and justification. The
comment should explain that while pickle.load is flagged as unsafe by static
analysis tools (S301), this usage is acceptable because the cookie file is
application-controlled and stored locally, which limits the attack surface and
requires prior file system access for exploitation. This will help future
maintainers understand the deliberate tradeoff being made here.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 87d1f053-bd6e-42e9-9c4f-3b7220aa747d

📥 Commits

Reviewing files that changed from the base of the PR and between e92cd69 and df86c79.

📒 Files selected for processing (17)
  • backend/server.js
  • frontend/src/App.tsx
  • frontend/src/components/CriteriaTuner.tsx
  • frontend/src/components/GuidelinesWizard.tsx
  • frontend/src/components/ListingDetailCard.tsx
  • frontend/src/components/SettingsView.tsx
  • frontend/src/components/ui/Button.tsx
  • frontend/src/components/ui/Card.tsx
  • frontend/src/components/ui/Input.tsx
  • frontend/src/hooks/useTranslation.ts
  • frontend/src/i18n/translations.ts
  • frontend/src/utils/listingTransformer.ts
  • scraper/scraper.py
  • scraper/scraper_selenium.py
  • scripts/demo_direct_get.py
  • scripts/demo_discover_listings.py
  • scripts/demo_full_pipeline.py

Comment thread backend/server.js Outdated
Comment thread frontend/src/App.tsx
Comment thread frontend/src/App.tsx
Comment on lines +1012 to 1016
<option value="All">{t('dashboard.statusAll')}</option>
<option value="High Niceness">{t('dashboard.statusMatches')} (70+)</option>
<option value="Evaluate with AI">{t('dashboard.statusPending')}</option>
<option value="New">{t('dashboard.statusEvaluated')}</option>
</select>

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Fix status filter label/value mismatch in the dropdown.

The option with value="New" is labeled as AI Evaluated Only, but the predicate for New filters l.status === 'New'. This makes the UI misleading and removes a true evaluated-only filter path.

Minimal alignment fix (label matches current behavior)
                   <option value="All">{t('dashboard.statusAll')}</option>
                   <option value="High Niceness">{t('dashboard.statusMatches')} (70+)</option>
                   <option value="Evaluate with AI">{t('dashboard.statusPending')}</option>
-                  <option value="New">{t('dashboard.statusEvaluated')}</option>
+                  <option value="New">{t('landing.new')}</option>
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@frontend/src/App.tsx` around lines 1012 - 1016, The select dropdown in the
status filter has a mismatch between the option value and label. The option with
value="New" uses the translation key dashboard.statusEvaluated which suggests an
"evaluated only" filter, but the underlying predicate filters for status ===
'New'. Change the label for the option with value="New" to use a translation key
that accurately describes filtering for new items (e.g., dashboard.statusNew or
similar) instead of dashboard.statusEvaluated to align the UI label with the
actual filtering behavior.

</div>
<span className={`text-[10px] font-bold px-2 py-0.5 rounded-md ${evalItem.status === 'satisfied' ? 'bg-emerald-500/10 text-emerald-400' : evalItem.status === 'violated' ? 'bg-rose-500/10 text-rose-400' : 'bg-slate-800 text-slate-400'}`}>
{evalItem.status === 'satisfied' ? 'SATISFIED' : evalItem.status === 'violated' ? 'VIOLATED' : 'NEUTRAL'}
{evalItem.status === 'satisfied' ? t('listing.satisfied').toUpperCase() : evalItem.status === 'violated' ? t('listing.violated').toUpperCase() : 'NEUTRAL'}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Localize the neutral checklist status.

The satisfied/violated statuses are translated, but the neutral branch is still hardcoded as English text.

🌐 Suggested fix
-{evalItem.status === 'satisfied' ? t('listing.satisfied').toUpperCase() : evalItem.status === 'violated' ? t('listing.violated').toUpperCase() : 'NEUTRAL'}
+{evalItem.status === 'satisfied'
+  ? t('listing.satisfied').toUpperCase()
+  : evalItem.status === 'violated'
+  ? t('listing.violated').toUpperCase()
+  : t('listing.neutral').toUpperCase()}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
{evalItem.status === 'satisfied' ? t('listing.satisfied').toUpperCase() : evalItem.status === 'violated' ? t('listing.violated').toUpperCase() : 'NEUTRAL'}
{evalItem.status === 'satisfied'
? t('listing.satisfied').toUpperCase()
: evalItem.status === 'violated'
? t('listing.violated').toUpperCase()
: t('listing.neutral').toUpperCase()}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@frontend/src/components/ListingDetailCard.tsx` at line 395, The neutral
status in the ternary operator evaluating evalItem.status is hardcoded as the
English string 'NEUTRAL' instead of being translated like the satisfied and
violated statuses. Replace the hardcoded 'NEUTRAL' string with
t('listing.neutral').toUpperCase() to apply the same translation and uppercase
formatting pattern used for the other two status cases in the ternary
expression.

setDelayBetweenPages(data.config.delayBetweenPages ?? 0.25);
setDelayBetweenListings(data.config.delayBetweenListings ?? 0.25);
setSuccess(t('settings.saveSuccess'));
setTimeout(() => setSuccess(''), 4000);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Clear existing success timers before scheduling a new one.

Line 80 starts a new timeout every save without cancelling prior ones. Rapid saves can clear the newest message too early, and a pending timer may fire after unmount.

Suggested fix
-import { useState, useEffect } from 'react';
+import { useState, useEffect, useRef } from 'react';
@@
 export default function SettingsView({ onBack }: SettingsViewProps) {
+  const successTimerRef = useRef<number | null>(null);
@@
-        setSuccess(t('settings.saveSuccess'));
-        setTimeout(() => setSuccess(''), 4000);
+        setSuccess(t('settings.saveSuccess'));
+        if (successTimerRef.current) {
+          window.clearTimeout(successTimerRef.current);
+        }
+        successTimerRef.current = window.setTimeout(() => setSuccess(''), 4000);
@@
+  useEffect(() => {
+    return () => {
+      if (successTimerRef.current) {
+        window.clearTimeout(successTimerRef.current);
+      }
+    };
+  }, []);
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@frontend/src/components/SettingsView.tsx` at line 80, The setTimeout call in
SettingsView that clears the success message does not cancel previously
scheduled timers, causing multiple overlapping timeouts to potentially fire and
clear the message prematurely, plus leaving dangling timers on unmount. Store
the timeout ID from each setTimeout in a ref variable, and before scheduling a
new timeout in the setSuccess handler, first call clearTimeout on any existing
timeout ID stored in that ref to ensure only one active timer exists at any
time.

Comment on lines +386 to +393
except Exception as e:
logger.error(f"Error getting detailed description: {str(e)}")
# Try to go back to the search results
try:
driver.get(current_url)
except Exception:
pass
return ""

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Potential NameError if driver.current_url fails, and silent exception swallows errors.

If the exception originates at line 356 (before current_url is assigned), line 390 raises NameError. Additionally, the silent except: pass at lines 391-392 hides navigation failures.

🛠️ Suggested fix
+    current_url = None
     try:
         # Store current URL to return to later
         current_url = driver.current_url
         ...
     except Exception as e:
         logger.error(f"Error getting detailed description: {str(e)}")
         # Try to go back to the search results
-        try:
-            driver.get(current_url)
-        except Exception:
-            pass
+        if current_url:
+            try:
+                driver.get(current_url)
+            except Exception as nav_err:
+                logger.warning(f"Failed to navigate back: {nav_err}")
         return ""
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
except Exception as e:
logger.error(f"Error getting detailed description: {str(e)}")
# Try to go back to the search results
try:
driver.get(current_url)
except Exception:
pass
return ""
current_url = None
try:
# Store current URL to return to later
current_url = driver.current_url
...
except Exception as e:
logger.error(f"Error getting detailed description: {str(e)}")
# Try to go back to the search results
if current_url:
try:
driver.get(current_url)
except Exception as nav_err:
logger.warning(f"Failed to navigate back: {nav_err}")
return ""
🧰 Tools
🪛 Ruff (0.15.17)

[warning] 386-386: Do not catch blind exception: Exception

(BLE001)


[warning] 387-387: Use explicit conversion flag

Replace with conversion flag

(RUF010)


[error] 391-392: try-except-pass detected, consider logging the exception

(S110)


[warning] 391-391: Do not catch blind exception: Exception

(BLE001)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scraper/scraper_selenium.py` around lines 386 - 393, The exception handler in
the except block attempts to use `current_url` which may not be defined if an
exception occurs before the assignment at line 356, causing a NameError.
Additionally, the silent except handler at lines 391-392 hides navigation
failures. Initialize `current_url` before the try block to ensure it's always
defined, and replace the silent `except: pass` clause with a proper logger.error
call to capture and log navigation failures instead of silently swallowing them.

Source: Linters/SAST tools

Comment on lines +721 to +724
with open(cookies_path, "rb") as f:
cookies = pickle.load(f)
for cookie in cookies:
driver.add_cookie(cookie)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Cookie loading lacks per-cookie error handling, unlike load_cookies().

The load_cookies() helper (lines 221-226) wraps each add_cookie() in try/except to skip problematic cookies. Here, a single failing cookie will abort the entire session. Apply the same defensive pattern.

🛠️ Suggested fix
             with open(cookies_path, "rb") as f:
                 cookies = pickle.load(f)
                 for cookie in cookies:
-                    driver.add_cookie(cookie)
+                    try:
+                        driver.add_cookie(cookie)
+                    except Exception as e:
+                        logger.warning(f"Error loading cookie: {e}")
             logger.info("Loaded persistent session cookies for description harvesting")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
with open(cookies_path, "rb") as f:
cookies = pickle.load(f)
for cookie in cookies:
driver.add_cookie(cookie)
with open(cookies_path, "rb") as f:
cookies = pickle.load(f)
for cookie in cookies:
try:
driver.add_cookie(cookie)
except Exception as e:
logger.warning(f"Error loading cookie: {e}")
🧰 Tools
🪛 OpenGrep (1.22.0)

[ERROR] 722-722: pickle.load/loads deserializes arbitrary Python objects and can execute arbitrary code. Use a safe format like JSON instead.

(coderabbit.deserialization.python-pickle)

🪛 Ruff (0.15.17)

[error] 722-722: pickle and modules that wrap it can be unsafe when used to deserialize untrusted data, possible security issue

(S301)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scraper/scraper_selenium.py` around lines 721 - 724, The cookie loading loop
that iterates over pickled cookies and calls driver.add_cookie(cookie) lacks
error handling for individual cookies. Wrap each driver.add_cookie(cookie) call
within a try/except block similar to the pattern used in the load_cookies()
helper function. This ensures that if a single cookie fails to load, the loop
continues processing remaining cookies instead of aborting the entire session.

Comment thread scraper/scraper.py
Comment on lines +445 to +461
if campaign_id is not None:
cursor.execute(
"""
SELECT l.id, l.url, l.title, l.detailed_description FROM listings l
JOIN searches s ON l.search_id = s.id
WHERE s.enabled = 1 AND s.campaign_id = ?
""",
(campaign_id,),
)
else:
cursor.execute(
"""
SELECT l.id, l.url, l.title, l.detailed_description FROM listings l
JOIN searches s ON l.search_id = s.id
WHERE s.enabled = 1
"""
)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Persist details/images updates even when description text is unchanged.

Current logic updates the row only when description changes. If listing specs or images change while description remains the same, DB data stays stale even though fresh values were parsed.

Suggested fix
-                SELECT l.id, l.url, l.title, l.detailed_description FROM listings l
+                SELECT l.id, l.url, l.title, l.detailed_description, l.details, l.images FROM listings l
@@
-                SELECT l.id, l.url, l.title, l.detailed_description FROM listings l
+                SELECT l.id, l.url, l.title, l.detailed_description, l.details, l.images FROM listings l
@@
             old_description = r["detailed_description"] or ""
+            old_details = r["details"] or "{}"
+            old_images = r["images"] or "[]"
@@
             detailed_description = parsed["detailed_description"] or ""
+            details_json = json.dumps(parsed["details"], ensure_ascii=False, sort_keys=True)
+            images_json = json.dumps(parsed["images"], ensure_ascii=False)
 
-            if detailed_description.strip() != old_description.strip():
+            if (
+                detailed_description.strip() != old_description.strip()
+                or details_json != old_details
+                or images_json != old_images
+            ):
@@
-                        json.dumps(parsed["details"]),
-                        json.dumps(parsed["images"]),
+                        details_json,
+                        images_json,

Also applies to: 478-522

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scraper/scraper.py` around lines 445 - 461, The current update logic for
listings only triggers a database update when the description text changes,
causing specs and images changes to be ignored if the description remains the
same. Modify the conditional logic that determines whether to update a listing
row (in the section starting around line 478) to check not only if the
description has changed, but also if there are any changes to the specs or
images. This ensures that all changes to listing details are persisted to the
database, even when the description text itself remains unchanged.

Comment on lines +123 to +125
price_elem = soup.select_one(
'h2.ad-item-main--middle--price-shipping--price, [itemprop="price"]'
)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Fix fallback price selector typo.

The fallback selector currently uses ad-item..., which does not match the class used elsewhere (aditem...). This can miss price extraction when #viewad-price is absent.

Suggested fix
-        price_elem = soup.select_one(
-            'h2.ad-item-main--middle--price-shipping--price, [itemprop="price"]'
-        )
+        price_elem = soup.select_one(
+            'p.aditem-main--middle--price-shipping--price, [itemprop="price"]'
+        )
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/demo_direct_get.py` around lines 123 - 125, The fallback CSS class
selector in the soup.select_one() call for price_elem contains a typo in the
class name format. The selector uses ad-item-main--middle--price-shipping--price
but should use aditem-main--middle--price-shipping--price (without the hyphen
between ad and item) to match the correct class naming convention used elsewhere
in the code. Update the first selector string in the soup.select_one() call to
correct this class name so that price extraction works properly when the primary
id selector is not available.

Comment on lines +88 to +90
price = item.select_one(
"p.aditem-main--middle--price-shipping--price"
).get_text(strip=True)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Avoid skipping valid listings when the price node is missing.

This direct chained call can raise and drop the whole listing record if the price element is absent. Make extraction null-safe.

Suggested fix
-            price = item.select_one(
-                "p.aditem-main--middle--price-shipping--price"
-            ).get_text(strip=True)
+            price_elem = item.select_one(
+                "p.aditem-main--middle--price-shipping--price"
+            )
+            price = price_elem.get_text(strip=True) if price_elem else "N/A"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
price = item.select_one(
"p.aditem-main--middle--price-shipping--price"
).get_text(strip=True)
price_elem = item.select_one(
"p.aditem-main--middle--price-shipping--price"
)
price = price_elem.get_text(strip=True) if price_elem else "N/A"
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/demo_full_pipeline.py` around lines 88 - 90, The price extraction in
the code directly chains select_one() and get_text() methods, which will raise
an exception and skip the entire listing if the price element is missing. Add a
null-safety check after the select_one() call to verify the element exists
before calling get_text() on it, allowing listings with missing prices to be
handled gracefully rather than being dropped.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
scraper/agent_worker.py (1)

301-302: ⚡ Quick win

Use one captured UTC timestamp per row update for both datetime columns.

llm_processed_time and last_ai_evaluated_at are written as a pair, but they currently use two separate now() calls, which can drift by microseconds. Capture once and reuse so both fields stay exactly aligned.

Suggested patch
@@
-            cursor.execute(
+            processed_at = datetime.datetime.now(datetime.timezone.utc).isoformat()
+            cursor.execute(
@@
-                (
-                    datetime.datetime.now(datetime.timezone.utc).isoformat(),
-                    datetime.datetime.now(datetime.timezone.utc).isoformat(),
+                (
+                    processed_at,
+                    processed_at,
                     json.dumps(facts_envelope),
                     listing_id,
                 ),
             )
@@
-            cursor.execute(
+            processed_at = datetime.datetime.now(datetime.timezone.utc).isoformat()
+            cursor.execute(
@@
-                (
-                    datetime.datetime.now(datetime.timezone.utc).isoformat(),
-                    datetime.datetime.now(datetime.timezone.utc).isoformat(),
+                (
+                    processed_at,
+                    processed_at,
                     full_info,
                     json.dumps(facts_envelope),
                     score,

Also applies to: 571-572

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scraper/agent_worker.py` around lines 301 - 302, The code at lines 301-302
makes two separate calls to
datetime.datetime.now(datetime.timezone.utc).isoformat() for the
llm_processed_time and last_ai_evaluated_at columns, which can cause microsecond
drift between the two timestamps. Capture the UTC timestamp once into a variable
before the update statement and reuse that same variable for both datetime
columns instead of calling .now() twice. Apply the same fix at lines 571-572
where this same pattern occurs.
scraper/scraper.py (1)

270-274: ⚡ Quick win

Excessive file I/O: writing to disk on every single listing.

This writes the entire JSON file after each listing is scraped, causing O(n²) total bytes written and high I/O overhead. For crash recovery, batching once per page would provide similar resilience with much better performance.

♻️ Suggested refactor: batch write per page

Move the file write outside the item loop to write once per page:

                     existing_ids.add(listing_id)
                     all_scraped_listings.append(listing)
                     scraped_count += 1
 
-                    if output_file:
-                        with open(output_file, "w", encoding="utf-8") as f:
-                            json.dump(
-                                existing_listings, f, ensure_ascii=False, indent=4
-                            )
-
                 except Exception as e:
                     logger.error(f"Error scraping single listing: {str(e)}")
                     continue
 
             logger.info(
                 f"Scraped {scraped_count} discovered listings from {current_url}"
             )
+
+            # Save after each page for crash recovery
+            if output_file and scraped_count > 0:
+                with open(output_file, "w", encoding="utf-8") as f:
+                    json.dump(existing_listings, f, ensure_ascii=False, indent=4)
 
         except Exception as e:
             logger.error(f"Error scraping page {current_url}: {str(e)}")
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scraper/scraper.py` around lines 270 - 274, The file write operation using
json.dump on existing_listings is currently inside the item loop, causing the
entire JSON file to be written after each listing is processed, resulting in
excessive I/O overhead. Move the entire block that opens output_file and calls
json.dump outside the innermost loop that processes individual listings so that
the file is written once per page rather than once per listing. This will
maintain crash recovery capabilities while significantly improving performance
by reducing write operations.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@scraper/agent_worker.py`:
- Around line 301-302: The code at lines 301-302 makes two separate calls to
datetime.datetime.now(datetime.timezone.utc).isoformat() for the
llm_processed_time and last_ai_evaluated_at columns, which can cause microsecond
drift between the two timestamps. Capture the UTC timestamp once into a variable
before the update statement and reuse that same variable for both datetime
columns instead of calling .now() twice. Apply the same fix at lines 571-572
where this same pattern occurs.

In `@scraper/scraper.py`:
- Around line 270-274: The file write operation using json.dump on
existing_listings is currently inside the item loop, causing the entire JSON
file to be written after each listing is processed, resulting in excessive I/O
overhead. Move the entire block that opens output_file and calls json.dump
outside the innermost loop that processes individual listings so that the file
is written once per page rather than once per listing. This will maintain crash
recovery capabilities while significantly improving performance by reducing
write operations.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 7700d350-353b-49b2-8783-78b0d5bff72e

📥 Commits

Reviewing files that changed from the base of the PR and between df86c79 and f12871c.

📒 Files selected for processing (8)
  • backend/server.js
  • frontend/src/App.tsx
  • frontend/src/components/ListingDetailCard.tsx
  • frontend/src/hooks/useTranslation.ts
  • frontend/src/i18n/translations.ts
  • scraper/agent_worker.py
  • scraper/scraper.py
  • scraper/scraper_selenium.py
✅ Files skipped from review due to trivial changes (1)
  • frontend/src/i18n/translations.ts
🚧 Files skipped from review as they are similar to previous changes (5)
  • frontend/src/hooks/useTranslation.ts
  • backend/server.js
  • scraper/scraper_selenium.py
  • frontend/src/components/ListingDetailCard.tsx
  • frontend/src/App.tsx

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (1)
backend/server.js (1)

51-53: 💤 Low value

Consider generating a random initial password.

The hardcoded default password 'password' is trivially guessable. For a dev/demo setup, consider generating a random password on first run and printing it once, or at minimum document prominently that this must be changed before any non-local use.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@backend/server.js` around lines 51 - 53, The hardcoded defaultPassword
variable is set to the weak default value 'password', which poses a security
risk. Replace this with a solution that either generates a random initial
password on first run and logs it to the console for the administrator to
capture, or add prominent documentation (in comments and README) stating that
this hardcoded password must be changed immediately before any non-local
deployment. If implementing random generation, ensure the password is only
generated once during initial setup and persisted securely.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@backend/package.json`:
- Around line 13-15: The jsonwebtoken package at version 9.0.3 has transitive
dependencies on lodash with known security vulnerabilities. Update jsonwebtoken
to a newer version that resolves these vulnerabilities (consider upgrading to a
more recent stable version), and if necessary, explicitly update lodash to a
patched version in the dependencies. After making these changes, run npm audit
to verify that the vulnerabilities have been resolved and no new issues are
introduced. Additionally, consider setting up automated transitive dependency
scanning in your CI/CD pipeline to catch such vulnerabilities early in future
development.

In `@backend/server.js`:
- Line 64: The console.log statement that logs the seeded default admin user is
including the plaintext defaultPassword value in the output, which poses a
security risk as logs may be persisted or sent to external systems. Remove the
password variable from the log message. Keep the log statement that confirms the
default admin user was seeded but only include the defaultEmail and a success
message without exposing the actual defaultPassword value.
- Line 21: Remove the hardcoded fallback value from the JWT_SECRET constant
definition and implement a fail-fast approach that throws an error if the
environment variable is not set in production environments, or generate a random
ephemeral secret with a warning in development. Create a jwtSigningKey variable
that will hold the resolved secret value, and replace all uses of JWT_SECRET
with jwtSigningKey throughout the codebase for signing and verifying
authentication tokens.

---

Nitpick comments:
In `@backend/server.js`:
- Around line 51-53: The hardcoded defaultPassword variable is set to the weak
default value 'password', which poses a security risk. Replace this with a
solution that either generates a random initial password on first run and logs
it to the console for the administrator to capture, or add prominent
documentation (in comments and README) stating that this hardcoded password must
be changed immediately before any non-local deployment. If implementing random
generation, ensure the password is only generated once during initial setup and
persisted securely.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 51974059-1c2b-4477-8358-d8109b0c1f85

📥 Commits

Reviewing files that changed from the base of the PR and between f12871c and 5747716.

⛔ Files ignored due to path filters (1)
  • backend/package-lock.json is excluded by !**/package-lock.json
📒 Files selected for processing (4)
  • backend/package.json
  • backend/server.js
  • frontend/src/App.tsx
  • frontend/src/i18n/translations.ts
🚧 Files skipped from review as they are similar to previous changes (1)
  • frontend/src/i18n/translations.ts

Comment thread backend/package.json
Comment on lines +13 to +15
"bcrypt": "^6.0.0",
"express": "^4.21.2",
"jsonwebtoken": "^9.0.3",

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check for security advisories on bcrypt and jsonwebtoken

echo "=== Checking bcrypt advisories ==="
gh api graphql -f query='
{
  securityVulnerabilities(first: 5, ecosystem: NPM, package: "bcrypt") {
    nodes {
      advisory { summary severity publishedAt }
      vulnerableVersionRange
      firstPatchedVersion { identifier }
    }
  }
}'

echo ""
echo "=== Checking jsonwebtoken advisories ==="
gh api graphql -f query='
{
  securityVulnerabilities(first: 5, ecosystem: NPM, package: "jsonwebtoken") {
    nodes {
      advisory { summary severity publishedAt }
      vulnerableVersionRange
      firstPatchedVersion { identifier }
    }
  }
}'

echo ""
echo "=== Latest versions on npm ==="
curl -s https://registry.npmjs.org/bcrypt/latest | jq '.version'
curl -s https://registry.npmjs.org/jsonwebtoken/latest | jq '.version'

Repository: Infraviored/DealMapper

Length of output: 1996


🌐 Web query:

bcrypt npm 6.0.0 security vulnerabilities advisories

💡 Result:

As of June 19, 2026, there are no known security vulnerabilities associated with bcrypt version 6.0.0 [1][2]. The bcrypt package previously contained a significant security issue known as the "wrap-around" bug (CVE-2020-7689), which caused passwords 255 characters or longer to be truncated, and an additional issue involving improper handling of NUL characters in passwords [3][4][5]. Both of these vulnerabilities were addressed and patched in version 5.0.0 [3][6][4]. Version 6.0.0, released on May 11, 2025, is the latest stable version and remains free of these historical vulnerabilities [1][4]. Users are encouraged to keep their dependencies up to date and use tools like npm audit to monitor for any newly discovered issues [7].

Citations:


🌐 Web query:

jsonwebtoken npm 9.0.3 security vulnerabilities CVE advisories

💡 Result:

As of June 19, 2026, the npm package jsonwebtoken version 9.0.3 does not have direct, uniquely assigned CVEs for vulnerabilities within its own codebase [1][2]. While major security advisories (such as CVE-2022-23529, CVE-2022-23540, CVE-2022-23541, and CVE-2022-23539) affected versions prior to 9.0.0, these have been patched [3][4][5]. However, users should be aware of the following security context for version 9.0.3: 1. Transitive Vulnerabilities: As of April 2026, version 9.0.3 has been reported to be flagged by security scanners due to vulnerabilities in its transitive lodash dependencies (e.g., CVE-2026-4800, CVE-2019-10744, CVE-2021-23337, and CVE-2020-8203) [6]. Development teams are tracking these issues in the project repository (e.g., issue #1023), but these are dependency-level concerns rather than flaws in the jsonwebtoken logic itself [6]. 2. Dependency Updates: Version 9.0.3 (released December 4, 2025) updated the jws dependency to version 4.0.1 [3][7]. 3. Recommended Practice: Because jsonwebtoken is a widely used library, always ensure that your application-level usage follows secure coding practices, such as validating algorithms and sanitizing user input, as recommended in the official documentation [8][9]. For the most accurate and up-to-date status, consult the project's official GitHub repository issues and your own security scanning tools to assess the impact of these transitive dependencies on your specific environment [6].

Citations:


🌐 Web query:

bcrypt npm latest version 2026

💡 Result:

As of June 19, 2026, the latest version of the bcrypt package on npm is 6.0.0 [1][2]. It was published on May 11, 2025 [1][3].

Citations:


🌐 Web query:

jsonwebtoken npm latest version 2026

💡 Result:

As of June 19, 2026, the latest version of the jsonwebtoken npm package is 9.0.3 [1][2][3]. This version was published on December 4, 2025 [1][3][4].

Citations:


Address transitive vulnerabilities in jsonwebtoken dependencies.

bcrypt 6.0.0 is current and free from known vulnerabilities. However, jsonwebtoken 9.0.3 depends on lodash with known vulnerabilities (CVE-2026-4800, CVE-2019-10744, CVE-2021-23337, CVE-2020-8203). Run npm audit to assess impact on your application and consider updating lodash or implementing transitive dependency auditing in your CI/CD pipeline.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@backend/package.json` around lines 13 - 15, The jsonwebtoken package at
version 9.0.3 has transitive dependencies on lodash with known security
vulnerabilities. Update jsonwebtoken to a newer version that resolves these
vulnerabilities (consider upgrading to a more recent stable version), and if
necessary, explicitly update lodash to a patched version in the dependencies.
After making these changes, run npm audit to verify that the vulnerabilities
have been resolved and no new issues are introduced. Additionally, consider
setting up automated transitive dependency scanning in your CI/CD pipeline to
catch such vulnerabilities early in future development.

Comment thread backend/server.js
Comment thread backend/server.js Outdated
@Infraviored

Copy link
Copy Markdown
Owner Author

/gemini review

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces user authentication using JWT and bcrypt, adds a custom internationalization (i18n) framework supporting English and German, refactors the frontend to use reusable UI components (Button, Card, Input), and optimizes the scraper to use fast, direct HTTP requests instead of Selenium by default. The review feedback highlights critical security vulnerabilities regarding hardcoded production secrets and default credentials, a bug in cookie parsing, and a pagination regex issue. Additionally, valuable performance improvements are suggested to batch disk writes and reuse HTTP connections via requests.Session.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread backend/server.js
Comment on lines +130 to +138
if (req.headers.cookie) {
const cookies = Object.fromEntries(
req.headers.cookie.split('; ').map(c => {
const parts = c.split('=');
return [parts[0], parts.slice(1).join('=')];
})
);
token = cookies.token;
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The current cookie parsing logic splits strictly by '; ' (semicolon followed by a space). However, according to RFC 6265, the space after the semicolon is optional. If a client sends cookies separated only by a semicolon (e.g., token=xyz;other=abc), this parsing logic will fail to split them correctly, resulting in a malformed token and authentication failure.

Using a regular expression like /;\s*/ to split the cookies and trimming the keys ensures robust parsing.

Suggested change
if (req.headers.cookie) {
const cookies = Object.fromEntries(
req.headers.cookie.split('; ').map(c => {
const parts = c.split('=');
return [parts[0], parts.slice(1).join('=')];
})
);
token = cookies.token;
}
if (req.headers.cookie) {
const cookies = Object.fromEntries(
req.headers.cookie.split(/;\s*/).map(c => {
const parts = c.split('=');
return [parts[0].trim(), parts.slice(1).join('=')];
})
);
token = cookies.token;
}

Comment thread backend/server.js
const sqlite3 = require('sqlite3').verbose();
const jwt = require('jsonwebtoken');
const bcrypt = require('bcrypt');
const JWT_SECRET = process.env.JWT_SECRET || 'dealmapper_dev_secret_key_12345';

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Using a hardcoded fallback for JWT_SECRET in production is a security risk. If the environment variable is missing, the application will run with a known default secret, making it vulnerable to token forgery.

It is highly recommended to enforce that JWT_SECRET is set when running in production, and throw an error or exit if it is missing.

Suggested change
const JWT_SECRET = process.env.JWT_SECRET || 'dealmapper_dev_secret_key_12345';
if (process.env.NODE_ENV === 'production' && !process.env.JWT_SECRET) {
console.error("FATAL: JWT_SECRET environment variable is required in production!");
process.exit(1);
}
const JWT_SECRET = process.env.JWT_SECRET || 'dealmapper_dev_secret_key_12345';

Comment thread backend/server.js Outdated
Comment on lines +51 to +53
const defaultEmail = 'admin@dealmapper.local';
const defaultPassword = 'password';
const saltRounds = 10;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Seeding a default admin user with a hardcoded password ('password') is a security risk. It is highly recommended to allow configuring the default admin credentials via environment variables, with fallback values only for development environments.

Suggested change
const defaultEmail = 'admin@dealmapper.local';
const defaultPassword = 'password';
const saltRounds = 10;
const defaultEmail = process.env.DEFAULT_ADMIN_EMAIL || 'admin@dealmapper.local';
const defaultPassword = process.env.DEFAULT_ADMIN_PASSWORD || 'password';
const saltRounds = 10;

Comment thread backend/server.js Outdated
Comment on lines +277 to +282
app.use('/api', (req, res, next) => {
if (req.path === '/auth/login' || req.path === '/auth/logout' || req.path === '/auth/me') {
return next();
}
authenticateToken(req, res, next);
});

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The path checks inside the global API auth protection middleware are redundant because the public routes /api/auth/login and /api/auth/logout are defined before this middleware in the Express routing chain. In Express, requests matching those routes are fully handled and terminated before reaching this middleware.

You can simplify this by removing the redundant path checks and applying the authenticateToken middleware directly.

// Global API Auth Protection (applied to all subsequent /api/* routes)
app.use('/api', authenticateToken);

Comment thread scraper/scraper.py Outdated
Comment on lines +62 to +65
def parse_listing_details_requests(url):
"""Retrieve and parse detailed specifications, description, and images using direct requests"""
try:
response = requests.get(url, headers=HEADERS, timeout=10)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The parse_listing_details_requests function performs a direct requests.get call for every listing. When harvesting descriptions for multiple listings sequentially, this establishes a new TCP connection and SSL handshake for every single request, which is highly inefficient.

By utilizing a requests.Session object, you can enable HTTP Keep-Alive connection reuse, significantly speeding up the harvesting process and reducing overhead on both the client and server.

def parse_listing_details_requests(url, session=None):
    """Retrieve and parse detailed specifications, description, and images using direct requests"""
    try:
        caller = session if session else requests
        response = caller.get(url, headers=HEADERS, timeout=10)

Comment thread scraper/scraper.py Outdated
Comment on lines +177 to +178
if "/seite:" in base_url:
current_url = re.sub(r"/seite:\d+/", f"/seite:{page}/", base_url)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The regular expression r"/seite:\d+/" expects a trailing slash. If the base_url ends with /seite:2 without a trailing slash, the regex will fail to match, and the URL will not be paginated correctly.

Using a more robust regex pattern like r"/seite:\d+(/|$)" handles both cases correctly.

Suggested change
if "/seite:" in base_url:
current_url = re.sub(r"/seite:\d+/", f"/seite:{page}/", base_url)
if "/seite:" in base_url:
current_url = re.sub(r"/seite:\d+(/|$)", f"/seite:{page}\\1", base_url)

Comment thread scraper/scraper.py
Comment on lines +269 to +281

if output_file:
with open(output_file, "w", encoding="utf-8") as f:
json.dump(
existing_listings, f, ensure_ascii=False, indent=4
)

except Exception as e:
logger.error(f"Error scraping single listing: {str(e)}")
continue

logger.info(
f"Scraped {scraped_count} discovered listings from {current_url}"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Writing the entire existing_listings list to the JSON file on every single iteration of the loop is extremely inefficient and causes unnecessary disk I/O overhead.

Instead, you should batch the write operation by moving the json.dump call outside of the inner loop, executing it once per page after all listings on that page have been parsed.

                    except Exception as e:
                        logger.error(f"Error scraping single listing: {str(e)}")
                        continue

                if output_file:
                    with open(output_file, "w", encoding="utf-8") as f:
                        json.dump(
                            existing_listings, f, ensure_ascii=False, indent=4
                        )

                logger.info(
                    f"Scraped {scraped_count} discovered listings from {current_url}"
                )

@Infraviored

Copy link
Copy Markdown
Owner Author

/gemini review

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces user authentication, internationalization (i18n) support, and a fast requests-based scraping pipeline. On the backend, JWT-based authentication and SQLite user seeding have been implemented, while the frontend has been updated with a login portal, reusable UI components, and English/German translations. Additionally, the scraper has been optimized to bypass browser automation using direct HTTP requests, with legacy Selenium paths preserved. The code review feedback highlights critical security vulnerabilities, including a default admin password fallback in production, plain-text logging of seeded credentials, and susceptibility to timing attacks on the login endpoint. A robustness issue was also identified in the JSON-LD parser when handling array-wrapped metadata structures.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread backend/server.js Outdated
Comment on lines +55 to +56
const defaultEmail = process.env.DEFAULT_ADMIN_EMAIL || 'admin@dealmapper.local';
const defaultPassword = process.env.DEFAULT_ADMIN_PASSWORD || 'password';

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-high high

In production, defaulting the admin password to 'password' if DEFAULT_ADMIN_PASSWORD is not set poses a severe security risk. It is highly recommended to require DEFAULT_ADMIN_PASSWORD to be explicitly set in production environments, similar to how JWT_SECRET is handled, and exit the process if it is missing.

        const defaultEmail = process.env.DEFAULT_ADMIN_EMAIL || 'admin@dealmapper.local';
        let defaultPassword = process.env.DEFAULT_ADMIN_PASSWORD;
        if (process.env.NODE_ENV === 'production' && !defaultPassword) {
          console.error("FATAL: DEFAULT_ADMIN_PASSWORD environment variable is required in production to seed the default admin user!");
          process.exit(1);
        }
        if (!defaultPassword) {
          defaultPassword = 'password';
        }

Comment thread backend/server.js Outdated
[defaultEmail, hash, 'admin'],
(err) => {
if (err) console.error("Failed to seed default admin user:", err);
else console.log(`Seeded default admin user: ${defaultEmail} (password: ${defaultPassword})`);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-high high

Logging the seeded admin password in plain text to the console/stdout is a security vulnerability (CWE-532). In production, these logs are often persisted or monitored by third-party systems, exposing sensitive credentials. Remove the password from the log message.

Suggested change
else console.log(`Seeded default admin user: ${defaultEmail} (password: ${defaultPassword})`);
else console.log(`Seeded default admin user: ${defaultEmail}`);

Comment thread backend/server.js
Comment on lines +237 to +245
const user = await get("SELECT * FROM users WHERE email = ?", [email.toLowerCase().trim()]);
if (!user) {
return res.status(401).json({ error: 'Invalid email or password.' });
}

const isMatch = await bcrypt.compare(password, user.password_hash);
if (!isMatch) {
return res.status(401).json({ error: 'Invalid email or password.' });
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-high high

The login route is vulnerable to user enumeration via a timing attack. If the user is not found in the database, the server returns a 401 response immediately. If the user is found, it performs a computationally expensive bcrypt.compare operation. This timing difference allows an attacker to easily determine which email addresses are registered. To mitigate this, always perform a dummy bcrypt comparison when the user is not found.

Suggested change
const user = await get("SELECT * FROM users WHERE email = ?", [email.toLowerCase().trim()]);
if (!user) {
return res.status(401).json({ error: 'Invalid email or password.' });
}
const isMatch = await bcrypt.compare(password, user.password_hash);
if (!isMatch) {
return res.status(401).json({ error: 'Invalid email or password.' });
}
const user = await get("SELECT * FROM users WHERE email = ?", [email.toLowerCase().trim()]);
// Always run bcrypt.compare to prevent timing attacks (user enumeration)
const passwordHash = user ? user.password_hash : "$2b$10$invalidhashplaceholderdummyvalue";
const isMatch = await bcrypt.compare(password, passwordHash);
if (!user || !isMatch) {
return res.status(401).json({ error: 'Invalid email or password.' });
}

Comment thread scraper/scraper.py Outdated
Comment on lines +82 to +86
data = json.loads(script.string)
if isinstance(data, dict) and data.get("@type") == "ImageObject":
if data.get("representativeOfPage") is True:
schema_description = data.get("description")
schema_main_image = data.get("contentUrl")

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The JSON-LD parser assumes that the parsed data is always a single dictionary. However, Schema.org metadata can often be wrapped in a JSON array (list) or contain a @graph array of objects. If data is a list, calling data.get will raise an AttributeError and skip the block. Enhance the parser to handle both single objects, lists, and @graph structures to make metadata extraction more robust.

Suggested change
data = json.loads(script.string)
if isinstance(data, dict) and data.get("@type") == "ImageObject":
if data.get("representativeOfPage") is True:
schema_description = data.get("description")
schema_main_image = data.get("contentUrl")
data = json.loads(script.string)
# Normalize to a list of objects to handle both single objects and arrays
items = []
if isinstance(data, list):
items = data
elif isinstance(data, dict):
if "@graph" in data and isinstance(data["@graph"], list):
items = data["@graph"]
else:
items = [data]
for item in items:
if isinstance(item, dict) and item.get("@type") == "ImageObject":
if item.get("representativeOfPage") is True:
schema_description = item.get("description")
schema_main_image = item.get("contentUrl")

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
scraper/scraper.py (1)

384-423: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Critical: Database UPDATE is outside the for loop - only the last listing gets updated.

The cursor.execute block starting at line 405 is indented at the same level as the for statement, placing it outside the loop. This means:

  1. The loop fetches and parses details for every listing
  2. Only the final iteration's listing_id and parsed values are used in the UPDATE
  3. All other listings' harvested data is silently lost
Proposed fix to move UPDATE inside the loop
                 parsed = parse_listing_details_requests(url, session=session)
                 if parsed is None:
                     continue
 
-            cursor.execute(
-                """
-                UPDATE listings 
-                SET detailed_description = ?, details = ?, images = ?, full_info_obtained = 1,
-                    last_description_changed_at = ?
-                WHERE id = ?
-            """,
-                (
-                    parsed["detailed_description"],
-                    json.dumps(parsed["details"]),
-                    json.dumps(parsed["images"]),
-                    datetime.datetime.now(datetime.timezone.utc).isoformat(),
-                    listing_id,
-                ),
-            )
-            conn.commit()
-            logger.info(
-                f"Successfully harvested detailed description, {len(parsed['details'])} details, and {len(parsed['images'])} images for ID {listing_id}"
-            )
+                cursor.execute(
+                    """
+                    UPDATE listings 
+                    SET detailed_description = ?, details = ?, images = ?, full_info_obtained = 1,
+                        last_description_changed_at = ?
+                    WHERE id = ?
+                """,
+                    (
+                        parsed["detailed_description"],
+                        json.dumps(parsed["details"]),
+                        json.dumps(parsed["images"]),
+                        datetime.datetime.now(datetime.timezone.utc).isoformat(),
+                        listing_id,
+                    ),
+                )
+                conn.commit()
+                logger.info(
+                    f"Successfully harvested detailed description, {len(parsed['details'])} details, and {len(parsed['images'])} images for ID {listing_id}"
+                )
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scraper/scraper.py` around lines 384 - 423, The cursor.execute block that
updates the database is currently indented at the same level as the for loop,
placing it outside the loop. This causes only the last listing to be updated
while all other listings' harvested data is discarded. Move the entire
cursor.execute UPDATE statement (which updates the listings table with
detailed_description, details, images, and full_info_obtained) inside the for
loop by increasing its indentation level so it executes after each listing is
parsed. Also ensure the conn.commit() and logger.info calls that follow the
UPDATE are moved inside the loop as well.
♻️ Duplicate comments (1)
scraper/scraper.py (1)

485-530: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Critical: Update logic is outside the for loop - only the last listing is processed.

Lines 507-530 are indented at the with block level, not inside the for loop. This results in:

  1. The loop fetches details for all listings but discards results except the last
  2. Only the final listing's description is compared and potentially updated
  3. All other listings are never checked for changes
Proposed fix to move processing inside the loop
                 parsed = parse_listing_details_requests(url, session=session)
                 if parsed is None:
                     continue
 
-            detailed_description = parsed["detailed_description"] or ""
-
-            if detailed_description.strip() != old_description.strip():
-                logger.info(
-                    f"Description changed for listing {listing_id}! Updating in DB."
-                )
-                cursor.execute(
-                    """
-                    UPDATE listings 
-                    SET detailed_description = ?, details = ?, images = ?, full_info_obtained = 1,
-                        last_description_changed_at = ?
-                    WHERE id = ?
-                """,
-                    (
-                        detailed_description,
-                        json.dumps(parsed["details"]),
-                        json.dumps(parsed["images"]),
-                        datetime.datetime.now(datetime.timezone.utc).isoformat(),
-                        listing_id,
-                    ),
-                )
-                conn.commit()
-            else:
-                logger.info(f"No description changes for listing {listing_id}.")
+                detailed_description = parsed["detailed_description"] or ""
+
+                if detailed_description.strip() != old_description.strip():
+                    logger.info(
+                        f"Description changed for listing {listing_id}! Updating in DB."
+                    )
+                    cursor.execute(
+                        """
+                        UPDATE listings 
+                        SET detailed_description = ?, details = ?, images = ?, full_info_obtained = 1,
+                            last_description_changed_at = ?
+                        WHERE id = ?
+                    """,
+                        (
+                            detailed_description,
+                            json.dumps(parsed["details"]),
+                            json.dumps(parsed["images"]),
+                            datetime.datetime.now(datetime.timezone.utc).isoformat(),
+                            listing_id,
+                        ),
+                    )
+                    conn.commit()
+                else:
+                    logger.info(f"No description changes for listing {listing_id}.")

Note: Once this indentation is fixed, the prior review feedback about persisting details/images changes even when description is unchanged will still apply.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scraper/scraper.py` around lines 485 - 530, The code block starting with the
detailed_description assignment and continuing through the database update and
logging logic is indented at the wrong level (at the with statement level
instead of inside the for loop). This means the loop iterates through all rows
and calls parse_listing_details_requests for each one, but the comparison and
update logic only executes once after the loop completes, using only the last
parsed result. Move all the code from the detailed_description assignment
through the final else block to be indented inside the for loop, at the same
indentation level as the parse_listing_details_requests call and the if parsed
is None check, so that each listing's details are compared and updated before
moving to the next listing.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@scraper/scraper.py`:
- Around line 384-423: The cursor.execute block that updates the database is
currently indented at the same level as the for loop, placing it outside the
loop. This causes only the last listing to be updated while all other listings'
harvested data is discarded. Move the entire cursor.execute UPDATE statement
(which updates the listings table with detailed_description, details, images,
and full_info_obtained) inside the for loop by increasing its indentation level
so it executes after each listing is parsed. Also ensure the conn.commit() and
logger.info calls that follow the UPDATE are moved inside the loop as well.

---

Duplicate comments:
In `@scraper/scraper.py`:
- Around line 485-530: The code block starting with the detailed_description
assignment and continuing through the database update and logging logic is
indented at the wrong level (at the with statement level instead of inside the
for loop). This means the loop iterates through all rows and calls
parse_listing_details_requests for each one, but the comparison and update logic
only executes once after the loop completes, using only the last parsed result.
Move all the code from the detailed_description assignment through the final
else block to be indented inside the for loop, at the same indentation level as
the parse_listing_details_requests call and the if parsed is None check, so that
each listing's details are compared and updated before moving to the next
listing.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: edece5e5-363f-4e1a-bfd2-4a9a3c8a07ff

📥 Commits

Reviewing files that changed from the base of the PR and between 5747716 and 25adb7c.

📒 Files selected for processing (2)
  • backend/server.js
  • scraper/scraper.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • backend/server.js

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant