Skip to content

Fix/h1 transactions 429#693

Open
sfc-gh-slonkar wants to merge 1 commit into
v1.10.0from
fix/h1-transactions-429
Open

Fix/h1 transactions 429#693
sfc-gh-slonkar wants to merge 1 commit into
v1.10.0from
fix/h1-transactions-429

Conversation

@sfc-gh-slonkar
Copy link
Copy Markdown

@sfc-gh-slonkar sfc-gh-slonkar commented Apr 28, 2026

fix(h1_collect): fix 429 rate limiting and duplicate row ingestion

Problems Fixed

  1. paginated_insert_transactions() looped over every month from 2012 to
    current year on every run (~165 API calls/run), causing HackerOne to
    return 429 Too Many Requests and halting ingestion entirely.
    Fixed by fetching only current + previous month (2 API calls/run).
    Previous month is included to catch late-posted transactions.

  2. insert_reports() and insert_transactions() used plain db.insert() with
    no deduplication, causing the same rows to be re-inserted on every run.
    With the connector running every ~15 mins, this produced ~96 duplicate
    copies per day (134k rows for only 1,405 unique reports).
    Fixed by replacing INSERT with MERGE:

    • Reports: MERGE ON id
    • Transactions: MERGE ON (id, activity_date, activity_description)

Testing

Tested insert_reports() MERGE against SNOWALERT.SLONKAR.H1_REPORTS_TEST:

  • Seeded table with yesterday's data (134,444 rows, 1,405 unique IDs)
  • MERGE run 1 with today's data: 1,405 rows updated, 1 new report inserted
  • MERGE run 2 with today's data: 0 rows changed (134,445 / 1,406 stable)
  • Confirmed MERGE prevents duplicates while correctly inserting new reports

@sfc-gh-slonkar sfc-gh-slonkar reopened this May 6, 2026
Copy link
Copy Markdown
Collaborator

@sfc-gh-afedorov sfc-gh-afedorov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for jumping on this. The HackerOne changelog confirms the billing transactions endpoint now has a 10 requests/minute limit, so reducing the request volume makes sense.

I don’t think we should merge this version as-is, though. Changing paginated_insert_transactions() to only fetch current + previous month fixes the immediate 429, but it also removes historical/backfill behavior: new installs, rebuilt landing tables, or any connector outage longer than a month will permanently miss older billing transactions. That is a significant semantic change from the current “scan all months since 2012” behavior.

Could we instead make the billing transaction loop rate-limit aware, e.g. throttle to <=10 requests/minute and/or honor Retry-After on 429?

I we can do it via this helper for rate limiting —

class RPM:
    def __init__(self, rate: int, per_seconds: int = 60) -> None:
        self.rate: int = rate
        self.per_seconds: int = per_seconds
        self.calls: Deque[float] = deque()
        self.lock: RLock = RLock()

    def __enter__(self) -> None:
        self._acquire()

    def __exit__(self, exc_type: type | None, exc_val: Exception | None, exc_tb: Any | None) -> None:
        pass

    def _acquire(self) -> None:
        with self.lock:
            now: float = time.time()

            while self.calls and self.calls[0] <= now - self.per_seconds:
                self.calls.popleft()

            if len(self.calls) >= self.rate:
                # Small buffer
                sleep_time: float = self.calls[0] + \
                    self.per_seconds - now + 0.01
                time.sleep(sleep_time)
                return self._acquire()

            self.calls.append(now)

and this one for retries —

def retry(
    max_attempts: int = 2,
    delay: int | float = 1,
    backoff: int | float = 2,
    exceptions: Tuple[Type[Exception], ...] = (Exception,),
    capture_to_sentry: bool = True,
) -> Callable[[Callable[..., Any]], Callable[..., Any]]:
    """Retry decorator with optional Sentry error capture.

    Args:
        max_attempts: Maximum number of attempts before giving up.
        delay: Initial delay between retries in seconds.
        backoff: Multiplier for delay after each retry.
        exceptions: Tuple of exception types to catch and retry.
        capture_to_sentry: If True, capture final failure to Sentry.
    """
    def decorator(func: Callable[..., Any]) -> Callable[..., Any]:
        @wraps(func)
        def wrapper(*args: Any, **kwargs: Any) -> Any:
            last_exception: Optional[Exception] = None
            current_delay: int | float = delay

            for attempt in range(max_attempts):
                try:
                    return func(*args, **kwargs)
                except exceptions as e:
                    last_exception = e
                    if attempt < max_attempts - 1:
                        print(
                            f"Attempt {attempt + 1} failed: {e}. Retrying in {current_delay}s...")
                        time.sleep(current_delay)
                        current_delay *= backoff
                    else:
                        print(f"All {max_attempts} attempts failed.")
                        # Capture to Sentry on final failure
                        if capture_to_sentry and sentry_sdk is not None and sentry_sdk.is_initialized():
                            sentry_sdk.capture_exception(e)

            if last_exception:
                raise last_exception

        return wrapper
    return decorator

Also, the PR description mentions replacing inserts with MERGE for reports/transactions, but the diff only changes the month loop. Please either include the MERGE changes or update the description so the reviewed behavior matches the patch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants