Based on experience with reproducing issues from Kapa queries we lack coverage of generally useful debugging steps. Improving debugging docs would also lead to better LLM advice both from Kapa and from users own agents in multi step conversations where initial solutions are ineffective.
The main things we typically ask from our customers when it comes to logs are the following:
/usr/local/kong/logs/error.log
This is for most troubleshooting issues, anything to do with plugins, behaviours, failures, etc. We will almost always want to look at this file, and ideally in the debug log level.
/usr/local/kong/logs/access.log
This is used more for when customers are saying a request wasn't processed by the Gateway, and we need to determine if the Nginx layer of Kong even saw the request come in, what the timestamps were if we're piecing together a multi-step workflow, etc. This isn't always proof if a request is missing from here, as it's possible it still reached Kong but just at the network layer and that's when a tcpdump might need to be captured to prove that a request made it to Kong or was sent upstream from Kong, etc.
/usr/local/kong/logs/admin_error.log
This is used for general Admin API / Kong Manager troubleshooting if a customer is complaining of on-prem Kong Manager issues or API calls failing to the Admin portion of the Kong Gateway (as opposed to proxy traffic).
/usr/local/kong/logs/admin_access.log
This is rarely needed but much like the regular proxy traffic access log file, this is used for troubleshooting connections to the Admin API / Kong Manager.
That's generally the only log files we need to look at for the Kong Gateway, and it's nearly alway just the main error.log file (#1), and sometimes the access.log file (#2) since most cases are for proxy traffic issues, plugin config/behaviour issues, etc.
On top of logs, we tend to request metrics if they have it from their OpenTelemetry endpoint or Prometheus endpoint or whatever tool they're using to collect and display various metrics about the Kong Gateway data planes and control planes. These metrics can help with troubleshooting latency-related issues where they feel something is taking too long to complete, requests are too slow to be processed, etc.
If it's an authentication issue, we'll usually also ask for a JWT token or something we can use to troubleshoot with by decrypting it to see the information held inside the token to ensure things like the issuer and such match what's configured in the OIDC or other related plugins.
Any plugin-related issues we'll always ask for the YAML or JSON of the plugin configuration too (I prefer JSON myself as it's easier to format when pasted in to a case comment by a customer, YAML loses a lot of the formatting when pasted into Salesforce unfortunately), or even a decK dump from the workspace if they are able to share it with us.
And anything related to a browser-side issue (auth, redirects, etc) we will typically also request a HAR from their browser during a reproduction situation.
Based on experience with reproducing issues from Kapa queries we lack coverage of generally useful debugging steps. Improving debugging docs would also lead to better LLM advice both from Kapa and from users own agents in multi step conversations where initial solutions are ineffective.
Definition of done
Information
From support:
Due date (optional)
Size