Skip to content

Commit 2845982

Browse files
authored
improvement: Added additional grafana data sources and monitors, documentation, improved setup flow for logging and metrics. (#185)
* improvement: Added additional grafana data sources and monitors, documentation, improved setup flow for logging and metrics. * Fixed version for iam role module, switched provider version requirement method
2 parents fc203b3 + 4feca67 commit 2845982

15 files changed

Lines changed: 347 additions & 48 deletions

File tree

templates/README.md

Lines changed: 33 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -51,6 +51,7 @@ Then you should be able to run commands normally:
5151
$ kubectl get pods -A
5252
```
5353

54+
[Kubernetes Resources](docs/kubernetes.md)
5455

5556
### Apply Configuration
5657
To init and apply the terraform configs, simply run the `make` and specify the
@@ -62,10 +63,10 @@ $ make ENVIRONMENT=<environment>
6263
#### Extra features built into my kubernetes cluster
6364
Outlines and best practices utilities that comes with your EKS cluster.
6465
Please see [Link][zero-k8s-guide]
65-
- Logging
66-
- Monitoring
66+
- [Logging and Metrics](docs/logging-and-metrics.md)
6767
- Ingress / TLS certificates (auto provisioning)
6868
- AWS IAM integration with Kubernetes RBAC
69+
- VPN using Wireguard
6970
...
7071

7172
#### Sending Email with Sendgrid
@@ -90,6 +91,32 @@ $ curl --request POST \
9091
```
9192
For Application use, see [Sendgrid resources][sendgrid-send-mail] on how to setup templates to send dynamic transactional emails. To setup emailing from your application deployment, you should create a kubernetes secret with your Sendgrid API Key(already stored in [AWS secret-manager](./terraform/bootstrap/secrets/main.tf)) in your application's namespace. Then mount the secret as an environment variable in your deployment.
9293

94+
95+
## WireGuard VPN support
96+
WireGuard® is an extremely simple yet fast and modern VPN that utilizes state-of-the-art cryptography. This allows users to access internal resources securely.
97+
98+
A WireGuard pod will be started inside the cluster and users can be added to it by appending lines to `kubernetes/terraform/environments/<env>/main.tf`:
99+
```
100+
vpn_client_publickeys = [
101+
# name, IP, public key
102+
["Your Name", "10.10.199.203/32", "yz6gNspLJE/HtftBwcj5x0yK2XG6+/SHIaZ****vFRc="],
103+
]
104+
```
105+
106+
A new user can add themselves to the VPN server easily. Any user with access to the kubernetes cluster should be able to run the script `scripts/add-vpn-user.sh`
107+
This will ask for their name, and automatically generate a line like the one above, which they can then add to the terraform and apply themselves, or give the line to an administrator and ask them to apply it.
108+
The environment they are added to will be decided by the current `kubectl` context. You can see your current context with `kubectl config current-context`.
109+
A user will need to repeat this for each environment they need access to (for example, staging and production.)
110+
111+
*Note that this will try to detect the next available IP address for the user but you should still take care to ensure there are no duplicate IPs in the list.*
112+
113+
It will also generate a WireGuard client config file on their local machine which will be properly populated with all the values to allow them to connect to the server.
114+
115+
The WireGuard client can be downloaded at [https://www.wireguard.com/install/](https://www.wireguard.com/install/)
116+
117+
Once connected to the VPN, the user should have direct access to anything running inside the AWS VPC. AWS resources can be referred to by their internal DNS names, and even things inside the kubernetes cluster may be reached direcly using their `.svc.cluster.local` names.
118+
119+
93120
#### Application database user creation
94121
A database user will automatically be created for a backend application with a random password, and the credentials will be stored in a kubernetes secret in the application namespace so they are available to the application.
95122

@@ -145,6 +172,8 @@ Commonly used links in AWS console
145172
Tearing down the infrastructure requires multiple steps, as some of the resources have protection mechanism so they're not accidentally deleted
146173

147174
_Note: the following steps are not reversible, tearing down the cluster results in lost data/resources._
175+
<details>
176+
<summary>Teardown steps</summary>
148177

149178
```
150179
export ENVIRONMENT=stage/prod
@@ -173,10 +202,11 @@ make teardown-secrets
173202
```
174203
make teardown-remote-state
175204
```
205+
</details>
176206

177207
### Suggested readings
178208
- [Terraform workflow][tf-workflow]
179-
- [Why do I want code as infrastructure][why-infra-as-code]
209+
- [Why do I want infrastructure as code?][why-infra-as-code]
180210

181211

182212

Lines changed: 76 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,76 @@
1+
# Logging
2+
3+
## Cloudwatch
4+
More to come...
5+
6+
## Kibana + Elasticsearch
7+
Logs from all the pods in the cluster are collected by FluentD and shipped to AWS Elasticsearch. Kibana is then used as an interface to explore those logs, create graphs, dashboards, etc.
8+
9+
If application logs are printed as JSON, FluentD and Elasticsearch should handle automatically parsing and storing the json structure as individual fields, which means if you output a log like `{"myField": "1234"}`, you would be able to write a query in Kibana that looks like `log.myField: "1234"`. This will even handle deeply nested JSON.
10+
11+
It also supports parsing the [Elasticsearch Common Schema](https://www.elastic.co/guide/en/ecs/current/index.html), which is a great way to keep your logging consistent across multiple applications.
12+
13+
You can view the Kibana dashboard at http://kibana.logging.svc.cluster.local/_plugin/kibana after logging into the VPN.
14+
15+
16+
### Elasticsearch retention
17+
Retention in Elasticsearch is controlled by [Index Lifecycle Policies](https://www.elastic.co/guide/en/elasticsearch/reference/current/index-lifecycle-management.html) and constrained by the amount of nodes and storage available in the cluster. Index policies are automatically configured when the cluster is created, and are slightly different between staging and production. The Production policy has a longer retention period, including a period where indices are kept "warm". The Staging policy keeps logs "hot" for a day and then deletes them after a month by default.
18+
19+
These policies may need to be changed, depending on how many logs are being generated. They can be found in [/scripts/files/](/scripts/files/) along with some helper scripts, or you can see them in the Kibana UI in the "Index Management" section.
20+
21+
Another option would be to increase the number of nodes in the cluster or the size of the attached storage for each node, though both of these actions come with an associated cost. Both of these can be changed by modifying the associated values in Terraform in `/terraform/environments/<env>/main.tf`
22+
23+
# Metrics
24+
25+
## Cloudwatch
26+
More to come...
27+
28+
## Grafana + Prometheus
29+
Prometheus runs in the k8s cluster and collects and stores metrics from various sources.
30+
Grafana also runs in the cluster and connects to various data sources including Prometheus, Cloudwatch, and Elasticsearch, to pull in and visualize data.
31+
It can create graphs, dashboards, and alerts which can be sent out via email, Slack, PagerDuty or many other integrations.
32+
33+
You can view the Grafana dashboard at http://grafana.metrics.svc.cluster.local after logging into the VPN.
34+
The default username is 'admin' and the password is '<% .Name %>'. This account could be shared across multiple team members, you could create multiple accounts per-person or per-team, or you could add an external auth provider like Google.
35+
36+
The UIs for Grafana and Kibana are only available from inside the private network (via the VPN) so there is already a certain amount of access restriction.
37+
38+
39+
### Default dashboards
40+
There should be a handful of default dashboards that get created by the various prometheus components. For example, there are a number of dashboards related to Kubernetes nodes, workloads, etc.
41+
There's also a site full of (community-created dashboards)[https://grafana.com/grafana/dashboards] which are very useful to get started, and to use as an example. Just copy the ID, click Create > Import Dashboard in Grafana, and paste the id into the box!
42+
43+
Here are some useful ones:
44+
- [NGINX Ingress controller](https://grafana.com/grafana/dashboards/9614)
45+
- [AWS RDS](https://grafana.com/grafana/dashboards/707)
46+
- [Kubernetes Nodes](https://grafana.com/grafana/dashboards/1860)
47+
- [Elasticsearch](https://grafana.com/grafana/dashboards/6483)
48+
49+
50+
### Adding Slack integration
51+
The image renderer plugin is already enabled, which allows Grafana to attach an image to the notifications it sends. To add alerts, go into the Alerting > Notification Channels section of Grafana and add a Slack Channel. Fill in the webhook URL. If you want files attached, fill in the "Recipient" channel, "Token", and enable "Include Image"
52+
53+
### Adding Prometheus data sources
54+
There are a number of community-supported exporters available here as helm charts:
55+
https://github.com/prometheus-community/helm-charts/tree/main/charts
56+
57+
### Adding a ServiceMonitor
58+
Prometheus Operator is running in the cluster, which allows you to control the prometheus configuration via native Kubernetes "Custom Resources".
59+
This means if you introduce a new service into the cluster that supports prometheus stat scraping, it is easy to set up. You would need to create a new `ServiceMonitor` in the `metrics` namespace that defines where to find the service that prometheus should watch for stats. It would look something like this:
60+
```yaml
61+
apiVersion: monitoring.coreos.com/v1
62+
kind: ServiceMonitor
63+
metadata:
64+
name: nginx-ingress-controller-metrics # Name of this monitor, just has to be unique
65+
namespace: metrics
66+
spec:
67+
endpoints:
68+
- interval: 30s
69+
port: metrics # Which port on the service should be hit. A path can also be added
70+
selector:
71+
matchLabels:
72+
app.kubernetes.io/name: ingress-nginx # The label of the service to monitor
73+
namespaceSelector:
74+
matchNames:
75+
- ingress-nginx # The namespace to look for a service in
76+
```

templates/kubernetes/terraform/modules/kubernetes/ingress/main.tf

Lines changed: 10 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -225,10 +225,6 @@ resource "kubernetes_deployment" "nginx_ingress_controller" {
225225
"app.kubernetes.io/name" = "ingress-nginx",
226226
"app.kubernetes.io/part-of" = "ingress-nginx"
227227
}
228-
# annotations = {
229-
# "prometheus.io/port" = "10254",
230-
# "prometheus.io/scrape" = "true"
231-
# }
232228
}
233229
spec {
234230
container {
@@ -250,6 +246,10 @@ resource "kubernetes_deployment" "nginx_ingress_controller" {
250246
name = "https"
251247
container_port = 443
252248
}
249+
port {
250+
name = "metrics"
251+
container_port = 10254
252+
}
253253
env {
254254
name = "POD_NAME"
255255
value_from {
@@ -319,7 +319,7 @@ resource "kubernetes_service" "ingress_nginx" {
319319
name = "ingress-nginx"
320320
namespace = kubernetes_namespace.ingress_nginx.metadata[0].name
321321
labels = {
322-
"app.kubernetes.io/name" = "ingress-nginx",
322+
"app.kubernetes.io/name" = "ingress-nginx", # Referenced by prometheus servicemonitor if prometheus is used
323323
"app.kubernetes.io/part-of" = "ingress-nginx"
324324
}
325325
}
@@ -334,6 +334,11 @@ resource "kubernetes_service" "ingress_nginx" {
334334
port = 443
335335
target_port = "https"
336336
}
337+
port {
338+
name = "metrics"
339+
port = 10254
340+
target_port = "metrics"
341+
}
337342
selector = {
338343
"app.kubernetes.io/name" = "ingress-nginx",
339344
"app.kubernetes.io/part-of" = "ingress-nginx"

templates/kubernetes/terraform/modules/kubernetes/logging/cloudwatch/main.tf

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@ module "iam_assumable_role_cloudwatch" {
1010
role_name = "<% .Name %>-k8s-${var.environment}-cloudwatch"
1111
provider_url = replace(data.aws_eks_cluster.cluster.identity.0.oidc.0.issuer, "https://", "")
1212
role_policy_arns = [data.aws_iam_policy.CloudWatchAgentServerPolicy.arn]
13-
oidc_fully_qualified_subjects = [ "system:serviceaccount:amazon-cloudwatch:cloudwatch-agent" ]
13+
oidc_fully_qualified_subjects = ["system:serviceaccount:amazon-cloudwatch:cloudwatch-agent"]
1414
}
1515

1616
# Create a role using oidc to map service accounts
@@ -21,7 +21,7 @@ module "iam_assumable_role_fluentd" {
2121
role_name = "<% .Name %>-k8s-${var.environment}-fluentd"
2222
provider_url = replace(data.aws_eks_cluster.cluster.identity.0.oidc.0.issuer, "https://", "")
2323
role_policy_arns = [data.aws_iam_policy.CloudWatchAgentServerPolicy.arn]
24-
oidc_fully_qualified_subjects = [ "system:serviceaccount:amazon-cloudwatch:fluentd" ]
24+
oidc_fully_qualified_subjects = ["system:serviceaccount:amazon-cloudwatch:fluentd"]
2525
}
2626

2727

templates/kubernetes/terraform/modules/kubernetes/logging/kibana/fluentd.tf

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -9,8 +9,8 @@ data "aws_elasticsearch_domain" "logging_cluster" {
99

1010
resource "kubernetes_service_account" "fluentd" {
1111
metadata {
12-
name = "fluentd"
13-
namespace = kubernetes_namespace.logging.metadata[0].name
12+
name = "fluentd"
13+
namespace = kubernetes_namespace.logging.metadata[0].name
1414
}
1515
}
1616

templates/kubernetes/terraform/modules/kubernetes/logging/kibana/main.tf

Lines changed: 11 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,17 @@ resource "kubernetes_namespace" "logging" {
88
}
99

1010

11-
11+
# Utility dns record for people using vpn
12+
resource "kubernetes_service" "elasticsearch" {
13+
metadata {
14+
namespace = kubernetes_namespace.logging.metadata[0].name
15+
name = "kibana"
16+
}
17+
spec {
18+
type = "ExternalName"
19+
external_name = data.aws_elasticsearch_domain.logging_cluster.endpoint
20+
}
21+
}
1222
# # Kibana ingress - Allows us to modify the path, but proxies out to elasticsearch
1323
# resource "kubernetes_ingress" "kibana_ingress" {
1424
# metadata {
@@ -47,21 +57,3 @@ resource "kubernetes_namespace" "logging" {
4757

4858
# depends_on = [kubernetes_namespace.logging]
4959
# }
50-
51-
52-
# # Create prometheus exporter to gather metrics about the elasticsearch cluster
53-
# resource "helm_release" "elasticsearch_prometheus_exporter" {
54-
# name = "elasticsearch-exporter"
55-
# repository = "stable"
56-
# chart = "elasticsearch-exporter"
57-
# version = "3.4.0"
58-
# namespace = "monitoring"
59-
# set {
60-
# name = "es.uri"
61-
# value = "http://elasticsearch.logging.svc.cluster.local"
62-
# }
63-
# set {
64-
# name = "serviceMonitor.enabled"
65-
# value = "true"
66-
# }
67-
# }

templates/kubernetes/terraform/modules/kubernetes/main.tf

Lines changed: 8 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -16,12 +16,14 @@ module "logging_kibana" {
1616
}
1717

1818
module "metrics_prometheus" {
19-
count = var.metrics_type == "prometheus" ? 1 : 0
20-
source = "./metrics/prometheus"
21-
project = var.project
22-
environment = var.environment
23-
region = var.region
24-
internal_domain = var.internal_domain
19+
count = var.metrics_type == "prometheus" ? 1 : 0
20+
source = "./metrics/prometheus"
21+
project = var.project
22+
environment = var.environment
23+
region = var.region
24+
cluster_name = var.cluster_name
25+
internal_domain = var.internal_domain
26+
elasticsearch_domain = var.logging_type == "kibana" ? "${var.project}-${var.environment}-logging" : ""
2527
}
2628

2729
module "ingress" {

templates/kubernetes/terraform/modules/kubernetes/metrics/prometheus/files/prometheus_operator_helm_values.yml

Lines changed: 46 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,30 @@
1-
21
## Using default values from https://github.com/grafana/helm-charts/blob/main/charts/grafana/values.yaml
32
##
43
grafana:
54

65
persistence:
76
enabled: true
87

8+
## Configure additional grafana datasources (passed through tpl)
9+
## ref: http://docs.grafana.org/administration/provisioning/#datasources
10+
additionalDataSources:
11+
- name: Elasticsearch
12+
type: elasticsearch
13+
database: fluentd-*
14+
access: proxy
15+
editable: false
16+
version: 1
17+
jsonData:
18+
esVersion: 70
19+
timeField: "@timestamp"
20+
- name: CloudWatch
21+
type: cloudwatch
22+
access: proxy
23+
editable: false
24+
version: 1
25+
jsonData:
26+
authType: default
27+
928
# Custom Dashboard
1029
dashboardProviders:
1130
dashboardproviders.yaml:
@@ -75,8 +94,11 @@ grafana:
7594
"uid": "Example",
7695
"version": 1
7796
}
97+
# Enable the image-renderer deployment & service - disable this if you don't want to send images via alerts
98+
imageRenderer:
99+
enabled: true
78100

79-
## Manages Prometheus and Alertmanager components
101+
## Manages Prometheus and Alertmanager components
80102
##
81103
prometheusOperator:
82104
securityContext:
@@ -116,3 +138,25 @@ prometheus:
116138
runAsNonRoot: false
117139
runAsUser: 0
118140
fsGroup: 0
141+
142+
# Don't apply a selector when looking for service and pod monitors, just discover all of them
143+
serviceMonitorSelectorNilUsesHelmValues: false
144+
podMonitorSelectorNilUsesHelmValues: false
145+
146+
147+
## Additional ServiceMonitors to create
148+
additionalServiceMonitors:
149+
- name: "nginx-ingress-controller-metrics"
150+
151+
## Label selector for services to which this ServiceMonitor applies
152+
selector:
153+
matchLabels:
154+
app.kubernetes.io/name: ingress-nginx
155+
156+
namespaceSelector:
157+
matchNames: [ ingress-nginx ]
158+
159+
## Endpoints of the selected service to be monitored
160+
endpoints:
161+
- port: "metrics"
162+
interval: 30s

0 commit comments

Comments
 (0)