commitdev
diff --git a/‎templates/README.md‎
Lines changed: 33 additions & 3 deletions b/‎templates/README.md‎
Lines changed: 33 additions & 3 deletions
diff --git a/‎templates/docs/logging-and-metrics.md‎
Lines changed: 76 additions & 0 deletions b/‎templates/docs/logging-and-metrics.md‎
Lines changed: 76 additions & 0 deletions
diff --git a/‎templates/kubernetes/terraform/modules/kubernetes/ingress/main.tf‎
Lines changed: 10 additions & 5 deletions b/‎templates/kubernetes/terraform/modules/kubernetes/ingress/main.tf‎
Lines changed: 10 additions & 5 deletions
diff --git a/‎templates/kubernetes/terraform/modules/kubernetes/logging/cloudwatch/main.tf‎
Lines changed: 2 additions & 2 deletions b/‎templates/kubernetes/terraform/modules/kubernetes/logging/cloudwatch/main.tf‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎templates/kubernetes/terraform/modules/kubernetes/logging/kibana/fluentd.tf‎
Lines changed: 2 additions & 2 deletions b/‎templates/kubernetes/terraform/modules/kubernetes/logging/kibana/fluentd.tf‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎templates/kubernetes/terraform/modules/kubernetes/logging/kibana/main.tf‎
Lines changed: 11 additions & 19 deletions b/‎templates/kubernetes/terraform/modules/kubernetes/logging/kibana/main.tf‎
Lines changed: 11 additions & 19 deletions
diff --git a/‎templates/kubernetes/terraform/modules/kubernetes/main.tf‎
Lines changed: 8 additions & 6 deletions b/‎templates/kubernetes/terraform/modules/kubernetes/main.tf‎
Lines changed: 8 additions & 6 deletions
diff --git a/‎templates/kubernetes/terraform/modules/kubernetes/metrics/prometheus/files/prometheus_operator_helm_values.yml‎
Lines changed: 46 additions & 2 deletions b/‎templates/kubernetes/terraform/modules/kubernetes/metrics/prometheus/files/prometheus_operator_helm_values.yml‎
Lines changed: 46 additions & 2 deletions
@@ -51,6 +51,7 @@ Then you should be able to run commands normally:
 $ kubectl get pods -A
 ```
 
+[Kubernetes Resources](docs/kubernetes.md)
 
 ### Apply Configuration
 To init and apply the terraform configs, simply run the `make` and specify the
@@ -62,10 +63,10 @@ $ make ENVIRONMENT=<environment>
 #### Extra features built into my kubernetes cluster
 Outlines and best practices utilities that comes with your EKS cluster.
 Please see [Link][zero-k8s-guide]
-- Logging
-- Monitoring
+- [Logging and Metrics](docs/logging-and-metrics.md)
 - Ingress / TLS certificates (auto provisioning)
 - AWS IAM integration with Kubernetes RBAC
+- VPN using Wireguard
 ...
 
 #### Sending Email with Sendgrid
@@ -90,6 +91,32 @@ $ curl --request POST \
 ```
 For Application use, see [Sendgrid resources][sendgrid-send-mail] on how to setup templates to send dynamic transactional emails. To setup emailing from your application deployment, you should create a kubernetes secret with your Sendgrid API Key(already stored in [AWS secret-manager](./terraform/bootstrap/secrets/main.tf)) in your application's namespace. Then mount the secret as an environment variable in your deployment.
 
+
+## WireGuard VPN support
+WireGuard® is an extremely simple yet fast and modern VPN that utilizes state-of-the-art cryptography. This allows users to access internal resources securely.
+
+A WireGuard pod will be started inside the cluster and users can be added to it by appending lines to `kubernetes/terraform/environments/<env>/main.tf`:
+```
+  vpn_client_publickeys = [
+    # name, IP, public key
+    ["Your Name", "10.10.199.203/32", "yz6gNspLJE/HtftBwcj5x0yK2XG6+/SHIaZ****vFRc="],
+  ]
+```
+
+A new user can add themselves to the VPN server easily. Any user with access to the kubernetes cluster should be able to run the script `scripts/add-vpn-user.sh`
+This will ask for their name, and automatically generate a line like the one above, which they can then add to the terraform and apply themselves, or give the line to an administrator and ask them to apply it.
+The environment they are added to will be decided by the current `kubectl` context. You can see your current context with `kubectl config current-context`.
+A user will need to repeat this for each environment they need access to (for example, staging and production.)
+
+*Note that this will try to detect the next available IP address for the user but you should still take care to ensure there are no duplicate IPs in the list.*
+
+It will also generate a WireGuard client config file on their local machine which will be properly populated with all the values to allow them to connect to the server.
+
+The WireGuard client can be downloaded at [https://www.wireguard.com/install/](https://www.wireguard.com/install/)
+
+Once connected to the VPN, the user should have direct access to anything running inside the AWS VPC. AWS resources can be referred to by their internal DNS names, and even things inside the kubernetes cluster may be reached direcly using their `.svc.cluster.local` names.
+
+
 #### Application database user creation
 A database user will automatically be created for a backend application with a random password, and the credentials will be stored in a kubernetes secret in the application namespace so they are available to the application.
 
@@ -145,6 +172,8 @@ Commonly used links in AWS console
 Tearing down the infrastructure requires multiple steps, as some of the resources have protection mechanism so they're not accidentally deleted
 
 _Note: the following steps are not reversible, tearing down the cluster results in lost data/resources._
+<details>
+  <summary>Teardown steps</summary>
 
 ```
 export ENVIRONMENT=stage/prod
@@ -173,10 +202,11 @@ make teardown-secrets
 ```
 make teardown-remote-state
 ```
+</details>
 
 ### Suggested readings
 - [Terraform workflow][tf-workflow]
-- [Why do I want code as infrastructure][why-infra-as-code]
+- [Why do I want infrastructure as code?][why-infra-as-code]
 
 
 
 
@@ -0,0 +1,76 @@
+# Logging
+
+## Cloudwatch
+More to come...
+
+## Kibana + Elasticsearch
+Logs from all the pods in the cluster are collected by FluentD and shipped to AWS Elasticsearch. Kibana is then used as an interface to explore those logs, create graphs, dashboards, etc.
+
+If application logs are printed as JSON, FluentD and Elasticsearch should handle automatically parsing and storing the json structure as individual fields, which means if you output a log like `{"myField": "1234"}`, you would be able to write a query in Kibana that looks like `log.myField: "1234"`. This will even handle deeply nested JSON.
+
+It also supports parsing the [Elasticsearch Common Schema](https://www.elastic.co/guide/en/ecs/current/index.html), which is a great way to keep your logging consistent across multiple applications.
+
+You can view the Kibana dashboard at http://kibana.logging.svc.cluster.local/_plugin/kibana after logging into the VPN.
+
+
+### Elasticsearch retention
+Retention in Elasticsearch is controlled by [Index Lifecycle Policies](https://www.elastic.co/guide/en/elasticsearch/reference/current/index-lifecycle-management.html) and constrained by the amount of nodes and storage available in the cluster. Index policies are automatically configured when the cluster is created, and are slightly different between staging and production. The Production policy has a longer retention period, including a period where indices are kept "warm". The Staging policy keeps logs "hot" for a day and then deletes them after a month by default.
+
+These policies may need to be changed, depending on how many logs are being generated. They can be found in [/scripts/files/](/scripts/files/) along with some helper scripts, or you can see them in the Kibana UI in the "Index Management" section.
+
+Another option would be to increase the number of nodes in the cluster or the size of the attached storage for each node, though both of these actions come with an associated cost. Both of these can be changed by modifying the associated values in Terraform in `/terraform/environments/<env>/main.tf`
+
+# Metrics
+
+## Cloudwatch
+More to come...
+
+## Grafana + Prometheus
+Prometheus runs in the k8s cluster and collects and stores metrics from various sources.
+Grafana also runs in the cluster and connects to various data sources including Prometheus, Cloudwatch, and Elasticsearch, to pull in and visualize data.
+It can create graphs, dashboards, and alerts which can be sent out via email, Slack, PagerDuty or many other integrations.
+
+You can view the Grafana dashboard at http://grafana.metrics.svc.cluster.local after logging into the VPN.
+The default username is 'admin' and the password is '<% .Name %>'. This account could be shared across multiple team members, you could create mulitple accounts per-person or -team, or you could add an external auth provider like Google.
+
+The UIs for Grafana and Kibana are only available from inside the private network (via the VPN) so there is already a certain amount of access restriction.
+
+
+### Default dashboards
+There should be a handful of default dashboards that get created by the various prometheus components. For example, there are a number of dashboards related to Kubernetes nodes, workloads, etc.
+There's also a site full of (community-created dashboards)[https://grafana.com/grafana/dashboards] which are very useful to get started, and to use as an example. Just copy the ID, click Create > Import Dashboard in Grafana, and paste the id into the box!
+
+Here are some useful ones:
+- [NGINX Ingress controller](https://grafana.com/grafana/dashboards/9614)
+- [AWS RDS](https://grafana.com/grafana/dashboards/707)
+- [Kubernetes Nodes](https://grafana.com/grafana/dashboards/1860)
+- [Elasticsearch](https://grafana.com/grafana/dashboards/6483)
+
+
+### Adding Slack integration
+The image renderer plugin is already enabled, which allows Grafana to attach an image to the notifications it sends. To add alerts, go into the Alerting > Notification Channels section of Grafana and add a Slack Channel. Fill in the webhook URL. If you want files attached, fill in the "Recipient" channel, "Token", and enable "Include Image"
+
+### Adding Prometheus data sources
+There are a number of community-supported exporters available here as helm charts:
+https://github.com/prometheus-community/helm-charts/tree/main/charts
+
+### Adding a ServiceMonitor
+Prometheus Operator is running in the cluster, which allows you to control the prometheus configuration via native Kubernetes "Custom Resources".
+This means if you introduce a new service into the cluster that supports prometheus stat scraping, it is easy to set up. You would need to create a new `ServiceMonitor` in the `metrics` namespace that defines where to find the service that prometheus should watch for stats. It would look something like this:
+```yaml
+apiVersion: monitoring.coreos.com/v1
+kind: ServiceMonitor
+metadata:
+  name: nginx-ingress-controller-metrics    # Name of this monitor, just has to be unique
+  namespace: metrics
+spec:
+  endpoints:
+    - interval: 30s
+      port: metrics                         # Which port on the service should be hit. A path can also be added
+  selector:
+    matchLabels:
+      app.kubernetes.io/name: ingress-nginx # The label of the service to monitor
+  namespaceSelector:
+    matchNames:
+      - ingress-nginx                       # The namespace to look for a service in
+```
@@ -225,10 +225,6 @@ resource "kubernetes_deployment" "nginx_ingress_controller" {
           "app.kubernetes.io/name"    = "ingress-nginx",
           "app.kubernetes.io/part-of" = "ingress-nginx"
         }
-        # annotations = {
-        #   "prometheus.io/port"   = "10254",
-        #   "prometheus.io/scrape" = "true"
-        # }
       }
       spec {
         container {
@@ -250,6 +246,10 @@ resource "kubernetes_deployment" "nginx_ingress_controller" {
             name           = "https"
             container_port = 443
           }
+          port {
+            name           = "metrics"
+            container_port = 10254
+          }
           env {
             name = "POD_NAME"
             value_from {
@@ -319,7 +319,7 @@ resource "kubernetes_service" "ingress_nginx" {
     name      = "ingress-nginx"
     namespace = kubernetes_namespace.ingress_nginx.metadata[0].name
     labels = {
-      "app.kubernetes.io/name"    = "ingress-nginx",
+      "app.kubernetes.io/name"    = "ingress-nginx", # Referenced by prometheus servicemonitor if prometheus is used
       "app.kubernetes.io/part-of" = "ingress-nginx"
     }
   }
@@ -334,6 +334,11 @@ resource "kubernetes_service" "ingress_nginx" {
       port        = 443
       target_port = "https"
     }
+    port {
+      name        = "metrics"
+      port        = 10254
+      target_port = "metrics"
+    }
     selector = {
       "app.kubernetes.io/name"    = "ingress-nginx",
       "app.kubernetes.io/part-of" = "ingress-nginx"
 
@@ -10,7 +10,7 @@ module "iam_assumable_role_cloudwatch" {
   role_name                     = "<% .Name %>-k8s-${var.environment}-cloudwatch"
   provider_url                  = replace(data.aws_eks_cluster.cluster.identity.0.oidc.0.issuer, "https://", "")
   role_policy_arns              = [data.aws_iam_policy.CloudWatchAgentServerPolicy.arn]
-  oidc_fully_qualified_subjects = [ "system:serviceaccount:amazon-cloudwatch:cloudwatch-agent" ]
+  oidc_fully_qualified_subjects = ["system:serviceaccount:amazon-cloudwatch:cloudwatch-agent"]
 }
 
 # Create a role using oidc to map service accounts
@@ -21,7 +21,7 @@ module "iam_assumable_role_fluentd" {
   role_name                     = "<% .Name %>-k8s-${var.environment}-fluentd"
   provider_url                  = replace(data.aws_eks_cluster.cluster.identity.0.oidc.0.issuer, "https://", "")
   role_policy_arns              = [data.aws_iam_policy.CloudWatchAgentServerPolicy.arn]
-  oidc_fully_qualified_subjects = [ "system:serviceaccount:amazon-cloudwatch:fluentd" ]
+  oidc_fully_qualified_subjects = ["system:serviceaccount:amazon-cloudwatch:fluentd"]
 }
 
 
 
@@ -9,8 +9,8 @@ data "aws_elasticsearch_domain" "logging_cluster" {
 
 resource "kubernetes_service_account" "fluentd" {
   metadata {
-    name        = "fluentd"
-    namespace   = kubernetes_namespace.logging.metadata[0].name
+    name      = "fluentd"
+    namespace = kubernetes_namespace.logging.metadata[0].name
   }
 }
 
 
@@ -8,7 +8,17 @@ resource "kubernetes_namespace" "logging" {
 }
 
 
-
+# Utility dns record for people using vpn
+resource "kubernetes_service" "elasticsearch" {
+  metadata {
+    namespace = kubernetes_namespace.logging.metadata[0].name
+    name      = "kibana"
+  }
+  spec {
+    type          = "ExternalName"
+    external_name = data.aws_elasticsearch_domain.logging_cluster.endpoint
+  }
+}
 # # Kibana ingress - Allows us to modify the path, but proxies out to elasticsearch
 # resource "kubernetes_ingress" "kibana_ingress" {
 #   metadata {
@@ -47,21 +57,3 @@ resource "kubernetes_namespace" "logging" {
 
 #   depends_on = [kubernetes_namespace.logging]
 # }
-
-
-# # Create prometheus exporter to gather metrics about the elasticsearch cluster
-# resource "helm_release" "elasticsearch_prometheus_exporter" {
-#   name       = "elasticsearch-exporter"
-#   repository = "stable"
-#   chart      = "elasticsearch-exporter"
-#   version    = "3.4.0"
-#   namespace  = "monitoring"
-#   set {
-#     name  = "es.uri"
-#     value = "http://elasticsearch.logging.svc.cluster.local"
-#   }
-#   set {
-#     name  = "serviceMonitor.enabled"
-#     value = "true"
-#   }
-# }
@@ -16,12 +16,14 @@ module "logging_kibana" {
 }
 
 module "metrics_prometheus" {
-  count           = var.metrics_type == "prometheus" ? 1 : 0
-  source          = "./metrics/prometheus"
-  project         = var.project
-  environment     = var.environment
-  region          = var.region
-  internal_domain = var.internal_domain
+  count                = var.metrics_type == "prometheus" ? 1 : 0
+  source               = "./metrics/prometheus"
+  project              = var.project
+  environment          = var.environment
+  region               = var.region
+  cluster_name         = var.cluster_name
+  internal_domain      = var.internal_domain
+  elasticsearch_domain = var.logging_type == "kibana" ? "${var.project}-${var.environment}-logging" : ""
 }
 
 module "ingress" {
 
@@ -1,11 +1,30 @@
-
 ## Using default values from https://github.com/grafana/helm-charts/blob/main/charts/grafana/values.yaml
 ##
 grafana:
 
   persistence:
     enabled: true
 
+  ## Configure additional grafana datasources (passed through tpl)
+  ## ref: http://docs.grafana.org/administration/provisioning/#datasources
+  additionalDataSources:
+  - name: Elasticsearch
+    type: elasticsearch
+    database: fluentd-*
+    access: proxy
+    editable: false
+    version: 1
+    jsonData:
+      esVersion: 70
+      timeField: "@timestamp"
+  - name: CloudWatch
+    type: cloudwatch
+    access: proxy
+    editable: false
+    version: 1
+    jsonData:
+        authType: default
+
   # Custom Dashboard
   dashboardProviders:
     dashboardproviders.yaml:
@@ -75,8 +94,11 @@ grafana:
             "uid": "Example",
             "version": 1
           }
+  # Enable the image-renderer deployment & service - disable this if you don't want to send images via alerts
+  imageRenderer:
+    enabled: true
 
-## Manages Prometheus and Alertmanager components
+  ## Manages Prometheus and Alertmanager components
 ##
 prometheusOperator:
   securityContext:
@@ -116,3 +138,25 @@ prometheus:
       runAsNonRoot: false
       runAsUser: 0
       fsGroup: 0
+
+    # Don't apply a selector when looking for service and pod monitors, just discover all of them
+    serviceMonitorSelectorNilUsesHelmValues: false
+    podMonitorSelectorNilUsesHelmValues: false
+
+
+  ## Additional ServiceMonitors to create
+  additionalServiceMonitors:
+    - name: "nginx-ingress-controller-metrics"
+
+      ## Label selector for services to which this ServiceMonitor applies
+      selector:
+        matchLabels:
+          app.kubernetes.io/name: ingress-nginx
+
+      namespaceSelector:
+        matchNames: [ ingress-nginx ]
+
+      ## Endpoints of the selected service to be monitored
+      endpoints:
+        - port: "metrics"
+          interval: 30s
Original file line number	Diff line number	Diff line change
`@@ -10,7 +10,7 @@ module "iam_assumable_role_cloudwatch" {`
`10`	`10`	`role_name = "<% .Name %>-k8s-${var.environment}-cloudwatch"`
`11`	`11`	`provider_url = replace(data.aws_eks_cluster.cluster.identity.0.oidc.0.issuer, "https://", "")`
`12`	`12`	`role_policy_arns = [data.aws_iam_policy.CloudWatchAgentServerPolicy.arn]`
`13`		`- oidc_fully_qualified_subjects = [ "system:serviceaccount:amazon-cloudwatch:cloudwatch-agent" ]`
	`13`	`+ oidc_fully_qualified_subjects = ["system:serviceaccount:amazon-cloudwatch:cloudwatch-agent"]`
`14`	`14`	`}`
`15`	`15`
`16`	`16`	`# Create a role using oidc to map service accounts`
`@@ -21,7 +21,7 @@ module "iam_assumable_role_fluentd" {`
`21`	`21`	`role_name = "<% .Name %>-k8s-${var.environment}-fluentd"`
`22`	`22`	`provider_url = replace(data.aws_eks_cluster.cluster.identity.0.oidc.0.issuer, "https://", "")`
`23`	`23`	`role_policy_arns = [data.aws_iam_policy.CloudWatchAgentServerPolicy.arn]`
`24`		`- oidc_fully_qualified_subjects = [ "system:serviceaccount:amazon-cloudwatch:fluentd" ]`
	`24`	`+ oidc_fully_qualified_subjects = ["system:serviceaccount:amazon-cloudwatch:fluentd"]`
`25`	`25`	`}`
`26`	`26`
`27`	`27`
Original file line number	Diff line number	Diff line change
`@@ -9,8 +9,8 @@ data "aws_elasticsearch_domain" "logging_cluster" {`
`9`	`9`
`10`	`10`	`resource "kubernetes_service_account" "fluentd" {`
`11`	`11`	`metadata {`
`12`		`- name = "fluentd"`
`13`		`- namespace = kubernetes_namespace.logging.metadata[0].name`
	`12`	`+ name = "fluentd"`
	`13`	`+ namespace = kubernetes_namespace.logging.metadata[0].name`
`14`	`14`	`}`
`15`	`15`	`}`
`16`	`16`