Kubernetes

Going Mobile: Part 5

Liam Aikin

04 Feb 2024 — 4 min read

In this final post of the Going Mobile series, I'll go over some issues I have faced while deploying the system, and provide a summary of the solutions I chose to resolve them.

Networking (again):

svclb timing out:

Upon deploying the nodes in different locations I realised that while the k3s svclb (Service Load Balancer) pods were deployed to each node correctly, an HTTP GET request to the address of the nodes that didn't have a 'locally available' traefik pod running would eventually return a timeout. I wrote this down to an issue in cluster communications, so decided to set up cluster communications via the embedded k3s multicloud solution.

This resolved the problem of the svclb pods responding correctly via HTTP ('404 page not found' as default from traefik) but revealed another issue, latency between svclb and traefik.

Latency between svclb and Traefik:

Network latency between pods scheduled on nodes in a location where the default k3s traefik pod was not scheduled caused significant flapping for the HAProxy instance, this proved to make accurate health checks impossible from HAProxy in this current arrangement.

The solution for this was to use the svclb node pool functionality, creating a node pool for each location, and also scoping a new traefik pod deployment to provide location-scoped pods, Below is an example with 2 pools:

Green squares represent the svclb pods, The Diamond represents HAProxy

While this set up allowed svclb to finally reach the pods in the different pools appropriately, I have had to increase the HAProxy check times to accommodate the latency.

The above example has the following manifests:

Codeserver-values.yaml:

image:
  repository: docker.io/codercom/code-server
  pullPolicy: IfNotPresent
  tag: 4.20.1@sha256:94e705de31ca2e6d0c28fec83f004c4159c037338c485ca0b60bba7e90bde8f9
securityContext:
  container:
    readOnlyRootFilesystem: false
    allowPrivilegeEscalation: true
    runAsNonRoot: false
    runAsUser: 0
    runAsGroup: 0
service:
  main:
    ports:
      main:
        port: 10063
        protocol: http
        targetPort: 8080
workload:
  main:
    enabled: true
    type: DaemonSet
    revisionHistoryLimit: 3
    strategy: RollingUpdate
    rollingUpdate:
      maxUnavailable: 1
    podSpec:
      containers:
        main:
          probes:
            liveness:
              type: http
              path: /
            readiness:
              type: http
              path: /
            startup:
              type: http
              path: /
          env:
            PROXY_DOMAIN: ""
          args:
            - --user-data-dir
            - "/config/.vscode"
            - --auth
            - none
persistence:
  config:
    enabled: false
    mountPath: /config
portal:
  open:
    enabled: true

codeserver-IngressRoute.yaml:

apiVersion: traefik.io/v1alpha1
kind: IngressRoute
metadata:
  name: codeserver-ingressroute
  namespace: default

spec:
  entryPoints:
    - web
  routes:
  - match: Host(`codeserver.example.com`)
    kind: Rule
    services:
    - name: codeserver-code-server
      port: 10063

haproxy.cfg:

# Backend for Codeserver
backend codeserver_backend
    mode http
    balance roundrobin
    option httpchk HEAD / HTTP/1.1\r\nHost:codeserver.example.com
    http-request set-header Host %[req.hdr(Host)]
    server agent1 agent1:80 
    server agent2 agent2:80 backup
    server agent3 agent3:80 backup
    http-request set-header X-Forwarded-Port %[dst_port]
    http-request add-header X-Forwarded-Proto https if { ssl_fc }
    option forwardfor

While, not necessary, running replicas on each svclb node pool appears to be ideal for high availability and reliability.

HAProxy stats:

A feature of HAProxy which I have found instrumental in terms of observability of HAProxy's metrics aswell as well as the overall health of the load balancing set up is the HAProxy stats page, it can be enabled with the following example config:

frontend stats
    mode http
    bind *:8404
    stats enable
    stats uri /stats
    stats refresh 10s
    stats admin if LOCALHOST

An example of the above codeserver_backend taken from my HAProxy stats page is as follows:

The above showing that the stats page provides details on:

Backend server uptime
Last Health Check status and latency
Total service downtime
Queues, Session Rates, Time since last session.

Essentially everything you would need from your load balancer at a glance.

Note: If you wish to incorporate HAProxy metrics into your Prometheus based monitoring solution, it is also possible to expose a Prometheus metrics endpoint: See here

Final wrap up:

This project began as a modest concept and has since expanded remarkably, significantly enhancing my understanding of various aspects involved in developing highly available distributed systems.

A major 'aha' moment: A major issue for me over the entire duration of the project was how I could manage the scoping of applications and making them available via their service (we can deploy them, but they wont all be local without specifically deploying a daemonset), however as described in this post, a major 'aha' moment for me was the functionality and combination of 1. Functionality of svclb pools and 2. Location scoped traefik pods. Once I understood that topology, all the other solutions fell into place.

Next steps: The next steps for this project will be to continue to incorporate ideas and tools which I learn about and help to improve the underlying reliability of the infrastructure, one idea is to build the same system but using k0s in place of k3s. Something which I am hopeful for in the future to incorporate is the Traefik failover service for the Kubernetes provider, see the GitHub issue: here.

Another interesting idea to try out, could be to use the HAProxy Kubernetes Ingress Controller.

Resource list:

k3s documentation: https://docs.k3s.io/
Traefik documentation: https://doc.traefik.io/traefik/
HAProxy blog: https://www.haproxy.com/blog
Chaos-Mesh docs: https://chaos-mesh.org/docs/
Tailscale docs: https://tailscale.com/kb
Headscale documenation: https://headscale.net/
Truecharts documentation: https://truecharts.org/manual/intro

Thank you for joining me through this journey in the 'Going Mobile' series. I hope you found these insights into distributed Kubernetes clusters both enlightening and practical. My goal of this blog is to delve deep into the realms of Self-hosting, DevOps, and Reliability Engineering, uncovering the latest trends, sharing compelling stories, and offering hands-on advice.

Finally, If you're on the brink of starting your own multi-location cluster or have been tinkering with k3s in your lab, share your story. What challenges are you facing? What successes have you celebrated? Your experiences improve the community and pave the way for growth.

Feel free to reach out to me on LinkedIn or GitHub

Explore the series:

This post is part of the 'Going Mobile' series. Explore the series further:

Going Mobile: Part 1 - Introduction to the series
Going Mobile: Part 2 - The design.
Going Mobile: Part 3 - Understanding a change in the scope.
Going Mobile: Part 4 - Going further into Networking, Monitoring and Chaos engineering
You are here: Going Mobile: Part 5 - Covering some troubleshooting, wrapping it up, and the future.