On-premises load balancing with BGP

When it comes to exposing your Kubernetes service to external clients, you have various options to choose from. Two commonly used methods are NodePort and LoadBalancer. NodePort simply exposes a Kubernetes service on each node’s IP at a static port. The LoadBalancer type of service is a Kubernetes abstraction that is used to create a network Load Balancer service that exposes your service using a single floating IP (VIP).

Public cloud providers typically offer managed Load Balancer services and automatically provision a Load Balancer for you when you create a service of type LoadBalancer. However, on on-premises clusters, there is no such load balancing service to implement Kubernetes’s LoadBalancer type of service. In this case, you have to use a third-party component to implement this functionality. If you’re not running on a supported IaaS platform (GCP, AWS, Azure…), LoadBalancers will remain in the “pending” state indefinitely when created.

Bare-metal cluster operators are left with two lesser tools to bring user traffic into their clusters, “NodePort” and “externalIPs” services. Both of these options have significant downsides for production use, which makes bare-metal clusters second-class citizens in the Kubernetes ecosystem.

One popular choice to solve this issue is the MetalLB project, which aims to redress this imbalance by offering a network load balancer implementation that integrates with standard network equipment so that external services on bare-metal clusters also “just work” as much as possible. MetalLB can do either Layer 2 (ARP) mode or BGP mode. In Layer 2 mode, MetalLB will respond to ARP requests for the service IP and will send traffic to the backend pods using standard routing. In BGP mode, MetalLB will establish BGP peering with the network equipment and will advertise the service IP to the network equipment, which will then route traffic to the backend pods.

Layer 2 mode is simpler to set up, but it has its own limitations and drawbacks. Namely, this mode supports routing traffic to only one node at a time, which can lead to suboptimal routing and performance.

BGP mode is more complex to set up, but it offers more flexibility and scalability. In BGP mode, the service IP is announced from all nodes that have a pod for the service, and the network equipment can route traffic to multiple nodes at the same time. This can lead to better performance and more efficient use of resources.

If you are experienced with MetalLB, you can install it on your CFKE cluster and it will work as expected. However, if you are intending to use MetalLB with your CFKE cluster in BGP mode and have no specific reason to use MetalLB, we recommend using Cloudfleet’s Cilium-based BGP advertisement feature instead, which Cloudfleet officially supports.

Cloudfleet uses Cilium as the Container Network Interface (CNI) plugin for Kubernetes. (See Network Architecture) Cilium supports BGP peering with the network equipment as an out-of-the-box feature. This is a direct replacement for MetalLB’s BGP mode. Moreover, you can configure Cilium to advertise Pod IPs to the router as well, which allows exposing the Pod network directly to the external network. This might be useful for some use cases where external network clients need to connect to the Pod network directly.

BGP announcement configuration

The following is a brief guide on how to configure BGP announcement in Cloudfleet. However, to learn about how the BGP announcement feature works in Cloudfleet with all its details, it is recommended to refer to Cilium documentation.

First, you need to create an IP pool to be used for the selection of LoadBalancer IPs. You can use the following manifest as an example:

apiVersion: "cilium.io/v2alpha1"
kind: CiliumLoadBalancerIPPool
metadata:
  name: "example-pool"
spec:
  blocks:
    - start: 172.16.90.2 # Start of the IP range
      stop: 172.16.90.254 # End of the IP range

    - cidr: 172.16.91.0/24 # Alternative notation with CIDR

Later, let us now configure the BGP advertisement for the LoadBalancer services and alternatively also for the Pod network:

apiVersion: "cilium.io/v2alpha1"
kind: CiliumBGPAdvertisement
metadata:
  name: services
  labels:
    advertise: bgp # This label is used in the next step
spec:
  advertisements:
    - advertisementType: PodCIDR # This is optional and can be omitted if you don't want to advertise the Pod network

    - advertisementType: Service
      service:
        addresses:
          - LoadBalancerIP
      selector:
        matchExpressions:
          # To enable BGP advertisement for all LoadBalancer services, you can use the following expression
          # See https://docs.cilium.io/en/latest/network/bgp-control-plane/bgp-control-plane-v2/#multipool-ipam to learn why
          - { key: somekey, operator: NotIn, values: [ 'never-used-value' ] }

Here we have chosen to advertise all the LoadBalancer services, but the configuration can be customized to advertise only specific services. The LoadBalancerIP is the IP address that is selected from the IP pool we created earlier. The selector field is used to select the services to be advertised.

Later, we need to configure the BGP-specific configuration for the advertisement we just configured. You can use the following manifest as an example:

apiVersion: cilium.io/v2alpha1
kind: CiliumBGPPeerConfig
metadata:
    name: tor-rack-1
spec:
    gracefulRestart:
        enabled: true
        restartTimeSeconds: 15
    families:
        - afi: ipv4
          safi: unicast
          advertisements:
              matchLabels:
                  advertise: "bgp"

See the CiliumBGPPeerConfig documentation to learn more details about the configuration options. Please note that we have used the matchLabels field to select the advertisements to be used for the BGP announcement. In our example, we have used the advertise: bgp label to select the advertisement we created in the previous step.

Later, you configure a BGP announcement for each router you may have in your network. You can use the following manifest as an example:

apiVersion: cilium.io/v2alpha1
kind: CiliumBGPClusterConfig
metadata:
  name: tor-rack-1
spec:
  nodeSelector:
    matchLabels:
      topology.kubernetes.io/region: berlin
      topology.kubernetes.io/zone: rack-1
  bgpInstances:
    - name: "cloudfleet"
      localASN: 65001 # The ASN number you want to use for your cluster
      peers:
        - name: unifi
          peerASN: 65000 # Router's ASN number
          peerAddress: "172.16.10.1" # Router's IP address
          peerConfigRef:
            name: tor-rack-1 # Reference to the peer configuration. We created this in the previous step

In this example, we have used the nodeSelector field to select the nodes that should be part of the BGP cluster. This is a particularly important configuration in the Cloudfleet context, as Cloudfleet allows nodes to be in different datacenters and clouds. Here we want to configure the advertisement for the nodes in the berlin region and rack-1 zone where our router is also located. If we have multiple datacenters, we can use different labels to create totally different BGP configurations for those datacenters.

The bgpInstances field is used to configure the BGP instances for the nodes in the cluster. The peers field is used to configure the BGP peers for the BGP instances. The peerConfigRef field is used to reference the BGP peer configuration we created in the previous step.

After you have applied all the resources to your cluster and also configured the router, you can check the BGP configuration via the Cilium CLI. To see the peer connections, use:

$ cilium bgp peers

To see the advertisements, use:

$ cilium bgp routes

Make sure you confirm that the advertisements are correctly propagated on the router side as well.

Once everything is configured correctly, you can create a LoadBalancer service in your cluster, and you will see an IP address from the IP pool you created earlier assigned to the service. This IP address is now advertised to the router, and the router will route the traffic to the correct node in the cluster.

Further reading