Prefix Mode for Windows¶
In Amazon EKS, each Pod that runs on a Windows host is assigned a secondary IP address by the VPC resource controller by default. This IP address is a VPC-routable address that is allocated from the host's subnet. On Linux, each ENI attached to the instance has multiple slots that can be populated by a secondary IP address or a /28 CIDR (a prefix). Windows hosts, however, only support a single ENI and its available slots. Using only secondary IP addresses can artifically limit the number of pods you can run on a Windows host, even when there is an abundance of IP addresses available for assignment.
In order to increase the pod density on Windows hosts, especially when using smaller instance types, you can enable Prefix Delegation for Windows nodes. When prefix delegation is enabled, /28 IPv4 prefixes are assigned to ENI slots rather than secondary IP addresses. Prefix delegation can be enabled by adding the
enable-windows-prefix-delegation: "true" entry to the
amazon-vpc-cni config map. This is the same config map where you need to set
enable-windows-ipam: "true" entry for enabling Windows support.
Please follow the instructions mentioned in the EKS user guide to enable Prefix Delegation mode for Windows nodes.
Figure: Comparison of Secondary IP mode with Prefix Delegation mode
The maximum number of IP addresses you can assign to a network interface depends on the instance type and its size. Each prefix assigned to a network interface consumes an available slot. For example, a
c5.large instance has a limit of
10 slots per network interface. The first slot on a network interface is always consumed by the interface's primary IP address, leaving you with 9 slots for prefixes and/or secondary IP addresses. If these slots are assigned prefixes, the node can support (9 * 16) 144 IP address whereas if they're assigned secondary IP addresses it can only support 9 IP addresses. See the documentation on IP addresses per network interface per instance type and assigning prefixes to network interfaces for further information.
During worker node initialization, the VPC Resource Controller assigns one or more prefixes to the primary ENI for faster pod startup by maintaining a warm pool of the IP addresses. The number of prefixes to be held in warm pool can be controlled by setting the following configuration parameters in
amazon-vpc-cni config map.
warm-prefix-target, the number of prefixes to be allocated in excess of current need.
warm-ip-target, the number of IP addresses to be allocated in excess of current need.
minimum-ip-target, the minimum number of IP addresses to be available at any time.
minimum-ip-targetif set will override
As more Pods are scheduled on the node, additional prefixes will be requested for the existing ENI. When a Pod is scheduled on the node, VPC Resource Controller would first try to assign an IPv4 address from the existing prefixes on the node. If that is not possible, then a new IPv4 prefix will be requested as long as the subnet has the required capacity.
Figure: Workflow during assignment of IPv4 address to the Pod
Use Prefix Delegation when¶
Use prefix delegation if you are experiencing Pod density issues on the worker nodes. To avoid errors, we recommend examining the subnets for contiguous block of addresses for /28 prefix before migrating to prefix mode. Please refer “Use Subnet Reservations to Avoid Subnet Fragmentation (IPv4)” section for Subnet reservation details.
By default, the
max-pods on Windows nodes is set to
110. For the vast majority of instance types, this should be sufficient. If you want to increase or decrease this limit, then add the following to the bootstrap command in your user data:
Avoid Prefix Delegation when¶
If your subnet is very fragmented and has insufficient available IP addresses to create /28 prefixes, avoid using prefix mode. The prefix attachment may fail if the subnet from which the prefix is produced is fragmented (a heavily used subnet with scattered secondary IP addresses). This problem may be avoided by creating a new subnet and reserving a prefix.
Configure parameters for prefix delegation to conserve IPv4 addresses¶
minimum-ip-target can be used to fine tune the behaviour of pre-scaling and dynamic scaling with prefixes. By default, the following values are used:
Use Subnet Reservations to Avoid Subnet Fragmentation (IPv4)¶
When EC2 allocates a /28 IPv4 prefix to an ENI, it has to be a contiguous block of IP addresses from your subnet. If the subnet that the prefix is generated from is fragmented (a highly used subnet with scattered secondary IP addresses), the prefix attachment may fail, and you will see the following node event:
It is recommended to create a new subnet, reserve space for prefixes, and enable prefix assignment for worker nodes running in that subnet. If the new subnet is dedicated only to Pods running in your EKS cluster with prefix delegation enabled, then you can skip the prefix reservation step.
Replace all nodes when migrating from Secondary IP mode to Prefix Delegation mode or vice versa¶
It is highly recommended that you create new node groups to increase the number of available IP addresses rather than doing rolling replacement of existing worker nodes.
When using self-managed node groups, the steps for transition would be:
- Increase the capacity in your cluster such that the new nodes would be able to accomodate your workloads
- Enable/Disable the Prefix Delegation feature for Windows
- Cordon and drain all the existing nodes to safely evict all of your existing Pods. To prevent service disruptions, we suggest implementing Pod Disruption Budgets on your production clusters for critical workloads.
- After you confirm the Pods are running, you can delete the old nodes and node groups. Pods on new nodes will be assigned an IPv4 address from a prefix assigned to the node ENI.
When using managed node groups, the steps for transition would be:
- Enable/Disable the Prefix Delegation feature for Windows
- Update the node group using the steps mentioned here. This performs similar steps as above but are managed by EKS.
Run all Pods on a node in the same mode
For Windows, we recommend that you avoid running Pods in both secondary IP mode and prefix delegation mode at the same time. Such a situation can arise when you migrate from secondary IP mode to prefix delegation mode or vice versa with running Windows workloads.
While this will not impact your running Pods, there can be inconsistency with respect to the node's IP address capacity. For example, consider that a t3.xlarge node which has 14 slots for secondary IPv4 addresses. If you are running 10 Pods, then 10 slots on the ENI will be consumed by secondary IP addresses. After you enable prefix delegation the capacity advertised to the kube-api server would be (14 slots * 16 ip addresses per prefix) 244 but the actual capacity at that moment would be (4 remaining slots * 16 addresses per prefix) 64. This inconsistency between the amount of capacity advertised and the actual amount of capacity (remaining slots) can cause issues if you run more Pods than there are IP addresses available for assignment.
That being said, you can use the migration strategy as described above to safely transition your Pods from secondary IP address to addresses obtained from prefixes. When toggling between the modes, the Pods will continue running normally and:
- When toggling from secondary IP mode to prefix delegation mode, the secondary IP addresses assigned to the running pods will not be released. Prefixes will be assigned to the free slots. Once a pod is terminated, the secondary IP and slot it was using will be released.
- When toggling from prefix delegation mode to secondary IP mode, a prefix will be released when all the IPs within its range are no longer allocated to pods. If any IP from the prefix is assigned to a pod then that prefix will be kept until the pods are terminated.
Debugging Issues with Prefix Delegation¶
You can use our debugging guide here to deep dive into the issue you are facing with prefix delegation on Windows.