On-Prem - Ingress High Availability
Follow

High Availability

If a vCPE HA Gateway is to act as a "default gateway", the pair will use VRRP to float a VIP (Virtual IP) between them. The VIP is used as the default route. The gateways function in active/standby mode. During normal operation, the primary hosts the VIP. If it fails, the secondary takes over until the primary is recovered.

Introduction

This guide walks you through a practical demonstration of configuring High Availability between customer-hosted edge routers.

Sample Architecture without HA

Onprem_ER_HA.drawio.png

For this demonstration, we have considered setting up a network between an ER running on a VMWARE VM on-premise (Assuming it is the headquarters) (ingress ER) and Azure South India that hosts the target application (egress ER). The NetFoundry-hosted edge routers are provisioned at Azure & AWS in India.

The endpoint from headquarters is able to access the web application hosted at Azure. Currently, a route has been configured for the network 10.90.1.0/24, pointing to ingress ER IP 192.168.158.151 at HQ.

Sample Architecture with HA

Onprem_ER_with_HA.drawio.png

An HQCustomerER2 has been deployed in order to configure High Availability between ingress ERs at HQ.

Step 1. Provisioning of Customer-hosted Edge Routers ( On-prem ) :

Customer self-hosted Edge Routers (CERs) act as egress routers for endpoints / other CERs to reach the services terminated on the CER endpoint.

Create and Register CERs in a private cloud

Use the below deployment guides to provision a customer-hosted Edge Router into a branch office or a private cloud.

https://support.netfoundry.io/hc/en-us/articles/5700949793293-Deployment-guides-for-provisioning-customer-edge-routers-in-a-private-cloud

Create and Register CERs on AWS / Azure / OCI / any Public Cloud

Use the below deployment guides to provision a customer-hosted Edge Router into your AWS / Azure/ GCP/ OCI.

https://support.netfoundry.io/hc/en-us/articles/5701001893133-Deployment-guides-for-provisioning-customer-edge-routers-in-public-clouds

Step 2: Configure the AppWAN

The AppWAN defines the services that one or more client endpoints can access.

To know more about AppWAN go to Create and Manage AppWAN article on the Netfoundry support hub.

Add the HQcusomerER2 endpoint in an existing AppWAN to create access for the web application which is hosted on Azure

Note: The endpoint/service/edge router attribute will select all endpoints/services/edge routers with that specific attribute. The @ symbol is used to tag Individual endpoints/services/edge routers and the # symbol is used to tag a group of endpoints/services/edge routers.

Note: Before configuring VRRP, ensure that proper connectivity exists between the participating Edge Routers. Verify that the necessary firewall rules are in place, both on the Edge Router’s UFW and on any on-premises firewalls in the path. Without the required firewall allowances, VRRP communication between Edge Routers may fail, leading to instability or loss of redundancy.

Step 3: Install and Configure Keepalived

Keepalived provides frameworks with high availability. High availability is achieved by the Virtual Redundancy Routing Protocol (VRRP). VRRP is a fundamental brick for router failover. To know more about the keepalived, click here

Update and install Keepalived

update and install Keepalived on both ERs

Use the below command to update the ER

    $sudo apt-get update

Use the below command to install keepalived

    $sudo apt-get install keepalived

Configure IP forwarding and non-local binding

Configure IP forwarding and non-local binding on both ERs

To configure IP forwarding and non-local binding you need to switch to the root user using the below command

    $sudo su -

To enable the Keepalived service to forward network packets to the backend servers, you need to enable IP forwarding. Run this command on both servers

    # sed -i 's/#net.ipv4.ip_forward=1/net.ipv4.ip_forward=1/' /etc/sysctl.conf

Similarly, you need to enable HAProxy and Keepalived to bind to non-local IP address, that is to bind to the failover IP address (Floating IP).

    # echo "net.ipv4.ip_nonlocal_bind=1" >> /etc/sysctl.conf

Reload sysctl settings

    # sysctl -p

Configure Keepalived

Check the network interface name and IP on both servers that will be part of HA setup

    $ip a

Edge Router	Interface	IP Address
HQcustomerER1	enp3s0	192.168.158.151
HQCustomerER2	enp3s0	192.168.158.152

The default configuration file for Keepalived should be /etc/keepalived/keepalived.conf.

However, this configuration is not created by default. Create the configuration with the content below.

Exit root shell

    # exit

Configure HQCustomerER1

Create a new conf file

    $ sudo nano /etc/keepalived/keepalived.conf

configure keepalived using the below script

vrrp_instance VI_1 {
interface enp3s0
state MASTER
virtual_router_id 51
priority 101
advert_int 2
unicast_src_ip 192.168.158.151 #IP of this device
unicast_peer{
192.168.158.152 #IP of the peer device
}
authentication {
auth_type AH
auth_pass N3tF0undry
}
virtual_ipaddress {
192.168.158.150 dev enp3s0 label enp3s0:vip
}
}

Configure HQCustomerER2

Create a new conf file

    $ sudo nano /etc/keepalived/keepalived.conf

configure keepalived using the below script

vrrp_instance VI_1 {
interface enp3s0
state BACKUP
virtual_router_id 51
priority 100
advert_int 2
unicast_src_ip 192.168.158.152 #IP of this device
unicast_peer{
192.168.158.151 #IP of the peer device
}
authentication {
auth_type AH
auth_pass N3tF0undry
}
virtual_ipaddress {
192.168.158.150 dev enp3s0 label enp3s0:vip
}
}

Start Keepalived

Start and enable keepalived on both routers

Start keepalived on both ERs using the below command

   $ sudo service keepalived start

Enable to run at system boot

   $ sudo systemctl enable --now keepalived

Use the below command to check the status

   $ sudo systemctl status keepalived

HQCustomerER1 keepalived status

HQCustomerER2 keepalived status

Check Virtual IPs

By default virtual IP will be assigned to the master server, In the case, the master gets down, it will automatically assign to the slave server.

Use the following command to show the assigned virtual IP on the interface.

    $ ip addr show enp3s0

HQCustomerER1 assigned VIP

HQCustomerER2 assigned VIP

Step 4: Change Route

Currently, a route has been configured for the network 10.90.1.0/24, pointing to ingress ER IP 192.168.158.151 at HQ. You need to change the route for the network 10.90.1.0/24 pointing to VIP 192.168.158.150 of ingress ERs.

Use the below command to delete the route pointing to ER1

    >route delete 10.90.1.0

Add a route using the below command

    >route app -P 10.90.1.0 mask 255.255.255.0 192.168.158.150

Use the below command to verify the route

    >route print

Note: A route has been configured on the local PC in this demonstration. For wider networks, you can make changes in the route at Gateways/Router/L3 Switches based on the network architecture.

Step 5: Validate

Validate Failover

Shut down the master ER1 and check if IPs are automatically assigned to the Backup ER2.

Use the below command to shutdown the ER1 interface

    $ip link set enp3s0 down

To check the status of the ER2 interface use the below command

    $ip addr show enp3s0

You can see that the VIP is automatically assigned to backup ER2

Validate service Accessibility

The application or server is accessed via a private hostname or address that is not reachable via the internet. The application is therefore dark to the outside world and reachable only within the NetFoundry network.

Ability to track loss of controller and/or fabric to trigger local switchover

This feature is available from OpenZiti v0.28.1 on. The corresponding CloudZiti version is v7.3.91. Each router must have the health check endpoint enabled. The customer hosted router registered with the NetFoundry provided registration app will have that enabled. To check, view the router config file.

ziggy@ziti-edge-router:~$ cat /opt/netfoundry/ziti/ziti-router/config.yml
v: 3
...

healthChecks:
  ctrlPingCheck:
    interval: 10s
    timeout: 15s
    initialDelay: 15s
  linkCheck:
    minLinks: 1
    interval: 10s
    initialDelay: 15s
...

Using this HC endpoint, one can check if a router lost communication with the controller/fabric and use that information to trigger switchover on the side of VRRP.

Note: The recommendation is to match the intervals between the script interval trigger and the health check endpoint interval listed in the configuration file shown in the previous section (i.e. highlighted in red). Perhaps slightly higher to allow for at least one health-check to be done, when VRRP checks again.

Here is sample configuration to add a script tracking to the VRRP configuration.

Note: Best practice is to run script under a user not root. The script section details need to be added before the main instance configuration,

global_defs {
        script_user ziggy ziggy
        enable_script_security
}

vrrp_script wan_check {
        script "/home/ziggy/erhchecker.pyz" # path to script
        interval 12
        fall 3 # times to wait to failover to standby, default 3
        rise 6 # times to wait before clear the failure, default 2 x fall
        user ziggy ziggy

}

vrrp_instance VI_1 {

        state MASTER
        ...
        track_script {
                wan_check
        }
        ...
}

Script Options

Options can be set as Environmental variables or passed through the command line. The recommended option is to use the environmental variables.

./erhchecker.pyz --help
usage: erhchecker.pyz [-h] [-c ROUTERCONFIGFILEPATH] [-t SWITCHTIMEOUT] [-r NOTFLAGROUTERSFILEPATH][-l {INFO,ERROR,WARNING,DEBUG,CRITICAL}] [-f LOGFILE] [-v]

options:
  -h, --help show this help message and exit
  -c ROUTERCONFIGFILEPATH, --routerConfigFilePath ROUTERCONFIGFILEPATH Specify the edge router config file
  -t SWITCHTIMEOUT, --switchTimeout SWITCHTIMEOUT Time to pass to allow for sessions drainage
  -r NOTFLAGROUTERSFILEPATH, --noTFlagRoutersFilePath NOTFLAGROUTERSFILEPATH Specify yaml file containing list of router ids that have no-traversable flag set
  -l {INFO,ERROR,WARNING,DEBUG,CRITICAL}, --logLevel {INFO,ERROR,WARNING,DEBUG,CRITICAL} Set the logging level
  -f LOGFILE, --logFile LOGFILE Specify the log file
  -v, --version show program's version number and exit

Environmental Variable names with default values:

'ROUTER_CONFIG_FILE_PATH', '/opt/netfoundry/ziti/ziti-router/config.yml'
'SWITCH_TIMEOUT', 600
'NO_T_FLAG_ROUTERS_FILE_PATH', ""
'LOG_FILE', ""
'LOG_LEVEL', "INFO"

SWITCH_TIMEOUT is used to delay the switch over to the protection router, when the router has lost communication with the controller and there are at least 1 fabric links still active. This is to provide more time to drain the active sessions still terminated on the router during the controller communication failure. It is settable, so one can decide what is the suitable time to allow for those session to continue. One must remember that while the router can not communicate with the controller, the new sessions can not be established.

NO_T_FLAG_ROUTERS_FILE_PATH is used to pass the router ids to the script that have the no-traversable flag set, i.e. NC Router used for the salt communication. If the script finds any the links associated with these routers, it will not include them in the link health lookup. The link connected the router on the controller is eliminated by comparing its IP address provided part of the HC Details and the controller ip/fqdn provided in the router configuration file.

YML syntax of the file content must be as follows:

$ cat router_file.yml
routerIds:
 - { router id 1} i.e falWSf-KgH
 - { router id 2}
...

Enable environment variables with VRRP

Add environment variables to the following file /etc/default/keepalived and restart the keepalived service.

Note: The log file needs to be pre-created and with write privileges to at least the user that is allowed to run the script as configured in the vrrp configuration

$ sudo cat /etc/default/keepalived
# Options to pass to keepalived
NO_T_FLAG_ROUTERS_FILE_PATH="/home/ziggy/router_file.yml"
LOG_FILE="/var/log/cloudziti_healthchecks.log"
LOG_LEVEL="DEBUG"
SWITCH_TIMEOUT=120
# DAEMON_ARGS are appended to the keepalived command-line
DAEMON_ARGS="-D"

$ sudo systemctl restart keepalived.service

Was this article helpful?

1 out of 1 found this helpful

On-Prem - Ingress High Availability Follow