VyOS 1.2.0 development news in July

Despite the slow news season and the RAID incident that luckily slowed us down only for a couple of days, I think we've made good progress in July.

First, Kim Hagen got cloud-init to work, even though it didn't make it to the mainline image, and WAAgent required for Azure is not working yet. Some more work, and VyOS will get a much wider cloud platform support. He's also working on Wireguard integration and it's expected to be merged into current soon.

The new VRRP CLI and IPv6 support is another big change, but it's got its own blog post, so I won't stop there and cover things that did not get their own blog posts instead.

IPsec and VTI

While I regard VTI as the most leaky abstraction ever created and always suggest using honest GRE/IPsec instead, I know many people don't really have any choice because their partners or service providers are using it. In older StrongSWAN versions it used to just work.

Updating StrongSWAN to the latest version had an unforeseen and very unpleasant side effect: VTI tunnels stopped working. A workaround in form of "install_routes = no" in /etc/strongswan.d/charon.conf was discovered, but it has an equally bad side effect: site to site tunnels stop working when it's applied.

The root cause of the problem is that for VTI tunnels to work, their traffic selectors have to be set to 0.0.0.0/0 for traffic to match the tunnel, even though actual routing decision is made according to netfilter marks. Unless route insertion is disabled entirely, StrongSWAN thus mistakenly inserts a default route through the VTI peer address, which makes all traffic routed to nowhere.

This is a hard problem without a workaround that is easy and effective. It's an architectural problem in the new StrongSWAN, according to our investigation of its source code and its developer responses, there is simply no way to control route insertion per peer. One developer responded to it with "why, site to site and VTI tunnels are never used on the same machine anyway" — yeah, people are reporting bugs just out of curiosity.

While there is no clean solution within StrongSWAN, this definitely has been a blocker for the release candidate. Reimplementing route insertion with an up/down script proved to be a hard problem since there are lots of cases to handle and complete information about the intended SA may not always be available to scripts. Switching to another IKE implementation seems like an attractive option, but needs a serious evaluation of the alternatives, and a complete rewrite of the IPsec config scripts — which is planned, but will take a while because the legacy scripts is an unmaintainable mess.

I think I've found a workable (even if far from perfect workaround) — instead of inserting missing routes, delete the bad routes. I've made a test setup and it seems to work reasonably well. The obvious issue is that it doesn't prevent bad things from happening, but rather undoes the damage, so there may still be a brief traffic disruption when VTI tunnels go up. Another problem is a possible race condition between StrongSWAN inserting routes and the script deleting them, though I haven't seen it in practice yet and I hope it doesn't exist. But, at least you can now use both VTI and site to site tunnels on the same machine.

For people who want to use VTI exclusively, there is now "set vpn ipsec options disable-route-autoinstall" option that disables route insertion globally, thus removing the possible disruption, at cost of making site to site tunnels impossible to use. That option is disabled by default.

I hope it will be good enough until we find a better solution. Your testing is needed to confirm that it is!

On "run reset vrrp master"

I've just exorcised a ghost of the old VRRP CLI implementation — the "reset vrrp master" command. I thought it would go away with the vyatta-vrrp package, but in fact it was in vyatta-op. It made me remember that I was going to write about it in the original blog post, but somehow I forgot about it.

Here is why that command was not reimplemented. First, it never worked with preempt to begin with, and with preempt being the default, its usefulness was already limited.

A more serious reason, however, is that it was a rather horrible (even if ingenious) kludge. This is how it worked: first it tried to locate the VRRP group in keepalived.conf, then it would remove it from the config, restart keepalived, insert it again, and restart keepalived again. It sort of worked, but you can see how fragile this approach is. If anything at any stage would go wrong, it would leave VRRP in an inconsistent state.

A much cleaner and general way to do it is to just disable the VRRP group in conf mode (set high-availability vrrp group Foo disable) and commit the change.



New VRRP CLI is here (with IPv6 support)

Ever since I started with Vyatta, I've had a problem with commands for features unrelated to interfaces being defined inside interfaces. I'm sure the person who came up with that arrangement meant well and thought it would be familiar for Cisco and Juniper users, but the more I lived with it, the more I thought it creates more problems than it solves.

From the user perspective, it's hard to easily view the complete configuration of those features. It's also much harder to clone a feature config to another machine. And if you ever want to move some connection to a different NIC, things get even more fun.

For developers, however, it's even worse. First, it means commands for those features needs to be duplicated for every interface type, which makes adding new interfaces much harder. Second, configuration scripts end up more complex due to paths that can be nested quite deeply. Third, with the current config backend specifically, lack of nested end nodes can lead to very interesting tracking of the state to avoid repeated service restarts.

Until recently there was a token excuse for leaving unfortunate UI decisions alone — the difficulty of writing migration scripts. Luckily, it's no longer the case, so we can start cleaning it up. Ok, it is hard and you need to take care of many details, but at least you are not wrestling with a library that is simply inadequate for the task. Now we can go on a quest to remove excessive nesting and redesign the UI is an easier to use, more logical fashion.

VRRP looked a good feature to start the clean up with — we need to get  IPv6 VRRP support to work in the end, its scripts have accumulated quite some cruft, and, well, it really has nothing to do with interface settings since it's a protocol of its own implemented by a userspace daemon.

Today I've rolled out the new implementation and it is already in the latest rolling release image, ready for your testing. Let's walk through the changes.

Making first boot scripts just got easier (but building vyos-1x got a bit harder)

As you probably know already, we are working on integrating cloud-init into VyOS, which will allow us to support multiple cloud platforms, and get rid of the custom script for EC2. The hard part of this project is that just allowing cloud-init to do what it normally does in Debian would not produce desired results, we need to make it modify the config.

This raises a question when this should occur and how it should be done. Since modifying running config with scripts has its difficulties in the current backend, and even if it didn't, it still could potentially clash with user's commits, we thought we may want to modify the config.boot file before it's loaded instead.

One advantage is that once we have common functionality implemented, it can be reused not only in cloud-init, but also in the installer, and in custom first boot scripts if someone wants them.

To test this concept, I've added a library names vyos.initialsetup that includes a collection of functions for common settings such as user passwords and keys, host name, default route, name servers, and interface addresses.

Here's an example of a script you can run on your system for demonstration (adjust user name and do ssh-keygen if necessary):

#!/usr/bin/env python3

import vyos.configtree as vct
import vyos.initialsetup as vis

with open('/opt/vyatta/etc/config.boot.default') as f:
    config_string = f.read()

with open('/home/dmbaturin/.ssh/id_rsa.pub') as f:
    key_string = f.read()

config = vct.ConfigTree(config_string)

vis.set_user_password(config, 'vyos', 'qwerty')
vis.set_user_ssh_key(config, 'vyos', key_string)

# Default level is admin
vis.create_user(config, 'dmbaturin', password=None, key=key_string)

# Default type is ethernet
vis.set_interface_address(config, 'eth0', '192.0.2.10/24')

vis.set_default_gateway(config, '192.0.2.1')

vis.set_name_servers(config, ['203.0.113.10', '203.0.113.20'])

vis.set_host_name(config, 'vyos-test')

print(str(config))

The script will print a customized config based on the default config.

Building vyos-1x

This is the good thing. The bad, or rather somewhat inconvenient thing is that vyos-1x package build now depends on the libvyosconfig0 package that provides the library behind the vyos.configtree module, and it's essential for running unit tests for those modules.

You should add the "deb http://dev.packages.vyos.net/repositories/current/vyos/ current main" repository to the sources.list on your build machine and install libvyosconfig0 with APT, or simply take the file from the repo and install it by hand with dpkg.

I hope the increased reliability we gain from those unit test outweighs the inconvenience of additional setup.


Infrastructure failure resolved, downloads.vyos.io is back now

We have managed to resolve the infrastructure problems and bring the affected machines back intact. Now downloads.vyos.io and ci.vyos.net are back online, and our build hosts with all the build dependencies setup and uncommited code are back online too, so luckily we can resume the work from the point where it was stopped by the RAID issue.

We are not ready to give a complete post mortem on the actual issue yet (may not even be able to at all) since we have limited data and the service provider support was not exactly helpful. What we can say now is that what was thought to be RAID5 actually was a RAID1 with an additional hot spare drive, and the failure mode included a RAID controller glitch apart from drive failure. The hot spare drive is what allowed us to bring the data back intact.

While this issue was resolved without any data loss, it definitely prompts us to reconsider many things about our infrastructure, including backup strategy, deployment mechanisms, and service provider choice.

We are going to write new posts as we decide upon the options and roll out improvements. Infrastructure deployment is also a good area for contributions, and some people already offered help.

Infrastructure failure

Hi everyone,

We've had a catastrophic failure on one of our hosts: two drives in a RAID5 failed simultaneously. A number of VMs related to the development infrastructure are permanently lost and need to be restored since we didn't have backups of them.

The only piece public infrastructure affected by it is downloads.vyos.io. You can use the old packages.vyos.net server or one of the mirrors to download the stable (1.1.8) or historical VyOS releases.

The other pieces of infrastructure that need to be restored are the Jenkins server, the build machines, and the repositories server. We'll have to restore them before we can continue development. I hope we'll get it done by Monday.

Building VyOS images with custom packages just got simpler

While the new build scripts first introduced when we migrated the development branch to jessie made things much simpler for developers and for people who just want to build the latest VyOS image from source, building an image even simply with a package available from Debian Jessie repos but not present in the VyOS package set by default was still quite an ordeal for a person not familiar with live-build and the structure of our build scripts.

Well, until now. Yesterday I've added ./configure script options that should allow everyone to build a custom image without ever touching the plumbing of the build scripts.

The simplest example, building an image with packages available in Debian Jessie:

./configure --custom-packages "bsdgames robotfindskitten"
sudo make iso

A more interesting example, adding a package from a third party repo signed with its own key. In this case, salt-minion:

wget https://repo.saltstack.com/apt/debian/8/amd64/2017.7/SALTSTACK-GPG-KEY.pub
./configure --custom-apt-entry "deb http://repo.saltstack.com/apt/debian/8/amd64/2017.7 jessie main" --custom-packages "salt-minion" --custom-apt-key ./SALTSTACK-GPG-KEY.pub
sudo make iso

Of course it doesn't guarantee that your image will build or work, but at least it will get you to the debugging phase faster,

VyOS 1.2.0 development news in June

If we haven't written any new blog posts in a while, that's because we've all been busy with development. Here is what happened since the last status update.

RADIUS authentication and authorization

We've had nominal support for RADIUS authentication for system login for a long time, but it was essentially useless because it required that the user account to already exist in VyOS to actually work, it was just a password checking method.

Kim Hagen found a way to implement real support for it and it appears to work fine (but of course needs testing). Apart from authentication, his implementation also supports authorization through the priv-lvl attribute. Privilege level 15 is mapped to admin users, anything below it is operator. This seems to be the most common way to do that, after Cisco.

Correct naming of PPTP and L2TP interfaces works again

The idea to switch to the upstream PPPD turned out to be premature — while it does have support for pre-up scripts, it still makes an assumption that the interface name is not changed by pre-up script, so remote access VPN interfaces were named pppX, which broke the "run show vpn remote-access" commands, and would break zone-based firewalls of people who rely on that naming.

We've re-applied the old patches for it to the latest upstream PPPD and opened pull requests for them, so if they are merged, we can finally stop building our own, and if not, we have clean patches that should be easy to apply to future releases.

Writing migration scripts (and manipulating VyOS config files outside VyOS) just got easier

Long story short

VyOS 1.2.0-rolling (starting with the next nightly build) includes a library for parsing and manipulating config files without loading them into the system config. It can be used for automatically converting configs from old versions in case an incompatible change was made, and for standalone utilities. Motivation and history are discussed below.

Here is an example of interacting with the new library:
>>> from vyos import configtree

>>> c = configtree.ConfigTree("system { host-name vyos \n } interfaces { dummy dum0 { address 192.0.2.1/24 \n address 192.0.2.20/24 \n disable \n } }  /* version: 1.2.0 */")

>>> print(c.to_string())
system {
    host-name vyos
}
interfaces {
    dummy dum0 {
        address 192.0.2.1/24
        address 192.0.2.20/24
        disable { }
    }
}

 /* version: 1.2.0 */

>>> c.set(['interfaces', 'dummy', 'dum0', 'address'], value='293.0.113.3/32', replace=False)

>>> c.delete_value(['interfaces', 'dummy', 'dum0', 'address'], '192.0.2.1/24')

>>> c.delete(['interfaces', 'dummy', 'dum0', 'disable'])

>>> c.is_tag(['interfaces', 'dummy'])
True

>>> c.exists(['interfaces', 'dummy', 'dum0', 'disable'])
False

>>> c.list_nodes(['interfaces', 'dummy'])
['dum0']

>>> print(c.to_string())
system {
    host-name vyos
}
interfaces {
    dummy dum0 {
        address 192.0.2.20/24
        address 293.0.113.3/32
    }
}

 /* version: 1.2.0 */

As you can see, it largely mimics the API you get for the running config. The only notable differences are that the "set" method requires that you specify the path and the value separately, and to have nodes formatted as tag nodes (i.e. "ethernet eth0 { ..." as opposed to "ethernet { eth0 { ..." you need to mark them as such with "set_tag", unless they were originally formatted that way in the config you parsed.

Things the new style configuration mode definitions intentionally do not support

I've made three important changes to the design of the configuration command definitions, and later I realized that I never wrote down a complete explanation of the changes and the motivation behind them.

So, let's make it clear: these changes are intentional and they shouldn't be reintroduced. Here's the details:

The "type" option

In the old definitions, you can see the "type:" option in the node.def files very often. In the new style XML definitions, there's no equivalent of it, and the type is always set to "txt" in autogenerated node.def's for tag and leaf nodes (which means "anything" to the configuration backend).

I always felt that the "type" option suffers from two problems: scope creep and redundancy.

The scope creep is in the fact that "type" was used for both value validation and generating completion help in "val_help:" option. Also, the "u32" type (32-bit unsigned integer) has a little known undocumented feature: it could be used for range validation in form of "type: u32:$lower-$upper" (e.g. u32:0-65535). It has never been used consistently even by the original Vyatta Core authors, plenty of node.def's use additional validation statements instead.

Now to the redundancy: there are two parallel mechanisms for validations in the old style definitions. Or three, depending on the way you count them. There are "syntax:expression:" statements that are used for validating values at set time, and "commit:expression:" that are checked at commit time.

My feeling from working with the system for scary amount of time was that the "type" option alone is almost never suffucient, and thus useless, since actual, detailed validation is almost always done elsewhere, in those "syntax/commit:expression:" statements or in the configuration scripts. Sometimes a "commit:expression:" is used where "syntax:expression:" would be more appropriate (i.e. validation is delayed) but let's focus on set-time validation only.

But without data to back it up, a feeling is just a feeling, so I made up a quick and dirty script to do some analysis. You can repeat what I've done easily with "find /opt/vyatta/share/vyatta-cfg/templates/ -type f -maxdepth 100 -name 'node.def' | nodecheck.py".

On VyOS 1.1.8 (which doesn't include any rewritten code) the output is:
Has type: 2737
Has type txt: 1387
Has type other than txt: 1350
Has commit or syntax expression: 1700
Has commit or syntax expression and type txt: 740
Has commit or syntax expression and type other than txt: 960

While irrelevant to the problem on hand, the total count of node.def's is 4293). In other words, of all nodes that have the type option, 50% have it set to "txt". Some of them are genuinely "anything goes" nodes such as "description" options, but most use it as a placeholder.

68% of all nodes that have a type are also using either "syntax:expression:" or "commit:expression:". Of all nodes that have a type more specific than "txt", 73% have additional validation. This means that even for supposedly specific types, type alone is enough only in 23% cases. This raises the question whether we need types at all.

Sure, we could introduce more types and add support for something of a sum type, but is it worth the trouble if validation can be easily delegated to external scripts? Besides, right now types are built in the config backend, which means adding a new one requires modifying it starting from the node.def parser.

In the new style definitions, I felt like the only special case that is special enough is regular expression. This is how value constraints checked at set time are defined:

<leafNode name="foo">
  <properties>
    <constraint>
      <regex>(baz|baz)</regex>
      <validator>quux</validator>
    </constraint>
  </properties>
<leafNode>

Here the "validator" tag contains a reference to a script stored in /usr/libexec/vyos/validators/. Since adding a new validator is easy, there's no reason to hesitate to add new ones for common (and even not so common) cases. Note that "regex" option is automatically wrapped in ^$, so there's no need to do it by hand every time.

Default values

The old definitions used to support "default:" option for setting default values for nodes. It looks innocous on the surface, but things get complicated when you look deeper into its behaviour.

You may think a node either exists, or it does not. What is the value of a node that doesn't exist? Sounds rather like a Zen koan, but here's cheap enlightenment for you: it depends on whether it has a default value or not.

Thus, nodes effectively have three states: "doesn't exist", "exists", and "default". As you can already guess, it's hard to tell the latter two apart. It's also very hard to see if a node was deleted from a config or just reset to a default value. It also means that every node lookup cannot operate on the running config tree alone and has to consult the command definitions as well, which is very bad if you plan to use the same code for the CLI and for standalone config handling programs such as migration scripts.

Last time people tried to introduce rollback without reboot, the difficulties of handling the third virtual "default" state was one of the biggest problems, and it's still one of the reasons we don't have a real rollback. VyConf has no support for default values for this reason, so we should eliminate them as we rewrite the code.

Defaults should be handled by config scripts now. Sure, we lose "show -all" and the ability to view defaults, but the complications that come with it hardly make it worth the trouble. There are also many implicit defaults that come from underlying software options anyway.

Embedded shell scripts

That's just a big "no". Have you ever tried to debug code that is spread across multiple node.def's in nested directories and that cannot be executed separately or stepped through?

While it's tempting to allow that for "trivial" scripts, the code tends to grow and things get ugly. Look the the implementation of PPPoE or tunnel interfaces in VyOS 1.1.8.

If it's more than one command, make it an external script, and you'll never regret the decision when it begins to grow.