VyOS 2.0 development digest #2

In the previous post we talked about the reasons for rewrite, design and implementation issues, and basic ideas. Now it's time to get to details. Today we'll mostly talk about command definitions, or rather interface definitions, since set commands is just one way to access the configuration interface.

Let's review the VyConf architecture (I included a few things in the diagram that we haven't discussed yet, ignore them for now):

At startup, VyConf will load the main config (or the fallback config, if that fails). But to know whether the config is valid, and to know what programs to call to actually configure the target applications, it needs additional data. We'll call that data "interface definitions" since it defines the configuration interface. Specifically, it defines:

  1. What config nodes (paths) are allowed (e.g. "interfaces ethernet", or "protocols ospf")
  2. What values are valid for that nodes (e.g. any IPv4 or IPv6 address for "system name-server")
  3. What script/program should be called when this or that part of the config is changed

The old way

Before we get to the new way, let's review the old way, the way it's one in the current VyOS implementation. In the current VyOS, those definitions are called "templates", no one remembers why.

This is a typical template file:
vyos@vyos# cat /opt/vyatta/share/vyatta-cfg/templates/interfaces/ethernet/node.tag/speed/node.def 
type: txt
help: Link speed
default: "auto"
syntax:expression: $VAR(@) in "auto", "10", "100", "1000", "2500", "10000"; "Speed must be auto, 10, 100, 1000, 2500, or 10000"
allowed: echo auto 10 100 1000 2500 10000

commit:expression: exec "\
	/opt/vyatta/sbin/vyatta-interfaces.pl --dev=$VAR(../@) \
	--check-speed $VAR(@) $VAR(../duplex/@)"

update: if [ ! -f /tmp/speed-duplex.$VAR(../@) ]; then
	   /opt/vyatta/sbin/vyatta-interfaces.pl --dev=$VAR(../@) \
	   	--speed-duplex $VAR(@) $VAR(../duplex/@)
	   touch /tmp/speed-duplex.$VAR(../@)
	fi

val_help: auto; Auto negotiation (default)
val_help: 10; 10 Mbit/sec
val_help: 100; 100 Mbit/sec
val_help: 1000; 1 Gbit/sec
val_help: 2500; 2.5 Gbit/sec
val_help: 10000; 10 Gbit/sec

We can spot a few issues with it already. First, the set of definitions is a huge directory tree where each directory represents a config node, e.g. "interfaces ethernet address" will be "interfaces/ethernet/node.tag/address/node.def". This makes them hard to navigate, you need to read a lot of small files to get the whole picture, and you need to do a lot of file/directory hopping to edit them. Now, try mass editing, or checking for mistakes before release...

Next, they use a custom syntax, which needs custom lexer and parser, and custom documentation (in practice, the source is your best bet, though I did write something of a syntax reference). No effortless parsing, no effortless analysis or transformation either.

Next, the value checking has its peculiarities. There is concept of type (it can be txt, u32, ipv4, ipv4net, ipv6, ipv6net, and macaddr), and there is "syntax:expression:" thing which partially duplicate each other. The types are hardcoded in the config backend and cannot be added without modifying it, even though they only define validation procedure and are not used in any other way. The "syntax:expression:" can be either "foo in bar baz quux", or "pattern $regex", or "exec $externalScript".

But, the "original sin" of those files is that they allow embedded shell scripts, as you can see. Mixing data with logic is rarely a good idea, and in this case it's especially annoying because a) you cannot test such code other than on a live system b) you have to read every single file in a package to get the complete picture, since any of them may have embedded shell.

Now to the new way.

VyOS 2.0 development digest #1

I keep talking about the future VyOS 2.0 and how we all should be doing it, but I guess my biggest mistake is not being public enough, and not being structured enough.

In the early days of VyOS, I used to post development updates, which no one would read or comment upon, so I gave up on it. Now that I think of it, I shouldn't have expected much as the size of the community was very small at the time, and there were hardly many people to read it in the first place, even though it was a critical time for the project, and input from the readers would have been very valuable.

Well, this is a critical time for the project too, and we need your input and your contributions more than ever, so I need to get to fixing my mistakes and try to make it easy for everyone to see what's going on and what we need help with.

Getting a steady stream of contributions is a very important goal. While the commercial support thing we are doing may let the maintainers focus on VyOS and ensure that things like security fixes and release builds get guaranteed attention in time, without occasional contributors who add things they personally need (while maintainers may not, I think myself I'm using maybe 30% of all VyOS features any often) the project will never realize its full potential, and may go stale.

But to make the project easy to manage and easy to contribute to, we need to solve multiple hard problems. It can be hard to get oneself to do things that promise no immediate returns, but if you looks at it the other way, we have a chance to build a system of our dreams together. As of 1.1.x and 1.2.x (the jessie branch), we'll figure it out how to maintain it until we solve those problems, but that's for another post. Right now we are talking about VyOS 2.0, which gets to be a cleanroom rewrite.

Why VyOS isn't as good as it could be, and can't be improved

I considered using "Why VyOS sucks" to catch reader's attention. It's a harsh word, and it may not be all that true, given that VyOS in its current state is way ahead of many other systems that don't even have system-wide config consistency checks, or revisions, or safe upgrades, but there are multiple problems that are so fundamental that they are impossible to fix without rewriting at least a very large part of the code.

I'll state the design problems that cannot be fixed in the current system. They affect both end users and contributors, sometimes indirectly, but very seriously.

Design problem #1: partial commits

You've seen it. You commit, there's an error somewhere, and one part of the config is applied, while the other isn't. Most of the time it's just a nuisance, you fix the issue and commit again, but if you, say, change interface address and firewall rule that is supposed to allow SSH to it, you can get locked out of your system.

The worst case, however, is when commit fails at boot. While it's good to have SSH at least, debugging it can be very frustrating, when something doesn't work, and you have no idea why, until you inspect the running config and see that something is simply missing (if you run into it in VyOS 1.x, do "load /config/config.boot" and commit, this will either work or show you why it failed). It's made worse by lack of notifications about config load failure for remote users, you can only see that error on the console.

The feature that can't be implemented due to it is what goes by "commit check" in JunOS. You can't test if your configuration will apply cleanly without actually commiting it.

It's because in the scripts, the logic for consistency checking and generating real configs (and sometimes applying them too) is mixed together. Regardless of the backend issues, every script needs to be taken apart and rewritten to separate that logic. We'll talk more about it later.

Design problem #2: read and write operations disparity

Config reads and writes are implemented in completely different ways. There is no easy programmatic API for modifying the config, and it's very hard to implement because binaries that do it rely on specific environment setup. Not impossible, but very hard to do right, and to maintain afterwards.

This blocks many things: network API and thus an easy to implement GUI, modifying the config script scripts in sane ways (we do have the script-template which does the trick, kinda, but it could be a lot better).

Design problem #3: internal representation

Now we are getting to really bad stuff. The running config is represented as a directory tree in tmpfs. If you find it hard to believe, browse /opt/vyatta/config/active, e.g. /opt/vyatta/config/active/system/time-zone/node.val

Config levels are directories, and node values are in node.val files. For every config session, a copy of the active directory is made, and mounted together with the original directory in union mount through UnionFS.

There are lots of reasons why it's bad:

  • It relies on behaviour of UnionFS, OverlayFS or another filesystem won't do. We are at mercy of unionfs-fuse developers now, and if they stop maintaining it (and I can see why they may, OverlayFS has many advantages over it), things will get interesting for us
  • It requires watching file ownership and permissions. Scripts that modify the config need to run as vyattacfg group, and if you forget to sg, you end up with a system where no one but you (or root) can make any new commits, until you fix it by hand or reboot
  • It keeps us from implementing role-based access control, since config permissions are tied to UNIX permissions, and we'd have to map it to POSIX ACLs or SELinux and re-create those access rules at boot since the running config dir is populated by loading the config
  • For large configs, it creates a fair amount of system calls and context switches, which may make system run slower than it could

Design problem #3: rollback mechanism

Due to certain details (mostly handling of default values), and the way config scripts work too, rollback cannot be done without reboot. Same issue once made Vyatta developers revert activate/deactivate feature.

It makes confirmed commit a lot less useful than it should be, especially in telecom where routers cannot be rebooted at random even in maintenance windows.

Implementation problem #1: untestable logic

We already discussed it a bit. The logic for reading the config, validating it, and generating application configs is mixed in most of the scripts. It may not look like a big deal, but for the maintainers and contributors it is. It's also amplified by the fact that there is not way to create and manipulate configs separately, the only way you can test anything is to build a complete image, boot it, and painstakingly test everything by hand, or have expect-like tool emulate testing it by hand.

You never know if your changes may possibly work until you get them to a live system. This allows syntax errors in command definitions and compilation errors in scripts to make it into builds, and it make it into a release more than one time when it wasn't immediately apparent and only appread with certain combination of options.

This can be improved a lot by testing components in isolation, but this requires that the code is written in appropriate way. If you write a calculator and start with add(), sub(), mul() etc. functions, and use them in a GUI form, you can test the logic on its own automatically, e.g. does add(2,3) equal 5, and does mul(9, 0) equal 0, does sqrt(-3) raise an exception and so on. But if you embed that logic in button event handlers, you are out of luck. That's how VyOS is for the most part, even if you mock the config subsystem so that config read functions return the test data, you need to redo the script so that every function does exactly one thing testable in isolation.

This is one of the reasons 1.2.0 is taking so long, without tests, or even ability to add them, we don't even know what's not working until we stumble upon it in manual testing.

Implementation problem #2: command definitions

This is a design problem too, but it's not so fundamental. Now we use custom syntax for command definitions (aka "templates"), which have tags such as help: or type: and embedded shell scripts. There are multiple problem with it. For example, it's not so easy to automatically generate at least a command reference from them, and you need a complete live system for that, since part of the templates is autogenerated. The other issue is that right now some components feature very extensive use of embedded shell, and some things are implemented in embedded shell scripts inside templates entirely, which makes testing even harder than it already is.

We could talk about upgrade mechanism too, but I guess I'll leave it for another post. Right now I'd like to talk about proposed solutions, and what's being done already, and what kind of work you can join.

VyOS Project news

Hello, Community!

We have some great news to share! 

As some of you may already know, we are planning to run virtual meeting event for VyOS devs and users in near future.

So in case you want to participate, just fill up this form and of course join us on our dev. portal to stay in touch.

May admit that this summer is productive in all aspects:

Tremendous work towards VyOS 1.2 done and we going to present 1.2.0 beta 2 in some weeks! Thanks, to our super team!

We revived OVA distribution of VyOS some months ago and continued work in the direction of extensive VMWare Platform support; we also plan to deliver supported images for all other standard hypervisors like KVM, MS Hyper-V and VyOS on clouds markets for AWS, Azure, GCE.

VyOS is now available on SolidRun ClearFog device, credits for this great work to UnicronNL, this opens new applications for VyOS.

Ansible starting from version 2.2 able to configure VyOS, so if you using Ansible, give it a try! We on other hand working on standalone management library for Python.

We receive many requests regarding GUI, and yes we are listening to them. Last four months we dedicated a lot of time to studies regarding possible ways of delivery such  UI. Apart from technical challenges, there are other problems to take into account(like service providers don't want GUI, while SMBs and ROBOs show high demand). We are working to bring satisfaction to all users without sacrifices from any side. So basically yes, it will be GUI in near future, but there are not ETAs for now.  

We believe that affordable open source routing platform and NFV should be accessible to everyone out there

Want to participate???

Join us on social networks and spread the word about VyOS - Twitter, RedditFaceBook, Google+, Linkedin 

Participate in discussions on forum and of course join us on development portal 

VyOS is a community-driven Linux-based network operating system for routers and firewalls

Saludos,

Yuriy