User Space Bootstrap in EdgeRouter Firmware 2.0

Background

Ubiquiti released firmware v2.0.0 for EdgeRouter models at the turn of 2019. FW 2.0 is based on Debian Stretch and has been in development for the past two years or so. Cavium models come with Linux kernel 4.9.y. ERX series come with kernel 4.14.y (higher than Cavium. Yay!). One notable cause of delay is reportedly due to adaptation effort to Systemd. Debian Jessie, the original baseline of FW 2.0, has migrated bootstrapping of services from init scripts to systemd. To delight the user base grown increasingly anxious about the much delay, Ubiquiti surprised the crowd with Debian Stretch instead of the planned Jessie. This article looks at some of the aspects of starting up services in FW 2.0. It'll be interesting to see how Systemd is integrated into the firmware or rather how Vyatta fits into Systemd.

Systemd

Some people like it. Some don't. Others don't care. Systemd bootstraps the user space on Linux  systems, replacing the old UNIX SysV and BSD style init scripts. Comments from one ArchLinux developer perhaps shed some light into the reasons that Systemd has been adopted by major distributions. My experience with Systemd is limited but enough to poke around in FW 2.0. This chart in SVG format sums up pretty well the bootstrap process. I generated the chart from 'systemd-analyze plot > erx-systemd.svg' on ERX.

First thing I noticed about FW 2.0 was that it took VERY LONG time to start up. The chart explains it. We see the kernel takes about eight seconds and Systemd takes more than three minutes. Edgerouter X does not have a fast CPU nor speedy flash memory. But three minutes by any measure is too long. Major benefits of Systemd include easy management of dependencies and performing parallel initialisations that speed up the overall boot time. For example, Systemd claims one to two seconds fast boot on modern laptops.

Among the three plus minutes spent in user space, vyatta-router.service takes up the majority share, 2mins and 43s. If this time is well spent on doing actual work instead of unnecessary hung or wait due to poor coding, then it's reasonable to assert that most of the start-up tasks are not taking the benefit of parallelisation (yet). Let's look into vyatta-router.service and some other notable services in chronological order.

ubnt-rcS.service

This Systemd unit ( /lib/systemd/system/ubnt-rcS.service) runs '/etc/ubnt/ubnt-rcS/ubnt-rcS.sh start'. Here is the excerpt of do_start() in this shell script:

do_start () {
  mkdir -p /dev/pts
  mount -a -n >/dev/null >/dev/null 2>&1

  max_inodes=300000
  tmpfs_opts="nosuid,nodev,mode=775,nr_inodes=$max_inodes"
  cfg_dir=/opt/vyatta/config
  mount -o $tmpfs_opts -t tmpfs none $cfg_dir && chgrp vyattacfg $cfg_dir

  if [ ! -e "$LAST_TIME_FILE" ]; then
    echo '2015-01-01 00:00' >$LAST_TIME_FILE
  fi
  /bin/date -s "$(cat $LAST_TIME_FILE)" >/dev/null 2>&1

  mkdir -p /run/lock /run/sendsigs.omit.d /dev/shm/network
  touch /var/log/wtmp
  touch /var/log/lastlog
  for d in apt fsck mgetty ntpstats squid squid3 user; do
    mkdir -p /var/log/$d
  done
  chown ntp:ntp /var/log/ntpstats
  chown proxy:adm /var/log/squid
  chown proxy:proxy /var/log/squid3
  mkdir -p -m 750 /var/log/lighttpd; chown www-data:www-data /var/log/lighttpd
  uf=/var/run/beaker
  mkdir -pm 750 $uf/container_file; chown -R www-data:www-data $uf
  rm -f /etc/resolv.conf.dhclient*

  hostname "$(cat /etc/hostname)"
  mkdir -p /run/network
  >/run/network/ifstate

  for file in /etc/sysctl.conf /etc/sysctl.d/*.conf; do
    /sbin/sysctl -q -p "$file"
  done

  uf=/var/run/utmp
  >$uf && chmod 644 $uf && chgrp utmp $uf

  uf=/var/run/ubnt-rtr-ui
  mkdir -p -m 770 $uf && chgrp www-data $uf

  uf=/root.dev/www
  mkdir -p $uf && chown www-data $uf

  mkdir -p /run/ravpn-sessions /run/pppoe-server-sessions
  mkdir -p /run/pppoe-client-sessions

  /usr/bin/find /var/tmp -mindepth 1 -maxdepth 1 | /usr/bin/xargs rm -rf

# Do not create '/etc/nologin' for following reasons (it used to forbid 
# login while ER was committing initial 'config.boot' configuration):
#  1) Console, WebUI and SSH login prompts will be available only after 
#     config.boot commit is over;
#  2) If initial boot will stuck or is somehow screwed-up then system admin 
#     will be able to login to recovery SSH instance with link-local eth0 address 
#
#  echo 'System is still booting, please wait' >/etc/nologin

  #must load ubnt-platform before octeon_ethernet
  # Move modeprbe ubnt-platform into /etc/modules-load.d/edgeos.conf
  #rmmod $OCT_ETH_NAME > /dev/null 2>&1
  #modprobe $UBNT_MOD_NAME

  id=$(/usr/sbin/ubnt-hal-e getBoardIdE)
  script="/etc/ubnt/ubnt-rcS/start-e.$id"
  [ -x "$script" ] && $script

  if [ ! -e "$FIRSTBOOT" ]; then
    touch $FIRSTBOOT

    if [ -d "$FIRSTBOOT_PKGS_DIR" ]; then
        pkgs=($(ls $FIRSTBOOT_PKGS_DIR/*.deb 2>/dev/null))
        if [ -n "${pkgs[*]}" ]; then
            echo 'Installing additional packages on first boot'
            dpkg -i "${pkgs[@]}"
            echo Done
        fi
    fi

    if [ -d "$FIRSTBOOT_DIR" ]; then
      run-parts --report --regex '^[a-zA-Z0-9._-]+$' "$FIRSTBOOT_DIR"
    fi
  fi

  usepeerdns=/etc/ppp/ip-down.d/0000usepeerdns
  [ -x $usepeerdns ] && $usepeerdns

  script="/etc/ubnt/ubnt-rcS/start.$id"
  [ -x "$script" ] && $script

  return 0
}

We could tell it's a collection of miscellaneous, small and low-level tasks. For example, creating directories, updating directory permissions, applying settings from /etc/sysctl.d, and loading platform specific kernel modules (each platform has one script as in /etc/ubnt/ubnt-rcS/start-e.*). This script also does some work on first boot after firmware upgrade such as  installing packages from /config/data/firstboot/install-packages and running scripts from  /config/scripts/firstboot.d. All tasks are preparing the system for launching more services. This script does not start any daemons.

ubnt-init.service

This Systemd unit (/lib/systemd/system/ubnt-init.service) runs '/etc/ubnt/ubnt-init/ubnt-init.sh start'. Here is the excerpt of do_start()of this script:

do_start () {
  # do some cleanup
  rm -rf /root.dev/w*~

  # rngd
  if [ -c /dev/hwrng ]; then
    /usr/sbin/rngd --rng-device=/dev/hwrng
    ln -sf $rngd_pidfile /run/sendsigs.omit.d/rngd
  elif [ -c /dev/hwrandom ]; then
    /usr/sbin/rngd --rng-device=/dev/hwrandom
    ln -sf $rngd_pidfile /run/sendsigs.omit.d/rngd
  fi

  # # rsyslog
  # pidfile=/var/run/rsyslogd.pid
  # start-stop-daemon --start --quiet --pidfile $pidfile \
  #   --exec /usr/sbin/rsyslogd -- -c4
  # ln -sf $pidfile /run/sendsigs.omit.d/rsyslog

  # # vyatta-pppoe
  # rm -f /etc/ppp/peers/pppoe*  
  
  # # at
  # start-stop-daemon --start --quiet --oknodo --chdir "$PWD" \
  #   --pidfile /var/run/atd.pid --exec /usr/sbin/atd

  # bootlogs
  cat /etc/motd.tail >/var/run/motd
  dlog=/var/log/dmesg
  if [ -f $dlog ]; then
    savelog -q -p -c 5 $dlog
  else
    dmesg -s 524288 >$dlog
    chgrp adm $dlog
  fi

  # cron
  #start-stop-daemon --start --quiet --pidfile /var/run/crond.pid --name cron \
  #  --startas /usr/sbin/cron

  # Start SSH recovery service which allows user to access shell 
  # via IPv6 link-local address if router is otherwise inaccessible

  start-stop-daemon --start --background --no-close --exec /usr/bin/ssh-recovery -- start

  # daemons
  /usr/sbin/ubnt-infctld
  builtin set +eu
  source /etc/default/vyatta
  builtin set -eu
  /usr/sbin/ubnt-daemon
  [ -e /tmp/ubnt.socket.platd ] || sleep 1
}

It looks interesting as it tries to make use of /dev/hwrng and /dev/hwrandom. Some Cavium based routers might have such hardware features implemented(?). This script also starts the following daemons:

  • SSH recovery console
  • ubnt-infctld, ubnt-daemon, ubnt-cfgd, ubnt-util

SSh recovery console is a facility that allows users to log into the system and perform emergency operations e.g. reverting default boot to previous FW image. So it is started pretty early, right after kernel loaded and minimal system initialization done. Apparently this daemon is killed later in the start-up sequence. Perhaps inside vyatta-router.service.

Also suprisingly ubnt-infctld always gets started. Looks like it gets killed later depending on ubnt-discover setting in config.boot. As of FW v2.0.0, the kill doesn't happen and ubnt-discover set to disable is ignored. A bug?

ubnt-daemon is basically a watchdog daemon that starts and restarts the other two daemons, ubnt-cfgd and ubnt-util. It deals with frequent crashes of ubnt-util that some people have spotted and reported. I recall in some versions of FW 1.10.x monit was used to perform the watchdog role but monit was an overkill for the task and devices with little memory. Systemd actually has built-in feature to restart a failed daemon that should make ubnt-daemon redundant.

imi.service, nsm.service, ribd.service

These Systemd units start imi, nsm and ribd daemons respectively.

vyatta-router.service

This Systemd unit (/lib/systemd/system/vyatta-router.service) runs one script, /etc/ubnt/init/vyatta-router. Here is the excerpt of start() in this script:

start ()
{
    disabled bootfile || init_bootfile

    disabled migrate || migrate_bootfile
    run_sub rl-system start

    run_parts_dir /etc/ubnt/pre-config.d

    run_parts_dir /config/scripts/pre-config.d

    disabled configure || load_bootfile

    [ -n "$DEFCFG_FILE" ] && touch $DEFCFG_FILE >&/dev/null

    local id=`/usr/sbin/ubnt-hal-e getBoardIdE`
    local script="/etc/ubnt/post-config.$id"
    [ -x "$script" ] && $script

    rm -f /etc/nologin

    run_parts_dir /config/scripts/post-config.d

    chmod g-w,o-w /
}

In other words, it performs the following tasks:

  1. Run '/opt/vyatta/sbin/rl-system.init start' to initialize miscellaneous aspects of the system not exposed or configurable through the boot file, /config/config.boot e.g. tuning TCP/IP parameters.
  2. Run scripts inside /config/scripts/pre-config.d.
  3. Load the boot file and set up the router. How is it done?  An educated guess. A C++ binary /opt/vyatta/sbin/my_cli_shell_api implements those Vyatta CLI commands that require speedy executions. It has one API known as loadFile that reads and parses the boot file. By the end of loadFile, the router is initialized accordingly, ensuring disabled daemons not running and enabled services e.g. ddclient (for dynamic DNS update) up and alive. my_cli_shell_api is tiny and most of its logics believes residing in /usr/lib/libvyatta-cfg.so, a 1.4Mbyte shared library.
  4. Run scripts inside /config/scripts/post-config.d.

We could further guess that the bulk of the near 3-min start-up time of this Systemd unit is spent inside parsing the boot file and setting up the router. Its serial nature makes it slow, ironically just like init scripts that Systemd has replaced. Perhaps some parts of the process inside loadFile could be handled in parallel or results cached to speed up subsequent reboots in situations without config changes. I expect low hanging fruits might be harvested in future updates to FW 2.x. Major change is unlikely as people see how slow bugs are fixed or new features implemented. After all, a router staying always-on is the norm and takes priority over a fast reboot...

GUI Service and Conclusion

lighttpd.service is the Systemd unit that starts the web GUI. However, I don't find it run by Systemd on start-up. In fact, I haven't figured out how the GUI gets started other than this unit does run the GUI. For example, you could stop and start the GUI service with 'systemctl stop lighttpd.service' and 'systemctl start lighttpd.service'.

That's my initial deep dive into the start-up process inside FW 2.0. Thanks for reading up to here. When I find time, in a future article I may discuss how to strip resource hog in FW 2.0. That should give ERX more breathing room. Or run custom services that are useful to some of us.

comments powered by Disqus