Upgrade of a mail server to Leap 15.1 – problems with SSSD and clamd

I use email servers based on Postfix (smtp), Cyrus (imap) in combination with an LDAP server for authentication purposes and fetchmail to access external mail provider services. Both the mail servers and the LDAP server are virtualized guests on KVM host servers with LUKS-encrypted disks/partitions. Due to a series of security measures to become compliant to DSGVO and EU-GDPR based customer contracts the whole setup is relatively complicated. However, authentication for mail clients to the different servers is of central importance. Communication of each mail server to the LDAP server is performed via an TLS connection and SSSD. The mail client systems access the mail servers via TLS; login to the client systems partially also depends on LDAP.

Whenever a full upgrade of the server is required I, therefore, first test it on copies of the KVM host installation and each KVM instance. (The "dd" command is of good service during these tests.) One experiences some unwelcome surprises from time to time - and then you may need a quick restauration of a workings system.

When I switched everything to Opensuse Leap 15.1 some days ago I stumbled once again across small problems. It is interesting that one of the problems had to do with SSSD - again.

Previous problems with SSSD during upgrades to Opensuse Leap 15.0

Some time ago 1 described a problem with PAM control files for imap and smtp services on the mail server when I upgraded to Leap 15.0. See:
Mail-server-upgrade to Opensuse Leap 15 – and some hours with authentication trouble
The PAM files included directives for SSSD. The file were unfortunately replaced (without backups) during upgrade from OS 42.3 to Leap 15.0. This hampered all authentication of mail clients via authentication requests from the imap and smtp services to the LDAP system. The cause of the resulting problems at the side of the email clients, namely authetication trouble, was not easy to identify.

New SSSD problem during upgrades to Opensuse Leap 15.1

This time I ran once again into authentication trouble - and suspected some mess with the PAM files again. Yet, this was not the case - the PAM files were all intact and correct. (SuSE learns!) However, after an hour of testing I saw that the SSSD service did not what it should. Checking the status of the service with "systemctl status sssd.service" I got a final status line saying "Backend is offline".

What did this mean? I had no real clue. You naturally assume that LDAP would be my backend in my server configuration; this is reflected in the file /etc/sssd/sssd.conf:

[sssd]
config_file_version = 2
services = nss,pam
domains = default
[nss]
filter_groups = root
filter_users = root
[pam]
[domain/default]
ldap_uri = ldap://myldap.mydomain.de
ldap_search_base = dc=mydc,dc=de
ldap_schema = rfc2307bis
id_provider = ldap
ldap_user_uuid = entryuuid
ldap_group_uuid = entryuuid
ldap_id_use_start_tls = True
enumerate = True
cache_credentials = False
ldap_tls_cacertdir = /etc/ssl/certs
ldap_tls_cacert = /etc/ssl/certs/mydomainCA.pem
chpass_provider = ldap
auth_provider = ldap

I checked - the LDAP service was active in its KVM machine. Of course, NSS must also be working for SSSD to become functional. No problem there. I checked whether the LDAP service could be reached through the firewalls of the different KVM instances and their hosts. Yes, this worked, too. So, what the hack was wrong?

Eventually, I found some interesting contribution in a Fedora mailing list: See here. What if the problem had its origin really in some systemd glitch? Wouldn't be the first time.

So, I first made a copy of the original file "/usr/lib/systemd/system/sssd.service" and after that tried a modification of the original file linked by "sss.service" in "/etc/systemd/system/multi-user.target.wants". I simply added a line "After=network.service" to guarantee a full network setup before sssd was started.

[Unit]
Description=System Security Services Daemon
# SSSD must be running before we permit user sessions
Before=systemd-user-sessions.service nss-user-lookup.target
Wants=nss-user-lookup.target
After=network.service

[Service]
Environment=DEBUG_LOGGER=--logger=files
EnvironmentFile=-/etc/sysconfig/sssd
ExecStart=/usr/sbin/sssd -i ${DEBUG_LOGGER}
Type=notify
NotifyAccess=main
PIDFile=/var/run/sssd.pid

[Install]
WantedBy=multi-user.target

And guess what? This was successful! The reason being that at the point in time when the sssd.service starts name resolution (i.e. the evaluation of resolv.conf and access to DNS-servers ) may not yet be guaranteed!

Hint:

Note that there may be multiple reasons for such a delay; one you could think of is a firewall which is started at some point and requires time to establish all rules. Your server may not get access to any of the defined DNS-servers up to the point where the firewalls rules are working. Then, depending on when exactly you start your firewall service, you may have to use a different "After"-rule than mine.

Important point:
You should not permanently change the files in "/usr/lib/systemd". So, after such a test as described you should restore the original systemd file for a specific service in "/usr/lib/systemd/system/" with all its attributes! The correct mechanism to add modifications to systemd service configuration files is e.g. described here "askubuntu.com : how-do-i-override-or-configure-systemd-services".

So, in my case we need to execute "systemctl edit sssd" on the command line and then (in the editor window) add the lines

[Unit]
After=network.service

This leads to the creation of a directory "/etc/systemd/system/sssd.service" with a file "override.conf" which contains the required entries for service startup modification.

An additional problem with clamd - timeout during the start of the clamd service

One of my anti-virus engines integrated with amavis is clamav. More precisely the daemon based variant, i.e. the "clamd" service. However, when I tested amavis for mail scanning I saw that it used to job instances of "clamscan" instead of "clamdscan". The impact of Amavis' using two parallel clamscan threads was an almost 100% CPU utilization for some time.

It took me a while to find out what the cause of this problem was: clamd requires time to start up. And due to whatever reasons this time is now a bit bigger on my mail system than the standard timeout of 90 secs systemd provides. This can be compensated by "systemctl edit sssd" and adding lines as

[Service]
TimeoutSec=3min

After this change clamd ran again as usual. Note however that clamav does not provide sufficient protection on professional mail servers, especially when your email clients are based on a Windows installations. Then you need at least one more advanced (and probably costly) antivirus solution.

Links

how-to-troubleshoot-backend.html
fedora archive contribution
www.clearos.com community : clamd-start-up-times-out
unix.stackexchange.com : how-to-change-systemd-service-timeout-value