Borrowing from the future

I noticed Kenny Tilton's recent Solve the First Problem post on reddit a few days ago. It's a really simple rule: when faced with a complicated problem, start with the first thing you can fix — the others often turn out to be side-effects. This is one of my favorite "debugging magic" techniques - over the years I've run into many problems which seemed far more intimidating than they actually were because the bug spewed out hundreds of errors or the assumption that the bug must be a mistake on their part because the program in question must be far too widely used to have a bug like that.

Today I fixed a problem which has been annoying us for awhile using what I think of as the zero-th law of debugging: if you can't find the problem, your first bug is poor observability.

A little background: we've been working on a more scalable upgrade to our LDAP server infrastructure. Since we're a Debian shop the process was really simple: clone our base Debian VM, apt-get install slapd, run cfengine, a quick slapcat/slapadd and we're up and running 15 minutes later with an LDAP master & a couple slaves (have I mentioned lately how much I love having invested in automation)? A little load-testing later and I'm ready to unleash this thing on our internal group — and then the guy down the hall mentions that he can't login…

After a quick look, this is even more confusing - the problem appears to be account specific but the LDAP records are very similar and copied from the current production server. Exhaustive comparisons between the two accounts turn up nothing. Soon one of the other admins discovers another interesting wrinkle: the problem only happens if pam_ldap uses SSL or STARTTLS - insecure connections work correctly for all accounts! Contributing to the confusion are the error messages insinuating that the problem is the LDAP server:

2008-03-11T10:30:58+07:00 su[17545]: pam_ldap: ldap_search_s Can't contact LDAP server
2008-03-11T10:31:01+07:00 su[17545]: pam_ldap: ldap_simple_bind Can't contact LDAP server
2008-03-11T10:31:03+07:00 su[17545]: pam_authenticate: Authentication service cannot retrieve authentication info.

A quick check with debug 65535 in /etc/pam_ldap.conf shows that the LDAP server is sending back all of the requested data:

ldap_read: want=44954, got=16376
  0000:  82 af 97 04 26 75 69 64  3d 6a 62 2c 6f 75 3d 50   ....&uid=jb,ou=P
  0010:  65 6f 70 6c 65 2c 64 63  3d 73 6e 6c 2c 64 63 3d   eople,dc=snl,dc=
  0020:  73 61 6c 6b 2c 64 63 3d  65 64 75 30 82 af 6b 30   salk,dc=edu0..k0
  0030:  47 04 0b 6f 62 6a 65 63  74 43 6c 61 73 73 31 38   G..objectClass18
  0040:  04 0c 70 6f 73 69 78 41  63 63 6f 75 6e 74 04 0d   ..posixAccount..
  0050:  73 68 61 64 6f 77 41 63  63 6f 75 6e 74 04 0d 69   shadowAccount..i
  0060:  6e 65 74 4f 72 67 50 65  72 73 6f 6e 04 0a 61 70   netOrgPerson..ap
  0070:  70 6c 65 2d 75 73 65 72  30 0b 04 03 75 69 64 31   ple-user0...uid1
  0080:  04 04 02 6a 62 30 13 04  09 75 69 64 4e 75 6d 62   ...jb0...uidNumb
  0090:  65 72 31 06 04 04 31 30  30 32 30 1b 04 0d 68 6f   er1...10020...ho
  00a0:  6d 65 44 69 72 65 63 74  6f 72 79 31 0a 04 08 2f   meDirectory1.../
  00b0:  68 6f 6d 65 2f 6a 62 30  18 04 0a 6c 6f 67 69 6e   home/jb0...login
  00c0:  53 68 65 6c 6c 31 0a 04  08 2f 62 69 6e 2f 63 73   Shell1.../bin/cs
  00d0:  68 30 10 04 02 63 6e 31  0a 04 08 4a 6f 65 20 42   h0...cn1...Joe B
…
ldap_read: want=28578 error=Success
ber_get_next failed.
ldap_err2string

After about the fifth time comparing strace results the real problem dawns on me: the main thing making this hard is because pam_ldap and OpenLDAP have pretty rough debugging code which is really only used by the developers and so we're stuck guessing at what might be happening. This has been an inconvenience for years and I realized that I've spent more time scrutinizing high debug-level dumps and strace output than it would take to add proper logging to pam_ldap - and anyway, how many times can you not fix a mocking "error=Success" telling you that something clobbered a useful errno before you could see it?

A quick apt-get source libpam-ldap later and I had a huge pile of innovatively-indented code to slog through (thank FSM for indent). The general process was to find each function in pam_ldap.c and add syslog() at each point where an error can cause an early return. I also cleaned up a number of the error messages and made sure that the important ones included the actual error codes since it was frequently the case that a meaningful low-level error was turned into an uninformative return -1 before it was logged. The resulting works-on-my-system tested patch is part of Debian bug report #470509.

Armed with my shiny new syslog information I quickly wound up in _get_user_info() where ldap_search_s() is used to retrieve the user's information. A quick apt-get source libldap2 and not-so-quick patch to add better logging later and I finally had a real error message:

SSL_read() failed: gnutls_read() returned -59: GnuTLS internal error

Huh? Although that does kind of mesh with that clue in the big dump above: ldap_read only received part of the data it needed (suspiciously close to a 16KB chunk in fact). A quick check confirms that the problem is size related - both of the failing accounts have jpegPhoto attributes which are larger than the jpegPhotos for the working accounts. A quick test later and we've confirmed that size is the problem by telling ldap_search_s() to return only attribute names and not the values:

  rc = ldap_search_s (session->ld, ssd->base, ssd->scope, filter, NULL, 1, &res); 
  rc = ldap_search_s (session->ld, ssd->base, ssd->scope, filter, NULL, 0, &res); 

This isn't quite the approach I ended up putting in the patch (which queries for a single attribute) and ultimately this led to a second bug report for the OpenLDAP client library since ldap_search_s() should never fail in this fashion. Since this post has already grown a bit large, though, I'm going to abbreviate those details in favor of the ever-popular bulleted list:

  1. OpenLDAP's logging is an operational cost: libldap is used all over the place and every single user has to live with logging which does not help isolate common operational problems and requires far more knowledge of the library internals than most users are going to have (or want).
  2. The saving grace was that everything involved is open source software: OSS is certainly not immune to unpleasant bugs and poor observability but I've run into similar bugs from all of the major commercial software vendors. The difference here is that Open Source means that my systems are patched a day after shaking off my "traditional ISV support"-induced apathy, leading to the final and most important point:
  3. I was stupid not to do this earlier: time spent improving observability is really time borrowed from the future. It saved me some time now which I otherwise would have spent in single-stepping in gdb but while that may cost me an “Extreme Sysadmin™” t-shirt it's a good investment in my future sanity. This class of problems has been widely discussed in the log-analysis community and something which most sysadmins (myself included) tend to view as more intractable than they really are. Fixing something the right way really isn't as time-consuming as it may seem at first glance and it sure beats scrutinizing chicken entrails the next time something breaks.
comments powered by Disqus