Kerberos is an easy, token-passing security protocol. But you can stop it cold with relative ease, and its error-handling is often poorly implemented, strangling you and your software. These five troubleshooting suggestions may save you teeth-gnashing and get your system working right again.
Kerberos is a simple process to keep track of interactions between authenticated users and the programs and services that need that security authentication. Its use usually is transparent, but Kerberos is occasionally considered as a curse, and for good reason: It’s supposed to be hidden.
Most people use Kerberos every day, but they don't realize it because its workings are behind-the-scenes until something breaks, which is rare. When Kerberos doesn't work, it can stop users cold, and deliver error messages that can rile the most Type B of helpdesk and administrative support personnel. The screams are harrowing.
When it's working, Kerberos operates as a time-sensitive, cross-platform (on good days) mutual authentication ticketing system that proves a relationship between a user request and something the user (or a user's application) desires. Kerberos records what's going on during the processes users undertake using this authenticating system. There are two branches, where you are granted a long-term Kerberos ticket when you or your app logs you in, and then lots of little short-term tickets for session-like uses.
Kerberos comes from MIT and the mythology of Cerebus, the multi-headed dog that guarded the gates of Hades. In our case, our guard dog security system has three heads: user, a server, and the key distribution server (KDC). The key for a user might have a secondary key from a SmartCard, fingerprint reader, SingleSignOn/SSO software, or proxy authenticators. It's still the user key.
The server – actually an application running somewhere else – wants that key, which must be issued by the KDC. The KDC is often an Active Directory Domain controller of some rank, but it doesn't have to be. One benefit of Kerberos is that, although the Windows version can be slightly different (and ornery), the security system runs on Solaris/SunOS, Linux, BSD, and most all Unix derivatives. Interoperability issues are few and far-between. When they do arise, interoperability issues can be a headache, however
In reality, Kerberos is simple, and most of its issues surround communications problems. Rarely, something goes wrong, and the forensic details for a Kerberos failure for the help desk is a process of finding out what caused it, usually either operating system or application-specific. Here’s some guidance that may help.
You must be able to talk to the KDC server to get a ticket.
Without the ticket and a path of communications to the Kerberos server or one of its authorized associates, you can't get the ticket.
This means your communications path to the server must be intact. There's not a DNS or IP address or AD/SMB way to find the salient Kerberos server? No washee. No ticket. Error messages usually are somewhat articulate at this point, but it looks to the user as though login failed. Don't lose the keys, and the ability to present the keys to a server located on a reliable path.
This problem is more often found in organizations with Linux, BSD, and Solaris, as the Windows implementations of Kerberos use a service that mandates you authenticate through the Active Directory. Often, you are blocked by the Active Directory first; surmounting that problem gets Kerberos working because the path to the same physical server where both are hosted is now working. The Windows Kerberos services can be shut off, usually by mistake but occasionally by malware. Check that, too.
Time is of the essence.
With Kerberos, all of the devices have to be in comparatively perfect time sync; the tolerance is five minutes by default. Some Kerberos implementations and changes from default configurations allow a wider tolerance, but Kerberos tickets must be able to expire, or they're good forever and ever.
No real sloppiness is allowed here, which means that clients and servers need to be synchronized with an actual and highly-available time server. Usually, to permit fooling with ticket expiration, that means you need an external time keeper/Network Time Protocol Server so nothing can tamper with time settings.
This also means headaches on the nine days of the year when time zones fall out of synch with each other as the pain we know as Daylight Savings Time takes place, or when UTC/Zulu time adds an extra minute at New Year’s. What once worked refuses. Maybe, if you're lucky, an articulate error message points to your time-adjusting sloth. Don’t count on it.
The downside here is that if one of the servers in your chain falls out of time synch, nothing that uses Kerberos works between the correct servers and the devices whose time has gone by. The only correction is to re-synchronize the time. This means thorough attention to time across locations and time zones, and attention to the increasing number of Daylight Savings Times corrections that have to be made at each occasion. If one of them doesn't correct for DST (such as the corporate site in Arizona), then you can set your clock by the number of failed processes or user help desk phone system insanities. Synch your clocks.
You have too many Active Directory group memberships.
Windows adds attributes to the Kerberos token at authentication time: when you get the token at logon or other authentication. If a user is a member of plentiful Active Directory groups, the padding exceeds the space available, and very strange things happen. Sometimes nothing happens, and logging in from another machine generates a fat token; just slightly too fat, and the token gets strange.
You need either a larger organization with many group membership possibilities (100+), or an overactive system administrator who loves Group Policy Objects and used very finely-grained object control in a fit of security insanity. Fortunately, a set of hotfixes from Microsoft cures this problem. Third-party single-sign-on (SSO) software, both for Windows but also Mac and Linux users, can trigger this craziness as well.
The hotfix goes on the server, but the best cure is to cut group membership possibilities to a sane level. That may mean re-organizing group policy objects and groups to stanch the membership possibilities.
The symptoms aren't necessarily repeatable. This makes troubleshooting this problem even more of an excuse for Zantac consumption; there is a fallback to NTLM authentication, but it's an ill-advised move. NTLM authentication is weak like cooked pasta.
You blocked 88 or 464; now you shall suffer.
Seems simple enough. Port 88 (no, not port 80, the http port) and 464 need to be open. Close them, and weep. Or worse: If you block ports 88 or 464, your DNS infrastructure can be mangled, corrupted, or hacked. In which case the server ticket no longer matches the client ticket, as the FullyQualifiedDomainNames (FQDNs) now link together the wrong IP addresses. There is an entire 50-gallon can of worms that can be opened up regarding DNS problems that we won't open here.
It may not be deliberate. You might have installed a new firewall application. Port 464 isn't often needed, and Microsoft has tried to deprecate it using kpassword, which needs Port 464 wide open and ready to gleefully respond. It's obscure. Yet it's still out there, and still used, and still can make you fall on the floor in convulsions trying to find.
The answer here is to think like Kerberos, and determine, as in Problem #1, what's blocking the communications. Then, pry it open. Use hydraulic pressure. Swear words are allowed. Go into your DNS server, and evaporate all of the duplicate and ancient Reverse DNS (rDNS) records. Include the old ones that cause bad reverse pointers to servers whose FQDN has changed, or whose IP address changed. Never, of course, attempt suicide by allowing servers to get their address from DHCP, no matter how long of a DHCP lease life you gave them.
Your encryption or naming convention method is ancient.
Users might get a ticket, but it's encrypted with DES, which lives on (if zombie-like) in older systems. An administrator can narrow the encryption supported to just a handful of options, and all systems using Kerberos must support that method. Otherwise, those that don't will be rejected like a bad blind date. Among all the participants, new and old, every encryption method must be in revision sync. Change one, and all will laugh at you, wag their tongues, and leave you to discovering which one fell out of sync.
Encryption support must be the same among the three heads of Cerebus: user, server, and KDC. If there is more than one KDC in the system, and the user can find it mistakenly, then it might get the wrong one to sign the certificate, too. This certificate is dutifully issued, and all works well until the server sees the certificate, becomes horrified, and balks. This can be a symptom of another problem, which is that your naming conventions for servers are bad, DNS is hallucinating, or there are multiple forests where an admin duplicated names; but traffic can find those dupes and cause everyone premature baldness from tearing their hair out.
The best routes towards troubleshooting Kerberos problems start with the specific operating system version where the KDC service/daemon is hosted. Problems in Kerb5 (the current version, as it's called) don't necessarily have the same error message types or even symptoms when used in heterogeneous installations. Kerberos Web problems are often unique to the specific version of httpd host, and items used with it, such as Tomcat, language handlers, or middleware processes.
There are no rules against caching Kerberos information. Specific to each operating system are “kflush” and similar activities that can force a system/user/server-wide restart of the ticketing processes to get rid of problems caused by client/server/KDC/browser caching of bad data or incorrect data. Often, this gets rid of all mysteries mysteriously. Somehow, the cache was old or corrupted or an OS or app reached to cache instead of getting refreshed information. Flushing user-side Kerberos cache means searching for specific OS flushing commands; a reboot may not help the user's invalid cache state. It can also restore sanity in ways that prevent users from throwing their notebooks across a lobby. No one thinks about what happens when there's a Kerberos patch until an application throws insane error messages.
I wish you luck, as the variants represent especially difficult forensic maneuvers, and situation-specific troubleshooting. Then Kerberos goes on working again, transparently, perhaps forever--or at least until your shift ends.
Tom Henderson is principal researcher for ExtremeLabs, Inc., of Bloomington, IN.