Healthy Person Trust Manager

If you have any kind of backend, any kind of infrastructure, then you probably have to tinker with TLS certificates. It’s good when all your servers are accessible from the Internet, and you can install Letsencrypt certificates or its analogues on them. In this case, you run certbot every day, which checks the validity period of your certificates, reissues them in a timely manner, and installs them. You just need to set up a hook so that in case of any error you will receive a notification somewhere. And you will always have enough time to fix everything.

It’s another matter when you have dozens of kubernetes servers and clusters isolated from the Internet, which host hundreds of microservices, made at different times by different development teams. Then you need your own custom CA, its certificate is in all trust stores, this CA issues certificates to everyone for their endpoints. Typically, in such cases, there is no unified system for automatically re-issuing and installing certificates; everyone configures TLS in their own way. The guys supporting CA issue certificates for floppy disks flash drives, developers work with them to their liking. For example, some people package certificates with keys in a docker image.

Certificates are issued according to modern rules for 1 year. As a result, if you have >2000 certificates, then you need to change your certificate somewhere every day. And here and there developers constantly miss deadlines, and something falls off. Does the certificate on this Postgres instance expire in a month? There is still time, we will work on it in the next sprint, although the devops fell ill a week later, and the team lead went on vacation, as a result, the certificate expired, access to the database was lost, while this and that, the system was idle for an hour. And such rubbish every day (almost).

We completely solved this problem using a special java library. In Java, TLS certificate validation is handled by the so-called TrustManager. The logic of its operation is something like this:

if (Date.now().after(cert.getNotAfter()) {
    throw new CertificateExpiredException();
}

main idea

Somewhere after the 3rd drop in sales, I had the idea to change the standard logic for checking the validity of the certificate to this:

if (Date.now().after(cert.getNotAfter()) {
    throw new CertificateExpiredException();
} else if (Duration.between(Instant.now(), cert.getNotAfter()).toDays() < 30) {
    notifyEverybody(cert);  
}

That is, the certificate used when establishing a TLS connection is analyzed. Not an abstract file in a directory, which periodically turns out to be not what is needed, but a real certificate received over the network. During certificate verification, instead of simply throwing an exception when, as they say, it’s too late to drink Borjomi, we issue a warning in advance. As a result, I got a very useful module called omni-tls-starterwhich I am pleased to present to your attention today.

Registration of a security provider

To register a trust manager, you need to implement a security provider. Our provider will not be able to implement all the necessary algorithms, so it will delegate the main work to standard providers, only adding its own features to some methods.

The custom provider must be installed at the beginning of the list of providers, then it will have priority over the others. But this may also be undesirable in some cases. If we want to use our provider not everywhere, then we need to install it at the end of the list and refer to it by name where necessary. In addition, it is necessary to protect against multiple installations due to incorrect configuration. It is best to initialize the provider lazy, since on the one hand we do not know exactly when we will need it, and on the other hand, we may not need it at all. I implemented all this logic in the class OmniSecurityProvider. Here's what its main registration method looks like:

public static void registerOnTop() {
    Provider[] providerArray = Security.getProviders();
    int targetPos = 1;
    if (providerArray != null && providerArray[targetPos - 1].equals(getInstance())) {
        return;
    }
    int pos = Security.insertProviderAt(getInstance(), targetPos);
    if (pos != targetPos) {
        String msg = String.format(Locale.ROOT, "Не удалось зарегистрировать провайдер безопасности %s в позиции %d",
                OmniSecurityProvider.class.getSimpleName(), targetPos);
        throw new IllegalStateException(msg);
    }
}

This method must be called when the application starts so that our trust manager begins to receive all TLS connections for validation. And here lies the first ambush. In modern frameworks, many integrations start at the start, and if we use standard autostart methods, it may happen that many integrations will establish TLS connections even before our provider is registered. Therefore, provider registration is implemented through ApplicationListener. If the library is used without spring, then the registerOnTop() method must be called at the beginning of the main method.

Implementation of trust manager and key manager

Our implementation trust manager And K-manager quite simple. They simply delegate all calls to the standard implementations derived from OmniSecurtyProvider.

In fact, there are pitfalls there. Firstly, the main interfaces are doubled, there is X509TrustManager, and there is also X509ExtendedTrustManager. We need to check all this, since there are no guarantees as to which interface will be received. Secondly, it is not so easy to intercept the creation of standard managers; for this you need to implement internal interfaces TrustManagerFactorySpi And KeyManagerFactorySpistandard implementations of which are not available.

Since we need to monitor not only server but also client certificates, we immediately implement both TrustManager and KeyManager, and place the code common to both implementations in the class OmniX509Commons.

Logging

A characteristic feature of the omni-tls-starter library is that, on the one hand, its main function is logging, and on the other hand, it cannot use the usual logging mechanisms. The fact is that the classes that perform logging can and usually do establish TLS connections themselves, for example, you can have an appender that writes logs to a database, elasticsearch, or Kafka. If low-level classes cause logging during a TLS connection, then recursion will result. To avoid recursion, our security provider stores all notifications in queuewhile rising separate threadwhich removes messages from the queue and sends them to the logs.

The message provider and message consumer are not connected to each other. The implementation is made in such a way that the logger can be moved to a separate library. For now, for the sake of simplicity, they are collected in one module, since there was no need to have different logging implementations for a custom security provider.

In this method (code snippet below) we will need to cheat if we need to dynamically choose between different logging implementations. The main thing is that the security provider does not depend on the implementation of logging.

/**
 * Возвращает экземпляр логгера.
 * @return экземпляр логгера
 */
static OmniSecurityNotifier getInstance() {
    try {
        // Провайдер безопасности не должен иметь внешних зависимостей, чтобы не допустить циклических вызовов.
        // Поэтому реализация интерфейса подбирается через рефлексию.
        return (OmniSecurityNotifier) Class.forName("ru.github.seregaizsbera.tls.starter.OmniSecurityNotifierImpl")
                .getDeclaredMethod("getInstance")
                .invoke(null);
    } catch (ReflectiveOperationException e) {
        return new OmniSecurityNotifier() {
            @Override
            public boolean notify(OmniX509EventModel event) {
                return false;
            }
            @Override
            public boolean isFull() {
                return false;
            }
        };
    }
}

Limiting the volume of diagnostics issuing

So, suppose we have somewhere in rabbit-mq, ​​which we didn’t even know we had, a certificate that expires in a month, it’s time for us to find out about it, omni-tls-starter notices this when creating a connection using the protocol TLS and starts writing about it in the log. Very soon the whole tree will be littered with the same type of messages like “such and such a certificate expires in 29 days.” The certificate, of course, will be reissued and installed, but the annoying logger will most likely be disabled at the same time so as not to spam. Therefore, you need to limit the number of messages issued. For this purpose, the omni-tls-starter module implements EventLimiter. This class is universal, it does not know anything about certificates at all. It is given event keys as input, it remembers which event occurred when, and tells whether it needs to accept this event or whether it needs to be skipped. Well, in order not to accidentally clog the memory, he also cleans his internal containers himself. The logger uses the serial number of the certificate as a key, so for a given period (by default 1 hour) no more than one message about one certificate will be logged.

Advanced diagnostics

Often, simply reporting a problem with a certificate is not enough; you also need to understand where it is located. For example, we may have a wildcard certificate installed on hundreds of servers, it seems that it has already been updated everywhere, but the logs still say that there is an old one somewhere. We will quickly find it when it expires, and some microservice will fall off as a result, but our task is to prevent this. Therefore, additional diagnostics are displayed in the log. Retrieving diagnostics from the socket:

private static String getConnectionInfo(Socket socket, SSLEngine engine) {
    var socketInfo = Optional.ofNullable(socket)
            .map(s -> String.format(Locale.ROOT, "%s:%d:%s:%d", s.getLocalAddress().getHostAddress(),
                    s.getLocalPort(), s.getInetAddress().getHostAddress(), s.getPort()))
            .orElse("");
    var engineInfo = Optional.ofNullable(engine)
            .map(s -> String.format(Locale.ROOT, "%s:%d", s.getPeerHost(), s.getPeerPort()))
            .orElse("");
    return socketInfo + engineInfo;
}

At first I threw an exception to myself and saved a stacktrace to understand where the event occurred, but this is too expensive a method, which became meaningless after receiving information from L3.

All to no avail

So, now within a month, messages are starting to appear in our logs at an intensity of once an hour that it’s time to change the certificate. But who reads them? I had to take certain measures. Check out this list to understand the depth of the problem:

  1. At the INFO level, the message begins to be issued 90 days in advance. Letsencrypt would not tolerate this.

  2. At the WARNING level, the warning begins to be written 30 days in advance.

  3. At the ERROR level, it is written to the log in 7 days.

  4. Do you think that's all? No matter how it is.

Finishing touch

It’s an amazing thing – in our country a whole generation of consumers has grown up who … [беспокоит] shelf life of products accurate to the day. In Soviet times, expiration dates were indicated only on dairy products and other perishables where hours counted (cakes, fudge).

Now a buyer picks up a jar of canned food that can last for three years and is indignant: “It’s overdue! It expired a week ago!”

It is curious that in the West, where consumers are loved more than here, they write “Best before”, which translated means “the product retains all its best qualities before”. Not “good to”, but “great to” or “best to” (if you can pronounce it).

Why do our manufacturers drive themselves into such Procrustean frameworks? Because someone approved this for the first time, and today any manufacturer who decided to rebel against consumer fascism would be declared a g… [нехорошим человеком] and with… [нехорошим человеком].

Can you eat cookies that are three days out of date?

https://tema.livejournal.com/1742206.html Artemy Lebedev

In principle, such logic for certificates also has a right to exist. Customers who place orders on our website should be aware that our certificate on the internal server expired yesterday.

Therefore, the omni-tls-starter module implements 3 different operating modes:

  1. Standard STRICT mode. If there is any error, terminate the connection.

  2. ALLOW_EXPIRED mode. In this mode, the notBefore field in the X509 certificate takes on the semantics of bestBefore. In this case, the message in the logs is not “the certificate expires in -5 days,” as many would have done, but “such and such a certificate expired 5 days ago.”

  3. And finally, INSECURE mode. In principle, all programs have this mode, for example, the curl utility has the -k option, but, unlike other programs, the omni-tls-starter module displays detailed diagnostics about all errors in the log. Ideal mode for debugging settings. Peculiar ssllabs.com for the internal network.

Briefly speaking

Anyone who wants to use it. I think many people may find both the library itself and the ideas implemented in it useful.

In general, as soon as I got this library, it turned out that many interesting additional checks could be done in it. This article does not cover everyone. Some of them can be viewed in my repository on Github.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *