Jason Sultana

Follow this space for writings (and ramblings) about interesting things related to software development.

Threadsafe is not enough

30 Mar 2021 » other

G’day guys!

Today I wanted to talk a little bit about a realisation that I had recently, that kind of goes against conventional wisdom or the apparent best practice of the industry. No doubt you’re not only familiar with the concept of being thread-safe, but also (like myself) actively try to identify potential areas where thread-safety is needed and apply appropriate fixes before they turn into bugs. If you’re not familiar with thread safety, here’s a quick wiki article on the concept, another MSDN article on thread safety best practices, and if you do a quick Google Search, you’ll find dozens more. I guess the point I’m trying to make here is that thread safety is a well-known, widely discussed concept - which I believe has become outdated. I’ll be arguing this from a backend webserver point-of-view, but I think that the arguments still hold up in some other areas as well.

In the past (say pre-2015 or so), I’d argue that it was a much simpler time as far as webservers were concerned. Containerisation wasn’t as common as it is today, the Cloud was still in its infancy (or not around at all, depending on how far back you go) and a well-engineered project would probably be something like an ASP.NET MVC application sitting on a single server. Ah, the good old days. Today, that same application will probably have been converted into a Web API with a separate SPA front-end, cut up into microservices, containerised, and placed in the Cloud in an AWS Auto-Scaling group, or whatever the vendor-specific alternative is. By doing this, it’s now no longer enough to think in terms of threads, since there are quite likely multiple instances of an API (or backend process) running.

Consider the following pseudo-code, for example.

function increment_counter
    lock some_static_object
        var contents = load_from_file("counters.xml")
        var xml = parse_xml(contents)
        xml.hitCount+= 1

        save_to_file(stringify_xml(xml))
    end lock
end function

This is a thread-safe function that increments a hit counter. It utilises a lock on a static object that’s shared by all threads in the program, meaning that if two threads call the function at the same time, one of them will obtain the lock, and the other threads will wait until the lock is released before they continue. Typical stuff here, nothing fancy. But it’s wrong - at least by today’s standards.

Now consider some caveats about the above code. The XML file is stored on a central file server, accessed by a Web API. The API is run in a container, with multiple containers running on multiple virtual machines; all of them behind a load balancer in an Auto Scaling Group. As the load of the application grows, more VMs and containers are spun up. As the load drops in off-peak periods, the suite of containers and VMs scales back down. Since there are now multiple instances of the API running, the fact that the function is thread-safe doesn’t help the swarm of API instances attacking the XML file like they’re Uruk Hai marching on Helm’s Deep.

Okay, maybe there’s not a swarm of them, and maybe an XML file on a file server wasn’t the most real-world example for this use case. Replace it with a caching service or a database instead - the problem is still there. As soon as there is more than one instance of your program running (which is almost always nowadays), being thread-safe is simply not enough. So what’s the answer, then?

Two Lord of the Rings memes in one post? You bet I did! In all seriousness though, greater minds can probably do better, but I can think of two solutions. Before we do so though, I’d like to define a new term. Thread-safe describes an algorithm that has no race conditions, even when executed by multiple threads in the same program. Concurrency-safe (my team lead coined it) describes an algorithm that has no race conditions, even when executed by multiple threads in the same program, multiple instances of the same program, serverless functions, or other asynchronous workflow that you can think of. Now, onto my promised solutions!

Option 1: Atomicity

I can hear you all saying duh!, but if you can make the process atomic, then it is inherently thread-safe and concurrency-safe. If you’re literally dealing with an XML file here, this may not be an option, since you’ll at some level of abstraction still need to open the file, read its contents, increment the count and write it back. But if you’re dealing with a more robust implementation like Redis, DynamoDB or some other database, chances are there may be an atomic option for you to consider.

Option 2: Distributed Locks

The basic idea here is that instead of locking on a static object that’s stored on the heap within the address space of the executing program, you lock on some remote object that’s accessible by all instances of your program. A much deeper discussion on this is available at https://redis.io/topics/distlock. As the linked article explains, there are essentially two types of distributed locks; expiring and non-expiring.

Expiring locks will automatically expire after a certain timeframe, so they’re arguably a bit more flexible and fault-tolerant. If the data center that’s housing the machine that’s running the VM that’s hosting the container that’s running your program explodes in a glorious display of fire and brimstone, the lock will eventually expire and the rest of your system might still function, if it wasn’t already destroyed.

Non-expiring locks are more bossy assertive. Once a component obtains a lock, it holds it indefinitely until it is done - much like my wife with the TV remote. Here you’d need to have some kind of failsafe just in case the component did die or become non-responsive, such that the lock will get released at some point, so the rest of your system can continue to function.

In either case though, it’s important to have a stategy for distributed locking, so that you’re able to go beyond thread-safety and achieve concurrency-safety. In today’s world where containers, load balancers and auto scaling groups are now the norm, this can’t be an afterthought - it’s a necessity. And it needs to be more well-known.

Anyhoo, that’s all from me for now. Have you run into a concurrency-safety problem, or implemented a different solution to make a process concurrency-safe? Feel free to let me know in the comments!

Catch ya!