Preventing Race Condition When Testing Cast Requests in Elixir

2017-07-09

This post assumes understanding of how GenServer.cast/2 and GenServer.call/2 work.

TL;DR

Call a GenServer.call/2 function after GenServer.cast/2. This prevents your caller process from executing any more code until it receives a reply to the call/2 function from the receiver process. The receiver process handles and sends the reply to call/2 only after it has handled the message from cast/2. This can ensure sequential execution of code. If you need a generic function, :sys.get_state/1 can be used for this purpose.

Initial Solution

I discovered this issue when writing a simple test for cast/2 function. The test calls the function and then makes an assertion about its expected result. It failed because of a classic race condition case: the assertion was made before the cast/2 function was handled by the receiver process.

I had to find a way to ensure that the assertion is made only after cast/2 was handled. The crudest solution was to call Process.sleep/1 to wait for a specified amount of time between cast/2 and ExUnit.Assertions.assert/1. This did make the test pass, but I wanted to write something better than that.

Better Solution

The next solution I found was to call call/2 between cast/2 and assert/1. This makes use of how BEAM processes and their mailboxes work, and how call/2 works.

A process mailbox concurrently receives messages sent by other processes, but sequentially handles those messages. So a mailbox serves as a sort of synchronizing point for messages. For example, if a process receives messages A and B in that order, it will handle A first and then handle B.

Now remember that call/2 suspends the caller process until it receives the reply from the receiver process. Calling call/2 after cast/2 ensures that any code following call/2 will be executed only after the caller process receives the reply sent by the receiver process. And the receiver process will handle the message from cast/2 first, and then handle the message from call/2, then send the reply. This creates an order of execution specific enough to prevent race condition.

You can see an example from my toy project here. I was using ETS (Erlang Term Storage) as a cache, but needed a solution to a race condition because some functions for interacting with ETS were using cast/2.

Avoid Race Condition in the First Place

But maybe there’s an even better solution. If we can circumvent the whole race condition issue, we can get rid of the problem it causes. This can be done by just testing handle_cast/2 that always accompanies cast/2 function. Directly testing handle_cast/2 spares us from having to deal with message passing among processes and race conditions that it causes. I believe that this is the best approach for unit tests.

Unfortunately, that’s not always an option. Integration tests require testing interaction among multiple processes. In that case, it might be necessary to synchronize processes with call/2 functions to simulate how they are expected to work in production. On the other hand, sometimes we don’t have access to callback functions. In fact, most modules keep their callback functions private. In that case there is no other way but to test both the message passing and callback handling in a single test case.

An Afterthought

By design, each BEAM process could be running on separate machines. This means that message passing among BEAM processes should be able to deal with classic network problems like availability, disconnection, unresponsiveness, and so on. But my tests take none of these into accounts - they assume that everything will be alright, which is never the case in the real world.

Is this okay? I guess so in this simple case, because the processes run within the same BEAM instance. But if I do write more advanced distributed software, I suspect I will have to write tests for potential network issues too.