Naive agent-based backends

Seeing the popularity of elmish architecture and attempts at implementations of backends based on it (or any unsupervised cooperative actor loops) I thought I’d share my thoughts.

Desirable characteristics

Most naive implementations of synchronous (HTTP/RPC) APIs are perfectly content to return some equivalent of 500 code at a drop of a hat. My DB timed out? 500! My message broker node has crashed and I need to reconnect? 500! Etc. etc. You get the idea.

But this and other kinds of recoverable problems happen all the time in distributed computing and scenarios with unpredictable load. Remember, network is unreliable, as is your cloud VM and the physical machine under it. If your service/agent is on the edge, the constrained device might be under load doing whatever its primary job.

Here’s another consideration: What if it’s in the middle of a chain of APIs, as is common with microservices? Wether you return an error or not, what is the recovery strategy for the earlier nodes in the chain? Retry? Then something like this should look familiar:heavy-mallet

 

What about timeout constraints higher up the call chain? At some point, you’d end up  returning the error all the way up the call chain and not only this is an overhead you could have avoided, now it’s your user’s problem.

Conversely, is it ok to just drop the error? This is what unsupervised actors and elmish dispatch loop do. In elmish at least the API tries to prepare you for dealing with the possibility of errors, but if something does slip through, the best we can do is log it.

And logging it is an example of data loss. There’s no longer the context of the call, so there may or may not be a data that was passed in required to resolve the problem even if you found out about it at later date.  This might be ok in the UI, but on the backend/edge it could lead to hard to diagnose and solve business problems.

So, one desirable characteristic that’s different for backends and edge nodes is resilience. If we can recover from an error, we should do so without making it our client’s responsibility.

Achieving resilience

Yes, I’m going to talk about store-and-forward architecture, which may seem ironic, considering the picture above comes from reddit – one of the most prominent users of RabbitMQ, but the fact is reddit scales far better than most websites could.

The essential ingredients of resilience are asynchronous APIs (messaging) and ack/nack functionality of the messaging infrastructure.

Asynchronous messaging allows us to change the problem of timing from “It’s time-sensitive” to “It’s time-relevant” – we just have to process the messages in certain order, not on a strict timeline.

Ack/nack allows us to guarantee that the message is processed completely, which is what makes asynchronous API trustworthy – we’ll never loose the message or the fact that there might have been a problem processing it.

Ack/nack is what makes it “fire-and-forget” – often misinterpreted analogy. Imagine if military fired the rockets thinking “wether it hit the target or not, we can forget about it once it’s fired”. No, of course we care what happens, and it’s because there are recovery mechanisms in place that we can kinda forget about it, which logging and unsupervised actors don’t provide.

Speaking of actors

If you read the original papers on actors, they were meant to be a reasoning tool in the face of high processing complexity. The original constraints got watered down with the ability to address any actors in the cluster and the approach has become a tactical tool for relatively safe cooperative concurrency. However…

Your single actors node can process $300K/s messages? I’m not impressed, the number is meaningless not only because your problem doesn’t translate into mine, but also because your actors are unsupervised. Show me the supervised numbers, where individual message can be shown to be fully processed, replayed (wether successfully or as part of the compensation workflow while handling a non-recoverable error), and while we are at it – can it throttle the sources of events so that my downstream processing is not overwhelmed?

Higher-level abstractions

That brings me to “streaming” – modern day abstractions that define concepts of sources, data streams and sinks. But it doesn’t stop there, if you take a look at Apache Beam, you’ll see a standard defining not only these essentials, but higher-level operators like grouping, windowing and many others. The standard is implemented by several different data processing frameworks, so if we are going to build something resilient let’s focus on stream processing.

Edit: Conclusion

Elmish abstractions lack the way to implement Ack/Nack, define a data stream, as well as any means of facilitating update parallelism. If none of that matters , for example – in a locally hosted node.js backend to support user interaction, then you are OK. For a more robust implementation look elsewhere… maybe check out FsShelter which I’ll be presenting at OpenFSharp in September.

Advertisements
Naive agent-based backends