NATS Weekly #33

Byron Ruth

Published on Jul 4th, 2022

Week of June 27 - July 3, 2022

🗞 Announcements, writings, and projects

A short list of  announcements, blog posts, projects updates and other news.

️ 💁 News

News or announcements that don't fit into the other categories.

⚡Releases

Official releases from NATS repos and others in the ecosystem.

🎬 Media

Audio or video recordings about or referencing NATS.

📖 Articles

Blog posts, tutorials, or any other text-based content about NATS.

💬 Discussions

Github Discussions from various NATS repositories.

💡 Recently asked questions

Questions sourced from Slack, Twitter, or individuals. Responses and examples are in my own words, unless otherwise noted.

How can I make a pull consumer that is cleaned up after the subscriber disconnects?

As of NATS server v2.7.0 (released January 2022, at the time of this writing v2.8.4 is available), ephemeral pull consumers are supported.

Quick tip: As with many new server-side features, clients need to be updated to take advantage of new options or APIs. So which clients support it?? Fortunately each new significant feature has a corresponding ADR which tracks the client library updates.

An ephemeral consumer (push or pull) implicitly has one subscriber and exists only for the lifetime of the subscription. Once the consumer is determined inactive based on the lack of interaction with a subscription, the consumer is auto-deleted.

Ephemeral push consumers have been supported prior to v2.7.0, but with advantage of the pull consumer is the deliberate, client-managed flow control using batches of appropriate size for the use case.

For reference, here is how to create an ephemeral consumer in Go, assuming there is a stream which is bound to the subject foo.>.

sub, err := js.PullSubscribe("foo.>", "", nats.DeliverAll())
defer sub.Unsubscribe()

The notable difference from the typical use is the use of an empty string for the durable name. This indicates to the server the intention to create an ephemeral consumer that will be then be cleaned up after some period of (configurable) inactivity.

Note, that the sub.Unsubscribe() call is good practice to prevent resource leaks and deliberately unsubscribe, but if a program crashes, the server will still detect the inactivity and delete the consumer.

How can implement the dead-letter queue pattern with JetStream?

I wanted to capture the great information (as usual) by Todd Beets in a response on Slack.

NATS provides advisory subjects that are published to when various events occur, one of which is $JS.EVENT.ADVISORY.CONSUMER.MAX_DELIVERIES indicating a message was not successfully ack'ed after all delivery attempts. The result of this situation is that the message is simply skipped.

For some situations, these unprocessable messages should be inspected by an operator, typically out-of-band of the standard processing. A dead-letter queue (DLQ) is a dedicated stream that these messages can go with a separate workflow for fixing the underlying issue.

One approach could be to have a subscription on the advisory stream for this event and then lookup the skipped message in the original stream and doing something with it. This can work, however Todd note's an important point.

be aware that if the underlying stream is set to Work Queue or Interest-based policy, you can't look up the message by sequence ID that you get in the Advisory thus will not have access to the message body of the failed message.  Also, be aware that the Advisory message itself is emitted under At Most Once QoS.

Todd recommends two approaches:

  1. Don't have your workers process off of the original stream.  Leave that stream (limits policy) intact as an archive/longer-lived buffer.  MIRROR a Work Queue stream off the upstream and work off that.  Then you can always look up the message sequence ID in the Advisory off of the upstream.

or..

  1. Put code in your workers that checks the Num Deliveries metadata of each incoming message.  If it reaches a certain number, don't process it but re-publish in a dedicated "Dead Letter Stream" of your choice.  Set your Max Deliveries setting on the working consumer to a higher number than your code's DLQ threshold. [...] don't forget to ACK the original stream after your code has successfully published to DLQ...

What exactly does exactly-once mean?

The depth of this topic far exceeds a short Q&A format like this newsletter provides. However, there was a recent question on Stack Overflow on how to achieve this with NATS, so we will focus on a succinct response where the details of why can be a fun exercise for the reader. Thanks to Tomasz Pietrek and R.I. Pienaar for the discussion on Slack.

Terms like "exactly-once delivery", "exactly-once processing", or just "exactly once" are pretty common phrases that come up when designing for or using any kind of messaging solution. For those less familiar with these concepts, we can draw from two very common examples of where this desired guarantee arises every day: SMS/messaging/chat apps and email.

As an end user, how terrible would it be if you received the same text message multiple times? Or the same email? You expect to get the message or email at-most-once, but you also expect the underlying system to never lose anything either, so you requireat-least-once. The combination of these two guarantees result in the exactly-once expectation. If you here the term "quality of service" (QoS), these are the three levels.

Without getting into the weeds, achieving exactly-once delivery over a network is impossible at a technical level. It has been researched and discussed, ad nauseum, and proven impossible.

However, when most people talk about exactly-once, they don't care about how this is achieved, but rather this is the desired observed outcome from a user/business perspective, just like our text messaging/email example. All that said, as the person implementing these exactly-once semantics, it is useful to know what facilities JetStream provides to acheive this desired outcome.

If we think about the touchpoints of a message, there is the publisher of the message, the NATS server, and the subscription handler that processes the message. As the publisher we want to ensure that the message has been received (ack'ed) by the server. In the happy path, we can do a js.Publish(..) and get the ack back and all is well.

But what if we get a timeout or a network error? The natural thing to do is to retry the publish. The problem with networks and acks is that the server may have received the message and responded with the ack, but this message dropped. So the server actually does already have it and the publisher doesn't know this. For the at-least-once guarantee, we need to retry until we get an ack back.

To handle this specific use case of intermittent interruptions while publishing, deduplication within a time window can be used. This relies on a consistent and unique Nats-Msg-Id header value being set on published messages. The server will then ignore duplicate messages it has received.

Now that the message has been stored exactly once (within this allowed time window), how can we guarantee the message is delivered and processed only once?

Well similarily to how we couldn't guarantee publishing only once, we also can't gaurantee physical deliver of a message to a client (subscriber) only once. On receipt of the message, the subscriber must ack the message before the ack wait time has elapsed. If not, the server assumes it wasn't delivered or the subscriber failed, and it will redeliver the message.

A related subtlely is that even if the subscriber handled the message and sent the ack back to the server, that message could fail.. and we get into the same situation of redundant delivery. For subscribers, deduplication or, generally, idempotency needs to be used to deal with redeliveries if the message already resulted in a side effect.