Photo by Jungwoo Hong on Unsplash
Lesson learn: Event sourcing & Postgresql
This is in the series about lesson learn that I have encountered every week. This time, we will talk about Event sourcing
For the last 6 months, I have joined projects that adopted the event sourcing pattern. The main technologies around the projects are Nodejs, Postgresql and Redux. In this article, I would like to note some evolution decisions and the reasons behind them.
Before starting, let's take a look at the status of the project when I came.
The server receives a request to book an order. At this step, the server performs some actions: auth, verify and normalize data, etc.
In the second step, the command
bookOrder
is invoked with normalized data and context. There are business checks in this command before handling the command.The command makes an API call to a 3rd- party (payment provider) to record usage for later payment calculation. If failed to record, then respond 500 to clients
Then the system persists events to the events table in the database (Postgresql)
Application nodes listen on a specific channel from the database. When there is new event is appended to the table, then PG will notify to nodes.
The project is using an in-memory state management tool called Redux to manage the local state. When a node receives an event from the database, there will be a corresponding reducer to update the local state.
This architecture works pretty well at the beginning until the system reaches a larger amount of data and (concurrent) transactions. I will explain the issues and the solutions for each of the following read circles.
Idempotency
We have encountered some incidents that were caused by duplicated orders that were created due to retried requests from other services or 3rd parties. The issue is resolved by persisting the inbox message IDs. The API handler returns OK with a specific outcome to indicate that the message was already ingested.
Pros:
- Ensure the same message won't be handled more than 1 time
Cons:
Doesn't ensure a message will be handled at least 1 time
Additional write queries to the database that could be failed
Make the system reliable
At steps 3.1
and 3.2
, the response time and result depend on 3rd-parties services that are out of our control (network, rate limiting, availability, etc). The idea is to record changes in the system and respond to clients as soon as possible, the async requests could be done in the background and notify clients when they are done.
As you can see, commands now persist events to the database and publish an event to PubSub, by using the Pub/Sub mechanism, the system can:
retry if failed to request 3rd-parties
raise alerts if there are UNACKED messages then we can debug and handle them properly
Polling instead of PG Notify
The last red circle is about pg notify
. This is an awesome feature but it does not ensure reliability. If it fails to deliver a message to a node, it will simply skip that message. Therefore, instead of reactively receiving changes from the database, the server will proactively poll new messages.
Pros: Ensure local states of all nodes are identically actually
Cons:
- Put more pressure on the database to ask if there are new events
BRIN index
As you can see the cons of the polling mechanism. We need to improve query performance. And based on the characteristic of the events table such as:
append-only
created_at
timestamp increase sequentiallyHigh volume data
BRIN index is a good alternative solution to the Btree. There is an awesome article explaining what is BRIN index and when to use it. When does the BRIN index win?
TL;DR
The BRIN index can be a useful alternative to the BTree, for specific cases:
For tables with an "insert only" data pattern, and a correlated column (like a timestamp)
For use cases with very large tables (too large for Btree) or very high velocity (expensive to update) or both
For queries that pull quite large (100K+) result sets from tables
The low cost to build, and low memory foot print make a BRIN index an option worth looking at, particularly for larger tabes with correlated column data
Conclusion
So far, I've introduced the architectural decisions and the reasons behind them.
Ensure to handle
Idempotency
Make the system reliable by reducing dependency on 3rd-parties services
PG Notify doesn't ensure data is reliable
The BRIN index is a useful alternative Btree in specific event sourcing case
An important lesson is to avoid trying to anticipate too many issues in advance. Instead, focus on delivering good enough solutions and launching products as soon as possible. Over time, user feedback and system data will provide opportunities to fine-tune our approach. By prioritizing issues based on their frequency and severity, we can accurately diagnose the root causes of problems before proposing effective solutions and tools.
As the business grows, the number of users and traffic will increase. Therefore, the system needs to be continually tuned and developed to keep up with the demand. Stay tuned for more updates in future articles. Keep calm and enjoy programming! 😊