ARTICLE
Redesign long-running computation outside the transactional context
24 April 2023
Situation and challenges
There are multiple situations when computations with high complexity are performed and the IT infrastructure is put under stress. This is a common issue in the IT field — many times the software that was written in the past doesn’t scale well with the growing amount of data or operations.
The execution time for such computations can be lengthy, and this aspect can cause multiple problems:
- Increased cost of having the infrastructure up and running for many hours;
- Services that are waiting for the results of those computations are idle and consume resources;
- Connection or database timeouts due to being open for extended periods of time.
With more than 10 years on the market and continually growing to accommodate our customers’ needs, such challenges are not unknown to us. In this article, we will consider the following scenario: compute the schedule for a loan account with a large number of loan transactions.
Initially, the type of accounts that we supported had a relatively low number of transactions. This allowed us to compute the schedule for every account in a short amount of time. This all changed when a new type of account was added — the revolving account, which has, on average, thousands of times more transactions than the already-existing types of accounts.
A revolving loan is a loan that can be associated with a credit card, and the end customer uses it for daily purchases. At the end of each month, the customer has to pay back the entire debt or a portion of it, depending on the terms of the loan contract. But at the end of each month, a new round of computations needs to be performed for each loan account to show the customer how much they should pay. Each month, the user has a new instalment generated to show this information. The collection of all instalments for a customer is called a “schedule.”
For a loan account with a large number of transactions, the instalment generation is a processor and memory intensive operation due to the large number of mathematical calculations implied.
Computing the instalments of a loan account together with the read and write operations are described in the diagram above. The algorithm requires the loan transactions from the database, and then gets additional data where needed through computations. At the end, the result is saved in the database.
Our main challenge is to have consistent data after the schedule is computed. During schedule computation, an exclusive lock on the loan account is acquired to assure that no new loan transactions are logged. After storing the schedule, additional processes are executed for a loan account to ensure consistency. The new loan account state is stored and the exclusive lock is released, permitting other actions on the loan account. This solution was no longer a possibility for revolving accounts with a high number of loan transactions, where the schedule computation took longer than the maximum duration of a database connection.
Solution
Addressing the challenges in computing the schedule of a loan account was not an easy task. After assessing multiple solutions, we concluded that extracting the schedule computation outside the transactional context was the best trade-off.
We deployed a strategy for tackling this challenge in four steps:
- Isolate schedule computations from other operations, especially database interactions;
- Extract schedule computation as a separate unit of work from the collection of all loan accounts end-of-day processes;
- Execute schedule computations outside transactional context;
- Ensure data integrity after the schedule is computed.
Isolating the schedule computations from other operations was a challenging refactoring that required a long period of time. The main idea was not to have any interactions with other systems (storage engine, notifications, etc.) when computing the schedule. This way the schedule computation becomes a distinct domain in our architecture. This domain has its own models — more specialised ones that can be used seamlessly within schedule computation algorithms. The communication between the schedule computation domain and other domains is done via an anti-corruption layer that converts existing models (mainly JDO models) into schedule computation models and vice versa.
Having the distinct domain of schedule computation enabled us to progress with further refactoring. Initially, every loan account was processed each day by multiple jobs, including the one that computes the schedule. All jobs were executed in the same JDO transaction, so if one job failed, the changes were reverted. We’ve extracted the schedule computation job and created a new category of jobs that should run independently. The schedule computation job was now part of this category, and so it was independent, adhering to the unit of work pattern.
This separation of the schedule computation as an independent unit facilitated its refactoring. We first isolated the interactions with the database in two methods: one that reads the initial data, and one that stores the results. Between those methods the actual schedule computation takes place using the data that is already loaded in memory.
Having the distinct domain of schedule computation enabled us to progress with further refactoring. Initially, every loan account was processed each day by multiple jobs, including the one that computes the schedule. All jobs were executed in the same JDO transaction, so if one job faileHaving the computations use only in-memory loaded data, we could further refactor the code to restrict the interactions with external parties. Before the actual computation is executed, all of the input data is retrieved in a dedicated JDO transaction. The same is valid for saving the results of schedule computation: this is done in a new JDO transaction. This way we clearly separate the code that interacts with external functionalities, and the code that actually computes the schedule. No type of lock was applied on the loan account while the schedule was computed.
The next problem was ensuring data integrity when computing schedules outside the transactional context. Our solution was to use optimistic locking and validate data integrity before storing the result. Defining the integrity condition was challenging because it should guarantee the successful storing of the computation results. To achieve this, we used the transactions on the loan account as the primary source of truth, thus having a safe way to check if any changes were performed on the loan account.
Another aspect of data integrity is the synchronisation between multiple entities, in our case between loan account and schedule. The schedule and the loan account balances are updated simultaneously. To maintain this data integrity constraint, we executed the loan account appraisal process after storing the new schedule, before closing the JDO transaction. If data integrity verification fails, then we retry computing the schedule with the newest data, but no more than a couple of times (two retries). If the schedule recomputation result couldn’t be stored because of data integrity issues, then the loan account is logged to be investigated manually by an engineer.
When extracting a functionality outside a context of processes, it is mandatory to ensure data integrity of the new functionality. Having the above steps covered, the long-running schedule computation process is executed in an isolated way, so our objectives are accomplished.
Results
Schedule computation extraction brought major benefits to overall reliability of business processes. We managed to correctly compute the schedule for all loan accounts (more than 500k) for one of our main customers, and the daily appraisal process finished with success. The failure rate due to timeouts dropped from a high value to almost none, increasing the reliability of the system.
This uncovered other hidden issues, so the observability of the system was improved. The team had a better visibility over the correctness of the schedule computation algorithm.
The system’s availability also increased because the exclusive lock on loan accounts was removed during the schedule computation process, so customers could enter new actions on loan accounts while the system took the time to process the schedule.
Extracting long-running computations outside main business processes can benefit reliability, availability and correctness. These key principles are a must in any core banking engine.
Miro board with diagrams here.