Yes, this sort of explains my situation. A requirement appears down the line that just completely breaks the service boundaries.
An example being something like an online gun store, you have a perfect service that handles orders. It's completely isolated and works fine. But now, 2 years later some local government has asked you "whenever someone buys a gun, you need to call into our webservice the moment the order is placed so we know a gun was sold, and you need to successfully do it, or you can't sell the item"
Now you've got a situation where you need an atomic operation, place the order and call the web service, or don't do it at all. You could say just place the order, do the web service call asynchronously and then delete the order afterwards. But you might not have that choice depending on what the regulations say. You can't do it before you place the order because what if payment fails?
The order service should not have any idea about blocking orders in specific scenarios. And now the architecture has broken down. Do you add this to the order service and break it's single responsibility? Will this be a bigger problem in the future and do you need to completely rearchitect your solution?
I would say this is another problem. If an external call to a web service is involved, then you can NEVER have an atomic call in the first place. One always needs to just have a state machine to navigate these cases.
Even with a monolith, what if you have a power-off at the wrong moment?
What you are describing here is to me pretty much the job description of a backend programmer to me -- think through and prepare for what happens if power disappears between code line N and code line N+1 in all situations.
In your specific example one would probably use a reserve/capture flow with the payment services provider; first get a reservation for the amount, then do the external webservice call, then finally to a capture call.
In our code we pretty much always write "I am about to call external webservice" to our database in one DB transaction (as an event), then call the external webservice, and finally if we get a response, write "I am done calling external webservice" as an event. And then there's a background worker that sits and monitors for cases of "about-to-call events without matching completed-events within 5 minutes", and does according required actions to clean up.
If a monolith "solves" this problem then I would say the monolith is buggy. A monolith should also be able to always have a sudden power-off without misbehaving.
A power-off between line N and N+1 in a monolith is pretty much the same as a call between two microservices failing at the wrong moment. Not a qualitative difference only a quantitive one (in that power-off MAY be more rare than network errors).
Where the difference is is in the things that an ACID database allows you to commit atomically (changes to your internal data either all happening or none happening).
Well that's the thing isn't it. As soon as you move away from the atomicity of a relational database you can't guarantee anything. And then we, like you do to, resort to cleanup jobs everywhere trying to rectify problems.
I think that's one of the things people rarely think of when moving to microservices. Just how much effort needs to be made to rectify errors.
You can always guarantee atomicity. You will just have to implement it yourself (what is not easy, but always possible, unless there are conflicting requisites of performance and network distribution).
And yes, the cleanup jobs are part of how you implement it. But you shouldn't be "trying to rectify the problems", you should be rectifying the problems, with certainty.
Create the order, and set the status to pending.
Keep checking the web service until it allows the transaction.
Set the status to authorized and set the status to payment.
Keep trying the payment until it succeeds, officially set the order and set status to ordered.
I really find it hard to believe that the regulations won't allow you to check the web service for authorization before creating the order. If that's really the case then create the order and check, and if it doesn't work, then cancel the status of the order and retry. It's only a few rows in the database. If this happens often then show the data to your local politician or what not and tell them they need to add more flexibility to the regulation.
An example being something like an online gun store, you have a perfect service that handles orders. It's completely isolated and works fine. But now, 2 years later some local government has asked you "whenever someone buys a gun, you need to call into our webservice the moment the order is placed so we know a gun was sold, and you need to successfully do it, or you can't sell the item"
Now you've got a situation where you need an atomic operation, place the order and call the web service, or don't do it at all. You could say just place the order, do the web service call asynchronously and then delete the order afterwards. But you might not have that choice depending on what the regulations say. You can't do it before you place the order because what if payment fails?
The order service should not have any idea about blocking orders in specific scenarios. And now the architecture has broken down. Do you add this to the order service and break it's single responsibility? Will this be a bigger problem in the future and do you need to completely rearchitect your solution?