> It’s also common to find long-running idle queries in PostgreSQL. Configuring timeouts like idle_in_transaction_session_timeout is essential to prevent them from blocking autovacuum.
Idle transactions have been a huge footgun at $DAYJOB… our code base is full of “connect, start a transaction, do work, if successful, commit.” It means you’re consuming a connection slot for all work, even while you’re not using the database, and not releasing it until you’re done. We had to bump the Postgres connection limits by an order of magnitude, multiple times, and before you know it Postgres takes up more RAM than anything else just to support the number of connections we need.
The problem permeated enough of our (rust) codebase that I had to come up with a compile time check that makes sure you’re not awaiting any async functions while a Postgres connection is in your scope. Using the .await keyword on an async function call, but not passing the pg connection to that function, ends up being a nearly perfect proxy for “doing unrelated work while not releasing a connection”. It worked extremely well, the compiler now just straight up tells us where we’re doing it wrong (in 100+ places in fact.)
Actually getting away from that pattern has been the hard part, but we’re almost rid of every place we’re doing it, and I can now run with a 32-connection pool in load testing instead of a 10,000 connection pool and there’s no real slowdowns. (Not that we’d go that low in production but it’s nice to know we can!)
Just decreasing the timeout for idle transactions would have probably been the backup option, but some of the code that holds long transactions is very rarely hit, and it would have taken a lot of testing to eliminate all of it if we didn’t have the static check.
Why don’t you change the order to “do work, if successful, grab a connection from the Postgres connection pool, start a transaction, commit, release the connection to the connection pool”?
That’s what we should do, yes. The problem is that we were just sorta careless with interleaving database calls in with the “work” we were doing. So that function that calls that slow external service, also takes a &PgConnection as an argument, because it wants to bump a timestamp in a table somewhere after the call is complete. Which means you need to already have a connection open to even call that function, etc etc.
If the codebase is large, and full of that kind of pattern (interleaving db writes with other work), the compiler plugin is nice for (a) giving you a TODO list of all the places you’re doing it wrong, and (b) preventing any new code from doing this while you’re fixing all the existing cases.
One idea was to bulk-replace everything so that we pass a reference to the pool itself around, instead of a checked-out connection/transaction, and then we would only use a connection for each query on-demand, but that’s dangerous… some of these functions are doing writes, and you may be relying on transaction rollback behavior if something fails. So if you were doing 3 pieces of “work” with a single db transaction before, and the third one failed, the transaction was getting rolled back for all 3. But if you split that into 3 different short-lived connections, now only the last of the 3 db operations is rolled back. So you can’t just find/replace, you need to go through and consider how to re-order the code so that the database calls happen “logically last”, but are still grouped together into a single transaction as before, to avoid subtle consistency bugs.
We have a similar check in our Haskell codebase, after running into two issues:
1. Nested database transactions could exhaust the transaction pool and deadlock
2. Same as you described with doing eg HTTP during transactions
We now have a compile time guarantee that no IO can be done outside of whitelisted things, like logging or getting the current time. It’s worked great! Definitely a good amount of work though.
I figured it’d be Haskell that is able to do this sort of thing really well. :-D
I had this realization while writing the rustc plugin that this is basically another shade of “function coloring”, but done intentionally. Now I wish I could have a language that lets me intentionally “color” my functions such that certain functions can only be called from certain blessed contexts… not unlike how async functions can only be awaited by other async functions, but for arbitrary domain-specific abstractions, in particular database connections in this case. I want to make it so HTTP calls are “purple”, and any function that gets a database connection is “pink”, and make it so purple can call pink but not vice-versa.
The rule I ended up with in the lint, is basically “if you have a connection in scope, you can only .await a function if you’re passing said connection to that function” (either by reference or by moving it.) It works with rust’s knowledge of lifetimes and drop semantics, so that if you call txn.commit() (which moves the connection out of scope, marking the storage as dead) you’re now free to do unrelated async calls after that line of code. It’s not perfect though… if you wrap the connection in a struct and hold that in your scope, the lint can’t see that you’re holding a connection. Luckily we’re not really doing that anywhere: connections are always passed around explicitly. But even if we did, you can also configure the lint with a list of “connection types” that will trigger the lint.
It sounds super cool, your idea and implementation for await and transactions. Because of my limited Rust knowledge, it's hard for me to understand how difficult it was to implement such a plugin.
Also, your idea of using different domain specific colors is interesting. It might be possible to express this via some kind of effect system. I'm not aware of any popular Rust libraries for that, but it could be worth borrowing some ideas from Scala libraries.
It’s a compile-time check, and yeah it’s a lint rule. In fact it goes a little deeper than a lint can go, because it uses data from earlier compiler phases (in order to get access to what the borrow checker knows.) The correct terminology is a “rustc driver” from what I’ve heard. Lints like clippy run as a “LateLintPass”, which doesn’t have access to certain mir data that is intentionally deleted in earlier phases to lower the memory requirements.
Hopefully it’s something I can open source soon (I may upstream it to the sqlx project, as that is what we’re using for db connections.)
I would love to see how you implemented that (and also the lint itself). I so far haven't found a solid way to implement custom lints for Rust, so if you have any resources to share at some point, I would love to see them!
Idle transactions have been a huge footgun at $DAYJOB… our code base is full of “connect, start a transaction, do work, if successful, commit.” It means you’re consuming a connection slot for all work, even while you’re not using the database, and not releasing it until you’re done. We had to bump the Postgres connection limits by an order of magnitude, multiple times, and before you know it Postgres takes up more RAM than anything else just to support the number of connections we need.
The problem permeated enough of our (rust) codebase that I had to come up with a compile time check that makes sure you’re not awaiting any async functions while a Postgres connection is in your scope. Using the .await keyword on an async function call, but not passing the pg connection to that function, ends up being a nearly perfect proxy for “doing unrelated work while not releasing a connection”. It worked extremely well, the compiler now just straight up tells us where we’re doing it wrong (in 100+ places in fact.)
Actually getting away from that pattern has been the hard part, but we’re almost rid of every place we’re doing it, and I can now run with a 32-connection pool in load testing instead of a 10,000 connection pool and there’s no real slowdowns. (Not that we’d go that low in production but it’s nice to know we can!)
Just decreasing the timeout for idle transactions would have probably been the backup option, but some of the code that holds long transactions is very rarely hit, and it would have taken a lot of testing to eliminate all of it if we didn’t have the static check.