Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Features I wish PostgreSQL had as a developer (bytebase.com)
28 points by cloudsql on April 5, 2024 | hide | past | favorite | 62 comments


> To prevent deleting data by mistake, we invent soft-delete pattern by having a is_deleted column. However, this brings extra complexities around foreign key, unique index enforcement. Ideally, PostgreSQL could allow users to configure an archived table. The removed data is moved to the archived table first and purged after a configured retention period. This simplifies application logic and data compliance work.

You can get the same result by changing your is_deleted boolean field to a date_deleted date field. Then use a cron to purge all the records with dates older than the configured retention period.


if you're using an ORM that has an updated_at field, you can keep the boolean and rely on the updated date.


> The typical way to do schema migration is to compose a list of ALTER TABLE statements. This becomes hard to track the latest schema state as the migration accumulates. It's more intuitive for the developers to specify the desired state. Ideally, PostgreSQL could allow developers to specify the desired CREATE TABLE schema, the engine then reconcile it with the latest schema, figure out the diff, and plan the migration path internally.

This is a terrible idea. The behaviour of engines that do this is unpredictable and suddenly you lose data because the engine deleted a column.


If a tool blindly drops columns, that's just a bad tool! It doesn't mean the concept is flawed.

Thousands of companies successfully use declarative schema management. Google and Facebook are two examples at a large scale, but it's equally beneficial at smaller scales too. As long as the workflow has sufficient guardrails, it's safe and it speeds up development time.

Some companies use it to auto-generate migrations (which are then reviewed/edited), while others use a fully declarative flow (no "migrations", but automated guardrails and human review).

I'm the author of Skeema (https://github.com/skeema/skeema) which has provided declarative flow for MySQL and MariaDB since 2016. Hundreds of companies use it, including GitHub, SendGrid, Cash App, Wix, Etsy, and many others you have likely heard of. Safety is the primary consideration throughout all of Skeema's design: https://www.skeema.io/docs/features/safety/

Meanwhile a few declarative solutions that support Postgres include sqldef, Migra, Tusker (which builds on Migra), and Atlas.


This is exactly how auto migrations in things like Django and Prisma work. Yes you need to check the changes before you apply them but most of the time they are entirely sensible.


Wouldn't it be more interesting to have this as an external tool?

You input your create table statement and it issues you back the migration statements? Then you can check it against your development database or whatever and if you feel fine use it?

This way you could check and modify the migration path without writing the alter statements.

This is one of the most frustrating thinks with sqlite for me. Changing a table doesn't always work with an alter statement but sometimes you need to drop and recreate it with the new columns. Why can't they do the magic for me. It's really frustrating and was often enough the sole reason I used postgres for private projects.


There are actually a bunch of external tools that offer declarative schema management flow. Personally I agree that that the overall flow should be external to the DB, but it would be useful if databases could offer more built-in functionality in this area to make these tools easier to write.

For sqlite in particular, check out https://david.rothlis.net/declarative-schema-migration-for-s... and https://sqlite-utils.datasette.io/en/stable/python-api.html#...


We do exactly this, but with our own "ground truth" schema in the form of an XML file defining the various tables, views, triggers and so on.

We then have a program which compares the latest schema from XML to a given database, and performs a series of CREATE, ALTER and so on to update the database so it conforms.

Since we've written it ourselves we have full control over what it does and how it does it, for example it never issues DROP on non-empty tables/columns or similar destructive actions.

We've had it for a long time now and it's worked very well for us, allowing for painless autonomous upgrades of our customers on-prem databases.


I find this interesting! Is there anything similar in the open source world? I have built projects with Supabase in the past and one of my gripes about it is that it becomes obnoxious to track what happens to the schema over time in version control with dozens of migration files. For example, you have no idea what the current state of a table or trigger is because there might be several updates that occured over four dozen migration files and you may not even be sure which files those changes are in. You have to just look at the database directly rather than the code.


I haven't stumbled over anything, but I wouldn't be surprised if it exists. It's not magic, my colleague wrote the first version in a few days, and we've iterated on it since.

It just requires that there are some system views or similar that you can use to extract the current database schema, so you have something to compare against.

Our tool goes through the XML file and for each table runs a query to find the current columns, and for each column find the current configuration. Then compare with the columns in the XML file and decide what to do for each, ALTER, DROP or ignore (because possible data loss) etc. Datatype changed from "int" to "varchar(50)"? Not a problem since 50 chars are enough to store the largest possible int, so issue ALTER TABLE. Column no longer present? Check if existing column has any data, if not we can safely DROP the column, otherwise keep it and issue warning.

Views, triggers and stored procs are replaced if different. We minimize logic in the database, so our triggers and stored procs are few and minimal.

Materialized views require a bit of extra handling with the database we use, in that we can't alter but have to drop and recreate. So we need to keep track of this.

As you say it's very nice to use as a developer, as you only have to care about what the database should look like at the end of the day, not how it got there. Especially since almost all of our customers skip some versions (we release monthly).


It’s not inherently a bad idea, though. I do agree it would likely be poorly implemented.


Why would you think so? The PostgreSQL developers didn't stick out by delivering bad software/features so far. If they would implement this I would actually expect a pretty good way with even better documentation.


The creation of the backing migration plan would have to be interactive, certainly. But that doesn't mean the concept as a whole is destined for failure.


For me, it’d be more control over the query planner, ideally down to the level of submitting my own physical query plan.

I was looking at a query yesterday that joined 3 tables and we were unable to convince Postgres to use the optimal plan, which was to scan an index on Table1 that matched its ORDER BY order, and filter by joining against the other tables.

With a CTE for Table1 that had a limit on it, Postgres would pick the optimal plan, but then may underflow the desired limit after filtering. However the query plan is optimal and it will finish in 3ms.

Without the limited CTE, Postgres would run the Table2 and Table3 join in parallel with Table1, and then try to hash join like 3,0000,000 rows to produce the final tuple set. This query plan takes 40 seconds.

It’s just so frustrating knowing the system has the capability to serve a query in fractions of a second but you can’t extract that performance directly. Instead you need to write some truly bizarre code you hope the query planner will like, and then live the rest of your life in fear the query planner changes its mind.


A few days late here (joys of opening multiple tabs then only getting to them when bored), here's hoping you look back and see this:

The difference between using a LIMIT and not hints this is happening because of a low correlation statistic between that index and the order on disk. Postgres is avoiding random-access lookups in what looks to you like the optimal plan, because this blows away the disk cache (paging cache) and would result in a much slower overall time than you think it will, given other configuration values.

The CLUSTER command will reorder the table data to match a chosen index, after which postgres will use that index with an index scan the way you want it to, even if you retrieved the whole table and have a slow disk, because now with a high correlation statistic the query planner knows it'll be working with the disk cache instead of against it.

If you're on an SSD or are otherwise sure it can all be held in memory, check out the "random_page_cost" setting on https://www.postgresql.org/docs/current/runtime-config-query... which may allow the index scan without having to run CLUSTER.


Yes this is completely my experience too. I spent most of yesterday battling with the query planner. Sometimes you can't express with table statistics something that you know to be true for the exact query that you're making.

There are a lot of arguments about query hints etc becoming stale and the performance changing as the table grows but I'm less worried about that - it would be a gradual degradation of performance. What worries me is that the table stats cross a threshold at 3am and suddenly the query planner chooses something crazy.

I've also wondered if an approach of trying a bunch of query plans could be fun. Get it to pick the top 10 possibilities and just run them all and record stats on which was fastest. Doubly so if it ever decides to run a seq scan where there is an index. Please. Just try the index! That said yesterday I was definitely at the point where I just wanted to express the exact query plan myself.


Very memorable post about this happening at Bluesky, about being utterly unable to trust Postgres to keep doing what it's doing.

> Bluesky scaling lesson: Postgres is not a great choice if your data can be irregular.

> The query planner will switch over to a new plan that consumes 100% CPU in the middle of the night whenever its table stats flip the heuristics the wrong way.

> Postgres badly needs query hints like MySQL has!

https://bsky.app/profile/jacob.gold/post/3kn4v67wyjt2l


I'd like to have a way of disabling certain (legacy?) functionality. They even list some of it on their wiki [1]. Maybe we could even get speedups by not covering the legacy cases.

A plan to phase out some of these legacy features would be even better. Don't know if they do this in general.

[1]: https://wiki.postgresql.org/wiki/Don't_Do_This


This comment should be at the top. That list is pretty much the first thing I add to the Readme.md of any new project (that uses Postgresql).


WDYT about a "don't do this" feature that INFO/WARN/ERROR you when you try to use a given feature?


I'd like to see an option for automatically adding indexes for performance.

Perhaps a background process could re-run the query planner for frequent queries and add an appropriate index if there's a big speedup.


As a developer who is not a DB expert, I always wonder why index cannot be auto added based on queries may be ? So lets say I have a table and once it starts filling in data and sees the typical queries coming in (say based on user id or email etc), just add the index ?


There are database that provide this, e.g., Oracle and sql-server. See: https://www.oracle.com/news/connect/oracle-database-automati...


Adding an index means the write path suddenly became slower. Sometimes, a covering index will be very beneficial. « It depends » is usually the right answer. I’m fine with the engine telling me « this query is usually slow » and I investigate why. I’m not ok with the engine adding indices willy-nilly and suddenly writing to the DB is 10x slower.


sounds like a non-default option would satisfy you both


Indexes aren't free. They take up storage space and reduce the performance of inserts. On top of that, if an index gets too big it can be really slow, the table needs partitioning instead.

In some applications you might value the performance of insert a lot more than select, and not want to pay for extra storage of the index.


I work as part of building an RDBMS, but most importantly, I know how people misuse RDBMS for more than +20 years.

People write terrible schemas. ORMs hide complexities and make terrible queries.

Some people refuse to use sophisticated features like DataTime types for storing DateTypes (using String instead. I'm not talking about formatting!).

Also, neither use JOINS or VIEWS or FUNCTIONS, or even adding indexes.

They build MonDB schemas on top of PostgreSQL. And then build their own query engine, that is not like the one an RDBMS is happy to deal it.

In short: "Trash-in Trash-out" but for queries, and that is something will be triggered by a system like this.

Plus:

Query engine optimizations IS the HARDEST aspect of build a DB: Example:

https://db.in.tum.de/~radke/papers/hugejoins.pdf

> The largest real-world query that we are aware of accesses more than 4,000 relations.

And you have a lot of constraints:

- More indexes more data, less RAM, less SPACE.

- More indexes, more query optimization paths, more complex solving of query optimization

- Adding more indexes could create suboptimal access patterns.

- And that changes with time

- And that changes could be on a second - And then you can create an unintentional delay that WILL impact the pocket of somebody

Doing this stuff "silently" is a sure way to add problems that will be very hard to solve or debug. Sure, without this some of the problems still remain, but at least will be possible to see why and to know when it get solved.

If anyone manages to create a solid implementation of this, it will deserve a noble prize, an oscar, a place in the half of fame, and will be rich.


I've always found that this is a great idea that doesn't hold up to the reality of DB tuning. There's no "typical" query, in the sense that should it optimize for the more common but pretty performant query, or the rarer, very slow query? What type of client are you most concerned about? Both? Is that possible? which queries are worth the (total) cost of the index? Does a specific index hurt some other workload like automated integrations, or is it OK to have slow updates in the middle of the night? You can definitely profile over time and there are tools that help you capture data and make recommendations, but auto-indexing would be a limited to poor solution a lot of the time.


> Archived Table > To prevent deleting data by mistake, we invent soft-delete pattern by having a is_deleted column. However, this brings extra complexities around foreign key, unique index enforcement. Ideally, PostgreSQL could allow users to configure an archived table. The removed data is moved to the archived table first and purged after a configured retention period. This simplifies application logic and data compliance work.

Partitioned tables works pretty well for this. https://www.postgresql.org/docs/current/ddl-partitioning.htm...


Also of interest, you can use the cluster command to reorder rows fysically on disc according to an index. This helps you cluster all active records, and cluster all deleted records. This way you can keep the records in one table, but more efficient querying as you need to read less database blocks.


> Partitioned tables works pretty well for this.

Pretty interesting idea. Do you have any stories or experience to share when going with this approach?


um, I've done it in production for years?


Interesting list. I think the items related to schema migrations are probably bad ideas, as the logic and process around even simple schema migrations can be extremely complex and varied. It seems much better to keep this as an external component than to build features into the db engine around a narrow set of approaches.

I like several of the other ideas, though. Branching is at least partially implemented in other RDBMSes. MS SQL Server has been able to do instant read-only database snapshots (which capture a point in time and can be queried just like the original DB) for decades.


A built in connection pooler would be nice! My first database tech was Microsoft SQL Server, and ADO.NET supports it out of the box, so when I made the move it was the most obvious missing feature to me.


As someone who's never used those technologies - is that not a client-side concern? What's different about it being "built in" vs adopting something like pgpool or Supavisor when you have those requirements?


>Rollback. If a new schema change introduces issues, versioning allows for safer rollbacks to previous stable states, minimizing downtime and impact on users.

Wonderful intention, not always easy/possible to pull off because you're not just managing the stateless structure, but the state which is stored by the structure. Migrations that drop columns by merging their data into another may well lose information and may do so in a way that makes the original data unrecoverable... not mentioning how to decide the information for any records possibly added after the dropping migration, but prior to running the rollback.

Sure, there are ways to handle these kind of scenarios, and sometimes in reversible ways... but there comes a point were a rollback looks more like moving forward adding a feature/capability than it does a return to a previous state.

Defining the "one true way" for migrations may cause more harm than good despite how convenient it may be. How migrators help you manage scenarios like rollback can vary from tool to tool and is part of the selection process when finding the right tool for your project. I've built a database migrator for my project which implements a non-traditional approach to migrations which works very well for my project, but would cause others to cringe with all sorts of objections. Sometimes an external tool is a better answer than an internal one.


I like the idea of having a migration progress.

I like to extend the idea and would ask for a more ingraned concept of versions and version migration.

Basically what flyway does but on DB level.

You could ask your DB from any backend if its in a migration status or not, it would be first class citizen were its absolutly clear that a DB migration should NOT disrupt production, it could also make it much more longrunning.

Like let me schedule a migration over 1 or 2 days through the DB and thanks to statistics and knowledge it knows to do the big chunks at night for example.


The single most important thing for me, as someone who wishes to move away from Oracle, is packages.


I wish it had logical locks where you didn’t need to lock a particular row. Instead you’d just specify a string and lock on it and then execute your application code. Would be useful for distributed locks, or for transactions that may conflict with another so you take the logical lock in the application code before proceeding.



That is the way, but the UX is pretty ass because the lock ID is a 64 bit number instead of a string. How the heck are you supposed to keep track of what lock ID you should be checking in a given situation across multiple client apps?


You get the same problem for strings. How do you know that you should lock "user_update" and not "update_user" for example? And how do you avoid name collision when client A wants to check for a lock that is used by client B for other purposes?

The solution to both cases is to define them as either static constants or use an Enum. Then you would not care if the end result is a string or a number.

At my work place we simply have a static class with lock names that we use.


Good point, that’s a better approach. I guess for a multi-repo situation at work, you would need to create a base project like “postgres-lock-ids” so you can synchronize the lock across everything.


hashtext() works well


It's not exactly string-based as it accepts bigint key, but I guess it's possible to hash a string when you pass it to the function.


That's what exactly what I did a fairly basic distributed cron and it worked fine.


Have you passed a string into Postgres and then hashed it into bigint with another function? If yes, what function did you use?

I assume that if you do it this way, then you see a string key in logs, views of current/locked queries, etc. Which should immensely help when debugging any kind of problems.


I have a long time humble wish, an auto-updated timestamp for row updates.

On mysql you can just declare on the row: `on update CURRENT TIMESTAMP`

On postgresql you have to set triggers on each table individually, and trying to automate with ddl commands require admin privileges which you not always have when applying migrations.


My only wish is that JSON in Postgres would be not sticking out like a sore thumb.

For example, why not address JSON paths as WHERE a.b.name = "king"


welllll for one thing that's the wrong type of quote!! (maybe my least favourite sql feature)


Ha using PSQL since nearly 30 years, and still making the same mistake :-)


Add to the list: Ability to reorder columns :)


Genuinely curious, why is this an issue? I don't think I've ever looked at a table and thought "oh, it'd be nice if I could move these around a bit".


Just annoying when the attributes aren't in a logical order because they have been added to incrementally.

I've got a table that contains the columns length, width, height and cube which is length x width x height) but cube is like 5 columns down next to the 'short description' field (which would ideally be next to the description field!)

The order can also impact postgres query performance, but for me it's mostly a look & feel thing.



Thanks, that was an interesting read. I'm curious how much real-world storage it would save for your average developer. Wonder if there's a nice little script out there you can run against an existing instance to get some kind of stats.


It would be easy to create a script like that. You'd save quite a significant storage space, especially when you have tables with hundreds of millions of rows. The real problem though is with further migration of your schema down the line, where you're going to add/remove columns, as it's almost always the case.


I think that all the time.


You can, but you have to play towers of Hanoi.


Now I’m waiting for:

“Show HN: Multiplayer Towers of Hanoi in Postgres”


> Git Integration

This is it for me. Does anyone know a good way of version controlling views in Postgres?


dbt comes to mind




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: