Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Boosting the performance of PostgreSQL’s COPY command by dropping indexes (californiacivicdata.org)
37 points by palewire on Jan 25, 2018 | hide | past | favorite | 46 comments


Is it not common knowledge that dropping indexes improves database insert performance?

Of course, that's often not an option when you you're loading records into a live database that's also getting queries, you usually don't want every query to result in a full table scan.

This was well known 20+ years ago when I was an entry-level DBA, and I assumed it was still well known today.


I think what the article is proposing is that it can be quicker to drop the indexes and recreate them than to load a lot of data in an indexed table.


I thought this is pretty common knowledge and I don't even do much with databases.


and I think that's what what johnny555 is questioning - isn't this common knowledge already? apparently not.

here's another tip, at least on mysql, but possibly other databases that have memory tables. Import stuff in to memory tables, then insert from the memory table to a disk-based table. I took a process that was naively importing data via SQL commands which took close to 24 hours down to around 20 minutes by breaking it up, chunking imports to memory tables, then copying those to permanent disk. This was years ago (12?) and mysql is probably better about insert handling than it was, but that approach (plus the drop/recreate indexes) meant this was a smallish process vs a 24 hour import cycle.


> apparently not.

Just becasue one blog post making it seem that way makes it not known ?


not just because of that - cuchoi also seemed to imply that.

relatedly/anecdotally, I still run in to people who aren't aware of this, as they don't understand what happens when you're inserting data in to a database or what indexes are in the first place.


"Batch-mode processing is always more efficient."

Not sure where I first heard that, but it applies here. Essentially it is almost the same thing as saying that computers are often set up to exploit economies of scale.

Thus building an index all at once after a large set of changes are made is more efficient than incrementally updating an index as each change is made.


It is more accurate to say that sequential I/O is always more efficient than random - even if there is no physical seek time anymore


The reasons why batch mode processing can be more efficient go beyond just sequential I/O.

If I do a little of something now and a little later and yet more even later, I will probably have to deal with caches that don't have my data in them because other things happen in the intervening time. If I do it all at once, then a lot of the work benefits from already-warm caches. (This can apply to disk caches, CPU data caches, and even instruction caches.)

Not to mention that sometimes batch mode processing opens up opportunities to use a more efficient algorithm (even if sometimes just by a constant factor). For example, if you maintain an index in a balanced tree and keep adding to it piecemeal, you do extra work continually rebalancing that tree. Whereas in theory if you built an index all at once, you could collect all the data, sort it using some kind of fast sort like mergesort, and then write out a final tree which is already balanced as desired and doesn't need to be rearranged as data comes in in random order.



It's weird but doing less work requires less time. Who'd have thought that?


I must be getting old.

Kids, many years ago, even before jQuery, software would come with documentation that you could read and it would tell you how to use it effectively.

I know, crazy right? But to this day some of that old software, of which PostgreSQL is an example, still has this documentation that you can read, even before you use the software in a production system.

Yeah, yeah, I know Agile and Docker solved the problem of ever having to document anything, but this is the way things used to be and a few of us are stuck in our ways and still like it.


I second reading the Postgres docs. They're fabulous resources. Any time I'm attempting to apply a new SQL syntax feature I'm not 100% fluent with, I'll give the PG docs a read-through—not just for usage, but also for idiomatic examples, performance analysis, edge-cases to consider, and more.


>> even before jQuery

If you're getting old then I must be ancient! I remember when all the software documentation had to be printed on this white stuff made out of dead trees.


And then people wonder why Agile never produces quality results.

Intellisense has replaced the need to read the docs and Agile has replaced the need to understand what you're doing.

Its no surprise that basic knowledge found in the documentation is later "discovered" when the project is already running in production.


To be fair, the honest among us will admit that it's not unusual to miss something glaringly obvious for an embarrassingly long amount of time.

The difference is really in whether you recognize the issue and quietly hope no one finds out how dumb you really are, or whether you make a big celebratory blog post about the secret behind your "pioneering" work, making sure that your title and first and last name are clearly attached. And of course, we can't fail to highlight the further brilliance of accomplishing this marvelous feat by employing "rarely used, low-level" commands from within the framework's ORM.

Hold on to your butts, because next week he's going to learn that you can execute commands directly on the server, without even having to use the "low-level" elements of an ORM! I can't wait for the field to be revolutionized by Lead Developer James Gordon's next discovery.


Maybe cut Lead Developer James Gordon some slack. He may not have wanted this attention, which came from Ben Welsh. Welsh describes himself as a "data journalist" and a "hack computer programmer". Perhaps he just thought this improvement made by the developer was really cool, and wanted to blog about it, not understanding the snark's nest he was stepping into.


You're right, of course. I considered mentioning the possibility that this was meant more as a test PR style or tongue-in-cheek "Haha this is obviously super important"-style post, since these are definitely viable explanations, but I felt the rant was already long enough and I couldn't find an easy way to work it in.

One of the most dangerous things about sharing stuff with others, especially isolated items from unknown authors with a worldwide audience, is that you never really know how much of their own context the recipient will read in, or how much of the assumed / pre-requisite context they'll fail to infer (or infer differently than intended).

You only get better at this through repeated practice, but you can't ever be perfect at it. Especially in a world of complex social interactions where people don't always mean what they say or say what they mean, and the lack of body language and facial expression in written language silently corrupts the signal.

All readers should always remember, it is easy to criticize. It is much harder to do. Critics especially need to remember this, because it's very easy, automatic in fact, to fall into a pattern of judgment and criticism when we're regularly exposed to so much stuff from so many sources. But it's good to get out there and try, because it's often much harder than it looks, and especially if you've been on the sidelines judging for a long time, it can be jarring how much harder it is to do than to say (or, particularly, deride).

A great way to test this: if there's some radio program you listen to regularly where callers can share a brief anecdote or story, call in. As a regular listener, you've been silently evaluating callers on a daily basis for years. You likely have some opinion about the practices of good callers and bad callers. Call in and you'll be surprised how nerve-wracking it can be, even for someone as "experienced" as yourself, and I think you'll be disappointed in your overall performance (unless you've thoroughly rehearsed ahead of time).

This is just a tiny, irrelevant thing that most people wouldn't give any thought to, a quick ~60 second phone call. You certainly don't give much thought to it every day when you dismiss anecdotes or stories told by amateurs. But try it yourself and you'll get a great deal on some perspective for the difficulty gap between doing v. criticizing.

The peanut gallery can, and will, always find something to nitpick. Don't take it personally. Use that data to hone your interactions and get a better-tuned result next time.


This is where the rampant ageism in the industry has lead us; millennials who expect a prize for doing something totally obvious to anyone with experience, who was never considered for the job because they were “too old”


I understand your sentiment, but I think your wording is a bit mean spirited. I don't think this person "expects a prize", nor do I think this is a trait that can be applied to millenials as a whole. My sister who has hired literally a hundred people bucketed as "millenials" has had the exact opposite experience in dealing with them than you describe. She thinks they are some of the hardest working people she has ever worked with. But we digress.


Wow. You've gotta love the audacity in making a big announcement like this based on the developer finally getting around to reading the docs. "Drop constraints and indexes for faster imports" is mass import 101.

The entirety of the "Why We Did It" section:

-----

> This improvement was pioneered by James Gordon, the Coalition’s lead developer.

> He drew instruction from PostgreSQL’s official documentation, which reads:

>> [snipping quoted sections from PostgreSQL manual at https://www.postgresql.org/docs/10/static/populate.html#POPU... ]

>Gordon’s code handles this task using rarely utilized, low-level tools in Django’s database manager.

-----

Sadly, in the current day and age, a developer actually taking the time to RTFM may indeed qualify as "pioneering" work!

Perhaps the rest of us need to start trumpeting our accomplishments when we find some clearly-stated performance gain in the manual, rather than hiding our heads in embarrassment for not finding out until we released version 2.2 of our mass DB import tool.


Yeah, it's OK to not know about this, and get excited about discovering it, but to categorize it as "pioneering" is, as you say, audacious. This announcement would have been better as a minor bullet-point in the release notes. If and when someone publicly points out how much faster it is, the appropriate response would be, "heh, this is embarrassing, but we didn't know you should drop indexes before importing"


Heads are going to explode when they discover the arcane rarely-utilized tool known, nay, merely whispered of, only as "transactions".


Just today someone asked me "what's MVCC[1]?". One of today's lucky 10,000 [2].

[1]: https://en.wikipedia.org/wiki/Multiversion_concurrency_contr... [2]: https://xkcd.com/1053/


Do you have any numbers on how much extra time (that is, time spent servicing queries above the normal query times when your tables are indexed) application queries take after the loads, but before the index rebuilds are complete?

If so, how does that compare, in aggregate, to the time saved in the loads?

Or are you simply not putting the application back into service until the index rebuilds have finished? How long does that take, compared to the time saved?

EDIT: I'm mostly asking these questions to nudge people to think about them in the course of trying this in their own environments. It's my day job to think about these kinds of things; I've worn the PostgreSQL DBA hat for over a decade now.


Yes, you would never do this unless writes were completely disallowed to your app. You will save time building the indices in one-shot.


My point is: this is not necessarily true. The Fine Article even mentions that the benefits are situational.


Do you also disable autovacuum while this is running? That is another good trick for speeding up large database imports.


I would, since it would be a waste on a table loaded right after a create.


N̶o̶ ̶s̶h̶i̶t̶,̶ ̶S̶h̶e̶r̶l̶o̶c̶k̶!̶ You don't say?


It was a bit abrupt but seriously it is kinda disheartening to read that a _lead_ developer _discovered_ such a basic thing.


Reminds about that funny article from Uber on why they switched from one DB to another. Seems like they had failed to discover a number of things.


The article is proposing is that it can be quicker to drop the indexes and recreate them than to load a lot of data in an indexed table.


Yep, and this is common knowledge for most folks who do this kind of work. Heck, if you can stop the database, a whole host of things become quicker by going drops and recreating things. Most alter commands are quicker if you do drops and creates often even if a copy of a table needs to be made.

Maybe we are missing something by getting rid of the DBAs.


It's not a proposal, it's common knowledge. INSERTS causes the index to be rebuilt. You have to search and find the right location then insert the new pointer. If you do 1,000,000 inserts that's 1,000,000 searches and writes to the index.


Sure INSERTs do, but I personally would have guessed COPY does its work in a single transaction, which under the scenes I would hope would avoid rebuilding the index until the transaction is committed?

I admit there’s no reason to expect this other than “so you don’t have to do the dumb drop/recreate indexes” trick... maybe I just expected COPY to be smarter than just expanding into a set of insert statements.


From personal experience: PostgreSQL's COPY commands aren't really all that performant, indexes or no.

Our project saw SIGNIFICANTLY better performance with batched multi-threaded INSERTs. If you can run a few hundred load threads and manage the concurrency correctly (not trivial), it will chew through big loads like a monster.

If I ever have the time/excuse, I want to go back and try a multi-threaded COPY. But if you need speed and have a choice between multi-threaded INSERTs or a single-threaded COPY, go with the INSERTs every time.


COPY of course is single threaded. If you can split your input data and run multiple copy operations you can get similar increases although I would never suggest hundreds of threads. Depending on your IO system something like twice as many threads as CPU threads is probably going to work better.

On a one for one basis COPY IN will be faster than inserts:

- COPY uses a special optimization in the access method: instead of running the full insert logic (find a target page, lock it, insert, unlock page) per row, it batches all the rows that will fit on the target page.

- COPY overall has shorter code paths than regular inserts.


Why not just use regular batched inserts with unnest to turn several array parameters into a table instead of using some arcane hard to use SQL command?


Why is everyone acting so pissy about the fact that someone happened to find a performance improvement trick by reading the docs? Isn't this what's supposed to happen?

None of you read all the performance "tricks" to Postgres before writing your first SQL statement.

Every day, somebody's born who doesn't know how to boos the performance of COPY by dropping indexes.


Most people don’t brag about being “pioneers” for having read the docs for a basic function. That’s what’s causing all the mirth.


It looks like the article is directed at those doing bulk uploads to their system and not at developers necessarily. This may not be universally intuitive to people outside of Database admins and developers.


They should have stuck with bog-standard mySQL to get this time saving for free - if you restore a SQL dump created with phpMyAdmin or mysqldump then all the disable index commands are already in there, good to go, and in SQL.

Whoever wrote the Django bit didn't really do a good job on the defaults.


I don't know what makes mysql more "bog standard" than postgres, but pg_dumpall takes `-F format`, one of which is standard SQL statements.

But that's not what they're dealing with. They're dealing with CSV, presumably from some external source. It'd also be faster if they were dealing with pre-formed database files that they could just rsync. But they're not.


yep





Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: