The title is quite clickbait, isn’t it? But honestly that’s the result I get. Only with a 2 core 4 GB RAM virtual machine. Interesting, right?
We in Tokopedia use many kind of databases. Our main database is PostgreSQL because of its robustness. But some cases we need high write concurrency that even standard Insert Into’s PostgreSQL isn’t capable to handle that, even with a high spec machine.
Our strongest candidate was move to ScyllaDB, but moving to another storage isn’t easy. There are some points that we need to think of, like how to migrate the data, how about the learning curve, and how to handle the issue. While, PostgreSQL is used by everyone in Tokopedia, and also we have DB Administrator team that really understand PostgreSQL. Oh and also, we already use PostgreSQL for 10+ years!
And then we found another insertion method in PostgreSQL called COPY. At first, Copy is usually used by DBA to import CSV files to a table, commonly used for testing and simulation. Copy method itself is very fast, compared with Insert Into. And then we think, is Copy can be used in transaction based?
The answer is yes.
PostgreSQL library in Go (lib/pq) already support for executing Copy command. And then I create a little script to benchmark and compare Copy with other insertion method in PostgreSQL.
Virtual Machine spec that I used:
- CPU 2 Core — (Intel 10th Gen)
- RAM 4 GB
- SSD SATA III 10 GB
- Bridged network
- INSERT INTO
- INSERT INTO PREPARED STATEMENT
- INSERT INTO BULK 5 ROWS
- INSERT INTO BULK 10 ROWS
- COPY PER 1 ROW
- COPY PER 5 ROWS
- COPY PER 10 ROWS
With each method, there’s 3 treatments:
- Async using 50 workers
- Rows per second
And I have CSV files that ranged from 10K rows up to 1M rows.
First Test: Synchronous Insert 10K rows CSV
The first test is synchronous: Insert 10K rows of CSV with synchronously.
If we check from Total Time only, Copy command is very fast. It’s only takes 280 ms to do insertion 10K rows. Second and third position is Insert Into Bulk 10 rows and Copy 10 rows. But if we check from CPU usage, Copy 10 rows get first place with score only 4.8%, followed by Insert Into Prepared Statement by 5.6%. Of course it’s possible to be margin error. Let’s continue to next treatment.
The second test: Asynchronous Insert using 50 Workers
In this test I’m using CSV file with 100K rows. Also for the script I have 50 workers to do insertion to simulate production environment.
By the way, Copy All is just comparison. Because Copy All method cannot use asynchronous — because it’s copying entire CSV to the table.
Copy 10 rows and Insert Into Bulk 10 rows both have good scores for total time comparison. It’s really close, two of them already under 5 secs. But if we check from CPU usage itself, it seems Insert Into Prepared Statement is winner here, even lower than Copy All.
After we try with these two treatments, we are aware that in production environment is not inserts 1 big CSV file to a table but transaction based with requests per second / RPS. To simulate that, in third test I use insertion rows per second treatment to simulate the production environment.
The Final Test: Insertion Rows per Second
In the final test I’m using a CSV file with 1 million rows to avoid CSV file run out while insertion. Also I’m using 50 workers and check CPU usage average in 60 secs.
In 500 rows per second Insert Into and Insert Into Prepared Statement is not having a good result, even they can’t finish 1000 rows per second test. The other four methods seems can handle this test. Let’s scroll down to check 3000 and 5000 rows per second.
In 3000 rows per second we can see the winner here. Insert Into Bulk 5 rows is having a pressure here, and it can’t finish 5000 rows per second. On the other hand, Copy 5 rows outperforms Insert Into Bulk 10 rows in 5000 rows per second.
Copy command is a powerful insertion method in PostgreSQL, even better than commonly used Insert Into. Its power in bulk insert with a lot of rows. But Copy itself is not good even worse than Insert Into if we compare for single row insertion.
Also Copy (AFAIK) cannot returning a column value (for example: RETURNING id;). On the other hand, Insert Into can handle that case.
Thank you for reading.