Что быстрее update или insert oracle
Say I have a simple table that has the following fields:
- ID: int, autoincremental (identity), primary key
- Name: varchar(50), unique, has unique index
- Tag: int
I never use the ID field for lookup, because my application is always based on working with the Name field.
I need to change the Tag value from time to time. I'm using the following trivial SQL code:
I wondered if anyone knows whether the above is always faster than:
Again - I know that in the second example the ID is changed, but it does not matter for my application.
@KM: I agree, this is a simplification of my real table, where all the lookup is done on a unique string field that is not the primary key. I do have a primary key int value that is completely irrelevant so I removed it from the example (it's automatically created and does not take part in the lookup at all)
UDPATE also has the benefit not to break any foreign key relations your table might have, as long as the key field being referenced doesn't change. If you DELETE + INSERT, some of your constraints might be violated and thus the DELETE might fail
15 Answers 15
A bit too late with this answer, but since I faced a similar question, I made a test with JMeter and a MySQL server on same machine, where I have used:
- A transaction Controller (generating parent sample) that contained two JDBC Requests: a Delete and an Insert statement
- A sepparate JDBC Request containing the Update statement.
After running the test for 500 loops, I have obtained the following results:
DEL + INSERT - Average: 62ms
Update - Average: 30ms
Hi Michael.That test was performed 5 years ago against the localhost (on a machine that doesn't exist anymore). To have some accuracy, the test should be repeated on identical tables and on indexed tables. Initial idea was just to grasp a flavor of the performance difference. If time allows it I may repeat it and update the post. I am glad that this post still sparks ppl's curiosity :)
The bigger the table (number of and size of columns) the more expensive it becomes to delete and insert rather than update. Because you have to pay the price of UNDO and REDO. DELETEs consume more UNDO space than UPDATEs, and your REDO contains twice as many statements as are necessary.
Besides, it is plain wrong from a business point of view. Consider how much harder it would be to understand a notional audit trail on that table.
There are some scenarios involving bulk updates of all the rows in a table where it is faster to create a new table using CTAS from the old table (applying the update in the the projection of the SELECT clause), dropping the old table and renaming the new table. The side-effects are creating indexes, managing constraints and renewing privileges, but it is worth considering.
One command on the same row should always be faster than two on that same row. So the UPDATE only would be better.
EDIT set up the table:
run this, which takes 1 second on my system (sql server 2005):
run this, which took 2 seconds on my system:
@Rax Olgud, how do you answer that? You haven't even said what database you are using. You asked a conceptual question, yet want concrete data. If you want actual data then you will need to write a wile loop (on your system), and update the row 1000 times, write another loop that will delete/insert it 1000 times. and see what is faster.
@Rax Olgud, there is some overhead in removing and creating an index value and checking any constraints. if you are just updating a data column it will avoid any of this overhead.
@Mohy66, the measurement is the time it takes to run, the totaling of the rowcount is to verify the amount of work that was done during the test. Thanks for the down vote.
I am afraid the body of your question is unrelated to title question.
If to answer the title:
In SQL, is UPDATE always faster than DELETE+INSERT?
then answer is NO!
Just google for
- "Expensive direct update"* "sql server"
- "deferred update"* "sql server"
Such update(s) result in more costly (more processing) realization of update through insert+update than direct insert+update. These are the cases when
- one updates the field with unique (or primary) key or
- when the new data does not fit (is bigger) in the pre-update row space allocated (or even maximum row size),resulting in fragmentation,
- etc.
My fast (non-exhaustive) search, not pretending to be covering one, gave me [1], [2]
Keep in mind the actual fragmentation that occurs when DELETE+INSERT is issued opposed to a correctly implemented UPDATE will make great difference by time.
Thats why, for instance, REPLACE INTO that MySQL implements is discouraged as opposed to using the INSERT INTO . ON DUPLICATE KEY UPDATE . syntax.
Just tried updating 43 fields on a table with 44 fields, the remaining field was the primary clustered key.
The update took 8 seconds.
A Delete + Insert is faster than the minimum time interval that the "Client Statistics" reports via SQL Management Studio.
In your case, I believe the update will be faster.
You have defined a primary key, it will likely automatically become a clustered index (at least SQL Server does so). A cluster index means the records are physically laid on the disk according to the index. DELETE operation itself won't cause much trouble, even after one record goes away, the index stays correct. But when you INSERT a new record, the DB engine will have to put this record in the correct location which under circumstances will cause some "reshuffling" of the old records to "make place" for a new one. There where it will slow down the operation.
An index (especially clustered) works best if the values are ever increasing, so the new records just get appended to the tail. Maybe you can add an extra INT IDENTITY column to become a clustered index, this will simplify insert operations.
In this example, the new record would go at the end of the table (based on the PK), because the user is not specifying the PK. If the "name" index were clustered, then that would be a problem, but it's unlikely to be clustered.
The question of speed is irrelevant without a specific speed problem.
If you are writing SQL code to make a change to an existing row, you UPDATE it. Anything else is incorrect.
If you're going to break the rules of how code should work, then you'd better have a damn good, quantified reason for it, and not a vague idea of "This way is faster", when you don't have any idea what "faster" is.
What if you have a few million rows. Each row starts with one piece of data, perhaps a client name. As you collect data for clients, their entries must be updated. Now, let's assume that the collection of client data is distributed across numerous other machines from which it is later collected and put into the database. If each client has unique information, then you would not be able to perform a bulk update; i.e., there is no where-clause criteria for you to use to update multiple clients in one shot. On the other hand, you could perform bulk inserts. So, the question might be better posed as follows: Is it better to perform millions of single updates, or is it better to compile them into large bulk deletes and inserts. In other words, instead of "update [table] set field=data where clientid=123" a milltion times, you do 'delete from [table] where clientid in ([all clients to be updated]);insert into [table] values (data for client1), (data for client2), etc'
Is either choice better than the other, or are you screwed both ways?
Obviously, the answer varies based on what database you are using, but UPDATE can always be implemented faster than DELETE+INSERT. Since in-memory ops are mostly trivial anyways, given a hard-drive based database, an UPDATE can change a database field in-place on the hdd, while a delete would remove a row (leaving an empty space), and insert a new row, perhaps to the end of the table (again, it's all in the implementation).
The other, minor, issue is that when you UPDATE a single variable in a single row, the other columns in that row remain the same. If you DELETE and then do an INSERT, you run the risk of forgetting about other columns and consequently leaving them behind (in which case you would have to do a SELECT before your DELETE to temporarily store your other columns before writing them back with INSERT).
I'm not sure I agree with you regarding the first point, especially when using variable length string types. Updating those may indeed require HD writes in "new places".
Delete + Insert is almost always faster because an Update has way more steps involved.
- Look for the row using PK.
- Read the row from disk.
- Check for which values have changed
- Raise the onUpdate Trigger with populated :NEW and :OLD variables
Write New variables to disk (The entire row)
(This repeats for every row you're updating)
- Mark rows as deleted (Only in the PK).
- Insert new rows at the end of the table.
Update PK Index with locations of new records.
(This doesn't repeat, all can be perfomed in a single block of operation).
Using Insert + Delete will fragment your File System, but not that fast. Doing a lazy optimization on the background will allways free unused blocks and pack the table altogether.
This answer over-simplifies the operations and misses out a lot of steps for the main commercial RDBMs models - deleting a row by just altering the PK (and nothing else) is not how the main commercial RDBMs work. Your information on triggers is incorrect and one-sided. For a start, the delete / insert could/would also fire triggers - but you fail to include those. Unless you specify a per row trigger, it will also fire just once for the update and twice for the delete / insert.
It depends on the product. A product could be implemented that (under the covers) converts all UPDATEs into a (transactionally wrapped) DELETE and INSERT. Provided the results are consistent with the UPDATE semantics.
I'm not saying I'm aware of any product that does this, but it's perfectly legal.
. perfectly legal, as long as the Foreign-Key constraint-checking is deferred until after the insert, which may not be legal.
I am not sure but I have heard that SQL Server does a DELETE+INSERT for UPDATE, internally. If that is the case, will it make any difference in case of SQL Server?
@Faiz - as with everything, the only way to be sure is to test with your data, in your environment. The underlying cost of these operations is unlikely to be your bottleneck - ever. With SQL Server, if you've got a trigger, it certainly resembles a delete/insert, but whether that's what the system actually does, who needs to know :-)
Every write to the database has lots of potential side effects.
Delete: a row must be removed, indexes updated, foreign keys checked and possibly cascade-deleted, etc. Insert: a row must be allocated - this might be in place of a deleted row, might not be; indexes must be updated, foreign keys checked, etc. Update: one or more values must be updated; perhaps the row's data no longer fits into that block of the database so more space must be allocated, which may cascade into multiple blocks being re-written, or lead to fragmented blocks; if the value has foreign key constraints they must be checked, etc.
For a very small number of columns or if the whole row is updated Delete+insert might be faster, but the FK constraint problem is a big one. Sure, maybe you have no FK constraints now, but will that always be true? And if you have a trigger it's easier to write code that handles updates if the update operation is truly an update.
Another issue to think about is that sometimes inserting and deleting hold different locks than updating. The DB might lock the entire table while you are inserting or deleting, as opposed to just locking a single record while you are updating that record.
In the end, I'd suggest just updating a record if you mean to update it. Then check your DB's performance statistics and the statistics for that table to see if there are performance improvements to be made. Anything else is premature.
An example from the ecommerce system I work on: We were storing credit-card transaction data in the database in a two-step approach: first, write a partial transaction to indicate that we've started the process. Then, when the authorization data is returned from the bank update the record. We COULD have deleted then re-inserted the record but instead we just used update. Our DBA told us that the table was fragmented because the DB was only allocating a small amount of space for each row, and the update caused block-chaining since it added a lot of data. However, rather than switch to DELETE+INSERT we just tuned the database to always allocate the whole row, this means the update could use the pre-allocated empty space with no problems. No code change required, and the code remains simple and easy to understand.
Сейчас появился еще оператор merge , но его сложно использовать в хранимых процедурах, когда необходимо обрабатывать строки по одной, и формируются они не обязательно по результатам запроса.
Меня интересует следующее:
- Что эффективнее ((1) или (2)) с точки зрения объема данных повторного выполнения? Генерирует ли сервер данные повторного выполнения для изменений, затрагивающих 0 строк, и для вставок, не срабатывающих из-за нарушения ограничения?
- Достаточно ли эффективен оператор merge , чтобы можно было начинать его использовать, вставляя предварительно данные в глобальную временную таблицу? Можно ли использовать pl/sql-таблицу вместо временной таблицы? Можно ли применять в качестве источника данных для merge переменные PL/SQL?
Ответ Тома Кайта
Оптимизировать обработку можно, если знать особенности данных:
- Если вы уверены, что БОЛЬШИНСТВО строк будет изменяться, сначала изменяейте, а если sql%rowcount=0 - вставляйте данные.
- Если вы уверены, что БОЛЬШИНСТВО строк будет вставляться, вставляйте, а в случае ошибки - изменяйте существующие данные.
Если не знаете точно, придется выбирать -- объем данных повторного выполнения в обоих случаях будет одинаковым.
Оператор Merge - замечательный, но появился в версии 9i. В версии 8.1.7 его еще не было. В версии 9i есть потоковые (pipelined) функции (поищите по ключевому слову pipelined соответствующие материалы у меня на сайте), так что, да, можно использовать pl/sql в качесте источника данных, наряду с временными таблицами.
Теперь, вот вам пример использования оператора MERGE и "временной" таблицы:
Итак, во всех смыслах таблицы t1 и t2 одинаковы - мы обе их построили методом upsert на базе данных из AO .
Оператор merge работает быстрее, чем процедурный код, и.
и выполняет меньше действий - генерирует лишь 75% соответствующего объема данных повторного выполнения.
Далее, я выполнил аналогичные действия с помощью потоковой функции (два оператора merge - merge с результатами merge ) и оказалось, что выполнение merge из таблицы, построенной по результатам выполнения потоковой функции (с помощью TABLE ), дает примерно те же результаты.
Именно удаление процедурного, написанного программистом кода, приводит к такой разнице. Надо стараться решать задачи НА УРОВНЕ МНОЖЕСТВ (непроцедурно). В общем случае, чем меньше процедурного кода вы пишете, тем лучше.
Хорошее сравнение
Приведенные результаты очень впечатляющи. Но я не смог найти структуру таблицы RUN_STATS и таблицы/представления STATS - это что, таблица/представление словаря данных?
Ответ Тома Кайта
Оператор merge и переменные
Можно ли использовать переменные вместо таблиц в операторе merge ? Например, если таблица для изменения и вставки состоит из 4 полей, можно ли эти 4 поля передать как параметры. Как это сделать - ведь так:
когда используется одна таблица, не получается. Я пытался использовать таблицу dual вместо недостающей и, хотя процедура скомпилировалась, она не работала.
Ответ Тома Кайта
Идентификаторы нельзя заменить параметрами НИ В ОДНОМ операторе - план при этом принципиально меняется. Вам придется использовать динамический SQL.
Что быстрее - delete/insert или upsert?
Если предполагается, что изменяться могут все столбцы, будет ли быстрее выполнить множественное удаление и множественные вставки, или использовать upsert с помощью оператора merge , как вы описали ранее?
Ответ Тома Кайта
truncate + insert /*+ append */ в таблицу с опцией nologging (конечно, с резервным копированием соответствующего табличного пространства сразу после завершения) будет, вероятно, самым быстрым методом, особенно если делать так:
Уточнение предыдущего вопроса
А что, если я не могу удалить все данные таблицы (truncate)? Что, если в таблице сейчас 100 миллионов строк, а теперь, скажем, миллион строк стирок надо вставить или изменить.
Сейчас мы удаляем все дублирующиеся строки в таблице, а затем массовыми вставками добавляем в таблицу миллион записей.
Это оказалось быстрее, чем делать изменение и, в случае возбуждения исключительной ситуации из-за отсутствия соответствующей строки, вставлять ее.
Ответ Тома Кайта
Причина в том, что построчные операции практически всегда выполняются медленнее (бывают исключения, но в общем случае). имеет смысл использовать merge . У меня есть таблица из 3,8 миллиона строк, и я хочу с помощью merge добавить 1% строк:
А если выполнить delete и insert :
Даже с учетом подсказки /+ APPEND/ , которую в реальной ситуации вы вряд ли используете (этот 1% места не будет использован повторно при добавлении).
Прекрасный пример
Ваш пример прекрасно иллюстрирует ваше утверждение. Хочу задать последний вопрос на эту тему. Мне кажется, что вставка данных в таблицу merge_data сопряжена с излишними расходами ресурсов. Если таблица merge_data - глобальная временная , могу ли я считать истинным следующее утверждение?
Достаточно простого ответа: да или нет.
Ответ Тома Кайта
Я предполагал, что таблица merge_data уже существует. Но, в общем случае, я считаю, что наполнепние данными merge_data + merge будет более эффективно, чем DELETE по ключу, а затем - INSERT .
Я не люблю давать вростые ответы да или нет ;)
Оператор merge и конструкция TABLE(CAST.
Поддерживается ли оператор MERGE в следующем случае или есть другой способ сделать это:
Все работае, если я использую обычную таблицу вместо набора. Я проверял на Oracle 9.2.0.1.0
Ответ Тома Кайта
Надо выполнять "select" из функции:
Оператор merge
Мне нужна помощь по оператору Merge .
Я создал следующую таблицу:
Этот оператор срабатывает как ожидалось, поскольку есть данные, удовлетворяющие критерию, и он изменяет значение в столбце currno с 31 на 32.
Если изменить год так, чтобы сработала ветка " when not matched ", оператор не срабатывает. Я поменял год с 2002 на 2003:
Логически здесь имеет место ситуация "not matched", поэтому я ожидал вставки в таблицу новой записи со значениями PBM,GE,2003,1 . Но этого не произошло. Почему?
Ответ Тома Кайта
вообще не вернул данных - вот почему. Не с чем вообще сравнивать.
Комментарий читателя от 29 июля 2003 года
Как мне применить merge в данном случае? Мне придется проверять существование записи по первичному ключу и если записи нет, выполнять Insert , а иначе - Update . То же самое сейчас у нас делается с помощью блока кода на pl/sql. Было бы здорово применить merge , потому что это действие будет выполняться очень часто.
Ответ Тома Кайта
Если множество, с которым выполняется слияние ( merge ), НЕ СОДЕРЖИТ ДАННЫХ, значит, нет данных для сравнения или для вставки. Вот в чем проблема. Вы ищете в ПУСТОМ множестве. Сервер за вас данные не "сделает".
Комментарий читателя от 20 августа 2003 года
Можно ли использовать оператор MERGE, если на сервере (версии 9i) оператор MERGE в конструкции USING выбирает данные через связь (DB Link) с сервера версии 8i?
Ответ Тома Кайта
Использование sql%rowcount для обновления информации о работе пользователя
Я использую твой способ выполнения "upsert", изменяя записи, проверяя затем sql%rowcount , и вставляя данные, если sql%rowcount=0 .
Я хочу избежать записей вида:
Надеюсь, я понятно описал проблему.
Ответ Тома Кайта
Можно просто включить аудит и регистрировать подключения. Это даст "время начала и время заврешения" для сеансов. Простое внешнее соединение с представлением v$session , а еще лучше - скалярный подзапрос к v$session , - позволит заполнить недостающие времена заврешения.
Приведите пример.
Спасибо за совет по использованию аудита.
Сейчас если изменение не срабатывает, я вставляю строку. Проблема в том, что в конечном итоге получается следующее:
И я не узнаю, что пользователь user1 не работал в системе с 6 до 7 вечера. Я не применяю аудит, потому что хочу отслеживать и другие события, не охватываемые аудитом, например, периоды, когда количество транзакций в секунду превышает 50.
Ответ Тома Кайта
При использовании аудита вы без проблем получите правильный результат. Я не могу решить проблему с периодом от 6 до 7 вечера иначе - только с помощью адуита.
Как отслеживать периоды, когда файл не доступен?
Я могу использовать аудит для отслеживания действий пользователя. Но как отслеживать, отключен ли файл данных (или другие события на уровне базы данных)? Как добиться того, чтобы была одна запись на отрезок времени, в течение которого файл данных был недоступен, не внося по 12 записей в час (при проверке раз в 5 минут)?
Suppose that we don't have foreign keys defined out there between my tables. in order to update a row in a table, we can use "UPDATE" statement as everybody knows, or we can also come up with a alternative of DELETE the row first and INSERT a new row.
What's the difference?
If it's in a huge OLTP system, what are the impacts on Oracle? any idea?
Blocking Locks
- Clashes on primary or unique keys for inserts
- Another session updating or deleting the same rows
It can be hard to completely avoid this problem. For example, if two users need to update the same rows at the same time, the second must wait until the first to end their transaction. To minimize its effect, ensure you commit or rollback transactions as soon as possible. If these continue to be an issue, you may need to redesign how you process data changes.
Comments
Hi
My understanding is that when you UPDATE, Oracle is acutally Deleting and Inserting the row. So I dont think there is any difference involved. I expect Oracle Gurus to shed more light into it and correct me if i am wrong.
From the performance perspective update is better then delete and insert. When you delete a row the whole row goes to rollback segments and is also written into redolog, for insert statement redolog info os rather small, while the update statement generates redo information only for changed columns. Oracle replaces the existing row in the database block only when the size of the updated row is the same as the size prior the update operation. When it is smaller or bigger Oracle relocates the row on new position allocated from the block's free space. If there is not enough contiguous free space in block Oracle performs block reorganization. When there is not enough free space left in block to accommodate a row after update then row is migrated to another block and the old block contains the pointer to new location. In this case whenever this row is accessed Oracle has to access two blocks. See the concepts manual for further explanation.
That would be a very silly idea. Just take a small test using autotrace on in sqlplus and look at results.
create table t (id number, a varchar2(100), b date);
table created.
insert into t select rownum, lpad('a', 100), sysdate
from all_objects;
9783 rows created.
set autotrace on
-- updating only one column
update t set where rownum < 2;
Statistics
-----------------------------------------------------
0 recursive calls
1 db block gets
4 consistent gets
0 physical reads
244 redo size
796 bytes sent via SQL*Net to client
912 bytes received via SQL*Net from client
4 SQL*Net roundtrips to/from client
1 sorts (memory)
0 sorts (disk)
1 rows processed
-- so look at redo size
-- and what if we delete and then insert again
delete t where rownum < 2;
Statistics
----------------------------------------------------
0 recursive calls
2 db block gets
4 consistent gets
0 physical reads
508 redo size
789 bytes sent via SQL*Net to client
902 bytes received via SQL*Net from client
4 SQL*Net roundtrips to/from client
1 sorts (memory)
0 sorts (disk)
1 rows processed
insert into t values (1, lpad('a', 100), sysdate);
Statistics
----------------------------------------------------
0 recursive calls
3 db block gets
1 consistent gets
0 physical reads
344 redo size
797 bytes sent via SQL*Net to client
926 bytes received via SQL*Net from client
4 SQL*Net roundtrips to/from client
1 sorts (memory)
0 sorts (disk)
1 rows processed
Sum two last statistic values and compare with the first one and think what if you'll have thousands and millions of such comparisons each day.
And more - you'll have to parse more statements, you'll have to update and delete all indexes for a given table, you'll have more locks and latches, you'll mess up integrity constraints etc etc.
As with selects, you can get execution plans for inserts, updates, and deletes. The process is the same as for queries. Run the statement then call DBMS_XPlan as shown:
For INSERT-AS-SELECT, UPDATE, and DELETE review the plan to see if you make changes to help the optimizer find the rows faster.
But even if you have an optimal execution plan for a statement, rewriting DML processes can often give order-of-magnitude performance gains.
Triggers
Triggers run extra code when processing your insert, update or delete. You can check if there are any triggers on your tables by querying the *_TRIGGERS views:
If there are, review the code in the trigger. It may be the case the write is slow because of the work done in the trigger. If this is the case, see if you can change rewrite the process - ideally without using triggers!
Single Inserts vs Batch Performance Comparison
This compares methods for inserting 100,000 rows:
- A for loop with 100,000 iterations, each adding one row
- One FORALL INSERT, adding all 100k rows
Due to the shared nature of Live SQL, there can be a lot of variation in run times for the processes in this tutorial. Run the comparisons 3-4 times to get a feel for the relative performance of the methods shown.
The difference between these methods is large. The FORALL method completes in a few tenths of a second. Single-row inserts in a loop often takes 10x longer!
Tuning Delete Using DDL: Removing Many Rows
Partitioning a table splits it up into smaller sub-tables. You can do operations that affect all the rows in a partition, leaving the other partitions unaffected. For example, dropping a partition.
This syntax (added in 12.2) changes the table to be partitioned into batches of 10,000 rows:
You can now remove all the rows in a partition by dropping or truncating it. This truncates the first partition:
Partitioning a table on insert date is a common strategy. This makes it fast and easy to remove all the rows added before a certain date.
Remember partitioning a table impacts all statements you run against it. Operations that read many partitions may be slower compared to the equivalent non-partitioned table. Ensure you test whole application workload before diving in with partitioning!
Tuning Updates Using DDL
An UPDATE with a where clause is much faster than looping through rows and running an UPDATE for each row. But the single update can still take a long time to finish. And it locks all the affected rows from the time the UPDATE starts until the transaction finishes.
This could take unacceptably long on large tables. If you're changing most or all of the rows in a table, you can make the process faster doing an "update" in DDL with this process:
- Create a holding table using CREATE-TABLE-AS-SELECT (CTAS). Provide the new column values in this select statement
- Switch the real and temporary table over
It's rare you need to use this trick. Reserve this for one-off migrations or other cases where speed is of the essence.
For example, this creates BRICKS_UPDATE, "updating" the colour and shape of every row to yellow prism by selecting these values instead of the columns in the table:
Note that when selecting literal values, you need to CAST them to ensure the new table has the same data types as the original.
To complete the "update", you need to:
- Copy any indexes, grants, triggers, etc. from the old table to the new
- Switch the tables over
Adding the dependent objects to the new table - particularly indexes - can take a while. Be sure to measure the total time to complete the process!
This completes the process by dropping the original table, then renaming the new one:
This destroys your rollback position! If there's an error in the process, you'll have to restore from backup. It's safer to rename the original table (e.g. BRICKS_OLD), then rename the new table.
Whichever method you use to swap the tables, there will be a brief period where there is no BRICKS table! You need to take the application(s) offline to complete this process safely.
Tuning Updates Using DDL Performance Comparison
This compares the run time of an update changing the shape and colour of every row to doing the "update" using CREATE-TABLE-AS-SELECT and swapping the tables over:
The speed gains of using a "DDL update" are small in this case. Given the extra complexity and risks of using this, it's only worth considering for when changing millions of rows or more. In most cases UPDATE is safer, easier, and fast enough.
Tuning Delete as DDL: Removing Most of the Rows
- CTAS to save rows, then switch the rows
- CTAS to save rows, then switch the tables
- A filtered table move
With create-table-as-select, write a query fetching the rows you want to keep. Then wrap this in a create-table statement:
From here, you can either switch just the rows over by truncating the original table and re-inserting the rows:
Or you could switch the tables themselves over by dropping or renaming the original table. Then renaming the new table to the old.
As with UPDATE, if you switch the tables over, you also need to copy any index, constraints, etc. from the old table to the new. Ensure you test the runtime of the complete process!
Both of these methods need an outage complete safely. Making them unusable except in extreme cases. Fortunately, Oracle Database 12.2 added an online DDL method to remove lots of data: a filtered table move.
This added the "including rows" clause to ALTER TABLE MOVE. You place a where clause after this to state the rows you want to keep. The database discards the non-matching rows in the process.
Tuning Delete Using DDL: Removing All Rows
You can also make deletion processes faster by changing DML to DDL. The easiest case is when you remove all the rows from a table.
Instead of a DELETE without a WHERE clause, use TRUNCATE. This is a meta-data only operation, so is "instant":
This gives big performance gains, as this test shows:
But TRUNCATE commits, so you can't roll it back. And it has some restrictions that don't apply to DELETE. So this is not always appropriate.
And it's rare you want to remove all the rows in a table. Even when archiving most of the data, usually you want to keep some rows. There are several other DDL tricks you can use to do this.
Tuning Delete Using DDL: Removing Many Rows Performance Comparison
This compares removing 20,000 rows with DELETE to truncating the first two partitions:
Note: 20,000 rows is tiny in modern database terms. You may notice little or no performance difference between the DELETE and truncating the partitions in the above process.
Try It!
This code loops through all the rows with a weight less than six and sets their weight to one. Run it a few times to get a feel for how long it takes to execute:
Replace /* TODO */ in the code below to turn the update above into one statement:
How does this affect the runtime of the process?
Try It Challenge!
This process starts with 50,000 rows in BRICKS. It the loops through data to:
- Insert 20,000 more rows
- Update the weight of blue bricks
- Delete all the green bricks
Run it a few times to see how long it takes. Then rewrite the process using the template below to make this faster:
Am I correct to assume that an UPDATE query takes more resources than an INSERT query?
Why would you compare those? They serve completely different purposes, so usually you have no choice - just use the one that does the job.
@Lukasz Milewski You may choose to delete a table, and insert vs. update if the speed is quicker and you are in effect getting the same result. Sometimes it is quicker to delete then rewrite all rows vs. having MySQL figure out if a row needs an update
The question should mention if the UPDATE concerns a single row and if the INSERT creates a single row. The response also may depend if the operation happens in a transaction.
Delete Performance Comparison
- 100k DELETE statements removing one row
- One DELETE removing all 100k rows
- One FORALL DELETE wiping all 100k rows
Again, the single statement is considerably faster. This time ~2-3x faster than deleting all 100k rows one-by-one. The runtime for bulk deletion typically falls somewhere between the two other methods.
Update Performance Comparison
This compares running:
- 100k UPDATE statements changing one row
- One UPDATE changing 100k rows
- One FORALL UPDATE changing 100k rows
In this example the single update statement is typically 3-4x faster than looping through 100k rows and updating each one. Bulk processing gives similar performance to plain SQL, though is usually slightly slower.
Many One Row Inserts vs. One Many Row Insert
You can also use INSERT to copy rows from one table to another. This code simulates loading 10,000 from one table to another using single row inserts inside a cursor-for loop:
As with the previous example, this means the database has to run the INSERT statement 10,000 times.
When changing data, it's faster to run one statement that changes all the rows. Putting DML inside a loop that changes one row on each iteration is a sure way to slow SQL!
Instead, have one SQL statement that inserts all the rows. This uses INSERT-AS-SELECT to add all 10,000 rows in one go:
You can also use bulk processing to copy data. To use this, BULK COLLECT the query into an array. Then use a FORALL INSERT:
In most cases it's quicker and easier to use INSERT-SELECT. Reserve BULK COLLECT . FORALL for cases where you need to do procedural (non-SQL) processing of the data before loading it into the target table.
Insert Performance Comparison
This compares methods for copying 100,000 rows from one table to another
- A cursor-for loop with 100,000 iterations, each adding one row
- One INSERT-SELECT statement, inserting all 100,000 rows
- One BULK COLLECTING the rows into an array, followed by a FORALL INSERT, adding all 100k rows
The single statement can be up to 10x faster than the looped approach; a huge saving! Copying the data using BULK COLLECT then INSERT INTO is also much faster than the loop, though typically marginally slower than plain SQL.
Single Row Inserts vs Batch Inserts
This code inserts 10,000 rows in the bricks table:
This means the database has to run the INSERT statement 10,000 times. Although each individual execution is fast, these add up quickly, making the overall process slow. This particularly problematic if the SQL comes from the mid-tier application, as this makes a roundtrip to the database for each row.
If you have a commit inside the loop, the process will be even slower!
You can make this process faster using bulk processing.
To use this, load an array with data. Then use the FORALL statement. This looks like a FOR LOOP. But unlike a loop, it only processes the DML statement inside once:
You can also execute bulk or batch DML statements from the middle tier in most programming languages. Check the documentation for your technology to find out how to do this.
Many One Row Deletes vs. One Many Row Delete
Finally, DELETE. As with the other DML statements, running a single DELETE removing all the rows is faster than lots of statements removing one at a time.
This removes all the rows with BRICK_IDs between 1 and 1,000 one at a time:
It's much better to run a single DELETE with a WHERE clause that removes desired rows:
As with UPDATE and INSERT, you can use bulk/batch processing to process many rows in one call.
Tuning Delete as DDL: Removing Most of the Rows Performance Comparison
This compares the relative performance of these methods and a regular DELETE to remove 90% of the rows from a table:
All are notably faster than a regular DELETE. This data set is small, so there's little performance difference between them.
These methods are most effective when removing a large fraction (50%+) of the rows from a table. But can be slower than a plain delete when removing a large absolute number of rows (millions or more) that are only a small fraction of a table. This kind of operation is common when archiving data. Such as removing the oldest month of data.
When doing this, partitioning is a great way to remove large numbers of rows fast.
Check for Side Effects
Many One Row Updates vs. One Many Row Update
This runs a cursor-for loop to get all the red rows. For each row fetched, it updates its weight to one:
But the loop is unnecessary! Adding a WHERE clause to UPDATE changes all the rows where the conditions are true. Instead of a loop, search for rows where the colour is red:
As with INSERT, you can also use FORALL or other batch processing methods to change many rows in one call.
This uses BULK COLLECT to populate an array with BRICK_IDs for the red rows:
Typically you only need to bulk updates when either:
- You receive arrays with different new column values for each row from the application
- You use the array to do other procedural (non-SQL) processing
8 Answers 8
I am not a database guru but here my two cents:
Personally I don't think you have much to do in this regard, even if INSERT would be faster (all to be proven), can you convert an update in an insert?! Frankly I don't think you can do it all the times.
During an INSERT you don't usually have to use WHERE to identify which row to update but depending on your indices on that table the operation can have some cost.
During an update if you do not change any column included in any indices you could have quick execution, if the where clause is easy and fast enough.
Nothing is written on stones and really I would imagine it depends on whole database setup, indices and so on.
Anyway, found this one as a reference:
Sometimes you can use INSERT. ON DUPLICATE KEYS UPDATE to partially simulate UPDATE by INSERT. But I believe that in that case MySQL does INSERT and then UPDATE if there are duplicates, so you would end up with two queries which should be slower than single UPDATE.
If you plan to perform a large processing (such as rating or billing for a cellular company), this question has a huge impact on system performance.
Performing large scale updates vs making many new tables and index has proven to reduce my company billing process form 26 hours to 1 hour!
I have tried it on 2 million records for 100,000 customer.
I first created the billing table and then every customer summary calls, I updated the billing table with the duration, price, discount.. a total of 10 fields.
In the second option I created 4 phases.
Each phase reads the previous table(s), creates index (after the table insert completed) and using: "insert into from select .." I have created the next table for the next phase.
Summary
Although the second alternative requires much more disk space (all views and temporary tables deleted at the end) there are 3 main advantages to this option:
- It was 4 time faster than option 1.
- In case there was a problem in the middle of the process I could start the process from the point it failed, as all the tables for the beginning of the phase were ready and the process could restart from this point. If the process fails implementing the first option, you will need to start the all the process all over again.
- This made the development and QA work much faster as they could work parallel.
The key resource here is disk access (IOPS to be precise) and we should evaluate which ones results in minimum of that.
Agree with others on how it is impossible to give a generic answer but some thoughts to lead you in the right direction , assume a simple key-value store and key is indexed. Insertion is inserting a new key and update is updating the value of an existing key.
If that is the case (a very common case) , update would be faster than insertion because update involves an indexed lookup and changing an existing value without touching the index. You can assume that is one disk read to get the data and possibly one disk write. On the other hand insertion would involve two disk writes one for index , one for data. But the another hidden cost is the btree node splitting and new node creation which would happen in background while insertion leading to more disk access on average.
You cannot compare an INSERT and an UPDATE in general. Give us an example (with schema definition) and we will explain which one costs more and why. Also, you can compere a concrete INSERT and an UPDATE by checking their plan and execution time.
Some rules of thumbs though:
- if you only update only one field, which is not indexed and you only update one record and you use rowid/primary key to find that record then this UPDATE will cost less, than
- an INSERT, which will also affect only one row, though this row will have many not null constrained, indexed fields; and all those indexes have to be maintained (e.g. add a new leaf)
It depends. A simple UPDATE that uses a primary key in the WHERE clause and updates only a single non-indexed field would likely be less costly than an INSERT on the same table. But even that depends on the database engine involved. An UPDATE that involved modifying many indexed fields, however, might be more costly than the INSERT on that table because more index key modifications would be required. An UPDATE with a poorly constructed WHERE clause that required a table scan of millions of records would certainly be more expensive than an INSERT on that table.
These statements can take many forms, but if you limit the discussion to their "basic" forms that involve a single record, then the larger portion of the cost will usually be dedicated to modifying the indexes. Each indexed field that is modified during an UPDATE would typically involve two basic operations (delete the old key and add the new key) whereas the INSERT would require one (add the new key). Of course, a clustered index would then add some other dynamics as would locking issues, transaction isolation, etc. So, ultimately, the comparison between these statements in a general sense is not really possible and would probably require benchmarking of specific statements if it actually mattered.
Typically, though, it makes sense to just use the correct statement and not worry about it since it is usually not an option to choose between an UPDATE and an INSERT.
Читайте также: