Database Reference
In-Depth Information
DELETE FROM mytable
WHERE ctid NOT IN
(SELECT min(ctid)
FROM has_duplicates
-- need WHERE clause to filter only exact duplicates
GROUP BY col1, col2, …, colN);
The preceding query works by grouping together all the rows with similar values and then
finding the row with the lowest ctid value. Lowest will mean nearer to the start of the table,
so duplicates will be removed from the far end of the table. When we run VACUUM , we may
find that the table gets smaller, because we have removed rows from the far end.
The BEGIN and COMMIT commands wrap the LOCK and DELETE commands into a
single transaction. The LOCK command applies a sufficiently high level of lock against
the table to prevent UPDATES and DELETES from being executed against the table while
we remove duplicates.
There's more...
Locking the table against changes for long periods may not be possible while we remove
duplicate rows. That gives some fairly hard problems with large tables. In that case, we need
to do things slightly differently:
F Identify the rows to be deleted, and save them into a side table
F Build an index on the main table to speed access to rows
F Write a program that reads the rows from the side table in a loop, performing a series
of smaller transactions.
F Start a new transaction
F Read a set of rows from the side table that match
F Select for update those rows from the main table, relying on the index to make those
accesses happen quickly
F Delete the appropriate rows
F Commit, and then loop again
The aforementioned program can't be written as a database function, as we can't have
multiple transactions in a function. We need multiple transactions to ensure we hold locks on
each row for the shortest possible duration.
Preventing duplicate rows
Preventing duplicate rows is one of the most important aspects of data quality for any
database. PostgreSQL offers some useful features in this area, extending beyond most
relational databases.
 
Search WWH ::




Custom Search