Database Reference
In-Depth Information
In the preceding case, we can't tell the data apart at all, so we can remove duplicate rows
without any manual process. SQL is a set-based language, so picking just one row out of a set
is slightly harder than most people want it to be. Use a query block like the following to delete
all the exactly duplicate rows, leaving just one row from each set of duplicates:
BEGIN;
LOCK TABLE new_cust IN ROW EXCLUSIVE MODE;
DELETE FROM new_cust
WHERE ctid NOT IN
(SELECT min(ctid)
FROM new_cust
WHERE customer_id IN (4) --specify exact duplicate ids
GROUP BY customerid);
COMMIT;
and then follow that with
VACUUM new_cust;
to clean up the table after the deletions.
How it works...
The first query works by grouping together the rows on the unique column and counting rows.
Anything with more than one row must be caused by duplicate values. If we're looking for
duplicates of more than one column (or even all columns) then we have to use an SQL of
the following form:
SELECT *
FROM mytable
WHERE (col1, col2, … ,colN) IN
(SELECT col1, col2, … ,colN
FROM mytable
GROUP BY col1, col2, … ,colN
HAVING count(*) > 1);
with ( col1 , col2 , …. , colN ) as the list of columns of the key.
Note that this type of query will need to sort the complete table on all of the key columns. That
will require sort space equal to the size of the table, so you'd better think first before running
that SQL on very large tables. You'll probably benefit from a large work_mem setting for this
query, probably 128 MB or more.
The DELETE query that we showed only works with PostgreSQL, because it uses the ctid
value which is the internal identifier of each row in the table. If you wanted to run that query
against more than one column, as we did earlier in the chapter, you'd need to extend the
query like the following:
 
Search WWH ::




Custom Search