Tables & Data - PostgreSQL 9 Administration

Database Reference

In-Depth Information

In the preceding case, we can't tell the data apart at all, so we can remove duplicate rows

without any manual process. SQL is a set-based language, so picking just one row out of a set

is slightly harder than most people want it to be. Use a query block like the following to delete

all the exactly duplicate rows, leaving just one row from each set of duplicates:

BEGIN;

LOCK TABLE new_cust IN ROW EXCLUSIVE MODE;

DELETE FROM new_cust

WHERE ctid NOT IN

(SELECT min(ctid)

FROM new_cust

WHERE customer_id IN (4) --specify exact duplicate ids

GROUP BY customerid);

COMMIT;

and then follow that with

VACUUM new_cust;

to clean up the table after the deletions.

How it works...

The first query works by grouping together the rows on the unique column and counting rows.

Anything with more than one row must be caused by duplicate values. If we're looking for

duplicates of more than one column (or even all columns) then we have to use an SQL of

the following form:

SELECT *

FROM mytable

WHERE (col1, col2, … ,colN) IN

(SELECT col1, col2, … ,colN

FROM mytable

GROUP BY col1, col2, … ,colN

HAVING count(*) > 1);

with ( col1 , col2 , …. , colN ) as the list of columns of the key.

Note that this type of query will need to sort the complete table on all of the key columns. That

will require sort space equal to the size of the table, so you'd better think first before running

that SQL on very large tables. You'll probably benefit from a large work_mem setting for this

query, probably 128 MB or more.

The DELETE query that we showed only works with PostgreSQL, because it uses the ctid

value which is the internal identifier of each row in the table. If you wanted to run that query

against more than one column, as we did earlier in the chapter, you'd need to extend the

query like the following:

Search WWH ::

Custom Search

Home