Database Reference
In-Depth Information
-------
95
(1 row)
postgres=# SELECT count(*) FROM mybigtable
WHERE random() < 0.01;
count
-------
106
(1 row)
The clause WHERE random() < 0.01 will generate a random number between 0.0 and 1.0
for each row, and then see if the number is less than 0.01. In other words, this WHERE clause
will generate a 1% random sample of rows in the table. You can use a similar clause to vary
the percentage to be anything you choose. Easy.
Now we need to get the sampled data out of the database, which is tricky for a few reasons.
First, there is no option to specify a WHERE clause for pg_dump. Second, if you create a view
that contains the WHERE clause, pg_dump only dumps the view definition, not the view itself.
You can use pg_dump to dump all databases apart from a set of tables, so you can produce a
sampled dump like the following:
pg_dump --exclude-table=MyBigTable > db.dmp
pg_dump --table=MyBigTable -schema-only > mybigtable.schema
psql -c '\copy (SELECT * FROM MyBigTable
WHERE random() < 0.01) to mybigtable.dat'
and then reload onto a separate database using
psql -f db.dmp
psql -f mybigtable.schema
psql -c '\copy mybigtable from mybigtable.dat'
Overall, my advice is to avoid sampling if you can. Otherwise, at least minimize it to a few very
large tables. This avoids both the mathematical issues surrounding sample design, and the
difficulty of extracting the data.
How it works...
The extract mechanism shows off the capabilities of the PostgreSQL command-line tools,
psql and pg_dump , as pg_dump allows you to include or exclude files and to dump the
whole table, or just its schema, whereas Psql allows you to dump out the result of an
arbitrary query to a file.
I haven't discussed how random the random() function is. This isn't the right place for such
details; if you prefer another mechanism, you can find an external random number generator,
and call out to it from SQL using a C language function.
 
Search WWH ::




Custom Search