Tables & Data - PostgreSQL 9 Administration

Database Reference

In-Depth Information

pgloader is written in Python, and allows connection to PostgreSQL through the standard

Python client interface. Yes, pgloader is less efficient than loading data files using a COPY

command, but running a COPY has many more restrictions: the file has to already be in the

right place on the server, has to be in the right format, and must be unlikely to throw errors on

load. pgloader has additional overhead, but it also has the ability to load data using multiple

parallel threads, so it can be faster to use as well. pgloader's ability to call out to reformat

functions written in Python is often essential in most cases; straight COPY is just too simple.

pgloader also allows loading from fixed-width files, which COPY cannot.

There's more...

If you need to reload the table from fresh completely, then specify --truncate on the

command line of pgloader.

After loading, if we had load errors, then there will be some junk loaded into the PostgreSQL

tables. Not junk you can see, or that gives any semantic errors, but think of it more like

fragmentation. You should think about whether you need to run with --vacuum as an

additional option, though this will make the load take possibly much longer.

We need to be careful to avoid loading data twice. The only easy way of doing that is to make

sure there is at least one unique index defined on every table that you load. The load should

then fail very quickly.

String handling can often be difficult, because of the presence of formatting or non-printable

characters. The default setting for PostgreSQL is to have a parameter named standard_

conforming_strings set off, which means that backslashes will be assumed to be escape

characters. Put another way, by default the string '\n' means linefeed, that can cause data to

appear truncated. You'll need to turn standard_conforming_strings = on , or you'll need

to specify an escape character in the load-parameter file.

If you are re-loading data that has been unloaded from PostgreSQL, then you may want to

use the pg_restore utility instead.. The pg_restore utility has an option to reload data in

parallel, -j number_of_threads, though this is only possible if the dump was produced using the

custom pg_dump format. Refer to the recipes in the Backup chapter for more details. This

can be useful for reloading dumps, though it lacks almost all of the other pgloader features

discussed here.