On Tablespaces, Large Objects and Replication

Exploring alternative storage strategy while not making compromises on the advantages of large objects (replication, dump consistency…)
par Georges Racinet, mis à jour le 11/12/2014

The problem: moving large objects to dedicated storage

Today I've been asked the question whether large objects (of which we are active proponents in Odoo context) could be set aside on a different partition (typically bigger and slower).

Since PostgreSQL has the notion of tablespace, precisely to dispatch physical storage, and large objects are stored in a regular table anyway, I said « sure why not, do you want me to test that explicitely ? »

Turns out it's a bit more complicated (on a 9.3.5 server):

withtablespace=# alter table pg_catalog.pg_largeobject set tablespace testgr;
ERROR:  permission denied: "pg_largeobject" is a system catalog

Actually, it's been thought of for a very long time, but relaxing the general protection of pg_catalog tables for the special case of large objects (it's a bit weird for this table to be in pg_catalog, if you ask me)

A solution

I'm assuming here that the cluster already exists, and that we aren't allowed to take it offline, and ever less so to move it entirely on the bigger drives.

As Tom Lane mentions in another thread, it's possible to have the whole of pg_catalog in the separate tablespace, but at the cost of setting it for the entire database:

postgres=# ALTER DATABASE withtablespace SET TABLESPACE testgr;

Then we can switch back the default tablespace used by the Odoo role to the pg_default tablespace:

postgres=# ALTER USER odoo SET default_tablespace TO pg_default;

Then one needs to give explicit permission to create objects in the tablespace, because pg_default is not the default tablespace of the database:

postgres=# GRANT CREATE ON TABLESPACE pg_default TO odoo;

Note that this does not mean that the user odoo would be allowed to create objects in pg_default inconditionnaly. A user having CONNECT grant to a database can create tables anyway.

It's better to do all this before the creation of the tables, otherwise you'll have to switch them back to pg_default one after the other, (from PG 9.4 onwards, ALTER TABLE ALL will make this less painful).

Note

this has the drawback of having the sequences also on the partition meant for the large objects, as well as the whole pg_catalog schema, but I don't think this matters much in terms of performance.

Checking that it works

So let's see how this behaves:

withtablespace=> create table regtable (id serial, a text NOT NULL);
CREATE TABLE
withtablespace=> create index regtable_idx on regtable(a);
CREATE INDEX

withtablespace=> select relname, reltablespace from pg_class where relname like '%regtable%';
     relname     | reltablespace
-----------------+---------------
 regtable_id_seq |             0
 regtable        |          1663
 regtable_idx    |          1663
(3 rows)

In the latter output, reltablespace=0 means that it's the default tablespace, i.e, the one we selected to put the large objects in. And 1663 is nothing but pg_default:

withtablespace=> select oid, spcname from pg_tablespace;
   oid   |  spcname
---------+------------
    1664 | pg_global
 2209893 | testgr
    1663 | pg_default
(3 rows)

So far, so good, now let's see where the pg_largeobject table lies:

withtablespace=> select relname, reltablespace from pg_class where relname like '%pg_largeobject%';
              relname              | reltablespace
-----------------------------------+---------------
 pg_largeobject_metadata_oid_index |             0
 pg_largeobject_loid_pn_index      |             0
 pg_largeobject_metadata           |             0
 pg_largeobject                    |             0
(4 rows)

Let's now create a large object, and check where it's been stored.

We'll do it from the pyscopg API, because server-side functions for large objects are restricted to the superuser and we want to check what happens with the Odoo user.

>>> import psycopg2
>>> conn = psycopg2.connect('dbname=withtablespace')
>>> handle = conn.lobject(0)
>>> handle.oid
2214429
>>> for i in xrange(10000): _ = handle.write("ten bytes ")
>>> handle.close()
>>> conn.commit()

So now let's see where the storage is:

$ sudo du -hs  /tmp/testtblspc/PG_9.3_201306121/2212668
505M  /tmp/testtblspc/PG_9.3_201306121/2212668

$ sudo du -hs /var/lib/postgresql/9.3/main/base/2212668
20K   /var/lib/postgresql/9.3/main/base/2212668

By the way, 2212668 is the simply the OID of the database:

postgres=> select oid from pg_database where datname='withtablespace';
   oid
---------
 2212668
(1 row)

A lower level hack

If you can afford to stop the cluster, then you may apply a sequence of commands in single-used mode to move just the large objects.

Here's the source, from the brazilian mailing-list

But I don't know how it'll behave with replication. I suppose one may need to reconstruct the replicate afresh.

Replication

Tablespace definitions are replicated, but beware that the directory exist on all replicates, otherwise the wole replicate server will crash !.

Seen on a 9.3.5 hot standby after a CREATE TABLESPACE on the master:

2014-12-11 15:11:32 GMT [] FATAL:  directory "/tmp/testtblspc" does not exist
2014-12-11 15:11:32 GMT [] HINT:  Create this directory for the tablespace before restarting the server.
2014-12-11 15:11:32 GMT [] CONTEXT:  xlog redo create tablespace: 2209893 "/tmp/testtblspc"
2014-12-11 15:11:32 GMT [] LOG:  startup process (PID 1703) exited with exit code 1
2014-12-11 15:11:32 GMT [] LOG:  terminating any other active server processes

(oops)

Actually, the whole session above had been done with replication active, and I get the exact same figures on filesystem footprints on the replicate. It works.

Dump/restore

How does this behave if I dump and restore the database ? Here's the answer for a pure SQL dump:

$ pg_dump withtablespace -f wts.sql
$ grep -i TABLESPACE wts.sql
SET default_tablespace = pg_default;
-- Name: regtable; Type: TABLE; Schema: public; Owner: odoo; Tablespace: pg_default
-- Name: regtable_idx; Type: INDEX; Schema: public; Owner: odoo; Tablespace: pg_default

Unsurprisingly, all the commands that I executed in the above SQL sessions are dumped (the SET being probably just the default parameter for this role), so, to restore, we shall be careful and recreate the database with the TABLESPACE parameter set, or simply decide that all databases on the system should have this parameter by setting it on template1.

Let's try a "custom format dump":

$ pg_dump -Fc withtablespace -f wts.dump
$ psql -c "CREATE DATABASE restoretblspc TABLESPACE testgr"

$  psql -c "CREATE DATABASE restoretblspc TABLESPACE testgr" postgres
CREATE DATABASE
$ pg_restore -d restoretblspc wts.dump
(...) usual warnings

$ psql -q restoretblspc
restoretblspc=> \dS regtable;
                         Table "public.regtable"
 Column |  Type   |                       Modifiers
--------+---------+-------------------------------------------------------
 id     | integer | not null default nextval('regtable_id_seq'::regclass)
 a      | text    | not null
Indexes:
    "regtable_idx" btree (a), tablespace "pg_default"
Tablespace: "pg_default"

Now I have a hard time believing the large objects are in, because the dump is very small, but it's just the compression being efficient:

>>> import psycopg2
>>> conn = psycopg2.connect('dbname=restoretblspc')
>>> handle = conn.lobject(2214429)
>>> handle.read(100)
'ten bytes ten bytes ten bytes ten bytes ten bytes ten bytes ten bytes ten bytes ten bytes ten bytes '

One can also notice how much garbage collecting the dump/restore does:

$ sudo du -hs /tmp/testtblspc/PG_9.3_201306121/2215425
6.7M  /tmp/testtblspc/PG_9.3_201306121/2215425

PostgreSQL elephant logo under BSD licence (By Jeff MacDonald (http://pgfoundry.org/docman/?group_id=1000089), via Wikimedia Commons