Migration to PostgreSQL

Why
PG has a much better geo support.

Concept
All of our wikis have to use the same Database (DB). This is because PostGres (PG) does not allow JOINs between tables on different DBs. However, PG has a very elaborated SCHEMA concept that can be used to divide the big DB into distinct namespaces. Thus, every wiki will store its tables in an own schema on the same DB.

In order to provide the single-signon feature, we need a commonly used schema that stores only the mwuser table.

The DB and all schemas and tables in it are owned by the DB user wikiadmin which is a pure administrative account and is not used by any wiki.

wikiadmin and all DB wiki users are members of the role (group) wikigroup.

One database for all wikis
First of all, we need to create the wikiadmin user and the group wikigroup where all wikiusers are members. Login as DB user postgres and run: wikis=# CREATE ROLE wikiadmin WITH CREATEROLE LOGIN PASSWORD 'secret'; wikis=# CREATE ROLE wikigroup WITH NOLOGIN ADMIN wikiadmin; or, alternativly, run on the shell: root# su postgres postgres:$ cd ~ postgres:~$ createuser wikiadmin -l -W postgres:~$ createuser wikigroup -L The first way is preferred since wikiadmin becomes admin for the wikigroup and is allowed to grant membership to other users. He also has the privileg to create new DB users what makes him a quite powerfull user.

The DB should have the PostGIS enhancement enabled. Make a DB template as described in the PostGIS docu and create the DB from the shell like this: root# su postgres postgres:$ cd ~ postgres:~$ createdb wikis --o wikiadmin -T template_gis

This creates the DB named wikis with the owner wikiadmin. It uses the previously created template template_gis for having PostGIS functionality.

TSearch2
Make sure, tsearch2 is installed on the system. On Debian, it's in package postgresql-contrib-8.1.

Run this on the shell as user postgres (wikiadmin would maybe be the better option, but then he needed additional privileges): postgres~$ psql wikis < /usr/share/postgresql/8.1/contrib/tsearch2.sql

Now, TSearch2 is actualy ready to use on the DB wikis. However, the default charset is set to the system's default C. Maybe, it's a good idea to update the table pg_ts_cfg and set the default to en_US.UTF-8 or whatever you need. wikis=# UPDATE pg_ts_cfg SET locale='en_US.UTF-8' WHERE ts_name='default'; wikis=# SELECT * FROM pg_ts_cfg; ts_name    | prs_name |    locale -+--+-- default_russian | default | ru_RU.KOI8-R simple         | default  | default        | default  | en_US.UTF-8

We also need to make SELECTable the new tables pg_ts_cfg, pg_ts_cfgmap, pg_ts_dict and pg_ts_parser to the DB wikiusers. Since all wikiusers will become members of the group wikigroup, we only need to GRANT SELECT privilege to this group: wikis=# GRANT SELECT ON pg_ts_cfg, pg_ts_cfgmap, pg_ts_dict, pg_ts_parser TO wikigroup;

PG versions below 8.2 use a different search ranking nomenclature. MW 1.11 and above knows about this and handles it right. For MW version prior to 1.11, replace 5 by 1 in the following line in includes/SearchPostgres.php: "rank($fulltext, to_tsquery('default',$searchstring),5) AS score ".

Singlesignon
There is one unique mwuser table used by all wikis to provide the single-signon feature. This table needs to be stored in an own schema, that we call zsamm (bavarian for "together" or "common"). Log into the DB wikis as wikiadmin and run wikis=> CREATE SCHEMA zsamm;

In order to initialize the mwuser table in schema zsamm, we have to extract the appropriate parts out of maintenance/postgres/tables.sql. This might look like this:

-- BEGIN; SET client_min_messages = 'ERROR';

CREATE SEQUENCE user_user_id_seq MINVALUE 0 START WITH 0; CREATE TABLE mwuser ( -- replace reserved word 'user' user_id                   INTEGER  NOT NULL  PRIMARY KEY DEFAULT nextval('user_user_id_seq'),  user_name                 TEXT     NOT NULL  UNIQUE,  user_real_name            TEXT,  user_password             TEXT,  user_newpassword          TEXT,  user_newpass_time         TIMESTAMPTZ,  user_token                CHAR(32),  user_email                TEXT,  user_email_token          CHAR(32),  user_email_token_expires  TIMESTAMPTZ,  user_email_authenticated  TIMESTAMPTZ,  user_options              TEXT,  user_touched              TIMESTAMPTZ,  user_registration         TIMESTAMPTZ,  user_editcount            INTEGER ); CREATE INDEX user_email_token_idx ON mwuser (user_email_token);

-- Create a dummy user to satisfy fk constraints especially with revisions INSERT INTO mwuser VALUES (DEFAULT,'Anonymous','',NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,now,now);

For MW version 1.10.0, you can download the fils tables-zsamm.sql from our svn repository. Assumed the above commands are stored in file tables-zsamm.sql, we can create the mwuser table: wikis=> SET search_path TO zsamm; wikis=> \i tables-zsamm.sql wikis=> COMMIT;

Creating a wiki schema
Run this after having logged into the DB as wikiadmin: wikis=> CREATE SCHEMA wikischema; where wikischema is the schema used for a certain wiki.

All the rest of maintenance/postgres/tables.sql that does not create the mwuser tables must be extracted and stored in tables-wiki.sql. Additionaly, all REFERENCES that reference table mwuser must be modified to zsamm.mwuser. Also, we ommit the SQL commands marked as "TSearch2 stuff" since triggers that update the search vectors don't make sense during import and only would slow it down. The STearch related stuf will be added when the import has been completed.

For MW version 1.10.0, a ready to use tables-wiki.sql is available from our svn repository.

Set wikiadmin's search path temporarily to the wiki's schema and create the basic table layout for mediawiki in this schema: wikis=> SET search_path TO wikischema, public; wikis=> SHOW search_path; search_path - wikischema, public (1 row)

wikis=> \i tables-wiki.sql wikis=> \i tables-zsamm.sql The creation of the mwuser table in the wikischema is just to have an empty dummy table for the MW update.php script. It insists on a table in the schema defined in $wgDBmwschema.

Who owns what
The DB should be owned completely by the general wikiadmin user that is not used by any wiki. It's a pure administration account used by humans (or manually started maintenance scripts), only.

Each wiki will use its own DB user. For security reasons, probably, this user should not own the schema used for the wiki's tables. The user only needs to have SELECT, INSERT, UPDATE and DELETE privileges for all tables on the schema.

The short script grantToWikiuser CREATEs a new DB user for the wiki, sets the appropriate search path and GRANTs all necessary privileges on tables and sequences. Run it against you DB as follows: wikiadmin$ ./grantToWikiuser wikischema commonschema wikiuser wupassword | psql wikis where wikischema</tt>, commonschema</tt>, wikiuser</tt> and wupassword</tt> are to be substituted by the wiki's schema to create, the common schema that holds mwuser, the new DB user for the wiki and his password, respectively.

This should be all and the rest of this section may be skipped if everything works fine. For future versions of MW it might be necessary to add the names of newly introduced tables or sequences in the corresponding commands in the script. However, if you should run into trouble, do everything manualy as explained below.

First of all, let's create the new user for the wiki. Login as DB user postgres and run: wikis=# CREATE ROLE wikiuser LOGIN PASSWORD 'secret' IN ROLE wikigroup;

Unfortunately, there seems not to be any way to GRANT table privileges with wildcards. Thus, we need to list all tables in the schema when GRANTing. Get a list of all tables in schema wikischema like this: wikis=> \t wikis=> \o tempfile Showing only tuples. wikis=> SELECT tablename || ',' FROM pg_tables WHERE schemaname='wikischema'; wikis=> \o wikis=> \t Tuples only is off. edit file tempfile</tt> and copy&paste the list into the GRANT command: wikis=> SET search_path TO zsamm, wikischema, public; wikis=> GRANT SELECT, UPDATE, INSERT, DELETE ON archive, categorylinks, ... TO wikiuser; wikis=> GRANT SELECT, UPDATE ON filearchive_fa_id_seq, ipblocks_ipb_id_val, ... TO wikiuser; The last line GRANTs privileges on sequences. Note that zsamm is the first schema in the search path. Thus, we GRANT the privileges to the mwuser table in schema zsamm, not in schema wikischema. In fact, wikischema.mwuser has become obsolete and could be deleted. We still have to allow the new user to use his schemas at all: wikis=> GRANT USAGE ON SCHEMA wikischema TO wikiuser; wikis=> GRANT USAGE ON SCHEMA zsamm TO wikigroup; The last command only needs to be invoked once at initial setup.

Now, how to tell the wiki DB user that the mwuser table has to be searched in an other schema? Very PG specific. Set the wiki user's search path to "zsamm,wikischema,public". In this sequence. When looking for the mwuser table, it will find one in the zsamm schema and won't search in the sequent schemas any more. Run this SQL command to set the search path permanently: wikis=# ALTER ROLE wikiuser SET search_path TO 'singlesignon,wikischema,public'; wikis=# ALTER ROLE wikiuser SET timezone TO "GMT"; The 2nd command sets the user's timezone to GMT what is very important for MediaWiki timestamps since it expects the PostgreSQL timestamp format in GMT without any verification.

Importing the Data from MySQL
Backup your MySQL DB, first! The following procedure will modify your MySQL DB and may crash it, if something goes wrong.

If you use the experimental singlesignon feature, it's not necessary to copy the user table from your singlesignon DB into the wiki's DB.

Cleaning up the MySQL DB
Since MySQL doesn't know about foreign keys, there will be some inconsistencies PG will complain about. So, better clean up the MySQL DB before dumping it. (Have you realy backuped?)

Copy weedforFK.php and weedforFK.inc into the maintenance folder of the wiki running on MySQL. First, investigate your tables by running: wikiadmin$ php weedforFK.php --check It checks all tables that are involved in foreign keys in the corresponding PG version. If everything is OK, you are done.

For getting more information, add option --verbose. For weeding out unused rows that violate PostgreSQL's foreign key constraints, run wikiadmin$ php weedforFK.php --san --verbose > analysefile

There is a known bug in respect to table user_newtalk. Hope you won't need to weed it.

Use this queries for manual investigations of inconsistencies if necessary: SELECT DISTINCT ug_user FROM user_groups LEFT JOIN user ON user_id=ug_user WHERE user_id IS NULL; SELECT DISTINCT user_newtalk.user_id FROM user_newtalk LEFT JOIN user ON user_newtalk.user_id=user.user_id WHERE user.user_id IS NULL; SELECT rev_id FROM revision LEFT JOIN page ON page_id=rev_page WHERE page_id IS NULL; SELECT rev_id, rev_user, rev_user_text FROM revision LEFT JOIN user ON user_id=rev_user WHERE user_id IS NULL AND rev_user !=0 ; SELECT pr_id FROM page_restrictions LEFT JOIN page ON page_id=pr_page WHERE page_id IS NULL; SELECT ar_user, ar_namespace, ar_title FROM archive LEFT JOIN user ON user_id=ar_user WHERE user_id IS NULL AND ar_user != 0; SELECT rd_from FROM redirect LEFT JOIN page ON page_id=rd_from WHERE page_id IS NULL; SELECT pl_from FROM pagelinks LEFT JOIN page ON page_id=pl_from WHERE page_id IS NULL; SELECT tl_from, tl_namespace, tl_title FROM templatelinks LEFT JOIN page ON page_id=tl_from WHERE page_id IS NULL; SELECT il_from, il_to FROM imagelinks LEFT JOIN page ON page_id=il_from WHERE page_id IS NULL; SELECT cl_from, cl_to FROM categorylinks LEFT JOIN page ON page_id=cl_from WHERE page_id IS NULL; SELECT el_from, el_to FROM externallinks LEFT JOIN page ON page_id=el_from WHERE page_id IS NULL; SELECT ll_from, ll_lang FROM langlinks LEFT JOIN page ON page_id=ll_from WHERE page_id IS NULL; SELECT ipb_id FROM ipblocks LEFT JOIN user ON user_id=ipb_by WHERE user_id IS NULL; SELECT wl_user FROM watchlist LEFT JOIN user ON user_id=wl_user WHERE user_id IS NULL; SELECT tb_id FROM trackbacks LEFT JOIN page ON page_id=tb_page WHERE page_id IS NULL;

Dumping the encoding chaos
I'm not sure if I had missconfigured the MySQL DB, but the encoding in our MySQL dumps looks somewhat strange.

In most tables, we have a quite strange encoding in text fields. The DB encoding is set to latin1, what means cp1252 in MySQL nomenclatura. UTF-8 strings have been interpreted as (not converted into!) cp1252 strings before storing them. The Mediawiki software knows that this pseudo cp1252 string is actualy UTF-8 so that it is displayed correctly in the wikis. However, the DB dumps do not know. By default, mysqldump makes an UTF-8 encoded output. Thus, all pseudo cp1252 are regarded as real cp1252 strings and are converted into (not interpreted as) UTF-8.

What makes things even worse: Real UTF-8 and pseudo cp1252 converted into UTF-8 are used simultanously in different fields in the same table, obiously depending on their MySQL type.

The mainenance script maintenance/postgres/mediawiki_mysql2postgres.pl</tt> shiped with MediaWiki relies on properly encoded UTF-8 database dumps. It looks quite elegant, but it does not work with odd encoded dumps.

There are basicaly two approaches: Either fix the encoding of MySQL DB dumps by a script or write a dump script from scratch. If the wiki is not too large, the latter one seems to be the easier way. Let MediaWiki's DB interface deal with its own strange encoding and dump the tables in PG format to stdout.

The PHP script mkPGdump.php uses the basic ideas of data import from mediawiki_mysql2postgres.pl</tt> but does its own dumping instead of working with mysqldump's output. It might be usefull for medium sized wikis. Copy it into your maintenance folder and make sure that AdminSettings.php is configured correctly. Adapt the control variables in the head of the script according to your needs. If done so, invoke it like this: wikiadmin$ php mkPGdump.php > dumpfile Depending on the wiki's size, the dump script will run for only few minutes up to hours. Beware of the comperatively high memory usage!

Since MySQL actualy sees latin1 instead of UTF-8, it cannot do any sanity check for UTF-8 characters. I did find some malformed multi byte characters in very rare cases. Still enough for PG to abort its data import. The simple perl script saneUTF8 checks for malformed UTF-8 charakters and removes byte after byte until the sequence is UTF-8, again. It is used as an import filter like this: wikiadmin$ ./saneUTF8 < dumpfile | psql wikis If you only have a small wiki or a lot of memory, you also can concatinate all the above commands to this single one: wikiadmin$ php mkPGdump.php | ./saneUTF8 | psql wikis

Completing TSearch2 features
Now it's time to add the previously skipped TSearch2 features. Two tables are affected, page</tt> and pagecontent</tt>. I'm not so happy with the layout shipped with MW 1.11 and before. It seems impossible to UPDATE the titlevector or the textvector, respectivly, with a NULL value since the trigger function always replaces this with to_tsvector's output. However, it does not make any sense to index old revisions. Thus, I have slightly modified the row definitions for titlevector and textvector and the corresponding trigger functions. See tsearch2triggers.sql in our repository.

If you use PostgreSQL version 8.2 or above, switch "gist" to "gin" in the two commands that create the indexes ts2_page_title</tt> and ts2_page_text</tt> as explained in the comment above this lines.

wikis=> SET search_path TO wikischema, public; wikis=> \i tsearch2triggers.sql

TSearch2 bases on the reduction of text to lexemes. This is archieved by different approaches. The most important are stemmer algorithms and dictionaries. You need to create a suitable configuration for every language you want to search in. See http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/ for more details. Assumed you have created a new configuration for the language of the new wiki, we have to find a way to make it used by MW. Probably, the most robust way is to hide the two functions to_tsvector(text,text) and to_tsquery(text,text) in the public schema by defining your own simple wrapper functions in the wikischema:

CREATE FUNCTION to_tsvector( l text, t text) RETURNS tsvector LANGUAGE plpgsql AS $mw$ BEGIN RETURN public.to_tsvector( 'default_german', t); END; $mw$;

CREATE FUNCTION to_tsquery( l text, t text) RETURNS tsquery LANGUAGE plpgsql AS $mw$ BEGIN RETURN public.to_tsquery( 'default_german', t); END; $mw$; Of course, you have to replace "defult_german" by the configuration appropriate to the wiki.

This relies on the hope that MW always calls this two functions with two arguments. There also extist one argument versions that use the "default" TSearch2 configuration. If you want to be sure, write wrappers also for them.

Now, you have to update the newly created search vectors ts2_page_text</tt> and ts2_page_title</tt>. In order to update your indices and REINDEX them, run this as wikiadmin: wikis=> UPDATE pagecontent SET textvector = NULL; wikis=> UPDATE page SET titlevector = NULL; wikis=> UPDATE pagecontent SET textvector = to_tsvector('default',old_text) WHERE old_id IN (SELECT rev_text_id FROM revision, page WHERE rev_id=page_latest AND page_is_redirect=0 ); wikis=> UPDATE page SET titlevector = to_tsvector('default',page_title) WHERE page_is_redirect=0; wikis=> REINDEX INDEX ts2_page_text; wikis=> REINDEX INDEX ts2_page_title;

Even after the wiki has run for a while, if you ever feel that your searches run very slowly, you probably have to clean up the TSearch2 indices ts2_page_text</tt> and ts2_page_title</tt> again. Due to MW bugs, it might be possible that old page revisions still remain indexed by ts2_page_text</tt>. In fact, we only need to index the current revision of each page.