Database dumps

It is one of our important goals to provide database dumps to public. Different kinds of useful dumps might be possible. We have to find general and technical solutions.

Respecting our users' privacy
There are some few personal data stored in the database that must not be published. Most important are the users' e-mail addresses and maybe the users' full real names. However, the full name is displayed in the footer of every page the user has contributed to and also in the corresponding RDF meta information. Thus, the real name is actually no secret any more.

Maybe, there are some more subtile informations you could get out of the database with data mining technics. This seems to be the most difficult question.

MySQL dumps
SQL dumps are no longer available. We switched to XML-dumps.

XML dumps
Generating XML dumps needs some more processor power, but it remains within the limits of our resources. However, it probably would not make too much sense to export all "articles" in the database, since even things like categories or Mediawiki messages count as articles. We had to select those articles that might be of interest, first. For the user of the dumps, it would not be so easy to reassemble the XML dumps and the images to a working wiki. Reimporting images is somewhat tricky.

At the other hand, we could be quite sure there are no personal informations other than those normal users could get from our web pages.

XML dumps are now available at http://dumps.wikivoyage.org/XML-dumps/.

HTML dumps
This would be a way to provide data for off-line browsing or for those people that intend to set up a mirror. Probably, it would be enough to provide only the current versions rather than the full article history.

We could adapt the 'edit' links so that they already link back to Wikivoyage.

There are HTML dumps available for the German branch, now.

Images
This is simple and already automated for backup purpose. All image files are stored in the file system (not in the database), so we just pack them as an archive -- ready.

Size: 450MB

Comments
I suppose it would be safe enough to make MySQL dumps public. This would be my favourite way of publishing dumps because it is the easiest for us. Later, we also could think about HTML dumps. -- Hansm 18:41, 12 December 2006 (CET)


 * For sql-dumps, all tables except: (user), user_newtalk, objectcache, querycache, searchindex, watchlist, querycache_info -- Hansm 17:08, 23 December 2006 (CET)


 * For XML, there is a maintenance/importDump.php to import the text; there's also an importImages but it has the limitation of the re-imported images appearing to have been uploaded under username "Image Import script". XML does allow generation of --current (last revision only for each page) or --full (entire revision history) from dumpBackup.php. The existence of category descriptions and MediaWiki messages may not be an issue; yes, there are potentially a thousand MediaWiki messages in the database, but if they're just one line each, nobody cares. It would likely be possible to modify the dumpBackup script to exclude those namespaces (#8 and #14 IIRC) if needed, but why bother? XML is used in Wikipedia's dumps and was used in Wikia's dumps the last time they were still being run (July 2?). It seems to work well enough, if an archive of images is available as a separate download. --Carlb 22:08, 18 August 2007 (CEST)


 * Thank you for your comment, Hansm is currently on tour. As being our Techbee he will certainly have a look after he returns.--Der Reisende 12:05, 20 August 2007 (CEST)