YK: Chapter 21: Importing and exporting

From Blik
Jump to: navigation, search

21 Importing and exporting

There are two ways of using one system’s data in another: you can either query system A’s data in system B, or you can import the data from A into B. The right answer depends on where you want the data to “live” in the future. Ideally, any piece of data shouldn’t be stored in more than one place, because data redundancy can quickly lead to inconsistent data, since the data can get modified anywhere it resides. So if you import data from one system to another, presumably the data will at the same time go away in the old one.

(There are exceptions to that rule, and one extension, “Push”, lets you dynamically move content between two wikis; we’ll get to that later.)

We covered both the querying of data within MediaWiki, and the querying of MediaWiki’s data in other systems, in the last chapter. This chapter is instead focused on actually moving content into, out of or between wikis. There are essentially three kinds of possible moves: moving contents between two MediaWiki wikis; moving contents from another system, wiki or otherwise, into MediaWiki; and moving contents from MediaWiki into another system. Let’s go through these one at a time.

Moving content between two MediaWiki systems

This is conceptually the simplest of the three kinds of moves. If you’re just trying to move a wiki to a different location, then the best approach is to zip up all the MediaWiki files, PHP and otherwise (going with the top directory will get them all), do a dump of the wiki’s database, and then un-archive them both in the new location. Give or take a few modifications to LocalSettings.php, everything should just work.

If, however, you’re trying to move contents from one wiki to an existing one (i.e., merging the two), it’s more complicated. Thankfully, MediaWiki provides two great special pages that can be used for this purpose: Special:Export and Special:Import.

Special:Export (see Figure 21.1) lets you generate an XML file that holds the contents from any number of pages in the wiki. The XML is in a custom format defined by MediaWiki. Special:Export unfortunately doesn’t provide an automatic way to export all pages at once. Instead, you assemble a list of pages to be exported; the page lets you add whole categories and namespaces to the list. Once you’ve assembled the full list, you can choose whether to export just the most recent revision of each page, or all of them. The latter is certainly recommended when exporting from one MediaWiki wiki to another.

[]

Figure 21.1 Special:Export page

Once you have the XML file generated and downloaded, it’s time for the second step, which, as you might have guessed, is to use Special:Import in the other wiki. That page is simpler ­ there, you just upload an XML file, and it generates all the relevant pages. See Figure 21.2 for the Special:Import interface.

[]

Figure 21.2 Special:Import page

In order to access Special:Import, you need to have the ’import’ permission, which by default is given to users in the "administrators/sysop" group.

For large sets of pages, using Special:Import might not work; the page might time out. In that case, you could use an import script to take in the XML file. There are a variety of scripts, all of which have their strengths and weaknesses; you can see the full list here:

http://meta.wikimedia.org/wiki/Data_dumps#Tools

There’s the issue of what to do with pages that have the same name in both the source destination wikis; if both wikis are in English, then, for instance, there’s a good chance that both will have a page called "Main Page". That’s a question that has to be decided on a case-by-case basis; there’s no mechanism for merging two wiki pages into one.

The use of Special:Export and Special:Import takes care of wiki pages; but that still leaves the transfer of uploaded files, the wiki’s users, and the wiki’s settings.

To import in uploaded files, there’s a very helpful script in MediaWiki called "importImages.php", in the /maintenance directory, that takes in a folder containing images, and imports them into the wiki. You can use this script to import files (images and otherwise) from both another MediaWiki installation, and from other sources (which we’ll get to in the next section). To import uploaded files from one MediaWiki instance to another, you just need to get the /images folder from the "source" wiki onto the same server as the "destination" wiki, if it’s not there already. (It may help to archive the folder first, into a zip file or the like.) The script is then called in the following way:

php maintenance/importImages.php /path/to/images/directory png jpg gif bmp PNG JPG GIF BMP

The group of file extensions at the end is necessary because every type of file that you want imported has to be manually specified. There are various other settings you can call importImages.php with ­ you can see a full description here:

https://www.mediawiki.org/wiki/Manual:ImportImages.php

Unfortunately, this script will not import old revisions of files ­ just the most recent ones.

Note that this is a separate process from importing the wiki pages for those files, i.e. pages in the "File:" namespace. If you have any special content in those pages, you should import them using the standard Special:Export/Special:Import process.

That leaves the issue of importing users ­ but that, unfortunately, seems to be impossible. You will simply have to ask users of the old wiki to re-register on the new one.

It should also be noted that the Special:Export/Special:Import approach can also be used to back up a wiki’s contents, to prevent against data loss if something happens to the wiki. Archiving the database and files is the better approach, since it also includes user data and everything else, but using Special:Export is the lighter-weight solution, and one that can be used if you don’t have access to the underlying database.

There can also be cases when you want to move content from one MediaWiki wiki into another, while keeping the “source” one around ­ this might happen if, for instance, your organization has two wikis with different levels of privacy, and you want some of the content generated in the more private wiki to also show up in the more public one. For that, you can use the “Push” extension, which lets you export content from one MediaWiki wiki to another, including regular pages, templates and images. You can read more about this extension here:

https://www.mediawiki.org/wiki/Extension:Push

Importing contents from another system into MediaWiki

Importing content from outside MediaWiki into MediaWiki is probably the trickiest of the three types of content transfers, because the outside content can take any form whatsoever, and must be converted into MediaWiki’s specific structure of pages and wikitext. There are various utilities you can use, depending on the nature of the source content.

If you have HTML content that needs to be converted to MediaWiki, there’s a convenient online tool, the HTML2Wiki Converter, where you can paste in HTML and retrieve corresponding wikitext:

https://toolserver.org/~magnus/html2wiki.php

There’s also a MediaWiki extension to convert Microsoft Word documents to wiki pages, called Word2MediaWikiPlus ­ though its usefulness seems to depend on what exact versions you have of Word and MediaWiki:

https://www.mediawiki.org/wiki/Extension:Word2MediaWikiPlus

If the type of information you’re importing is structured, e.g. if it comes from a database; or if it consists of simple text that has already been assigned to page names, then the Data Transfer extension may be the right tool. This is an extension that allows both the import and export of structured data; for now, we’ll just look at how you can import data with it.

To import data, Data Transfer provides two special pages: Special:ImportXML and Special:ImportCSV. These pages take in a file in XML or CSV format, respectively, and create a series of wiki pages based on their contents. You may be wondering what the purpose of Special:ImportXML is, given that turning XML into wiki page is already what Special:Import does. The difference is that Special:ImportXML takes in an entirely different XML format than Special:Import does. The XML format of Special:Import is wiki-based, holding the entire structure of a wiki page, while, the XML format of Special:ImportXML is data-based, holding each field of structured data separately; and the same is true for Special:ImportCSV. To put it briefly: Data Transfer supports the import (and, as we’ll see later, the export) of structured data by translating that structured data into the individual parameters of template calls.

First, let’s cover Special:ImportCSV, which is the simpler, and much more popular, of the two. CSV, which was already mentioned earlier in the book, is a standard format for storing a table of data; "CSV" stands for "comma-separated values". To use Special:ImportCSV, you must have a table of data that you want to import into the wiki, with one page per row. Special:ImportCSV requires a special format for the top, "header" row.

Here is a sample CSV file that could be imported via Special:ImportCSV:

Title,Superhero[Alias],Superhero[Special powers],Free Text

Superman,Clark Kent,"Flying, Super strength",He’s from Krypton.

Batman,Bruce Wayne,None,He just has a bunch of gadgets.

Wonder Woman,Diana Prince,"Flying (sometimes), Super strength",She also has cool gadgets.

This page requires a specific syntax for the CSV files it imports. That syntax is:

Title ­ sets the title of each page (required)

Template name[Field name] ­ defines the value for a single field, within a template call

Free Text ­ sets the "free", non-template part of each page.

And here is the resulting text for the page that would be created from the first row of data, i.e. the second row:

Template:Superhero

He’s from Krypton.

The strings "Title" and "Free Text" are in place for English-language wikis, but they can also be used in any other language. In addition, many (over 50) languages have their own counterpart for "Title" and "Free Text" ­ if your wiki is in a language other than English, you can see these values (and, if you want, even modify them) at the pages MediaWiki:dt_xml_title and MediaWiki:dt_xml_freetext.

"Title" is the only required column. If the data you want to import contains no structured fields, you can simply put all the text you want for each page in the "Free Text" column.

Any columns whose name doesn’t conform to one of the standard structures will simply be ignored.

Before or after the import is done, you will of course need to create the template or templates used, if any, by the pages resulting from the import.

The page Special:ImportXML works in the same way, but takes in a file in XML format. See the next section for the XML format that Special:ImportXML would take.

The advantage of using XML over CSV is that it can hold more than one type of page in a single file ­ with CSV, you would need a different file for each combination of templates, whereas with XML you can put everything in one big file. The disadvantage of using XML, on the other hand, is that it’s harder to create the file.

The homepage for the Data Transfer extension can be found here:

https://www.mediawiki.org/wiki/Extension:Data_Transfer

To import files, you can use the importImages.php script, described in the previous section.

You can also import users, if you put the user information from the other system into a CSV file first, via the ImportUsers extension:

https://www.mediawiki.org/wiki/Extension:ImportUsers

Exporting contents from MediaWiki to another system

Finally, we get to the third type of transfer: one where the source is MediaWiki and the destination isn’t. This may be done for a variety of reasons: it could be because of an abandonment of MediaWiki (which of course we would never recommend). It could be done to get the wiki’s data into an automated system, for some sort of analysis or display. (If that’s the case, Semantic MediaWiki is actually the ideal way to do that ­ see Chapter 16 for more details.) Or it could be done in order to view the contents in a format like PDF or LaTeX.

If you want to export the entire contents in a structured format, the Data Transfer extension is the best approach. Unlike import, where Data Transfer supports two formats, CSV and XML, for export only one format is supported: XML, via the page Special:ViewXML.

Special:ViewXML lets you view structured XML for different sets of pages in the wiki, by selecting any number of different categories and namespaces. The XML is in the same template-based structure as the one used for Special:ImportXML, although it can also be displayed in a simplified format.

The page is called "ViewXML", not "ExportXML", because the XML is displayed directly on the screen instead of being downloaded, when accessed by a browser. Here is an example of XML that could show up on the screen:

<Pages>

<Category Name="Local scripts">

<Page ID="747" Title="A3J0TestSuite">

<Template Name="Local script">

<Field Name="Author">Frode Fredriksen</Field>

<Field Name="Status">In use</Field>

<Field Name="Path">/home/fredriks/Utilities/ A3J0TestSuite.pl</Field>

</Template>

<Free_Text id="1">A testing script for performance of our A3J0 RAID servers.</Free_Text>

</Page>

...

The exact appearance will depend on one’s browser and/or relevant browser plugins.

There is also a “Simplified format” checkbox, that lets you do that same export but to a simpler XML format. Here is how that same data would appear, with the "Simplified format" option:

<Pages>

<Local_scripts>

<Page>

<ID>747</ID>

<Title>A3J0TestSuite</Title>

<Local_script>

<Author>Frode Fredriksen</Author>

<Status>In use</Status>

<Path>/home/fredriks/Utilities/A3J0TestSuite.pl

</Path>

</Local_script>

<Free_Text id="1">A testing script for performance of our A3J0 RAID servers.</Free_Text>

</Page>

...

Instead of the <Template> and <Field> tags, the actual template and field names are used as tag names.

As you might expect, it’s then up to the external system to parse this XML, in either form, and import it into its data structure.

Unlike with the wiki’s text content, there’s no standard mechanism for exporting either a wiki’s uploaded files or its user data to a non-MediaWiki system.

PDF export

You can also save some or all of a wiki’s pages into other file formats, for display purposes. There are several solutions for saving to PDF ­ the most popular one is the Collection extension:

https://www.mediawiki.org/wiki/Extension:Collection

Collection was developed in order to print Wikipedia articles in books, but it’s used for a variety of purposes. With the extension, one can define “collections” of wiki pages, which then, in conjunction with other extensions, can be exported to PDF, .odt (the file format used by OpenOffice and LibreOffice), and the lesser-known DocBook and ZIM (a file format used by Kiwix ­ see next section).

PdfBook is another extension that can create PDF files from wiki pages:

https://www.mediawiki.org/wiki/Extension:PdfBook

PdfBook is easier to set up and run than Collection, although the display of the PDF files that PdfBook generates is not nearly as nice.

Another extension is Wiki2LaTeX, which can export wiki pages to LaTeX format, and then additionally from LaTeX to PDF:

https://www.mediawiki.org/wiki/Extension:Wiki2LaTeX

Offline viewing

In the context of the web, offline viewing means being able to do an initial download of some or all of a site’s contents, then afterwards viewing it locally, whether it’s on a mobile device or on a desktop or laptop. If your wiki’s content is relatively small, a PDF file of everything (see above) may be the simplest solution. For larger wikis, and to have a more MediaWiki-like browsing interface, the Kiwix application (http://kiwix.org) is the way to go. It’s an application that was mostly developed for offline viewing of Wikipedia, but can also handle other MediaWiki sites, as well as (in theory) any HTML site at all. You can use the Collection extension (see above) to generate the ZIM files used by Kiwix.

There’s also the Miga Data Viewer, or Miga (http://migadv.com), which, like Kiwix, was developed with a specific focus on Wikipedia and MediaWiki but is in fact much more generic. Miga is intended more for structured data ­ of the kind, for instance, found in infoboxes ­ and offers automated ways to browse and drill down through that data, with an interface somewhat similar to that of the Semantic Drilldown extension (see here).

[]

Figure 21.3 A screenshot of Miga, used to drill down through structured data from Wikipedia