BlikExp

From Blik
Jump to: navigation, search

<languages/> Template:PD Help Page <translate> Wiki pages can be exported in a special XML format to [[<tvar|import>Special:MyLanguage/Help:Import</>|upload import]] into another MediaWiki installation (if this function is enabled on the destination wiki, and the user is a sysop there) or use it elsewise for instance for analysing the content.</translate> <translate> See also [[<tvar|meta>m:Syndication feeds</>|m:Syndication feeds]] for exporting other information but pages and <tvar|import>Template:Ll</> on importing pages.</translate>

<translate>

How to export

There are at least four ways to export pages:

<translate>

  • The backup script <tvar|dumpBackup>dumpBackup.php</> dumps all the wiki pages into an XML file.</translate> <translate> <tvar|dumpBackup>dumpBackup.php</> only works on MediaWiki 1.5 or newer.</translate> <translate> You need to have direct access to the server to run this script.</translate> <translate> Dumps of Wikimedia projects are regularly made available at <tvar|url>https://dumps.wikimedia.org/</>.</translate>

<translate>

    • Note: you might need to configure AdminSettings.php in order to run dumpBackup.php successfully.</translate> <translate> See <tvar|1>m:MediaWiki</> for more information.</translate>

<translate>

  • There is a OAI-PMH-interface to regularly fetch pages that have been modified since a specific time.</translate> <translate> For Wikimedia projects this interface is not publicly available; see <tvar|1>Template:Ll</>.</translate> <translate> OAI-PMH contains a wrapper format around the actual exported articles.</translate>

<translate>

By default only the current version of a page is included.</translate> <translate> Optionally you can get all versions with date, time, user name and edit summary.</translate> <translate> Optionally the latest version of all templates called directly or indirectly are also exported.</translate>

<translate> Additionally you can copy the SQL database.</translate> <translate> This is how dumps of the database were made available before MediaWiki 1.5 and it won't be explained here further.</translate>

<translate>

Using 'Special:Export'

To export all pages of a namespace, for example.

1. Get the names of pages to export

I feel an example is better because the description below feels quite unclear.

  1. Go to [[<tvar|allpages>Special:Allpages</>|Special:Allpages]] and choose the desired article/file.</translate>

<translate>

  1. Copy the list of page names to a text editor</translate>

<translate>

  1. Put all page names on separate lines</translate>

<translate>

    1. You can achieve that relatively quickly if you copy the part of the rendered page with the desired names, and paste this into say MS Word - use paste special as unformatted text - then open the replace function (CTRL+h), entering ^t in Find what, entering ^p in Replace with and then hitting Replace All button.</translate> (<translate> This relies on tabs between the page names; these are typically the result of the fact that the page names are inside td-tags in the html-source.</translate>)

<translate>

    1. The text editor Vim also allows for a quick way to fix line breaks: after pasting the whole list, run the command <tvar|code1>:1,$s/\t/\r/g</> to replace all tabs by carriage returns and then <tvar|code2>:1,$s/^\n//g</> to remove every line containing only a newline character.</translate>

<translate>

    1. Another approach is to copy the formatted text into any editor exposing the html.</translate> <translate> Remove all <tvar|tr1><tr></> and <tvar|tr2></tr></> tags and replace all <tvar|td1><td></> tags to <tvar|trtd><tr><td></> and <tvar|td2></td></> tags to <tvar|tdtr></td></tr></> the html will then be parsed into the needed format.</translate>

<translate>

    1. If you have shell and mysql access to your server, you can use this script:

</translate>

#
mysql -umike -pmikespassword -hlocalhost wikidbname 
select page_title from wiki_page where page_namespace=0
EOF

<translate> Note, replace mike and mikespassword with your own. Also, this example shows tables with the prefix wiki_</translate>

<translate>

  1. Prefix the namespace to the page names (e.g. 'Help:Contents'), unless the selected namespace is the main namespace.</translate>

<translate>

  1. Repeat the steps above for other namespaces (e.g. Category:, Template:, etc.)

A similar script for PostgreSQL databases looks like this: </translate>

$ psql -At -U wikiuser -h localhost wikidb -c "select page_title from mediawiki.page"

<translate> Note, replace wikiuser with your own, the database will prompt you for a password.</translate> <translate> This example shows tables without the prefix wiki_ and with the namespace specified as part of the table name.</translate>

<translate> Alternatively, a quick approach for those with access to a machine with Python installed:</translate>

<translate>

  1. Go to [[<tvar|allpages>Special:Allpages</>|Special:Allpages]] and choose the desired namespace.</translate>

<translate>

  1. Save the entire webpage as index.php.htm.</translate> <translate> Some wikis may have more pages than will fit on one screen of AllPages; you will need to save each of those pages.</translate>

<translate>

  1. Run <tvar|export>export_all_helper.py</> in the same directory as the saved file.</translate> <translate> You may wish to pipe the output to a file; e.g. <tvar|code>python export_all_helper.py > main</> to send it to a file named "main".</translate>

<translate>

  1. Save the page names output by the script.

2. Perform the export

  • Go to [[<tvar|export>Special:Export</>|Special:Export]] and paste all your page names into the textbox, making sure there are no empty lines.</translate>

<translate>

  • Click 'Submit query'</translate>

<translate>

  • Save the resulting XML to a file using your browser's save facility.

and finally...

  • Open the XML file in a text editor.</translate> <translate> Scroll to the bottom to check for error messages.</translate>

<translate> Now you can use this XML file to [[<tvar|import>Special:MyLanguage/Help:Import</>|perform an import]].

Exporting the full history

A checkbox in the [[<tvar|export>Special:Export</>|Special:Export]] interface selects whether to export the full history (all versions of an article) or the most recent version of articles.</translate> <translate> A maximum of 100 revisions are returned; other revisions can be requested as detailed in <tvar|1>Template:Ll</>.</translate>

<translate>

Export format

</translate> <translate> The format of the XML file you receive is the same in all ways.</translate> <translate> It is codified in XML Schema at <tvar|url>https://www.mediawiki.org/xml/export-0.10.xsd</></translate> <translate> This format is not intended for viewing in a web browser.</translate> <translate> Some browsers show you pretty-printed XML with "+" and "-" links to view or hide selected parts.</translate> <translate> Alternatively the XML-source can be viewed using the "view source" feature of the browser, or after saving the XML file locally, with a program of choice.</translate> <translate> If you directly read the XML source it won't be difficult to find the actual wikitext.</translate> <translate> If you don't use a special XML editor "<" and ">" appear as &lt; and &gt;, to avoid a conflict with XML tags; to avoid ambiguity, "&" is coded as "&amp;".</translate>

<translate> In the current version the export format does not contain an XML replacement of wiki markup (see Wikipedia DTD for an older proposal).</translate> <translate> You only get the wikitext as you get when editing the article.</translate>

<translate>

Example

</translate>

  <mediawiki xml:lang="en">
    <page>
      <title>Page title</title>
      <restrictions>edit=sysop:move=sysop</restrictions>
      <revision>
        <timestamp>2001-01-15T13:15:00Z</timestamp>
        <contributor><username>Foobar</username></contributor>
        <comment>I have just one thing to say!</comment>
        <text>A bunch of [[Special:MyLanguage/text|text]] here.</text>
        <minor />
      </revision>
      <revision>
        <timestamp>2001-01-15T13:10:27Z</timestamp>
        <contributor><ip>10.0.0.2</ip></contributor>
        <comment>new!</comment>
        <text>An earlier [[Special:MyLanguage/revision|revision]].</text>
      </revision>
    </page>
 
    <page>
      <title>Talk:Page title</title>
      <revision>
        <timestamp>2001-01-15T14:03:00Z</timestamp>
        <contributor><ip>10.0.0.2</ip></contributor>
        <comment>hey</comment>
        <text>WHYD YOU LOCK PAGE??!!! i was editing that jerk</text>
      </revision>
    </page>
  </mediawiki>

<translate>

DTD

</translate> <translate> Here is an unofficial, short Document Type Definition version of the format.</translate> <translate> If you don't know what a DTD is just ignore it.</translate>

<!ELEMENT mediawiki (siteinfo,page*)>
<!-- version contains the version number of the format (currently 0.3) -->
<!ATTLIST mediawiki
  version  CDATA  #REQUIRED 
  xmlns CDATA #FIXED "https://www.mediawiki.org/xml/export-0.3/"
  xmlns:xsi CDATA #FIXED "http://www.w3.org/2001/XMLSchema-instance"
  xsi:schemaLocation CDATA #FIXED
    "https://www.mediawiki.org/xml/export-0.3/ https://www.mediawiki.org/xml/export-0.3.xsd"
  xml:lang  CDATA #IMPLIED
>
<!ELEMENT siteinfo (sitename,base,generator,case,namespaces)>
<!ELEMENT sitename (#PCDATA)>      <!-- name of the wiki -->
<!ELEMENT base (#PCDATA)>          <!-- url of the main page -->
<!ELEMENT generator (#PCDATA)>     <!-- MediaWiki version string -->
<!ELEMENT case (#PCDATA)>          <!-- how cases in page names are handled -->
   <!-- possible values: 'first-letter' | 'case-sensitive'
                         'case-insensitive' option is reserved for future -->
<!ELEMENT namespaces (namespace+)> <!-- list of namespaces and prefixes -->
  <!ELEMENT namespace (#PCDATA)>     <!-- contains namespace prefix -->
  <!ATTLIST namespace key CDATA #REQUIRED> <!-- internal namespace number -->
<!ELEMENT page (title,id?,restrictions?,(revision|upload)*)>
  <!ELEMENT title (#PCDATA)>         <!-- Title with namespace prefix -->
  <!ELEMENT id (#PCDATA)> 
  <!ELEMENT restrictions (#PCDATA)>  <!-- optional page restrictions -->
<!ELEMENT revision (id?,timestamp,contributor,minor?,comment?,text)>
  <!ELEMENT timestamp (#PCDATA)>     <!-- according to ISO8601 -->
  <!ELEMENT minor EMPTY>             <!-- minor flag -->
  <!ELEMENT comment (#PCDATA)> 
  <!ELEMENT text (#PCDATA)>          <!-- Wikisyntax -->
  <!ATTLIST text xml:space CDATA  #FIXED "preserve">
<!ELEMENT contributor ((username,id) | ip)>
  <!ELEMENT username (#PCDATA)>
  <!ELEMENT ip (#PCDATA)>
<!ELEMENT upload (timestamp,contributor,comment?,filename,src,size)>
  <!ELEMENT filename (#PCDATA)>
  <!ELEMENT src (#PCDATA)>
  <!ELEMENT size (#PCDATA)>

<translate>

Processing XML export

</translate> <translate> Many tools can process the exported XML.</translate> <translate> If you process a large number of pages (for instance a whole dump) you probably won't be able to get the document in main memory so you will need a parser based on SAX or other event-driven methods.</translate>

<translate> You can also use regular expressions to directly process parts of the XML code.</translate> <translate> This may be faster than other methods but not recommended because it's difficult to maintain.</translate>

<translate> Please list methods and tools for processing XML export here:

<translate>

<translate>

  • [[<tvar|meta>m:Processing MediaWiki XML with STX</>|m:Processing MediaWiki XML with STX]] - Stream based XML transformation</translate>

<translate>

<translate>

Details and practical advice

  • To determine the namespace of a page you have to match its title to the prefixes defined in

<tvar|ns>/mediawiki/siteinfo/namespaces/namespace</></translate> <translate>

  • Possible restrictions are</translate>
    • sysop - <translate> protected pages</translate>

<translate>

Why to export

Why not just use a dynamic database download? </translate>

<translate> Suppose you are building a piece of software that at certain points displays information that came from Wikipedia.</translate> <translate> If you want your program to display the information in a different way than can be seen in the live version, you'll probably need the wikicode that is used to enter it, instead of the finished html.</translate>

<translate> Also if you want to get all of the data, you'll probably want to transfer it in the most efficient way that's possible.</translate> <translate> The Wikimedia servers need to do quite a bit of work to convert the wikicode into html.</translate> <translate> That's time consuming both for you and for the Wikimedia servers, so simply spidering all pages is not the way to go.</translate>

<translate> To access any article in XML, one at a time, link to:

Special:Export/Title_of_the_article

See also

</translate>

[[Category:Help{{#translation:}}]]