BlikExp
<languages/> Template:PD Help Page <translate> Wiki pages can be exported in a special XML format to [[<tvar|import>Special:MyLanguage/Help:Import</>|upload import]] into another MediaWiki installation (if this function is enabled on the destination wiki, and the user is a sysop there) or use it elsewise for instance for analysing the content.</translate> <translate> See also [[<tvar|meta>m:Syndication feeds</>|m:Syndication feeds]] for exporting other information but pages and <tvar|import>Template:Ll</> on importing pages.</translate>
<translate>
Contents
How to export
There are at least four ways to export pages:
- Paste the name of the articles in the box in [[<tvar|export>Special:Export</>|Special:Export]] or use <tvar|page>http://blik.tf/index.php/Special:Export/FULLPAGENAME</>.</translate>
<translate>
- The backup script <tvar|dumpBackup>
dumpBackup.php
</> dumps all the wiki pages into an XML file.</translate> <translate> <tvar|dumpBackup>dumpBackup.php
</> only works on MediaWiki 1.5 or newer.</translate> <translate> You need to have direct access to the server to run this script.</translate> <translate> Dumps of Wikimedia projects are regularly made available at <tvar|url>https://dumps.wikimedia.org/</>.</translate>
<translate>
- Note: you might need to configure AdminSettings.php in order to run dumpBackup.php successfully.</translate> <translate> See <tvar|1>m:MediaWiki</> for more information.</translate>
<translate>
- There is a OAI-PMH-interface to regularly fetch pages that have been modified since a specific time.</translate> <translate> For Wikimedia projects this interface is not publicly available; see <tvar|1>Template:Ll</>.</translate> <translate> OAI-PMH contains a wrapper format around the actual exported articles.</translate>
<translate>
- Use the [<tvar|url>http://pywikipediabot.sourceforge.net/</> Python Wikipedia Robot Framework]. This won't be explained here
By default only the current version of a page is included.</translate> <translate> Optionally you can get all versions with date, time, user name and edit summary.</translate> <translate> Optionally the latest version of all templates called directly or indirectly are also exported.</translate>
<translate> Additionally you can copy the SQL database.</translate> <translate> This is how dumps of the database were made available before MediaWiki 1.5 and it won't be explained here further.</translate>
<translate>
Using 'Special:Export'
To export all pages of a namespace, for example.
1. Get the names of pages to export
I feel an example is better because the description below feels quite unclear.
- Go to [[<tvar|allpages>Special:Allpages</>|Special:Allpages]] and choose the desired article/file.</translate>
<translate>
- Copy the list of page names to a text editor</translate>
<translate>
- Put all page names on separate lines</translate>
<translate>
- You can achieve that relatively quickly if you copy the part of the rendered page with the desired names, and paste this into say MS Word - use paste special as unformatted text - then open the replace function (CTRL+h), entering ^t in Find what, entering ^p in Replace with and then hitting Replace All button.</translate> (<translate> This relies on tabs between the page names; these are typically the result of the fact that the page names are inside td-tags in the html-source.</translate>)
<translate>
- The text editor Vim also allows for a quick way to fix line breaks: after pasting the whole list, run the command <tvar|code1>:1,$s/\t/\r/g</> to replace all tabs by carriage returns and then <tvar|code2>:1,$s/^\n//g</> to remove every line containing only a newline character.</translate>
<translate>
- Another approach is to copy the formatted text into any editor exposing the html.</translate> <translate> Remove all <tvar|tr1><tr></> and <tvar|tr2></tr></> tags and replace all <tvar|td1><td></> tags to <tvar|trtd><tr><td></> and <tvar|td2></td></> tags to <tvar|tdtr></td></tr></> the html will then be parsed into the needed format.</translate>
<translate>
- If you have shell and mysql access to your server, you can use this script:
</translate>
# mysql -umike -pmikespassword -hlocalhost wikidbname select page_title from wiki_page where page_namespace=0 EOF
<translate> Note, replace mike and mikespassword with your own. Also, this example shows tables with the prefix wiki_</translate>
<translate>
- Prefix the namespace to the page names (e.g. 'Help:Contents'), unless the selected namespace is the main namespace.</translate>
<translate>
- Repeat the steps above for other namespaces (e.g. Category:, Template:, etc.)
A similar script for PostgreSQL databases looks like this: </translate>
$ psql -At -U wikiuser -h localhost wikidb -c "select page_title from mediawiki.page"
<translate> Note, replace wikiuser with your own, the database will prompt you for a password.</translate> <translate> This example shows tables without the prefix wiki_ and with the namespace specified as part of the table name.</translate>
<translate> Alternatively, a quick approach for those with access to a machine with Python installed:</translate>
<translate>
- Go to [[<tvar|allpages>Special:Allpages</>|Special:Allpages]] and choose the desired namespace.</translate>
<translate>
- Save the entire webpage as index.php.htm.</translate> <translate> Some wikis may have more pages than will fit on one screen of AllPages; you will need to save each of those pages.</translate>
<translate>
- Run <tvar|export>export_all_helper.py</> in the same directory as the saved file.</translate> <translate> You may wish to pipe the output to a file; e.g. <tvar|code>
python export_all_helper.py > main
</> to send it to a file named "main".</translate>
<translate>
- Save the page names output by the script.
2. Perform the export
- Go to [[<tvar|export>Special:Export</>|Special:Export]] and paste all your page names into the textbox, making sure there are no empty lines.</translate>
<translate>
- Click 'Submit query'</translate>
<translate>
- Save the resulting XML to a file using your browser's save facility.
and finally...
- Open the XML file in a text editor.</translate> <translate> Scroll to the bottom to check for error messages.</translate>
<translate> Now you can use this XML file to [[<tvar|import>Special:MyLanguage/Help:Import</>|perform an import]].
Exporting the full history
A checkbox in the [[<tvar|export>Special:Export</>|Special:Export]] interface selects whether to export the full history (all versions of an article) or the most recent version of articles.</translate> <translate> A maximum of 100 revisions are returned; other revisions can be requested as detailed in <tvar|1>Template:Ll</>.</translate>
<translate>
Export format
</translate> <translate> The format of the XML file you receive is the same in all ways.</translate> <translate> It is codified in XML Schema at <tvar|url>https://www.mediawiki.org/xml/export-0.10.xsd</></translate> <translate> This format is not intended for viewing in a web browser.</translate> <translate> Some browsers show you pretty-printed XML with "+" and "-" links to view or hide selected parts.</translate> <translate> Alternatively the XML-source can be viewed using the "view source" feature of the browser, or after saving the XML file locally, with a program of choice.</translate> <translate> If you directly read the XML source it won't be difficult to find the actual wikitext.</translate> <translate> If you don't use a special XML editor "<" and ">" appear as < and >, to avoid a conflict with XML tags; to avoid ambiguity, "&" is coded as "&".</translate>
<translate> In the current version the export format does not contain an XML replacement of wiki markup (see Wikipedia DTD for an older proposal).</translate> <translate> You only get the wikitext as you get when editing the article.</translate>
<translate>
Example
</translate>
<mediawiki xml:lang="en"> <page> <title>Page title</title> <restrictions>edit=sysop:move=sysop</restrictions> <revision> <timestamp>2001-01-15T13:15:00Z</timestamp> <contributor><username>Foobar</username></contributor> <comment>I have just one thing to say!</comment> <text>A bunch of [[Special:MyLanguage/text|text]] here.</text> <minor /> </revision> <revision> <timestamp>2001-01-15T13:10:27Z</timestamp> <contributor><ip>10.0.0.2</ip></contributor> <comment>new!</comment> <text>An earlier [[Special:MyLanguage/revision|revision]].</text> </revision> </page> <page> <title>Talk:Page title</title> <revision> <timestamp>2001-01-15T14:03:00Z</timestamp> <contributor><ip>10.0.0.2</ip></contributor> <comment>hey</comment> <text>WHYD YOU LOCK PAGE??!!! i was editing that jerk</text> </revision> </page> </mediawiki>
<translate>
DTD
</translate> <translate> Here is an unofficial, short Document Type Definition version of the format.</translate> <translate> If you don't know what a DTD is just ignore it.</translate>
<!ELEMENT mediawiki (siteinfo,page*)> <!-- version contains the version number of the format (currently 0.3) --> <!ATTLIST mediawiki version CDATA #REQUIRED xmlns CDATA #FIXED "https://www.mediawiki.org/xml/export-0.3/" xmlns:xsi CDATA #FIXED "http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation CDATA #FIXED "https://www.mediawiki.org/xml/export-0.3/ https://www.mediawiki.org/xml/export-0.3.xsd" xml:lang CDATA #IMPLIED > <!ELEMENT siteinfo (sitename,base,generator,case,namespaces)> <!ELEMENT sitename (#PCDATA)> <!-- name of the wiki --> <!ELEMENT base (#PCDATA)> <!-- url of the main page --> <!ELEMENT generator (#PCDATA)> <!-- MediaWiki version string --> <!ELEMENT case (#PCDATA)> <!-- how cases in page names are handled --> <!-- possible values: 'first-letter' | 'case-sensitive' 'case-insensitive' option is reserved for future --> <!ELEMENT namespaces (namespace+)> <!-- list of namespaces and prefixes --> <!ELEMENT namespace (#PCDATA)> <!-- contains namespace prefix --> <!ATTLIST namespace key CDATA #REQUIRED> <!-- internal namespace number --> <!ELEMENT page (title,id?,restrictions?,(revision|upload)*)> <!ELEMENT title (#PCDATA)> <!-- Title with namespace prefix --> <!ELEMENT id (#PCDATA)> <!ELEMENT restrictions (#PCDATA)> <!-- optional page restrictions --> <!ELEMENT revision (id?,timestamp,contributor,minor?,comment?,text)> <!ELEMENT timestamp (#PCDATA)> <!-- according to ISO8601 --> <!ELEMENT minor EMPTY> <!-- minor flag --> <!ELEMENT comment (#PCDATA)> <!ELEMENT text (#PCDATA)> <!-- Wikisyntax --> <!ATTLIST text xml:space CDATA #FIXED "preserve"> <!ELEMENT contributor ((username,id) | ip)> <!ELEMENT username (#PCDATA)> <!ELEMENT ip (#PCDATA)> <!ELEMENT upload (timestamp,contributor,comment?,filename,src,size)> <!ELEMENT filename (#PCDATA)> <!ELEMENT src (#PCDATA)> <!ELEMENT size (#PCDATA)>
<translate>
Processing XML export
</translate> <translate> Many tools can process the exported XML.</translate> <translate> If you process a large number of pages (for instance a whole dump) you probably won't be able to get the document in main memory so you will need a parser based on SAX or other event-driven methods.</translate>
<translate> You can also use regular expressions to directly process parts of the XML code.</translate> <translate> This may be faster than other methods but not recommended because it's difficult to maintain.</translate>
<translate> Please list methods and tools for processing XML export here:
- Parse MediaWiki Dump (crates.io) is a Rust crate to parse XML dumps.</translate> <translate> Parse Wiki Text (crates.io) is a Rust crate to parse wiki text into a tree of elements.</translate>
<translate>
- Parse::MediaWikiDump is a perl module for processing the XML dump file.</translate>
<translate>
- [[<tvar|meta>m:Processing MediaWiki XML with STX</>|m:Processing MediaWiki XML with STX]] - Stream based XML transformation</translate>
<translate>
- The [[<tvar|meta>m:IBM History flow project</>|m:IBM History flow project]] can read it after applying a small Python program, <tvar|py>export-historyflow-expand.py</>.</translate>
- Template:Ll
<translate>
Details and practical advice
- To determine the namespace of a page you have to match its title to the prefixes defined in
<tvar|ns>/mediawiki/siteinfo/namespaces/namespace
</></translate>
<translate>
- Possible restrictions are</translate>
-
sysop
- <translate> protected pages</translate>
-
<translate>
Why to export
Why not just use a dynamic database download? </translate>
<translate> Suppose you are building a piece of software that at certain points displays information that came from Wikipedia.</translate> <translate> If you want your program to display the information in a different way than can be seen in the live version, you'll probably need the wikicode that is used to enter it, instead of the finished html.</translate>
<translate> Also if you want to get all of the data, you'll probably want to transfer it in the most efficient way that's possible.</translate> <translate> The Wikimedia servers need to do quite a bit of work to convert the wikicode into html.</translate> <translate> That's time consuming both for you and for the Wikimedia servers, so simply spidering all pages is not the way to go.</translate>
<translate> To access any article in XML, one at a time, link to:
Special:Export/Title_of_the_article
See also
</translate>
[[Category:Help{{#translation:}}]]