Difference between revisions of "YK: Chapter 14: Protecting content"
(Created page with "14 Protecting content Handling spam The web 2.0 revolution of user-generated content, for all its positive impact, has also been a godsend for spammers. Never before in hu...") |
(No difference)
|
Latest revision as of 23:06, 13 June 2016
14 Protecting content
Handling spam
The web 2.0 revolution of user-generated content, for all its positive impact, has also been a godsend for spammers. Never before in human history has it been so easy to sell dubious merchandise or services, whether it’s prescription drugs, life insurance, college essay writing, or more of those prescription drugs. So spam has infiltrated countless blog comments, Twitter feeds, and wiki pages. If your wiki is not public, or is public but has closed registration, then you have nothing to fear and you can skip ahead to the next chapter. If, however, your wiki allows contributions from the general public, then chances are good that, at some point, one or more groups of wiki-spammers will find it and will start trying to turn it into a free advertising platform.
MediaWiki already does an important task in preventing spam, which is to add a “nofollow” tag to the HTML of every external link this tells search engines not to go to that link, thus greatly minimizing the benefit that adding links can provide to spammers. You can undo that behavior, by the way, by adding the following to LocalSettings.php, though you really shouldn’t:
$wgNoFollowLinks = false;
Still, for whatever reason, some spammers really like to hit MediaWiki sites. Thankfully, there are a number of effective extensions that let you fight back against spam. The two most important ones, which are recommended for every publicly-editable wiki, are ConfirmEdit and SpamBlacklist; we’ll get to those in the next sections.
ConfirmEdit
The ConfirmEdit extension comes bundled in with every MediaWiki install. Its documentation can be found here:
https://www.mediawiki.org/wiki/Extension:ConfirmEdit
It sets up a CAPTCHA tool for page edits, user registration and user login. A CAPTCHA (which stands for "Completely Automated Public Turing test to tell Computers and Humans Apart") is any input that’s designed so that a human can enter but a software program can’t: its most common variety is those now-ubiquitous tests in online forms that ask you to look at an image with distorted numbers and letters and type them in. ConfirmEdit provides, at the moment, six different CAPTCHA options. They are:
SimpleCaptcha the default option. Displays a simple math problem.
FancyCaptcha displays an image of stylized set of letters that users have to decipher (this option is most like the standard CAPTCHAs).
MathCaptcha like SimpleCaptcha, but the math problem is displayed as an image.
QuestyCaptcha asks a question, out of a pre-defined set (the administrator has to create the questions, and their allowed answers).
ReCaptcha uses the reCAPTCHA service, which uses distorted text from scanned-in books.
Asirra asks users to select only the cat photos out of a list of 12 photos of cats and dogs.
All of these options, including SimpleCaptcha, are better than nothing; though there’s a big range of effectiveness among all of them. For a while, the ReCaptcha option was a very popular anti-spam tool (as was the reCAPTCHA service in general, on non-wiki sites). It’s the nicest-looking of the ConfirmEdit options, it performs a useful service (helping to digitize old books), and it’s also the only one that provides an alternate, audio-based CAPTCHA, for use by blind or visually-impaired users. However, it may have become a victim of its own popularity: sometime around 2010, spammers apparently figured out en masse how to get around it, and now it’s no longer very useful.
At the moment, the two most effective of these options appear to be QuestyCaptcha and Asirra. With QuestyCaptcha, there’s no automated software that can figure out the right answer to your questions, so even simple questions are generally effective. (Though it’s helpful to replace the set of questions every once in a while, if spam starts getting through.) Asirra appears to be effective in that there’s currently no software that can easily distinguish between cats and dogs that presumably won’t be the case forever, but it is at the moment.
Whichever CAPTCHA module you go with, ConfirmEdit offers the same standard set of additional options. First, it lets you customize which user groups will see CAPTCHAs, using the ’skipcaptcha’ permission type. By default, only the ’bot’ and ’sysop’ user groups are exempted from CAPTCHAs (in other words, they have ’skipcaptcha’ set to true). If you want to, say, exempt registered users as well, you could add the following to LocalSettings.php:
$wgGroupPermissions['user']['skipcaptcha'] = true;
That may seem like a reasonable change, but actually it’s not necessary or recommended, as we’ll see soon.
ConfirmEdit also lets you configure which actions result in a CAPTCHA test. The relevant actions are:
’edit’ any attempted page edit
’create’ the creation of a new page
’addurl’ any edit which results in a new URL being added to the page
’createaccount’ user registration
’badlogin’ when a user tries to log in after already having given an incorrect password (this is useful to guard against bots that try to guess passwords)
By default, ’addurl’, ’createaccount’ and ’badlogin’ are checked, while ’edit’ and ’create’ are not. Why is that surely every edit is worth checking? Actually, it’s not usually necessary, because of the presence of the ’addurl’ action. Spam almost always involves the addition of one or more URLs. (Not always, though, because, bizarrely, some "pseudo-spammers" like to just add random text to pages.) Meanwhile, in regular wiki editing, new external URLs get added to pages only occasionally. So checking the addition of URLs works to ward off most automated spammers, while being only a minor inconvenience to real users.
In an ideal world, a CAPTCHA system would block all spam. But spammers have figured out how to bypass CAPTCHAs, most likely by hiring humans to enter the inputs (the going rate, according to Wikipedia, is an absurd 0.1 cents for every completed CAPTCHA). Still, ConfirmEdit does seem to cut down significantly on spam: it stops the waves of automated spam that spammers sometimes like to unleash, where hundreds of spam pages can be created in a few hours.
SpamBlacklist
Another, complementary tool is the SpamBlacklist extension, which can block edits based on two criteria: what URLs they add to the page, and what IP address they originate from. URLs that spammers add tend to be to members of a very large, but finite, set of known websites. The SpamBlacklist extension lets you use as many URL "blacklists" as you want, each containing a set of domains (actually, it’s a set of regular expressions for domains, so that, for instance, every domain containing the string "casino-online" can get blocked with one line). By default, SpamBlacklist uses a single blacklist the Wikimedia Meta-Wiki blacklist, located at:
http://meta.wikimedia.org/wiki/Spam_blacklist
It’s an impressively long list, and it seems to be fairly effective at blocking the spam edits that ConfirmEdit doesn’t. The set of websites that spammers use, though, is always growing, but thankfully there’s no shortage of additional blacklists available the SpamBlacklist documentation lists a number of these.
There’s more functionality in SpamBlacklist, including the ability to create "whitelists", of domains that match some anti-spam criteria but are actually fine. You can read more on the extension’s web page:
https://www.mediawiki.org/wiki/Extension:SpamBlacklist
Other anti-spam tools
Even ConfirmEdit and SpamBlacklist, as helpful as they both are, don’t block all spam. Most perniciously, some spammers simply link to the URLs of pages they’ve created on other wikis, which themselves contain spam. There’s no real way to block such URLs, since they point to innocent domains. There’s a third way of guarding against spam, though, which is to check attributes like the IP address and email address (assuming they’ve registered) of the user making the edit. Two extensions that do that are "Bad Behavior" and "Check Spambots":
https://www.mediawiki.org/wiki/Extension:Bad_Behavior
https://www.mediawiki.org/wiki/Extension:Check_Spambots
If you do get hit with spam, there are three useful tools for getting rid of it quickly and easily: "Nuke", “Block and Nuke” and "DeleteBatch".
Nuke is an extension that’s bundled in with MediaWiki that lets you delete all the pages created by a single user or IP address. If a spammer sticks to just a few user accounts or IP addresses, and they only create new pages, Nuke will work very well. You just need to enter the username or IP address, and it does the rest. You can get the extension here:
https://www.mediawiki.org/wiki/Extension:Nuke
An extension that’s probably more useful overall is “Block and Nuke”, which, instead of going after one account, goes after all of them, except for usernames that have been added to a whitelist file (it’s the “guilty until proven innocent” model). In most cases, wiki spammers seem to create many accounts, abandoning each one after only one or two edits; and “Block and Nuke” will work much better in those cases:
https://www.mediawiki.org/wiki/Extension:BlockandNuke
Unfortunately, there’s no current extension that does something similar with bad edits to pages that already existed in other words, does a mass revert instead of a mass deletion. The closest thing is this JavaScript code, which you can add to MediaWiki:Common.js, which lets you do that via JavaScript it’s not nearly as efficient, but it’s certainly better than nothing:
http://en.wikipedia.org/wiki/User:John254/mass_rollback.js
MediaWiki has a script, deleteBatch.php, that provides a different approach to undoing spam and vandalism it lets administrators delete a large group of pages at once, by supplying a text file containing all the page names:
https://www.mediawiki.org/wiki/Manual:DeleteBatch.php
Additionally, there’s the DeleteBatch extension, which lets you do essentially the same thing from the browser interface:
https://www.mediawiki.org/wiki/Extension:DeleteBatch
mediawiki.org has an entire “Combating spam” page, that lists these and other extensions, as well as other, more involved ways of avoiding spam. You can see it here:
https://www.mediawiki.org/wiki/Manual:Combating_spam
Restricting registration
Finally, there’s an alternate approach to preventing spam, which is to control users’ ability to register, and then to restrict editing to just logged-in users. It makes registration more difficult for users, but it may well be the most foolproof approach against spam.
There are two extensions that can be used for this purpose: OpenID and ConfirmAccount. The OpenID extension lets users register and log in via the OpenID protocol, and can be set to only allow registration via OpenID. OpenID, until now at least, seems to have been impossible for spammers to crack. You can see here for more information.
The other extension is ConfirmAccount, in which every user registration has to be approved by an administrator; which also works quite well against spam. You can read about it here.
Access control and protecting pages
There are two kinds of access control: restricting the ability of certain users to read certain content, and restricting their ability to edit certain content. In MediaWiki, these are very different from one another, and we’ll handle them in two separate sections.
Controlling read access
MediaWiki was never designed, and never re-designed, to allow for restricting read-access. If you go to the page on mediawiki.org for almost any access-control extension, you will see the following friendly message:
If you need per-page or partial page access restrictions, you are advised to install an appropriate content management package. MediaWiki was not written to provide per-page access restrictions, and almost all hacks or patches promising to add them will likely have flaws somewhere, which could lead to exposure of confidential data. We are not responsible for anything being leaked, leading to loss of funds or one’s job.
In reality, there are hooks in the MediaWiki code to allow extensions to restrict viewing of pages using any of the access-control extensions, a user who is not allowed to view a page will most likely only see an error message if they go directly to the page. However, the warning is still appropriate, because, for whatever reason, there are places in the MediaWiki code that ignore these read restrictions. Currently there are two known ones: the search page, and the "Recent changes" page. If a user does a search on text contained in a restricted page, they will be able to see the name of the page, the fact that it contains the search text, and the text fragment around the search text. And any changes to restricted pages will show up in the "Recent changes" page, where at least the edit summary will be viewable by everyone.
In addition, for those using the Semantic MediaWiki extension, SMW poses a third security gap, because it, too, ignores read restrictions so any data stored semantically within a restricted page will be viewable by everyone.
It could be that all of the current issues will be fixed in future versions of the software. Nevertheless, even then, trying to restrict people’s ability to access content in MediaWiki still seems like it would be a bad idea. This being a wiki, anyone who can read a certain protected page can easily copy over its contents to another, unprotected, page; or make some mistake in editing the page that leads to it no longer being in a protected category; etc. Even if the mistake lasts for no more than five minutes, that’s still enough time for someone to see the material and have a permanent copy of it. And you might never find out if such a breach happens.
The other big issue is that every extension you use has to restrict read-access permissions. If even one doesn’t, like Semantic MediaWiki, then all your restriction work may be in vain.
So what do you do if you want to store confidential information in your wiki? Probably the most foolproof solution for that case is to simply have a second wiki, one which is restricted to only the small group of people with preferred access (most likely, top-level managers or the like), which will hold all the confidential data. Then you can have an additional element of "security through obscurity" people who don’t have access to the wiki may not even know about it, or may not know its web address; so there’s less chance of any sort of breach. It’s much safer to prevent someone from reading a wiki entirely, than reading only certain parts of it.
How do you prevent people from getting to a wiki? If you’re on an internal network, and there’s already some server that only the people you want to restrict access to, have access to, then the easiest solution is to put the wiki on that server. Otherwise, the best way to restrict viewing of the wiki is via LocalSettings.php settings that’s covered here, but here are the relevant lines again:
$wgGroupPermissions['*']['read'] = false;
$wgGroupPermissions['user']['read'] = true;
The variable $wgWhitelistRead can also be useful in this case, because it lets you define certain pages that non-logged-in users can see, even if they can’t view the rest of the wiki. If you want everyone to be able to see the front page, for instance, you could add the following:
$wgWhitelistRead = array( 'Main Page' );
And if you’re using the ConfirmAccount extension (see here), which lets people request a user account, then at least the following would be necessary if the wiki is private:
$wgWhitelistRead = array( 'Special:RequestAccount' );
In addition to using $wgGroupPermissions, there are also general web-based solutions, like using an .htaccess file.
What about more complex requirements like, for instance, if you want to implement some system where regular users can only read and edit their own contributions, while administrators can read and edit everything? There may be extensions intended to support specific configurations "KeepYourEyesToYourself" is intended for just that case but good general advice is to echo the warning message: "you are advised to install an appropriate content management package."
Controlling write access
Thankfully, all the many issues related to restricting reading ability don’t apply to restricting writing ability. Unlike read restrictions, write restrictions in MediaWiki work quite well; and even if a security breach occurs, it can be easily undone. If you’re an administrator, you can restrict the writing of any particular page just via the "Protect" tab (or dropdown selection). In Figure 14.1, you can see an example of the interface shown after clicking on that tab/selection. As you can see, an administrator can set different protection levels for editing and moving pages, and they can set expirations on that protection.
[]
Figure 14.1 “Protect page” interface
All of the access-control extensions also let you restrict write access. These generally provide a way to restrict all pages in a category and/or namespace to editing by one or more user groups. Of these, the safest choices at the moment seem to be the extensions Lockdown and SemanticACL; they both work with the most recent versions of MediaWiki. You can view them here:
https://www.mediawiki.org/wiki/Extension:Lockdown
https://www.mediawiki.org/wiki/Extension:SemanticACL
In addition, you can practice a "nicer" form of write-restriction, by using one of the extensions that let you mark a certain revision of the page as approved; anyone can then modify the page further, but the newer edits won’t be displayed to users until they, too, are approved. We’ll cover the two extensions that allow that in the next section.
FlaggedRevs and Approved Revs
Running a wiki can be a frightening experience: you’re in charge of a set of documents that are meant to reflect some sort of official view of things, but sometimes many people, and sometimes everyone, can change anything on the wiki at any time. Which means that you can check the wiki in the morning and find out that a page about some software utility has, for the last four hours, held nothing but a string of obscenities, or some incorrect information about the software, or a nice recipe for chocolate mousse. Or perhaps you find that that bad edit has been in place for a week.
This fear tends to be overblown, because (a) with the (major) exception of spam, edits that are malicious or obviously incorrect are usually pretty rare, (b) to the extent that users are reading the wiki and have editing power, they can usually be trusted to find and revert such changes on their own. Still, the threat is there; and beyond that, some administrators simply want to have final control over what gets displayed at all times. Editorial control, in many cases, can be a nice thing to have.
The natural solution to this problem is one that has its roots in software version control: having one or more versions of a wiki page that are certified as "stable" or approved, and then having the most recent stable version be the one shown by default to users. That way you don’t interfere with the process of wiki-editing, while at the same time ensuring a certain level of security for the content.
FlaggedRevs
This view of things has reached to Wikipedia, where vandalism has always been a problem. The FlaggedRevs extension (sometimes also referred to as PendingChanges) was developed for that purpose. It was first used on the German-language Wikipedia in 2008, and it is currently used on some other language Wikipedias, though not yet on the English-language one.
FlaggedRevs can be used on any MediaWiki-based wiki, though it takes some effort to install and use it because it’s a substantial piece of software. It does more than simply enable setting a stable revision of a page: it provides a whole framework for users to review different revisions of a page according to different criteria, so that the decision about which revision(s) to go with can be made by consensus.
You can read more about FlaggedRevs here:
https://www.mediawiki.org/wiki/Extension:FlaggedRevs
Approved Revs
FlaggedRevs makes sense for Wikipedia, although it may be overkill for small-to-medium-sized wikis, where decisions can just be made by one or a few people without the need for a full, open discussion. In such a case, the Approved Revs extension may be the better solution.
Approved Revs is an extension that essentially was created to be a much simpler alternative to FlaggedRevs. It basically just lets administrators do one thing: select a single revision of a page as the approved one. When a user goes to a page that has an approved revision, that approved revision is what they will see by default (though they can still see any other revision if they go the "history" page).
If a page has no approved revision, users will, by default, just see the latest revision Approved Revs will have no impact. However, the wiki can be set to instead show a blank page if there’s no approved revision this can be done by adding the following to LocalSettings.php:
$egApprovedRevsBlankIfUnapproved = true;
If normal users edit a page that already has an approved revision, their changes won’t show up on the default page until another approval happens. But by default, if anyone who has revision-approval permission edits a page, their edit (and thus, the latest revision of the page) will automatically be marked as approved. That usually makes sense, since such editors presumably wouldn’t make changes that they wouldn’t themselves authorize. However, you can change that default behavior by adding the following to LocalSettings.php:
$egApprovedRevsAutomaticApprovals = false;
Besides protecting content, Approved Revs can also be used to turn MediaWiki into more of a publishing platform, or a traditional CMS, where “draft” versions of a wiki page exist before the page gets “published”. For this case, the $egApprovedRevsBlankIfUnapproved setting becomes quite useful. It’s different from standard publishing schemes because readers can still see all the draft versions, through the history page (although those can be hidden if necessary see here), but the basic concept of a page that’s kept hidden until it’s reviewed and approved is there.
You can also set the group of namespaces for which Approved Revs is applied, via the $egApprovedRevsNamespaces variable. By default it comprises four namespaces: NS_MAIN (the main namespace), NS_HELP (help pages), NS_TEMPLATE (templates) and NS_PROJECT (the project namespace). And you can set Approved Revs to apply to specific individual pages, using the “__APPROVEDREVS__” behavior switch. This is best done via a template.
As an example, let’s say you only wanted approval to apply to the set of pages in the category “Essays”. You would first turn off Approved Revs in general, by adding the following to LocalSettings.php:
$egApprovedRevsNamespaces = array();
You would then create an infobox template, to be used for every "essay" page, that defines pages as being in the “Essays” category; and you would add to that template the “__APPROVEDREVS__” behavior switch, so that it was added automatically to every such page.
Approved Revs also defines a new special page, Special:ApprovedRevs, that provides several lists of pages: all pages that have an approved revision, all the ones that don’t, and all the ones whose approved revision is not their latest.
You can read more about about Approved Revs on its homepage: