Different approaches to fight WikiSpam

Wikispam is getting more and more annoying. Wiki pages get high ratings in search engines because of the strong linking between the pages (and each other via InterWiki links). This makes them a valuable target to increase the ranking of other pages.

(Some of) these ideas are categorized as Helpful (improves the experience of well-intentioned users), Transparent (well-intentioned users never even notice), and Annoying (but perhaps less annoying than the spam it blocks).

Current Anti-spam Features

TextChas

Textual puzzle questions appearing at the top of the page. The idea is to prove that a user is a human, and not a bot, before they are allowed to save an edit.

See TextCha page for details, and also HelpOnTextChas

SurgeProtection

A limit to the rate with which a user is allowed to edit pages, since bots often try to edit rapid-fire.

See SurgeProtection page for details

'BadContent' list

You can ban certain content within contributions by listing RegularExpressions on the your 'BadContent' page.

If any of these regular expressions get a positive match, when a user is editing a page, then that page cannot be saved. The only solution for them, is to remove the offending links from the page before saving.

This feature is more effective at blocking WikiSpam if the list is kept up-to-date (block known spammers before they even reach your wiki) You can manually update your list taking entries from our BadContent page. Alternatively you can try the AntiSpamGlobalSolution which automates this process.

Problems


Ideas

Please keep discussion in this section, and describe/document all ACTUAL FEATURES above

General considerations

"Spam is only useful, if the bot reads the edited page and gets the link." (But spam is still annoying to readers of my wiki, whether or not it has any effect on Google PageRank.)

Such as

Counterarguments:

This also means any unmodified existing external links from last version of the wikipage should be preserved when submited by user, maybe a diff action against last version is required.

Server could validate the markup rule by decrypting server encrypted value in a hidden form control.

Allow Humans, discriminate against Scripts

For high volume spamming (spamming hundreds/thousands of wikis) scripts are needed. Disallowing editing for scripts would limit the amount of wikis spammers can target.

Nevertheless spam from humans is annoying too. Anti script messures will surely not be enough.

Setup a HoneypotPage to detect spambots.

Identify Human

Detecting spam by content

Problems:

Usually transparent.

Detecting spam by content redundancy

Black Lists -- detecting spam by source

If you want to block only few addresses, there is also a configuration method to block IPs and subnets. Like this:

hosts_deny = ['61.55.22.51', '220.188.',]

in moin_config.py

Distributed Blacklist

Those will not give spammers any rank at google.

Who is spamming ? Detecting spam by source. SpamAssassin for email and wiki.

Could we potentially use SpamAssassin to block spam? I mean I suppose the spammers mail-accounts have often the same subnets as the computers they use. This could be especially effective if your Wiki is far from international and there are edits from China. I think Spam is generally the same - mail or wiki - I would suggest to gather information about different IP-blocks that are distributed. We should also be able to exclude IPs positive from a negative block. See also BlogSpamAssassin

Staging Revisions

A system I've seen in use on other wikis "stages" anonymous edits so they don't get seen until either a known user verifies the page (http://wikifeatures.wiki.taoriver.net/moin.cgi/StagedCommits) or a certain time period has elapsed (http://wikifeatures.wiki.taoriver.net/moin.cgi/DelayedCommits): an advantage of this aproach is that spammed pages won't get seen by a search engine. Disadvantage is it requires registered users to "moderate" RecentChanges, and complicates the whole edit-view-edit cycle. There is more on this topic at WardsWiki, MeatBall and ProWiki -- I'll dig up some specific links when I'm near my offline-wiki.

A hybrid of this approach where most edits get through directly, only ones that are (a) anonymous and (b) fit a certain spam profile, would get tagged as "needs-review".

Of course, grafting this onto any wiki engine is going to be ugly and it's a flagrant violation of DoTheSimplestThingThatCouldPossiblyWork ...

DelayedCommits is generally transparent to most users.

The ApproveChanges action (and associated event handler) implements staged revisions using an approval queue as a subpage of the affected page.

Easier Restoration

Make it useless

Spam is only useful if the Google-bot reads the edited page and gets the link. Instead of identifing the Spam-Bot, identify Google and mask out all links, that are not least x days old. alternatively create a robot.txt to disallow google on recently edited pages. Then trust the WikiGnomes to correct the Page before these x days. Take additional measures to identify/ban spammers. If spamming your wiki is useless, the spammer will not take the effort.

Google has announced a "nofollow"-Attribute for links to prevent comment spam in blogs. This could be used for any external link in a wiki, too -- should be easy to implement and quite effective.

I'm afraid that its hard to implement, and won't be effective. A wiki is one big comment - we can't add the "nofollow" attribute to all wiki links, as this will kill the good links in the wiki. Using this feature require that we know how to catch the spammers links, and if we know how to do this, we can simply prevent their edits, as AntiSpam does. Its good that search engines are trying to fight this problem though. -- NirSoffer 2005-01-19 15:18:03

What about using an editable white list? Interwikilinks are on the white list per default of course. Most wikis don't have that much external links. And most of this links do not rely on getting google rank. We could even offer an "prove this link" icon infront of new links. With ACLs it should be easy to implement an TrustedEditorsGroup which is allowed to do this.

I think that's an excellent idea, and very similar to what I was thinking about this problem. InterWiki and IntraWiki links would be exempt, and the rest would have to be added to a whitelist before their nofollow attributes were dropped. -- Omnifarious 2005-05-16 19:54:15

Slightly annoying.

The HelpOnRobots page describes how "nofollow" is applied to links for different kinds of user agents.

Report them

Send abuse reports to spammer's ISPs.

Deny their ACL rights

Since most spam is of the Pagerank variety, would it be beneficial to add an "externallink" ACL right that would permit or deny the addition of external links to pages? E.g. the ACL

 #acl Known:read,write,externallink All:read,write

would let all users edit pages, but only let known users add external links.

Usually transparent (since most edits don't add external links).

Implement AKISMET

Akismet (Automattic Kismet) is a collaborative effort implemented into blogging software such as WordPress 2.0. It makes comment and trackback spam a non-issue. Assuming that Wiki Spam is similar (if not identical to) Blog Spam, implementing Akismet functionality into MoinMoin could almost eliminate Wiki Spam. It has been extremely effective in WordPress installations, almost completely eliminating comment and trackback spam. Sharing the spam detection results with other platforms (such as the large install base of WordPress users) would greatly benefit spam prevention in MoinMoin installations.

Detecting spam bots

Ned Batchelder has interesting spam bots prevention technique, using hashed tickets, hashed field names, and invisible form elemnts. See http://www.nedbatchelder.com/text/stopbots.html.

Don't advertise orphans

Most spam I've seen targets orphan pages.

Disallow /OrphanedPages/

in robots.txt should stop these pages from being indexed.

Discussion

anti-spam proposal

  1. The first step to stop spam is to enable users to delete spam pages. We can do some kind of checkbox batch select delete for pages in RecentChanges. This would allow user to delete 30 spam pages that were created, at once.

  2. spamassassin like program and comment like SPAM

    1. If page is deleted and comment says SPAM Username who created page is disabled and marked as spammer.(cannot be re-enabled)

    2. The page information should be submitted to spamassassin like program which would enable the following.

      1. Regex the page content and put it into database. (I guess score based solutions could be implemented)
      2. Based on the score the existing pages can be deleted(need some administrative page),
      3. New pages cannot be saved after passing it through spamassassing like program and if the score is too high.

    1.RBL solution (this should stop 85%) of the spam as it does for email spam.
    1. The way it works it would check editor/saver ip address to see if it is listed in RBL anti-spam server. (If he is, he cannot edit/save page)
    2. Ip address should be submited to RBL moin anti-spam server(if page was deleted with comment SPAM, or reverted with comment SPAM)(after passing some threshold server will add the ip to its list)

    3. RBL anti-spam server should be an option to enable, and it will work great. (I wonder if moin spammers also do email spam. this could mean we could use existing databases)
  3. Captcha when creating new usernames (this is in 1.6 already)
  4. kitten auth, Asirra, by Microsoft.

Cannot annoy users

I personally think that we cannot deploy an anti-spam system that annoys even anonymous users on every edit without damaging the wiki idea as a whole. So I would prefer a solution that tries to detect spam and bothers the user only if the wiki suspect a edit as spam. -- FlorianFesti 2004-08-18 17:11:50

I wrote up a proposal for a HoneypotPage feature over @ WikiFeatures. I'll copy it here and see what u guys think. - GoofRider 2005-05-02

Spam prevention is next to useless

This is an ever-escalating battle with no end in sight. Why not focus on ways to help undo the damage instead? Despam action is a good start but badly underdeveloped. If despam action also deleted the user it would go a long way helping manage spam. Right now the process for removing a user account borders on making your eyes bleed. Hint: if it's easier to ssh into a wiki, change to an obscure directory, grep for some text and manually delete files than it is to use the web interface then there is something seriously wrong with your GUI ;-)

(!) As you see when looking at the recent amount of spam (0) in this (moin 1.6) wiki, it is not useless. But you maybe are right with the endless battle, we'll see when the spammers adapt. I recently did some improvements to Despam in 1.5 and I'll commit and port them to 1.6/1.7 soon. If you want to help improving the UI, feel welcome. :) -- ThomasWaldmann 2008-01-18 19:47:55

How does BadContent work?

I'm more of a mediawiki expert, but I was trying to help sort out a spam problem on this wiki, by adding a few regexps to the top of the BadContent page: http://wiki.freemap.in/moin.cgi/BadContent I was quite surprised that I was able to add to that page (as a normal user, not a sysop). Actually that seemed quite nifty, but to my disappointment, it doesn't seem to be picking up my entries. For example I simply added "warcraft" but still the warcraft spammer comes back. Maybe the BadContent feature isn't activated, but then why would that page be populated with regexs?

Also, do you guys have any other tips for ways in which normal users can help fight spam on a MoinMoin wiki? I tried reverting the spam manually, but it takes a while because of the (anti-spam) edit throttling!


CategorySpam

MoinMoin: AntiSpamFeatures (last edited 2014-09-06 23:27:50 by 83-132-149-118)