antiSPAMnews News Archives About Us Feedback


Click to Play

Twitter As A Marketing Tool?
The SEO/SEM community loves Twitter. The term "loves" in this sense is used in the strongest possible manner. Guy Kawasaki loves Twitter also, but views it...

Recent Articles

Checking Outbound Links For Content Spam
I still love the linkfromdomain command on Live.com. Like MSN / Live (perhaps Kumo?!) search platform, it's often forgotten about. But actually it still has much value for SEO. Here are 3 quick reasons why. 1) Check...

Using CAPTCHA Images To Help Prevent Spam
My system administrator is telling me that I need to add a "capcha" [actually, it's "captcha"] system to my site so that I get less spam. What's a captcha system and why would I want it? Dave's Answer: Ah, spam, the...

Why Does Spam Effect Some People More?
I'm a German national who lives in the Philippines for more than 17 years. I use Gmail for a long time already and I really like it. No problems so far. Just a couple of days back, though, my brother in Germany who...

Can Spammers Take Advantage Of Redirects
Google is warning that spammers can take advantage of your site without even making use of your server! They do so by abusing open redirect URLs. In this case the spammers or the hackers take advantage of your...

AVG Gets Acknowledged For Excellence
AVG Technologies, a anti-virus and security software provider with over 80 million users in 167 countries countries, today announced that its Internet Security Network Edition (NE) solution has received the...


04.10.09

How To Detect Web Content Spam

By David Harry

For starters, what is web spam and what's its function? In the patents we're looking at today, they describe spam as websites constructed with random or targeted content and links in order to, "to trick the analysis algorithms used by search engines" into ranking the pages higher than they should (bit of an oxy moron play there). The end game of course being to monetize said traffic with varying forms of advertising… yada yada… we know the deal - And the fly in the ointment?

"However, achieving this is complicated because it can be difficult to identify spam hosts without manually reviewing the content of each host and classifying it as a spam or non-spam host."

So what's a search engine to do? Welcome to the world of rare AIR (Adversarial Information Retrieval). Last time out CJ was walking us through some methods of Paid Link detection and in the past we've covered link spam, phrase based and temporal spam detection methods ( to name a few) - this time we're going to look at Host Level Spam Detection.

Here are the Patents -

System and method for identifying spam hosts using stacked graphical learning - Method of detecting spam hosts based on propagating prediction labels - Detecting spam hosts based on clustering the host graph 

Note; It should be noted that they say it can be on a Server (IP) level, site level or even used on a page level. This is worth bearing in mind when the term 'host' is used.

Host/Site Level Spam Detection
As with a lot of learning models these days, they start off with a seed set. In this case it is a set of hosts deemed spam or non-spam by a baseline classifier. Then, in once instance, they describe a random walk that is "modified in order to obtain a weighted or skewed characterization of the host". Essentially it would look for linking anomalies common to spam hosts which can be used in weighting the results.

The Power, Control and Services You Need Wrapped
in the Expert Support You Want - Learn More

Now, using this modified RW means they can either follow links from a known/classified spam host or use a probabilistic model to choose other likely profiles of likely/known spam hosts. Essentially it is a given that spam hosts link to other spam hosts in a higher proportion. They describe this scoring as a 'characterization value' - To get a better feel on Yahoo's evolved TrustRank see last years post on HarmonicRank.

From there they look at clusters of Spam hosts based on how each host is linked to others. Then these clusters can be analyzed to establish if it is a spam or non-spam cluster (of websites). The hosts in the cluster can then be reclassified based on the over-all scoring. If a site within the cluster does not meet a minimum threshold of the cluster then it's spam score stays the same.

Essentially by classifying and clustering Spam hosts based on inter-linkage, a spamicity score can be calculated at the host and cluster level… think of it as PageRank for spam.

Note; They do mention classifications of spam sites based on links AND content, but I didn't find much on the content spam detection, outside of a brief section on hiddent text and cloaking. This makes me believe there is another patent in this series we've not found

Continue reading this article.


About the Author:
David Harry is the President of Reliable SEO and has been building and marketing websites since 1998. He can be found writing about search and internet marketing on the Fire Horse Trail and is the author of the SEO Handbook series.

http://www.reliable-seo.com
http://www.huomah.com
http://www.the-seo-handbook
antiSPAMnews is brought to you by:

SecurityConfig.com NetworkingFiles.com
ITmanagementNews.com WebProASP.com
DatabaseProNews.com SQLProNews.com
ITcertificationNews.com SysAdminNews.com
LinuxProNews.com WirelessProNews.com
CProgrammingTrends.com ITmanagementNews.com


About antiSPAMnews
News and updates for the fight against spam.



-- antiSPAMnews is an iEntry, Inc. publication --
iEntry, Inc. 2549 Richmond Rd. Lexington KY, 40509
2009 iEntry, Inc. All Rights Reserved | Privacy Policy | Legal | Contact

archives | advertising info | news headlines | free newsletters | comments/feedback | submit article


antiSPAMnews Home Page About Article Archive News Downloads WebProWorld Forums Jayde iEntry Advertise Contact WebProWorld Forum