|
|
04.10.09 How To Detect Web Content Spam By David Harry For starters, what is web spam and what's its function? In the patents we're looking at today, they describe spam as websites constructed with random or targeted content and links in order to, "to trick the analysis algorithms used by search engines" into ranking the pages higher than they should (bit of an oxy moron play there). The end game of course being to monetize said traffic with varying forms of advertising… yada yada… we know the deal - And the fly in the ointment? "However, achieving this is complicated because it can be difficult to identify spam hosts without manually reviewing the content of each host and classifying it as a spam or non-spam host." So what's a search engine to do? Welcome to the world of rare AIR (Adversarial Information Retrieval). Last time out CJ was walking us through some methods of Paid Link detection and in the past we've covered link spam, phrase based and temporal spam detection methods ( to name a few) - this time we're going to look at Host Level Spam Detection. Here are the Patents - System and method for identifying spam hosts using stacked graphical learning - Method of detecting spam hosts based on propagating prediction labels - Detecting spam hosts based on clustering the host graph Note; It should be noted that they say it can be on a Server (IP) level, site level or even used on a page level. This is worth bearing in mind when the term 'host' is used. Host/Site Level Spam Detection As with a lot of learning models these days, they start off with a seed set. In this case it is a set of hosts deemed spam or non-spam by a baseline classifier. Then, in once instance, they describe a random walk that is "modified in order to obtain a weighted or skewed characterization of the host". Essentially it would look for linking anomalies common to spam hosts which can be used in weighting the results.
Now, using this modified RW means they can either follow links from a known/classified spam host or use a probabilistic model to choose other likely profiles of likely/known spam hosts. Essentially it is a given that spam hosts link to other spam hosts in a higher proportion. They describe this scoring as a 'characterization value' - To get a better feel on Yahoo's evolved TrustRank see last years post on HarmonicRank. From there they look at clusters of Spam hosts based on how each host is linked to others. Then these clusters can be analyzed to establish if it is a spam or non-spam cluster (of websites). The hosts in the cluster can then be reclassified based on the over-all scoring. If a site within the cluster does not meet a minimum threshold of the cluster then it's spam score stays the same. Essentially by classifying and clustering Spam hosts based on inter-linkage, a spamicity score can be calculated at the host and cluster level… think of it as PageRank for spam. Note; They do mention classifications of spam sites based on links AND content, but I didn't find much on the content spam detection, outside of a brief section on hiddent text and cloaking. This makes me believe there is another patent in this series we've not found Continue reading this article. About the Author: David Harry is the President of Reliable SEO and has been building and marketing websites since 1998. He can be found writing about search and internet marketing on the Fire Horse Trail and is the author of the SEO Handbook series. http://www.reliable-seo.com http://www.huomah.com http://www.the-seo-handbook |
|
| ||
| -- antiSPAMnews
is an iEntry, Inc.
publication -- iEntry, Inc. 2549 Richmond Rd. Lexington KY, 40509 2009 iEntry, Inc. All Rights Reserved | Privacy Policy | Legal | Contact archives | advertising info | news headlines | free newsletters | comments/feedback | submit article |