Website Duplicate Content – A Search Engine’s View

by Maura Stouffer

Duplicate Content Detection, the process of detecting and scoring the correlation between two or more documents on the Web, is one of the more complex problems that Search Engines solve. Webmasters tend to overlook this issue because of its complex nature. In this article, I will briefly discuss how a Search Engine views Duplicate Content, and explain how you can take steps toward eliminating Duplicate Content issues, thus elevating your Webpages in the Search Engine rankings!

How Do Search Engines Detect Duplicate Content?

Search Engines utilize various methods to analyze Duplicate Content. Basic approaches include utilizing the so-called Levenshtein distance, or “edit distance” – i.e. how many changes need to be made to one document to make it look like the other. More complex approaches identify and eliminate common duplicate elements (Navigation Bar, Footers, and Headers) before running a vector-based comparison between two or more documents. On a large-scale, sorting documents into similar groupings and comparing those groupings tends to provide similar results.

Are Duplicate Content Penalties Fair?

From a Webmaster or SEO professional’s point-of-view, Duplicate Content is typically seen as something that is “unfairly” penalized – after all, rarely do most Websites purposefully contain duplicate Webpages. For a Search Engine, when two or more Webpages are identified as “too similar”, one or more of those Webpages usually disappear from the Search Engine Results Pages (SERPs), because duplicate information is not beneficial to the Search Engine user. This often leaves Webmasters asking: what went wrong? Unfortunately, the typical corrective approach to the Webpage’s disappearance tends to be overly simplistic and inadequate.

How Can I Easily Correct Duplicate Content Penalties?

Fortunately, there are tools out there that can help to identify duplicate content within a Website and assist in restoring the Webpage in the SERPs.

Here are some important factors to consider when selecting a tool:

Choose tools that identify the exact percentage amount of correlation between large sets of documents, and allow you to compare each Webpage to the other Webpages in the Website. Stay away from tools that only allow you to compare two Webpages at a time.
- TIP: Correlation is best done in the context of the larger picture of the Website as a whole, as most Search Engines process at a macro scale and not a micro scale.
Use a tool that can quickly re-analyze the entire grouping of Webpages as a whole after you make the changes.
Test tools with your own set of test Webpages. Create a bunch of similar webpages and play around with content to see how the tool’s algorithm works. This will let you see the limitations of the tool.

Here are some simple things that you must avoid:

Do not write additional content and place it in a common area that is shared. Examples of this pitfall are content placed in the Footer, Header, or Navigation Bar of a Website. This will only decrease the margin of error for Duplicate Content penalties.
Avoid simply changing words to their synonym counterparts. Search Engines have solved this issue long ago, and the word “bike” is typically counted as a duplicate to “bicycle”.
Simply re-arranging words will not help either. Search Engines calculate Duplicate Content with a vector-based approach – that is, when the calculation is executed it is usually done without respect to the ordering of words.
Make sure each Webpage has content! Less content on each Webpage decreases the margin of error because there are fewer words that need to be compared. Common areas like the Header, Footer, and Navigation Bar naturally dilute some of the correlations and more unique content is needed to overcome this issue.

How Serious Is Duplicate Content?

There is an even more insidious problem with Duplicate Content than the wrong Webpage being shown in the SERPs, or penalties being assessed on a Website-level. In my opinion, Link Loss is the major reason why you should avoid Duplicate Content within a Website.

Link Loss is simply the loss of Link Flow™ within a given set of Webpages (typically within a given Website or set of sub-domains). When hundreds of duplicate Webpages exist, hundreds of Link Flow “black holes” are created.

For example: If you have 100 Webpages that are too similar, Search Engines will typically show the Webpage with the highest Total Link Flow in their keyword results. The other 99 Webpages are relegated to the “scrap heap” – in other words, the other 99 Webpages will never see the light of day on the SERPs. But where does all of this scrapped Total Link Flow go? The answer is NOWHERE! The Total Link Flow of the other 99 Webpages (which mathematically could be well more than the Total Link Flow of the 1 Webpage being shown at the top) is essentially wasted.

So watch your Duplicate Content, not from a human’s perspective, but from a Search Engine’s perspective. After all, it is a Search Engine that ultimately looks at your content, and if it doesn’t like what it sees, neither will you!

Maura Stouffer is co-Founder and President, Marketing & Operations for SEOENG®, a Search Engine that discloses its ranking algorithms and provides tools like the new Duplicate Content Detection feature to improve your Website’s SEO. This article reprinted with permission. Originally posted at Addme.com at http://www.addme.com/newsletters/website-duplicate-content-search-engine.htm.