Search engines dislike duplicate content for a few reasons. One is that major search engines such as Google, Yahoo, MSN, and Ask aim to provide searchers with a diverse cross-section of unique content, and duplicate content often results in duplicate listings that impair the searcher’s experience. Another reason is that search engines don’t want to spend the resources (bandwidth) on indexing pages that are very similar.
In some instances, pages containing duplicate content are filtered at the time search engine results are sorted, so there is no guarantee as to which version of a page will appear in results and which won’t. Duplicate content may even hinder some sites and web pages from getting indexed by search engines, and there are some cases in which a search engine crawler will stop indexing all of the pages of a site because it finds too many copies of the same pages under different URLs.
While content duplication is sometimes used in an attempt to manipulate search engine rankings to garner more website traffic, in most cases it occurs without ill intent on behalf of the site owner or webmaster. The following is a list of duplicate content scenarios that could be burdening your site.
Scenario #1: Ecommerce sites that include product descriptions from manufacturers, producers, and publishers
Product distribution websites often use text from the manufacturer or producer of the product as a description for the item on their own pages. With the addition of the product name, creator, manufacturer, writer, or recording artist appearing on the page, there is a considerable amount of duplicate content on pages that don’t originate from the same website. Here are some examples:
http://www.amazon.com/Sony-VGN-TXN15P-B-Notebook-Processor/dp/B000J43MR0
http://www.crowdstorm.com/Sony_VAIO_11_1_Widescreen_Notebook_PC_VGN_TXN15P_B+
2973.html
http://www.clearanceclub.com/products/6495-VAIO-VGN-TXN15P-B
http://www.provantage.com/sony-vgntxn15p-b~7SONN0UX.htm
Scenario #2: Printer-friendly pages
Many sites offer “printer friendly” versions of their content on different pages. Without the application of robots.txt disallow statements or meta “noindex” tags on these pages to keep search engines from indexing them, they may be indexed as duplicate content. See these samples:
http://www.constructionbook.com/xq/ASP/productid.5395/qx/printable_view_produ
ct.htm
http://www.tigerdirect.com/applications/searchtools/item-details-print.asp?Ed
pNo=1556143&Sku=H24-PX849%20SB
Scenario #3: Websites that create session IDs
A session ID lets you create customized applications for a more personalized user experience, thus increasing the appeal of your website. A visitor to your site would be assigned a unique session ID which is either stored in a cookie on the user side or is propagated in the URL.
Websites with session IDs serve information in their URLs to track visitors as they go through the pages of that site. When search engine crawlers detect this tracking information they may index the same page several times under different URLs. A good example of this is www.staples.com.
Search engine guidelines advise you to allow bots or spiders to crawl your sites without session IDs that track their path through the site. While this technique is great for tracking individual user behavior, the access pattern of bots is entirely different. Since bots cannot always decipher URLs that look different but point to the same page, the use of session IDs may result in incomplete indexing of your site.
Scenario #4: URLs that include multiple data variables
When multiple data variables exist within a URL, this causes bots to crawl and index the same page under different URLs. Here are some examples of sites that show different data variables in their URLs.
http://www.homedepot.com/webapp/wcs/stores/servlet/ProductDisplay?storeId=100
51&langId=-1&catalogId=10053&productId=100022126&categoryID=502813storeId=
10051
catalogId=10053
productId=100022126
categoryID=5028 http://www1.macys.com/catalog/index.ognc?CategoryID=30977&PageID=30977*1*24*-
1*-1&kw=Hugo%20Boss&LinkType=EverGreenCategoryID=30977
PageID=30977
LinkType=EverGreen
It is difficult for a search engine bot or spider to crawl the URLs listed above. If this scenario applies to your website, you may want to implement the mod-re-write server settings.
Scenario #5: Pages sharing similar elements
Some websites have elements that are very common from one page to another, such as title, meta descriptions, headings, navigation, and text that is shared sitewide. This can be a problem since bots might consider it to be duplicate content. Beware of this scenario if you own an ecommerce site that includes your brand name and information about that brand in every title on every page of your site. In addition, the use of content management systems that do not allow for distinct meta description tags to be placed on each page of a website can cause a similar dilemma.
Here are two well-known websites that use their brand names on every page:
http://www.barnesandnoble.com
http://www.officemax.com
These five scenarios represent situations in which search engine crawlers may perceive your website to have duplicate content. Although it is probably inadvertent on your part, you should take steps to resolve these issues to ensure that all of your web pages are properly indexed on the search engines.
Latest Comments