Duplicate Content and Robots

Spider Traps

These are sets of pages that may be intentionally or unintentionally be used to cause a web crawler or search bot to make an infinite number of requests or cause a poorly constructed crawler to crash. Spider traps may be used to “catch” spambots or other crawlers that waste a website’s bandwidth.

Politeness

A spider trap makes the web crawler enter an infinite loop, which wastes the spider’s resources, lowers its productivity, and can even crash it. Polite spiders alternate requests between different hosts and don’t request the same document from the same server more than once every few seconds. A “polite” spider is much less affected by spider traps than are “impolite” spiders.

Sites with spider traps usually have a robots.txt telling bots not to go into the trap, so a legitimate “polite” bot would not fall into the trap, but an “impolite” one would disregard the robots.txt file and fall in.

Site Elements That Are Problematic for Spiders

  1. Search and web forms.
  2. Java, images, audio, video, flash: alt attributes is a good way to present at least some idea as to what the content is, but adding captions and text descriptions is better.
  3. Ajax and JavaScript
  4. Frames

SEO-Friendly Navigation Guidelines

  1. Implement a text-link-based navigation structure, especially if your main navigation is JS or Flash based.
  2. Beware of “spider traps.” Avoid looping 301 or 302 server codes and other redirection protocols.
  3. Watch out for session IDs and cookies. If you limit the ability of a user to view pages or redirect pages based ob a cookie or session ID, search engines may be unable to crawl your content.

Root Domains, Subdomains, and Microsites

Common questions when structuring a website:

  1. Whether to host content on a new domain.
  2. When to use subfolders.
  3. When to employ microsites.

As search engines scour the internet, they look at and place value on four different web structures:

  1. Individual pages/URLS
  2. Subfolders: the folder structures websites use have different levels of value in the eyes of spiders, meaning if a subfolder is used a lot, its value will rise.
  3. Subdomains (3rd level domains): These can receive individual assignments of importance, trustworthiness and value from the engines, independant of their second-level domains
  4. Complete root domains

When to Use a Subfolder
If a subfolder will work, it is the best choice 99.9% of the time. Keeping content of a single root domain and single subfolder gives the maximum SEO benefits as engines will maintain all the positive metrics the site earns around links, authority and trust, and will apply these to every page of the site.

Subdomains are a popular choice for hosting content, but are not recommended from an SEO standpoint. Subdomains may or may not inherit the ranking benefits and positive numbers of the root domain.

When to Use a Subdomain
When you need a completely unique URL, a subdomain can look more authoritative to users. Subdomains may also be a reasonable choice if keyword usage in the domain name is of critical importance.

When to Use a Separate Root Domain
This is not recommended, it overly segments your content and brand. Switching to a new domain forces you to rebrand and to earn positive ranking all over again.

Microsites
If your site is likely to gain more traction and interest with webmasters and bloggers by being separated from your main site, create a microsite. You should never implement a microsite that acts as a doorway page to your main page, or that has substantially the same content as your main site, not unless the microsite had rich original content and you can promote it as a separate site.

Duplicate Content Issues

[youtube=http://youtu.be/6hSoXutuj0g]

Duplicate content can result from many causes:

  • Licensing of content to and from your site
  • Site architecture flaws
  • Plagiarism, also called scraping: spammers desparate for content scrape it from legitimate sources, scrambling the words and repurposing the text to appear on their own pages in the hopes of attracting long-tail searches and serving contextual ads, etc.

Search engines filter out as much duplicate content as possible for a better overall user experience.

Consequences of Duplicate Content

  • A spider comes to a site with a crawl budget: the number of pages it plans to crawl in each particular session. If you have duplicate content, you wast that budget and fewer of your original pages can be crawled.
  • Links to duplicate content are a waste of link juice.
  • You can’t tell which version of the content the search engine will pick.

How to Fix the Problem
Apply the “canonical” URL tag to the link to the version you deem the original:

<link rel=”canonical” href=”http://www.site.com”>

This tells the search engine that the link in question should be treated as though it were a copy of www.site.com and that all of the link and content metrics the engine will apply should technically flow back to that URL.

This is like a 301 redirect from an SEO perspective. Essentially, you’re telling the search engines that multiple pages should be considered as one without actually redirecting visitors to the new URL, saving you development time. A 301 redirect tells search engines that the page has moved permanently and is the most efficient and search-engine friendly method of page redirection.

Licensed Content
If you syndicate content to third parties, the search engine may boot your copy of the article in favor of the republished one.Way to fix this include:

  • Have the syndication partner add a NoIndex meta tag.
  • Have the syndication partner make a link back to the original source.

How Search Engines Identify Duplicate Content

  • It is duplicate content even if it is one the same site.
  • Search engines do not reveal the percentage of duplicate content needed to trip the wire.
  • Pages do not have to be identical to be considered duplicates.

How to Monitor Whether Your Content Is Being Ripped Off

  1. Use Copyscape.com
  2. Don’t worry if the pages using your content are way below yours in the rankings.
  3. If, on the other hand, you have a new site with relatively few incoming links and the scrapers are consistently ranking ahead of you, file a DMCA Infringement Request with Google, Yahoo, and Bing. Also, you may have grounds to sue, depending on the type of content scraped.

How to Avoid Duplicate Content on Your Own Site

  • First of all, fix your site structure so that these pages aren’t created in the first place.
  • Use the canonical tag.
  • Use robots.txt to block search engine spiders from crawling the dupes.
  • Use the robots NoIndex meta tag.
  • NoFollow all the links on the dupe pages.

[youtube=http://youtu.be/8V34o9GnwDk]

Leave a Reply

Your email address will not be published. Required fields are marked *