West Virginia Web Scraping

West Virginia Data Scraping, Web Scraping Tennessee, Data Extraction Tennessee, Scraping Web Data, Website Data Scraping, Email Scraping Tennessee, Email Database, Data Scraping Services, Scraping Contact Information, Data Scrubbing

Saturday, 24 January 2015

How to Deal with Content Scrapers

There are few approaches that people take when dealing with content scrapers. The Do Nothing Approach, Kill them all approach, Take Advantage of them approach.

The Do Nothing Approach

This is by far the easiest approach you can take. Usually the most popular bloggers would recommend this because it takes A LOT of time fighting the scrapers. This approach simply recommends that “instead of fighting them, spend your time producing even more quality content and having fun”. Now obviously if it is a well-known blog like Smashing Magazine, CSS-Tricks, Problogger, or others, then they do not have to worry about it. They are authority sites in Google’s eyes.

However during the Panda Update, we know some good sites got flagged as scrapers because google thought their scrapers were original content. So this approach is not always the best in our opinion.

Kill them all Approach

The exact opposite of the “Do Nothing Approach”. In this approach, you simply contact the scraper and ask them to take the content down. If they refuse to do so or simply do not reply to your requests, then you file a DMCA (Digital Millennium Copyright Act) with their host. In our experience, majority of the scraping websites do not have a contact form available. If they do, then utilize it. If they do not have the contact form, then you need to do a Whois Lookup.

Whois Lookup

You can see the contact info on the administrative contact. Usually the administrative, and technical contact is the same. The whois also shows the domain registrar. Most well-known web hosting companies and domain registrars have DMCA forms or emails. You can see that this specific person is with Hostgator because of their nameservers. HostGator has a form for DMCA complaints. If the nameserver is something like ns1.theirdomain.com, then you have to dig deeper by doing reverse IP lookups and searching for IPs.

You can also use a third party service for DMCA.com for takedowns.

Jeff Starr in his article suggest that you should block the bad guy’s IPs. Access your logs for their IP address, and then block it with something like this in your root .htaccess file:

1    Deny from 123.456.789

You can also redirect them to a dummy feed by doing something like this:

1    RewriteCond %{REMOTE_ADDR} 123\.456\.789\.

2    RewriteRule .* http://dummyfeed.com/feed [R,L]

You can get really creative here as Jeff suggests. Send them to really large text feeds full with Lorem Ipsum. You can send them some disgusting images of bad things. You can also send them right back to their own server causing an infinite loop which will crash their site.

The last approach that we take is to take Advantage of them.


No comments:

Post a Comment