West Virginia Web Scraping: 2015

Tuesday 30 June 2015

Data Scraping - Enjoy the Appeal of the Hand Scraped Flooring

Hand scraped flooring is appreciated for the character it brings into the home. This style of flooring relies on hand scraped planks of wood and not the precise milled boards. The irregularities in the planks provide a certain degree of charm and help to create a more unique feature in the home.

Distressed vs. Hand scraped

There are two types of flooring in the market that have an aged and unique charm with a non perfect finish. However, there is a significant difference in the process used to manufacture the planks. The more standard distresses flooring is cut on a factory production line. The grooves, scratches, dents, or other irregularities in these planks are part of the manufacturing process and achieved by rolling or pressed the wood onto a patterned surface.

The real hand scraped planks are made by craftsmen and they work on each plant individually. By using this working technique, there is complete certainty that each plank will be unique in appearance.

Scraping the planks

The hand scraping process on the highest-quality planks is completed by the trained carpenter or craftsmen who will produce a high-quality end product and take great care in their workmanship. It can benefit to ask the supplier of the flooring to see who completes the work.

Beside the well scraped lumber, there are also those planks that have been bought from the less than desirable sources. This is caused by the increased demand for this type of flooring. At the lower end of the market the unskilled workers are used and the end results aren't so impressive.

The high-quality plank has the distinctive look that feels and functions perfectly well as solid flooring, while the low-quality work can appear quite ugly and cheap.

Even though it might cost a little bit more, it benefits to source the hardwood floor dealers that rely on the skilled workers to complete the scraping process.

Buying the right lumber

Once a genuine supplier is found, it is necessary to determine the finer aspects of the wooden flooring. This hand scraped flooring is available in several hardwoods, such as oak, cherry, hickory, and walnut. Plus, it comes in many different sizes and widths. A further aspect relates to the finish with darker colored woods more effective at highlighting the character of the scraped boards. This makes the shadows and lines appear more prominent once the planks have been installed at home.

Why not visit Bellacerafloors.com for the latest collection of luxury floor materials, including the Handscraped Hardwood Flooring.

Source: http://ezinearticles.com/?Enjoy-the-Appeal-of-the-Hand-Scraped-Flooring&id=8995784

Tuesday 23 June 2015

Migrating Table-oriented Web Scraping Code to rvest w/XPath & CSS Selector Examples

My intrepid colleague (@jayjacobs) informed me of this (and didn’t gloat too much). I’ve got a “pirate day” post coming up this week that involves scraping content from the web and thought folks might benefit from another example that compares the “old way” and the “new way” (Hadley excels at making lots of “new ways” in R :-) I’ve left the output in with the code to show that you get the same results.

The following shows old/new methods for extracting a table from a web site, including how to use either XPath selectors or CSS selectors in rvest calls. To stave of some potential comments: due to the way this table is setup and the need to extract only certain components from the td blocks and elements from tags within the td blocks, a simple readHTMLTable would not suffice.

The old/new approaches are very similar, but I especially like the ability to chain output ala magrittr/dplyr and not having to mentally switch gears to XPath if I’m doing other work targeting the browser (i.e. prepping data for D3).

The code (sans output) is in this gist, and IMO the rvest package is going to make working with web site data so much easier.

library(XML)
library(httr)
library(rvest)
library(magrittr)

# setup connection & grab HTML the "old" way w/httr

freak_get <- GET("http://torrentfreak.com/top-10-most-pirated-movies-of-the-week-130304/")

freak_html <- htmlParse(content(freak_get, as="text"))

# do the same the rvest way, using "html_session" since we may need connection info in some scripts

freak <- html_session("http://torrentfreak.com/top-10-most-pirated-movies-of-the-week-130304/")

# extracting the "old" way with xpathSApply

xpathSApply(freak_html, "//*/td[3]", xmlValue)[1:10]

## [1] "Silver Linings Playbook "           "The Hobbit: An Unexpected Journey " "Life of Pi (DVDscr/DVDrip)"

## [4] "Argo (DVDscr)"                      "Identity Thief "                    "Red Dawn "

## [7] "Rise Of The Guardians (DVDscr)"     "Django Unchained (DVDscr)"          "Lincoln (DVDscr)"

## [10] "Zero Dark Thirty "

xpathSApply(freak_html, "//*/td[1]", xmlValue)[2:11]

## [1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10"

xpathSApply(freak_html, "//*/td[4]", xmlValue)

## [1] "7.4 / trailer" "8.2 / trailer" "8.3 / trailer" "8.2 / trailer" "8.2 / trailer" "5.3 / trailer" "7.5 / trailer"

## [8] "8.8 / trailer" "8.2 / trailer" "7.6 / trailer"

xpathSApply(freak_html, "//*/td[4]/a[contains(@href,'imdb')]", xmlAttrs, "href")

##                                    href                                    href                                    href

## "http://www.imdb.com/title/tt1045658/" "http://www.imdb.com/title/tt0903624/" "http://www.imdb.com/title/tt0454876/"

##                                    href                                    href                                    href

## "http://www.imdb.com/title/tt1024648/" "http://www.imdb.com/title/tt2024432/" "http://www.imdb.com/title/tt1234719/"

##                                    href                                    href                                    href

## "http://www.imdb.com/title/tt1446192/" "http://www.imdb.com/title/tt1853728/" "http://www.imdb.com/title/tt0443272/"

##                                    href

## "http://www.imdb.com/title/tt1790885/?"

# extracting with rvest + XPath

freak %>% html_nodes(xpath="//*/td[3]") %>% html_text() %>% .[1:10]

## [1] "Silver Linings Playbook "           "The Hobbit: An Unexpected Journey " "Life of Pi (DVDscr/DVDrip)"

## [4] "Argo (DVDscr)"                      "Identity Thief "                    "Red Dawn "

## [7] "Rise Of The Guardians (DVDscr)"     "Django Unchained (DVDscr)"          "Lincoln (DVDscr)"

## [10] "Zero Dark Thirty "

freak %>% html_nodes(xpath="//*/td[1]") %>% html_text() %>% .[2:11]

## [1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10"

freak %>% html_nodes(xpath="//*/td[4]") %>% html_text() %>% .[1:10]

## [1] "7.4 / trailer" "8.2 / trailer" "8.3 / trailer" "8.2 / trailer" "8.2 / trailer" "5.3 / trailer" "7.5 / trailer"

## [8] "8.8 / trailer" "8.2 / trailer" "7.6 / trailer"

freak %>% html_nodes(xpath="//*/td[4]/a[contains(@href,'imdb')]") %>% html_attr("href") %>% .[1:10]

## [1] "http://www.imdb.com/title/tt1045658/" "http://www.imdb.com/title/tt0903624/"

## [3] "http://www.imdb.com/title/tt0454876/" "http://www.imdb.com/title/tt1024648/"

## [5] "http://www.imdb.com/title/tt2024432/" "http://www.imdb.com/title/tt1234719/"

## [7] "http://www.imdb.com/title/tt1446192/" "http://www.imdb.com/title/tt1853728/"

## [9] "http://www.imdb.com/title/tt0443272/" "http://www.imdb.com/title/tt1790885/?"

# extracting with rvest + CSS selectors

freak %>% html_nodes("td:nth-child(3)") %>% html_text() %>% .[1:10]

## [1] "Silver Linings Playbook "           "The Hobbit: An Unexpected Journey " "Life of Pi (DVDscr/DVDrip)"

## [4] "Argo (DVDscr)"                      "Identity Thief "                    "Red Dawn "

## [7] "Rise Of The Guardians (DVDscr)"     "Django Unchained (DVDscr)"          "Lincoln (DVDscr)"

## [10] "Zero Dark Thirty "

freak %>% html_nodes("td:nth-child(1)") %>% html_text() %>% .[2:11]

## [1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10"

freak %>% html_nodes("td:nth-child(4)") %>% html_text() %>% .[1:10]

## [1] "7.4 / trailer" "8.2 / trailer" "8.3 / trailer" "8.2 / trailer" "8.2 / trailer" "5.3 / trailer" "7.5 / trailer"

## [8] "8.8 / trailer" "8.2 / trailer" "7.6 / trailer"

freak %>% html_nodes("td:nth-child(4) a[href*='imdb']") %>% html_attr("href") %>% .[1:10]

## [1] "http://www.imdb.com/title/tt1045658/" "http://www.imdb.com/title/tt0903624/"

## [3] "http://www.imdb.com/title/tt0454876/" "http://www.imdb.com/title/tt1024648/"

## [5] "http://www.imdb.com/title/tt2024432/" "http://www.imdb.com/title/tt1234719/"

## [7] "http://www.imdb.com/title/tt1446192/" "http://www.imdb.com/title/tt1853728/"

## [9] "http://www.imdb.com/title/tt0443272/" "http://www.imdb.com/title/tt1790885/?"

# building a data frame (which is kinda obvious, but hey)

data.frame(movie=freak %>% html_nodes("td:nth-child(3)") %>% html_text() %>% .[1:10],

           rank=freak %>% html_nodes("td:nth-child(1)") %>% html_text() %>% .[2:11],

           rating=freak %>% html_nodes("td:nth-child(4)") %>% html_text() %>% .[1:10],

           imdb.url=freak %>% html_nodes("td:nth-child(4) a[href*='imdb']") %>% html_attr("href") %>% .[1:10],

           stringsAsFactors=FALSE)

##                                 movie rank        rating                              imdb.url

## 1            Silver Linings Playbook     1 7.4 / trailer http://www.imdb.com/title/tt1045658/

## 2 The Hobbit: An Unexpected Journey     2 8.2 / trailer http://www.imdb.com/title/tt0903624/

## 3          Life of Pi (DVDscr/DVDrip)    3 8.3 / trailer http://www.imdb.com/title/tt0454876/

## 4                       Argo (DVDscr)    4 8.2 / trailer http://www.imdb.com/title/tt1024648/

## 5                     Identity Thief     5 8.2 / trailer http://www.imdb.com/title/tt2024432/

## 6                           Red Dawn     6 5.3 / trailer http://www.imdb.com/title/tt1234719/

## 7      Rise Of The Guardians (DVDscr)    7 7.5 / trailer http://www.imdb.com/title/tt1446192/

## 8           Django Unchained (DVDscr)    8 8.8 / trailer http://www.imdb.com/title/tt1853728/

## 9                    Lincoln (DVDscr)    9 8.2 / trailer http://www.imdb.com/title/tt0443272/

## 10                  Zero Dark Thirty    10 7.6 / trailer http://www.imdb.com/title/tt1790885/?

Source: http://www.r-bloggers.com/migrating-table-oriented-web-scraping-code-to-rvest-wxpath-css-selector-examples/

Thursday 18 June 2015

Web Scraping Services : Data Discovery vs. Data Extraction

Looking at screen-scraping at a simplified level, there are two primary stages involved: data discovery and data extraction. Data discovery deals with navigating a web site to arrive at the pages containing the data you want, and data extraction deals with actually pulling that data off of those pages. Generally when people think of screen-scraping they focus on the data extraction portion of the process, but my experience has been that data discovery is often the more difficult of the two.

The data discovery step in screen-scraping might be as simple as requesting a single URL. For example, you might just need to go to the home page of a site and extract out the latest news headlines. On the other side of the spectrum, data discovery may involve logging in to a web site, traversing a series of pages in order to get needed cookies, submitting a POST request on a search form, traversing through search results pages, and finally following all of the "details" links within the search results pages to get to the data you're actually after. In cases of the former a simple Perl script would often work just fine. For anything much more complex than that, though, a commercial screen-scraping tool can be an incredible time-saver. Especially for sites that require logging in, writing code to handle screen-scraping can be a nightmare when it comes to dealing with cookies and such.

In the data extraction phase you've already arrived at the page containing the data you're interested in, and you now need to pull it out of the HTML. Traditionally this has typically involved creating a series of regular expressions that match the pieces of the page you want (e.g., URL's and link titles). Regular expressions can be a bit complex to deal with, so most screen-scraping applications will hide these details from you, even though they may use regular expressions behind the scenes.

As an addendum, I should probably mention a third phase that is often ignored, and that is, what do you do with the data once you've extracted it? Common examples include writing the data to a CSV or XML file, or saving it to a database. In the case of a live web site you might even scrape the information and display it in the user's web browser in real-time. When shopping around for a screen-scraping tool you should make sure that it gives you the flexibility you need to work with the data once it's been extracted.

Source: http://ezinearticles.com/?Data-Discovery-vs.-Data-Extraction&id=165396

Saturday 6 June 2015

WordPress Titles: scraping with search url

I’ve blogged for a few years now, and I’ve used several tools along the way. zachbeauvais.com began as a Drupal site, until I worked out that it’s a bit overkill, and switched to WordPress. Recently, I’ve been toying with the idea of using a static site generator (a lá Jekyll or Hyde), or even pulling together a kind of ebook of ramblings. I also want to be able to arrange the posts based on the keywords they contain, regardless of how they’re categorised or tagged.

Whatever I wanted to do, I ended up with a single point of messiness: individual blog posts, and how they’re formatted. When I started, I seem to remember using Drupal’s truly awful WYSIWYG editor, and tweaking the HTML soup it produced. Then, when I moved over to WordPress, it pulled all the posts and metadata through via RSS, and I tweaked with the visual and text tools which are baked into the engine.

A couple years ago, I started to write in Markdown, and completely apart from the blog (thanks to full-screen writing and loud music). This gives me a local .md file, and I copy/paste into WordPress using a plugin to get rid of the visual editor entirely.

So, I wrote a scraper to return a list of blog posts containing a specific term. What I hope is that this very simple scraper is useful to others—WordPress is pretty common, after all—and to get some ideas for improving it, and handle post content. If you haven’t used ScraperWiki before, you might not know that you can see the raw scraper by clicking “view source” from the scraper’s overview page (or going here if you’re lazy).

This scraper is based on WordPress’ built-in search, which can be used by passing the search terms to a url, then scraping the resulting page:

http://zachbeauvais.com/?s=search_term&submit=Search

The scraper uses three Python libraries:

    Requests
    ScraperWiki
    lxml.html

There are two variables which can be changed to search for other terms, or using a different WordPress site:

term = "coffee"

site = "http://www.zachbeauvais.com"

The rest of the script is really simple: it creates a dictionary called “payload” containing the letter “s”, the keyword, and the instruction to search. The “s” is in there to make up the search url: /?s=coffee …

Requests then GETs the site, passing payload as url parameters, and I use Request’s .text function to render the page in html, which I then pass through lxml to the new variable “root”.

payload = {'s': str(term), 'submit': 'Search'}

r = requests.get(site, params=payload) # This'll be the results page

html = r.text

root = lxml.html.fromstring(html) # parsing the HTML into the var root

Now, my WordPress theme renders the titles of the retrieved posts in <h1> tags with the CSS class “entry-title”, so I loop through the html text, pulling out the links and text from all the resulting h1.entry-title items. This part of the script would need tweaking, depending on the CSS class and h-tag your theme uses.

for i in root.cssselect("h1.entry-title a"):

    link = i.cssselect("a")

    text = i.text_content()

    data = {

        'uri': link[0].attrib['href'],

        'post-title': str(text),

        'search-term': str(term)

    }

    if i is not None:

        print link

        print text

        print data

        scraperwiki.sqlite.save(unique_keys=['uri'], data=data)

    else:

        print "No results."

These return into an sqlite database via the ScraperWiki library, and I have a resulting database with the title and link to every blog post containing the keyword.

So, this could, in theory, run on any WordPress instance which uses the same search pattern URL—just change the site variable to match.

Also, you can run this again and again, changing the term to any new keyword. These will be stored in the DB with the keyword in its own column to identify what you were looking for.

See? Pretty simple scraping.

So, what I’d like next is to have a local copy of every post in a single format.

Has anyone got any ideas how I could improve this? And, has anyone used WordPress’ JSON API? It might be a logical next step to call the API to get the posts directly from the MySQL DB… but that would be a new blog post!

Source: https://scraperwiki.wordpress.com/2013/03/11/wordpress-titles-scraping-with-search-url/

Sunday 31 May 2015

Data Scraping Services - Web Scraping Video Tutorial Collection for All Programming Language

Web scraping is a mechanism in which request made to website URL to get HTML Document text and that text then parsed to extract data from the HTML codes. Website scraping for data is a generalize approach and can be implemented in any programming language like PHP, Java, C#, Python and many other.

There are many Web scraping software available in market using which you can extract data with no coding knowledge. In many case the scraping doesn’t help due to custom crawling flow for data scraping and in that case you have to make your own web scraping application in one of the programming language you know. In this post I have collected scraping video tutorials for all programming language.

I mostly familiar with web scraping using PHP, C# and some other scraping tools and providing web scraping service. If you have any scraping requirement send me your requirements and I will get back with sample data scrape and best price.

Web Scraping Using PHP

You can do web scraping in PHP using CURL library and Simple HTML DOM parsing library. PHP function file_get_content() can also be useful for making web request. One drawback of scraping using PHP is it can’t parse JavaScript so ajax based scraping can’t be possible using PHP.

Web Scraping Using C#

There are many library available in .Net for HTML parsing and data scraping. I have used Web Browser control and HTML Agility Pack for data extraction in .Net using C#

I have didn’t done web scraping in Java, PERL and Python. I had learned web scraping in node.js using Casper.JS and Phantom.JS library. But I thought below tutorial will be helpful for some one who are Java and Python based.

Web Scraping Using Jsoup in Java

Scraping Stock Data Using Python

Develop Web Crawler Using PERL

Web Scraping Using Node.Js

If you find any other good web scraping video tutorial then you can share the link in comment so other readesr get benefit form that.

Source: http://webdata-scraping.com/web-scraping-video-tutorial-collection-programming-language/

Thursday 28 May 2015

Web Scraping Services - Extracting Business Data You Need

Would you like to have someone collect, extract, find or scrap contact details, stats, list, extract data, or information from websites, online stores, directories, and more?

"Hi-Tech BPO Services offers 100% risk-free, quick, accurate and affordable web scraping, data scraping, screen scraping, data collection, data extraction, and website scraping services to worldwide organizations ranging from medium-sized business firms to Fortune 500 companies."

At Hi-Tech BPO Services we are helping global businesses build their own database, mailing list, generate leads, and get access to vast resources of unstructured data available on World Wide Web.

We scrape data from various sources such as websites, blogs, podcasts, and online directories; and convert them into structured formats such as excel, csv, access, text, My SQL using automated and manual scraping technologies. Through our web data scraping services, we crawl through websites and gather sales leads, competitor’s product details, new offers, pricing methodologies, and various other types of information from the web.

Our web scraping services scrape data such as name, email, phone number, address, country, state, city, product, and pricing details among others.

Areas of Expertise in Web Scraping:

•    Contact Details
•    Statistics data from websites
•    Classifieds
•    Real estate portals
•    Social networking sites
•    Government portals
•    Entertainment sites
•    Auction portals
•    Business directories
•    Job portals
•    Email ids and Profiles
•    URLs in an excel spreadsheet
•    Market place portals
•    Search engine and SEO
•    Accessories portals
•    News portals
•    Online shopping portals
•    Hotels and restaurant
•    Event portals
•    Lead generation

Industries we Serve:

Our web scraping services are suitable for industries including real estate, information technology, university, hospital, medicine, property, restaurant, hotels, banking, finance, insurance, media/entertainment, automobiles, marketing, human resources, manufacturing, healthcare, academics, travel, telecommunication and many more.

Why Hi-Tech BPO Services for Web Scraping?

•    Skilled and committed scraping experts
•    Accurate solutions
•    Highly cost-effective pricing strategies
•    Presence of satisfied clients worldwide
•    Using latest and effectual web scraping technologies
•    Ensures timely delivery
•    Round the clock customer support and technical assistance

Get Quick Cost and Time Estimate

Source: http://www.hitechbposervices.com/web-scraping.php

Monday 25 May 2015

Improving performance for web scraping code

2 down vote favorite

I have a website in which the code scrapes other websites for getting the accurate data. While the code works good but there a decent lag in performance because the code firsts downloads the html stream from various sites(some times 9 websites), extracts the relative part and then renders the html page.

What should I do to get an optimal performance. Should I change from shared hosting (godaddy) to my own server or it has nothing to do with my hosting and I need to make changes to my code?

1 Answer

API/CSV

Ask those websites if they provide an API, or, if you don't need an up-to-date information or the information you need doesn't change frequently, if they can sell/give you for free the data itself (for example in an CSV file). Some small websites may have fancier ways to access data, like a CSV file for the older information, and an RSS feed for the changed one.

Those websites would probably be happy to help you, since providing you with an API would reduce their own CPU and bandwidth usage by you.

Profile

Screen scrapping is really ugly when it comes to performance and scaling. You may be limited by:

your machine performance, since parsing, sometimes an invalid HTML file, takes time,
your network speed,
their network speed usage, i.e. how fast can you access the pages of their website depending on the restrictions they set, like the DOS protection and the number of requests per second for screen scrappers and search engine crawlers,
their machine performance: if they spend 500 ms. to generate every page, you can't do anything to reduce this delay.

If, despite your requests to them, those websites cannot provide any convenient way to access their data, but they give you a written consent to screen scrape their website, then profile your code to determine the bottleneck. It may be the internet speed. It may be your database queries. It may be anything.

For example, you may discover that you spend too much time finding with regular expressions the relevant information in the received HTML. In that case, you would want to stop doing it wrong and use a parser instead of regular expressions, then see how this improve the performance.

You may also find that the bottleneck is the time the remote server spends generating every page. In this case, there is nothing to do: you may have the fastest server, the fastest connection and the most optimized code, the performance will be the same.

Do things in parallel:

Remember to use parallel computing wisely and to always profile what you're doing, instead of doing premature optimization, in hope that you're smarter than the profiler.

Especially when it comes to using network, you may be very surprised. For example, you may believe that making more requests in parallel will be faster, but as Steve Gibson explains in episode 345 of Security Now, this is not always the case.

Legal aspects

Also note that screen scrapping is explicitly forbidden by the conditions of use (like on IMDB) on many websites. And if nothing is said on this subject in conditions of use, it doesn't mean that you can screen scrape those websites.

The fact that the information is available publicly on the internet doesn't give you the right to copy and reuse it this way neither.

Why? you may ask. For two reasons:

Most websites are relying on advertisement and marketing. When people use one of those websites directly, they waste some CPU/network bandwidth of the website, but in response, they may click on an ad or buy something sold on the website. When you screen scrape, your bot waste their CPU/network bandwidth, but will never click on an ad or buy something.
Displaying the information you screen scrapped on your website can have even worse effects. Example: in France, there are two major websites selling hardware. The first one is easy and fast to use, has a nice visual design, better SEO, and in general is very well done. The second one is a crap, but the prices are lower. If you screen scrape them and give the raw results (prices with links) to your users, they will obviously click on the lower price every time, which means that the website with pretty design will have less chances to sell the products.
People made an effort in collecting, processing and displaying some data. Sometimes they paid to get it. Why would they enjoy seeing you pulling this data conveniently and for free?

Source: http://programmers.stackexchange.com/questions/141403/improving-performance-for-web-scraping-code/141406#141406

Sunday 24 May 2015

How to prevent getting blacklisted while scraping

Crawlers can retrieve data much quicker and in greater depth than human searchers, so bad scraping practices can have some impact on the performance of the site.

Needless to say, if a single crawler is performing multiple requests per second and/or downloading large files, a under powered server would have a hard time keeping up with requests from multiple crawlers.

Since spiders don’t bring direct organic traffic and seemingly affect the performance of the site, most site admins hate spiders and do their best to prevent them.

Lets go through how websites detect and block spiders and also know the techniques to overcome those barriers.

Most websites don’t have anti scraping mechanisms since it would affect the user experience, but some sites do not believe in open data access.

Before going through this article always keep in mind that

    A GOOD SPIDER MUST OBEY A WEBSITE’S CRAWLING POLICIES.

HOW DOES DETECTING ‘SPIDER ACTIVITY’ WORK?

A web server can use different mechanisms to detect a spider from a normal user. Here are some methods used by a site to detect a spider:

•    Unusual traffic/high download rate especially from a single client/or IP address within a short time span raises a bot alert.

•    Repetitive tasks done on website based on an assumption that a human user won’t perform the same repetitive tasks all the time.

•    The site has honeypot traps inside their pages, these honeypots are usually links which aren’t visible to a normal user but only to a spider . When a scraper/spider tries to access the link, the alarms are tripped.

Spend some time and investigate the anti-scraping mechanisms used by a site and build the spider accordingly, it will provide a better outcome in the long run and increase the longevity and robustness of your work.

EASIEST WAY TO FIND IF A SITE HATES BOTS

Check the robots.txt file if it contains line like these, It means the site doesn’t like bots. However, since most sites want to be on Google (arguably the largest scraper of websites globally ;-)) they do allow access to bots and spiders.

User-agent: *
Disallow: /

This line is for preventing well-behaved bots or the bots which respect robots.txt.

Another way is CAPTCHAs irritating presence in the sites other than in authentication page.

WHAT HAPPENS WHEN YOU GET BANNED

There are two ways to ban a webspider, either by banning all accesses from a particular IP or by banning all accesses that use a specific id to access the server (most browsers and web spiders identify themselves whenever they request a page by user agents. Chrome browser for example uses Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.149 Safari/537.36

The banning can be temporary or permanent. Temporary blocks can last minutes or hours.

HOW DO WE KNOW A SITE HAS BLOCKED US?

If any of the following symptoms appear on the site that you are crawling, it is a sign of being blocked or banned.

•    Showing CAPTCHA pages
•    Unusual content delivery delay
•    Frequent response with 404,301,500 errors,

also frequent appearance of these status codes are also indication of blocking.

•    401 Unauthorized
•    403 Forbidden
•    404 Not Found
•    408 Request Timeout
•    429 Too Many Requests

WEB CRAWLING BEST PRACTICES

These are the best practices we can follow to overcome the detection.

1. MAKE CRAWLING SLOWER, DO NOT DDoS THE SERVER, TREAT THEM NICELY

Use auto throttling mechanisms, which will automatically throttle crawling speed based on the load on both spider and the website, you are crawling and also adjust the spider to optimum crawling speed. The faster you crawl, the worse it is for everyone.

Put some random sleeps in between requests, Add some delays after crawled number of pages. Choose the lowest number of concurrent requests possible. These techniques make the spider looks like a human being.

2. DISGUISE YOUR REQUESTS BY ROTATING IP/PROXY

A server can easily detects a bot by checking the requests from a single IP address, So we use different IPs for making request to a server and detection rate become lesser. Make a pool of IPs that you can use and use random ones for each request.

There are several methods can be used to change the IP. Services like VPN ,shared proxies, TOR can help and some third parties are also provides services for IP rotation.

3. USER-AGENT SPOOFING

Since every request made from a client end contains a user-agent header ,Using the same useragent multiple times leads to the detection of a bot. User agent spoofing is the best solution for this. Spoof the User agent by making a list of user agents and pick a random one for each request.

Websites do not want to block genuine users so you should try to look like one. Set your user-agent to a common web browser instead of using the library default (such as wget/version or urllib/version). You could even pretend to be the Google Bot: Googlebot/2.1; (http://www.google.com/bot.html)

You can check your user-agent string here:

http://www.whatsmyuseragent.com/

A good user-agent string list can be found here:

http://www.useragentstring.com/pages/useragentstring.php

4. BE AWARE OF HONEYPOTS

Some site designers put honeypot traps inside websites to detect web spiders, They may be links that normal user can’t see and a spider can.

When following links always take care that the link has proper visibility with no nofollow tag. Some honeypot links to detect spiders will be have the CSS style display:none or will be color disguised to blend in with the page’s background color.

5. DO NOT ALWAYS FOLLOW THE SAME CRAWLING PATTERN

Only robots follow the same crawling pattern,Sites that have intelligent anti-crawling mechanisms can easily detect spiders from finding pattern in their actions. Humans wont perform repetitive tasks a lot of times. Incorporate some random clicks on the page, mouse movements and random actions that will make a spider looks like a human client.

6. ALWAYS RESPECT THE robots.txt

All web spiders are supposed to follow rules that you place in a robots.txt file in a website, such as how frequently they are allowed to request pages, and from what directories they are allowed to crawl through. They should also be supplying a consistent valid User-Agent string that identifies the requests as a bot request.

Source: http://learn.scrapehero.com/how-to-prevent-getting-blacklisted-while-scraping/

Friday 22 May 2015

How Web Data Extraction Services Impact Startups

Starting a business has its fair share of ebbs and flows – it can be extremely challenging to get a new business off the blocks, and extremely rewarding when everything goes according to plan and yields desired results. For startups, it is important to get the nuances of running a business right from day one. To succeed in an immensely competitive space, startups need to perform above and beyond expectation right from the start, and one of the factors that can be of great help during the growing years of a startup is web data extraction.

Web data extraction through crawling and scraping, a highly efficient information gathering process, can be used in many creative ways to bring about major change in the performance graph of a startup. With effective web data extraction services acquired by outsourcing to a reputed company, the business intelligence gathered and the numerous possibilities associated with it, web crawling and extraction services can indeed become the difference maker for a startup, propelling it to the heights of success.

What drives the success of web data extraction?

When it comes to figuring out the perfect, balanced web data collection methodology for startups, there are a lot of crucial factors that come into play. Some of these are associated with the technical aspects of data collection, the approach used, the time invested, and the tools involved. Others have more to do with the processing and analysis of collected information and its judicious use in formulating strategies to take things forward.

Web Crawling Services & Web Scraping Services

With the advent of highly professional web data extraction services providers, massive amounts of structured, relevant data can be gathered and stored in real time, and in time, productively used to further the business interests of a startup. As a new business owner, it is important to have a high-level knowledge of the modern and highly functional web scraping tools available for use. This will help to utilize the prowess of competent data extraction services. This in turn can assist both in the immediate and long-term revenue generation context.

Web Data Extraction for Startups

From the very beginning, the dynamics of startups is different from that of older, well-established businesses. The time taken by the new business entity in proving its capabilities and market position needs to be used completely and effectively. Every day of growth and learning needs to add up to make a substantial difference. In this period, every plan and strategy, every execution effort, and every move needs to be properly thought out.

In such a trying situation where there is little margin for error, it pays to have accurate, reliable, relevant and actionable business intelligence. This can put you in firm control of things by allowing you to make informed business decisions and formulate targeted, relevant and growth oriented business strategies. With powerful web crawling, the volume of data gathered is varied, accurate and relevant. This data can then be studied minutely, analyzed in detail and arranged into meaningful clusters. With this weapon in your arsenal, you can take your startup a long way with smart decisions and clever implementations.

Web data extraction is a task best handled by professionals who have had rich experience in the field. Often, in-house web scraping teams are difficult to assemble and not economically viable to maintain, especially for startups. For a better solution, you can outsource your web scraping needs to a reliable web data extraction service for data collection. This way, you can get all the relevant intelligence you need without overstraining your workforce or having to employ additional personnel to handle web scraping. The company you outsource your work to can easily scrape data from multiple sources as per your requirements, and furnish you with actionable business intelligence that can help you take a lead in a competitive market.

Different Ways for Startups to use Web Data Extraction

Web scraping can be employed for many different purposes to yield different kinds of relevant data that generate actionable insights. For a startup, the important decision is how to use this powerful technique to provide valuable information that can make a difference for the future prospects of the company. Here are some interesting possibilities when it comes to impactful web data extraction for startups –

Fishing for Social Rankings and Backlinks

One of the most important business processes for a startup is competition analysis. This is one area where web data extraction can come across as an invaluable enabler. In the past, many startups have effectively used web scraping to fish for backlinks and social rankings related to competing companies.

Backlinks are important to reach a greater mass of better-targeted audiences, which can go on to increase customer base with minimal efforts. Social ranking is also an immensely important factor, as social actions on the internet are building blocks of opinion and reputation generation in this day and age. Keeping this in mind, you can use web data extraction to scrape for social rankings and backlinks related to content generated by your competing companies. After careful analysis, it is possible to arrive at concrete conclusions regarding what your competitors are doing well, and what sells the best.

This information is gold for marketers and sales personnel, and can be used to discern exactly what needs to be done to increase social buzz, generate favorable opinion, and win over customers from your competitors. You can also use this technique to develop high authority backlinks that help with SEO, targeted reach and organic traffic for your business website. For competition analysis, web scraping is a formidable tool.

Sourcing Contact Information

Another important aspect of business that startups can never ignore is good networking. Whether it is with customers, prospective customers, industry peers, partners, or competitors, excellent networking and open, transparent communication is essential for the success of your startup. For effective communication and networking, you need a large, solid list of contact information pertaining to your exact requirements.

Scraping data from multiple web sources gives you the perfect method of achieving this. With automated, fast web scraping, you can in a short time collect a wealth of important contact information that can be leveraged in many different ways. Whether it is the formation of lasting business relationships or making potential customers aware of what you have on offer, this information has the power to propel your startup to new levels of recognition.

For Ecommerce

If you sell your products and services online and want to stay on top of the competition when it comes to variety, pricing analysis, and special deals and offers, web scraping is the way to go. For many e-commerce startups, the problem of high CTR and low conversion is a stumbling block to higher bottom lines. To remedy problems like these and to ensure better sales, it is always a good idea to have a clear insight about your competition.

Future of Retail Industry

With web data extraction, you can be always aware of what competing companies are doing in terms of pricing strategies, product diversity and special customer offers. By considering that information while evaluating and cementing your own strategies, you can always ensure that you provide better value and range of products and services than your competitors, and therefore stay ahead of the competition.

For Marketing, Brand Promotion and Advertisement

For startups, the first wave of promotion and marketing is the one that holds the key to your long-term business success. It is during this phase that the first and most important public perception of your company is formed, and the rudiments of public opinion start taking shape. For this reason, it is crucial to be on point with your marketing and promotion during the early, formative years of your business.

To achieve this, you need a clear, in-depth understanding of your target audience. You need to categorize your target audience on the basis of many factors like age, gender, demographics, income groups and tastes and preferences. Such detailed understanding can only be possible when you have a large wealth of social data pertaining to your target audience. There is no better way of achieving this than by web data extraction.

Love your brand

With the help of data extraction services, you can gather large chunks of relevant data regarding your target audience which can help you accurately evaluate the potential of each prospective customers as a possible addition to your business family. To ensure that you have a steady, early wave of customers to take your business off the blocks at a rapid pace, you need to devise marketing campaigns, promotional strategies and advertisements in accordance with the customer knowledge you drive through your web scraping efforts. This is a foolproof strategy to have marketing and promotional plans in place that achieve goals, bring in new business and provide your company with enough initial momentum to carry it through the later years of success.

To conclude, web data extraction can be a veritable tool in the hands of a startup. With the proper use and leveraging of this technique, your startup can gather the required business intelligence to shine in a competitive market and become a favorite with the customer base. Working with the right web data extraction company can be one of the most important business decisions you make as a startup owner.

Source: https://www.promptcloud.com/blog/web-data-extraction-services-for-startups/

Monday 18 May 2015

What is Blog Scraping Service?

Blog scraping is the one of the best service to increase the traffic of the site by commenting about blogs or writing review about blogs in SEO field. Most of the Blogs will allow their reader to review or write their own comments or suggestion or ideas or thoughts in the blogs.

Nowadays in the internet world we can find the number of blogs and sites related to various topics or various products. Main concept of this service is increase traffic of website by commenting others blogs. This is very simple and easiest method. But the main difficultly we face here is getting approval from moderator of the site which may take more time or maybe we won’t get the approval.

Hence Web Scraping seo is planning to provide this blog scraping service without approval as many moderators do not have the time to read and approved each and every comment written by various visitors. We will find the High PR pages on the various blogs related to your website content and write the own comment about those blogs and provide the link of your website or anchor text. We don’t have the option or the way to track the blogs whether it is approved by moderator or not. We will give the link with comments what we have typed on the blogs as a report. It will increase the back link and increase the traffic.

What are the features of Blog scraping Service?

•    Will provide the comments or reviews to blogs which having related niche to your product.
•    Will write comments only high density or high ranking blogs.
•    Fast and More accurate promotion compared to other service
•    Understand the Blogs by reading carefully and comment accordingly
•    This service is optimized and SEO friendly.

What are the benefits of Blog scraping Service?

•    Effect of time spending for this service is very less.
•    This service is best method to increase your site traffic with minimal effect and cost.
•    Increase your web site rank in all search engines.
•    Reach your site to more number of audiences.
•    Increase your product sale.
•    Fast and more results.

What are the advantages of using this service in Web Scraping SEO?

•    Web Scraping SEO is one the top SEO service provider in the SEO Market.
•    Expert people working on Blog commenting service will always do analysis to find the high traffic blogs.
•    Web Scraping SEO will get the approval from Blogs administrator easily.
•    Provides High Quality Service with reasonable price.
•    Provides on time delivery.
•    More flexible to clients.
•    Always met the Client expectation and Provide quality service.

Frequently Asked Questions

Q: Will you provide the approval for each comment you typed on the blogs from blogsite moderator?

A: No, we are only responsible for creating comments for your website but we won’t wait for moderation approval, because Moderator is responsible for Approval, He may take time for approval that is according to Moderator’s scope. We will give only the blog links and the comments to you as a report.

Q: Do you have any system or software to track the approval of blog?

A: We don’t have any system or software to track the approval, we do comments in those top blog sites according to the matching keyword. That is only our job approval is from moderator side.

Q: Why you can’t get the approval for comments from moderator?

A: I can clearly answer this one, Because nowadays everyone is busy particularly the blogsite Moderators for that reason our comments got approved late. But we are not going to wait for that because we have a lot of works to do, But I assure you, that with the final reports that contains how many sites we have uploaded with your comments in MS Excel format will reach you.

Q: How do you select the blogs for commenting?

A: We are going to select top ranking blog sites related to your keywords, According to the benefits of your product we will give proper and attractive comments carefully.

Source: http://www.Web Scrapingseo.com/blog-scraping-service.aspx

Thursday 14 May 2015

Web Scraping: Startups, Services & Market

I got recently interested in startups using web scraping in a way or another and since I find the topic very interesting I wanted to share with you some thoughts. [Note that I’m not an expert. To correct me / share your knowledge please use the comment section]

Web scraping is everything but a new technique. However with more and more data shared on internet (from user generated content like social networks & review websites to public/government data and the growing number of online services) the amount of data collected and the use cases possible are increasing at an incredible pace.

We’ve entered the age of “Big Data” and web scraping is one of the sources to feed big data engines with fresh new data, let it be for predictive analytics, competition monitoring or simply to steal data.

From what I could see the startups and services which are using “web scraping” at their core can be divided into three categories:

•    the shovel sellers (a.k.a we sell you the technology to do web scraping)

•    the shovel users (a.k.a we use web scraping to extract gold and sell it to our users)

•    the shovel police (a.k.a the security services which are here to protect website owners from these bots)

The shovel sellers

From a technology point of view efficient web scraping is quite complicated. It exists a number of open source projects (like Beautiful Soup) which enable anyone to get up and running a web scraper by himself. However it’s a whole different story when it has to be the core of your business and that you need not only to maintain your scrapers but also to scale them and to extract smartly the data you need.

This is the reason why more and more services are selling “web scraping” as a service. Their job is to take care about the technical aspects so you can get the data you need without any technical knowledge. Here some examples of such services:

    Grepsr
    Krakio
    import.io
    promptcloud
    80legs
    Proxymesh (funny service: it provides a proxy rotator for web scraping. A shovel seller for shovel seller in a way)
    scrapingHub
    mozanda

The shovel users

It’s the layer above. Web scraping is the technical layer. What is interesting is to make sense of the data you collect. The number of business applications for web scraping is only increasing and some startups are really using it in a truly innovative way to provide a lot of value to their customers.

Basically these startups take care of collecting data then extract the value out of it to sell it to their customers. Here some examples:

Sales intelligence. The scrapers screen marketplaces, competitors, data from public markets, online directories (and more) to find leads. Datanyze, for example, track websites which add or drop javascript tags from your competitors so you can contact them as qualified leads.

Marketing. Web scraping can be used to monitor how your competitors are performing. From reviews they get on marketplaces to press coverage and financial published data you can learn a lot. Concerning marketing there is even a growth hacking class on udemy that teaches you how to leverage scraping for marketing purposes.

Price Intelligence. A very common use case is price monitoring. Whether it’s in the travel, e-commerce or real-estate industry monitoring your competitors’ prices and adjusting yours accordingly is often key. These services not only monitor prices but with their predictive algorithms they can give you advice on where the puck will be. Ex: WisePricer, Pricing Assistant.

Economic intelligence, Finance intelligence etc. with more and more economical, financial and political data available online a new breed of services, which collect and make sense of it, are rising. Ex: connotate.

The shovel police

Web scraping lies in a gray area. Depending on the country or the terms of service of each website, automatically collecting data via robots can be illegal. Whatever the laws say it becomes crucial for some services to try to block these crawlers to protect themselves. The IT security industry has understood it and some startups are starting to tackle this problem. Here are 3 services which claim to provide solutions to stop bots from crawling your website:

•    Distil
•    ScrapeSentry
•    Fireblade

From a market point of view

A couple of points on the market to conclude:

•    It’s hard to assess how big the “web scraping economy” is since it is at the intersection of several big industries (billion dollars): IT security, sales, marketing & finance intelligence. This technique is of course a small component of these industries but is likely to grow in the years to come.

•    A whole underground economy also exists since a lot of web scraping is done through “botnets” (networks of infected computers)

•    It’s a safe bet to say that more and more SaaS (like Datanyze pr Pricing Assistant) will find innovative applications for web scraping. And more and more startups will tackle web scraping from the security point of view.

•    Since these startups are often entering big markets through a niche product / approach (web scraping is a not the solution to everything, there are more a feature) they are likely to be acquired by bigger players (in the security, marketing or sales tools industries). The technological barrier are there.

Source: http://clementvouillon.com/article/web-scraping-startups-services-market/

Monday 4 May 2015

Earn Money From Price Comparison Through Web Scraping

Many individuals discover the pot of gold just within their reach. They have realized that there is money in the web. Cyber technology has blessed mankind with so many benefits that makes money very possible by just some clicks on the mouse and keyboard. Building a price comparison website is an effective way of helping clients find their desired products while you as the owner earn money at the same time.

Building price comparison websites

There is indeed much money in building price comparison websites but it is not an easy task especially for a novice in maintaining a website of one’s own. Since this entails some serious programming and ample familiarity with data feeds, you have to have a good working plan. In addition, what you are venturing into is greater than the usual blogs about just anything that you can think of. Furthermore, you are stepping into the vast field of electronic marketing, therefore you must be ready.

The first point of consideration is to identify which products or services are you going to include in your website. Choose a product or service that you and a majority of clients are mostly interested in. Suppose you want you to choose sports as your theme then you can include items and prices of sports gear, clothing such as uniforms, training videos, books, and other safety stuff. You need to do some research and even a survey to determine whether the goods and services you are promoting on your website are in demand and are what most people want to know. Moreover, it is on this stage that you may need the help of experts and veterans in the field of building to be assured that you are on the right track.

In addition, be willing to change in case your chosen category is not gaining readership or visitors. Then evaluate whether you need to expand or to be more specific in your description of the products and the comparison of the prices. Make your site prominent by search engine optimization (SEO) and make sure to acknowledge also that not too many people visit a site that is not free.

Helping visitors choose the best product/services

Good marketing strategy starts with knowing who your target audience are. There is indeed a need to do a lot of planning and research in order to understand your client’s needs and preferences. Moreover, knowing them thoroughly leads to achieving 100% consumer satisfaction. When you have provided everything they need to know about certain products, they would not need to seek elsewhere which will also gain you more regular visitors. Remember that your audience are members of communities and social networks such that there is a great possibility that they would spread the word around about the good services you are offering.

If there is a need to conduct a survey in addition to research, you should resort to it. In this manner you can discover what goods and services are not yet completely exhausted by the other websites or web creators. Ample knowledge about your potential visitors and consumers will surely make you effectively provide them with adequate statistics for their needs.

Your site will then look like a complete guidebook for them that will give them the best value for their money. Therefore, it must be thoroughly filled with product details, uses, options, and prices.

Making money as affiliate of eCommerce websites

Maintaining a price comparison website gives you less worry about getting paid or having your products bought and sold because income comes in through advertising and affiliate sales. Affiliate marketing is a way of earning money online by serving as a publisher for promotion of products, services or sites of businesses. The affiliate receives rewards from businesses for each visitor or client that comes to the business website or buys its product through the efforts of the advertising and promotion that is made by the affiliate. This is the online version of the concept of agent or referral fee sales channel. Aside from website owners, bloggers as well as members of community forums can also serve as affiliates. The affiliate earns money in three ways: through pay per link; pay per sale and pay per lead.

Trust in the reliability of the product - You should have a personal belief or confidence in the product you are promoting not only because it makes you sound more convincing, but also because you need to maintain your clients and establish credibility in your blog or website. In other words, don’t just pick any product. If you cannot use them personally, they should at least have several positive reviews and no negative ones.

Maintain credibility with readers and fellow bloggers - Befriend your readers and your co-bloggers by answering their queries sincerely and quickly. Your friendly attitude can win you their trust which is a very vital element of affiliate marketing.

Do reviews - In addition to publishing price comparison, you can gain more visitors by writing about the product and do proper SEO (Search Engine Optimization). So the expected happens, the more prominent the product becomes online, the higher will be your income.

Link with friends thru social media - Your friends have friends and their friends have also friends. Just think of how powerful your social media site can be when you post your link on your account on Facebook, Twitter or MySpace and others. Since trust is built on friendship, it is easy to get clients from among your friends and their friends.

Overall, you get all pertinent information about certain products through web data mining or web scraping. All you need to do is to be keen to the needs of your clients and use web content extraction efficiently.

Source: http://www.loginworks.com/earn-money-price-comparison-web-scraping/

Wednesday 29 April 2015

A Guide to Web Scraping Tools

Web Scrapers are tools designed to extract / gather data in a website via crawling engine usually made in Java, Python, Ruby and other programming languages.Web Scrapers are also called as Web Data Extractor, Data Harvester , Crawler and so on which most of them are web-based or can be installed in local desktops.

Its main purpose is to enable webmasters, bloggers, journalist and virtual assistants to harvest data from a certain website whether text, numbers, contact details and images in a structured way which cannot be done easily thru manual copy and paste method. Typically, it transforms the unstructured data on the web, from HTML format into a structured data stored in a local database or spreadsheet or automates web human browsing.

Web Scraper Usage

Web Scrapers are also being used by SEO and Online Marketing Analyst to pull out some data privately from the competitor’s website such as high targeted keywords, valuable links, emails & traffic sources that were also perform by SEOClerk, Google and many other web crawling sites.

Includes:

•    Price comparison
•    Weather data monitoring
•    Website change detection
•    Research
•    Web mash up
•    Info graphics
•    Web data integration
•    Web Indexing & rank checking
•    Analyze websites quality links

List of Popular Web Scrapers

There are hundreds of Web Scrapers today available for both commercial and personal use. If you’ve never done any web scraping before, there are basic

Web scraping tools like YahooPipes, Google Web Scrapers and Outwit Firefox extensions that it’s good to start with but if you need something more flexible and has extra functionality then, check out the following:

HarvestMan [ Free Open Source]

HarvestMan is a web crawler application written in the Python programming language. HarvestMan can be used to download files from websites, according to a number of user-specified rules. The latest version of HarvestMan supports as much as 60 plus customization options. HarvestMan is a console (command-line) application. HarvestMan is the only open source, multithreaded web-crawler program written in the Python language. HarvestMan is released under the GNU General Public License.Like Scrapy, HarvestMan is truly flexible however, your first installation would not be easy.

Scraperwiki [Commercial]

Using a minimal programming you will be able to extract anything. Off course, you can also request a private scraper if there’s an exclusive in there you want to protect. In other words, it’s a marketplace for data scraping.

Scraperwiki is a site that encourages programmers, journalists and anyone else to take online information and turn it into legitimate datasets. It’s a great resource for learning how to do your own “real” scrapes using Ruby, Python or PHP. But it’s also a good way to cheat the system a little bit. You can search the existing scrapes to see if your target website has already been done. But there’s another cool feature where you can request new scrapers be built. All in all, a fantastic tool for learning more about scraping and getting the desired results while sharpening your own skills.

Best use: Request help with a scrape, or find a similar scrape to adapt for your purposes.

FiveFilters.org [Commercial]

Is an online web scraper available for commercial use. Provides easy content extraction using Full-Text RSS tool which can identify and extract web content (news articles, blog posts, Wikipedia entries, and more) and return it in an easy to parse format. Advantages; speedy article extraction, Multi-page support, has a Autodetection and you can deploy on the cloud server without database required.

Kimono

Produced by Kimono labs this tool lets you convert data to into apis for automated export.   Benjamin Spiegel did a great Youmoz post on how to build a custom ranking tool with Kimono, well worth checking out!

Mozenda [Commercial]

This is a unique tool for web data extraction or web scarping.Designed for easiest and fastest way of getting data from the web for everyone. It has a point & click interface and with the power of the cloud you can scrape, store, and manage your data all with Mozenda’s incredible back-end hardware. More advance, you can automate your data extraction leaving without a trace using Mozenda’s anonymous proxy feature that could rotate tons of IP’s .

Need that data on a schedule? Every day? Each hour? Mozenda takes the hassle out of automating and publishing extracted data. Tell Mozenda what data you want once, and then get it however frequently you need it. Plus it allows advanced programming using REST API the user can connect directly Mozenda account.

Mozenda’s Data Mining Software is packed full of useful applications especially for sales people. You can do things such as “lead generation, forecasting, acquiring information for establishing budgets, competitor pricing analysis. This software is a great companion for marketing plan & sales plan creating.

Using Refine Capture tetx tool, Mozenda is smart enough to filter the text you want stays clean or get the specific text or split them into pieces.

80Legs [Commercial]

The first time I heard about 80Legs my mind really got confused of what really this software does. 80Legs like Mozenda is a web-based data extraction tool with customizable features:

•    Select which websites to crawl by entering URLs or uploading a seed list
•    Specify what data to extract by using a pre-built extractor or creating your own
•    Run a directed or general web crawler
•    Select how many web pages you want to crawl
•    Choose specific file types to analyze

80 legs offers customized web crawling that lets you get very specific about your crawling parameters, which tell 80legs what web pages you want to crawl and what data to collect from those web pages and also the general web crawling which can collect data like web page content, outgoing links and other data. Large web crawls take advantage of 80legs’ ability to run massively parallel crawls.

Also crawls data feeds and offers web extraction design services. (No installation needed)

ScrapeBox [Commercial]

ScrapeBox are most popular web scraping tools to SEO experts, online marketers and even spammers with its very user-friendly interface you can easily harvest data from a website;

•    Grab Emails
•    Check page rank
•    Checked high value backlinks
•    Export URLS
•    Checked Index
•    Verify working proxies
•    Powerful RSS Submission

Using thousands of rotating proxies you will be able to sneak on the competitor’s site keywords, do research on .gov sites, harvesting data, and commenting without getting blocked.

The latest updates allow the users to spin comments and anchor text to avoid getting detected by search engines.

You can also check out my guide to using Scrapebox for finding guest posting opportunities:

Scrape.it [Commercial]

Using a simple point & click Chrome Extension tool, you can extract data from websites that render in javascript. You can automate filling out forms, extract data from popups, navigate and crawl links across multiple pages, extract images from even the most complex websites with very little learning curve. Schedule jobs to run at regular intervals.

When a website changes layout or your web scraper stops working, scrape.it will fix it automatically so that you can continue to receive data uninterrupted and without the need for you to recreate or edit it yourself.

They work with enterprises using our own tool that we built to deliver fully managed solutions for competitive pricing analysis, business intelligence, market research, lead generation, process automation and compliance & risk management requirements.

Features:

    Very easy web date extraction with Windows like Explorer interface

    Allowing you to extract text, images and files from modern Web 2.0 and HTML5 websites which uses Javascript & AJAX.

    The user could select what features they’re going to pay with

    lifetime upgrade and support at no extra charge on premium license

Scrapy [Free Open Source]

Off course the list would not be cool without Scrapy, it is a fast high-level screen scraping and web crawling framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing.

Features:

•         Design with simplicity- Just writes the rules to extract the data from web pages and let Scrapy crawl the entire web site. It can crawl 500 retailers’ sites daily.

•         Ability to attach new code for extensibility without having to touch the framework core

•         Portable, open-source, 100% Python- Scrapy is completely written in Python and runs on Linux, Windows, Mac and BSD

•         Scrapy comes with lots of functionality built in.

•         Scrapy is extensively documented and has an comprehensive test suite with very good code coverage

•         Good community and commercial support

Cons: The installation process is hard to perfect especially for beginners

Needlebase [Commercial]

Many organizations, from private companies to government agencies, store their info in a searchable database that requires you navigate a list page listing results, and a detail page with more information about each result. Grabbing all this information could result in thousands of clicks, but as long as it fits the same formula, Needlebase can do it for you. Point and click on example data from one page once to show Needlebase how your site is structured, and it will use that pattern to extract the information you’re looking for into a dataset. You can query the data through Needle’s site, or you can output it as a CSV or other file format of your choice. Needlebase can also rerun your scraper every day to continuously update your dataset.

OutwitHub [Free]

This Firefox extension is one of the more robust free products that exists Write your own formula to help it find information you’re looking for, or just tell it to download all the PDFs listed on a given page. It will suggest certain pieces of information it can extract easily, but it’s flexible enough for you to be very specific in directing it. The documentation for Outwit is especially well written, they even have a number of tutorials for what you might be looking to do. So if you can’t easily figure out how to accomplish what you want, investing a little time to push it further can go a long way.

Best use: more text

irobotsoft [Free}

This is a free program that is essentially a GUI for web scraping. There’s a pretty steep learning curve to figure out how to work it, and the documentation appears to reference an old version of the software. It’s the latest in a long tradition of tools that lets a user click through the logic of web scraping. Generally, these are a good way to wrap your head around the moving parts of a scrape, but the products have drawbacks of their own that makes them little easier than doing the same thing with scripts.

Cons: The documentation seems outdated

Best use: Slightly complex scrapes involving multiple layers.

iMacros [Free]

The same ethos on how microsoft macros works, iMacros automates repetitive task.Whether you choose the website, Firefox extension, or Internet Explorer add-on flavor of this tool, it can automate navigating through the structure of a website to get to the piece of info you care about. Record your actions once, navigating to a specific page, and entering a search term or username where appropriate. Especially useful for navigating to a specific stock you care about, or campaign contribution data that’s mired deep in an agency website and lacks a unique Web address. Extract that key piece (pieces) of info into a usable form. Can also help convert Web tables into usable data, but OutwitHub is really more suited to that purpose. Helpful video and text tutorials enable you to get up to speed quickly.

Best use: Eliminate repetition in navigating to a particular datapoint in a website that you’re checking up on often by recording a repeatable action that pulls the datapoint out of the clutter it’s naturally surrounded by.

InfoExtractor [Commercial]

This is a neat little web service that generates all sorts of information given a list of urls. Currently, it only works for YouTube video pages, YouTube user profile pages, Wikipedia entries, Huffingtonpost posts, Blogcatalog blog posts and The Heritage Foundation blog (The Foundry). Given a url, the tool will return structured information including title, tags, view count, comments and so on.

Google Web Scraper [Free]

A browser-based web scraper works like Firefox’s Outwit Hub, it’s designed for plain text extraction from any online pages and export to spreadsheets via Google docs. Google Web Scraper can be downloaded as an extension and you can install it in your Chrome browser without seconds. To use it: highlight a part of the webpage you’d like to scrape, right-click and choose “Scrape similar…”. Anything that’s similar to what you highlighted will be rendered in a table ready for export, compatible with Google Docs™. The latest version still had some bugs on spreadsheets.

Cons: It doesn’t work for images and sometimes it can’t perform well on huge volume of text but it’s easy and fast to use.

Tutorials:

Scraping Website Images Manually using Google Inspect Elements

The main purpose of Google Inspect Elements is for debugging like the Firefox Firebug however, if you’re flexible you can use this tool also for harvesting images in a website. Your main goal is to get the specific images like web backgrounds, buttons, banners, header images and product images which is very useful for web designers.

Now, this is a very easy task. First, you will definitely need to download and install the Google Chrome browser in your computer. After the installation do the following:

1. Open the desired webpage in Google Chrome

2. Highlight any part of the website and right click > choose Google Inspect Elements

3. In the Google Inspect Elements, go to Resources tab

4. Under Resources tab, expand all folders. You will eventually see script folders and IMAGES folders

5. In the Images folders, just use arrow keys to find the images you need to have (see the screenshot above)

6. Next, right click the images and choose Open the Image in New Tab

7. Finally, right click the image > choose Save Image As… . (save to your local folder)

You’re done!

How to Extract Links from a Web Page with OutWit Hub

In this tutorial we are going to learn how to extract links from a webpage with OutWit Hub.

Sometimes it can be useful to extract all links from a given web page. OutWit Hub is the easiest way to achieve this goal.

1. Launch OutWit Hub

If you haven’t installed OutWit Hub yet, please refer to the Getting Started with OutWit Hub tutorial.

Begin by launching OutWit Hub from Firefox. Open Firefox then click on the OutWit Button in the toolbar.

If the icon is not visible go to the menu bar and select Tools -> OutWit -> OutWit Hub

OutWit Hub will open displaying the Web page currently loaded on Firefox.

2. Go to the Desired Web Page

In the address bar, type the URL of the Website.

Go to the Page view where you can see the Web page as it would appear in a traditional browser.

Now, select “Links” from the view list.

In the “Links” widget, OutWit Hub displays all the links from the current page.

If you want to export results to Excel, just select all links using ctrl/cmd + A, then copy using ctrl/cmd + C and paste it in Excel (ctrl/cmd + V).

Source: http://www.garethjames.net/a-guide-to-web-scrapping-tools/

Monday 27 April 2015

Social Media Crawling & Scraping services for Brand Monitoring

Crawling social media sites for extracting information is a fairly new concept – mainly due to the fact that most of the social media networking sites have cropped up in the last decade or so. But it’s equally (if not more) important to grab this ever-expanding User-Generated-Content (UGC) as this is the data that companies are interested in the most – such as product/service reviews, feedback, complaints, brand monitoring, brand analysis, competitor analysis, overall sentiment towards the brand, and so on.

Scraping social networking sites such as Twitter, Linkedin, Google Plus, Instagram etc. is not an easy task for in-house data acquisition departments of most companies as these sites have complex structures and also restrict the amount and frequency of the data that they let out to crawlers. This kind of a task is best left to an expert, such as PromptCloud’s Social Media Data Acquisition Service – which can take care of your end-to-end requirements and provide you with the desired data in a minimal turnaround time. Most of the popular social networking sites such as Twitter and Facebook let crawlers extract data only through their own API (Application Programming Interface), so as to control the amount of information about their users and their activities.

PromptCloud respects all these restrictions with respect to access to content and frequency of hitting their servers to make sure that user information is not compromised and their experience with the site is unhindered.

Social Media Scraping Experts

At PromptCloud, we have developed an expertise in crawling and scraping social media data in real-time. Such data can be from diverse sources such as – Twitter, Linkedin groups, blogs, news, reviews etc. Popular usage of this data is in brand monitoring, trend watching, sentiment/competitor analysis & customer service, among others.

Our low-latency component can extract data on the basis of specific keywords, categories, geographies, or a combination of these. We can also take care of complexities such as multiple languages as well as tweets and profiles of specific users (based on keywords or geographies). Sample XML data can be accessed through this link – demo.promptcloud.com.

Structured data is delivered via a single REST-based API and every time new content is published, the feed gets updated automatically. We also provide data in any other preferred formats (XML, CSV, XLS etc.).

If you have a social media data acquisition problem that you want to get solved, please do get in touch with us.

Source: https://www.promptcloud.com/social-media-networking-sites-crawling-service/

Tuesday 21 April 2015

What is HTML Scraping and how it works

There are many reasons why there may be a requirement to pull data or information from other sites, and usually the process begins after checking whether the site has an official API. There are very few people who are aware about the presence of structured data that is supported by every website automatically. We are basically talking about pulling data right from the HTML, also referred to as HTML scraping. This is an awesome way of gleaning data and information from third party websites.

Any webpage content that can be viewed can be scraped without any trouble. If there is any way provided by the website to the browser of the visitor to download content and use the same in a highly structured manner, in that case, accessing of the content programmatically is possible. HTML scraping works in an amazing manner.

Before indulging in HTML scraping, one can inspect the browser for network traffic. Site owners have a couple of tricks up their sleeve to thwart this access, but majority of them can be worked around.

Before moving on to how HTML scraping works, we must understand the reasons behind the same. Why is scraping needed? Once you get a satisfactory answer to this question, you can start looking for RSS or API feeds or various other traditional structured data forms. It is significant to understand that when compared with APIs, websites are more significant.

The most important advantage of the same is the maintenance of their websites where a lot of visitors visit rather than safeguarding structured data feeds. With Tweeter, the same has been publicly seen when it clamps down on the developer ecosystem. Many times, API feeds change or move without any prior warning. Many times, it can also be a deliberate attempt, but mostly, such issues or problems erupt as there is no authority or an organization that maintains or takes care of the structured data. It is rarely noticed, if the same gets severely mangled or goes offline. In case the website has certain issues or the website no longer works, the problem is more in the form of a ball in your court requiring dealing with the same without losing any time. api-comic-image

Rate limiting is another factor that needs a lot of thinking and in case of public websites, it virtually doesn’t exist. Besides some occasional sign up pages or captchas, many business websites fail to create and built defenses against any unwarranted automated access. Many times, a single website can be scraped for four hours straight without anyone noticing. There are chances that you would not be viewed under DDOS attack unless concurrent requests are being made by you. You will be seen just as an avid visitor or an enthusiast in the logs, that too, in case anyone is looking.

Another factor in HTML scraping is that one can easily access any website anonymously. Behavior tracking can be done with a few ways by the administrator of the website and this turns out to be beneficial if you want to privately gather the data. Many times, registration is imperative with APIs in order to get key and with any request being sent, this key also needs to be sent. But, in case of simple and straightforward HTTP requests, the visitor can stay anonymous besides cookies and IP address, which can again be spoofed.

The availability of HTML scraping is universal and there is no need to wait for the opening of the site for an API or for contacting anyone in the organization. One simply needs to spend some time and browse websites at a leisurely pace until the data you want is available and then find out the basic patterns to access the same.

Now you need to don a hat of a professional scraper and simply dive in. Initially, it may take some time to work up figuring out the way the data have been structured and the way it can be accessed just as we read APIs. If there is no documentation unlike APIs, you need to be a little more smart about it and use clever tricks.

Some of the most used tricks are

Data Fetching
The first thing that is required is data fetching. Find endpoints to begin with, that is the URLs that can help in returning the data that is required. If you are pretty sure about the data and the way it should be structured so as to match your requirements, you will require a particular subset for the same and later you can indulge in site browsing using the navigation tools.

GET Parameter

The URLs must be paid attention to and see the way it changes as you indulge in clicking between the sections and the way they divide into various subsections. Before starting, the other option that can be used is to straight away go to the search functionality of the site. Certain terms can be typed and the URL needs to be focused again for watching the changes on the basis of what is being searched. A GET parameter will be probably seen like q which changes on the basis of the search term used by you. Other GET parameters that are not being used can be removed from the URL until only the ones that are needed are left for data loading. Before a query string, there must always be a “?” beginning.

Now the time has come when you would have started to come across the data that you would like to see and want to access, but sometimes, there may be certain pagination issues that require to be dealt with. Due to these issues, you may not be able to see the data in its entirety. Single requests are kept away by many APIs as well from database slamming. Many times, clicking the next page can add some offset parameter that helps in data visibility on the page. All these steps will help you succeed in HTML scraping.

Source: https://www.promptcloud.com/blog/what-is-html-scraping-and-how-it-works/

Wednesday 8 April 2015

Thoughts on scraping SERPs and APIs

Google says that scraping keyword rankings is against their policy from what I've read. Bummer. We comprise a lot of reports and manual finding and entry was a pain. Enter Moz! We still manually check and compare, but it's nice having that tool. I'm confused now though about practices and getting SERPs in an automated way. Here are my questions

Is it against policy to get SERPs from an automated method? If that is the case, isn't Moz breaking this policy with it's awesome keyword tracker?
If it's not, and we wanted to grab that kind of data, how would we do it? Right now, Moz's API doesn't offer this data. I thought Raven Tools at one point offered this, but they don't now from what I've read. Are there any APIs out there that we can grab this data and do what we want with it? (let's day build our own dashboard)?

Thanks for any clarification and input!

Source: http://moz.com/community/q/thoughts-on-scraping-serps-and-apis

Monday 6 April 2015

Some Traps to know and avoid in Web Scraping

In the present day and age, web scraping comes across as a handy tool in the right hands. In essence, web scraping means quickly crawling the web for specific information, using pre-written programs. Scraping efforts are designed to crawl and analyze the data of entire websites, and saving the parts that are needed. Many industries have successfully used web scraping to create massive banks of relevant, actionable data that they use on a daily basis to further their business interests and provide better service to customers. This is the age of the Big Data, and web scraping is one of the ways in which businesses can tap into this huge data repository and come up with relevant information that aids them in every way.

Web scraping, however, does come with its own share of problems and roadblocks. With every passing day, a growing number of websites are trying to actively minimize the instance of scraping and protect their own data to stay afloat in today’s situation of immense competition. There are several other complications which might arise and several traps that can slow you down during your web scraping pursuits. Knowing about these traps and how to avoid them can be of great help if you want to successfully accomplish your web scraping goals and get the amount of data that you require.

Complications in Web Scraping

Over time, various complications have risen in the field of web scraping. Many websites have started to get paranoid about data duplication and data security problems and have begun to protect their data in many ways. Some websites are not generally agreeable to the moral and ethical implications of web scraping, and do not want their content to be scraped. There are many places where website owners can set traps and roadblocks to slow down or stop web scraping activities. Major search engines also have a system in place to discourage scraping of search engine results. Last but not the least, many websites and web services announce a blanket ban on web scraping and say the same in their terms and conditions, potentially leading to legal issues in the event of any scraping.

Here are some of the most common complications that you might face during your web scraping efforts which you should be particularly aware about –

•    Some locations on the intranet might discourage web scraping to prevent data duplication or data theft.

•    Many websites have in place a number of different traps to detect and ban web scraping tools and programs.

•    Certain websites make it clear in their terms and conditions that they consider web scraping an infringement of their privacy and might even consider legal redress.

•    In a number of locations, simple measures are implemented to prevent non-human traffic to websites, making it difficult for web scraping tools to go on collecting data at a fast pace.

To surmount these difficulties, you need a deeper and more insightful understanding of the way web scraping works and also the attitude of website owners towards web scraping efforts. Most major issues can be subverted or quietly avoided if you maintain good working practice during your web scraping efforts and understand the mentality of the people whose sites you are scraping.

Web Crawling Services & Web Scraping Services

Common Problems

With automated scraping, you might face a number of common problems. The behavior of web scraping programs or spiders presents a certain picture to the target website. It then uses this behavior to distinguish between human users and web scraping spiders. Depending on that information, a website may or may not employ particular web scraping traps to stop your efforts. Some of the commonly employed traps are –

Crawling Pattern Checks – Some websites detect scraping activities by analyzing crawling patterns. Web scraping robots follow a distinct crawling pattern which incorporates repetitive tasks like visiting links and copying content. By carefully analyzing these patterns, websites can determine that they are being caused by a web scraping robot and not a human user, and can take preventive measures.

Honeypots – Some websites have honeypots in their webpages to detect and block web scraping activities. These can be in the form of links that are not visible to human users, being disguised in a certain way. Since your web crawler program does not operate the way a human user does, it can try and scrape information from that link. As a result, the website can detect the scraping effort and block the source IP addresses.

Policies – Some websites make it absolutely apparent in their terms and conditions that they are particularly averse to web scraping activities on their content. This can act as a deterrent and make you vulnerable against possible ethical and legal implications.

Infinite Loops – Your web scraping program can be tricked into visiting the same URL again and again by using certain URL building techniques.

These traps in web scraping can prove to be detrimental to your efforts and you need to find innovative and effective ways to surpass these problems. Learning some web crawler tips to avoid traps and judiciously using them is a great way of making sure that your web scraping requirements are met without any hassle.

What you can do

The first and foremost rule of thumb about web scraping is that you have to make your efforts as inconspicuous as possible. This way you will not arouse suspicion and negative behavior from your target websites. To this end, you need a well-designed web scraping program with a human touch. Such a program can operate in flexible ways so as to not alert website owners through the usual traffic criteria used to spot scraping tools.

Web scraping for ecommerce data extraction

Some of the measures that you can implement to ensure that you steer clear of common web scraping traps are –

•    The first thing that you need to do is to ascertain if a particular website that you are trying to scrape has any particular dislike towards web scraping tools. If you see any indication in their terms and conditions, tread cautiously and stop scraping their website if you receive any notification regarding their lack of approval. Being polite and honest can help you get away with a lot.

•    Try and minimize the load on every single website that you visit for scraping. Putting a high load on websites can alert them towards your intentions and often might cause them to develop a negative attitude. To decrease the overall load on a particular website, there are many techniques that you can employ.

•    Start by caching the pages that you have already crawled to ensure that you do not have to load them again.

•    Also store the URLs of crawled pages.

•    Take things slow and do not flood the website with multiple parallel requests that put a strain on their resources.

•    Handle your scraping in gentle phases and take only the content you require.

•    Your scraping spider should be able to diversify its actions, change its crawling pattern and present a polymorphic front to websites, so as not to cause an alarm and put them on the defensive.

•    Arrive at an optimum crawling speed, so as to not tax the resources and bandwidth of the target website. Use auto throttling mechanisms to optimize web traffic and put random breaks in between page requests, with the lowest possible number of concurrent requests that you can work with.

•    Use multiple IP addresses for your scraping efforts, or take advantage of proxy servers and VPN services. This will help to minimize the danger of getting trapped and blacklisted by a website.

•    Be prepared to understand the respect the express wishes and policies of a website regarding web scraping by taking a good look at the target ‘robots.txt’ file. This file contains clear instructions on the exact pages that you are allowed to crawl, and the requisite intervals between page requests. It might also specify that you use a pre-determined user agent identification string that classifies you as a scraping bot. adhering to these instructions minimizes the chance of getting on the bad side of website owners and risking bans.

Use an advanced tool for web scraping which can store and check data, URLs and patterns. Whether your web scraping needs are confined to one domain or spread over many, you need to appreciate that many website owners do not take kindly to scraping. The trick here is to ensure that you maintain industry best practices while extracting data from websites. This prevents any incident of misunderstanding, and allows you a clear pathway to most of the data sources that you want to leverage for your requirements.

Hope this article helps in understanding the different traps and roadblocks that you might face during your web scraping endeavors. This will help you in figuring out smart, sensible ways to work around them and make sure that your experience remains smooth. This way, you can keep receiving the important information that you need with web scraping. Following these basic guidelines can help you prevent getting banned or blacklisted and stay in the good books of website owners. This will allow you continue with your web scraping activities unencumbered.

Source:https://www.promptcloud.com/blog/some-traps-to-avoid-in-web-scraping/