West Virginia Web Scraping

West Virginia Data Scraping, Web Scraping Tennessee, Data Extraction Tennessee, Scraping Web Data, Website Data Scraping, Email Scraping Tennessee, Email Database, Data Scraping Services, Scraping Contact Information, Data Scrubbing

Wednesday, 31 December 2014

Data Extraction, Web Screen Scraping Tool, Mozenda Scraper

Web Scraping

Web scraping, also known as Web data extraction or Web harvesting, is a software method of extracting data from websites. Web scraping is closely related and similar to Web indexing, which indexes Web content. Web indexing is the method used by most search engines. The difference with Web scraping is that it focuses more on the translation of unstructured content on the Web, characteristically in rich text format like that of HTML, into controlled data that can be analyzed stored and in a spreadsheet or database. Web scraping also makes Web browsing more efficient and productive for users. For example, Web scraping automates weather data monitoring, online price comparison, and website change recognition and data integration. 

This clever method that uses specially coded software programs is also used by public agencies. Government operations and Law enforcement authorities use data scrape methods to develop information files useful against crime and evaluation of criminal behaviors. Medical industry researchers get the benefit and use of Web scraping to gather up data and analyze statistics concerning diseases such as AIDS and the most recent strain of influenza like the recent swine flu H1N1 epidemic.

Data scraping is an automatic task performed by a software program that extracts data output from another program, one that is more individual friendly. Data scraping is a helpful device for programmers who have to generate a line through a legacy system when it is no longer reachable with up to date hardware. The data generated with the use of data scraping takes information from something that was planned for use by an end user.

One of the top providers of Web Scraping software, Mozenda, is a Software as a Service company that provides many kinds of users the ability to affordably and simply extract and administer web data. Using Mozenda, individuals will be able to set up agents that regularly extract data then store this data and finally publish the data to numerous locations. Once data is in the Mozenda system, individuals may format and repurpose data and use it in other applications or just use it as intelligence. All data in the Mozenda system is safe and sound and is hosted in a class A data warehouses and may be accessed by users over the internet safely through the Mozenda Web Console.

One other comparative software is called the Djuggler. The Djuggler is used for creating web scrapers and harvesting competitive intelligence and marketing data sought out on the web. With Dijuggles, scripts from a Web scraper may be stored in a format ready for quick use. The adaptable actions supported by the Djuggler software allows for data extraction from all kinds of webpages including dynamic AJAX, pages tucked behind a login, complicated unstructured HTML pages, and much more. This software can also export the information to a variety of formats including Excel and other database programs.

Web scraping software is a ground-breaking device that makes gathering a large amount of information fairly trouble free. The program has many implications for any person or companies who have the need to search for comparable information from a variety of places on the web and place the data into a usable context. This method of finding widespread data in a short amount of time is relatively easy and very cost effective. Web scraping software is used every day for business applications, in the medical industry, for meteorology purposes, law enforcement, and government agencies.


Monday, 29 December 2014

How to scrape address from Google Maps

If you want to build a new online directory based website and want it to be popular with latest web contents, then you need the help of web scraping services from iWeb scraping. If you want to scrape address from maps.google.com, there is a specialized web scraping tool developed by iWeb scraping which can do the job for you. There are plenty of benefits with web scraping which includes market research, gathering customer information, managing product catalogs, compare prices, gather real estate data, gather job posting information etc. Web scraping technology is very popular nowadays and it saves lot of time and effort involved in manual extraction of data from websites.

The web scraping tools developed iWeb Scraping is very user-friendly and can extract specific information from targeted websites. It converts data from HTML web pages to useful formats like Excel spread sheets or Access database. Whatever web scraping requirements you have, you can contact iWeb Scraping as they have more than 3.5 years of web data extraction experience and offer the best prices in the industry. Also their services are available in 24x7 basis and free pilot projects will be done based on request.

Companies which require specific web data and look for an application which can automate the process and export the HTML data in structured format could benefit greatly from web scraping applications of iWeb scraping. You can easily extract data from multiple target websites, parse and re-assemble the information in HTML format to database or spread sheets as you wish. The application has simple point-and-click user-interface and any beginner can use it scrape address from Google Maps. If you want to gather address of people in particular region from Google maps, you can do it with help of web scraping application developed by iWebscraping.

Web Scraping is a technology that able to digest target website databases that are visible only as HTML web pages, and create a local, identical replica of those databases as a information or result. With our web scraping & web data extraction service we can capture web pages, then pin-point specific pieces of data/information you'd like to extract from web pages. What is needed in this process is much more than a Website crawler and set of Website wrappers. The time required to do web data extraction goes down in comparison to manually data copying and pasting job.


Saturday, 27 December 2014

So What Exactly Is A Private Data Scraping Services To Use You?

If your computer connects to the Internet or resources on the request for this information, and queries to different servers. If you have a website to introduce to the site server recognizes your computer's IP address and displays the data and much more. Many e - commerce sites use to log your IP address, and the browsing patterns for marketing purposes.

Related Articles

Follow Some Tips For Data Scraping Services

Web Data Scraping Assuring Scraping Success Proxy Data Services

Data Scraping Services with Proxy Data Scraping

Web Data Extraction Services for Data Collection - Screen Scrapping Services, Data Mining Services

The  Scraping server you connect to your destination or to process your information and make a filter. For example, IP address or protocol filtering traffic through a  Scraping service. As you might guess, there are many types of  Scraping services. including the ability to a high demand for the software. Email messages are quickly sent to businesses and companies to help you search for contacts.

Although there are Sanding free  Scraping IP addresses in this way can work, the use of payment services, and automatic user interface (plug and play) are easy to give.  Scraping web information services, thus offering a variety of relevant sources of data.  Scraping information service organizations are generally used where large amounts of data every day. It is possible for you to receive efficient, high precision is also affordable.

Information on the various strategies that companies,  Scraping excellent information services, and use the structure planned out and has led to the introduction of more rapid relief of the Earth.

In addition, the application software that has flexibility as a priority. In addition, there is a software that can be tailored to the needs of customers, and satisfy various customer requirements play a major role. Particular software, allows businesses to sell, a customer provides the features necessary to provide the best experience.

If you do not use a private Data Scraping Services suggest that you immediately start your Internet marketing. It is an inexpensive but vital to your marketing company. To choose how to set up a private  Scraping service, visit my blog for more information. Data Scraping Services software as the activity data and provides a large amount of information, Sorting. In this way, the company reduced the cost and time savings and greater return on investment will be a concept.

Without the steady stream of data from these sites to get stopped? Scraping HTML page requests sent by argument on the web server, depending on changes in production, it is very likely to break their staff. 

Data Scraping Services is common in the respective outsourcing company. Many companies outsource  Data Scraping Services service companies are increasingly outsourcing these services, and generally dealing with the Internet business-related activities, in particular a lot of money, can earn.

Web  Data Scraping Services, pull information from a structured plan format. Informal or semi-structured data source from the source.They are there to just work on your own server to extract data to execute. IP blocking is not a problem for them when they switch servers in minutes and back on track, scraping exercise. Try this service and you'll see what I mean.

It is an inexpensive but vital to your marketing company. To choose how to set up a private  Scraping service, visit my blog for more information. Data Scraping Services software as the activity data and provides a large amount of information, Sorting. In this way, the company reduced the cost and time savings and greater return on investment will be a concept.


Friday, 26 December 2014

Limitations and Challenges in Effective Web Data Mining

Web data mining and data collection is critical process for many business and market research firms today. Conventional Web data mining techniques involve search engines like Google, Yahoo, AOL, etc and keyword, directory and topic-based searches. Since the Web's existing structure cannot provide high-quality, definite and intelligent information, systematic web data mining may help you get desired business intelligence and relevant data.

Factors that affect the effectiveness of keyword-based searches include:

• Use of general or broad keywords on search engines result in millions of web pages, many of which are totally irrelevant.

• Similar or multi-variant keyword semantics my return ambiguous results. For an instant word panther could be an animal, sports accessory or movie name.

• It is quite possible that you may miss many highly relevant web pages that do not directly include the searched keyword.

The most important factor that prohibits deep web access is the effectiveness of search engine crawlers. Modern search engine crawlers or bot can not access the entire web due to bandwidth limitations. There are thousands of internet databases that can offer high-quality, editor scanned and well-maintained information, but are not accessed by the crawlers.

Almost all search engines have limited options for keyword query combination. For example Google and Yahoo provide option like phrase match or exact match to limit search results. It demands for more efforts and time to get most relevant information. Since human behavior and choices change over time, a web page needs to be updated more frequently to reflect these trends. Also, there is limited space for multi-dimensional web data mining since existing information search rely heavily on keyword-based indices, not the real data.

Above mentioned limitations and challenges have resulted in a quest for efficiently and effectively discover and use Web resources. Send us any of your queries regarding Web Data mining processes to explore the topic in more detail.

Source: http://ezinearticles.com/?Limitations-and-Challenges-in-Effective-Web-Data-Mining&id=5012994

Tuesday, 23 December 2014

GScholarXScraper: Hacking the GScholarScraper function with XPath

Kay Cichini recently wrote a word-cloud R function called GScholarScraper on his blog which when given a search string will scrape the associated search results returned by Google Scholar, across pages, and then produce a word-cloud visualisation.

This was of interest to me because around the same time I posted an independent Google Scholar scraper function  get_google_scholar_df() which does a similar job of the scraping part of Kay’s function using XPath (whereas he had used Regular Expressions). My function worked as follows: when given a Google Scholar URL it will extract as much information as it can from each search result on the URL webpage  into different columns of a dataframe structure.

In the comments of his blog post I figured it’d be fun to hack his function to provide an XPath alternative, GScholarXScraper. Essensially it’s still the same function he wrote and therefore full credit should go to Kay on this one as he fully deserves it – I certainly had no previous idea how to make a word cloud, plus I hadn’t used the tm package in ages (to the point where I’d forgotten most of it!). The main changes I made were as follows:

    Restructure internal code of GScholarScraper into a series of local functions which each do a seperate job (this made it easier for me to hack because I understood what was doing what and why).

    As far as possible, strip out Regular Expressions and replace with XPath alternatives (made possible via the XML package). Hence the change of name to GScholarXScraper. Basically, apart from a little messing about with the generation of the URLs I just copied over my get_google_scholar_df() function and removed the Regular Expression alternatives. I’m not saying one is better than the other but f0r me personally, I find XPath shorter and quicker to code but either is a good approach for web scraping like this (note to self: I really need to lean more about regular expressions!) :)

•    Vectorise a few of the loops I saw (it surprises me how second nature this has become to me – I used to find the *apply family of functions rather confusing but thankfully not so much any more!).
•    Make use of getURL from the RCurl package (I was getting some mutibyte string problems originally when using readLines but this approach automatically fixed it for me).
•    Add option to make a word-cloud from either the “title” or the “description” fields of the Google Scholar search results
•    Added steaming via the Rstem package because I couldn’t get the Snowball package to install with my version of java. This was important to me because I was getting word clouds with variations of the same word on it e.g. “game”, “games”, “gaming”.
•    Forced use of URLencode() on generation of URLs to automatically avoid problems with search terms like “Baldur’s Gate” which would otherwise fail.

I think that’s pretty much everything I added. Anyway, here’s how it works (link to full code at end of post):

<div id="LC198"># #EXAMPLE 1: Display word cloud based on the title field of each Google Scholar search result returned</div>
<div id="LC199"># GScholarXScraper(search.str = "Baldur's Gate", field = "title", write.table = FALSE, stem = TRUE)</div>
<div id="LC200">#</div>
<div id="LC201"># # word freq</div>
<div id="LC202"># # game game 71</div>
<div id="LC203"># # comput comput 22</div>
<div id="LC204"># # video video 13</div>
<div id="LC205"># # learn learn 11</div>
<div id="LC206"># # [TRUNC...]</div>
<div id="LC207"># #</div>
<div id="LC208"># #</div>
<div id="LC209"># # Number of titles submitted = 210</div>
<div id="LC210"># #</div>
<div id="LC211"># # Number of results as retrieved from first webpage = 267</div>
<div id="LC212"># #</div>
<div id="LC213"># # Be aware that sometimes titles in Google Scholar outputs are truncated - that is why, i.e., some mandatory intitle-search strings may not be contained in all titles</div>


// image

I think that’s kind of cool and corresponds to what I would expect for a search about the legendary Baldur’s Gate computer role playing game :)  The following is produced if we look at the ‘description’ filed instead of the ‘title’ field:


<div id="LC215"># # EXAMPLE 2: Display word cloud based on the description field of each Google Scholar search result returned</div>
<div id="LC216">GScholarXScraper(search.str = "Baldur's Gate", field = "description", write.table = FALSE, stem = TRUE)</div>
<div id="LC217">#</div>
<div id="LC218"># # word freq</div>
<div id="LC219"># # page page 147</div>
<div id="LC220"># # gate gate 132</div>
<div id="LC221"># # game game 130</div>
<div id="LC222"># # baldur baldur 129</div>
<div id="LC223"># # roleplay roleplay 21</div>
<div id="LC224"># # [TRUNC...]</div>
<div id="LC225"># #</div>
<div id="LC226"># # Number of titles submitted = 210</div>
<div id="LC227"># #</div>
<div id="LC228"># # Number of results as retrieved from first webpage = 267</div>
<div id="LC229"># #</div>
<div id="LC230"># # Be aware that sometimes titles in Google Scholar outputs are truncated - that is why, i.e., some mandatory intitle-search strings may not be contained in all titles</div>


Not bad. I could see myself using the text mining and word cloud functionality with other projects I’ve been playing with such as Facebook, Google+, Yahoo search pages, Google search pages, Bing search pages… could be fun!

Many thanks again to Kay for making his code publicly available so that I could play with it and improve my programming skill set.


Full code for GScholarXScraper can be found here: https://github.com/tonybreyal/Blog-Reference-Functions/blob/master/R/GScholarXScraper/GScholarXScraper

Original GSchloarScraper code is here: https://docs.google.com/document/d/1w_7niLqTUT0hmLxMfPEB7pGiA6MXoZBy6qPsKsEe_O0/edit?hl=en_US

Full code for just the XPath scraping function is here: https://github.com/tonybreyal/Blog-Reference-Functions/blob/master/R/googleScholarXScraper/googleScholarXScraper.R


Sunday, 21 December 2014

Extractions and Skin Care

As an esthetician or skin care professional, you may have heard some controversy over the matter of performing extractions during a routine facial service. What may seem like a relatively simple procedure can actually raise great controversy in the world of esthetics. Some estheticians regard extractions as a matter of providing a complete service while others see this as inflicting trauma to the skin. Learning more about both sides of the issue can help you as a professional in making an informed decision and explaining the issue to your clients.

What is an extraction?

As a basic review, an extraction is removing impurity (plug of dead skin or oil) from a pore or pimple. It is the removal of both blackheads and whiteheads from the skin. Extractions occur after the skin has been thoroughly cleansed, exfoliated and sometimes steamed to soften the area prior to extraction.

Why Do It?

Extractions are considered a "must" by many estheticians when performing a routine facial because they want to leave their clients skin looking and feeling it's best. When done correctly, a simple extraction should be quick and relatively painless. As a trained esthetician it is important to know if your client has sensitive skin which would make them more prone to the damage that can be caused by extractions.

Why Not?

Extractions should only be performed by a trained esthetician and should not be done in excess. Extractions can cause broken capillaries or sin irritations that can lead to more (not less) breakouts. Extractions can also cause discomfort for your client when done incorrectly so you should seek their permission before performing any type of extraction during their facial. Remember your client has the right to know any product or procedure being performed on their skin and make an informed choice.

Who Decides?

As an esthetician it may be entirely up to you or it may be a procedure within your salon to do or not do extractions. It is important to check the guidelines of your employer and know their policies before performing any procedure. Remember to explain extractions and their benefits and possible complications to your client. Trust is an important part of any relationship and your client needs to know you are being open and honest with them. The last thing you want as a professional is a reputation for inflicting unnecessary and unwanted procedures or damage to your client's skin.

Bellanina Institute's owner and director, Nina Howard, is a multi-talented, forward-thinking entrepreneur who has built the Bellanina brand form the ground up to a successful million-dollar spa, spa training business, and skin care product line. Nina is a Licensed Esthetician with Para-Medical studies, Massage Therapist, Polarity Therapist, Skin Care Educator, Artist, and Professional Interior Designer.


Wednesday, 17 December 2014

Benefits of Predictive Analytics and Data Mining Services

Predictive Analytics is the process of dealing with variety of data and apply various mathematical formulas to discover the best decision for a given situation. Predictive analytics gives your company a competitive edge and can be used to improve ROI substantially. It is the decision science that removes guesswork out of the decision-making process and applies proven scientific guidelines to find right solution in the shortest time possible.

Predictive analytics can be helpful in answering questions like:

•    Who are most likely to respond to your offer?
•    Who are most likely to ignore?
•    Who are most likely to discontinue your service?
•    How much a consumer will spend on your product?
•    Which transaction is a fraud?
•    Which insurance claim is a fraudulent?
•    What resource should I dedicate at a given time?

Benefits of Data mining include:

•    Better understanding of customer behavior propels better decision
•    Profitable customers can be spotted fast and served accordingly
•    Generate more business by reaching hidden markets
•    Target your Marketing message more effectively
•    Helps in minimizing risk and improves ROI.
•    Improve profitability by detecting abnormal patterns in sales, claims, transactions etc
•    Improved customer service and confidence
•    Significant reduction in Direct Marketing expenses

Basic steps of Predictive Analytics are as follows:

•    Spot the business problem or goal
•    Explore various data sources such as transaction history, user demography, catalog details, etc)
•    Extract different data patterns from the above data
•    Build a sample model based on data & problem
•    Classify data, find valuable factors, generate new variables
•    Construct a Predictive model using sample
•    Validate and Deploy this Model

Standard techniques used for it are:

•    Decision Tree
•    Multi-purpose Scaling
•    Linear Regressions
•    Logistic Regressions
•    Factor Analytics
•    Genetic Algorithms
•    Cluster Analytics
•    Product Association

Should you have any queries regarding Data Mining or Predictive Analytics applications, please feel free to contact us. We would be pleased to answer each of your queries in detail.


Tuesday, 16 December 2014

RAM Scraping a New Old Favorite For Hackers

Some of the best stories involve a conflict with an old enemy: a friend-turned-foe, long thought dead, returning from the grave for violent retribution; an ancient order of dark siders from the distant reaches of the galaxy, hiding in plain sight and waiting to seize power for themselves; a dark lord thought destroyed millennia ago, only to rise again and seek his favorite piece of jewelry.  The list goes on.

Granted, 2011 isn’t quite “millennia,” and this story isn’t meant for entertainment, but the old foe in this instance is nonetheless dangerous in its own right.  That is the year when RAM scraping malware first made major headlines: originating as an advanced version of the Trackr malware, controlled through a botnet, it was discovered in the compromised Point of Sale (POS) systems of a university and several hotels.  And while it seemed recently that this method had dwindled in popularity, the Target and other retail breaches saw it return with a vengeance.  With 110 million Target customers having their information compromised, it was easily one the largest incidents involving memory scrapers.

How does it work?  First, the malware has to be introduced into the POS network, which can happen via any machine that is connected to the network, or unsecured wireless networks.  Even with firewalls, an infected laptop could serve as a vector.  Once installed, the malware can hide in the shadows, employing encryption or antivirus-avoiding tools to prevent its identification until it’s ready to strike.  Then, when a customer’s card gets used at a POS machine, the data contained within—name, card number, security code, etc.—gets sent to the system memory.  “There is that opportunity to steal the credit card information when it is in memory, perhaps even before your payment has even been authorized, and the data hasn't even been written to the hard drive yet,” says security researcher Graham Cluley.

So, why not encrypt the system’s memory, when it’s at its most vulnerable?  Not that simple, sadly: “No matter how strong your encryption is, if the system needs to process data or process the code, everything needs to be decrypted in memory,” Chris Elisan, principal malware scientist at security firm RSA, explained to Dark Reading.

There are certain steps a company can take, of course, and should take, to reduce the risk.  Strong passwords to access the POS machines, firewalls to isolate the POS network from the Internet, disabling remote access to POS systems, to name a few.  All the same, while these measures are vital and should be used, I don’t think, in light of recent breaches, they are sufficient.  Now, I wrote a short time ago about the impending October 2014 deadline imposed by the credit card industry, regarding the systematic switch to chipped credit card technology; adopting this standard will definitely assist in eradicating this problem.  But, until such a time when a widespread implementation of new systems comes about, always be vigilant to protect your data from attack, because what’s old is new again, and a colossal data breach is a story consumers are liable to seek financial restitution for.


Monday, 15 December 2014

Handling exceptions in scrapers

When requesting and parsing data from a source with unknown properties and random behavior (in other words, scraping), I expect all kinds of bizarrities to occur. Managing exceptions is particularly helpful in such cases.

Here is some ways that an exception might be raised.
[][0] #The list has no zeroth element, so this raises an IndexError
{}['foo'] #The dictionary has no foo element, so this raises a KeyError

Catching the exception is sometimes cleaner than preventing it from happening in the first place. Here are some examples handling bizarre exceptions in scrapers.

Example 1: Inconsistant date formats

Let’s say we’re parsing dates.
import datetime
This doesn’t raise an error.
datetime.datetime.strptime('2012-04-19', '%Y-%m-%d')
But this does.
datetime.datetime.strptime('April 19, 2012', '%Y-%m-%d')

It raises a ValueError because the date formats don’t match. So what do we do if we’re scraping a data source with multiple date formats?

Ignoring unexpected date formats

A simple thing is to ignore the date formats that we didn’t expect.

import lxml.html
import datetime
def parse_date1(source):
    rawdate = lxml.html.fromstring(source).get_element_by_id('date').text
         cleandate = datetime.datetime.strptime(rawdate, '%Y-%m-%d')
    except ValueError:
         cleandate = None
    return cleandate

print parse_date1('<div id="date">2012-04-19</div>')

If we make a clean date column in a database and put this in there, we’ll have some rows with dates and some rows with nulls. If there are only a few nulls, we might just parse those by hand.

Trying multiple date formats

Maybe we have determined that this particular data source uses three different date formats. We can try all three.

import lxml.html
import datetime

def parse_date2(source):

    rawdate = lxml.html.fromstring(source).get_element_by_id('date').text

    for date_format in ['%Y-%m-%d', '%B %d, %Y', '%d %B, %Y']:

             cleandate = datetime.datetime.strptime(rawdate, date_format)
             return cleandate
        except ValueError:
    return None

print parse_date2('<div id="date">19 April, 2012</div>')

This loops through three different date formats and returns the first one that doesn’t raise the error.

Example 2: Unreliable HTTP connection

If you’re scraping an unreliable website or you are behind an unreliable internet connection, you may sometimes get HTTPErrors or URLErrors for valid URLs. Trying again later might help.

import urllib2
def load(url):
    retries = 3
    for i in range(retries):
            handle = urllib2.urlopen(url)
            return handle.read()
        except urllib2.URLError:
            if i + 1 == retries:
    # never get here

print load('http://thomaslevine.com')

This function tries to download the page thee times. On the first two fails, it waits 42 seconds and tries again. On the third failure, it raises the error. On a success, it returs the content of the page.

Example 3: Logging errors rather than raising them

For more complicated parses, you might find loads of errors popping up in weird places, so you might want to go through all of the documents before deciding which to fix first or whether to do some of them manually.

import scraperwiki
for document_name in document_names:
    except Exception as e:
        scraperwiki.sqlite.save([], {
            'documentName': document_name,
            'exceptionType': str(type(e)),
            'exceptionMessage': str(e)
        }, 'errors')

This catches any exception raised by a particular document, stores it in the database and then continues with the next document. Looking at the database afterwards, you might notice some trends in the errors that you can easily fix and some others where you might hard-code the correct parse.

Example 4: Exiting gracefully

When I’m scraping over 9000 pages and my script fails on page 8765, I like to be able to resume where I left off. I can often figure out where I left off based on the previous row that I saved to a database or file, but sometimes I can’t, particularly when I don’t have a unique index.

for bar in bars:
        print('Failure at bar = "%s"' % bar)

This will tell me which bar I left off on. It’s fancier if I save the information to the database, so here is how I might do that with ScraperWiki.

import scraperwiki
resume_index = scraperwiki.sqlite.get_var('resume_index', 0)
for i, bar in enumerate(bars[resume_index:]):
        scraperwiki.sqlite.save_var('resume_index', i)
scraperwiki.sqlite.save_var('resume_index', 0)

ScraperWiki has a limit on CPU time, so an error that often concerns me is the scraperwiki.CPUTimeExceededError. This error is raised after the script has used 80 seconds of CPU time; if you catch the exception, you have two CPU seconds to clean up. You might want to handle this error differently from other errors.

import scraperwiki
resume_index = scraperwiki.sqlite.get_var('resume_index', 0)
for i, bar in enumerate(bars[resume_index:]):
    except scraperwiki.CPUTimeExceededError:
        scraperwiki.sqlite.save_var('resume_index', i)
    except Exception as e:
        scraperwiki.sqlite.save_var('resume_index', i)
        scraperwiki.sqlite.save([], {
            'bar': bar,
            'exceptionType': str(type(e)),
            'exceptionMessage': str(e)
        }, 'errors')
scraperwiki.sqlite.save_var('resume_index', 0)


Expect exceptions to occur when you are scraping a randomly unreliable website with randomly inconsistent content, and consider handling them in ways that allow the script to keep running when one document of interest is bizarrely formatted or not available.

Source: https://blog.scraperwiki.com/2012/05/handling-exceptions-in-scrapers

Friday, 12 December 2014

Content Scraping Reuses Blog Posts without Permission

What do popular blogs and websites such as Social Media Examiner, Copy Blogger, CNN.com, Mashable, and Type A Parent have in common? No, it’s not traffic and a loyal online community, each was a victim of the content scraping site “BuzzMyFx.” Although most bloggers fall victim to content scrapers at least once, the offending website was such an extreme case the backlash against it was fast and furious. Thanks to the quick action of many angry bloggers, BuzzMyFix was taken down in a matter of days.

If you’re not familiar with content scraping sites and aren’t sure why they’re bad and what you can do if you fall prey, read on. Not knowing what steps you can take to remove your content from a scraping site can mean someone else is profiting from your hard work.

What is content scraping?

Content scraping is when a blog or website pulls in other bloggers’ content without permission, in many cases passing it off as their own. Instead of stocking their sites with unique content, they steal entire blog posts. Some do leave the original authors’ bylines, but there are plenty that don’t provide attribution at all. This is not a good thing at all.

If you don’t care about someone taking your content and putting it on their blogs and websites without your permission, you should. These sites are stealing traffic, search engine rankings, and even advertising revenue from bloggers. Moreover, by ignoring scraping sites you’re giving the message that this practice is OK.

It’s not OK.

How was BuzzMyFx different?

BuzzMyFx was a little different from your usual scrapers. Bloggers didn’t just find their content had been posted on this site, they learned their entire blogs — down to the design and comments — had been cloned. Plus, any bloggers checking to see if their blogs were being cloned immediately found themselves being scraped as well. Dozens, if not hundreds of blogs were affected. However, bloggers didn’t take this incident sitting down. They spread the word and contacted the site’s host en masse. Thanks to their swift action, and the high number of complaints, the site was removed quickly.

How can I tell if my content is being scraped?

Fortunately for content creators, scrapers are a lazy bunch. Because their sites are automated, and they don’t check or read the content being pulled, they don’t take many precautions to ensure the people they scrape from don’t find their sites. In fact, they may not even care. Fortunately, this makes it easy to learn if your content is being stolen.

    Link to your own articles — When you write a blog post and link to other (of your own) blog posts within that post, it’s not only good SEO. You also will get pingbacks whenever someone else steals your content because of your interlinks. You’re alerted when someone links to your content, and when content is published with your links, you’ll get that alert.

    Google Alerts — If your name, blog’s name, or other unique keywords are set up as Google Alerts, you’ll receive an e-mail every time content is published with these keywords.

    Analytics — When people click on your links that are in scraped content, it will show up as referring traffic in your analytics program. You should always check referring traffic so you can thank the referring site owner, but also to make sure no one is stealing your content.

What steps can I take to remove my content from a scraper?

If you find your content is being stolen, know you have several options. First, you’ll need to find out who owns the scraping site. You can find this out by doing a WHOis domain lookup, which will enable you to search for the website’s details, including the name of the webmaster, contact info, and the name of the site’s host.

Keep in mind that sometimes the website’s owner will pay extra to have his or her name kept private, but you will always be able to find the name of the host. Once you have this information, you can take the necessary steps to have your content removed.

    Contact the site’s owner personally: Your first step should always be a polite request to remove your content immediately. Let the website owner know he or she is in violation of the Digital Millennium Copyright Act (DMCA), and you will take the necessary steps to report him if he doesn’t comply.

    Contact the site’s host: If you can’t find the name of the person who owns the site, or if he won’t comply with your takedown request, contact the website’s host. You’ll have to prove your content is being stolen. As the host can be held liable for allowing the content theft, it’s in their best interest to contact the website owner and request removal.

    Contact Google: You can contact Google and fill out a form to have them remove the website from their search engines.

    Spread the word: Let all your blogging friends know about content scrapers when you come across them. The more people who take action against content scrapers, the less likely they are to do it again.

Contacting the webmaster with a takedown notice doesn’t have to be an intimidating process, either. The website Plagiarism Today has a wonderful set of stock letters to use to contact webmasters, web hosts, and even Google. All you have to do is insert the necessary information.

Content scrapers and cloners may try to steal your content, but you don’t have to let them. Stand up for what’s yours.

Source: http://www.dummies.com/how-to/content/content-scraping-reuses-blog-posts-without-permiss.html

Wednesday, 10 December 2014

Finding & Removing Spam Blogs Who Scrape Content Onto Free Hosted Blogs

The more popular you become in the blogging world, the more crap you have to deal with!
Content scraping is one chore that can be dealt with swiftly once you understand what to do.
This post contains links which you can use to quickly and easily report content scrapers and spam blogs.
Please share this post and help clean up spam blogs and punish content scrapers.
First step is to find your url’s which have been scraped of content and then get the scrapers spam blog removed.

Some of the tools i use to do this are:

    Google Webmaster Tools
    Google Alerts

Finding Scraped Content
Login to your Google Webmaster Tools account and go to traffic > links to your site.
You should see something like this:
Webmaster Tools Links to Your Site

The first domain is a site which has copied and embedded my homepage which i have already dealt with.
The second site is a search engine.
The third domain is the one i want to deal with.

A common method scrapers use is to post the scraped content from your rss feed on to a free hosted blog like WordPress.com or blogger.com.

Once you click the WordPress.com link in webmaster tools, you’ll find all the url’s which have been scraped.
Links to Your Site

There’s 32 url’s which have been linked to so its simply a matter of clicking each of your links and finding the culprits.

The first link is my homepage which has been linked to by legit domains like WordPress developers.
The others are mainly linked to by spam blogs who have scraped the content and used a free hosted service which in this case is WordPress.com.
WordPress.com Links to Your Site
 Reporting & Removing Spam Blogs

Once you have the url’s of the content scraping blogs as seen in the screenshot above:

    Fill in this basic form to report spam to WordPress.com
    Fill in this form to report copyright content to WordPress.com
    Use this form to report Blogspot and Blogger.com content which has been scraped.
    Fill in one of these forms to remove content from Google

Google Alerts

Its very easy to setup a Google alert to find your post titles when they get scraped.
If you’ve setup the WordPress SEO plugin correctly, you should have included your site title at the end of all your post titles.
Then all you need to do is setup a Google alert for your site title and you’ll be notified every time a scraper links to your content.

Link Notifications

You may also receive a pingback or trackback if you have this feature enabled in your discussion settings.

Link Notifications
RSS Feed Links

Most content scrapers use automated software to scrape the content from RSS feeds.
Make sure you configure your Reading settings so only a summary is displayed.
Reading Settings Feed Summary

Next step is to configure the settings in Yoast’s SEO plugin so links back to your site are included in all RSS feed post summaries.

RSS Feed Links

This will help search engines identify you and your domain as the original author of the content.
There’s other services like copyscape and dmca which can help you protect your sites content if you’re prepared to pay a premium.
That’s it folks.
Its easy to find and get spam sites removed once you know what to do.
Hope you don’t have to deal with this garbage to often.
Ever found out your content has been scraped?
What did you do about it?

Source: http://wpsites.net/blogging/content-scraping-monitoring-and-prevention-tips/

Monday, 1 December 2014

What you have to know before requesting web scraping services?

Before you request web scraping services you have to know what are your needs (what data you need, structure of it and where you can find this data).

Step 1: Define what data you need?

Data needs depending on purpose, if you want to find new customers you probably need contact data from players in your industry. Also if you want to study your competitors you need to define who are they. Only after that you can select data sources (websites feeds or other electronic sources) for this extraction.

In many cases for discovering and defining data sources are used search engines like Google, Bing, Yahoo, and others.

Step 2: Structure of data

Data structure it’s directly linked to usage purpose. In many cases data structure it’s a table where a row represents an entity and a cell of this row represents a property of this entity. In other cases Data structure is a a chart or another graphic representation builder with data extracted from a web source.

Step 3: Number of data extraction

In many cases is needed one time data extraction. In other cases when you need a regular report, are needed periodically extractions.

If you have defined all of above points you are ready to request a quote and an amount estimation from this contact form.

Source: http://thewebminer.com/blog/2013/08/

Friday, 28 November 2014

Scraping XML Tables with R

A couple of my good friends also recently started a sports analytics blog. We’ve decided to collaborate on a couple of studies revolving around NBA data found at www.basketball-reference.com. This will be the first part of that project!

Data scientists need data. The internet has lots of data. How can I get that data into R? Scrape it!

People have been scraping websites for as long as there have been websites. It’s gotten pretty easy using R/Python/whatever other tool you want to use. This post shows how to use R to scrape the demographic information for all NBA and ABA players listed at www.basketball-reference.com.

Here’s the code:

###### Settings


 ###### URLs



 ###### Reading data


 for (i in 2:len)


 ###### Formatting data


tbl$BirthDate<-as.Date(tbl$BirthDate[1],format="%B %d, %Y")

Created by Pretty R at inside-R.org

And here’s the result:Result

Source: http://www.r-bloggers.com/scraping-xml-tables-with-r/

Thursday, 27 November 2014

Data Mining KNN Classifier


Suppose a data analyst working for an insurance company was asked to build a predictive model for predicting weather a customer will buy a mobile home insurance policy. S/he tried kNN classifier with different number of neighbours (k=1,2,3,4,5). S/he got the following F-scores measured on the training data: (1.0; 0.92; 0.90; 0.85; 0.82). Based on that the analyst decided to deploy kNN with k=1. Was it a good choice? How would you select an optimal number of neighbours in this case?

1 Answer

It is not a good idea to select a parameter of a prediction algorithm using the whole training set as the result will be biased towards this particular training set and has no information about generalization performance (i.e. performance towards unseen cases). You should apply a cross-validation technique e.g. 10-fold cross-validation to select the best K (i.e. K with largest F-value) within a range. This involves splitting your training data in 10 equal parts retain 9 parts for training and 1 for validation. Iterate such that each part has been left out for validation. If you take enough folds this will allow you as well to obtain statistics of the F-value and then you can test whether these values for different K values are statistically significant.

See e.g. also: http://pic.dhe.ibm.com/infocenter/spssstat/v20r0m0/index.jsp?topic=%2Fcom.ibm.spss.statistics.help%2Falg_knn_training_crossvalidation.htm

The subtlety here however is that there is likely a dependency between the number of data points for prediction and the K-value. So If you apply cross-validation you use 9/10 of the training set for training...Not sure whether any research has been performed on this and how to correct for that in the final training set. Anyway most software packages just use the abovementioned techniques e.g. see SPSS in the link. A solution is to use leave-one-out cross-validation (each data samples is left out once for testing) in that case you have N-1 training samples(the original training set has N).


Tuesday, 25 November 2014

A Content Marketer's Guide to Data Scraping

As digital marketers, big data should be what we use to inform a lot of the decisions we make. Using intelligence to understand what works within your industry is absolutely crucial within content campaigns, but it blows my mind to know that so many businesses aren't focusing on it.

One reason I often hear from businesses is that they don't have the budget to invest in complex and expensive tools that can feed in reams of data to them. That said, you don't always need to invest in expensive tools to gather valuable intelligence — this is where data scraping comes in.

Just so you understand, here's a very brief overview of what data scraping is from Wikipedia:

    "Data scraping is a technique in which a computer program extracts data from human-readable output coming from another program."

Essentially, it involves crawling through a web page and gathering nuggets of information that you can use for your analysis. For example, you could search through a site like Search Engine Land and scrape the author names of each of the posts that have been published, and then you could correlate this to social share data to find who the top performing authors are on that website.

Hopefully, you can start to see how this data can be valuable. What's more, it doesn't require any coding knowledge — if you're able to follow my simple instructions, you can start gathering information that will inform your content campaigns. I've recently used this research to help me get a post published on the front page of BuzzFeed, getting viewed over 100,000 times and channeling a huge amount of traffic through to my blog.

Disclaimer: One thing that I really need to stress before you read on is the fact that scraping a website may breach its terms of service. You should ensure that this isn't the case before carrying out any scraping activities. For example, Twitter completely prohibits the scraping of information on their site. This is from their Terms of Service:

    "crawling the Services is permissible if done in accordance with the provisions of the robots.txt file, however, scraping the Services without the prior consent of Twitter is expressly prohibited"

Google similarly forbids the scraping of content from their web properties:

    Google's Terms of Service do not allow the sending of automated queries of any sort to our system without express permission in advance from Google.

So be careful, kids.
Content analysis

Mastering the basics of data scraping will open up a whole new world of possibilities for content analysis. I'd advise any content marketer (or at least a member of their team) to get clued up on this.

Before I get started on the specific examples, you'll need to ensure that you have Microsoft Excel on your computer (everyone should have Excel!) and also the SEO Tools plugin for Excel (free download here). I put together a full tutorial on using the SEO tools plugin that you may also be interested in.

Alongside this, you'll want a web crawling tool like Screaming Frog's SEO Spider or Xenu Link Sleuth (both have free options). Once you've got these set up, you'll be able to do everything that I outline below.

So here are some ways in which you can use scraping to analyse content and how this can be applied into your content marketing campaigns:

1. Finding the different authors of a blog

Analysing big publications and blogs to find who the influential authors are can give you some really valuable data. Once you have a list of all the authors on a blog, you can find out which of those have created content that has performed well on social media, had a lot of engagement within the comments and also gather extra stats around their social following, etc.

I use this information on a daily basis to build relationships with influential writers and get my content placed on top tier websites. Here's how you can do it:

Step 1: Gather a list of the URLs from the domain you're analysing using Screaming Frog's SEO Spider. Simply add the root domain into Screaming Frog's interface and hit start (if you haven't used this tool before, you can check out my tutorial here).

Once the tool has finished gathering all the URLs (this can take a little while for big websites), simply export them all to an Excel spreadsheet.

Step 2: Open up Google Chrome and navigate to one of the article pages of the domain you're analysing and find where they mention the author's name (this is usually within an author bio section or underneath the post title). Once you've found this, right-click their name and select inspect element (this will bring up the Chrome developer console).

Within the developer console, the line of code associated to the author's name that you selected will be highlighted (see the below image). All you need to do now is right-click on the highlighted line of code and press Copy XPath.

For the Search Engine Land website, the following code would be copied:


This may not make any sense to you at this stage, but bear with me and you'll see how it works.

Step 3: Go back to your spreadsheet of URLs and get rid of all the extra information that Screaming Frog gives you, leaving just the list of raw URLs – add these to the first column (column A) of your worksheet.

Step 4: In cell B2, add the following formula:


Just to break this formula down for you, the function XPathOnUrl allows you to use the XPath code directly within (this is with the SEO Tools plugin installed; it won't work without this). The first element of the function specifies which URL we are going to scrape. In this instance I've selected cell A2, which contains a URL from the crawl I did within Screaming Frog (alternatively, you could just type the URL, making sure that you wrap it within quotation marks).

Finally, the last part of the function is our XPath code that we gathered. One thing to note is that you have to remove the quotation marks from the code and replace them with apostrophes. In this example, I'm referring to the "leftCol" section, which I've changed to ‘leftCol' — if you don't do this, Excel won't read the formula correctly.

Once you press enter, there may be a couple of seconds delay whilst the SEO Tools plugin crawls the page, then it will return a result. It's worth mentioning that within the example I've given above, we're looking for author names on article pages, so if I try to run this on a URL that isn't an article (e.g. the homepage) I will get an error.

For those interested, the XPath code itself works by starting at the top of the code of the URL specified and following the instructions outlined to find on-page elements and return results. So, for the following code:


We're telling it to look for any element (//*) that has an id of leftCol (@id='leftCol') and then go down to the second div tag after this (div[2]), followed by a p tag, a span tag and finally, an a tag (/p/span/a). The result returned should be the text within this a tag.

Don't worry if you don't understand this, but if you do, it will help you to create your own XPath. For example, if you wanted to grab the output of an a tag that has rel=author attached to it (another great way of finding page authors), then you could use some XPath that looked a little something like this:


As a full formula within Excel it would look something like this:


Once you've created the formula, you can drag it down and apply it to a large number of URLs all at once. This is a huge time-saver as you'd have to manually go through each website and copy/paste each author to get the same results without scraping – I don't need to explain how long this would take.

Now that I've explained the basics, I'll show you some other ways in which scraping can be used…

2. Finding extra details around page authors

So, we've found a list of author names, which is great, but to really get some more insight into the authors we will need more data. Again, this can often be scraped from the website you're analysing.

Most blogs/publications that list the names of the article author will actually have individual author pages. Again, using Search Engine Land as an example, if you click my name at the top of this post you will be taken to a page that has more details on me, including my Twitter profile, Google+ profile and LinkedIn profile. This is the kind of data that I'd want to gather because it gives me a point of contact for the author I'm looking to get in touch with.

Here's how you can do it.

Step 1: First we need to get the author profile URLs so that we can scrape the extra details off of them. To do this, you can use the same approach to find the author's name, with just a little addition to the formula:

=XPathOnUrl(A2,"//a[@rel='author']", <strong>"href"</strong>)

The addition of the "href" part of the formula will extract the output of the href attribute of the atag. In Lehman terms, it will find the hyperlink attached to the author name and return that URL as a result.

Step 2: Now that we have the author profile page URLs, you can go on and gather the social media profiles. Instead of scraping the article URLs, we'll be using the profile URLs.

So, like last time, we need to find the XPath code to gather the Twitter, Google+ and LinkedIn links. To do this, open up Google Chrome and navigate to one of the author profile pages, right-click on the Twitter link and select Inspect Element.

Once you've done this, hover over the highlighted line of code within Chrome's developer tools, right-click and select Copy XPath.

Step 3: Finally, open up your Excel spreadsheet and add in the following formula (using the XPath that you've copied over):

=XPathOnUrl(C2,"//*[@id='leftCol']/div[2]/p/a[2]", "href")

Remember that this is the code for scraping Search Engine Land, so if you're doing this on a different website, it will almost certainly be different. One important thing to highlight here is that I've selected cell C2 here, which contains the URL of the author profile page and not just the article page. As well as this, you'll notice that I've included "href" at the end because we want the actual Twitter profile URL and not just the words ‘Twitter'.

You can now repeat this same process to get the Google+ and LinkedIn profile URLs and add it to your spreadsheet. Hopefully you're starting to see the value in this, and how it can be used to gather a lot of intelligence that can be used for all kinds of online activity, not least your SEO and social media campaigns.

3. Gathering the follower counts across social networks

Now that we have the author's social media accounts, it makes sense to get their follower counts so that they can be ranked based on influence within the spreadsheet.

Here are the final XPath formulae that you can plug straight into Excel for each network to get their follower counts. All you'll need to do is replace the text INSERT SOCIAL PROFILE URL with the cell reference to the Google+/LinkedIn URL:





4. Scraping page titles

Once you've got a list of URLs, you're going to want to get an idea of what the content is actually about. Using this quick bit of XPath against any URL will display the title of the page:


To be fair, if you're using the SEO Tools plugin for Excel then you can just use the built-in feature to scrape page titles, but it's always handy to know how to do it manually!

A nice extra touch for analysis is to look at the number of words used within the page titles. To do this, use the following formula:


From this you can get an understanding of what the optimum title length of a post within a website is. This is really handy if you're pitching an article to a specific publication. If you make the post the best possible fit for the site and back up your decisions with historical data, you stand a much better chance of success.

Taking this a step further, you can gather the social shares for each URL using the following functions:







Note: You can also use a tool like URL Profiler to pull in this data, which is much better for large data sets. The tool also helps you to gather large chunks of data from other social networks, link data sources like Ahrefs, Majestic SEO and Moz, which is awesome.

If you want to get even more social stats then you can use the SharedCount API, and this is how you go about doing it…

Firstly, create a new column in your Excel spreadsheet and add the following formula (where A2 is the URL of the webpage you want to gather social stats for):


You should now have a cell that contains your webpage URL prefixed with the SharedCount API URL. This is what we will use to gather social stats. Now here's the Excel formula to use for each network (where B2 is the cell that contaiins the formula above):













Facebook Shares:


Facebook Comments:


Once you have this data, you can start looking much deeper into the elements of a successful post. Here's an example of a chart that I created around a large sample of articles that I analysed within Upworthy.com.

The chart looks at the average number of social shares that an article on Upworthy receives vs the number of words within its title. This is invaluable data that can be used across a whole host of different on-page elements to get the perfect article template for the site you're pitching to.

See, big data is useful!

5. Date/time the post was published

Along with analysing the details of headlines that are working within a site, you may want to look at the optimal posting times for best results. This is something that I regularly do within my blogs to ensure that I'm getting the best possible return from the time I spend writing.

Every site is different, which makes it very difficult for an automated, one-size-fits-all tool to gather this information. Some sites will have this data within the <head> section of their webpages, but others will display it directly under the article headline. Again, Search Engine Land is a perfect example of a website doing this…

So here's how you can scrape this information from the articles on Search Engine Land:


Now you've got the date and time of the post. You may want to trim this down and reformat it for your data analysis, but you've got it all in Excel so that should be pretty easy.

Extra reading

Data scraping is seriously powerful, and once you've had a bit of a play around with it you'll also realise that it's not that complicated. The examples that I've given are just a starting point but once you get your creative head on, you'll soon start to see the opportunities that arise from this intelligence.

Here's some extra reading that you might find useful:






    Start using actual data to inform your content campaigns instead of going on your gut feeling.

    Gather intelligence around specific domains you want to target for content placement and create the perfect post for their audience.

    Get clued up on XPath and JSON through using the SEO Tools plugin for Excel.

    Spend more time analysing what content will get you results as opposed to what sites will give you links!

    Check the website's ToS before scraping.


Friday, 21 November 2014

Is It Time to End Screen Scraping?

As the industry works to improve the way online banking information is shared with personal financial management apps, a debate is brewing over whether to end the decades-old practice of screen scraping.

Proponents of the popular method say it is a valuable supplement to direct data feeds that may be incomplete or out-of-date. But screen scraping also raises risk concerns, since like other data collection methods it requires consumers to cough up their banking credentials.

"I have not talked to a bank that hasn't confirmed it's a growing problem in their organization," said Jim Routh, the chairman of the products and services committee at Financial Services Information Sharing and Analysis Center.

Financial institutions worry that data aggregators may not take all the appropriate security precautions. According to the FS-ISAC, an industry organization, startups are entering the aggregation market without making security a higher priority.

Routh, who is Aetna's chief information security officer and a former global head of application and mobile security for JPMorgan Chase, said the upstarts do some things well, but "protecting credentials isn't necessarily high on their priorities." The problem is worsened by data aggregators that collect marketing data, such as the device a consumer is using, to understand their behaviors across channels, he said.

The FS-ISAC has proposed creating a standard application programming interface to share information from bank accounts. The API would serve as the conduit for data when consumers wish to use a web or mobile app to receive push bill reminders, to verify their bank accounts or for numerous other PFM use cases.

The proposed API would also be designed to reduce the storage of financial data. But if the industry embraces the model, it would be harder for aggregators to do screen-scraping.

For years, PFM companies have used this tool to obtain customers' banking account information. With consumers' permission, aggregators log in with the customer's user name and password to grab financial data and use it to populate the mobile or web app of the customer's choice — whether or not the bank supports the technique.

Yodlee, which works with more than 300 banks as well as startups, argues that there is a place and a need for aggregators to collect data through various techniques to provide the best customer experience.

Brian Costello, vice president of operations and security at Yodlee, said his company uses a combination of methods to gather customer account data. If it couldn't get data from a direct feed, it could also screen scrape.

If the industry moved to embracing only one data exchange method, Yodlee could be more vulnerable to the problem of receiving outdated information from the banks.

When a bank changes an annual percentage rate, if it doesn't update the data feed it sends to the aggregator right away, the PFM services that rely on that data will appear stale. (Services like Credit Karma, Mint and Wallaby, for example, rely on aggregation technology to recommend financial products to consumers according to price, among other things.)

Proper maintenance of data feeds, of course, takes time and money — resources many banks are short on. But delays could also result from the bankers' dilemma: On the one hand, they want to let customers aggregate their accounts to gather intelligence on their competitors. On the other hand, they may have reservations about their rivals collecting that same data in the battle for wallet share.

"Banks are under tremendous pressure to retain and obtain more clients," said Costello.

Screen scraping also has maintenance requirements, though. The FS-ISAC white paper draft said the approach "requires some coordination from the FI to allow what appears to be an automated attack against their application. To avoid blocking the aggregator's attempt to screen scrape the financial institution's application with this or other current security controls, a whitelist of aggregator IPs are set up and maintained by the FIs."

Like Costello, Marc West, president of digital channels at Fiserv, said a combination of data collection methods is better than a standard data exchange approach that might fail to extract the necessary information. Any data feed, said West, offers a limited set of data and information, while a scrape can enable a custom data extract.

But Aetna's Routh said moving to a real-time API model would improve a recurring issue caused by screen scraping: customer service hiccups. A consumer may call the company behind the personal financial app when a link to an account is broken. The PFM provider might tell him to call the bank, when the problem could lie with the aggregator not knowing of an update to the bank's code.

"The consumer gets in the middle of a customer service issue that is thorny at best and unsolvable at worst," Routh said. "Unfortunately that happens more frequently than anyone would like to it happen.

The new model, then, is "inevitable" in Routh's point of view because of the risk and economics involved. "This won't happen overnight," he said. "It needs some legs."

Kristin Moyer, a research vice president in industry advisory services and banking and investment services at Gartner, said she expects more banks to embrace APIs as a way to compete in a digital world.

Already financial institutions like Capital One, Agricole Bank and Fidor Bank are piloting and testing the OAuth specification, which lets banks keep ownership of the customer log-in data but requires them to make available an API. (The FS-ISAC is also promoting OAuth 2.0 as a way to strengthen aggregation security.)

"It's something we will see a lot more of in the next two to three years," said Moyer. "It's an exciting time…I think the use of APIs will enable us as an industry [to do things] that we never really imagined possible before."


The move away from screen scraping has already happened in some countries that lack a data exchange standard. Regulators in Poland, for example, recently recommended the practice halt. Responding to the guidance, mBank is one of the banks that changed its aggregation roadmap.

The bank, which spun off from BRE Bank, had been piloting a PFM service with friends and family and has now suspended the pilot. It had, however, already made use of aggregation technology so consumers, who weren't customers of the bank, could get loan decisions from mBank within half an episode of "Modern Family." Indeed, the bank would screen scrape consumers' external bank accounts to make a loan decision within five to 15 minutes. Now, loan decisions have to be made at a branch or for a smaller dollar amount after a consumer sends the bank a copy of an electronic statement.

"Right now we have to put it on the shelf. We haven't killed it. We want to resurrect it," said Michal Panowicz, senior director at mBank.

Overall, he sounds calm about the setback. "This is a regulator decision," said Panowicz. "We have to respect that. …We have to live with them on good footing."

But that doesn't mean it has given up on aggregation. Payday lenders can continue to screen scrape financial data in order to make loan decisions in Poland — which makes it an uneven playing field.

"We will try to convey the logic that [screen scraping] cannot be stopped," said Panowicz.

He views it as a longer term game for something he believes is valuable to consumers. mBank like other banks wants to realize the true aggregation dream: letting customers quickly switch bank accounts and products if they wish.

"To be honest, it's the most exciting part about aggregation... to move accounts to us without spending a minute of physical labor," he said.


Tuesday, 18 November 2014

Data Scraping Guide for SEO & Analytics

Data scraping can help you a lot in competitive analysis as well as pulling out data from your client’s website like extracting the titles, keywords and content categories.

You can quickly get an idea of which keywords are driving traffic to a website, which content categories are attracting links and user engagement, what kind of resources will it take to rank your site…………and the list goes on…

 Scraping Organic Search Results

By scraping organic search results you can quickly find out your SEO competitors for a particular search term. You can determine the title tags and the keywords they are targeting.

    The easiest way to scrape organic search results is by using the SERPs Redux bookmarklet.

For e.g if you scrape organic listings for the search term ‘seo tools’ using this bookmarklet, you may see the following results:

You can copy paste the websites URLs and title tags easily into your spreadsheet from the text boxes.

    Pro Tip by Tahir Fayyaz:

    Just wanted to add a tip for people using the SERPs Redux bookmarklet.

    If you have a data separated over multiple pages that you want to scrape you can use AutoPager for Firefox or Chrome to loads x amount of pages all on one page and then scrape it all using the bookmarklet.

Scraping on page elements from a web document

Through this Excel Plugin by Niels Bosma you can fetch several on-page elements from a URL or list of URLs like:

    Title tag
    Meta description tag
    Meta keywords tag
    Meta robots tag
    H1 tag
    H2 tag
    HTTP Header
    Facebook likes etc.

Scraping data through Google Docs

Google docs provide a function known as importXML through which you can import data from web documents directly into Google Docs spreadsheet. However to use this function you must be familiar with X-path expressions.

    Syntax: =importXML(URL,X-path-query)

    url=> URL of the web page from which you want to import the data.

    x-path-query => A query language used to extract data from web pages.

You need to understand following things about X-path in order to use importXML function:

1. Xpath terminology- What are nodes and kind of nodes like element nodes, attribute nodes etc.

2. Relationship between nodes- How different nodes are related to each other. Like parent node, child node, siblings etc.

3. Selecting nodes- The node is selected by following a path known as the path expression.

4. Predicates – They are used to find a specific node or a node that contains a specific value. They are always embedded in square brackets.

If you follow the x-path tutorial then it should not take you more than an hour to understand how X path expressions works.

Understanding path expressions is easy but building them is not. That’s is why i use a firefbug extension named ‘X-Pather‘ to quickly generate path expressions while browsing HTML and XML documents.

Since X-Pather is a firebug extension, it means you first need to install firebug in order to use it.

 How to scrape data using importXML()

Step-1: Install firebug – Through this add on you can edit & monitor CSS, HTML, and JavaScript while you browse.

Step-2: Install X-pather – Through this tool you can generate path expressions while browsing a web document. You can also evaluate path expressions.

Step-3: Go to the web page whose data you want to scrape. Select the type of element you want to scrape. For e.g. if you want to scrape anchor text, then select one anchor text.

Step-4: Right click on the selected text and then select ‘show in Xpather’ from the drop down menu.

Then you will see the Xpather browser from where you can copy the X-path.

Here i have selected the text ‘Google Analytics’, that is why the xpath browser is showing ‘Google Analytics’ in the content section. This is my xpath:


Pretty scary huh. It can be even more scary if you try to build it manually. I want to scrape the name of all the analytic tools from this page: killer seo tools. For this i need to modify the aforesaid path expression into a formula.

This is possible only if i can determine static and variable nodes between two or more path expressions. So i determined the path expression of another element ‘Google Analytics Help center’ (second in the list) through X-pather:


Now we can see that the node which has changed between the original and new path expression is the final ‘li’ element: li[1] to li[2]. So i can come up with following final path expression:


Now all i have to do is copy-paste this final path expression as an argument to the importXML function in Google Docs spreadsheet. Then the function will extract all the names of Google Analytics tool from my killer SEO tools page.

This is how you can scrape data using importXML.

    Pro Tip by Niels Bosma: “Anything you can do with importXML in Google docs you can do with XPathOnUrl directly in Excel.”

    To use XPathOnUrl function you first need to install the Niels Bosma’s Excel plugin. It is not a built in function in excel.

Note:You can also use a free tool named Scrapy for data scraping. It is an an open source web scraping framework and is used to extract structured data from web pages & APIs. You need to know Python (a programming language) in order to use scrapy.

Scraping on-page elements of an entire website

There are two awesome tools which can help you in scraping on-page elements (title tags, meta descriptions, meta keywords etc) of an entire website. One is the evergreen and free Xenu Link Sleuth and the other is the mighty Screaming Frog SEO Spider.

What make these tools amazing is that you can scrape the data of entire website and download it into excel. So if you want to know the keywords used in the title tag on all the web pages of your competitor’s website then you know what you need to do.

Note: Save the Xenu data as a tab separated text file and then open the file in Excel.

 Scraping organic and paid keywords of an entire website

The tool that i use for scraping keywords is SEMRush. Through this awesome tool i can determine which organic and paid keyword are driving traffic to my competitor’s website and then can download the whole list into excel for keyword research. You can get more details about this tool through this post: Scaling Keyword Research & Competitive Analysis to new heights

Scraping keywords from a webpage

Through this excel macro spreadsheet from seogadget you can fetch keywords from the text of a URL(s). However you need an Alchemy API key to use this macro.

You can get the Alchemy API key from here

Scraping keywords data from Google Adwords API

If you have access to Google Adwords API then you can install this plugin from seogadget website. This plugin creates a series of functions designed to fetch keywords data from the Google Adwords API like:

getAdWordAvg()- returns average search volume from the adwords API.

getAdWordStats() – returns local search volume and previous 12 months separated by commas

getAdWordIdeas() – returns keyword suggestions based on API suggest service.

Check out this video to know how this plug-in works

Scraping Google Adwords Ad copies of any website

I use the tool SEMRush to scrape and download the Google Adwords ad copies of my competitors into excel and then mine keywords or just get ad copy ideas.  Go to semrush, type the competitor website URL and then click on ‘Adwords Ad texts’ link on the left hand side menu. Once you see the report you can download it into excel.

Scraping back links of an entire website

The tool that you can use to scrape and download the back links of an entire website is: open site explorer

Scraping Outbound links from web pages

Garrett French of citation Labs has shared an excellent tool: OBL Scraper+Contact Finder which can scrape outbound links and contact details from a URL or URL list. This tool can help you a lot in link building. Check out this video to know more about this awesome tool:

Scraper – Google chrome extension

This chrome extension can scrape data from web pages and export it to Google docs. This tool is simple to use. Select the web page element/node you want to scrape. Then right click on the selected element and select ‘scrape similar’.

Any element/node that’s similar to what you have selected will be scraped by the tool which you can later export to Google Docs. One big advantage of this tool is that it reduces our dependency on building Xpath expressions and make scraping easier.

See how easy it is to scrape name and URLs of all the Analytics tools without using Xpath expressions.

Source: http://www.optimizesmart.com/data-scraping-guide-for-seo/

Monday, 17 November 2014

Screenscraping from Java using jsoup – effective data gathering from websites

In a recent article I discussed screenscraping in a in hindsight fairly clumsy way (http://technology.amis.nl/blog/12786/building-java-object-graph-with-tour-de-france-results-using-screen-scraping-java-util-parser-and-assorted-facilities). While preparing for a series of articles on data visualizations, I had need of statistics regarding the Olympic Games – more specifically: the overall medal count per country during the 2008 Bejing Olympic Games. This information is readily available from dozens of websites. However, I could not find one hat offered the data in easy to process XML or CSV format – all websites had human consumers in mind.

Using screenscraping – we use a programmatic facility to consume the content that is intended to be displayed on screen to human users and subsequently process that content by extracting the required data from it. Some web-pages are easier to scrape than others – this depends on the richness of the HTML (the poorer the better for scraping), the required interactivity (JavaScript, AJAX – the less the better) and the structure used to present the data (tables, frequently despised by web developers, work rather well).

I came across a tool for screenscraping from Java, called jsoup – http://jsoup.org/. It turned out to be so incredibly easy to use – that I thouht I should share it.

Getting going with jsoup is as easy as can be:

1. download jsoup-1.6.1.jar (or whatever the latest version is) from http://jsoup.org/download

2. add this jar as a dependency in your project and/or application CLASSPATH

3. make use of jsoup in the code that does the screenscraping.

A simple example of code that uses jsoup (more examples on: http://jsoup.org/cookbook/):

One of the websites offering the overall medal count is http://www.databaseolympics.com/games/gamesyear.htm?g=26. The page looks as follows:


Well, more importantly, the page looks like this:


This means in terms of screenscraping: I will find the medal count for each country inside a TABLE element with styleclass pt8. Each country has a TR element. Only the first TR element does not represent a country score, as it is the table header. The first TD element in the TR represents the country. The name of the country can be retrieved as the text content from the A element in the TD. The next TD elements contain the numbers of medals in Gold, Silver, Bronze and Total.

The corresponding Java code with jsoup boils down to:

public static void main(String[] args) throws IOException, SQLException, InterruptedException {

        Document doc = Jsoup.connect(OlympicMedalMirrorProcessor.baseUrl + "?g=26").get();
        String title = doc.title();
        Element table = doc.select("table.pt8").get(0);
        Elements trs = table.select("tr");
        Iterator trIter = trs.iterator();
        boolean firstRow = true;
        while (trIter.hasNext()) {

            Element tr = (Element)trIter.next();
            if (firstRow) {
                firstRow = false;
            Elements tds = tr.select("td");
            Iterator tdIter = tds.iterator();
            int tdCount = 1;
            String country = null;
            Integer gold = null;
            Integer silver = null;
            Integer bronze = null;
            Integer total = null;
            // process new line
            while (tdIter.hasNext()) {

                Element td = (Element)tdIter.next();
                switch (tdCount++) {
                case 1:
                    country = td.select("a").text();
                case 2:
                    gold = Integer.parseInt(td.text());
                case 3:
                    silver = Integer.parseInt(td.text());
                case 4:
                    bronze = Integer.parseInt(td.text());
                case 5:
                    total = Integer.parseInt(td.text());

            System.out.println(country + ": gold " + gold + " silver " + silver + " bronze " + bronze + " total " +
        } //table rows


Friday, 14 November 2014

The PromptCloud Advantage- Web Scraping with an Edge

The global market is now more aware of its data scraping needs. And so with the demand, the list of suppliers has grown too. This post is dedicated to bringing out the PromptCloud Advantage among such providers.

PromptCloud-Winning-The Race

1. The know-how- Crawling the web, as mundane as it may sound, is a fairly complex task. No one is to be blamed for overlooking the complexity as these things surface only after you’ve tried it yourself and delved into the nitty-gritty. The design decisions you take sit at the core of what you build and eventually monetize. And the long-term effects of such architectural choices are as pleasing if you’ve done it right as disturbing they might turn out if you’re not far-sighted.

Although the expertise of building the tech stack for such large-scale data acquisition, distributing your clusters (and putting thoughts into their geographical locations), maintaining queues, databases and backups, does come from ‘been there done that’, we have been lucky to have the tech advantage imbibed into us since inception. Not that we got it right the first time, but our systems have evolved with technologies, improving each day. Now that we have been there in this business for the last 56 months, it does feel like a long journey for our stack and yes, we do know better :)

2. SLAs- SLAs are what bolsters the data itself. PromptCloud’s key SLAs are scale and quality; while not compromising the data coverage or the politeness policies on your sources. Since we perform focused crawls, there’s no dilution of data and you can consume it all or ask us to index it in order to search using logical combinations in queries. For your reference, here’s a list of all SLAs to visit while picking your data service provider.


3. The Experience- There are many scraping tools and crawling services in the market which might just serve the need. What PromptCloud provides is a data acquisition experience; and we go as many number of extra miles as you’d like us to go for it. By leveraging our DaaS platform, we make sure you get what you need from the time you start your research for a data provider through importing the data feeds into your database. We hear your requirements in detail, make sure we’ve got it right by sharing samples and going multiple iterations of reprocessing the data to match your needs while you battle internally on freezing your requirements. But what’s more magical is the way all these feeds get delivered to you, at the intervals you requested; programatically.

It might be evident for the SLAs and the know-how fusing to provide the experience, but it’s that additional human touch that actually aids in sustaining it. We make sure you’re at peace while our systems handle the roadblocks and sort out the messiness on the web.