Blogs: April 2016

Python is a versatile programming language that can do everything from data mining to plotting graphs. Its design philosophy is based on the importance of readability and simplicity.

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.

As you can imagine, algorithms in Python are designed to be easy to read and write. Blocks of Python code are separated by indentations. Within each block, you’ll discover a syntax that wouldn’t be out of place in a technical handbook.

Benefits of R

R is a programming environment specifically designed for data analysis that is very popular in the data science community. You’ll need to understand R if you want to make it far in your data science career. This interactive tutorial will help.

PYTHON VS R

Usage

Python, as we noted above, is often used by computer programmers since it is the Swiss knife of programming languages, versatile enough so that you can build websites and do data analysis at the same time.

R is primarily used by researchers and academics who don’t necessarily have a background or knowledge of computer science.

Syntax

Python has a nice clear “English-like” syntax that makes debugging and understanding code easier, while R has unconventional syntax that can be tricky to understand, especially if you have learned another programming language.

Learning curve

R is slightly harder to pick up, especially since it doesn’t follow the normal conventions other common programming languages have. Python is simple enough that it makes for a really good first programming language to learn.

Popularity

Python has always been among the top 5 most popular programming languages on Github, a common repository of code that often tracks usage habits across all programmers quite accurately, while R typically hovers below the top 10.

Python is versatile, simple, easier to learn, and powerful because of its usefulness in a variety of contexts, some of which have nothing to do with data science. R is a specialized environment that looks to optimize for data analysis, but which is harder to learn. You’ll get paid more if you stick it out with R rather than working with Python.

Ecommerce business owners and managers have many good reasons to crawl their own websites, including monitoring pages, tracking site performance, ensuring the site is accessible to customers with disabilities, and looking for optimization opportunities.

For each of these, there are discrete tools, web crawlers, and services you could purchase to help monitor your site. While these solutions can be effective, with a relatively small about of development work you can create your own site crawler and site monitoring system.

The first step toward building your own, custom site-crawling and monitoring application is to simply get a list of all of the pages on your site. In this article, I’ll review how to use the Python programming language and a tidy web crawling framework called Scrapy to easily generate a list of those pages.

You’ll Need a Server, Python, and Scrapy

This is a development project. While it is relatively easy to complete, you will still need a server with Python and Scrapy installed. You will also want command line access to that server via a terminal application or an SSH client.

In a July 2015 article, “Monitor Competitor Prices with Python and Scrapy,” I described in some detail how to install Python and Scrapy on a Linux server or OS X machine. You can also get information about installing Python from the documentation section of Python.org. Scrapy also has good installation documentation.

Given all of these available resources, I’ll start with the assumption that you have your server ready to go with both Python and Scrapy installed.

Create a Scrapy Project

Using an SSH client like Putty for Windows or the terminal application on a Mac or Linux computer, navigate to the directory where you want to keep your Scrapy projects. Using a built-in Scrapy command, startproject, we can quickly generate the basic files we need.

For this article, I am going to be crawling a website called Business Idea Daily, so I am naming the project “bid.”

scrapy startproject bid

Scrapy will generate several files and directories.

Generate a New Scrapy Web Spider

For your convenience, Scrapy has another command line tool that will generate a new web spider automatically.

scrapy genspider -t crawl getbid businessideadaily.com

Let’s look at this command piece by piece.

The first term, scrapy, references the Scrapy framework. Next, we have the genspidercommand that tells Scrapy we want a new web spider or, if you prefer, a new web crawler.

The -t tells Scrapy that we want to choose a specific template. The genspider command can generate any one of four generic web spider templates: basic, crawl, csvfeed, andxmlfeed. Directly after the -t, we specify the template we want, and, in this example, we will be creating what Scrapy calls a CrawlSpider.

The term, getbid, is simply the name of the spider; this could have been any reasonable name.

The final portion of the command tells Scrapy what website we want to crawl. The framework will use this to populate a couple of the new spider’s parameters.

Define Items

In Scrapy, Items are mini models or ways of organizing the things our spider collects when it crawls a specific website. While we could easily complete our aim — getting a list of all of the pages on a specific website — without using Items, not using Items might limit us if we wanted to expand our crawler later.

To define an Item, simply open the items.py file Scrapy created when we generated the project. In it, there will be a class called BidItem. The class name is based on the name we gave our project.

class BidItem(scrapy.Item):
 # define the fields for your item here like:
 # name = scrapy.Field()
 pass

Replace pass with a definition for a new field called url.

url = scrapy.Field()

Save the file and you’re done.

Build the Web Spider

Next open the spider’s directory in your project and look for the new spider Scrapy generated. In the example, this spider is called getbid, so the file is getbid.py.

When you open this file in an editor, you should see something like the following.

# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule

from bid.items import BidItem

class GetbidSpider(CrawlSpider):
 name = 'getbid'
 allowed_domains = ['businessideadaily.com']
 start_urls = ['http://www.businessideadaily.com/']

rules = (
 Rule(LinkExtractor(allow=r'Items/'), callback='parse_item', follow=True),
 )

def parse_item(self, response):
 i = BidItem()
 #i['domain_id'] = response.xpath('//input[@id="sid"]/@value').extract()
 #i['name'] = response.xpath('//div[@id="name"]').extract()
 #i['description'] = response.xpath('//div[@id="description"]').extract()
 return i

We need to make a few minor changes to the code Scrapy generated for us. First, we need to modify the arguments for the LinkExtractor under rules. We are simply going to delete everything in the parenthesis.

Rule(LinkExtractor(), callback='parse_item', follow=True),

With this update, our spider will find every link on the start page (home page), pass the individual link to the parse_item method, and follow links to the next page of the site to ensure we are getting every linked page.

Next, we need to update the parse_item method. We will remove all of the commented lines. These lines were just examples that Scrapy included for us.

def parse_item(self, response):
 i = BidItem()
 return i

I like to use variable names that have meaning. So I am going to change the i to href, which is the name of the attribute in an HTML link that holds, if you will, the target link’s address.

def parse_item(self, response):
 href = BidItem()
 return href

Now for the magic. We will capture the page URL as an Item.

def parse_item(self, response):
 href = BidItem()
 href['url'] = response.url
 return href

That is it. The new spider is ready to crawl.

Blogs

Thursday, 28 April 2016

Python or R : Which one is best ?

Usage

Syntax

Learning curve

Popularity

Wednesday, 20 April 2016

Crawl Your Ecommerce Site with Python

You’ll Need a Server, Python, and Scrapy

Create a Scrapy Project

Generate a New Scrapy Web Spider

Define Items

Build the Web Spider

Google’s TGIF (Thank God It’s Friday)

Search This Blog