Scraping instagram users with scrapy

25 November 2016

In this tutorial I will show you how to scrape User Accounts on Instagram with Scrapy

What is Scrapy?

From the official website: An open source and collaborative framework for extracting the data you need from websites. In a fast, simple, yet extensible way.

Creating a new project

We can use scrapy by importing the libary via import scrapy, however scrapy has also a nice way to create boilerplate templates.

scrapy startproject instagram

this will create an folder named instagram with the following tree:

instagram
├── instagram
│   ├── __init__.py
│   ├── items.py
│   ├── pipelines.py
│   ├── settings.py
│   └── spiders
│       └── __init__.py
└── scrapy.cfg

We won’t enter into details and explanation of the files in this tutorials.However feel free to check the scrapy documentation: Scrapy Documentation

Creating the item object

As the name says, the item.py allows to define the structure of the item we want to save.

import scrapy

class UserItem(scrapy.Item):    
    username = scrapy.Field()
    follows_count = scrapy.Field()
    followed_by_count = scrapy.Field()
    is_verified = scrapy.Field()
    biography = scrapy.Field()
    external_link = scrapy.Field()
    full_name = scrapy.Field()
    posts_count = scrapy.Field()
    posts = scrapy.Field()


class PostItem(scrapy.Item):
    code = scrapy.Field()
    likes = scrapy.Field()
    thumbnail = scrapy.Field()
    caption = scrapy.Field()
    hashtags = scrapy.Field()

Creating the spider

Create a new file named spider.py in the spiders folders.

import scrapy
import json
from scraper_user.items import UserItem
from scraper_user.items import PostItem


class InstagramSpider(scrapy.Spider):

    name = 'instagramspider'
    allowed_domains = ['instagram.com']
    start_urls = []

    def __init__(self):
        self.start_urls = ["https://www.instagram.com/_spataru/?__a=1"]

    def parse(self, response):
        #get the json file
        json_response = {}
        try:
            json_response = json.loads(response.body_as_unicode())
        except:
            self.logger.info('%s doesnt exist', response.url)
            pass
        if json_response["user"]["is_private"]:
            return;
        #check if the username even worked
        try:
            json_response = json_response["user"]

            item = UserItem()

            #get User Info
            item["username"] = json_response["username"]
            item["follows_count"] = json_response["follows"]["count"]
            item["followed_by_count"] = json_response["followed_by"]["count"]
            item["is_verified"] = json_response["is_verified"]
            item["biography"] = json_response.get("biography")
            item["external_link"] = json_response.get("external_url")
            item["full_name"] = json_response.get("full_name")
            item["posts_count"] = json_response.get("media").get("count")

            #interate through each post
            item["posts"] = []

            json_response = json_response.get("media").get("nodes")
            if json_response:
                for post in json_response:
                    items_post = PostItem()
                    items_post["code"]=post["code"]
                    items_post["likes"]=post["likes"]["count"]
                    items_post["caption"]=post["caption"]
                    items_post["thumbnail"]=post["thumbnail_src"]
                    item["posts"].append(dict(items_post))

            return item
        except:
            self.logger.info("Error during parsing %s", response.url)

Run the Scraper

In the top directory you can run the Scraper

scrapy crawl instagramspider

Now it should be scrapying accounts.