Scraping instagram users with scrapy

25 November 2016

In this tutorial I will show you how to scrape User Accounts on Instagram with Scrapy

What is Scrapy?

From the official website: An open source and collaborative framework for extracting the data you need from websites. In a fast, simple, yet extensible way.

Creating a new project

We can use scrapy by importing the libary via import scrapy, however scrapy has also a nice way to create boilerplate templates.

scrapy startproject instagram

this will create an folder named instagram with the following tree:

├── instagram
│   ├──
│   ├──
│   ├──
│   ├──
│   └── spiders
│       └──
└── scrapy.cfg

We won’t enter into details and explanation of the files in this tutorials.However feel free to check the scrapy documentation: Scrapy Documentation

Creating the item object

As the name says, the allows to define the structure of the item we want to save.

import scrapy

class UserItem(scrapy.Item):    
    username = scrapy.Field()
    follows_count = scrapy.Field()
    followed_by_count = scrapy.Field()
    is_verified = scrapy.Field()
    biography = scrapy.Field()
    external_link = scrapy.Field()
    full_name = scrapy.Field()
    posts_count = scrapy.Field()
    posts = scrapy.Field()

class PostItem(scrapy.Item):
    code = scrapy.Field()
    likes = scrapy.Field()
    thumbnail = scrapy.Field()
    caption = scrapy.Field()
    hashtags = scrapy.Field()

Creating the spider

Create a new file named in the spiders folders.

import scrapy
import json
from scraper_user.items import UserItem
from scraper_user.items import PostItem

class InstagramSpider(scrapy.Spider):

    name = 'instagramspider'
    allowed_domains = ['']
    start_urls = []

    def __init__(self):
        self.start_urls = [""]

    def parse(self, response):
        #get the json file
        json_response = {}
            json_response = json.loads(response.body_as_unicode())
  '%s doesnt exist', response.url)
        if json_response["user"]["is_private"]:
        #check if the username even worked
            json_response = json_response["user"]

            item = UserItem()

            #get User Info
            item["username"] = json_response["username"]
            item["follows_count"] = json_response["follows"]["count"]
            item["followed_by_count"] = json_response["followed_by"]["count"]
            item["is_verified"] = json_response["is_verified"]
            item["biography"] = json_response.get("biography")
            item["external_link"] = json_response.get("external_url")
            item["full_name"] = json_response.get("full_name")
            item["posts_count"] = json_response.get("media").get("count")

            #interate through each post
            item["posts"] = []

            json_response = json_response.get("media").get("nodes")
            if json_response:
                for post in json_response:
                    items_post = PostItem()

            return item
  "Error during parsing %s", response.url)

Run the Scraper

In the top directory you can run the Scraper

scrapy crawl instagramspider

Now it should be scrapying accounts.