In this tutorial I will show you how to scrape User Accounts on Instagram with Scrapy

What is Scrapy?

From the official website: An open source and collaborative framework for extracting the data you need from websites.
In a fast, simple, yet extensible way.

Creating a new project

We can use scrapy by importing the libary via import scrapy, however scrapy has also a nice way to create boilerplate templates.

1
scrapy startproject instagram

this will create an folder named instagram with the following tree:

1
2
3
4
5
6
7
8
9
instagram
├── instagram
│   ├── __init__.py
│   ├── items.py
│   ├── pipelines.py
│   ├── settings.py
│   └── spiders
│   └── __init__.py
└── scrapy.cfg

We won’t enter into details and explanation of the files in this tutorials.However feel free to check the scrapy documentation: Scrapy Documentation

Creating the item object

As the name says, the item.py allows to define the structure of the item we want to save.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
import scrapy

class UserItem(scrapy.Item):
username = scrapy.Field()
follows_count = scrapy.Field()
followed_by_count = scrapy.Field()
is_verified = scrapy.Field()
biography = scrapy.Field()
external_link = scrapy.Field()
full_name = scrapy.Field()
posts_count = scrapy.Field()
posts = scrapy.Field()


class PostItem(scrapy.Item):
code = scrapy.Field()
likes = scrapy.Field()
thumbnail = scrapy.Field()
caption = scrapy.Field()
hashtags = scrapy.Field()

Creating the spider

Create a new file named spider.py in the spiders folders.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
import scrapy
import json
from scraper_user.items import UserItem
from scraper_user.items import PostItem


class InstagramSpider(scrapy.Spider):

name = 'instagramspider'
allowed_domains = ['instagram.com']
start_urls = []

def __init__(self):
self.start_urls = ["https://www.instagram.com/_spataru/?__a=1"]

def parse(self, response):
#get the json file
json_response = {}
try:
json_response = json.loads(response.body_as_unicode())
except:
self.logger.info('%s doesnt exist', response.url)
pass
if json_response["user"]["is_private"]:
return;
#check if the username even worked
try:
json_response = json_response["user"]

item = UserItem()

#get User Info
item["username"] = json_response["username"]
item["follows_count"] = json_response["follows"]["count"]
item["followed_by_count"] = json_response["followed_by"]["count"]
item["is_verified"] = json_response["is_verified"]
item["biography"] = json_response.get("biography")
item["external_link"] = json_response.get("external_url")
item["full_name"] = json_response.get("full_name")
item["posts_count"] = json_response.get("media").get("count")

#interate through each post
item["posts"] = []

json_response = json_response.get("media").get("nodes")
if json_response:
for post in json_response:
items_post = PostItem()
items_post["code"]=post["code"]
items_post["likes"]=post["likes"]["count"]
items_post["caption"]=post["caption"]
items_post["thumbnail"]=post["thumbnail_src"]
item["posts"].append(dict(items_post))

return item
except:
self.logger.info("Error during parsing %s", response.url)

Run the Scraper

In the top directory you can run the Scraper

1
scrapy crawl instagramspider

If you have any problems running the script, leave a comment below :)