How To Get Facebook Data With Python

By using Facebook Graph API, we can get the feed of posts and links published by the specific page, or by others on this page as well as likes and comments (feed api). I have written a python script to scrape the feed info in the JSON format and turn it into structured tables. Once the data is in the tabular format, we can load it in the relational database or use common analytical tools (like Excel) to do further analysis.

Data Model

We can split feed data into 3 tables. Each post has one or many likes and comments. This data model nicely accommodates the one-to-many relationship. In the Feed table, Page_Name and Id are the composite keys. Likes and Comments can be joined to Feed by the Page_Name and Post_Id.

Facebook Graph API

Facebook offers different methods for authentication depending on which API function you want to use. In this example, all we need is App ID and App Secret. We can use this neat trick to create access token by concatenating App ID and App Secret with “|”.

First of all, we need to create an app and generate API credentials.

  • Login to Facebook and go to https://developers.facebook.com/.
  • Select ‘Add New App’ from the top left corner.
  • Enter Display Name and hit ‘Create App ID’.
  • Get App ID and App Secret From the dashboard
  • Access Token = <App ID>|<App Secret>

Python has Facebook SDK and it works fine. However, I am using the requests and json packages to make API calls and process data. In my opinion, the requests package is the best thing happened for creating REST applications with Python. To make a GET request, we can simply add url and access token as a parameter in the get() function. Then, we convert the response to a JOSN object for further processing.

Facebook Graph API

It takes 7 argumenst: Access Token, Page Name (e.g. CocaCola), Json File Name, Feed csv file path, Likes csv file path, Comments csv file path, Since data (from when to pull the data).

Example Call

python facebookScrapeFeed.py <;Access Token>; CocaCola \
feed.json feed.csv likes.csv comments.csv 2017-10-31

Key Points

Since date has to be converted to a unix timestamp. I created the method to convert a regular date string to the unix timestamp, convert_to_epochtime().

The maximum number of feed records is 100. To obtain more than 100 records, we loop GET request by incrementing the offset parameter.

The maximum records for Likes and Comments in the Feed json file are 25. If there are more than 25 records, we can use the url in the next node until there is no next url comes back in the data.

The script works for both Python 2.7 and 3.x by changing the few lines to handle Unicode as instructed in the script. This is because each version handles Unicode differently.

Now, here comes the code!

Code

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
import requests
import json
import sys
import time
'''
reloading sys for utf8 encoding is for Python 2.7
This line should be removed for Python 3
In Python 3, we need to specify encoding when open a file
f = open("file.csv", encoding='utf-8')
'''

reload(sys)
sys.setdefaultencoding('utf8')

class FacebookScraper:
    '''
    FacebookScraper class to scrape facebook info
    '''


    def __init__(self, token):
        self.token = token

    @staticmethod
    def convert_to_epochtime(date_string):
        '''Enter date_string in 2000-01-01 format and convert to epochtime'''
        try:
            epoch = int(time.mktime(time.strptime(date_string, '%Y-%m-%d')))
            return epoch
        except ValueError:
            print('Invalid string format. Make sure to use %Y-%m-%d')
            quit()

    def get_feed_data(self, target_page, offset, fields, json_path, date_string):
        """
        This method will get the feed data
        """

        url = "https://graph.facebook.com/v2.10/{}/feed".format(target_page)
        param = dict()
        param["access_token"] = self.token
        param["limit"] = "100"
        param["offset"] = offset
        param["fields"] = fields
        param["since"] = self.convert_to_epochtime(date_string)

        r = requests.get(url, param)
        data = json.loads(r.text)
        f = open(json_path, "w")
        f.write(json.dumps(data, indent=4))
        print("json file has been generated")

        f.close()

        return data
   
    def create_table(self, list_rows, file_path, page_name, table_name):
        '''This method will create a table according to header and table name'''

        if table_name == "feed" :
            header = ["page_name", "id", "type", "created_time", "message", "name",\
            "description", "actions_link", "actions_name", "share_count",\
            "comment_count", "like_count"]
        elif table_name == "likes":
            header = ["page_name", "post_id", "user_id", "name"]
        elif table_name == "comments":
            header = ["page_name", "post_id", "created_time", "message",\
             "user_id", "name", "message_id"]
        else:
            print("Specified table name is not valid.")
            quit()

        file = open(file_path, 'w')
        file.write(','.join(header) + '\n')
        for i in list_rows:
            file.write('"' + page_name + '",')
            for j in range(len(i)):
                row_string = ''
                if j < len(i) -1 :
                    row_string += '"' + str(i[j]).replace('"', '').replace('\n', '') + '"' + ','
                else:
                    row_string += '"' + str(i[j]).replace('"', '').replace('\n', '') + '"' + '\n'
                file.write(row_string)
        file.close()
        print("Generated {} table csv File for {}".format(table_name, page_name))

    def convert_feed_data(self, response_json_list):
        '''This method takes response json data and convert to csv'''
        list_all = []
        for response_json in response_json_list:
            data = response_json["data"]

            for i in range(len(data)):
                list_row = []
                row = data[i]
                id = row["id"]
                try:
                    type = row["type"]
                except KeyError:
                    type = ""
                try:
                    created_time = row["created_time"]
                except KeyError:
                    created_time = ""
                try:
                    message = row["message"]
                except KeyError:
                    message = ""
                try:
                    name = row["name"]
                except KeyError:
                    name = ""
                try:
                    description = row["description"]
                except KeyError:
                    description = ""
                try:
                    actions_link = row["actions"][0]["link"]
                except KeyError:
                    actions_link = ""
                try:
                    actions_name = row["actions"][0]["name"]
                except KeyError:
                    actions_name = ""
                try:
                    share_count = row["shares"]["count"]
                except KeyError:
                    share_count = ""
                try:
                    comment_count = row["comments"]["summary"]["total_count"]
                except KeyError:
                    comment_count = ""
                try:
                    like_count = row["likes"]["summary"]["total_count"]
                except KeyError:
                    like_count = ""
               
                list_row.extend((id, type, created_time, message, name, \
                description, actions_link, actions_name, share_count, comment_count, like_count))
                list_all.append(list_row)
       
        return list_all
   
    def convert_likes_data(self, response_json_list):
        '''This will get the list of people who liked post,
        which can be joined to the feed table by post_id. '''

        list_all = []
        for response_json in response_json_list:
            data = response_json["data"]
            # like_list = []
            for i in range(len(data)):
                likes_count = 0
                row = data[i]
                post_id = row["id"]
                try:
                   like_count = row["likes"]["summary"]["total_count"]
                except KeyError:
                    like_count = 0
                if like_count > 0:
                    likes = row["likes"]["data"]
                    for like in likes:
                        row_list = []
                        user_id = like["id"]
                        name = like["name"]
                        row_list.extend((post_id, user_id, name))
                        list_all.append(row_list)
                # Check if the next link exists
                try:
                    next_link = row["likes"]["paging"]["next"]
                except KeyError:
                    next_link = None
                    continue

                if next_link is not None:
                    r = requests.get(next_link.replace("limit=25", "limit=100"))
                    likes_data = json.loads(r.text)
                    while True:
                        for i in range(len(likes_data["data"])):
                            row_list = []
                            row = likes_data["data"][i]
                            user_id = row["id"]
                            name = row["name"].encode("latin1", "ignore")
                            row_list.extend((post_id, user_id, name))
                            list_all.append(row_list)
                        try:
                            next = likes_data["paging"]["next"]
                            r = requests.get(next.replace("limit=25", "limit=100"))
                            likes_data = json.loads(r.text)
                        except KeyError:
                            print("Likes for the post {} completed".format(post_id))
                            break
        return list_all

    def convert_comments_data(self, response_json_list):
        '''This will get the list of people who commented on the post,
        which can be joined to the feed table by post_id. '''

        list_all = []
        for response_json in response_json_list:
            data = response_json["data"]
            # like_list = []
            for i in range(len(data)):
                likes_count = 0
                row = data[i]
                post_id = row["id"]
                try:
                   comment_count = row["comments"]["summary"]["total_count"]
                except KeyError:
                    comment_count = 0
                if comment_count > 0:
                    comments = row["comments"]["data"]
                    for comment in comments:
                        row_list = []
                        created_time = comment["created_time"]
                        message = comment["message"].encode('latin1', 'ignore')
                        user_id = comment["from"]["id"]
                        name = comment["from"]["name"].encode('latin1', 'ignore')
                        message_id = comment["id"]
                        row_list.extend((post_id, created_time, message,\
                        user_id, name, message_id))
                        list_all.append(row_list)
               
                # Check if the next link exists
                try:
                    next_link = row["comments"]["paging"]["next"]
                except KeyError:
                    next_link = None
                    continue
               
                if next_link is not None:
                    r = requests.get(next_link.replace("limit=25", "limit=100"))
                    comments_data = json.loads(r.text)
                    while True:
                        for i in range(len(comments_data["data"])):
                            row_list = []
                            comment = comments_data["data"][i]
                            created_time = comment["created_time"]
                            message = comment["message"].encode('latin1', 'ignore')
                            user_id = comment["from"]["id"]
                            name = comment["from"]["name"].encode('latin1', 'ignore')
                            message_id = comment["id"]
                            row_list.extend((post_id, created_time, message,\
                            user_id, name, message_id))
                            list_all.append(row_list)
                        try:
                            next = comments_data["paging"]["next"]
                            r = requests.get(next.replace("limit=25", "limit=100"))
                            comments_data = json.loads(r.text)
                        except KeyError:
                            print("Comments for the post {} completed".format(post_id))
                            break
        return list_all

if __name__ == "__main__":

    token_input = sys.argv[1]
    target_page_input = sys.argv[2]
    json_path_input = sys.argv[3]
    csv_feed_path_input = sys.argv[4]
    csv_likes_path_input = sys.argv[5]
    csv_comments_path_input = sys.argv[6]
    date_since_input = sys.argv[7]
    # Input check
    print(token_input)
    print(target_page_input)
    field_input = 'id,created_time,name,message,comments.summary(true),\
    shares,type,published,link,likes.summary(true),actions,place,tags,\
    object_attachment,targeting,feed_targeting,scheduled_publish_time,\
    backdated_time,description'


    fb = FacebookScraper(token_input)

    offset = 0
    json_list = []
    while True:
        path = str(offset) + "_" + json_path_input
        try:
            data = fb.get_feed_data(target_page_input, str(offset), field_input, path, date_since_input)
            check = data['data']
            if (len(check) >= 100):
                json_list.append(data)
                offset += 100
            else:
                json_list.append(data)
                print("End of loop for obtaining more than 100 feed records.")
                break
        except KeyError:
            print("Error with get request.")
            quit()

    feed_table_list = fb.convert_feed_data(json_list)
    likes_table_list = fb.convert_likes_data(json_list)
    comments_table_list = fb.convert_comments_data(json_list)
    # Record check
    print(feed_table_list[0])
    print(likes_table_list[0])
    print(comments_table_list[0])

    fb.create_table(feed_table_list, csv_feed_path_input, target_page_input, "feed")
    fb.create_table(likes_table_list, csv_likes_path_input, target_page_input, "likes")
    fb.create_table(comments_table_list, csv_comments_path_input, target_page_input, "comments")
Data Engineering
Sending XML Payload and Converting XML Response to JSON with Python

If you need to interact with a REST endpoint that takes a XML string as a payload and returns another XML string as a response, this is the quick guide if you want to use Python. If you want to do it with Node.js, you can check out the post …

Data Engineering
Sending XML Payload and Converting XML Response to JSON with Node.js

Here is the quick Node.js example of interacting with a rest API endpoint that takes XML string as a payload and return with XML string as response. Once we get the response, we will convert it to a JSON object. For this example, we will use the old-school QAS (Quick …

Data Engineering
Downloading All Public GitHub Gist Files

I used to use plug-ins to render code blocks for this blog. Yesterday, I decided to move all the code into GitHub Gist and inject them from there. Using a WordPress plugin to render code blocks can be problematic when update happens. Plugins might not be up to date. It …