How To Get Data From Google Analytics With Python

When you ingest data from Google Analytics, you need to create a series of reports based on GA dimensions and metrics. The granularity is determined by dimensions you add in the report. The most important thing is to understand business requirements before start ingesting data. Good requirement analysis will enable you to drill up and down metrics at the right granularity and slice them with the dimensions that are critical to the business.

Google Analytics (for free version) only allows us to have up to 7 dimensions and 10 metrics in a single report. You also need to note that not all dimensions and metrics can be queried together. In my example, I created reports over the key metrics and grouped dimensions according to the GA Dimensions & Metrics Explorer . For example, I created the ga_dim_geo table, which have the dimensions grouped under Geo Network. In my code, you can see how I grouped the same metrics against different dimension groups (like geo location, traffic source, device, channels and so on). GA has a really neat query tool (Query Explorer) where you can create ad hoc reports.

The python script uses the Core Reporting API v3.0 and OAuth2.0. The details on how to obtain API keys and the example scripts can be found here.

The report comes in the JSON format. Many databases allow you to insert JSON file to a table. You can also fetch the json file with an ETL tool and do the conversion. In this example, the script converts JSON to CSV.

There is also an example code in Java to get data from GA (a href=”https://www.mydatahack.com/how-to-get-data-from-google-analytics-with-java/” target=”_blank”>here.

Key Points

The maximum number of records you can get from this API is 10000. When you are trying to do a historical load, you will easily exceed the limit. I am using the start_index attribute for to iterate the API call until we get all the records.

You can easily add a report in the script by adding report_name, dimensions and metrics in the main method. In this design, each report becomes a table.

Credentials and file paths need to be in the arguments when you call the script as below:

python script.py <key file location e.g. /tmp/key/key.p12> <service-email>
<json_path e.g. /tmp/gadata/> <start-date e.g. 2017-11-01> <end-date e.g. 2017-11-02>
<csv_path e.g. /tmp/gadata/>

Here comes the code! Have fun!

Code

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
import json
from apiclient.discovery import build
from oauth2client.service_account import ServiceAccountCredentials
import httplib2
from oauth2client import client
from oauth2client import file
from oauth2client import tools

import sys
'''
reloading sys for utf8 encoding is for Python 2.7
This line should be removed for Python 3
In Python 3, we need to specify encoding when open a file
f = open("file.csv", encoding='utf-8')
'''

reload(sys)
sys.setdefaultencoding('utf8')

# (1) Get Credentials and File Paths from Argument
key_file = sys.argv[1]
service_email = sys.argv[2]
json_path = sys.argv[3]
start_date = sys.argv[4]
end_date = sys.argv[5]
csv_path = sys.argv[6]

scope=['https://www.googleapis.com/auth/analytics.readonly']
api_name = 'analytics'
api_version = 'v3'

# (2) Create Credentials and build the Service Object
def get_service(key_file, service_email):
    credentials = ServiceAccountCredentials.from_p12_keyfile(\
        service_email, key_file, scopes=scope)
    http = credentials.authorize(httplib2.Http())
    # Build the service object.
    service = build(api_name, api_version, http=http)
    return service

# (3) Use the Analytics service object to get the first profile id
def get_first_profile_id(service):
    # Get a list of all Google Analytics accounts for this user
    accounts = service.management().accounts().list().execute()

    if accounts.get('items'):
        # Get the first Google Analytics account.
        account = accounts.get('items')[0].get('id')

        # Get a list of all the properties for the first account.
        properties = service.management().webproperties().list(
            accountId=account).execute()

        if properties.get('items'):
            # Get the first property id.
            property = properties.get('items')[0].get('id')

        # Get a list of all views (profiles) for the first property.
        profiles = service.management().profiles().list(
            accountId=account,
            webPropertyId=property).execute()

        if profiles.get('items'):
            # return the first view (profile) id.
            return profiles.get('items')[0].get('id')
    return None

# (4) Find the start-index for the next record
def find_next_index(url_string):
    next_link = url_string
    start_index_pos = next_link.find('start-index=')
    end_pos = next_link.find('&', start_index_pos)
    return int(next_link[start_index_pos + 12: end_pos])

# (5) Convert Result to csv
def create_table(results_list, csv_path, header, report_name):
    csv_file = csv_path + report_name + '.csv'
    f = open(csv_file, "w")
    f.write(header + '\n')
    for result in results_list:
        result_row = result["rows"]
        for i in range(len(result_row)):
            row_i = result_row[i]
            row_string = ''
            for j in range(len(row_i)):
                row_string += row_i[j] + ','
           
            f.write(row_string[:-1] + '\n')
    f.close()
    print("{}.csv has been generated".format(report_name))


# (6) Get results, generate json file and return a list of results
def get_results(service, profile_id, start, end, metrics, dimensions, report_name, json_path):

    counter = 0
    start_index = 1
    results_list = []

    while True:
        f_name = report_name + '_' + str(start_index) + '.json'
        json_file_path = json_path + f_name

        result = service.data().ga().get(\
            ids='ga:'+profile_id, start_date=start, end_date=end,\
            metrics=metrics, dimensions=dimensions, start_index=start_index,\
            max_results=10000).execute()
        print("Total Record Number for the %s query result no.%s: %s" \
        % (report_name, str(counter), len(result['rows'])))

        results_list.append(result)
        # Create Json Files
        pretty = json.dumps(result, indent=1)
        f = open(json_file_path, "w")
        f.write(pretty)
        f.close()
        print("{} has been successfully generated.".format(f_name))
        try:
            next_link = result["nextLink"]
            start_index = find_next_index(next_link)
            counter += 1
        except KeyError:
            print("No more records to retrieve for {}".format(f_name))
            break
    return results_list

# (7) Create Header for csv file
def create_header(dimensions, metrics, report_name):
    header = dimensions.replace('ga:', '') + ',' + metrics.replace('ga:', '')
    header = header.replace(' ', '')
    print("Table Name: {}".format(report_name))
    print(header)
    return header

# Main Method
def main():

    # Authentication
    service = get_service(key_file, service_email)
    profile_id = get_first_profile_id(service)
    print("Obtained Profile ID: " + profile_id)

    # Set up Reports
    # (1) Metrics based on Geographical Dimensions
    report_name = 'ga_dim_geo'
    date_dim = ',ga:date,ga:hour'

    dimensions = 'ga:country,\
                 ga:region,\
                 ga:city,\
                 ga:continent,\
                 ga:language'
+ date_dim

    metrics = 'ga:users,\
                ga:newusers,\
                ga:sessions,\
                ga:bounces,\
                ga:sessionDuration,\
                ga:hits,\
                ga:pageviews'


    header = create_header(dimensions, metrics, report_name)
    results_list = get_results(service, profile_id, start_date, end_date, metrics, dimensions, report_name, json_path)
    create_table(results_list, csv_path, header, report_name)

    # (2) Metrics based on Traffic Source Dimensions
    report_name = "ga_dim_traffic_source"
    date_dim = ',ga:date,ga:hour'

    dimensions = 'ga:source,\
                 ga:medium,\
                 ga:socialNetwork'
+ date_dim

    metrics = 'ga:users,\
                ga:newusers,\
                ga:sessions,\
                ga:bounces,\
                ga:sessionDuration,\
                ga:hits,\
                ga:pageviews'


    header = create_header(dimensions, metrics, report_name)
    results_list = get_results(service, profile_id, start_date, end_date, metrics, dimensions, report_name, json_path)
    create_table(results_list, csv_path, header, report_name)

    # (3) Metrics based on Device Dimensions
    report_name = "ga_dim_device"
    date_dim = ',ga:date,ga:hour'

    dimensions = 'ga:browser,\
                 ga:browserVersion,\
                 ga:screenResolution,\
                 ga:deviceCategory'
+ date_dim

    metrics = 'ga:users,\
                ga:newusers,\
                ga:sessions,\
                ga:bounces,\
                ga:sessionDuration,\
                ga:hits,\
                ga:pageviews'


    header = create_header(dimensions, metrics, report_name)
    results_list = get_results(service, profile_id, start_date, end_date, metrics, dimensions, report_name, json_path)
    create_table(results_list, csv_path, header, report_name)

    # (4) Metrics based on Channel Grouping and Event Action
    report_name = 'ga_dim_channel_event'

    date_dim = ',ga:date,ga:hour'

    dimensions = 'ga:channelGrouping,\
                 ga:eventAction'
+ date_dim

    metrics = 'ga:uniqueEvents, \
                ga:users,\
                ga:newusers,\
                ga:sessions,\
                ga:bounces,\
                ga:sessionDuration,\
                ga:hits,\
                ga:pageviews'


    header = create_header(dimensions, metrics, report_name)
    results_list = get_results(service, profile_id, start_date, end_date, metrics, dimensions, report_name, json_path)
    create_table(results_list, csv_path, header, report_name)

    # (5) Goal Metrics by Channel Grouping
    report_name ='ga_goal_dim_channel'

    date_dim = ',ga:date,ga:hour'

    dimensions = 'ga:channelGrouping' + date_dim

    metrics = 'ga:goal15Completions,\
                ga:goal2Completions, \
                ga:goal7Completions, \
                ga:goal11Completions, \
                ga:goal12Completions, \
                ga:goal10Completions, \
                ga:goal19Completions, \
                ga:goal15Completions, \
                ga:transactions'


    header = create_header(dimensions, metrics, report_name)
    results_list = get_results(service, profile_id, start_date, end_date, metrics, dimensions, report_name, json_path)
    create_table(results_list, csv_path, header, report_name)

# Execute
if __name__ == "__main__":
    main()
Data Engineering
Sending XML Payload and Converting XML Response to JSON with Python

If you need to interact with a REST endpoint that takes a XML string as a payload and returns another XML string as a response, this is the quick guide if you want to use Python. If you want to do it with Node.js, you can check out the post …

Data Engineering
Sending XML Payload and Converting XML Response to JSON with Node.js

Here is the quick Node.js example of interacting with a rest API endpoint that takes XML string as a payload and return with XML string as response. Once we get the response, we will convert it to a JSON object. For this example, we will use the old-school QAS (Quick …

Data Engineering
Downloading All Public GitHub Gist Files

I used to use plug-ins to render code blocks for this blog. Yesterday, I decided to move all the code into GitHub Gist and inject them from there. Using a WordPress plugin to render code blocks can be problematic when update happens. Plugins might not be up to date. It …