How To Get Data From Google Analytics With Python
- By : Mydatahack
- Category : Data Engineering, Data Ingestion
- Tags: Python API Data Ingestion
When you ingest data from Google Analytics, you need to create a series of reports based on GA dimensions and metrics. The granularity is determined by dimensions you add in the report. The most important thing is to understand business requirements before start ingesting data. Good requirement analysis will enable you to drill up and down metrics at the right granularity and slice them with the dimensions that are critical to the business.
Google Analytics (for free version) only allows us to have up to 7 dimensions and 10 metrics in a single report. You also need to note that not all dimensions and metrics can be queried together. In my example, I created reports over the key metrics and grouped dimensions according to the GA Dimensions & Metrics Explorer . For example, I created the ga_dim_geo table, which have the dimensions grouped under Geo Network. In my code, you can see how I grouped the same metrics against different dimension groups (like geo location, traffic source, device, channels and so on). GA has a really neat query tool (Query Explorer) where you can create ad hoc reports.
The python script uses the Core Reporting API v3.0 and OAuth2.0. The details on how to obtain API keys and the example scripts can be found here.
The report comes in the JSON format. Many databases allow you to insert JSON file to a table. You can also fetch the json file with an ETL tool and do the conversion. In this example, the script converts JSON to CSV.
There is also an example code in Java to get data from GA (a href=”https://www.mydatahack.com/how-to-get-data-from-google-analytics-with-java/” target=”_blank”>here.
Key Points
The maximum number of records you can get from this API is 10000. When you are trying to do a historical load, you will easily exceed the limit. I am using the start_index attribute for to iterate the API call until we get all the records.
You can easily add a report in the script by adding report_name, dimensions and metrics in the main method. In this design, each report becomes a table.
Credentials and file paths need to be in the arguments when you call the script as below:
<json_path e.g. /tmp/gadata/> <start-date e.g. 2017-11-01> <end-date e.g. 2017-11-02>
<csv_path e.g. /tmp/gadata/>
Here comes the code! Have fun!
Code
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 | import json from apiclient.discovery import build from oauth2client.service_account import ServiceAccountCredentials import httplib2 from oauth2client import client from oauth2client import file from oauth2client import tools import sys ''' reloading sys for utf8 encoding is for Python 2.7 This line should be removed for Python 3 In Python 3, we need to specify encoding when open a file f = open("file.csv", encoding='utf-8') ''' reload(sys) sys.setdefaultencoding('utf8') # (1) Get Credentials and File Paths from Argument key_file = sys.argv[1] service_email = sys.argv[2] json_path = sys.argv[3] start_date = sys.argv[4] end_date = sys.argv[5] csv_path = sys.argv[6] scope=['https://www.googleapis.com/auth/analytics.readonly'] api_name = 'analytics' api_version = 'v3' # (2) Create Credentials and build the Service Object def get_service(key_file, service_email): credentials = ServiceAccountCredentials.from_p12_keyfile(\ service_email, key_file, scopes=scope) http = credentials.authorize(httplib2.Http()) # Build the service object. service = build(api_name, api_version, http=http) return service # (3) Use the Analytics service object to get the first profile id def get_first_profile_id(service): # Get a list of all Google Analytics accounts for this user accounts = service.management().accounts().list().execute() if accounts.get('items'): # Get the first Google Analytics account. account = accounts.get('items')[0].get('id') # Get a list of all the properties for the first account. properties = service.management().webproperties().list( accountId=account).execute() if properties.get('items'): # Get the first property id. property = properties.get('items')[0].get('id') # Get a list of all views (profiles) for the first property. profiles = service.management().profiles().list( accountId=account, webPropertyId=property).execute() if profiles.get('items'): # return the first view (profile) id. return profiles.get('items')[0].get('id') return None # (4) Find the start-index for the next record def find_next_index(url_string): next_link = url_string start_index_pos = next_link.find('start-index=') end_pos = next_link.find('&', start_index_pos) return int(next_link[start_index_pos + 12: end_pos]) # (5) Convert Result to csv def create_table(results_list, csv_path, header, report_name): csv_file = csv_path + report_name + '.csv' f = open(csv_file, "w") f.write(header + '\n') for result in results_list: result_row = result["rows"] for i in range(len(result_row)): row_i = result_row[i] row_string = '' for j in range(len(row_i)): row_string += row_i[j] + ',' f.write(row_string[:-1] + '\n') f.close() print("{}.csv has been generated".format(report_name)) # (6) Get results, generate json file and return a list of results def get_results(service, profile_id, start, end, metrics, dimensions, report_name, json_path): counter = 0 start_index = 1 results_list = [] while True: f_name = report_name + '_' + str(start_index) + '.json' json_file_path = json_path + f_name result = service.data().ga().get(\ ids='ga:'+profile_id, start_date=start, end_date=end,\ metrics=metrics, dimensions=dimensions, start_index=start_index,\ max_results=10000).execute() print("Total Record Number for the %s query result no.%s: %s" \ % (report_name, str(counter), len(result['rows']))) results_list.append(result) # Create Json Files pretty = json.dumps(result, indent=1) f = open(json_file_path, "w") f.write(pretty) f.close() print("{} has been successfully generated.".format(f_name)) try: next_link = result["nextLink"] start_index = find_next_index(next_link) counter += 1 except KeyError: print("No more records to retrieve for {}".format(f_name)) break return results_list # (7) Create Header for csv file def create_header(dimensions, metrics, report_name): header = dimensions.replace('ga:', '') + ',' + metrics.replace('ga:', '') header = header.replace(' ', '') print("Table Name: {}".format(report_name)) print(header) return header # Main Method def main(): # Authentication service = get_service(key_file, service_email) profile_id = get_first_profile_id(service) print("Obtained Profile ID: " + profile_id) # Set up Reports # (1) Metrics based on Geographical Dimensions report_name = 'ga_dim_geo' date_dim = ',ga:date,ga:hour' dimensions = 'ga:country,\ ga:region,\ ga:city,\ ga:continent,\ ga:language' + date_dim metrics = 'ga:users,\ ga:newusers,\ ga:sessions,\ ga:bounces,\ ga:sessionDuration,\ ga:hits,\ ga:pageviews' header = create_header(dimensions, metrics, report_name) results_list = get_results(service, profile_id, start_date, end_date, metrics, dimensions, report_name, json_path) create_table(results_list, csv_path, header, report_name) # (2) Metrics based on Traffic Source Dimensions report_name = "ga_dim_traffic_source" date_dim = ',ga:date,ga:hour' dimensions = 'ga:source,\ ga:medium,\ ga:socialNetwork' + date_dim metrics = 'ga:users,\ ga:newusers,\ ga:sessions,\ ga:bounces,\ ga:sessionDuration,\ ga:hits,\ ga:pageviews' header = create_header(dimensions, metrics, report_name) results_list = get_results(service, profile_id, start_date, end_date, metrics, dimensions, report_name, json_path) create_table(results_list, csv_path, header, report_name) # (3) Metrics based on Device Dimensions report_name = "ga_dim_device" date_dim = ',ga:date,ga:hour' dimensions = 'ga:browser,\ ga:browserVersion,\ ga:screenResolution,\ ga:deviceCategory' + date_dim metrics = 'ga:users,\ ga:newusers,\ ga:sessions,\ ga:bounces,\ ga:sessionDuration,\ ga:hits,\ ga:pageviews' header = create_header(dimensions, metrics, report_name) results_list = get_results(service, profile_id, start_date, end_date, metrics, dimensions, report_name, json_path) create_table(results_list, csv_path, header, report_name) # (4) Metrics based on Channel Grouping and Event Action report_name = 'ga_dim_channel_event' date_dim = ',ga:date,ga:hour' dimensions = 'ga:channelGrouping,\ ga:eventAction' + date_dim metrics = 'ga:uniqueEvents, \ ga:users,\ ga:newusers,\ ga:sessions,\ ga:bounces,\ ga:sessionDuration,\ ga:hits,\ ga:pageviews' header = create_header(dimensions, metrics, report_name) results_list = get_results(service, profile_id, start_date, end_date, metrics, dimensions, report_name, json_path) create_table(results_list, csv_path, header, report_name) # (5) Goal Metrics by Channel Grouping report_name ='ga_goal_dim_channel' date_dim = ',ga:date,ga:hour' dimensions = 'ga:channelGrouping' + date_dim metrics = 'ga:goal15Completions,\ ga:goal2Completions, \ ga:goal7Completions, \ ga:goal11Completions, \ ga:goal12Completions, \ ga:goal10Completions, \ ga:goal19Completions, \ ga:goal15Completions, \ ga:transactions' header = create_header(dimensions, metrics, report_name) results_list = get_results(service, profile_id, start_date, end_date, metrics, dimensions, report_name, json_path) create_table(results_list, csv_path, header, report_name) # Execute if __name__ == "__main__": main() |