How Postgres JSON Query Handles Missing Key

When we transform JSON to a structured format with a programming language in JSON data ingestion, we have to handle missing key. This is because JSON is schema-less and it doesn’t always have the same keys in all records as opposed to relational database tables.

In Python, the missing key handling looks like this.

1
2
3
4
try:
    tmp.append(record['info']['postcode'])
except KeyError:
    tmp.append('NA')

In the previous post, we examined Postgres JSON support and discussed how we could use it as a new JSON ingestion strategy (New JSON Data Ingestion Strategy By Using the Power of Postgres).

The first question comes to mind is how Postgres handles missing key when you are used to transforming JSON in a programming language.

I start with the conclusion. Postgres puts null value when the key is missing. It also puts null value when the key value is null. This is awesome because the database can handle the missing key without being specific about key error handling.

Let’s have take a look.

Record Example

We have 6 JSON records.

  1. Complete record
  2. Note is null
  3. Note key is missing
  4. Payment is null
  5. Payment Type key is missing
  6. Payment is missing

JSON File

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
[
{
    "Id": 100,
    "Name": "John",
    "TransactionId": "tran1",
  "Transaction": [
    {
    "ItemId":"a100",
    "price": 200
    },
    {
    "ItemId":"a110",
    "price": 200  
    }
  ],
  "Subscriber": true,

  "Payment": {
    "Type": "Credit-Card",
    "Total": 400,
    "Success": true
  },
  "Note": "1st Complete Record"
},
{
    "Id": 101,
    "Name": "Tom",
    "TransactionId": "tran2",
  "Transaction": [
    {
    "ItemId":"a100",
    "price": 200
    },
    {
    "ItemId":"a110",
    "price": 200  
    }
  ],
  "Subscriber": true,

  "Payment": {
    "Type": "Debit-Card",
    "Total": 400,
    "Success": true
  },
  "Note":null
},
{
    "Id": 102,
    "Name": "Margaret",
    "TransactionId": "tran3",
  "Transaction": [
    {
    "ItemId":"a100",
    "price": 200
    },
    {
    "ItemId":"a110",
    "price": 200  
    }
  ],
  "Subscriber": true,

  "Payment": {
    "Type": "Credit-Card",
    "Total": 400,
    "Success": true
  }
},
{
    "Id": 103,
    "Name": "Dylan",
    "TransactionId": "tran4",
  "Transaction": [
    {
    "ItemId":"a100",
    "price": 200
    },
    {
    "ItemId":"a110",
    "price": 200  
    }
  ],
  "Subscriber": true,
  "Payment": null,
  "Note": "Payment is Null"
},
{
    "Id": 104,
    "Name": "Oliver",
    "TransactionId": "tran5",
  "Transaction": [
    {
    "ItemId":"a100",
    "price": 200
    },
    {
    "ItemId":"a110",
    "price": 200  
    }
  ],
  "Subscriber": true,

  "Payment": {
    "Total": 400,
    "Success": true
  },
  "Note": "Payment Type is missing"
},
{
    "Id": 105,
    "Name": "Sarah",
    "TransactionId": "tran6",
  "Transaction": [
    {
    "ItemId":"a100",
    "price": 200
    },
    {
    "ItemId":"a110",
    "price": 200
    }
  ],
  "Subscriber": true,
  "Note": "Payment is missing"
}
]

Loading to Postgres with Python

Let’s load this to Postgres with a target table as usermanaged.transaction.

Make sure to create the table with data type jsonb. It is more optimised for queries.

1
2
3
4
CREATE TABLE usermanaged.transaction
(
DATA jsonb
);

You can use the Python code below to load the data into the table.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
import json
import psycopg2

connection_string = "dbname='<dbname>' user='<user>' host='<uri>' password='<pw>'"
table_name = "usermanaged.transaction"

def pg_insert(connection_string, table_name, json_obj):
    try:
        conn = psycopg2.connect(connection_string)
        print("Connecting to Database")
        cur = conn.cursor()

        cur.execute("Truncate {} Cascade;".format(table_name))
        print("Truncated {}".format(table_name))
       
        for record in json_obj:
            cur.execute("INSERT INTO {} VALUES ('{}')".format(table_name, json.dumps(record)))
            cur.execute("commit;")

        print("Inserted data into {}".format(table_name))
        conn.close()
        print("DB connection closed.")
    except Exception as e:
        print('Error {}'.format(str(e)))

f = open('./transaction_example.json')
f_json = json.load(f)

pg_insert(connection_string, table_name, f_json)

Postgres Query

Let’s query the table to see how Postgres handles missing keys.

1
2
3
4
5
6
7
8
9
10
11
12
13
SELECT
data->>'Id' AS id
,data->>'Name' AS name
,data->>'TransactionId' AS transactionid
,jsonb_array_elements(data->'Transaction')->>'ItemId' AS itemid
,jsonb_array_elements(data->'Transaction')->>'price' AS price
,data->>'Subscriber' AS subscriber
,data->'Payment'->>'Type' AS payment_type
,data->'Payment'->>'Total' AS payment_total
,data->'Payment'->>'Success' AS payment_success
,data->>'Note' AS note
,data->>'Whatever' AS non_existing
FROM usermanaged.transaction;

Results

As you can see, the missing key is set to null. If you have a key that does not exist in any record, it creates a column with null values. When the key value is null, it also sets the value to null.

Good times!

 

 

Git
How to specify which Node version to use in Github Actions

When you want to specify which Node version to use in your Github Actions, you can use actions/setup-node@v2. The alternative way is to use a node container. When you try to use a publicly available node container like runs-on: node:alpine-xx, the pipeline gets stuck in a queue. runs-on is not …

AWS
Using semantic-release with AWS CodePipeline and CodeBuild

Here is the usual pattern of getting the source from a git repository in AWS CodePipeline. In the pipeline, we use AWS CodeStart to connect to a repo and get the source. Then, we pass it to the other stages, like deploy or publish. For some unknown reasons, CodePipeline downloads …

DBA
mysqldump Error: Unknown table ‘COLUMN_STATISTICS’ in information_schema (1109)

mysqldump 8 enabled a new flag called columm-statistics by default. When you have MySQL client above 8 and try to run mysqldump on older MySQL versions, you will get the error below. mysqldump: Couldn’t execute ‘SELECT COLUMN_NAME, JSON_EXTRACT(HISTOGRAM ‘$”number-of-buckets-specified”‘) FROM information_schema.COLUMN_STATISTICS WHERE SCHEMA_NAME = ‘myschema’ AND TABLE_NAME = ‘craue_config_setting’;’: Unknown …