ETL Developers are Dead

I remember when I was a DataStage developer in circa 2014. All I did was making DataStage jobs. I was working on an enterprise data warehouse. In that company, they were using it for batch integration between systems. So, I also maintained those jobs. It was my first ETL development job and I didn’t know anything better. I was a happy ETL developer. DataStage is pretty intense and you really need to be specialised for the tool. Moving a table from one database to another takes a serious development effort. I remember that I spent days just figuring out the character mismatch or datetime transformation between databases in order to copy a table.

Making a data mart with DataStage is even more intense. I needed to learn the tool really well to be effective. Back then, I had a lot of fun playing with it. It was OK for me to just work on DataStage because they just had a traditional data warehouse that crunched data overnight.

Although the traditional data warehousing development with an ETL tool is still important, it’s not only job you need to do as a developer in a data team. You are required to do real-time ingestion, API data ingestions, big data and also transforming data for ML or AI applications. It requires more skills than ETL. You cannot afford to be only using ETL. Nowadays, you need to be a data engineer who can use different tools to ingest and transform data in many different ways.

Because of the diverse data ingestion and transformation requirements, it is better for a company to choose light-weight tool to do ETL. There are heaps of ETL tools that copies tables with a few clicks. We don’t really need to use a complex and expensive tool like DataStage. We can just use more affordable options. If you are building a data warehouse in a cloud environment, it is worth looking into using the native tool that comes with that environment.

We have to say good-by to the good old days of ETL specialists. ETL developers are dead.

Git
How to specify which Node version to use in Github Actions

When you want to specify which Node version to use in your Github Actions, you can use actions/setup-node@v2. The alternative way is to use a node container. When you try to use a publicly available node container like runs-on: node:alpine-xx, the pipeline gets stuck in a queue. runs-on is not …

AWS
Using semantic-release with AWS CodePipeline and CodeBuild

Here is the usual pattern of getting the source from a git repository in AWS CodePipeline. In the pipeline, we use AWS CodeStart to connect to a repo and get the source. Then, we pass it to the other stages, like deploy or publish. For some unknown reasons, CodePipeline downloads …

DBA
mysqldump Error: Unknown table ‘COLUMN_STATISTICS’ in information_schema (1109)

mysqldump 8 enabled a new flag called columm-statistics by default. When you have MySQL client above 8 and try to run mysqldump on older MySQL versions, you will get the error below. mysqldump: Couldn’t execute ‘SELECT COLUMN_NAME, JSON_EXTRACT(HISTOGRAM ‘$”number-of-buckets-specified”‘) FROM information_schema.COLUMN_STATISTICS WHERE SCHEMA_NAME = ‘myschema’ AND TABLE_NAME = ‘craue_config_setting’;’: Unknown …