Welcome to MyDatahack!
My passion is Programming, Data Engineering, Data Science, Mathematics, Database, Data Warehousing, Business Intelligence, IT Infrastructure and Architecture. I consider myself as a full-stack data specialist with skills and experience in infrastructure design and setup, data integration, DWH and BI development, DBA, big data engineering and data science application development and deployment. My favourite tools are Java, Python, R, Spark, Node.js, JavaScript, DataStage, Informatica, Talend, AWS, Linux, and all the relational and non-relational databases.
I love writing about things that I think are cool. MyDatahack is a collection of what are cool to me. I also love sharing my knowledge and helping others. There are many practical examples of code/solutions that work. If any of my posts helps you to solve your problem, I am super happy.
Please leave comments if you have any feedback or questions. I would love to bounce ideas off everyone! You can see the latest posts in each category below. You can search the keywords if you are looking for something specific.
Enjoy!
In the previous post, we used grid search to find the best hyper parameter for the neural network model with R’s caret package. Here, let’s use Python and scikit-learn package to optimise a neural network model. Just like the caret package, scikit-learn has a pre-built function for hyper parameter search. …
Once you finish training the model and are happy with it, you may need to consider saving the model. Otherwise, you will loose the model once you close the session. The model you create in R session is not persistent, only existing in the memory temporarily. Most of the time, …
Writing code to do machine learning is easy. What makes it difficult is the optimisation. By large, there are two ways to optimise your model. Feature selection & transformation Model parameter optimisation Both are hard-core topics and neither can be covered in this post. Feature selection and transformation often require …
How to get data from MongoDB with Python MongoDB is one of the most popular no SQL databases used as a backend database for web and mobile applications. Data is stored in MongoDB as BSON, which looks like JSON files. Once you understand the way MongoDB stores data, all you …
Apache Spark is a powerful framework to utilise cluster-computing for data procession, streaming and machine learning. Its native language is Scala. It also has multi-language support with Python, Java and R. Spark is easy to use and comparably faster than MapReduce. For example, you can write Spark on the Hadoop …
I hate 404 errors. Whenever I get it, I stare at the screen with disdain because it cannot be fixed most of the time. What cannot be found cannot be found. But, fixing this one is easy! By default, Anaconda serves notebooks from local directory: C:\ in Windows. To resolve …
Joiner is the stage to join tables in Informatica Cloud (see a quick introduction for Joiner Transformation here). If you have a large volume of data, the joiner transformation becomes very slow without performance optimisation. In this post, we will show you a few tricks that you can use to …
By default, the secure agent can run 2 data synchronisation tasks at a time. This constraint can become limiting quickly especially when multiple developers are building and testing the data synchronisation tasks at the same time. By adding a custom property on the secure agent, you can run more than …
Informatica does not have a dedicated Postgres database connector. Therefore, we need to use the ODBC connector. In this post, I will discuss how to configure Postgres ODBC in both Linux and Windows servers for the Informatica Cloud ODBC connector. Linux Server (Red Hat) There are a few instructions, but …
In modern web development, we try to avoid building a website from scratch if we can. Websites are usually built on top of a platform like Sitecore, Drupal, WordPress and so on. When I first thought about creating this blog, I tried to code everything from scratch, going against the …