In this article I will discuss how to build a cloud agnostic Big Data processing and storage solution running entirely in Kubernetes. This design avoids vendor lock-in by using only open-source technologies and avoiding cloud-managed products such as S3 and Amazon ElasticMapReduce in favour of MinIO and Apache Spark
Build a Data Lake with Trino, Kubernetes, Helm, and Glue
How to create a Data Lake in AWS using S3 as the storage layer, Glue as the metastore, and Trino on Kubernetes as the query engine.
Run Trino/Presto on Minikube on AWS
This is the beginning of a new series centering on the use of Kubernetes to host Big Data infrastructure. In this article I will run a single-node Trino cluster in local Kubernetes cluster called minikube
Setup a Kafka cluster on Amazon EC2
I will demonstrate how to set up a Kafka Broker on a single EC2 instance. We will first setup and configure Zookeeper and the Kafka Broker, then I will demonstrate how to create topics, publish and consume logs. Finally, I will demonstrate an example of publishing application logfiles to a Kafka topic and then consuming from the same topic.
Create a JupyterLab notebook for Spark
Imagine this - you've created a pipeline to clean your company's raw data and enrich it according to business requirements. You've documented each table and column in excruciating detail. Finally you built a dashboard brimming with charts and insights which tell a compelling narrative of the business' health and direction. How do you share and present your work?
Set up Spark and Hive for data warehousing and processing
This is the second article in a series to build a Big Data development environment in AWS. If you've not read the first article, you'll likely be confused. Please go read Create a single node Hadoop cluster Setup Spark and Hive in Hadoop cluster We've set up the storage service HDFS and the resource manager … Continue reading Set up Spark and Hive for data warehousing and processing
Create a single node Hadoop cluster
Starting out in Data Engineering Hadoop on EC2 When I cut my teeth in Data Engineering in 2018, Apache Spark was all the rage. Spark's in-memory processing made it lightening-fast and made older frameworks such as Apache Pig obsolete. You couldn't call yourself a Data Engineer without knowing Spark. I was a fledgling Data Engineer … Continue reading Create a single node Hadoop cluster
How to use Presto to query an AWS MySQL RDS instance
We will discuss how to use a multi node Presto cluster to query data in an AWS MySQL RDS instance
How to create a multi-node Presto cluster on AWS EC2
What's wrong with a single node Presto cluster? In a previous post, I created a single-node Presto cluster where the coordinator and worker processes run on the same node. That's a bad idea in large clusters. Processing work on the coordinator can starve the coordinator process of resources and negatively impact scheduling work and monitoring … Continue reading How to create a multi-node Presto cluster on AWS EC2
How I passed the AWS Certified Solutions Architect Professional exam
This a continuation of What is the AWS Certified Solutions Architect - Professional exam If you don't know what a AWS Certified Solutions Architect or why you should become one, read part 1 or check out AWS website. How did I prepare for it? The AWS Certified Solutions Architect exam asks 75 multiple-choice questions in … Continue reading How I passed the AWS Certified Solutions Architect Professional exam