In this article I will discuss how to build a cloud agnostic Big Data processing and storage solution running entirely in Kubernetes. This design avoids vendor lock-in by using only open-source technologies and avoiding cloud-managed products such as S3 and Amazon ElasticMapReduce in favour of MinIO and Apache Spark
How to create a Data Lake in AWS using S3 as the storage layer, Glue as the metastore, and Trino on Kubernetes as the query engine.
This is the beginning of a new series centering on the use of Kubernetes to host Big Data infrastructure. In this article I will run a single-node Trino cluster in local Kubernetes cluster called minikube
I will demonstrate how to set up a Kafka Broker on a single EC2 instance. We will first setup and configure Zookeeper and the Kafka Broker, then I will demonstrate how to create topics, publish and consume logs. Finally, I will demonstrate an example of publishing application logfiles to a Kafka topic and then consuming from the same topic.
Imagine this - you've created a pipeline to clean your company's raw data and enrich it according to business requirements. You've documented each table and column in excruciating detail. Finally you built a dashboard brimming with charts and insights which tell a compelling narrative of the business' health and direction. How do you share and present your work?
This is the second article in a series to build a Big Data development environment in AWS. If you've not read the first article, you'll likely be confused. Please go read Create a single node Hadoop cluster Setup Spark and Hive in Hadoop cluster We've set up the storage service HDFS and the resource manager … Continue reading Set up Spark and Hive for data warehousing and processing
Starting out in Data Engineering Hadoop on EC2 When I cut my teeth in Data Engineering in 2018, Apache Spark was all the rage. Spark's in-memory processing made it lightening-fast and made older frameworks such as Apache Pig obsolete. You couldn't call yourself a Data Engineer without knowing Spark. I was a fledgling Data Engineer … Continue reading Create a single node Hadoop cluster
We will discuss how to use a multi node Presto cluster to query data in an AWS MySQL RDS instance
What's wrong with a single node Presto cluster? In a previous post, I created a single-node Presto cluster where the coordinator and worker processes run on the same node. That's a bad idea in large clusters. Processing work on the coordinator can starve the coordinator process of resources and negatively impact scheduling work and monitoring … Continue reading How to create a multi-node Presto cluster on AWS EC2
This a continuation of What is the AWS Certified Solutions Architect - Professional exam If you don't know what a AWS Certified Solutions Architect or why you should become one, read part 1 or check out AWS website. How did I prepare for it? The AWS Certified Solutions Architect exam asks 75 multiple-choice questions in … Continue reading How I passed the AWS Certified Solutions Architect Professional exam