In this article I will discuss how to build a cloud agnostic Big Data processing and storage solution running entirely in Kubernetes. This design avoids vendor lock-in by using only open-source technologies and avoiding cloud-managed products such as S3 and Amazon ElasticMapReduce in favour of MinIO and Apache Spark
Imagine this - you've created a pipeline to clean your company's raw data and enrich it according to business requirements. You've documented each table and column in excruciating detail. Finally you built a dashboard brimming with charts and insights which tell a compelling narrative of the business' health and direction. How do you share and present your work?
This is the second article in a series to build a Big Data development environment in AWS. If you've not read the first article, you'll likely be confused. Please go read Create a single node Hadoop cluster Setup Spark and Hive in Hadoop cluster We've set up the storage service HDFS and the resource manager … Continue reading Set up Spark and Hive for data warehousing and processing