2020-11-13
Deploying Starburst on GCP with Hive, Storage and Postgres connectors.
Hello folks, grski here. Today we are in for some Ops/Big Data fun instead of the usual plain old Python. Why is that?
For the past year and a half, I've immersed myself in work that was related to designing/implementing APIs mostly with some of Product Ownership/Management/Mentoring on the side. The challenge in the projects/products that I've worked on was understanding business context mostly and client-facing requirements work. Technologically speaking, nothing advanced usually, or should I say: nothing interesting. Plain old Django + DRF, both of which are amazing, but you know. Stuff gets boring, especially on a small scale. So while I've developed myself when it comes to manager/product owner knowledge, my tech skills have stagnated.
I've noticed that and decided to change this so STUFF ISN"T SO BORING ANYMORE, GOSH. The text that you are reading is the result of this - me trying out new stuff and learning things completely outside my usual specialization/comfort zone. Without further ado, let's get on with the technical stuff, but first let me make a small note here: the solutions shown in this article, are not perfect probably, almost certainly. I'm a newbie in this topic, who just sat down to it today and started doing things. This should NOT be inspiration to anything production-related. Also, I make some things simplified as the target for this article are greenhorns like me or even non-technical people, so bear with me.
Data is monnies
So what's this Starburst Enterprise for Presto thing? Why is it important?
Nowadays we like in the age of data basically. Data == money quite often. Most of us want more monnies, right? So do all the different companies around the globe. Companies, that often have loads of data that they don't know how to use. Okay, data == money, but just selling it, is the fool's way. Sometimes the better approach is to do some numbers crunching on that data, do some analytics, gather insights and then act on them. This is where the potential to make a killing lies, this is where the miracle happens. Stuff like influencing the US elections, voting certain people out, predicting outburst of a pandemic or how it'll spread. What ties these things together? Data.
Okay, so now we know that data is very important and all, right? Right. Great. Now think about it - all of that data must be stored somewhere and it indeed is. Usually in some kind of a database.
Where does Presto fit into this?
The de-facto standard in the industry, for dealing with that data, understanding it, running some queries, is SQL. It's kind of a language you can say, a language that databases understand, that tell them what to do with the data that we have. Almost everyone in the data world knows it, it's not THAT hard, it's been with us for years, so it's battle-tested. It's magnificent. So far so good, stuff is nice and easy.
Here comes the boom though. SQL and relational databases are not the only thing out there, nor should they be. They are good in certain use cases, in other ones not so much. Let's call these our other data sources
. In some applications you'll find dozens of these data source types, some of which do not understand SQL at all, making it way harder to process them together with traditional DB data, to gather insights and so on.
Here comes this PrestoSql (now known as Trino) thing though. Ah, btw - it was developed by Facebook initially. What is it? Presto is a layer of abstraction unifying all these data sources. It lets you use SQL/queries on almost any type of data source. It takes a lot of hassle away from the developer, making things easier. It's also designed with scale in mind, which means that handling loads of data won't be an issue. How much is loads
, well basically petabytes or exabytes. Which is a lot. A lot lot. Now, thanks to Presto, you can query all these different data sources with SQL and scale your application to suit your needs, whatever you need to query gigabytes, terabytes or petabytes, all of that is a breeze.
And Starburst?
The part about Presto is clear, what about Starburst? Well, basically it's a company that specializes in providing solutions with Presto engine, striving to be the best. They also provide a product with the same name, which is something like a gathering of cool packages
for Presto, more connectors
for new data sources, improvements to existing ones, better performance. In simple terms, this product is Presto on steroids and for example support if you need it. They have a couple of different "versions" of this software, in this article, we will be going through the Starburst Enterprise version setup.
Kubernetes?
This one you just have to go and google yourself.
Let's get our hands dirty
So, let's start then. We will begin by setting up the project on GCP and all of that.
Except we won't. I'll not bore you with the details on how to download CLI, register at GCP and initialize you cli. Google has magnificent docs regarding this, so help yourself. I expect that you'll have:
- Project created in your google cloud platform console.
- CLI installed
- Kubectl added to gcloud
- Project id set in config in the cli
- Region set in the cli
- Proper services enabled in google cloud.
Now, once you have all that done, we can roll with our cluster. To do that you need to:
gcloud container clusters create starburst
It'll probably take a few minutes to create the cluster. After it's done, you can see what was deployed eg. using
gcloud container node-pools list
gcloud compute instances list
The first command will list you all of your node-pools, which are like a group
for the machines you'll have. It's a very simplified description but bear with me or learn k8s. As for the second one it lists all the instances
, so the machines/computers
that your cluster will run on.
Okay, our cluster is more or less running. What to do now? Head over here to Starburst's docs and download the files listed at the beginning of the linked page. After that navigate to the place where you have these files and apply these configs to our k8s cluster. How?
kubectl apply -f service_account.yaml
kubectl apply -f role.yaml
kubectl apply -f role_binding.yaml
kubectl apply -f presto_v1_crd.yaml
kubectl apply -f operator.yaml
kubectl apply -f example_presto_v1_cr.yaml
After all of that is done, try:
kubectl get pods
which should get you all of your pods.
NAME READY STATUS RESTARTS AGE
pod/hive-metastore-example-presto-8fc4787d8-phg8d 0/1 Pending 0 27m
pod/hive-postgresql-example-presto-5694696897-6whjr 0/1 Pending 0 27m
pod/presto-coordinator-example-presto-798cb57c7-mrfx9 2/2 Running 0 27m
pod/presto-operator-549d58bd9f-9wrgd 1/1 Running 0 27m
pod/presto-worker-example-presto-6dc67485f-czmbb 1/1 Running 0 27m
pod/presto-worker-example-presto-6dc67485f-dwfzd 1/1 Running 0 27m
pod/presto-worker-example-presto-6dc67485f-qst95 1/1 Running 0 27m
You should see a couple of pods running, some of them might have the status of PENDING
. For now, that's all fine.
The world is yours to take
So our Presto cluster is more or less running. It'd be good to access it though, right? A UI interface comes packed together with Starburst Enterprise, it exposes itself at the port 8080, therefore all we have to do is to expose it to the world. How to do that? Well, k8s has a solution for it - stuff like LoadBalancer and ingress?
What are they exactly? Just smart way of calling the service that faces the world and routes requests to proper resources. More or less. How should we do it? K8s docs tell us how. Considering that my k8s knowledge is first of all, very very shallow, second of all rusty as hell, since I've last touched it 1.5 yrs ago, I tried doing it the naive way and creating a file called lb.yml
inside the same catalogue that I had my other deployment files with contents of:
apiVersion: v1
kind: Service
metadata:
name: loadbalancer
spec:
selector:
app: loadbalancer
ports:
- port: 8080
targetPort: 8080
type: LoadBalancer
If you know k8s even slightly, you should see what's wrong in this approach. Anyway. I've applied this deployment, the load balancer service got up properly, so far so good. I checked it using:
kubectl get services
You need to wait till your LB gets external ip
column assigned. The IP that you see there is the one you should try connecting to. And I've tried just that, trying to connect to <external ip of my lb>:8080
in my browser. That didn't work. I get an error right away. Weird. Let's try to connect just the ip without the port. That behaved differently. How? The connection was made but it just hanged till timeout. Interesting. That suggested to me that LB was working correctly, overall, but the config was wrong - the service tried to map our request to a resource that didn't exist. That part with selector.app: loadbalancer
was wrong. I had to know what is the name of the resource that I need to point to. But which one to pick? I have a couple of pods already. Hm. Let's be smart here and do:
kubectl get all
This will list all of our resources, in my case:
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/example-presto NodePort 10.3.253.30 <none> 8080:31234/TCP 27m
service/hive-metastore-example-presto ClusterIP 10.3.244.139 <none> 9083/TCP 27m
service/hive-postgresql-example-presto ClusterIP 10.3.254.181 <none> 5432/TCP 27m
service/kubernetes ClusterIP 10.3.240.1 <none> 443/TCP 36m
service/presto-operator-metrics ClusterIP 10.3.241.122 <none> 8686/TCP,8383/TCP 27m
service/prometheus-coordinator-example-presto ClusterIP 10.3.254.83 <none> 8081/TCP 27m
service/prometheus-worker-example-presto ClusterIP 10.3.248.236 <none> 8081/TCP 27m
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/hive-metastore-example-presto 0/1 1 0 27m
deployment.apps/hive-postgresql-example-presto 0/1 1 0 27m
deployment.apps/presto-coordinator-example-presto 1/1 1 1 27m
deployment.apps/presto-operator 1/1 1 1 27m
deployment.apps/presto-worker-example-presto 3/3 3 3 27m
NAME DESIRED CURRENT READY AGE
replicaset.apps/hive-metastore-example-presto-8fc4787d8 1 1 0 27m
replicaset.apps/hive-postgresql-example-presto-5694696897 1 1 0 27m
replicaset.apps/presto-coordinator-example-presto-798cb57c7 1 1 1 27m
replicaset.apps/presto-operator-549d58bd9f 1 1 1 27m
replicaset.apps/presto-worker-example-presto-6dc67485f 3 3 3 27m
Hmmmm. The UI was supposed to expose port 8080, riiight? Well, if you look this output through, you'll get:
service/example-presto NodePort 10.3.253.30 <none> 8080:31234/TCP 27m
Gotcha. I now know the proper name of the resource I need to refer to, time to fix this and also get rid of the file holding the config - deploy the lb using just a command. Why? To try another way and learn more basically!
But before I did that, maybe you'd like to try and kill/remove our previous load-balancer service? You don't need to do that theoretically if the name matches, but try it anyway. How? Google it this time. Anyways, to bring up the proper LB do this:
kubectl expose service/example-presto --port=8080 --target-port=8080 --name=load-balancer --type=LoadBalancer
Now, if you go to <external ip of my lb>:8080
magic will happen.
Let's do some queries
Okay, our Presto cluster is properly exposed, everything is fine and dandy. Let's connect to the cluster, maybe create some tables or query hive. Let's go. How to do that? Install Presto CLI. How? Installing CLI for Starburst Enterprise Presto
I won't get into that, as the docs are enough. Now, let's run:
presto --server <external ip of my lb>:8080 --user test
and we see:
presto>
SUCCESS! What do to now? Let's maybe query hive a bit? Just to check if it's working properly. Let's start with typing this command in presto console:
USE hive.default;
in my case it caused an error. If you head over to the UI wrapper for you Presto, and filter for failed
queries, you'll be able to get more details about why it failed. In my case it was connectivity
error. As if it couldn't connect or there was nothing to connect to. Hm... Remember that PENDING
status we got after using kubectl get pods
. Here lies the issue. Both hive's metastore
and it's internal postgres
were still pending. Something's wrong.
Getting logs with kubectl get logs <pod name>
won't work since the pod hasn't even started yet, so it won't return anything insightful, but describing a pod should. How do those two differ? kubectl get logs
is focused on the things outputted in the pod, while it's running. kubectl describe pod
on the other hand will tell us more about the pod's configuration and stuff like that. Let's do it then.
kubectl describe pod <name of the pod with hive>
What did it return? Something along the lines of:
...
Warning FailedScheduling 43s (x2 over 43s) default-scheduler 0/3 nodes are available: 3 Insufficient cpu.
Warning FailedScheduling 2m30s (x7 over 9m31s) default-scheduler 0/3 nodes are available: 3 Insufficient cpu.
Warning FailedScheduling 97s (x6 over 116s) default-scheduler 0/4 nodes are available: 4 Insufficient cpu.
...
Everything got clear.
Plan your resources smartly
As you can see, our pod with hive
and postgres
couldn't get enough resources therefore it didn't start. What to do? Well, my initial idea was to just add more machines/scale the number of nodes. How to do that? You can read about that in google cloud docs. I increased the number of my nodes to 5, just in case. By default I had 3.
Let's see now. kubectl get pods
-> bang! Hive metastore started properly, but postgres... STILL PENDING, although in theory the node-pool still had lots of free resources. Wat to do? Let's dig with kubectl describe pod <name of the pod with postgres for hive>
again.
...
Requests:
cpu: 2
memory: 2Gi
...
Warning FailedScheduling 30s default-scheduler 0/4 nodes are available: 1 Insufficient memory, 4 Insufficient cpu.
...
This part of the config got my attention. Why? Because I checked the machines on which our nodes run. They were of type n1-standard-1
which means all of them had 1 vCPU and 1 GB of RAM. Now, it can be quite hard to run a pod that needs 2 vcpus and 2 gigs of ram on a machine with just 1 vcpu ang 1 gig of RAM. How do we deal with this? Well, we need to resize our instances. I have no idea if there are more sophisticated ways of doing that, but I settled for just deleting the old node pool and creating a new one with bigger machines.
First - get the name of your node-pool
gcloud container node-pools list
and then delete your old one, while also creating a new one:
gcloud container node-pools delete <your node pool name>
# after it's done then create a new one:
gcloud container node-pools create starburst-pool --machine-type=n1-standard-4 --num-nodes=3
This will create a node-pool with the name of starburst-pool
, running on machine type n1-standard-4
, with 3 nodes, so 3 machines more or less. For the example deployment, you can also probably roll with just 2 nodes to save money, but I'd rather go with 3. Let's now see if our pods are running correctly.
NAME READY STATUS RESTARTS AGE
hive-metastore-example-presto-75bcb5954b-vgcpj 1/1 Running 0 20m
hive-postgresql-example-presto-5694696897-bgshb 1/1 Running 0 21m
presto-coordinator-example-presto-7688499cb8-pxvbx 2/2 Running 0 20m
presto-operator-549d58bd9f-mbgn6 1/1 Running 0 20m
presto-worker-example-presto-784f5db44-clxgp 1/1 Running 0 20m
presto-worker-example-presto-784f5db44-gxfnr 1/1 Running 0 20m
presto-worker-example-presto-784f5db44-wz98m 1/1 Running 0 20m
EVERYTHING IS WORKING, YAY!
Try the initial query that we did a while ago. It should work now. If it does, congrats. You hive with it's internal postgres is running correctly.
External sources of data
Okay. We have the cluster set up finally. Everything is perfect. Except for one thing. We have no data to run queries against. Which is sad. Let's maybe load some. Why not?
How can we do that? Lots of ways, but let's go with one that's quite common: reading an ORC file from Object Storage Service. What is this? ORC is this format developed for big data basically. Performance reasons and so on - you don't have to bother yourself that much about it. Object Storage Service is something like Amazon's AWS S3 or Google Cloud Platform's Google Cloud Storage, which is basically a fancy way of saying that it's kind of a hard drive but in the cloud.
I decided to roll with GCP's solution as to have everything in one place, plus I know S3 already, so let's try something new.
How to create a bucket? It's easy, google it. Then do the same with the information on how to create a service key. Just watch out to not create a public bucket as then all your files will be available on the internet. Also maybe try to limit the access you give to your service key. Research more about these two topics on your own.
Once you have a service key generated, download it's .json file. You have it? Great. Let's proceed.
Object Storage Service - GCS
Do you remember this file example_presto_v1_cr.yaml
that we downloaded before? Open it in some kind of text editor/IDE, find the hive
section. In my case it looks like that:
hive:
internalMetastore:
image:
pullPolicy: Always
internalPostgreSql:
enabled: true
memory: 0.5Gi
cpu: 0.5
What we need to do here is to allow hive to somehow authorize with GCP. How? Remember the .json service file we downloaded? Good, move it to the same directory as example_presto_v1_cr.yaml
and name it gcs-key.json
. Then edit the deployment to look like that:
hive:
gcs:
json-key-file-path: "./gcs-key.json"
internalMetastore:
image:
pullPolicy: Always
internalPostgreSql:
enabled: true
memory: 0.5Gi
cpu: 0.5
After that:
kubectl apply -f example_presto_v1_cr.yaml
Beng, done. Now hive will be able to properly authenticate if you've set up the service access properly, which I think you did. But wait a second. If we want to read in some data, we need the data, but we don't have it.
There are two solutions here: find some example ORC file on the internet and just use that or prepare your own random set of test data. What are we doing to do? The latter of course! How? With Python. Let's first instal this one package we will need with:
pip install pyorc
Use virtualenv if you want to, I didn't.
Now, a simple script shall do:
# filename: generate_orc.py
from random import choice
import pyorc
first_names = ("Jacob", "Mat", "Demon", "Lucifer", "Jaroslaw", "Kunaal", "Rajan", "Taro")
last_names = ("Sasin", "Smith", "Test", "Apple", "Leaf", "Shitsu", "Kowalski", "Górski")
with open("./users.orc", "wb") as data:
with pyorc.Writer(data, "struct<col0:int,col1:string,col2:string>") as writer:
for i in range(1000):
writer.write((i+1, choice(first_names), choice(last_names)))
This will generate a list of a 1000 users with first and last names and ids. It's a very simple example, but will do. Now run it with:
python generate_orc.py
and bam. There we go. You should see a new file in your dir called users.orc
. Now go to your bucket, create a directory there called import
and upload the file there. Now in order to import this file, we need to enter presto shell again. If you exited it before, here's a little reminder how to reach it:
presto --server <external ip of my lb>:8080 --user test
once you are there do:
USE hive.default;
CREATE TABLE people (id bigint, first_name var_char(60), last_name varchar(60)) WITH (external_location = 'gs://{your bucket name}/import/', format='ORC');
This will make Hive go to that bucket location and process ALL the files there, importing them and populating newly created table people
.
Done. Now try:
SELECT * FROM hive.default.people WHERE id < 10;
to print out first 10 rows. Does it work? If so, congrats.
Nice. We now can query stuff from ORC files stored in a Object Storage Service. What about getting data from a more popular data source, more normal
one, like a regular
db eg. postgres? Let's do that.
Postgres
How can we query postgres from the inside of our presto cluster? First of all, we need to have such a db. For the purpose of this article, you can just provision a managed db on GCP. How to do that? Google it again. Just remember to authorize your cluster's IP address to access the db or use 0.0.0.0/0
to allow ANY IP to acces it. What's next?
Go back to the same file that you used to add hive gcs connector. Now open it and try to find section with additionalCatalogs
. Configure it more or less like that:
additionalCatalogs:
postgresql: |
connector.name=postgresql
connection-url=jdbc:postgresql://<your ip>:5432/<db name>
connection-user=postgres
connection-password=very-secure
Now - first issue here: PASSWORD STORED IN THE PLAIN TEXT OMG OMG. Usually, I'd use secrets to manage it there, as k8s has a mechanism for that, but this ain't no production setup guide. Plus if someone has access to the repo with your deployment code, it usually it's too late for security anyway, but I get your worry. This should NOT be done like that in real-life applications.
Okay. You added that, what now? Apply the changes with the same command you applied them in case of Hive, so the thing with kubectl apply
.
Open your presto console and try to do something eg. list available schemas:
SHOW SCHEMAS FROM postgresql;
If that works for you, you are officially done. You can now query all you want. Your queries will be nicely summed up on the UI, you can see the details there.
Summary
Well, we are more or less done. That was a fun exploration for me, as I did not have any experience with either Presto, Starburst or GCP. Had a blast overall, which is nice, especially in these daunting times. Anyway.
We've slightly tasted the things that Presto can do, but there's so much more to try.
This is all for today's episode of grski's ramblings!