mktintel.nl
Back to articles.

Connecting Apache Airflow to Trino for a Development Environment

So you want to use Apache Airflow to orchestrate data operations in Apache Trino? This guide explains how to connect Apache Airflow Version 2.7.1 to Trino Version (latest 2024-05-12). I explain how to connect them for a personal development environment without authentication aske by Trino. Never do this type of connection without authentication on prod! In any case this is your baseline for more secure authentication methods or for having fun developing with these two open source packages. Make sure you do not expose your server ports to the Internet (it is risky as Trino will not be requiring authentication on HTTP connections). One way to not expose them is to forward the ports to your local laptop (e.g. ssh port forwarding on AWS EC2 allowing your laptop to serve the Trino and Airflow website).

Pre requisites: installed Airflow; installed Trino. Airflow already configured to have the TrinoOperator. Trino already working to select data, etc.

I just provide here he difficult parts which are: the Airflow connection configuration, you have to mention the connection name on your TrinoOperator. The Trino config.properties allowing non-authenticated connections. And the Airflow DAG Trino Operator code. For all other information how to configure Airflow to have the TrinoOperator code, etc search elsewhere as it is easy to find and follow.

Airflow connection definition

The below connection won't work if you don't set the login to trino, even though it is not really authenticated. I describe a bit the connection so that the search engine picks-it-up: the Airflow connection is of type Trino. The host is localhost as it is for a dev environment where Airflow runs next to Trino on the same host. The port is 8081 because Airflow already takes port 8080 so I had to raise my Trino on port 8081. The extra json is: { "protocol": "http", "verify": "false", "auth": "None" } .

Airflow TrinoOperator example Python code (you should know how to build dags)

move_data_trino = TrinoOperator( \
task_id="move_data_trino", \
trino_conn_id="trino_connection_http_verify_false_auth_none", \
sql="templates/trino_data_move_sensor_data_to_logs.sql", \
handler=list, \
)

Trino config.properties

#single node install config coordinator=true
node-scheduler.include-coordinator=true
http-server.http.port=8080
discovery.uri=http://localhost:8080

# This is not recommended on prod.
http-server.authentication.allow-insecure-over-http=true

# The whole block below worked to enable SSL with a self signed certificate
#Changed below on the 2024-05-12 to None and trino failed to restart.
#http-server.authentication.type=CERTIFICATE
#http-server.authentication.type=None This does not work. Trino fails to restart.
#
# This is required when authentication is enabled.
#internal-communication.shared-secret=blablablablablaieieiei
#http-server.https.included-cipher=TLS_RSA_WITH_AES_128_CBC_SHA,TLS_RSA_WITH_AES_128_CBC_SHA256
#http-server.https.excluded-cipher=
#http-server.https.enabled=true
#http-server.https.port=8443
#http-server.https.keystore.path=etc/server.pem