Sadayuki Furuhashi
Founder & Software Architect
Treasure Data, inc.
internals
PostgreSQL protocol gateway for Presto
A little about me...
> Sadayuki Furuhashi
> github/twitter: @frsyuki
> Treasure Data, Inc.
> Founder & Software Architect
...
Today’s talk
1. What’s Presto?
2. Prestogres design
3. Prestogres implementation
4. Prestogres hacks
5. Prestogres future ...
1. What’s Presto?
What’s Presto?
A distributed SQL query engine

for interactive data analisys

against GBs to PBs of data.
What’s the problems to solve?
> We couldn’t visualize data in HDFS directly using
dashboards or BI tools
> because Hive is...
HDFS
Hive
PostgreSQL, etc.
Daily/Hourly Batch
Interactive query
Commercial

BI Tools
Batch analysis platform Visualization...
HDFS
Hive
PostgreSQL, etc.
Daily/Hourly Batch
Interactive query
✓ Less scalable
✓ Extra cost
Commercial

BI Tools
Dashboar...
HDFS
Hive Dashboard
Presto
PostgreSQL, etc.
Daily/Hourly Batch
HDFS
Hive
Dashboard
Daily/Hourly Batch
Interactive query
In...
Presto
HDFS
Hive
Dashboard
Daily/Hourly Batch
Interactive query
Cassandra PostgreSQL Commertial DBs
SQL on any data sets
D...
Presto
HDFS
Hive
Dashboard
Daily/Hourly Batch
Interactive query
Cassandra PostgreSQL Commertial DBs
SQL on any data sets C...
Presto
HDFS
Dashboard
Interactive query
Commercial

BI Tools
✓ IBM Cognos

✓ Tableau

✓ ...
Prestogres
Today’s topic!
dashboard on chart.io: https://chartio.com/
What can Presto do?
> Query interactively (in milli-seconds to minues)
> MapReduce and Hive are still necessary for ETL
> ...
Presto’s deployment
> Facebook
> Multiple geographical regions
> scaled to 1,000 nodes
> actively used by 1,000+ employees...
Prestogres design of the ODBC/JDBC gateway
The problems to use Presto with
BI tools
> BI tools need ODBC or JDBC connectivity
> Tableau, IBM Cognos, QlickView, Chart...
A solution
> Creates a PostgreSQL protocol gateway
> Uses PostgreSQL’s stable ODBC / JDBC driver
Other possible designs were…
a) MySQL protocol + libdrizzle:
> Drizzle provides a well-designed library to implement
MySQL...
Other possible designs were…
c) PostgreSQL + H2 database + patch:
> H2 is an embedded database engine written in Java
> H2...
Prestogres design
pgpool-II + PostgreSQL + PL/Python
> pgpool-II is a PostgreSQL protocol middleware for
replication, fail...
Prestogres implementation
psql
pgpool-IIodbc
jdbc
PostgreSQL Presto
Authentication Create faked system

catalogs for meta-queries
1. 2.
Rewriting qu...
psql
pgpool-IIodbc
jdbc
PostgreSQL Presto
Authentication Create faked system

catalogs for meta-queries
1. 2.
Rewriting qu...
pgpool-IIpsql PostgreSQL Presto
$ psql -U me mydb
StartupPacket {
database = “mydb”,
user = “me”,
…
}
pgpool-IIpsql PostgreSQL Presto
$ psql -U me mydb
prestogres.conf
system_db_dbname = ‘postgres’
system_db_user = ‘prestogr...
pgpool-IIpsql PostgreSQL Presto
StartupPacket {
database = “mydb”,
user = “me”,
…
}
$ psql -U me mydb
prestogres.conf
syst...
pgpool-IIpsql PostgreSQL Presto
$ psql -U me mydb
prestogres_hba.conf
host mydb me 0.0.0.0/0 trust

presto_server presto.l...
system catalog!
pgpool-IIpsql PostgreSQL Presto
$ psql -U me mydb
“Q” SELECT * FROM pg_class;
"Query against a system cata...
system catalog!
pgpool-IIpsql PostgreSQL Presto
$ psql -U me mydb
SELECT setup_system_catalog(‘presto.local:8080’, ‘hive’)...
pgpool-IIpsql PostgreSQL Presto
$ psql -U me mydb
SELECT setup_system_catalog(‘presto.local:8080’, ‘hive’)“Q” SELECT * FRO...
pgpool-IIpsql PostgreSQL Presto
$ psql -U me mydb
“Q” SELECT * FROM pg_class; “Q” SELECT * FROM pg_class;
Meta-query
"Quer...
pgpool-IIpsql PostgreSQL Presto
$ psql -U me mydb
“Q” select count(*) from access_logs;
regular table!
Presto Query
"Query...
pgpool-IIpsql PostgreSQL Presto
$ psql -U me mydb
“Q” select count(*) from access_logs; SELECT start_presto_query(…

‘sele...
pgpool-IIpsql PostgreSQL Presto
$ psql -U me mydb
“Q” select count(*) from access_logs; SELECT start_presto_query(…

‘sele...
pgpool-IIpsql PostgreSQL Presto
$ psql -U me mydb
“Q” select count(*) from access_logs; “Q” SELECT * FROM fetch_results();...
Prestogres hacks
Multi-statement queries
BEGIN; SELECT 1; COMMIT;
Supporting Cursors
DECLARE CURSOR xyz FOR select …; FETCH
Security
security definer
Error message handling
raise exception ‘%’, E’…’ using errcode = …;
Faked current_database()
delete from pg_catalog.pg_proc where
proname=‘current_database’;
create function pg_catalog.curre...
Future works
Future works
Rewriting CAST syntax
Extended query
CREATE TEMP TABLE
DROP TABLE
Check: www.treasuredata.com
Cloud service for the entire data pipeline,
including Presto. We’re hiring!
Prestogres internals
Prestogres internals
of 47

Prestogres internals

Prestogres Internals JPUG PostgreSQL meetup in Japan #30.5
Published on: Mar 4, 2016
Published in: Presentations & Public Speaking      
Source: www.slideshare.net


Transcripts - Prestogres internals

  • 1. Sadayuki Furuhashi Founder & Software Architect Treasure Data, inc. internals PostgreSQL protocol gateway for Presto
  • 2. A little about me... > Sadayuki Furuhashi > github/twitter: @frsyuki > Treasure Data, Inc. > Founder & Software Architect > Open-source hacker > MessagePack - Efficient object serializer > Fluentd - An unified data collection tool > ServerEngine - A Ruby framework to build multiprocess servers > Prestogres - PostgreSQL protocol gateway for Presto > LS4 - A distributed object storage with cross-region replication > kumofs - A distributed strong-consistent key-value data store
  • 3. Today’s talk 1. What’s Presto? 2. Prestogres design 3. Prestogres implementation 4. Prestogres hacks 5. Prestogres future works
  • 4. 1. What’s Presto?
  • 5. What’s Presto? A distributed SQL query engine
 for interactive data analisys
 against GBs to PBs of data.
  • 6. What’s the problems to solve? > We couldn’t visualize data in HDFS directly using dashboards or BI tools > because Hive is too slow (not interactive) > or ODBC connectivity is unavailable/unstable > We needed to store daily-batch results to an interactive DB for quick response
 (PostgreSQL, Redshift, etc.) > Interactive DB costs more and less scalable by far > Some data are not stored in HDFS > We need to copy the data into HDFS to analyze
  • 7. HDFS Hive PostgreSQL, etc. Daily/Hourly Batch Interactive query Commercial
 BI Tools Batch analysis platform Visualization platform Dashboard
  • 8. HDFS Hive PostgreSQL, etc. Daily/Hourly Batch Interactive query ✓ Less scalable ✓ Extra cost Commercial
 BI Tools Dashboard ✓ Extra work to manage
 2 platforms ✓ Can’t query against
 “live”data directly Batch analysis platform Visualization platform
  • 9. HDFS Hive Dashboard Presto PostgreSQL, etc. Daily/Hourly Batch HDFS Hive Dashboard Daily/Hourly Batch Interactive query Interactive query Data analysis platform
  • 10. Presto HDFS Hive Dashboard Daily/Hourly Batch Interactive query Cassandra PostgreSQL Commertial DBs SQL on any data sets Data analysis platform
  • 11. Presto HDFS Hive Dashboard Daily/Hourly Batch Interactive query Cassandra PostgreSQL Commertial DBs SQL on any data sets Commercial
 BI Tools ✓ IBM Cognos
 ✓ Tableau
 ✓ ... Data analysis platform Prestogres
  • 12. Presto HDFS Dashboard Interactive query Commercial
 BI Tools ✓ IBM Cognos
 ✓ Tableau
 ✓ ... Prestogres Today’s topic!
  • 13. dashboard on chart.io: https://chartio.com/
  • 14. What can Presto do? > Query interactively (in milli-seconds to minues) > MapReduce and Hive are still necessary for ETL > Query using commercial BI tools or dashboards > Reliable ODBC/JDBC connectivity through Prestogres > Query across multiple data sources such as
 Hive, HBase, Cassandra, or even internal DBs > Plugin mechanism > Integrate batch analisys + visualization
 into a single data analysis platform
  • 15. Presto’s deployment > Facebook > Multiple geographical regions > scaled to 1,000 nodes > actively used by 1,000+ employees > who run 30,000+ queries every day > processing 1PB/day > Netflix, Dropbox, Treasure Data, Airbnb, Qubole > Presto as a Service
  • 16. Prestogres design of the ODBC/JDBC gateway
  • 17. The problems to use Presto with BI tools > BI tools need ODBC or JDBC connectivity > Tableau, IBM Cognos, QlickView, Chart.IO, ... > JasperSoft, Pentaho, MotionBoard, ... > ODBC/JDBC is VERY COMPLICATED > Matured implementation needs LONG time • psqlODBC: 58,000 lines • postgresql-jdbc: 62,000 lines • mysql-connctor-odbc: 27,000 lines • mysql-connector-j: 101,000 lines
  • 18. A solution > Creates a PostgreSQL protocol gateway > Uses PostgreSQL’s stable ODBC / JDBC driver
  • 19. Other possible designs were… a) MySQL protocol + libdrizzle: > Drizzle provides a well-designed library to implement MySQL protocol server. > Proof-of-concept worked well: • trd-gateway - MySQL protocol gateway for Hive > Difficulties: clients assumes the server is MySQL but, • syntax mismatches: MySQL uses `…` while Presto “…” • function mismatches: DAYOFMONTH(…) vs EXTRACT(day…) b) PostgreSQL + Foreign Data Wrapper (FDW): > JOIN and aggregation pushdown is not available
  • 20. Other possible designs were… c) PostgreSQL + H2 database + patch: > H2 is an embedded database engine written in Java > H2 has a PostgreSQL protocol implementation in Java > Difficulties: • System catalog implementation is incomplete
 (pg_class, pg_namespace, pg_proc, etc.) d) Reusing PostgreSQL protocol impl.: > Difficulties: • complete implementation of system catalogs was too difficult
  • 21. Prestogres design pgpool-II + PostgreSQL + PL/Python > pgpool-II is a PostgreSQL protocol middleware for replication, failover, load-balancing, etc. > pgpool-II originally has some useful code
 (parsing SQL, rewriting SQL, hacking system catalogs, …) > Basic idea: • Rewrite queries at pgpool-II and run Presto queries using PL/Python select count(1)
 from access select * from
 python_func(‘select count(1) from access’) rewrite!
  • 22. Prestogres implementation
  • 23. psql pgpool-IIodbc jdbc PostgreSQL Presto Authentication Create faked system
 catalogs for meta-queries 1. 2. Rewriting queries Executing queries using PL/Python 3. 4. Overview Patched!
  • 24. psql pgpool-IIodbc jdbc PostgreSQL Presto Authentication Create faked system
 catalogs for meta-queries 1. 2. Rewriting queries Executing queries using PL/Python 3. 4. Overview Patched! Prestogres
  • 25. pgpool-IIpsql PostgreSQL Presto $ psql -U me mydb StartupPacket { database = “mydb”, user = “me”, … }
  • 26. pgpool-IIpsql PostgreSQL Presto $ psql -U me mydb prestogres.conf system_db_dbname = ‘postgres’ system_db_user = ‘prestogres’ prestogres_hba.conf host mydb me 0.0.0.0/0 trust
 presto_server presto.local:8080, presto_catalog hive,
 pg_database hive StartupPacket { database = “mydb”, user = “me”, … }
  • 27. pgpool-IIpsql PostgreSQL Presto StartupPacket { database = “mydb”, user = “me”, … } $ psql -U me mydb prestogres.conf system_db_dbname = ‘postgres’ system_db_user = ‘prestogres’ > CREATE DATABASE hive; > CREATE ROLE me; > CREATE FUNCTION setup_system_catalog; > CREATE FUNCTION start_presto_query; libpq host=‘localhost’, dbname=‘postgres’,
 user=‘prestogres’ (system_db) prestogres_hba.conf host mydb me 0.0.0.0/0 trust
 presto_server presto.local:8080, presto_catalog hive,
 pg_database hive
  • 28. pgpool-IIpsql PostgreSQL Presto $ psql -U me mydb prestogres_hba.conf host mydb me 0.0.0.0/0 trust
 presto_server presto.local:8080, presto_catalog hive,
 pg_database hive prestogres.conf system_db_dbname = ‘postgres’ system_db_user = ‘prestogres’ StartupPacket { database = “hive”, user = “me”, … } StartupPacket { database = “mydb”, user = “me”, … }
  • 29. system catalog! pgpool-IIpsql PostgreSQL Presto $ psql -U me mydb “Q” SELECT * FROM pg_class; "Query against a system catalog!” Meta-query
  • 30. system catalog! pgpool-IIpsql PostgreSQL Presto $ psql -U me mydb SELECT setup_system_catalog(‘presto.local:8080’, ‘hive’)“Q” SELECT * FROM pg_class; "Query against a system catalog!” Meta-query PL/Python function
 defined at prestogres.py
  • 31. pgpool-IIpsql PostgreSQL Presto $ psql -U me mydb SELECT setup_system_catalog(‘presto.local:8080’, ‘hive’)“Q” SELECT * FROM pg_class; > CREATE TABLE access_logs; > CREATE TABLE users; > CREATE TABLE events; … Meta-query SELECT * FROM
 information_schema.columns "Query against a system catalog!”
  • 32. pgpool-IIpsql PostgreSQL Presto $ psql -U me mydb “Q” SELECT * FROM pg_class; “Q” SELECT * FROM pg_class; Meta-query "Query against a system catalog!”
  • 33. pgpool-IIpsql PostgreSQL Presto $ psql -U me mydb “Q” select count(*) from access_logs; regular table! Presto Query "Query against a regular table!”
  • 34. pgpool-IIpsql PostgreSQL Presto $ psql -U me mydb “Q” select count(*) from access_logs; SELECT start_presto_query(…
 ‘select count(*) from access_logs’) regular table! Presto Query "Query against a regular table!” PL/Python function
 defined at prestogres.py
  • 35. pgpool-IIpsql PostgreSQL Presto $ psql -U me mydb “Q” select count(*) from access_logs; SELECT start_presto_query(…
 ‘select count(*) from access_logs’) > CREATE TYPE result_type (c0_ BIGINT); > CREATE FUNCTION fetch_results 
 RETURNS SETOF result_type … regular table! Presto Query "Query against a regular table!” 1. start the query on Presto 2. define a function
 to fetch the result
  • 36. pgpool-IIpsql PostgreSQL Presto $ psql -U me mydb “Q” select count(*) from access_logs; “Q” SELECT * FROM fetch_results(); Presto Query "Query against a regular table!” PL/Python function
 defined by start_presto_query “Q” RAISE EXCEPTION …
  • 37. Prestogres hacks
  • 38. Multi-statement queries BEGIN; SELECT 1; COMMIT;
  • 39. Supporting Cursors DECLARE CURSOR xyz FOR select …; FETCH
  • 40. Security security definer
  • 41. Error message handling raise exception ‘%’, E’…’ using errcode = …;
  • 42. Faked current_database() delete from pg_catalog.pg_proc where proname=‘current_database’; create function pg_catalog.current_database()
 returns name as $$
 begin
 return ‘faked_name’::name;
 end
 $$ language plpgsql stable strict;
  • 43. Future works
  • 44. Future works Rewriting CAST syntax Extended query CREATE TEMP TABLE DROP TABLE
  • 45. Check: www.treasuredata.com Cloud service for the entire data pipeline, including Presto. We’re hiring!

Related Documents