Polyglot PersistenceBig Data in the CloudAndrei Savu / andrei.savu@cloudsoftcorp.com
Overview• Introduction• Databases• Search• Processing• Deployment
Polyglot Persistence“Polyglot Persistence, like polyglotprogramming, is all about choosing the rightpersistence option for...
It all started from ...a set of papers released by Google & Amazon
• Google Filesystem (2003) http://research.google.com/archive/gfs.html• Google MapReduce (2004) http://research.google.c...
Databases
Apache HBase• Java • persistence through HDFS (Hadoop)• designed...
Apache Cassandra• Java • really fast writes• inspired by Google • excellent for a large Big...
MongoDB• C++ • map/reduce with javascript• document dat...
Apache CouchDB• Erlang • exposes a stream of realtime update...
Riak (Basho)• Erlang, C, Javascript • tunable trade-offs (N, R, W)• key, valu...
Neo4j• Java • web admin interface• graph database • nodes & relationships ...
Redis• C/C++ • values can be expired• disk-backed data • Pub/Sub for messaging ...
Search
elasticsearch• Java • simple multi-tenancy• based on Apache Lucene • real-time search• dis...
Apache SolrCloud• Java • automatic management of multiple shards• ...
Processing
Apache Hadoop• Java, C/C++ • can scale to 1000s of machines• set of d...
Hadoop Ecosystem• HDFS (Storage) • Oozie (workflow)• MapReduce (Processing) • Mahout (machine ...
Deploymenton Cloud Infrastructure (using jclouds)
Apache Whirr https://whirr.apache.org/ * disclaimer: I am a member of the PMC
First Steps• Download $ curl -O http://www.apache.org/dist/whirr/whirr-0.7.1/whirr-0.7.1.tar.gz $ tar zxf whirr-0.7.1.ta...
Deploy Hadoopwhirr.instance-templates= 1 hadoop-namenode+hadoop-jobtracker, 10 hadoop-datanode+hadoop-tasktracker ...
With Mahoutwhirr.instance-templates= 1 hadoop-namenode+hadoop-jobtracker +mahout-client, 10 hadoop-datanode+hadoop-taskt...
Or with HBasewhirr.instance-templates= 1 hadoop-namenode+hadoop-jobtracker +hbase-master+zookeeper, 10 hadoop-datanode+...
Or Cassandrawhirr.instance-templates=10 cassandra
And elasticsearchwhirr.instance-templates=10 elasticsearch
Thanks!andrei.savu@cloudsoftcorp.com
Polyglot Persistence & Big Data in the Cloud
of 29

Polyglot Persistence & Big Data in the Cloud

Published on: Mar 4, 2016
Published in: Technology      
Source: www.slideshare.net


Transcripts - Polyglot Persistence & Big Data in the Cloud

  • 1. Polyglot PersistenceBig Data in the CloudAndrei Savu / andrei.savu@cloudsoftcorp.com
  • 2. Overview• Introduction• Databases• Search• Processing• Deployment
  • 3. Polyglot Persistence“Polyglot Persistence, like polyglotprogramming, is all about choosing the rightpersistence option for the task at hand” http://www.nearinfinity.com/blogs/scott_leberknight/polyglot_persistence.html http://martinfowler.com/bliki/PolyglotPersistence.html
  • 4. It all started from ...a set of papers released by Google & Amazon
  • 5. • Google Filesystem (2003) http://research.google.com/archive/gfs.html• Google MapReduce (2004) http://research.google.com/archive/mapreduce.html• Google BigTable (2006) http://research.google.com/archive/bigtable.html• Amazon Dynamo (2007) http://s3.amazonaws.com/AllThingsDistributed/sosp/amazon-dynamo- sosp2007.pdf
  • 6. Databases
  • 7. Apache HBase• Java • persistence through HDFS (Hadoop)• designed to be able to store massive amounts • Map/Reduce with of data Hadoop• speaks HTTP / REST, • designed for real time Thrift, Avro workloads• based on Google • https://hbase.apache.org/ BigTable
  • 8. Apache Cassandra• Java • really fast writes• inspired by Google • excellent for a large BigTable and Amazon number of high speed Dynamo counters• tunable trade-offs • Map/Reduce possible with Hadoop• query by column and • range of keys http://cassandra.apache.org/
  • 9. MongoDB• C++ • map/reduce with javascript• document database (bson) with rich indexing • server side javascript• master / slave replication • journaling• built-in sharding • fast in-place updates• auto failover with replica • http://www.mongodb.org/ sets
  • 10. Apache CouchDB• Erlang • exposes a stream of realtime updates• document database (json) • needs compacting• bi-directional replication • indexing via views (JS)• advanced conflict • attachment handling resolution • https://couchdb.apache.org/• MVCC - writes do not block reads
  • 11. Riak (Basho)• Erlang, C, Javascript • tunable trade-offs (N, R, W)• key, value store • mapreduce in JS or• focus on fault tolerance Erlang and cross datacenter replication • full-text indexing with riak search• speaks HTTP/REST or custom binary • http://wiki.basho.com/
  • 12. Neo4j• Java • web admin interface• graph database • nodes & relationships can have metadata• speaks HTTP/REST • indexing• standalone or embeddable in Java apps • http://neo4j.org/• full ACID
  • 13. Redis• C/C++ • values can be expired• disk-backed data • Pub/Sub for messaging structure server • ideal for rapidly changing• master-slave replication data that fits in memory• supports: strings, lists, • http://redis.io/ sets, hashes, sorted sets• batch operations
  • 14. Search
  • 15. elasticsearch• Java • simple multi-tenancy• based on Apache Lucene • real-time search• distributed by design • scale to 100s of machines• cloud aware (Amazon) • http://www.elasticsearch.org/• understands JSON objects• no-schema required
  • 16. Apache SolrCloud• Java • automatic management of multiple shards• based on Apache Lucene (share the same repo) • automatic fail-over• adds distributed • durable writes capabilites to Solr • https://wiki.apache.org/• based on ZooKeeper for solr/SolrCloud coordination & config
  • 17. Processing
  • 18. Apache Hadoop• Java, C/C++ • can scale to 1000s of machines• set of distributed systems (hdfs, mr etc.) • designed to be highly available at the• framework for application level distributed data processing • https:// hadoop.apache.org/• simple programming model (map / reduce)
  • 19. Hadoop Ecosystem• HDFS (Storage) • Oozie (workflow)• MapReduce (Processing) • Mahout (machine learning)• Hive, Pig (high level languages) • Flume (log streaming)• HBase (database) • Sqoop (data import)• ZooKeeper • Whirr (deployment) (coordination)
  • 20. Deploymenton Cloud Infrastructure (using jclouds)
  • 21. Apache Whirr https://whirr.apache.org/ * disclaimer: I am a member of the PMC
  • 22. First Steps• Download $ curl -O http://www.apache.org/dist/whirr/whirr-0.7.1/whirr-0.7.1.tar.gz $ tar zxf whirr-0.7.1.tar.gz; cd whirr-0.7.1• Use # export credentials $ bin/whirr launch-cluster --config ... $ bin/whirr destroy-cluster --config ... https://whirr.apache.org/docs/latest/whirr-in-5-minutes.html
  • 23. Deploy Hadoopwhirr.instance-templates= 1 hadoop-namenode+hadoop-jobtracker, 10 hadoop-datanode+hadoop-tasktracker https://whirr.apache.org/docs/0.7.1/quick-start-guide.html
  • 24. With Mahoutwhirr.instance-templates= 1 hadoop-namenode+hadoop-jobtracker +mahout-client, 10 hadoop-datanode+hadoop-tasktracker
  • 25. Or with HBasewhirr.instance-templates= 1 hadoop-namenode+hadoop-jobtracker +hbase-master+zookeeper, 10 hadoop-datanode+hadoop-tasktracker +hbase-regionserver
  • 26. Or Cassandrawhirr.instance-templates=10 cassandra
  • 27. And elasticsearchwhirr.instance-templates=10 elasticsearch
  • 28. Thanks!andrei.savu@cloudsoftcorp.com

Related Documents