N o v 2 0 1 4 – B i g d a t a 2 Of 53
2
What is Big Data ?
* Data so large and complex that it becomes difficult to
pr...
N o v 2 0 1 4 – B i g d a t a 3 Of 53
3
Big data is everywhere
* Every 2 days we create as much information as we did f...
N o v 2 0 1 4 – B i g d a t a 4 Of 53
4
Big data is not new
N o v 2 0 1 4 – B i g d a t a 5 Of 53
5
Characteristics
N o v 2 0 1 4 – B i g d a t a 6 Of 53
6
Volume
* More data beats == better model
* Scalable storage, and distributed a...
N o v 2 0 1 4 – B i g d a t a 7 Of 53
7
Variety
* Big data includes all data
* Data no longer fits into neatly structu...
N o v 2 0 1 4 – B i g d a t a 8 Of 53
8
Velocity
* Frequency at which data is generated, captured , stored and processe...
N o v 2 0 1 4 – B i g d a t a 9 Of 53
9
Data sources
N o v 2 0 1 4 – B i g d a t a 10 Of 53
10
Importance of Big Data
* Media
* Retailing
* Public service
* Health
* In...
N o v 2 0 1 4 – B i g d a t a 11 Of 53
11
Importance of Big Data
* Gaining a more complete understanding of
business
...
N o v 2 0 1 4 – B i g d a t a 12 Of 53
12
The problem
* Overall information available
10% structured data
used in dec...
N o v 2 0 1 4 – B i g d a t a 13 Of 53
13
It’s not the only the size
* Collect -> Analyze -> Understand -> Generate Val...
N o v 2 0 1 4 – B i g d a t a 14 Of 53
14
Purpose
* Take more precise actions that brings value and reduce costs
* Mak...
N o v 2 0 1 4 – B i g d a t a 15 Of 53
15
How big will big data get?
* 3.2 zettabytes today to 40 zettabytes in only si...
N o v 2 0 1 4 – B i g d a t a 16 Of 53
16
Challenges
* Storing data
* Analysis
* Search
* Sharing
* Transfer
* Vis...
N o v 2 0 1 4 – B i g d a t a 17 Of 53
17
NoSQL and Big Data Analytics
* Storing data
* Distribution
* Processing
N o v 2 0 1 4 – B i g d a t a 18 Of 53
18
NoSQL
* Scalability/ cluster friendly
* Availability/ fault tolerance
* Sch...
N o v 2 0 1 4 – B i g d a t a 19 Of 53
19
Dynamic scaling
* adding/removing nodes dynamically
→ storage/performance ca...
N o v 2 0 1 4 – B i g d a t a 20 Of 53
20
Auto-sharding
* Natively and automatically spread data across servers
* Data...
N o v 2 0 1 4 – B i g d a t a 21 Of 53
21
Replication
* Support automatic replication
→ high availability
→ disaster ...
N o v 2 0 1 4 – B i g d a t a 22 Of 53
22
Schemaless
* No predefined schema
* Insertion of aggregates
→ puts together...
N o v 2 0 1 4 – B i g d a t a 23 Of 53
23
NoSQL vanillas
N o v 2 0 1 4 – B i g d a t a 24 Of 53
24
NoSQL vanillas
* Key-value store
→ Amazon DynamoDB, Redis
→ Content caching...
N o v 2 0 1 4 – B i g d a t a 25 Of 53
25
Reasons for choosing NoSQL
* Working on large amount of data
* Scaling out w...
N o v 2 0 1 4 – B i g d a t a 26 Of 53
26
… and associates
N o v 2 0 1 4 – B i g d a t a 27 Of 53
27
What is hadoop?
● Distributed file system
● Distributed processing system
●...
N o v 2 0 1 4 – B i g d a t a 28 Of 53
28
In the beginning...
● Created by Doug Cutting and Mike Cafarella
● Inteded a...
N o v 2 0 1 4 – B i g d a t a 29 Of 53
29
Who uses Hadoop?
Most notable users are …
+ many others
N o v 2 0 1 4 – B i g d a t a 30 Of 53
30
Hadoop in the real world
● Recommandation system
● Data warehousing
● Finan...
N o v 2 0 1 4 – B i g d a t a 31 Of 53
31
Why Hadoop?
● Scalable
● Cost effective
● Flexible
● Efficient
● Resilien...
N o v 2 0 1 4 – B i g d a t a 32 Of 53
32
Why not Hadoop?
● Inefficient when used at small scale
● Not good for real t...
N o v 2 0 1 4 – B i g d a t a 33 Of 53
33
Hadoop major components
● Hadoop commons
● YARN
● HDFS
● Map/Reduce
N o v 2 0 1 4 – B i g d a t a 34 Of 53
34
Arhitecture
N o v 2 0 1 4 – B i g d a t a 35 Of 53
35
Arhitecture
N o v 2 0 1 4 – B i g d a t a 36 Of 53
36
Arhitecture
N o v 2 0 1 4 – B i g d a t a 37 Of 53
37
Arhitecture
N o v 2 0 1 4 – B i g d a t a 38 Of 53
38
Arhitecture
N o v 2 0 1 4 – B i g d a t a 39 Of 53
39
MapReduce
● Split input files
● Operate on key/value
● Mappers filter
& tr...
N o v 2 0 1 4 – B i g d a t a 40 Of 53
40
N o v 2 0 1 4 – B i g d a t a 41 Of 53
41
… and associates
N o v 2 0 1 4 – B i g d a t a 42 Of 53
42
Apache Ambari
The project is aimed at making Hadoop management simpler
by de...
N o v 2 0 1 4 – B i g d a t a 43 Of 53
43
Apache Pig
Apache Pig is a platform for analyzing large data sets that consis...
N o v 2 0 1 4 – B i g d a t a 44 Of 53
44
Apache Hive
The Apache Hive ™ data warehouse software facilitates querying an...
N o v 2 0 1 4 – B i g d a t a 45 Of 53
45
Apache Chukwa
It is a data collection system for monitoring large distributed...
N o v 2 0 1 4 – B i g d a t a 46 Of 53
46
Apache Avro
A remote procedure call and data serialization framework
N o v 2 0 1 4 – B i g d a t a 47 Of 53
47
Apache Hbase
Apache Hbase offers random, realtime read/write access to your B...
N o v 2 0 1 4 – B i g d a t a 48 Of 53
48
Apache Mahout
The Apache Mahout™ project's goal is to build a scalable machin...
N o v 2 0 1 4 – B i g d a t a 49 Of 53
49
Apache Spark
Apache Spark™ is a fast and general engine for large-scale data ...
N o v 2 0 1 4 – B i g d a t a 50 Of 53
50
Apache Zookeeper
ZooKeeper is a centralized service for maintaining configura...
N o v 2 0 1 4 – B i g d a t a 51 Of 53
51
Big data – in the future
● 87% of enterprises believe Big Data analytics will...
N o v 2 0 1 4 – B i g d a t a 52 Of 53
52
Big data – in the future
Va multumim!
Prezentare: Big Data demistificat
of 53

Prezentare: Big Data demistificat

Big Data: ce este? ce oportunitati ne ofera? cum il putem folosi? Afli raspunsurile in aceasta prezentare susținută de Pentalog în cadrul evenimentului ALT Festival, organizat de Clusterul pentru Inovare și Tehnologie ALT Brasov. http://www.altbrasov.org/
Published on: Mar 4, 2016
Published in: Engineering      
Source: www.slideshare.net


Transcripts - Prezentare: Big Data demistificat

  • 1. N o v 2 0 1 4 – B i g d a t a 2 Of 53 2 What is Big Data ? * Data so large and complex that it becomes difficult to process with traditional systems * First time coined in 1997, NASA report * Petabytes and Exabytes of data
  • 2. N o v 2 0 1 4 – B i g d a t a 3 Of 53 3 Big data is everywhere * Every 2 days we create as much information as we did from the beginning of time until 2003 * Google processes over 40 thousand search queries per second, making it over 3.5 billion in a single day. * Around 100 hours of video are uploaded to YouTube every minute and it would take you around 15 years to watch every video uploaded by users in one day * Every minute we send 204 million emails, generate 1,8 million Facebook likes, send 278 thousand Tweets, and upload 200,000 photos to Facebook * Trillions of sensors monitor, track, communicate with each other , populating the IoT with realtime data
  • 3. N o v 2 0 1 4 – B i g d a t a 4 Of 53 4 Big data is not new
  • 4. N o v 2 0 1 4 – B i g d a t a 5 Of 53 5 Characteristics
  • 5. N o v 2 0 1 4 – B i g d a t a 6 Of 53 6 Volume * More data beats == better model * Scalable storage, and distributed approach to querying
  • 6. N o v 2 0 1 4 – B i g d a t a 7 Of 53 7 Variety * Big data includes all data * Data no longer fits into neatly structured tables
  • 7. N o v 2 0 1 4 – B i g d a t a 8 Of 53 8 Velocity * Frequency at which data is generated, captured , stored and processed * Need for real-time processing
  • 8. N o v 2 0 1 4 – B i g d a t a 9 Of 53 9 Data sources
  • 9. N o v 2 0 1 4 – B i g d a t a 10 Of 53 10 Importance of Big Data * Media * Retailing * Public service * Health * Industry
  • 10. N o v 2 0 1 4 – B i g d a t a 11 Of 53 11 Importance of Big Data * Gaining a more complete understanding of business customers products competitors * Which can lead to efficiency improvements increased sales lower costs better customer service improved products
  • 11. N o v 2 0 1 4 – B i g d a t a 12 Of 53 12 The problem * Overall information available 10% structured data used in decision making 90% unstructured data wasted, not captured or analyzed * Valuable information VS data which is best left ignored * 37.5% of large organizations said that analyzing big data is their biggest challenge * More that 90% said that Big Data is a top ten priority
  • 12. N o v 2 0 1 4 – B i g d a t a 13 Of 53 13 It’s not the only the size * Collect -> Analyze -> Understand -> Generate Value * Find a meaning * Find interconnexions * Find hidden data
  • 13. N o v 2 0 1 4 – B i g d a t a 14 Of 53 14 Purpose * Take more precise actions that brings value and reduce costs * Make the right decision within the right amount of time
  • 14. N o v 2 0 1 4 – B i g d a t a 15 Of 53 15 How big will big data get? * 3.2 zettabytes today to 40 zettabytes in only six years. * More than 30 billion devices will be wirelessly connected by 2020.
  • 15. N o v 2 0 1 4 – B i g d a t a 16 Of 53 16 Challenges * Storing data * Analysis * Search * Sharing * Transfer * Visualization
  • 16. N o v 2 0 1 4 – B i g d a t a 17 Of 53 17 NoSQL and Big Data Analytics * Storing data * Distribution * Processing
  • 17. N o v 2 0 1 4 – B i g d a t a 18 Of 53 18 NoSQL * Scalability/ cluster friendly * Availability/ fault tolerance * Schema-less * Low latency * High performance * Open-source
  • 18. N o v 2 0 1 4 – B i g d a t a 19 Of 53 19 Dynamic scaling * adding/removing nodes dynamically → storage/performance capacity can grow or shrink as needed
  • 19. N o v 2 0 1 4 – B i g d a t a 20 Of 53 20 Auto-sharding * Natively and automatically spread data across servers * Data and query load automatically balanced across servers
  • 20. N o v 2 0 1 4 – B i g d a t a 21 Of 53 21 Replication * Support automatic replication → high availability → disaster recovery → no need for separate applications to manage these tasks
  • 21. N o v 2 0 1 4 – B i g d a t a 22 Of 53 22 Schemaless * No predefined schema * Insertion of aggregates → puts together data that is commonly accessed together
  • 22. N o v 2 0 1 4 – B i g d a t a 23 Of 53 23 NoSQL vanillas
  • 23. N o v 2 0 1 4 – B i g d a t a 24 Of 53 24 NoSQL vanillas * Key-value store → Amazon DynamoDB, Redis → Content caching (focus on scaling to huge amounts of data, designed to handle massive load), logging, etc * Document store → CouchDB, MongoDb → Web applications * Column family store → Cassandra, HBase → Distributed file systems * Graph store → Neo4J, InfoGrid, Infinite Graph → Social networking, Recommendations (Focus on modeling the structure of data – interconnectivity)
  • 24. N o v 2 0 1 4 – B i g d a t a 25 Of 53 25 Reasons for choosing NoSQL * Working on large amount of data * Scaling out with ease * Need of: → high-availability → low-latency systems with eventual consistency * Model fits aggregate: → as a natural choice → structure is changing with time
  • 25. N o v 2 0 1 4 – B i g d a t a 26 Of 53 26 … and associates
  • 26. N o v 2 0 1 4 – B i g d a t a 27 Of 53 27 What is hadoop? ● Distributed file system ● Distributed processing system ● Batch / offline oriented ● Open source
  • 27. N o v 2 0 1 4 – B i g d a t a 28 Of 53 28 In the beginning... ● Created by Doug Cutting and Mike Cafarella ● Inteded as a distribution support for ● Built based on Google's MapReduce and Google File System ● papers
  • 28. N o v 2 0 1 4 – B i g d a t a 29 Of 53 29 Who uses Hadoop? Most notable users are … + many others
  • 29. N o v 2 0 1 4 – B i g d a t a 30 Of 53 30 Hadoop in the real world ● Recommandation system ● Data warehousing ● Financial analysis ● Market research/forecasting ● Log analysis ● Threat analysis ● Image processing ● Social networking ● Advertising
  • 30. N o v 2 0 1 4 – B i g d a t a 31 Of 53 31 Why Hadoop? ● Scalable ● Cost effective ● Flexible ● Efficient ● Resilient to failure ● Schema on read
  • 31. N o v 2 0 1 4 – B i g d a t a 32 Of 53 32 Why not Hadoop? ● Inefficient when used at small scale ● Not good for real time systems
  • 32. N o v 2 0 1 4 – B i g d a t a 33 Of 53 33 Hadoop major components ● Hadoop commons ● YARN ● HDFS ● Map/Reduce
  • 33. N o v 2 0 1 4 – B i g d a t a 34 Of 53 34 Arhitecture
  • 34. N o v 2 0 1 4 – B i g d a t a 35 Of 53 35 Arhitecture
  • 35. N o v 2 0 1 4 – B i g d a t a 36 Of 53 36 Arhitecture
  • 36. N o v 2 0 1 4 – B i g d a t a 37 Of 53 37 Arhitecture
  • 37. N o v 2 0 1 4 – B i g d a t a 38 Of 53 38 Arhitecture
  • 38. N o v 2 0 1 4 – B i g d a t a 39 Of 53 39 MapReduce ● Split input files ● Operate on key/value ● Mappers filter & transform input data ● Reducers aggregate mappers output ● Move code to data
  • 39. N o v 2 0 1 4 – B i g d a t a 40 Of 53 40
  • 40. N o v 2 0 1 4 – B i g d a t a 41 Of 53 41 … and associates
  • 41. N o v 2 0 1 4 – B i g d a t a 42 Of 53 42 Apache Ambari The project is aimed at making Hadoop management simpler by developing software for provisioning, managing, and monitoring Apache Hadoop clusters
  • 42. N o v 2 0 1 4 – B i g d a t a 43 Of 53 43 Apache Pig Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs
  • 43. N o v 2 0 1 4 – B i g d a t a 44 Of 53 44 Apache Hive The Apache Hive ™ data warehouse software facilitates querying and managing large datasets residing in distributed storage. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL
  • 44. N o v 2 0 1 4 – B i g d a t a 45 Of 53 45 Apache Chukwa It is a data collection system for monitoring large distributed systems. Chukwa comes with a flexible and powerful toolkit for displaying, monitoring and analyzing results to make the best use of the collected data.
  • 45. N o v 2 0 1 4 – B i g d a t a 46 Of 53 46 Apache Avro A remote procedure call and data serialization framework
  • 46. N o v 2 0 1 4 – B i g d a t a 47 Of 53 47 Apache Hbase Apache Hbase offers random, realtime read/write access to your Big Data. This project's goal is the hosting of very large tables -- billions of rows X millions of columns -- atop clusters of commodity hardware
  • 47. N o v 2 0 1 4 – B i g d a t a 48 Of 53 48 Apache Mahout The Apache Mahout™ project's goal is to build a scalable machine learning library
  • 48. N o v 2 0 1 4 – B i g d a t a 49 Of 53 49 Apache Spark Apache Spark™ is a fast and general engine for large-scale data processing
  • 49. N o v 2 0 1 4 – B i g d a t a 50 Of 53 50 Apache Zookeeper ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services.
  • 50. N o v 2 0 1 4 – B i g d a t a 51 Of 53 51 Big data – in the future ● 87% of enterprises believe Big Data analytics will redefine the competitive landscape of their industries within the next three years ● 89% believe that companies that do not adopt a Big Data analytics strategy in the next year risk losing market share and momentum.
  • 51. N o v 2 0 1 4 – B i g d a t a 52 Of 53 52 Big data – in the future
  • 52. Va multumim!

Related Documents