AWS EMR Flashcards Preview

AWS 102 > AWS EMR > Flashcards

Flashcards in AWS EMR Deck (144)
Loading flashcards...
1
Q

What is AWS EMR

A

It is AWS elastic map-reduce, it is where data can be broken up and some sort of calculation of code can be run over it, and the results are then compiled. If you were to have the text form every book in the world and you have to look for the word dog. You would split the book into all the map node have e each map node do a search on each book for the word doc and once each map node is finished you could the returned map node in the reduce node.

2
Q

As a developer what two components of code do you have to give an EMR?

A
  • Map code component

- Reduce code component

3
Q

What is a split size?

A

This is where the dat is split into the map nodes by size.

4
Q

Is there input and output data from EMR?

A

Yes, data comes from a persistent data store and once processed is pushed to a persistent data store, s3 is a candidate.

5
Q

Outside of AWS wnat is EMS know as?

A

Hadoop

6
Q

What are the two frameworks that EMS can run?

A

Hadoop and spark, it also used hive and pig, HBase

Hue, Zookeeper,

7
Q

I we were to see hive?

A

What would it relate to EMR

8
Q

What type of node has an EMR cluster?

A
  • Master node
  • Core node
  • Task node
9
Q

What is the master node job?

A

Master node controles the cluster and distributes the workload and monitors the health.

10
Q

What does the EMR cluster run on?

A

It runs on EC2 instances.

11
Q

What nodes do the work in an EMR cluster?

A

Core Nodes

12
Q

Other than processing, what else does the code node do?

A

They provide the HDFS file system.

13
Q

Is data replicated between code nodes?

A

Yes 100%

14
Q

What is the difference between Task node and Code nodes?

A

Task nodes process but do not have HDFS

15
Q

Where can we get and put data for EMR?

A

From S3

16
Q

Where is HDFS run in EMR?

A

On the code nodes

17
Q

What is EMRFS?

A

It is an S3 backed file system and can be used to replace HDFS

18
Q

What advantage has using EMRFS?

A

It is in S3 so it lives beyond the life of the cluster.

19
Q

Has EMR fully managed services and does not use a VPC with nodes?

A

No, EMR is not fully managed but is a managed service that is deployed in your VPC.

20
Q

Is EMR highly available across all availability zones?

A

No, for speed of processing, EMR (Hadoop) nodes are deployed into a single AZ.

21
Q

What does spark do?

A

It is a batch and stream processing engine for data, it competes again EMR (Hadoop ) in the area of batch.

22
Q

Who uses EMR (Hadoop) and spark?

A
  • Financial sector: if you are looking for fraud

- Health: Scoring potential health risks

23
Q

What is Hive?

A

It complements the HDFS file system, it enables you to use SQL like queries that are converted into map reduce jobs to be run on a Hadoop cluster.

24
Q

Is hive a good use for OLTP or relational data?

A

No

25
Q

His EMR good for use with OLTP and relational data?

A

NO

26
Q

What is PIG used for?

A

Before PIG, people using EMP (Hadoop) have to interact with the cluster by doing low-level tasks written in Java. Pig is a sort of scripting language.

27
Q

What is the minimum size of an EMR cluster?

A

One node, but this is for development only.

28
Q

I am thinking of running the master node on a spot instance, is there any potential issue and why?

A

Yes, the master node is used to control the EMR (Hadoop) cluster, if it fails the cluster is failed, spot instances can and will go away at any point in time.

29
Q

What EMR (Hadoop) nodes should I use spot instance for?

A

Use the spot instances for Task nodes

30
Q

Can I use instance fleets with EMR nodes?

A

Yes, this gives you the ability to select up to five different instance types. The fleet enables you to select the desired number of nodes and price and the fleet will manage to try to make it happen.

31
Q

How do I secure the EMR (Hadoop) cluster?

A

Using security groups and NACLs.

32
Q

Do you wnat to use spot instances for code nodes?

A

You cna but you could lose the node and part of the HDFS file system

33
Q

What should I use to run my EMR task nodes?

A

Spot instances as the task nodes have no data.

34
Q

If I am using instance fleet, how my fleets are used for the different node types in EMR?

A

You will have three fleet types,

  • Master node fleet
  • Task node fleet
  • Core node fleet
35
Q

I have data in us-east-1 region in s3, where should I run my EMR cluster?

A

As close to the region as possible, in this case, us-east-1. The reason for this is latency, you get 1ms per 90 miles of distance.

36
Q

I am calculating PI, should I use a general purpose, computer optimised or memory optimised node?

A

Computer optimised as it is going to use a lot of CPU.

37
Q

What is the recommended instances type for Hadoop cluster nodes?

A

m4.large for a cluster with < 50 nodes, for a cluster with more then 50 nodes you step to next size m4.xlarge

38
Q

For EMR, when should I used reserved instances?

A

When you know the cluster wi going to be used long term 1, 2, 3years)

39
Q

For long-running EMR or where EMR is a data wherehouse, how should I set up the cost of the nodes?

A
  • Master node = On-demand
  • Core node = on-demand or fleet
  • Task node = on-demand or fleet
40
Q

For cost driven EMR how should I set up the cost of the nodes?

A
  • Master node = Spot
  • Core node = Spot
  • Task node = Spot
41
Q

For data warehouse critical EMR how should I set up the cost of the nodes?

A
  • Master node = On-demand
  • Core node = On-demand
  • Task node = on-demand or fleet
42
Q

For app testing EMR how should I set up the cost of the nodes?

A
  • Master node = Spot
  • Core node = Spot
  • Task node = Spot
43
Q

Do you have to provide code or dose EMR just do the map-reduce for me and generate code?

A

You have to provide the map and reduce code, this is the code EMR will push to the map and reduce nodes. And is the code thet will run on the modes to perform the map and reduce processing.

44
Q

What are a split and a split size?

A

Split is the split size, where we split the data into chunks to save on separate nodes

45
Q

What is a map job?

A

The map phase takes data like saying a data, name, address and store in the nodes splitting the data by say the date. This way each node has a subset of the data.

46
Q

What is the reduce job?

A

Data is shuffled into the reduce where it is counted for example.

47
Q

I require Hadoop, how do I configure RedShift?

A

You do not, RedShift is a data where-house, you need EMR, EMR is AWS implementation of map-reduce and Hadoop.

48
Q

I require Spark, what product in AWS should I be configuring?

A

EMR

49
Q

What is HIVE?

A

Hive is a wherehouse on top Hadoop, it gives you SQL query abilities. It has a metadata store and ODBC and JDBC drivers to enable you to easily query form your apps.

50
Q

What is PIG?

A

Pig is a high-level language to analyze data in Hadoop. For example, you can use pig to,

  • Load CSV file: LOAD k.csv USING PigStorage as id:int, date:chararray
  • Create new data listings: FOREACH listings GENERATE list_id, ToDate
51
Q

How are EMR clusters created?

A

EMR cluster can be created by you through the console/CLI/API or through another product like datapipeline. When you create a cluster, it is a long-running cluster.

52
Q

Can I ssh to the master node?

A

Yes, you can ssh to the master node.

53
Q

I am using Hadoop and hive, I wnat to use ODBC, dod I need to move the data to RedShift?

A

No, Hive is a data wherehouse on top of Hadoop, one of the features of Hive is its ability to use ODBC.

54
Q

I am using Hadoop and hive, I wnat to use JDBC, dod I need to move the data to RedShift?

A

No, Hive is a data wherehouse on top of Hadoop, one of the features of Hive is its ability to use JDBC.

55
Q

What is HBase?

A

HBase is like google BigTable database, it runs on top of Hadoop HDFS.

56
Q

I have to write some code, Is map and reduce one application?

A

Two separate application, there is a map app and a reduce app.

57
Q

Dose EMR support spark?

A

Yes

58
Q

How are EMR clusters created?

A
  • You create the cluster (long-running cluster)

- Another product like AWS Data-pipeline creates the cluster

59
Q

I wnat to get deeper understanding ow what my EMR cluster is doing, what cna I do?

A

When creating a cluster you have the option to create logs and have them saved to S3.

60
Q

When creating a Hadoop cluster what software configuration do I have available?

A
  • Hadoop
  • HBase
  • Spark
  • Presto
61
Q

When creating a Hadoo[ cluster, do I have the option to select instance size?

A

Yes, 100%, you are getting a managed cluster of nodes, loaded with software for you, you can size the nodes as you need.

62
Q

As EMR is a service, do you get node and if so what are they called?

A

Master node and core nodes

63
Q

If ai create a cluster of 4 what nodes am I getting?

A

You are getting, one master and 3 core nodes.

64
Q

My org has a policy of encrypting everything, how can we apply this to EMR?

A

EMR has encryption in,

  • Transit
  • At rest (EBS) & HDFS encryption
  • Encryption in transit between nodes.
65
Q

I am concerned about data security between EMR nodes?

A

You cna have the data encrypted between EMR nodes.

66
Q

I wnat the EMR nodes to access S3, what do I have to provide?

A

An IAM role, this role is used to grant the EMR nodes access to S3.

67
Q

Can I SSH to the EMR nodes?

A

Yes 100%, you can ssh to the nodes.

68
Q

For EMR, how is security for the cluster configured?

A

You use a ‘Security Configuration’

69
Q

EMR uses S3, is it possible to have the EMR S3 data encrypted?

A

Yes 100%, this is done when you create a ‘security configuration’

70
Q

What is EMRFS?

A

It is an implementation of HDFS for reading and writing to S3.

71
Q

Can I use EMR with VPC, the security group?

A

Yes 100%, EMR is just a cluster of EC2 nodes and is deployed into a VPC.

72
Q

What is the default cluster size?

A

3 nodes, a master and two core nodes.

73
Q

I wnat to have the ability to use SQL over my Hadoop, what do I need to do?

A

You need to select the hive configuration when setting up the cluster, the hive is a data where the house on top of Hadoop and enables you to perform SQL and ODBS, JDBS queries.

74
Q

What does the master node do?

A

It manages the cluster, distributes workloads to the cluster and monitors the health of the cluster.

75
Q

I wnat to run SQL like queries, I have deployed a cluster, how do I run the queries?

A

SSH to master node and you cna run the queries, you can also connect to hive using ODBC and JDBC.

76
Q

Where can the cluster store its data?

A

On HDFS or on EMRFS (S3)

77
Q

Inside an EMR cluster, where is the HDFS storage?

A

It runs on the core nodes.

78
Q

I am architecting an EMR cluster, I am concerned about availability, so I wnat to architect it to be in multiple availability zones, how do I do this?

A

You do not, EMR clusters are architected and deployed to be on one availability zona.

79
Q

What is a task node?

A

Task nodes is an optional node for running tasks but is not a node that interacts with HDFS.

80
Q

I am concerned about cost and the tasks node, what can I do to improve the cost of task nodes and why?

A

A task node can use a spot instance as it can be stopped at any point in time without causing issues.

81
Q

When loading and storing data for EMR, what should I be thinking with regard to S3?

A

How I use storage classes, data that cna be recreated should be stored in low-cost storage like one zone.

82
Q

If the master node fails, what happens to the cluster?

A

The cluster will fail, it is the most important node.

83
Q

How many many master nodes can I have?

A

One, the single master node is responsible for cluster management.

84
Q

Can I have a spot instance for the master node, explain?

A

No!!, the master node is responsible for managing the cluster and always needs to be present.

85
Q

Can I have a spot instance for the core node?

A

TBD

86
Q

What is the difference between a RedShift data node and an EMR core node?

A

Redshift data node stores data and searches the data.

EMR core nodes perform jobs on the data like map and reduce.

87
Q

What instance sizes am I limited to in EMR?

A

Unlike redshift, you can use almost all the instance types

88
Q

Can I change the master node type after the cluster is created?

A

No, it is fixed after the cluster is provisioned.

89
Q

I want to optimise EMR performance, I have my data in S3 us-east-1 and am thinking of putting EMR in us-west-1, is this a good choice and explain why?

A

It is not a good choice, you will wnat to keep data close to the EMR cluster, in this case, the data is in S3 in us-east-1, you will wnat to create an EMR cluster in us-east-1

90
Q

What is a good starting size of EMR nodes?

A

m4.large

91
Q

How would I figure the best instances type for EMR

A

select something like m4.large and then evaluate using cloud watch and resize the nodes.

92
Q

When I am optimizing my core nodes what should I be looking for?

A

Memory - > slect mem optimised instance type (r)
Compute- > slect mem optimised instance type (c)
Storage -> select storage optimized instance type ()

93
Q

I have a cost-sensitive situation, what instances should I use for EMR?

A

Consider spot for master, core and task nodes.

94
Q

I am using EMR for test situation, what instances types should I be using for EMR?

A

Consider spot for master, core and task nodes.

95
Q

I am using EMR for data critical application, what should i be considering for instances?

A
  • on-demand for master node
  • on-demand for core-node
  • spot for task-node
96
Q

I am using EMR for data critical application, what should I be considering for instances?

A
  • on-demand for master node
  • on-demand for core-node
  • spot for task-node or instance fleet
97
Q

I am using EMR for data where and long-running, what should I be considering for instances?

A
  • on-demand for master node
  • on-demand for core-node or instance fleet
  • spot for task-node or instance fleet
98
Q

I am using EMR for data where and long-running, what should I be considering for instances?

A
  • on-demand for the master node
  • on-demand for core-node or instance fleet
  • spot for task-node or instance fleet
99
Q

I wnat to stream and analyze data on in realtime, how cna I set up the EMR streaming and analytic service?

A

There is no streaming and analytic service in EMR, EMR is a batch processing service for data thet is already captured and stored in storage like S3. For streaming and realtime analytics, it best to use Kinesis

100
Q

What are the use cases for EMR?

A
  • web indexing, data mining
  • logfile analysis
  • machine learning
  • financial analysis
  • scientific simulation
  • bioinformatics research
101
Q

You require the ability to analyze a large amount of data, which is stored on Amazon S3 using Amazon Elastic Map Reduce. You are using the cc2.8xlarge instance type, who’s CPUs are mostly idle during processing. Which of the below would be the most cost efficient way to reduce the runtime of the job

A

(instances with better balance cpu to IO for this type of analysis). Smaller instances with higher IO would be better.

102
Q

I have data in DynamoDB and i wnat to use EMR, how cna I do this?

A

Amazon DynamoDB is integrated with Apache Hive, a data warehousing application that runs on Amazon EMR. Hive can read and write data in DynamoDB tables, allowing you to:

Query live DynamoDB data using a SQL-like language (HiveQL).

Copy data from a DynamoDB table to an Amazon S3 bucket, and vice-versa.

Copy data from a DynamoDB table into Hadoop Distributed File System (HDFS), and vice-versa.

Perform join operations on DynamoDB tables.

103
Q

Where an I load data into EMR form?

A
  • S3
  • Elastic search
    Amazon RDS
    DynamoDB
    Redshift
    Kafka
    Kinesis
104
Q

Can I install my own libraries on EMR?

A

Yes 100%, you can load you own libs and code on the nodes in EMR.

105
Q

Can I use Hive on EMR to query tables on dynamo DB?

A

Yes 100% there is a connector

106
Q

I need to transfer data from RDS to EMR, how cna I do this?

A

Apache Sqoop is a tool for transferring data between Amazon S3, Hadoop, HDFS, and RDBMS databases

107
Q

Need a question here.

A

Spark -> redshift conenctor

108
Q

I am using EMR and data stored on S3, how cna I get better bandwidth?

A

Compress the data

109
Q

Is EMR a batch or streaming service?

A

It is a batch service, but if you add Kinesis connector you can have EMR query and process data coming from Kinesis stream using HIVE, PIG, MapReduce.

110
Q

What is EMR KInesisConnector?

A

It enables you to connect EMR with Kinesis stream for processing and querying, you cna query and process using PIG, HIVE, MapReduce.

111
Q

I what an analytics platform where I can install and run my own code and customize it in every way I need. what is my best option?

A

EMR, With EMR you a

112
Q

What is a slave node?

A

Slave nodes are core nodes, just another name.

113
Q

Whist core nodes, is it just one group of nodes?

A

No, you cna have several groups of nodes, each group could be different sizes nodes, on-demand and spot instances.

114
Q

Could I use EMR for machine learning?

A

Yes 100%, AWS has machine learning platforms thet make it easy to prefrom ML. But you cna alos run ML on top of EMR.

115
Q

Chose the instance family group what performing ML with EMR?

A

AS ML is compute-intensive, you would use compute-optimized instances (C family)

116
Q

What are some use cases for EMR?

A
  • Log processing
  • Genomic
    Clickstream
117
Q

How cna I improve master node availability?

A

During deployment, you can opt to use advance and deploy more than a single master node.

118
Q

When deploying an EMR cluster, how can I automatically carry out functions?

A

You can opt to run a boot script.

119
Q

What language is used for developing map-reduce applications?

A

Java, but most any language can be used.

120
Q

How cna I get my large data sets into EMR for processing?

A
You cna use,
Snowball
Import/export
AWS CLI S3
Data sync
Direct connect
121
Q

Can I add an EC2 keypair to the master node, explain?

A

Yes 100%, you can ssh to this node.

122
Q

Is it the master node thet carries out the map?

A

No, the core and task node carries out the map?

123
Q

What is a task node used for?

A

A task node is used for running the map and reduce functions, but the task node does not have HDFS data, the task node works on data from the HDFS or the EMRHDFS (S3HDFS).

124
Q

I will be receiving 20TB of data thet I wnat to load into EMR, once loaded, I will only need to reload this data once a year, but when I do need to load the data from s3, I need it instantly, what is the best s3 storage type to use?

A

You will need to use infrequent access, IA gives you good cost as your data is stored for a long period untouched, once a year. You can use glacier as the data need to be instantly available.

125
Q

How many task nodes do I need to run the HDFS?

A

HDFS does is not run on the task nodes it runs on the core nodes.

126
Q

If the master node fails what would happen the cluster?

A

The cluster will fail as the master nodes I the node thet takes care of the entire cluster?

127
Q

I am concerned about the master node failing, how can I ensure my cluster is highly available?

A

You can configure EMR to use up to 3 master nodes

128
Q

I wnat my EMR cluster to have 11 x 9s durable data, how could I do this?

A

Use EMRFS (S3) to store the data.

129
Q

Can I read and write to EMRFS form the EMR cluster?

A

Yes 100%

130
Q

Can I encrypt EMRFS?

A

Yes 100%.

131
Q

Is EMR deployed as a service or as a cluster?

A

Cluster, you can deploy EMR as nodes in your VPC.

132
Q

I am concerned about high availability, how dos I deploy EMR across availability zones?

A

You cant, EMR is deployed into a single availability zone.

133
Q

What are the application sets you can deploy when creating EMR?

A
  • Hadoop
  • Spark
  • HBase
  • Presto
134
Q

What is Hive?

A

It is a data-wherehouse on top of EMR

135
Q

What is HBase?

A

Implementation of Google’s Bigtable

136
Q

What is PIG?

A

It is a high-level languages for creating an application on Hadoop

137
Q

What is the smallest EMR you can have?

A

a single node, a single node is only for demo and will have the master node, HDFS and the map and reduce.

138
Q

I am concerned about cost, should I use the master node on a spot instance?

A

No, the master node can not be on a spot as it the spot is taken back by AWS then the cluster will die.

139
Q

I have a base workload with I process click batch data, every six months I get 50 times more data from one data thet still has to be processed in one day, how should I architect EMR?

A

You should run EMR with the master node as reserved or on-demand, the core nodes as on-demand or reserved and the task nodes can be bumped up to deal with the 50 by using spot instances as needed.

140
Q

How can you get the cheapest instances?

A

You cna use spot instances.

141
Q

Can I use spot instances with autoscaling?

A

Yes 100%

142
Q

I have developed using an EMR cluster what should all my instance types be?

A

Spot, yes spot for master, core and task, lowest price.

143
Q

I have a cost-sensitive workload using an EMR cluster what should all my instance types be?

A

Spot, yes spot for master, core and task, lowest price.

144
Q

I have a long-running workload instance type should i be using with my EMR cluster?

A

Master = on-demand
Core on-demand or instance fleet
Task spot or instance feet