Big Data Analytics

My BDA LAB Exp & Assgnt

Module -1: Introduction to Big Data


Data is the quantities, character or symbol on which operations are performed by a computer. 

It may be stored and transmitted in the form of electrical signals and recorded on magnetic, optical, or mechanical recording media.

Big Data:   

Big data refers to the large, diverse sets of information that grow at ever-increasing rates.

Characteristics of Big Data: 


2. Variety

3. Velocity

4. Variability

(i) Volume – 

The name Big Data itself is related to a size which is enormous. Size of data plays a very crucial role in determining value out of data. Also, whether a particular data can actually be considered as a Big Data or not, is dependent upon the volume of data. Hence, 'Volume' is one characteristic which needs to be considered while dealing with Big Data.

(ii) Variety – 

The next aspect of Big Data is its variety. Variety refers to heterogeneous sources and the nature of data, both structured and unstructured. During earlier days, spreadsheets and databases were the only sources of data considered by most of the applications. Nowadays, data in the form of emails, photos, videos, monitoring devices, PDFs, audio, etc. are also being considered in the analysis applications. This variety of unstructured data poses certain issues for storage, mining and analyzing data.

(iii) Velocity – 

The term 'velocity' refers to the speed of generation of data. How fast the data is generated and processed to meet the demands, determines real potential in the data. Big Data Velocity deals with the speed at which data flows in from sources like business processes, application logs, networks, and social media sites, sensors, Mobile devices, etc. The flow of data is massive and continuous.

(iv) Variability – 

This refers to the inconsistency which can be shown by the data at times, thus hampering the process of being able to handle and manage the data effectively. 

Types of Big Data: 

1. Structured 

2. Unstructured 

3. Semi-Structured

1. Structured: 

Any data that can be stored, accessed and processed in the form of fixed-format is termed as a 'structured' data.

Examples Of Structured Data

An 'Employee' table in a database is an example of Structured Data

2. Unstructured

Any data with the unknown form of the structure is classified as unstructured data.

Examples Of Unstructured Data

The output returned by 'Google Search'

3. Semi-Structured

Semi-structured data can contain both forms of data. We can see semi-structured data as a structured in form but it is actually not defined with e.g. a table definition in relational DBMS. 

Example of semi-structured data is a data represented in an XML file.

Examples of Big Data:

1. The New York Stock Exchange generates about one terabyte of new trade data per day.

2. The statistic shows that 500+terabytes of new data get ingested into the databases of social media site Facebook, every day. This data is mainly generated in terms of photo and video uploads, message exchanges, putting comments etc.

3. A single Jet engine can generate 10+terabytes of data in 30 minutes of flight time. With many thousand flights per day, generation of data reaches up to many Petabytes.

Challenges in Big Data:

1. Dealing with data growth

2. Generating insights in a timely manner

3. Recruiting and retaining big data talent

4. Integrating disparate data sources

5. Validating data

6. Securing big data

7. Organizational resistance

Applications of Big Data

1. Banking and Securities
2. Communications, Media and Entertainment
3. Healthcare Providers
4. Education
5. Manufacturing and Natural Resources
6. Government
7. Insurance
8. Retail and Wholesale trade
9. Transportation
10. Energy and Utilities

Traditional vs. Big Data business approach

traditional vs big data

Module -2:     Introduction to Big Data Frameworks: Hadoop, NoSQL


Apache Hadoop is an open-source framework that is used to efficiently store and process large datasets ranging in size from gigabytes to petabytes of data. 

Instead of using one large computer to store and process the data, Hadoop allows clustering multiple computers to analyze massive datasets in parallel more quickly.

Hadoop consists of four main modules:

1.Hadoop Distributed File System (HDFS) – 

A distributed file system that runs on standard or low-end hardware. HDFS provides better data throughput than traditional file systems, in addition to high fault tolerance and native support of large datasets.

2. Yet Another Resource Negotiator (YARN) –

Manages and monitors cluster nodes and resource usage. It schedules jobs and tasks.

3. MapReduce – 

A framework that helps programs do the parallel computation on data. The map task takes input data and converts it into a dataset that can be computed in key-value pairs. The output of the map task is consumed by reduce tasks to aggregate output and provide the desired result.

4. Hadoop Common – 

Provides common Java libraries that can be used across all modules.

Hadoop Ecosystem: 

The Hadoop ecosystem includes many tools and applications to help collect, store, process, analyze, and manage big data. Some of the most popular applications are:

Apache Spark – 

An open-source, distributed processing system commonly used for big data workloads. Apache Spark uses in-memory caching and optimized execution for fast performance, and it supports general batch processing, streaming analytics, machine learning, graph databases, and ad hoc queries.

Pig -

Pig is commonly used for complex use cases that require multiple data operations. Pig is a procedural language for developing parallel processing applications for large data sets in the Hadoop environment. Pig is an alternative to Java programming for MapReduce and automatically generates MapReduce functions. Pig includes Pig Latin, which is a scripting language. Pig translates Pig Latin scripts into MapReduce, which can then run on YARN and process data in the HDFS cluster. Pig is popular because it automates some of the complexity in MapReduce development.

Hive -

Allows users to leverage Hadoop MapReduce using a SQL interface, enabling analytics at a massive scale, in addition to distributed and fault-tolerant data warehousing.


An open-source, non-relational, versioned database that runs on top of Amazon S3 (using EMRFS) or the Hadoop Distributed File System (HDFS). HBase is a massively scalable, distributed big data store built for random, strictly consistent, real-time access for tables with billions of rows and millions of columns.


Apache Sqoop is a tool in Hadoop which is design to transfer data between HDFS (Hadoop storage) and relational database like MySQL, RDB etc. Apache Sqoop imports data from relational databases to HDFS, and exports data from HDFS to the relational databases. It efficiently transfers bulk data between Hadoop and external data stores such as enterprise data warehouses, relational databases, etc.


NoSQL databases ("not only SQL") are non-tabular and store data differently than relational tables. NoSQL databases come in a variety of types based on their data model. The main types are document, key-value, wide-column, and graph. They provide flexible schemas and scale easily with large amounts of data and high user loads.

why Nosql ?

NoSQL databases are a great fit for many modern applications such as mobile, web, and gaming that require flexible, scalable, high-performance, and highly functional databases to provide great user experiences.

NoSQL Data Architecture Pattern:

A data architecture pattern is a consistent way of representing data in a regular structure that will be stored in memory. Four types of Data architecture pattern are as shown below:

1. Key-Value Stores:

The data is stored in the form of Key-Value Pairs. The key is usually a sequence of strings, integers or characters but can also be a more advanced data type. The value is typically linked or co-related to the key. The key-value pair storage databases generally store data as a hash table where each key is unique. The value can be of any type (JSON, BLOB(Binary Large Object), strings, etc). This type of pattern is usually used in shopping websites or e-commerce applications.


  • DynamoDB
  • Berkeley DB

2. Column Stores:

 the data is stored in individual cells which are further grouped into columns. Column-oriented databases work only on columns. They store large amounts of data into columns together. Format and titles of the columns can diverge from one row to other. Every column is treated separately. But still, each individual column may contain multiple other columns like traditional databases.
Basically, columns are mode of storage in this type.


Bigtable by Google

3. Document Stores:

The document database fetches and accumulates data in forms of key-value pairs but here, the values are called Documents. The document can be stated as a complex data structure. Document here can be a form of text, arrays, strings, JSON, XML or any such format. The use of nested documents is also very common. It is very effective as most of the data created is usually in the form of JSON and are unstructured.



4. Graph Stores:

this architecture pattern deals with storage and management of data in graphs. Graphs are basically structures that depict connections between two or more objects in some data. The objects or entities are called nodes and are joined together by relationships called Edges. Each edge has a unique identifier. Each node serves as a point of contact for the graph. This pattern is very commonly used in social networks where there are a large number of entities and each entity has one or many characteristics which are connected by edges. The relational database pattern has tables which are loosely connected, whereas graphs are often very strong and rigid in nature.


FlockDB( Used by Twitter)

Mongo DB:

MongoDB is a general-purpose, document-based, distributed database built for modern application developers and for the cloud era.

Any relational database has a typical schema design that shows a number of tables and the relationship between these tables. While in MongoDB, there is no concept of relationship.

Advantages of MongoDB over RDBMS

  • Schemaless − MongoDB is a document database in which one collection holds different documents. A number of fields, content and size of the document can differ from one document to another.

  • Structure of a single object is clear.

  • No complex joins.

  • Deep query-ability. MongoDB supports dynamic queries on documents using a document-based query language that's nearly as powerful as SQL.

  • Tuning.

  • Ease of scale-out − MongoDB is easy to scale.

  • Conversion/mapping of application objects to database objects not needed.

  • Uses internal memory for storing the (windowed) working set, enabling faster access to data.

Why Use MongoDB?

  • Document Oriented Storage − Data is stored in the form of JSON style documents.

  • Index on any attribute

  • Replication and high availability

  • Auto-Sharding

  • Rich queries

  • Fast in-place updates

  • Professional support by MongoDB

Where to Use MongoDB?

  • Big Data

  • Content Management and Delivery

  • Mobile and Social Infrastructure

  • User Data Management

  • Data Hub

Module-3: MapReduce Paradigm 

Mapreduce :


MapReduce is a software framework and programming model used for processing huge amounts of data. MapReduce program work in two phases, namely, Map and Reduce. Map tasks deal with splitting and mapping of data while Reduce tasks shuffle and reduce the data.

The input to each phase is key-value pairs. In addition, every programmer needs to specify two functions: map function and reduce function.


The Map Task : 

This is the very first phase in the execution of the map-reduce program. In this phase data in each split is passed to a mapping function to produce output values.

Reduce Task :

In this phase, output values from the Shuffling phase are aggregated. This phase combines values from Shuffling phase and returns a single output value. In short, this phrase summarizes the complete dataset.

MapReduce Execution :

  • One map task is created for each split which then executes map function for each record in the split.
  • It is always beneficial to have multiple splits because the time taken to process a split is small as compared to the time taken for processing of the whole input. When the splits are smaller, the processing is better to load-balanced since we are processing the splits in parallel.
  • However, it is also not desirable to have splits too small in size. When splits are too small, the overload of managing the splits and map task creation begins to dominate the total job execution time.
  • For most jobs, it is better to make a split size equal to the size of an HDFS block (which is 64 MB, by default).
  • Execution of map tasks results in writing output to a local disk on the respective node and not to HDFS.
  • Reason for choosing local disk over HDFS is, to avoid replication which takes place in case of HDFS store operation.
  • Map output is intermediate output which is processed by reduce tasks to produce the final output.
  • Once the job is complete, the map output can be thrown away. So, storing it in HDFS with replication becomes overkill.
  • In the event of a node failure, before the map output is consumed by the reduce task, Hadoop reruns the map task on another node and re-creates the map output.
  • Reduce task doesn't work on the concept of data locality. An output of every map task is fed to the reduce task. Map output is transferred to the machine where reduce task is running.
  • On this machine, the output is merged and then passed to the user-defined reduce function.
  • Unlike the map output, reduce output is stored in HDFS (the first replica is stored on the local node and other replicas are stored on off-track nodes). 

Coping with Node Failures :

The worst thing that can happen is that the compute node at which the Master is executing fails. In this case, the entire MapReduce job must be restarted. But only this one node can bring the entire process down; other failures will be managed by the Master, and the MapReduce job will complete eventually. Suppose the compute node at which a Map worker resides fails. This failure will be detected by the Master because it periodically pings the Worker processes. All the Map tasks that were assigned to this Worker will have to be redone, even if they had completed. The reason for redoing completed Map tasks is that their output destined for the Reduce tasks resides at that compute node, and is now unavailable to the Reduce tasks. The Master sets the status of each of these Map tasks to idle and will schedule them on a Worker when one becomes available. The Master must also inform each Reduce task that the location of its input from that Map task has changed. Dealing with a failure at the node of a Reduce worker is simpler. The Master simply sets the status of its currently executing Reduce tasks to idle. These will be rescheduled on another reduce worker later.

Algorithms using MapReduce:

Matrix-Vector Multiplication by MapReduce:

Suppose we have an n×n matrix M, whose element in row i and column j will be denoted mij . Suppose we also have a vector v of length n, whose jth element is vj . Then the matrix-vector product is the vector x of length n, whose ith element xi is given by xi=∑nj=1mij×vj. If n = 100, we do not want to use a DFS or MapReduce for this calculation. But this sort of calculation is at the heart of the ranking of Web pages that goes on at search engines, and there, n is in the tens of billions.3 Let us first assume that n is large, but not so large that vector v cannot fit in main memory and thus be available to every Map task. The matrix M and the vector v each will be stored in a file of the DFS. We assume that the row-column coordinates of each matrix element will be discoverable, either from its position in the file, or because it is stored with explicit coordinates, as a triple (i, j,mij). We also assume the position of element vj in the vector v will be discoverable in the analogous way.

The Map Function: The Map function is written to apply to one element of M. However, if v is not already read into main memory at the compute node executing a Map task, then v is first read, in its entirety, and subsequently will be available to all applications of the Map function performed at this Map task. Each Map task will operate on a chunk of the matrix M. From each matrix element mij it produces the key-value pair (i,mij* vj). Thus, all terms of the sum that make up the component xi of the matrix-vector product will get the same key, i.

The Reduce Function: The Reduce function simply sums all the values associated with a given key i. The result will be a pair (i, xi).

we can divide the matrix into vertical stripes of equal width and divide the vector into an equal number of horizontal stripes, of the same height. Our goal is to use enough stripes so that the portion of the vector in one stripe can fit conveniently into main memory at a compute node. The figure suggests what the partition looks like if the matrix and vector are each divided into five stripes.


Map: for input mij

Emit (i, ps = ∑ mij * vj)

Reduce : Compute Xi = ∑ ps

Relational Algebra Operations:

  • Selection.
  • Projection.
  • Union & Intersection.
  • Natural Join.
  • Grouping & Aggregation.

1. Selection:

Apply a condition c to each taple in the relation and produce as output only those tuples that satisfy c.

The result of this selection is denoted by 6c(R)

Selection really do not need the full power of MapReduce.

They can be done most conveniently in the map portion alone, although they could also be done in the reduce portion also.

The pseudo code is as follows :

Map (key, valve)

for tuple in valve :

if tuple satisfies C :

emit (tuple, tuple)

Reduce (key, valves)

emit (key, key)

2. projection:

for some subset s of the attribute of the relation, produce from each tuple only the components for the attributes in S.

The result of this projection is denoted TTs (R)

Projection is performed similarly to selection.

As projection may cause the same tuple to appear several times, the reduce function eliminate duplicates.

The pseudo code for projection is as follows :

Map (key, valve)

for tuple in valve :

ts = tuple with only the components for the attributes in S.

emit (ts, ts)

Reduce (key, values)

emit (key, key)

Computing Selections by MapReduce:

Computing Selections by MapReduce

Computing Projections by MapReduce:

-In the map task, from each tuple t in R the attributes not present in S are eliminated and a new tuple t is constructed. The output of the map tasks is the key-value pairs (t', t').

-The main job of the reduce task is to eliminate the duplicate t's as the output of the projection operation cannot have duplicates.

computing projections by mapreduce

Union, Intersection, and Difference by MapReduce

Union with MapReduce:

-For the union operation R U S, the two relations R and S must have the same schema. Those tuples which are present in either R or S or both must be present in the output.

-The only responsibility of the map phase is of converting each tuple t into the key-value pair (t, t).

-The Reduce phase eliminates the duplicates just as in the case of projection operation. Here for a key t, there can be either 1 value if it is present in only one of the relations or t can have 2 values if it is present in both the relations. In either case, the output produced by the Reduce task will be (t, t).

Union with MapReduce

Intersection by MapReduce:

Intersection by mapreduce

intersection by mapreduce

Difference by MapReduce:

differences by mapreduce

differences by mapreduce

Computing Natural Join by MapReduce:

MapReduce JOIN operation is used to combine two large datasets. However, this process involves writing lots of code to perform the actual join operation. Joining two datasets begins by comparing the size of each dataset. If one dataset is smaller as compared to the other dataset then smaller dataset is distributed to every data node in the cluster.

Once it is distributed, either Mapper or Reducer uses the smaller dataset to perform a lookup for matching records from the large dataset and then combine those records to form output records.

1. Map-side joins - When the join is performed by the mapper, it is called a map-side join. In this type, the join is performed before data is actually consumed by the map function. It is mandatory that the input to each map is in the form of a partition and is in sorted order. Also, there must be an equal number of partitions and it must be sorted by the join key.

2. Reduce-side join - When the join is performed by the reducer, it is called a reduce-side join. There is no necessity in this join to have a dataset in a structured form (or partitioned).

Here, map side processing emits join key and corresponding tuples of both the tables. As an effect of this processing, all the tuples with same join key fall into the same reducer which then joins the records with same join key.

Grouping and Aggregation by MapReduce:

Grouping and Aggregation by MapReduce

Grouping and Aggregation by MapReduce

Grouping and Aggregation by MapReduce

Matrix Multiplication:

Matrix multiplication is a binary operation that produces a matrix from two matrices. For matrix multiplication, the number of columns in the first matrix must be equal to the number of rows in the second matrix. The resulting matrix, known as the matrix product, has the number of rows of the first and the number of columns of the second matrix. The product of matrices {\displaystyle A}A and {\displaystyle B}B is then denoted simply as {\displaystyle AB}AB.[1][2]

Matrix multiplication was first described by the French mathematician Jacques Philippe Marie Binet in 1812,[3] to represent the composition of linear maps that are represented by matrices. Matrix multiplication is thus a basic tool of linear algebra, and as such has numerous applications in many areas of mathematics, as well as in applied mathematics, statistics, physics, economics, and engineering.[4][5] Computing matrix products is a central operation in all computational applications of linear algebra.

Matrix Multiplication with one MapReduce Step:

MapReduce is a technique in which a huge program is subdivided into small tasks and run parallelly to make computation faster, save time, and mostly used in distributed systems. It has 2 important parts:

Mapper: It takes raw data input and organizes into key, value pairs. For example, In a dictionary, you search for the word “Data” and its associated meaning is “facts and statistics collected together for reference or analysis”. Here the Key is Data and the Value associated with is facts and statistics collected together for reference or analysis.

Reducer: It is responsible for processing data in parallel and produces the final output.

Let us consider the matrix multiplication example to visualize MapReduce. Consider the following matrix:

taking two matrix of 2x2

Here matrix A is a 2×2 matrix which means the number of rows(i)=2 and the number of columns(j)=2. Matrix B is also a 2×2 matrix where a number of rows(j)=2 and number of columns(k)=2. Each cell of the matrix is labelled as Aij and Bij. Ex. element 3 in matrix A is called A21 i.e. 2nd-row 1st column. Now One step matrix multiplication has 1 mapper and 1 reducer. The Formula is:

Mapper for Matrix A (k, v)=((i, k), (A, j, Aij)) for all k

Mapper for Matrix B (k, v)=((i, k), (B, j, Bjk)) for all i

The formula for Reducer is:

Reducer(k, v)=(i, k)=>Make sorted Alist and Blist

(i, k) => Summation (Aij * Bjk)) for j

Output =>((i, k), sum)

Therefore the Final Matrix is:

Result of Matrix Multiplication With 1 MapReduce Step

Illustrating the use of MapReduce with the use of real-life databases and applications.

Post a Comment