There are many different data sources with different formats that need to be input into HDFS. Just as there are many vendors that your organization may have, there are many different mechanisms to get your data loaded into HDFS. There are many tools for assisting in the ease of getting the data into the correct structure to suit your specific goals.
Call out Vendor Connectors and HDFS APIs which are available for Java and C. As you see in this slide the list of vendors is growing quickly just as corporate needs change.
Multiple Ingestion Workflows
LAMBDA Architecture is a data-processing architecture designed to handle massive quantities of data by taking advantage of both batch and stream-based architecture methods. The approach balances use batch processing to provide comprehensive and accurate views of batch data while real-time stream processing to provide views of online data. The inception of the lambda architecture has paralleled the growth of big data, real-time analytics, and the drive to mitigate the latencies of MapReduce. (Schuster, Werner. “Nathan Marz on Storm, Immutability in the Lambda Architecture, Clojure”. www.infoq.com. Interview with Nathan Marz, 6 April 2014)
Streaming in Hadoop helps capture new business opportunities with low-latency dashboards, security alerts, and operational enhancements integrated with other applications running in a Hadoop cluster. To realize these benefits, an enterprise should integrate real-time processing with normal Hadoop batch processing. Data derived from batch processing can be used to inform real-time processing dashboards and applications.
For example, data derived from batch processing is commonly used to create the event models used by the real-time system. These event models define the schemas of incoming event data, such as records of calls into the customer contact center, copies of customer order transactions, or external market data that might affect any action taken.
Real-Time vs. Batch Ingestion
Batch and real-time data processing are very different. The differences include the characteristics of the data, the requirements for processing the data, and the clients who use or consume the data.
One of the primary differences between batch and real-time processing is that real-time systems are always running and therefore typically require automated applications or dashboards to consume the data. These applications or dashboards are used to affect current operations while batch processing is commonly used for historical data analysis.
Batch and Bulk Ingestion Options
The Ambari Files View is an Ambari Web UI plug-in providing a graphical user interface to HDFS files and directories.
The Files View can create a directory, rename a directory, browse to and list a directory, download a zipped directory, and delete a directory. It can also upload a file, list files, open files, download files, and delete a file.
The Hadoop Client
Hadoop Client machines are neither Master nor Slave. Their role is to load data into the cluster, submit jobs describing how that data should be processed, and then retrieve or view the results of the job when its finished.
Use the put command to upload data to HDFS
Are perfect for inputting local files into HDFS
Are useful in batch scripts
hdfs dfs –put mylocalfile /some/hdfs/path
Typically, local files are loaded into the HDFS environment by writing or creating scripts In the language of choice.
In the above usage example:
The file system is hdfs
The type of file system is dfs (distributed file system)
The command being used to load data is –put which loads data
The name of the file is mylocalfile
An absolute file path name is then given
– The the leading “/” denotes an absolute versus current directory where file can be found /some hdfs/path
WebHDFS provides an external client that does not necessarily run on the Hadoop cluster itself a standard method of executing Hadoop filesystem operations.
Using WebHDFS provides a method to identify the host that must be connected to in the cluster. It is based on based on HTTP operations such as:
Operations such as OPEN, GETFILESTATUS, LISTSTATUS use HTTP GET; other operations such as CREATE, MKDIRS, RENAME, SETPERMISSIONS use HTTP PUT. APPEND operations are based on HTTP POST.
The above example shows http://host:port and identifies the server that contains the information for each one of the files.
OPEN – request is redirected to a datanode where the file can be read
MKDIRS – request is redirected to a datanode where the directory can be found
APPEND – request is redirected to a datanode where the file mydata.txt can is to be be appended.
The NFS Gateway for HDFS allows clients to mount HDFS and interact with it through NFS, as if it were part of their local file system. The gateway supports NFSv3.
After mounting HDFS, a user can:
Browse the HDFS file system through their local file system on NFSv3 client-compatible operating systems.
Upload and download files between the HDFS file system and their local file system.
Stream data directly to HDFS through the mount point. (File append is supported, but random write is not supported.)
The NFS Gateway machine must be running all components that are necessary for running an HDFS client, such as a Hadoop core JAR file and a HADOOP_CONF directory.
The NFS Gateway can be installed on any DataNode, NameNode, or HDP client machine. Start the NFS server on that machine.
As the diagram above shows files are written by an application user to the NFS Client. The NFS v3 protocol is used as the mechanism to contact the NFS gateway. On the the right side of the diagram the DFS (Distributive File System) Client is used as traffic cop to send/receive data to the:
Domain name via the data transfer protocol
Node Name via the Client protocol
Apache Sqoop efficiently transfers bulk data between Apache Hadoop and structured datastores such as relational databases. Sqoop helps offload certain tasks (such as ETL processing) from the EDW to Hadoop for efficient execution at a much lower cost. Sqoop can also be used to extract data from Hadoop and export it into external structured datastores. Sqoop works with relational databases such as Teradata, Netezza, Oracle, MySQL, Postgres, and HSQLDB
Apache Sqoop does the following to integrate bulk data movement between Hadoop and structured datastores:
Import sequential datasets from mainframe Satisfies the growing need to move data from mainframe to HDFS
Import direct to ORCFiles Improved compression and light-weight indexing for improved query performance
Data imports Moves certain data from external stores and EDWs into Hadoop to optimize cost-effectiveness of combined data storage and processing
Parallel data transfer For faster performance and optimal system utilization
Fast data copies From external systems into Hadoop
Efficient data analysis Improves efficiency of data analysis by combining structured data with unstructured data in a schema-on-read data lake
Load balancing Mitigates excessive storage and processing loads to other systems YARN coordinates data ingest from Apache Sqoop and other services that deliver data into the Enterprise Hadoop cluster.
Streaming Framework Frameworks
Streaming alternatives for ingestion of data include:
Hortonworks Data Flow (HDF)
Apache Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of streaming data into the Hadoop Distributed File System (HDFS). It has a simple and flexible architecture based on streaming data flows; and is robust and fault tolerant with tunable reliability mechanisms for failover and recovery.YARN coordinates data ingest from Apache Flume and other services that deliver raw data into an Enterprise Hadoop cluster.
Flume lets Hadoop users ingest high-volume streaming data into HDFS for storage. Specifically, Flume allows users to:
Stream data Ingest streaming data from multiple sources into Hadoop for storage and analysis
Insulate systems Buffer storage platform from transient spikes, when the rate of incoming data exceeds the rate at which data can be written to the destination
Guarantee data delivery Flume NG uses channel-based transactions to guarantee reliable message delivery. When a message moves from one agent to another, two transactions are started, one on the agent that delivers the event and the other on the agent that receives the event. This ensures guaranteed delivery semantics
Scale horizontally To ingest new data streams and additional volume as needed Enterprises use Flume’s powerful streaming capabilities to land data from high-throughput streams in the Hadoop Distributed File System (HDFS).
Typical sources of these streams are application logs, sensor and machine data, geo-location data and social media. These different types of data can be landed in Hadoop for future analysis using interactive queries in Apache Hive. Or they can feed business dashboards served ongoing data by Apache HBase.
Apache™ Storm adds reliable real-time data processing capabilities to Enterprise Hadoop. Storm on YARN is powerful for scenarios requiring real-time analytics, machine learning and continuous monitoring of operations.
Storm integrates with YARN via Apache Slider, YARN manages Storm while also considering cluster resources for data governance, security and operations components of a modern data architecture.
Storm is a distributed real-time computation system for processing large volumes of high-velocity data. Storm is extremely fast, with the ability to process over a million records per second per node on a cluster of modest size. Enterprises harness this speed and combine it with other data access applications in Hadoop to prevent undesirable events or to optimize positive outcomes.
Some of specific new business opportunities include: real-time customer service management, data monetization, operational dashboards, or cyber security analytics and threat detection.
There are four abstractions in Storm: spouts, bolts, streams, and topologies. Storm data processing occurs in a topology. A topology consists of spout and bolt components. Spouts and bolts run on the systems in a Storm cluster. Multiple topologies can co-exist to process different data sets in different ways.
The core abstraction in Storm is the “stream”. A stream is an unbounded sequence of tuples. Storm provides the primitives for transforming a stream into a new stream in a distributed and reliable way. For example, you may transform a stream of tweets into a stream of trending topics. The basic primitives Storm provides for doing stream transformations are “spouts” and “bolts”. Spouts and bolts have interfaces that you implement to run your application-specific logic.
A spout is a source of streams. For example, a spout may read tuples off of a Kestrel queue and emit them as a stream. Or a spout may connect to the Twitter API and emit a stream of tweets.
A bolt consumes any number of input streams, does some processing, and possibly emits new streams. Complex stream transformations, like computing a stream of trending topics from a stream of tweets, require multiple steps and thus multiple bolts. Bolts can do anything from run functions, filter tuples, do streaming aggregations, do streaming joins, talk to databases, and more.
Networks of spouts and bolts are packaged into a “topology” which is the top-level abstraction that you submit to Storm clusters for execution. A topology is a graph of stream transformations where each node is a spout or bolt. Edges in the graph indicate which bolts are subscribing to which streams. When a spout or bolt emits a tuple to a stream, it sends the tuple to every bolt that subscribed to that
The source of Storm data is often a message queue. For example, an operating system, service, or application will send log entries, event information, or other messages to a queue. The queue is read by Storm.
Storm integrates with many queuing systems. Example queue integrations include Kestrel, RabbitMQ, Advanced Message Queuing Protocol (AMQP), Kafka, Java Message Service (JMS), and Amazon Kinesis. Storm’s abstractions make it easy to integrate with a new queuing system.
As a general rule, new frameworks don’t introduce very many new concepts. Spark Streaming is an exception, as the new concepts of a receiver and Dstream are introduced. A streaming application is composed of a receiver, Core Spark, and a Dstream.
In Spark Streaming:
Receivers listen to data streams, and create batches of data called Dstreams
Dstreams are then processed by the Spark Core engine
Once data is ingested, you can use all the Spark frameworks; you are not limited to using only Spark streaming functionalities.
Dstreams are created as batches of data from a streaming source by the receiver at regular time intervals.
1) When a streaming source begins communicating with the Spark streaming application, the receiver begins filling up a bucket.
2 ) At a predetermined time interval, the bucket is shipped off to be processed.
Each of these buckets is a single Dstream. Once the Dstream is created it is conceptually very similar to an RDD.