Batch processing is defined as a series of jobs that are connected to each other or executed one after another in sequence or in parallel. After executing this number of jobs, an output is generated and the information is consolidated to generate a final result. Input data is collected and put into lots (chunks) in a period of time and the output produced by each lot may be the “input” to the next batch. It is also called discrete data processing, and has the function to process a collection of large data files (GB / TB / PB). Due to the high latency time between tasks, the fast response time is not a critical factor for the batch processing business.
Each task in the batch is associated with a "window" of time, and this that window is set for the processing task. The window that is linked a priority for the task can be processed, usually in periods of less intensive system activity or extra-time schedules. In many cases, batch tasks are scheduled and run at predefined time intervals, such as at a certain time of day, month or year. Some examples of tasks in batch (or batch processing):
a) Log analysis: this type of application or system (servers, applications, OLTP software, etc.) logs are collected within a certain period of time (day, week or year) and the analysis, the data processing of the logs, is performed in a distinct time (the time window) to derive a number of key performance indicators (KPI) for the system in question. (source: http://en.wikipedia.org/wiki/Log_analysis, accessed in 11-mar- 2016).
b) Billing applications: billing applications calculate the use of a service provided for a period of time and generate billing information, such as credit companies to produce billing statements at the end of each month.
c) Backups: A series of tasks that is performed at times that are not critical to the operation of the system (the time window), making backups of critical systems for the organization.
d) Data Warehouses: Data Warehouses consolidate management information as a static snapshot in function of the timeframe collected and aggregate views as weekly, monthly, quarterly reports, etc.
The above examples may adopt a perception that batch processing systems are not critical by nature, and this perception is wrong. Batch systems are critical to the business, although instant or real-time answers are not expected by its users. For example, a system of recommendations of product offerings to customers prospecting can be performed every night, creating calculations and storing the results in a database. The processing of such a system can become critical because it needs to be completed within a period of time (minutes or hours), and if not, your users would not have access to an updated set of offers and recommendations, and that would cause a bad impact on the business.
Let’s talk a little about the complexities involved in batch processing systems:
a) Large data sets: the masses of data are usually very large, and it requires a lot of computing resources to produce results in defined time windows.
b) Scalability: Unlike vertical scale, which adds more computational resources centrally, we need to adopt an architecture in horizontal scale, where growing data processing demands are met through the addition of new computer resources without affecting the current architecture (for example, adding new processing nodes).
c) Distributed Processing: there are physical limitations on the upgrade of certain computer resources like RAM and CPU that can be added to a single machine. Eventually, as the volume of data grows, their jobs can be aborted for lack of resources. Batch processing should support horizontal scalability that supports processing distributed in cluster (group) servers.
d) Fault tolerance: Faults always occur, and especially when tasks is are performed on multiple servers, more failures may occur. There may be hardware or network failures, or many other external negative events that are not under the control of the user or the system. For example, some nodes in a cluster may fail as while there are batch processing activities being performed.
Your system should be fault-tolerant so that he it should continue the process only for the tasks that were allocated to us that failed and have not been completed. And not reprocess it all again!
e) Business Restrictions: companies formalize service level agreements (SLAs) among themselves in order to ensure the operation and availability of their systems. In these SLAs, batch processing must be performed within stipulated time, in such a way that, after its completion, the computational resources can be used for other purposes. There are other restrictions as scheduling, reprocessing, clustering, persistence and so on.
The descriptions that follow should be sufficient to understand that the design and batch application architecture are not simple tasks. They require skill development of a distributed system, scalable and offers high performance.
Real time data processing (Streaming processing)
The real-time data processing continuously receives data that is under constant change, such as information relating to air traffic control, travel booking systems, and so on. The streaming processing should be fast enough so that exercises control over the ability to consume events from various data sources. The response time required for stream processing is instantaneous and should be processed in milliseconds (or microseconds). A stream processor is classified as real time only when it produces logically correct results within a time slice (in milliseconds) and can ensure the system requirements.
Stream Processing often use the term latency is the time interval between a stimulus and a response, or the difference between the time at which the data was received and a response is generated. The lower latency, better flow, or “throughout”, system. The real-time data processing is also often referred to as near real time latency introduced due to the system or to relaxation of SLAs in producing the desired results. Here are some examples of real-time systems that receive data in near real time, process the data and send back the results:
Bank ATMs: They receive input from users and instantly apply the transactions (withdrawals, transfers or any other transaction) to a centralized account.
Real-time monitoring: capture and analyze data issued by various sources such as sensors, news feeds, click on Web pages, etc.
Business Intelligence in real time: process that involves delivering business intelligence and decision-making information as operations are taking place.
Operational Intelligence: uses data processing in real time and complex event processing (CEP) to extract value from operations from analysis of queries submitted against events that are introduced into the system. (Http://en.wikipedia.org/wiki/Complex_event_processing) (http://en.wikipedia.org/wiki/Operational_intelligence.
Point of sale (POS) systems: perform update of stock, provides inventory history items allowing an organization to register payments in real time.
Assembly lines: process data in real time to reduce time, cost and mitigate errors. Errors are instantly captured and appropriate actions are taken without impacting the business by increasing the quality of products and business productivity.
Let’s address some of the complexities involved in streaming processing systems (real-time):
Responsiveness of the system: expectations for real time processing systems revolve around its ability to process the data as they are introduced into the system (in the order of microseconds or milliseconds) and not generate any delay in producing results.
Fault tolerance: failures can occur, but real-time processing systems can not afford to lose a single event.
Scalability: need to adopt a scale-out architecture, so that the growing demands of stream processing are met from the insertion of new nodes or computing resources without having to rebuild the entire environment.
Processing Memory: due to high latency, real-time systems can not tolerate writing processes / disk reading and data processing to be performed in the whole main memory. For this to occur, the systems need to ensure sufficient amount of memory to store ingested data entry into the distributed memory of the servers in a cluster.
The above descriptions should be in itself sufficient to understand that, just as in batch processing, design and construction of a stream processing system in real time is not a simple task.
Text: Luis Cláudio R. da Silveira
Revision: Pedro Carneiro Jr.