Fast OLAP on Hadoop With Apache Kylin

kylin_logo

There are many OLAP tools available in the market today, and all these tools provide powerful query capabilities, so data analysts can easily pivot or drill-down OLAP cubes and provide better results for decision making processes. However, OLAP can also mean huge amount of time in order to get results, specially when large data sets are involved. When complex ad hoc queries are submitted, they may take hours to be processed. This is even more critical in big data scenarios with Apache Hadoop. But writing MapReduce jobs to read data from Hadoop has proved to be a daunting task to do. Apache Hive has emerged as a data warehouse and query solution for Hadoop. But Hive does not solve the query latency problem. Hive queries may take hours to run, they need to be translated and executed as a set of MapReduce operations.

Apache Kylin vs Latency of SQL Queries Over Hadoop Clusters

The Apache Kylin project was designed to reduce latency of SQL queries over Hadoop clusters, with massive data sets. It is an open source Apache project, originated from eBay Inc. Its core is a distributed analytics engine designed to provide a SQL interface and multi-dimensional analysis (MOLAP) tool over Hadoop, and it supports extremely large data sets (10+ billion rows of data). Furthermore, Kylin is fully compatible with ANSI SQL and supports a good part of ANSI SQL analytic functions (some of them still under development). Users can interact with Kylin in sub-second latency on massive data sets, better than with Hive queries on the same data sets.

The OLAP Process With Apache Kylin

First of all, the OLAP designer must identify a star schema data on Hadoop. Usually that Hadoop structure is based on a Hive data warehouse. The Hive database should contain what we call a star schema model, a set of huge fact tables and dimensions. That star schema is the heart of any data warehouse solution. It is designed and optimized for query purposes. Next, Kylin users access the Hive database and build OLAP cubes from the star schema. Each cube that is designed on Kylin is pre-calculated and stored on the HBase database for further access. Kylin was designed to explore all HBase features related to query optimization and cube manipulation. Finally, users can query data with ANSI SQL and get results in sub-second latency, via ODBC, JDBC or a simple RESTful API. Another important feature is that Kylin has seamless integration with other BI Tools like Tableau, PowerBI, Excel and Microstrategy (under development).

Other Kylin Features

Some other Kylin features:

– Job Management and Monitoring
– Compression and Encoding Support
– Incremental Refresh of Cubes
– Leverage HBase Coprocessor for query latency
– Approximate Query Capability is distinct Count (HyperLogLog)
– Easy Web interface to manage, build, monitor and query cubes
– Security capability to set ACL at Cube / Project Level
– Support LDAP Integration

Kylin Solution Architecture

kylin_diagram

Visit Apache Kylin website at kylin.apache.org.

Text: Luis Cláudio R. da Silveira
Revision: Pedro Carneiro Jr.

 

Batch Processing vs Stream Processing

Batch processing is defined as a series of jobs that are connected to each other or executed one after another in sequence or in parallel. After executing this number of jobs, an output is generated and the information is consolidated to generate a final result. Input data is collected and put into lots (chunks) in a period of time and the output produced by each lot may be the “input” to the next batch. It is also called discrete data processing, and has the function to process a collection of large data files (GB / TB / PB). Due to the high latency time between tasks, the fast response time should not a critical factor for the business.

Each task in the batch is associated with a “window” of time, and that window is set for the processing task. The window that is linked a priority for the task can be processed, usually in periods of less intensive system activity or extra-time schedules. In many cases, batch tasks are scheduled and run at predefined time intervals, such as at a certain time of day (usually at night), month or year. Some examples of tasks in batch (or batch processing):

a) Log analysis: this type of application or system (servers, applications, OLTP software, etc.) logs are collected within a certain period of time (day, week or year) and the analysis, the data processing of the logs, is performed in a distinct time (the time window) to derive a number of key performance indicators (KPI) for the system in question. (source: http://en.wikipedia.org/wiki/Log_analysis, accessed in 11-mar- 2016).

b) Billing applications: billing applications calculate the use of a service provided for a period of time and generate billing information, such as credit companies to produce billing statements at the end of each month.

c) Backups: A series of tasks that is performed at times that are not critical to the operation of the system (the time window), making backups of critical systems for the organization.

d) Data Warehouses: Data Warehouses consolidate management information as a static snapshot in function of the timeframe collected and aggregate views as weekly, monthly, quarterly reports, etc.

The above examples may adopt a perception that batch processing systems are not critical by nature, and this perception is wrong. Batch systems are critical to the business, although instant or real-time answers are not expected by its users. For example, a system of recommendations of product offerings to prospecting customers can be performed every night, creating calculations and storing the results in a database. The processing of such a system can become critical because it needs to be completed within a period of time (minutes or hours), and if not, your users would not have access to an updated set of offers and recommendations, and that would cause a bad impact on the business.

Let’s talk a little about the complexities involved in batch processing systems:

a) Large data sets: the masses of data are usually very large, and it requires a lot of computing resources to produce results in defined time windows.

b) Scalability: Unlike vertical scale, which adds more computational resources centrally, we need to adopt an architecture in horizontal scale, where growing data processing demands are met through the addition of new computer resources without affecting the current architecture (for example, adding new processing nodes).

c) Distributed Processing: there are physical limitations on the upgrade of certain computer resources like RAM and CPU that can be added to a single machine. Eventually, as the volume of data grows, their jobs can be aborted for lack of resources. Batch processing should support horizontal scalability that supports processing distributed in cluster (group) servers.

d) Fault tolerance: Faults always occur, and especially when tasks are performed on multiple servers, more failures may occur. There may be hardware or network failures, or many other external negative events that are not under the control of the user or the system. For example, some nodes in a cluster may fail while there are batch processing activities being performed.

Your system should be fault-tolerant so that he it would continue the process only for the tasks that were allocated and failed and have not been completed. And not reprocess it all again!

e) Business Restrictions: companies formalize service level agreements (SLAs) among themselves in order to ensure the operation and availability of their systems. In these SLAs, batch processing must be performed within stipulated time, in such a way that, after its completion, the computational resources can be used for other purposes. There are other restrictions as scheduling, reprocessing, clustering, persistence and so on.

The descriptions that follow should be sufficient to understand that the batch application design and architecture are not simple tasks. They require skills for the development of a scalable, distributed system that offers high performance.

Real time data processing (Streaming processing)

The real-time data processing continuously receives data that is under constant change, such as information relating to air traffic control, travel booking systems, and the like. The streaming processing should be fast enough so that exercises control over the ability to consume events from various data sources. The response time required for stream processing is instantaneous and should be processed in milliseconds (or microseconds). A stream processor is classified as real time only when it produces logically correct results within a time slice (in milliseconds) and can ensure the system requirements.

Stream Processing often use the term latency is the time interval between a stimulus and a response, or the difference between the time at which the data was received and a response is generated. The lower latency, better flow, or “throughout”, system. The real-time data processing is also often referred to as near real time latency introduced due to the system or to relaxation of SLAs in producing the desired results. Here are some examples of real-time systems that receive data in near real time, process the data and send back the results:

Bank ATMs: They receive input from users and instantly apply the transactions (withdrawals, transfers or any other transaction) to a centralized account.

Real-time monitoring: capture and analyze data issued by various sources such as sensors, news feeds, click on Web pages, etc.

Business Intelligence in real time: process that involves delivering business intelligence and decision-making information as operations are taking place.

Operational Intelligence: uses data processing in real time and complex event processing (CEP) to extract value from operations from analysis of queries submitted against events that are introduced into the system. (Http://en.wikipedia.org/wiki/Complex_event_processing) (http://en.wikipedia.org/wiki/Operational_intelligence.

Point of sale (POS) systems: perform update of stock, provides inventory history items allowing an organization to register payments in real time.

Assembly lines: process data in real time to reduce time, cost and mitigate errors. Errors are instantly captured and appropriate actions are taken without impacting the business by increasing the quality of products and business productivity.

Let’s address some of the complexities involved in streaming processing systems (real-time):

Responsiveness of the system: expectations for real time processing systems revolve around its ability to process the data as they are introduced into the system (in the order of microseconds or milliseconds) and not generate any delay in producing results.

Fault tolerance: failures can occur, but real-time processing systems can not afford to lose a single event.

Scalability: need to adopt a scale-out architecture, so that the growing demands of stream processing are met from the insertion of new nodes or computing resources without having to rebuild the entire environment.

Processing Memory: due to high latency, real-time systems can not tolerate writing processes / disk reading and data processing to be performed in the whole main memory. For this to occur, the systems need to ensure sufficient amount of memory to store ingested data entry into the distributed memory of the servers in a cluster.

The above descriptions should be in itself sufficient to understand that, just as in batch processing, design and construction of a stream processing system in real time is not a simple task.

Text: Luis Cláudio R. da Silveira
Revision: Pedro Carneiro Jr.

 

What is Data Science?

I am a bit dissatisfied with the multiple definitions that data science have been receiving and the lack of at least one clear and scientific approach to a definition for it as it occurs with computer science, software development science, and a lot of other subjects. So I decided to write this post expecting to produce some findings and/or to light up some discussion around it. Who knows we may reach a more scientific definition in the future.

“The field of data science is emerging at the intersection of the fields of social science and statistics, information and computer science, and design. The UC Berkeley School of Information is ideally positioned to bring these disciplines together and to provide students with the research and professional skills to succeed in leading edge organizations.” – https://datascience.berkeley.edu/about/what-is-data-science/, accessed on January 13rd, 2016.

Data Science Happens Not Only In California

Many people quote that a data scientist is “a data analyst who lives in San Francisco”. That alone might indicate the importance of the data analysts and all the data practitioners in California, but also it seems to be enough to determine that what we know as data science has a more practical or commercial appeal than a proper scientific definition for itself. Anyhow, we should not deny that this data science already has an identity: a fast-paced, rapidly-evolving one, just like any other field directly involved with modern technologies. But the distinct personality of data science is still a bit confusing.

Is Statistics Data Science Itself?

Many argue that data science might be statistics itself or whatsoever modern statistics does by the usage of computational means. That happens even in the academic ecosystem in a large scale, propelled by the popularity and the usage of big data, machine learning et cetera. Do statistics compose the whole data science? Does data science compose the whole statistics? In other words, are statistics and data science different sets, different sciences? The known truth so far is that statistics makes use of data science.

Data Science, According To Wikipedia

Many professors would not accept an Wikipedia definition as the basis for a scientific argument.

Anyhow, let us ease things a little bit by using it. In my opinion, Wikipedia reflects what a majority think or at least tends to be an average of the mindset.

Data Science is an interdisciplinary field about processes and systems to extract knowledge or insights from data in various forms, either structured or unstructured, which is a continuation of some of the data analysis fields such as statistics, data mining, and predictive analytics, similar to Knowledge Discovery in Databases (KDD). – Wikipedia, https://en.wikipedia.org/wiki/Data_science, accessed on January 12th, 2016.

Wikipedia, at this moment at least, defines data science as an interdisciplinary field. That is true. Another point of view affirms that too and provides the famous Data Science Venn Diagram. My question is: must a field be a science? A field is a subset or part of a science but the reciprocal is not necessarily true. In the citation above, Wikipedia affirms that statistics is a field too and we are considering Statistics as a science.

A Data Science Visualization, according to Drew Conway

One of the opinions that has a closer approach to a common place for a definition of Data Science is the one of Drew Conway. Despite I have not seen yet any statement that it is a definition, his visualization brings data science as an intersection of hacking skills, Statistics, and the areas of application, the famous Data Science Venn Diagram. It seems that it still misses key areas such as databases, data governance, and so on, but I think that he has put all Computer Science and databases stuff into a set called “hacking skills”.

Also, that occurs probably because the world has much more programmers (people with hacking skills) than computer scientists or because those results-oriented people with hacking skills are in more demand than computer scientists. Who knows computer science is so closed in itself (difficult to enter or to communicate) or it becomes so boring in university that there are more “people with hacking skills” from other areas behind the desks typing R command lines than good computer scientists doing the same.

“As I have said before, I think the term “data science” is a bit of a misnomer, but I was very hopeful after this discussion; mostly because of the utter lack of agreement on what a curriculum on this subject would look like. The difficulty in defining these skills is that the split between substance and methodology is ambiguous, and as such it is unclear how to distinguish among hackers, statisticians, subject matter experts, their overlaps and where data science fits. What is clear, however, is that one needs to learn a lot as they aspire to become a fully competent data scientist.

Unfortunately, simply enumerating texts and tutorials does not untangle the knots.

Therefore, in an effort to simplify the discussion, and add my own thoughts to what is already a crowded market of ideas, I present the Data Science Venn Diagram.” – Drew Conway, http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram, accessed on January 13rd, 2016.

According to Drew Conway, author of the DS Venn Diagram, the recent “data science” term forged for the recent usage of data may be a bit of a misnomer and I agree with him.

Data Science vs Data Science

We should ask then, what science is that one that the so called data science field sits in? Information Science, the “Data Science”, Statistics, Computer Science…? Wikipedia’s data science definition also says that DS is similar to KDD, but shouldn’t KDD be encompassed by DS simply because databases deal with data? Because of that, another question comes to mind: Is the real Data Science “the science of data” or “the science that extracts knowledge or insights from data in various forms”?

Here we encounter two definitions and only one of them is the real Data Science.

“Data science is the study of where information comes from, what it represents and how it can be turned into a valuable resource in the creation of business and IT strategies… Mining large amounts of structured and unstructured data to identify patterns can help an organization rein in costs, increase efficiencies, recognize new market opportunities and increase the organization’s competitive advantage. Some companies are hiring data scientists to help them turn raw data into information. To be effective, such individuals must possess emotional intelligence in addition to education and experience in data analytics.” – http://searchcio.techtarget.com/definition/data-science, accessed on January 13rd, 2016.

The Data Science Venn Diagram above helps a lot with that, but there is more to be discovered, mainly because, in my opinion, this “data science” Wikipedia, data analysts, statisticians, programmers, and business men talk about is more about what these data practitioners have been doing with statistics, substantive expertise and hacking skills to turn raw data into information then, for example, the science that studies data, a systematically organized body of knowledge on the particular subject of data, in other words, the science that studies data frames, data sets, databases, meta-data, data flows, data cubes, data models, and all the domain the subject of data might encompass and its frontiers. That makes us go after the definition of science.

“There is much debate among scholars and practitioners about what data science is, and what it isn’t. Does it deal only with big data?

What constitutes big data? Is data science really that new? How is it different from statistics and analytics?… In virtually all areas of intellectual inquiry, data science offers a powerful new approach to making discoveries.

By combining aspects of statistics, computer science, applied mathematics, and visualization, data science can turn the vast amounts of data the digital age generates into new insights and new knowledge.”, http://datascience.nyu.edu/what-is-data-science/, accessed on January 13rd, 2016.

What Science Is

I went after a classic definition for science and the first thing that came to me was, again, an Wikipedia definition. That’s the modern days, professors.

Anyway, trying to be fair to the investigation, I tried to find other online sources and found some other definitions, including one that gets close to what is better to use when one wants to prove a science and that may be helpful in our future reasonings.

Science, According to Wikipedia

Wikipedia defines science as “a systematic enterprise that creates, builds and organizes knowledge in the form of testable explanations and predictions about the universe”.

Science, According to Google’s Definition

According to Google, science is “the intellectual and practical activity encompassing the systematic study of the structure and behavior of the physical and natural world through observation and experiment. (‘the science of criminology’)”; “a particular area of this. (‘veterinary science’)”;”a systematically organized body of knowledge on particular subject. (‘the science of criminology’)”; “synonyms: physics, chemistry, biology; physical sciences, life sciences (‘he teaches science at the high school’)”.

Science, According to Merriam-Webster

At Merriam-Webster we read that science is “knowledge about or study of the natural world based on facts learned through experiments and observation; a particular area of scientific study (such as biology, physics, or chemistry); a particular branch of science; a subject that is formally studied in a college, university, etc.”

Science, According to BusinessDictionary.com

The BusinessDictionary.com defines science as “Body of knowledge comprising of measurable or verifiable facts acquired through application of the scientific method, and generalized into scientific laws or principles. While all sciences are founded on valid reasoning and conform to the principles of logic, they are not concerned with the definitiveness of their assertions or findings”. And adds, “In the words of the US paleontologist Stephen Jay Gould (1941-), ‘Science is all those things which are confirmed to such a degree that it would be unreasonable to withhold one’s provisional consent.'”

This one seams to be the best definition for science we found up to the moment as it mentions the scientific method as the way to measure and verify the facts and the laws or principles that compose a science.

A Raw First Definition of Data Science

This is raw, and maybe not sophisticated and prone to errors (we are not using the scientific method yet – let us keep that one for future posts), but let us imagine what a data science definition would be based on the definitions of science we listed above.

An Wikipedia-Would-Be Definition of Data Science

“A systematic enterprise that creates, builds and organizes knowledge in the form of testable explanations and predictions about data“.

Do we have a systematic enterprise that creates, builds and organizes knowledge in the form of testable explanations and predictions about data? What we have today is about data or about other things using data as the main support?

A Google-Would-Be Definition of Data Science

From Google’s definition of Science, it looks like our data science definition should at least become something like:

1. “the intellectual and practical activity encompassing the systematic study of the structure and behavior of the physical and natural world through observation and experiment of data. (‘the science of data’)”;

2. or “the intellectual and practical activity encompassing the systematic study of the structure and behavior of data in the physical and natural world through observation and experiment. (‘the science of data’)”.

We have a first and second definitions, based on Google’s definition of science.

From recent practice and readings, I would bet that our first-created Google-would-be definition (1) is what all people involved have in mind as for what they/we think data science is. I think that is why many people tend to confuse data science with statistics, simply because the definition number one expresses very well what statistics does. But, actually, is that the proper definition for data science?

Other Google definitions would be like “a particular area of this. (‘data science’)”; “a systematically organized body of knowledge on the data subject. (‘the science of data‘)”.

Do we have a systematically organized body of knowledge on the data subject? As far as I know we have systematically organized bodies of knowledge on many subjects and they use data as a foundation.

A Merriam-Webster-Would-Be Definition of Data Science

“Knowledge about or study of data based on facts learned through experiments and observation; a particular area of scientific study (such as “DATA-o-logy”, biology, physics, or chemistry); a particular branch of science (data science); a subject that is formally studied in a college, university, etc.”

A BusinessDictionary-Would-Be Definition of Data Science

“Body of knowledge comprising of measurable or verifiable facts about data acquired through application of the scientific method, and generalized into scientific laws or principles.

We are here not to precisely inform the data science definition yet, but to throw the ball to the kicker.

Nowadays (we are in January, 2016), it is possible to find many definitions of data science and many (or all) of them still lack precision or lead to a practice that may be a misnomer of something people do with data for scientific and commercial reasons. As a science, there are people studying it, defining it (what we are trying to do), and not only using it. As a practice, people do not mind if it is a science or not since the tool set works for them. As many are trying to define it, according to their observations and experiences, it looks like everybody, while succeeding in a good definition for specific purposes, fails to discover a common place for the definition. As far as all scientists know, the proper common place for the definition of any science is Science itself.

Should one say that data science is “the science of data”, that would be vague, not precise, but that innocence would throw a light on a different perspective.

What is science and what is data? That might help us reach better and more common-sense oriented definitions for both the practice of extracting knowledge or insights from data and the science of data and, who knows, turn us able to affirm that there is a lot of or no difference between the two things.

Just an important note: I searched the websites http://www.sciencecouncil.org/ and http://www.businessdictionary.com/ and found no definition for data science in their websites.

Other Readings

Other sources to find popular or different perspectives about data science are:

  • http://www.kdnuggets.com/2013/12/what-is-wrong-with-definition-data-science.html
  • http://datascience.nyu.edu/what-is-data-science/
  • http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram
  • https://datascience.berkeley.edu/about/what-is-data-science/
  • https://en.wikipedia.org/wiki/Data_science

Text: Pedro Carneiro Jr.
Revision: Luis Cláudio R. da Silveira.

 

 

About The Software Design Science (Part 1 of 2)

As the first person to claim the science of software design, Kanat-Alexander (2012), when he speaks about what he calls “the missing science”, explores the concept of software design. All the fundamentation of the software design laws depend on that conceptualization. The missing science is the software design science. The approach defined by Kanat-Alexander (2012) transcribed bellow, reflects the practice and the facts. The software design science acts since before the begining of programming phase, remains during its development, after the programming is finished and untill the program enters in operation, for its maintenance.

Every programmer is a designer. (KANAT-ALEXANDER, 2012, p.6).

In the original version of this specific work of Kanat-Alexander, Code Simplicity, its title represents a fundamental truth to be followed by the software developers: the simplicity of the code.

Software design, as it is practiced in the world today, is not a science. What is a science? The dictionary definition is a bit complex, but basically, in order for a subject to be a science it has to pass certain tests. (KANAT-ALEXANDER, 2012, p.7).

In this defense made by him, of a new science, are the elements long ago percieved, but not yet organized by the more experienced programmers. In this way, are listed the tests by which the software project must pass to be considered a science:

  • A science must be composed of facts, not opinions, and these facts should have been gathered somewhere (like in a book).
  • That knowledge must have some sort of organization, be divided into categories and the various parts must be properly linked to each other in terms of importance etc.
  • A science must contain general truths or basic laws.
  • A science should tell you how to do something in the physical universe and be somehow applicable at work or in life.
  • Typically, a science is discovered and proven by means of scientific method, which involves the observation of the physical universe, piece together a theory about how the universe works, perform experiments to verify its theory and show that the same experiment works everywhere to demonstrate that the theory is a general truth and not just a coincidence or something that worked only for someone.

The whole software community knows there is a lot of knowledge recorded and collected in books, in a well-organized manner. Despite that, we still miss clearly stated laws. If experienced software developers know what is right to do, nobody knows for sure why some decisions represent the right thing. Therefore, Kanat-Alexander (2012) lists definitions, facts, rules and laws for this science.

The whole art of practical programming grew organically, more like college students teaching themselves to cook than like NASA engineers building the space shuttle…. After that came a flurry of software development methods: the Rational Unified Process, the Capability Maturity Model, Agile Software Development, and many others. None of these claimed to be a science—they were just ways of managing the complexity of software development. And that, basically, brings us up to where we are today: lots of methods, but no real science. (KANAT-ALEXANDER, 2012, p. 10).

Kanat-Alexander (2012) affirms that all the definitions below are applicable when we talk about software design:

  • When you “design software”, it is planned: the structure of the code, what technologies to use, etc. There are many technical decisions to be made. Often, one decides just mentally, other times, also jots plans down or makes a few diagrams;
  • Once that is done, there is a “software design” (a plan that was elaborated), may that be a written document or only several decisions taken and kept in mind;
  • Code that already exists also has “a project” (“project” as the plan that an existing creation follows), which is the structure that it has or the plan that it seems to follow. Between “no project” and “a project” there are also many possibilities, such as “a partial project”, “various conflicting projects in a code snippet”, etc. There are also effectively bad projects that are worse than having no project, like coming across some written code that is intentionally disorganized or complex: a code with an effectively bad project.

The science presented here is not computer science. That’s a mathematical study. Instead, this book contains the beginnings of a science for the “working programmer” — a set of fundamental laws and rules to follow when writing a program in any language… The primary source of complexity in software is, in fact, the lack of this science. (KANAT-ALEXANDER, 2012, p. 11).

The science of software design is a science to develop plans and to make decisions about software, helping in making decisions about the ideal structure of a program’s code, the choice between execution speed or ease of understanding and about which programming language is more appropriate to the case. We note then, a new point of view for what we call software design, through the prism of the programmer, and that involves not only the activities after the requirements analysis, but which also perpetuates throughout all the programming and product life cycle, including in its maintenance, because for a good maintenance, such software you will need a good design as a reference taking into account its fundamental laws.

Text: Pedro Carneiro Jr.
Revision: Luis Cláudio R. da Silveira.

 

When to use Enums?

Let us consider this short real-life example:

One day, I was helping the guys at our local Housing Agency to build a new online form that was going to feed the database with candidates for the government’s low-cost housing benefit. Months later, after the solution was on-air and people had fed in their information, the same small Scrum team was responsible for building reports on that data.

One of our team members had left the project for another job and up to that moment we were alright implementing everything as demanded from above. But one thing happened to catch my attention: there were some data that, at first sight, did not have any correspondence among the report and the database.

We went up and down the documentation trying to figure out how to map all the report fields and the database we had (one born for the other for a very simple task). Despite the database was well documented, apparently, very little could be done to solve the problem and deliver the reports on time. Where the reports asked for marital status, income range, and information of the kind that usually go in combo boxes we only could find numbers (integers) in the database!

After some hours we were struggling with that, I decided to open the code, find and clamp what had the possibility to be the data structure I was looking for.

The answer quickly came in the form of a word: enum. Some of the data structures were written within the code as “enum” (enumerations / enumerators) and what was being persisted in the database were only their corresponding integer parts.

An enumeration is a complete, ordered listing of all the items in a collection. The term is commonly used in mathematics and theoretical computer science (as well as applied computer science) to refer to a listing of all of the elements of a set. – Enumeration – Wikipedia, the free encyclopedia (https://en.wikipedia.org/wiki/Enumeration, accessed on December 23rd, 2015.)

I believe in the benefits, agree and support the values of Scrum and really think of it as the closest approach to good effectiveness of a small team dealing with multiple projects, and I also would never consider that colleague that left the team a bad or average programmer (actually he is one of the best programmers I’ve met in my journey). The fact is that pressure and heat generated some bad gases in that event. The exhaust valve for the major part of OO programmers is called “enum”.

That situation inspired me to observe more that “enum” thing and during some other experiences with that whatchamacallit datatype I was convinced I would be able to research a little and maybe bring something useful for the scientific and software development communities (for those that my words convince, of course).

For that reason, I’m going to start a series of posts that will depict in parts my monograph “THE PERSISTENCE OF ENUMERATIONS IN POSTGRESQL DATABASES AND THE QUALITY IN BUSINESS INTELLIGENCE”, freely translated from the original in Portuguese, and where I expect to introduce the view of some authors on Software Design Science, Business Intelligence, THE ENUM, and some other things, usually related in a BI environment and, above all, to decipher the enumerations and when and how it’s better to use them.

As database people we sometimes feel uncomfortable when developers tend to use certain methods, so, my proposal to answer the question “When to use enums?” came up after some debate among our professional circles. Some colleagues support my point of view and some avoid or do not like it. All in all, there is still a gap between code and data. Let’s explore it?


ATTRIBUTION

The images used in this post, edited or not, are Creative Commons (CC) and the originals are credited to their creators and can be found at:

Text: Pedro Carneiro Jr.
Revision: Luis Cláudio R. da Silveira.

Apache Flink – Part I

Why the name is Flink? The Flink logo is a squirrel. But what squirrels have to do with intensive data processing? Perhaps because squirrels are rodents and can crack nuts very quickly.

But you ask me: what’s Apache Flink? Apache Flink is an extremely fast and reliable platform for large scale distributed data processing flows. Flink was born to meet an ever increasingly demand to extract analytical results from big data sources.

Similar to Apache Spark (from Berkeley’s AMPLab), it has come to address some drawbacks found on Apache Hadoop, on which the analytical results are produced from a series of batch operations interspersed with read/write access storage data. Beyond the batch processing the way Hadoop makes, Flink processes large data streams and delivers analytical data on real time. Furthermore, Flink has a very robust implementation back-end and a powerful API which is relatively easy to work with, thus saving development time when compared to the programming of scheduled map/reduce tasks on Hadoop systems.

But where can we apply Flink? The Apache Flink is an analytical system for general purpose big data projects. Application areas such as streaming processing of real-time systems that require timestamp classification of source events, ETL (extract, transform, load – complex tasks heavily used in data warehouse workloads), analysis of massive graphs derived from traffic or social networks, machine learning applications predicting trends from massive data sets, etc.

In terms of system architecture, Flink’s core is an engine that provides features such as data distribution, fault tolerance, scalability for distributed computing systems that are based on data flows. Better than Hadoop, Flink is able to execute batch processing systems on top of a streaming processing engine, offering a single API for the development of both batch and streaming applications. Moreover, this overlapping layers natively supports flow iterations, a better memory management and code optimization.

Flink can be installed in many environments: it can be embedded in an application, or standalone, over Hadoop YARN, or cloud platforms like Amazon, Google Compute engine and others.

To learn more about the Apache Flink project, access flink.apache.org.

Text: Luis Cláudio R. da Silveira.
Revision: Pedro Carneiro Jr.

 

Hello Data!

We all are, in one form or another, informed that, despite human beings, we also exist in the form of data, may it be as a single telephone record in the smartphone of a friend or even as a simple paper record in the notebooks an old church or civil registry office of a distant island somewhere in the world.

Besides that, all things discovered up to the moment have been registered and mapped in some form. From underground waters to every meter of ground land and the frontiers of outer space, from subjective interpretations of human sciences to the objective discoveries of engineering, physics and math. It is true that, in terms of science, there is still so much to discover, but every time that this is achieved, new records are added, not to mention the new data that has been created along the way. Actually, new data is created on the go while sisters share their selfies on Facebook, when engineers and scientists manage their projects away from their team members, and while I am here writing this post.

Anyhow, civilization has always been subject to its tools, from generation to generation, and currently, in the information age, every person is affected by the way data is obtained, processed and presented.

This website expects to talk about data and its related subjects in every perspective we find, giving examples, sharing tutorials, commenting experiences, and whatever possible and necessary to contribute positively with the scientific, software development and data professionals communities, by getting feedback (intellectual contribution) from its readers, and by generating discussion and reflection.

DataOnScale.com is an endeavor of Luis Cláudio and Pedro Jr., two Brazilian IT professionals and data science enthusiasts.