4 pragmatic enablers of data-driven decision making
You could talk about what makes companies data-driven for hours. Fortunately, as a single picture is worth a thousand words, we can also use an…
Read moreApache NiFI, a big data processing engine with graphical WebUI, was created to give non-programmers the ability to swiftly and codelessly create data pipelines and free them from those dirty, text-based methods of implementation. Unfortunately, we live in a world of trade-offs, and those features come with a price. The purpose of our blog series is to present our experience and lessons learned when working with production NiFI pipelines. This will be organised into the following articles:
Apache NiFi - why do data engineers love it and hate it at the same time?
Part I - Fast development, painful maintenance
Part II - We have deployed, but at what cost… - CI/CD of NiFi flow
Part IV - Universe made out of flow files - NiFi architecture
Part V - It’s fast and easy, what could possibly go wrong - one year history of certain NiFi flow
I have only one rule and that’s … - recommendations for using Apache NiFi
In this post we focus on creating data flows with ready-to-go processors, the limitations of such an approach and the solutions we have applied. Previous posts you can read on our blog.
We are genuine fans of Lego which is typical for many engineers ;-) Lego provides different brick series for different age categories and this brings different capabilities to what can be built. One can create almost anything with Lego Technic, but it takes some time to make something in a more complex way, even if you are a grown up. On the other hand, with Lego Duplo, one can create high buildings really fast at the age of three. The only issue is when one wants to add more details because huge Duplo cuboids prevent from creating custom things. Fortunately one can mix different Lego series and for instance use Lego Classic on the top of Duplo.
If data flows were built out of Lego, then NiFi definitely stands for Duplo. You can build great things really fast and there are some options to create custom logic when no out-of-the-box NiFi processor is present.
The really nice extension can be done with the ExecuteGroovyScript processor that allows writing scripts in Groovy. It’s well integrated with NiFi and just a couple lines of Groovy code can solve really complex problems. The processor allows you to put a script body into a text attribute and you are done. The only disadvantage is the manual testing within NiFi each time the script is modified. As the approach solves many issues quickly, it can get really popular within NiFi flow. At some point, you realize that flow contains dozens of inline Groovy scripts that share some common logic elements.
This can be solved with a Groovy project that contains classes instead of scripts, that are fully covered with unit tests and packaged into a jar file. Once a jar file is deployed on all NiFi nodes, it is included in the classpath of ExecuteGroovyScript processors. Class methods from within the jar file are used instead of writing the inline scripts. The serious disadvantage we encountered was that, when uploading new jars onto NiFi nodes, it was required to manually reload the classpath for each processor to get the new version loaded. Another was that Groovy code can read flowfile attributes and both systems got tightly coupled. In other words, if you want to change an attribute in NiFi, not only all the NiFi processors need to be checked you also need to make sure that it won't break the Groovy code stored in another repository. That’s how the monolithiest of monoliths get built.
While scripts provide a great interface to extend functionalities of Nifi, they have some limitations, both from usability and maintenance perspectives. To maximize customization, we can create our own processors, with a few notable advantages compared to scripts. The first is better abstraction; in the case of processors, the user can look into build-in documentation, check help messages next to property name etc. However, in the case of script, looking into the script code is almost always necessary. We can also define as many output connections as we want instead of just success and failure. In addition, because every custom processor is just a maven project, it can make use of all traditional programming features, versioning with VCSs like Git, using test frameworks and creating CI/CD pipelines. What's more, NiFi provides an interface for adding new components in a plugin-like manner, so no need to recompile anything. Since version 1.8 there is even the option of dynamically adding new components during runtime and switching between versions of components is available from the level of WebUI. Unfortunately, NiFi will ignore components with the same version as ones previously loaded, so it's impossible to dynamically replace the jar with an already existing version.
All those mechanisms are great for programmers but since NiFi is a tool designed for people who do not necessarily like to code, the additional complexity in creating components is a major downside compared to scripts. Everything has to be done in accordance with the NiFi framework,
- just the necessity of using Maven or some similar system is a major complication, especially if a task executed by a component is fairly simple. Another disadvantage is that you need to have to access the NiFi cluster via ssh or configured CI/CD to put a custom processor into Nifi, which might be a problem security-wise. The same as with scripts, it’s just adding additional parts to one monolithic system.
No one plans the building of monolith monsters. It is just tiny bricks of tightly coupled things added one by one each day. The best way to avoid tight coupling is using state-of-the-art engineering methods such as…. microservices. Microservices ensure the encapsulation of business logic into elegant and tiny components which have a clear definition of the API they expose. This is something that really worked in our projects. Whenever some complex logic is required, instead of dozens of untestable NiFi processors, it is really worth creating a REST service endpoint. We favour that approach most of all because NiFi can easily send HTTP/HTTPS requests and handle JSON responses. There are plenty of mature frameworks for writing rest services in a language of your preference. The lack of unit tests in NiFi is a serious limitation. When You build complex things and have unit tests, you can easily refactor your code and continuously make it better each day. Without them, making improvements is risky and is often avoided, thus the code base or NiFi flow gets difficult to maintain. Moreover, microservices can be used by other systems to communicate with NiFi.
The approach with microservices works well unless big amounts of data is sent through the network. In other words, it suits the scenarios where complex logic can be kept separate to data volumes at scale. In other cases, Apache Spark jobs can be triggered from NiFi.
What we loved? NiFi is like Lego Duplo and it's great that it can be extended with other Lego bricks like groovy scripts, custom processors or offloading logic to microservices. Each of the approaches has its pros and cons. It's always good when you have multiple options and pick the one that serves your needs best.
What we hated? When working with real life business logic, we prefer using Apache Spark for bigger data volumes or rest services with smaller amounts of data. In other words, for custom logic we prefer avoiding NiFi.
You could talk about what makes companies data-driven for hours. Fortunately, as a single picture is worth a thousand words, we can also use an…
Read moreA few months ago I was working on a project with a lot of geospatial data. Data was stored in HDFS, easily accessible through Hive. One of the tasks…
Read moreOne of the biggest challenges of today’s Machine Learning world is the lack of standardization when it comes to models training. We all know that data…
Read moreBig Data Technology Warsaw Summit 2020 is fast approaching. This will be 6th edition of the conference that is jointly organised by Evention and…
Read moreIn this episode of the RadioData Podcast, Adama Kawa talks with Jonas Björk from Acast. Mentioned topics include: analytics use cases implemented at…
Read moreEvery second your IT systems exchange millions of messages. This information flow includes technical messages about opening a form on your website…
Read moreTogether, we will select the best Big Data solutions for your organization and build a project that will have a real impact on your organization.
What did you find most impressive about GetInData?