Tutorial

11 min read

NiFi Scripted Components - the missing link between scripts and fully custom stuff

Custom components

As we probably know, the biggest strength of Apache Nifi is the large amount of ready-to-use components. There are, of course, situations where we need something custom that we can’t do with the use of available processors or it will make our flow unreadable, messy, ugly, the kind that you wouldn’t like to introduce to your parents. If you have any experience with Nifi then you have probably found some ways to handle this particular problem and (some of them are presented in the article here). The solution depends largely on the complexity of the processing. Available external components are also an important factor, but today we will focus on the internal features of Nifi. If it’s something simple, you could do with one or two bash commands then the ExecuteStreamCommand processor will do the rest of the job for you. It's very simple, it usually only requires one line of code and you’re good to go. If that’s not enough, it’s time to get out the big guns. ExecuteGroovyScript or ExecuteScript both provide a way to insert code directly into the processor and make it do whatever we want it to. It’s almost the perfect solution… Almost.

So what’s wrong with the scripts?

Even though script executors are great tools, they do have some downsides. We can more or less separate them into two kinds, based on their nature, explicitness and phase of development in which they appear.

Development phase, functional limitations

These are some that you can encounter while designing your flow or in the implementation, to name a few:

only two outgoing relationships
dependency updates issue
dynamic properties only
working in processor-like manner

As we can see, these are strictly related to the development phase. If we want to have more relationships, we need to set some attribute with status and later route the flowfile with RouteOnAttribute processor and as a rule we would like to keep the size of our flow to a minimum. If we want to update dependency jars, we also need to invalidate the processors that use them, otherwise they will use a cached (old) version of the jar. Dynamic properties make it impossible for us to use sensitive properties. Finally, the manner of working is problematic if we want to achieve functionality of e.g. a controller service.

It’s possible to make workarounds for those but they’re more like hacks rather than solutions.

Maintenance phase, good practice hurdles

Here instead of clear and explicit lack in functionality, we can see secondary consequences of the design. While they may not seem that important to the developer that creates the first PoC version of the flow, they definitely can make someone trying to make changes in the flow a year later question the qualifications and sanity of the creator. Among others:

lack of testing frameworks or good practices
incentive for using implicit arguments
no way of describing the properties

The first one is quite obvious, we want to test our changes and do regression tests. Frameworks for automatic tests are a blessing that we often don't appreciate enough in normal programming. Two of the following require more explanation.

The devil lies in the implicit

As we know the Nifi is full of implicit values that we pass on throughout the flow in the form of attributes. Using attributes is convenient, but we need to remember that if we create a custom component, we need to make it visible somehow, that we are in fact using them.

Let’s imagine a situation where you are a maintainer of the flow and you want to make some change which will modify the value of one attribute. You need to check if it's not used anywhere in the flow, fortunately each processor has documentation that contains the information which attributes it uses… except for those scripts. You need to find all the scripts that use the attribute inside the script body. Even more-so, they might call some method that has flowfile reference as an argument. Then you need to find the dependency code and check there.

It can happen with any custom component in Nifi but in the case of scripts, there is no incentive to make values explicit right away. What’s more, taking value from attributes is in fact the most convenient way to get it.

Documentation matters

Imagine a different scenario; you want to use a script someone else created in your flow. The creator was reasonable enough to take all the values from dynamic properties, so you can see which ones are used right away, but what for? The script is big and properties have generic names. Guess it’s time to study the code and figure out the author's intentions. That’s usually not a pleasant experience.

Custom components to the rescue

Apache Nifi provides API for all types of components, so the user can easily create custom processors, controller services and so on. To put it bluntly, they solve all the problems mentioned above. The question is then, why don’t we use them by default?

So what’s the problem?

We can divide this question into two separate questions that are easier to answer. First of all - why use scripts in the first place and secondly, why not migrate to custom components?

Why scripts?

The reason is simple, they’re faster to implement. To write a custom processor you need to create a project, compile it and make sure all the libraries are added correctly. It’s quite a lot of hassle. On the other hand if it’s something simple, the script will be more than enough. The problem is, we start with scripts because the logic they implement is simple, but later it gets more and more complicated and we have already made the decision to use scripts. The only way then, is migration.

Hurdles of making non-functional improvements

Migrations to better solutions and non-functional improvements are something we would all like to have done in our project. I could say that we all would like to do it but let's face it, we don’t necessarily want to… It will not improve functionality of the solution and will take a lot of time, the only people who will see this are the developers, so a business will not notice. What’s more, they want new functionalities so we have a thousand items in the backlog. If that wasn’t enough, a new solution means changes in deployment so it’s not only a job for developers but for devopses as well… and we don’t even know whether it will be worth it or not.

Some of the factors mentioned above will not change, but if we could create a quick PoC that works, we would have more arguments to push for such migration.

Salvation for migration - scripted components

Nifi provides several components that can be a half-way solution that is able to solve a few of the migration issues.

InvokeScriptedProcessor
ScriptedTransformRecord
SimpleScriptedLookupService
ScriptedActionHandler
ScriptedLookupService
ScriptedReader
ScriptedRecordSetWriter
ScriptedRecordSink
ScriptedRulesEngine
ScriptedReportingTask

These components work in a straightforward way. You provide custom implementation in the component body and then it transforms accordingly.

How do you use it?

Let's go with the example of the processor that has two properties and two relationships.

class ExampleProc implements Processor {

    public static final PropertyDescriptor REQUIRED_PROPERTY = new PropertyDescriptor.Builder()
            .name("required property")
            .displayName("Required Property")
            .description("Description of the required property, can be as detailed as we want it to be")
            .required(true)
            .addValidator(StandardValidators.NON_BLANK_VALIDATOR)
            .expressionLanguageSupported(ExpressionLanguageScope.VARIABLE_REGISTRY)
            .build()

    public static final PropertyDescriptor OPTIONAL_PROPERTY = new PropertyDescriptor.Builder()
            .name("optional property")
            .displayName("Optional Property")
            .description("Description of the optional property, can be as detailed as we want it to be")
            .required(false)
            .addValidator(StandardValidators.NON_BLANK_VALIDATOR)
            .expressionLanguageSupported(ExpressionLanguageScope.FLOWFILE_ATTRIBUTES)
            .build()

    public static final Relationship REL_SUCCESS = new Relationship.Builder()
            .name("success")
            .description("Description of the success relationship, can be as detailed as we want it to be")
            .build()
    public static final Relationship REL_FAILURE = new Relationship.Builder()
            .name("failure")
            .description("Description of the failure relationship, can be as detailed as we want it to be")
            .build()

    @Override
    void initialize(ProcessorInitializationContext processorInitializationContext) {}

    @Override
    Set<Relationship> getRelationships() {
        return new HashSet<>(Arrays.asList(REL_SUCCESS, REL_FAILURE));
    }

    @Override
    void onTrigger(ProcessContext processContext, ProcessSessionFactory processSessionFactory) throws ProcessException {
    }

    @Override
    Collection<ValidationResult> validate(ValidationContext validationContext) {
        return validationContext.getProperties().entrySet().stream()
                .map{e -> e.getKey().validate(e.getValue(), validationContext)}
                .collect(Collectors.toSet())
    }

    @Override
    PropertyDescriptor getPropertyDescriptor(String s) {
        return getPropertyDescriptors().find {p -> p.getName().equalsIgnoreCase(s)}
    }

    @Override
    void onPropertyModified(PropertyDescriptor propertyDescriptor, String s, String s1) {}

    @Override
    List<PropertyDescriptor> getPropertyDescriptors() {
        return Arrays.asList(REQUIRED_PROPERTY, OPTIONAL_PROPERTY)
    }

    @Override
    String getIdentifier() {
        return "Example Processor"›
    }
}

This is the code of a processor that does literally nothing, but if we paste it into a Script Body property of InvokeScriptedProcessor we will see this:

nifi scripted components tomasz nazarewicz getindata — Before (on the left) and After (on the right)

So what are the changes? All the properties defined in the code are visible in the processor, they are not dynamic and have documentation in the Nifi UI. We can also see relationships defined in the code, also with documentation.

What about testing?

If we assume that scripted processors are placed somewhere between scripts and custom components, then testing frameworks are between scripted and custom components. It’s because you can’t do them in the processor's body, you have to set up a project with unit tests. Example:

class ExampleProcSpec extends Specification {

    def setup() {
        runner = TestRunners.newTestRunner(new ExampleProcessor())
        runner.setProperty(ExampleProcessor.REQUIRED_PROPERTY, “some-value”)
    }

    def "test"(){
        given:
        runner.enqueue(input.getBytes("UTF-8"))
        when:
        runner.run(1)
        then:
        runner.getFlowFilesForRelationship(ExampleProc.REL_SUCCESS).size() == 1
    }
}

Show me what you’re made of!

Ok, looks good but now what? So as stated previously this is a step towards custom components, for it to be good it has to have two features:

make migration to custom components easier (the end goal would be to have all things we want as custom)
be in some way better than the scripts (show the benefits of the migration)

Migration aids

There are few benefits on the migration side. Let’s go through them

We don't have to create new deploy pipelines - if we can either paste the code in the script body or use any jar on the worker just like in the case of the scripts.
Less overhead on first development - normally we would have to implement the component class, create some maven project, configure dependencies and deploy it on Nifi + handle any types of issues. With this solution we can skip the project creation and deploy, which sometimes can be the most time consuming part.

Let’s also have a look at what improvements can be made compared to using scripts.

Not only dynamic properties - we can set up validators for property values, make property optional or required and everything that comes with creating custom components.
More types of components - although we can't have all of them, we’re no longer bound to processors only
Easier to follow good practices, easier to keep things explicit - documentation, clear requirements for properties help to keep the flow structure clean

And that’s where the good parts end

The moment you think that maybe InvokeScripted* could replace the custom components, well… here is your bucket of cold water.

Dependency update issue - still present and going strong
No sensitive values - because of the way it’s implemented, it cannot really store sensitive values, more than that if we try to use sensitive parameters, we will probably break the whole flow
Testing - Custom components are still much better for unit testing.

So what do we do now?

In conclusion, if you feel like your scripts could use an upgrade, there you have few arguments to back it up. Remember to take into consideration that scripts are in general faster to create and have their place in the Nifi ecosystem. In the end, every project has its own specification and the decision rests on yours and your teammates shoulders. Cheers!

Would you like to read something more about Apache NiFi? Check out our blog series NiFi Ingestion Blog Series

big data

technology

apache nifi

Last updated: 7 September 2021

Written by

Tomasz Nazarewicz

Data Engineer

Like this post?
Spread the word

Want more? Check our articles

Big Data Event

A Review of the Presentations at the DataMass Gdańsk Summit 2022

The 4th edition of DataMass, and the first one we have had the pleasure of co-organizing, is behind us. We would like to thank all the speakers for…

getindator data engineer as a pirate behind the blue steering w d0d036e9 2016 48da b7bb 6f6c9e6523f0

Tutorial

Kubecost: Cross Charging Costs of Data Processing Pipelines in Data Mesh Architecture

Introduction As organizations increasingly adopt cloud-native technologies like Kubernetes, managing costs becomes a growing concern. With multiple…

Tutorial

Automated Machine Learning (AutoML) with BigQuery ML. Start Machine Learning easily and validate if ML is worth investing in or not.

Machine learning is becoming increasingly popular in many industries, from finance to marketing to healthcare. But let's face it, that doesn't mean ML…

Tech News

6 Big Data Trends For 2021

2020 was a very tough year for everyone. It was a year full of emotions, constant adoption and transformation - both in our private and professional…

Tutorial

Data Mesh as a proper way to organise data world

Data Mesh as an answer In more complex Data Lakes, I usually meet the following problems in organizations that make data usage very inefficient: Teams…

getindata intelligent health modern data platform story 2

Success Stories

How the GID Modern Data Platform’s good practices help us address Intelligent Health data analytics needs in 6 weeks?

Can you build an automated infrastructure setup, basic data pipelines, and a sample analytics dashboard in the first two weeks of the project? The…

Check All

Contact us

Interested in our solutions?
Contact us!

Together, we will select the best Big Data solutions for your organization and build a project that will have a real impact on your organization.

What did you find most impressive about GetInData?

They did a very good job in finding people that fitted in Acast both technically as well as culturally.

Type the form or send a e-mail: hello@getindata.com

NiFi Scripted Components - the missing link between scripts and fully custom stuff

Custom components

So what’s wrong with the scripts?

Development phase, functional limitations

Maintenance phase, good practice hurdles

The devil lies in the implicit

Documentation matters

Custom components to the rescue

So what’s the problem?

Why scripts?

Hurdles of making non-functional improvements

Salvation for migration - scripted components

How do you use it?

What about testing?

Show me what you’re made of!

Migration aids

And that’s where the good parts end

So what do we do now?

Like this post?Spread the word

Want more? Check our articles

A Review of the Presentations at the DataMass Gdańsk Summit 2022

Kubecost: Cross Charging Costs of Data Processing Pipelines in Data Mesh Architecture

Automated Machine Learning (AutoML) with BigQuery ML. Start Machine Learning easily and validate if ML is worth investing in or not.

6 Big Data Trends For 2021

Data Mesh as a proper way to organise data world

How the GID Modern Data Platform’s good practices help us address Intelligent Health data analytics needs in 6 weeks?

Contact us

Interested in our solutions?Contact us!

Like this post?
Spread the word

Interested in our solutions?
Contact us!