NiFi Scripted Components - the missing link between scripts and fully custom stuff
Custom components As we probably know, the biggest strength of Apache Nifi is the large amount of ready-to-use components. There are, of course…
Read moreIn this episode of the RadioData Podcast, Adam Kawa talks with Yetunde Dada & Ivan Danov about QuantumBlack, Kedro, trends in the MLOps landscape e.g. so many MLOps tools and LLMOPs. We encourage you to listen to the whole podcast or, if you prefer reading, skip to the key takeaways listed below.
_________________
Host: Adam Kawa, GetInData | Part of Xebia CEO
Since 2010, Adam has been working with Big Data at Spotify (where he proudly operated one of the largest and fastest-growing Hadoop clusters in Europe), Truecaller and as a Cloudera Training Partner. Nine years ago, he co-founded GetInData | Part of Xebia – a company that helps its customers to become data-driven and builds custom Big Data solutions. Adam is also the creator of many community initiatives such as the RadioData podcast, Big Data meetups and the DATA Pill newsletter.
Guests: Yetunde Data & Ivan Danov
Yetunde is a Product Director at QuantumBlack and has been in the company for almost 4 years.
Ivan is a Software Engineer and has been working for QuantumBlack for 6 years. He has been working on Kedro since the beginning.
_________________
QuantumBlack, a McKinsey company, is a data science and advanced analytics company that works with customers from various industries. QuantumBlack was founded in 2009 and has its headquarters in London, United Kingdom. The company became a part of McKinsey & Company, a global management consulting firm in 2015, and now operates as part of McKinsey's global analytics practice.
_________________
Kedro is an open-source, Python workflow development framework that helps ML practitioners write maintainable and modular analytics code which is production ready. It achieves this by enabling teams to adopt software engineering best practices.
Most companies have separate research and production units. Research units often work with Jupyter notebooks and are responsible for inventing new solutions, whereas production units try to implement their work and run it in a production environment.
Kedro tries to give everyone, regardless of the team, the same level of software engineering practices, which makes the code production ready right from the start or with much less refactoring, than without those practices.
A lot of data engineering and data science prototypes that they write are production ready right from the start, or become production ready with a little bit of work. In the end this approach brings more value to the company.
If you are a data scientist you want to pick up Kedro, because you are collaborating with other team members and you want to write well structured code which you want to share with other people and make it more maintainable and understandable.
ML engineers often pick up Kedro because it helps them to create an environment where other team members can write prototypes in a specific, well structured way. It allows the users to build software that is easily scalable and can be run in different environments.
If you are a data engineer, you are involved in creating large scale feature engineering pipelines or some form of data cleaning. Kedro can provide a well structured workflow for those types of tasks.
The last group that benefits from Kedro are project leads. Kedro Viz can help you have a birds-eye view over the pipeline structure.
The code that is produced with Kedro is more modular and usable across different projects.
The example from Quantumblack is that before Kedro, each team developed each part of the code in a different programming language, and the following integration was a nightmare. The code was hard to understand and hard to read and it was complicated to move it between different projects and different environments.
Whereas when Kedro was introduced, it presented a common language that everyone could use to communicate with each other about the project. It presented a level of abstraction that helped with communication about data engineering tasks. They can suddenly start talking about the development of a „node” or a „pipeline” and everyone has a common understanding of what that means.
It also presented a common code base structure which was cleaner and easier to work with across multiple teams.
As a consequence of that, they started to build bigger and bigger projects which were more usable. They have been able to industrialize the way that they write machine learning code at QuantumBlack thanks to Kedro.
One of the companies that benefited greatly from Kedro is Telkomsel (the link to the article about Telkomsel and Kedro). Telkomsel is Indonesia's largest communications company. Telkomsel used Kedro in several of their data engineering projects, and the benefits that they emphasize are collaboration improvement, configuration management and visualization of data pipelines of Kedro.
Data scientists who did not have a software engineering background become better software engineers by using Kedro.
Kedro has over 8 thousand stars on github. The growth of the project was largely organic. There are over 1.6 thousand projects that depend on Kedro, and the number is growing. There are also almost 180 contributors to the project as well.
There are hundreds of companies that use Kedro, some of the most notable ones can be seen in the README on the github main project page, these are for example: Absa, AXA UK, NASA, ING, GetInData and AMAI GmbH.
If you are collaborating with others and are building a data engineering or data science pipeline, then Kedro is for you. Kedro supports creating code that should be deployed to production.
The Kedro design assumes being platform agnostic. It provides freedom in writing data pipeline code without having to worry about which cloud provider it is going to be used with. There are a number of plugins (some of which are developed by GetInData company like Kedro VertexAI and Kedro AzureML) which enable different data sources and data platforms / cloud providers to be used with Kedro and provide the freedom and a level of abstraction that helps to write more modular and reusable code.
Kedro wants to be a bridge between data scientists and production. They want data scientists to have a uniform experience, regardless of what they are developing and which cloud provider is going to be used to run the code.
Kedro is supposed to be governed by the community and all of the QuantumBlack work is done in public. You can see the github issues that they are working on and the milestones that they are currently trying to achieve.
Kedro has got a telemetry opt-in plugin that sends the data back to our database, so that they can see which commands are used more often and which are not. This helps them to decide what the next field of interest should be for the Kedro development team.
In terms of upcoming things to the Kedro project, QuantumBlack is working on improving templating and configuration management in newly created Kedro projects.
They also want to improve already created features and make sure that they are working as intended. They will also focus more on integration with Databricks, Sagemaker and AzureML. They want to equip our users with appropriate tools to work with those services.
The Kedro Viz project is also supposed to see improvements in visualizing dashboards and pipelines.
They also plan to improve Kedro online courses and documentation that will explain the basics of Kedro and how to take advantage of its features.
Regarding the MLOps tooling, it seems that there are too many and they predict that they will see either convergence or clear dominant players taking the stage in certain areas of MLOps.
They are probably going to see new literature about best practices and code quality in Data Science projects, similar to the one that is already there regarding Software Engineering.
Also they cannot ignore that right now there is a lot of talk about ChatGPT and new language models which probably is going to be a trend in upcoming years.
In QuantumBlack they have a product that covers a similar field to the one created by Iguazio's company and they plan to join forces to create a better solution together.
They want Kedro to be natively run together with Iguazio’s solution. Their goal is to achieve such integration that there are as little steps as possible needed from both groups of users to transfer one project from one environment to another.
Another benefit is that they have acquired another platform that Kedro runs on, which brings more experience and better understanding of how Kedro should look like, to be more flexible and useful in different scenarios. Although they should not forget that Kedro is still going to be a platform agnostic tool.
___________________
These are just snippets from the entire conversation which you can listen to here:
Want to learn more about Kedro? Check out the following articles, tutorials and case-studies:
Custom components As we probably know, the biggest strength of Apache Nifi is the large amount of ready-to-use components. There are, of course…
Read moreTrend 4. Larger clouds over the Big Data landscape A decade ago, only a few companies ran their Big Data infrastructure and pipelines in the public…
Read more“Without data, you are another person with an opinion” These words from Edward Deming, a management guru, are the best definition of what means to…
Read moreLLM-enhanced information retrieval Over the last few months, Large Language Models have gained a lot of traction. Companies and developers were trying…
Read moreWelcome to the third part of the "Power of Big Data" series, in which we describe how Big Data tools and solutions support the development of modern…
Read moreHello again in 2020. It’s a new year and the new, 6th edition of Big Data Tech Warsaw is coming soon! Save the date: 27th of February. We have put…
Read moreTogether, we will select the best Big Data solutions for your organization and build a project that will have a real impact on your organization.
What did you find most impressive about GetInData?