8 min read

Data Journey with Michał Wróbel (RenoFi) - Doing more with less with a Modern Data Platform and ML at home


In this episode of the RadioData Podcast, Adam Kawa talks with Michał Wróbel about business use cases at RenoFi (​​a U.S.-based FinTech), the Modern Data Platform on top of the Google Cloud Platform and advanced ML/AI models. We will also get an insight into what the specifics of startup projects are. 


Host: Adam Kawa, GetInData | Part of Xebia CEO

Since 2010, Adam has been working with Big Data at Spotify (where he proudly operated one of the largest and fastest-growing Hadoop clusters in Europe), Truecaller and as a Cloudera Training Partner. Nine years ago, he co-founded GetInData | Part of Xebia – a company that helps its customers to become data-driven and builds custom Big Data solutions. Adam is also the creator of many community initiatives like the RadioData podcast, Big Data meetups and the DATA Pill newsletter.

Guest: Michał Wróbel, Lead Software Engineer

Michał has worked as a data engineer since 2015. Michał has a lot of expertise in AWS, analytics engineering, AI/ML and end-to-end projects. At RenoFi, Michał was responsible for delivering several data products and for the management of RenoFi’s data platform. Now Michał works at Embedded Insurance as Lead Software Engineer.


RenoFi Use Case

RenoFi is a platform that helps people who are trying to renovate their properties to borrow more money with the lowest possible monthly repayment. What makes RenoFi different? Usually, when institutions like banks grant loans, they use the present value of the property, without taking into account the value after renovation. RenoFi tries to calculate the post-renovation value of the property, which significantly affects the terms of the loan.


Key quotes

  1. dbt 

I think dbt triggered the biggest change in my data career, because it simplified so many things for so many people. It was revolutionary. I remember the times when we had custom SQL scripts and when we ran them on schedule on Airflow, you had to execute them in order. You either had one big file, thousands of lines - just sequential SQL. Now, with dbt you have all the dependencies, all the docs just there in one place.”

“Back in the day, you had to write your own custom scripts, keep the state. Hightouch does this for you, you just connect, say what the data source is, what your destination is and hightouch will handle what has changed on the source and how it changed on the destination. If it didn't change on destination then I wouldn't even hit and update it. So, it has nice notifications, is really easy to integrate with dbt and any data plot.”

  1. Data

"There is this common problem with training prediction skew, where you might train your model on the data that you have available in the warehouse.

But when you try to predict, the client of this model might send one feature that is in a different shape / format than model expects.

This could be an upper case, lower case, not normalised, etc. It could be also different because in the DBT you have transformed it from the raw shape and data scientist wasn’t aware of..

So this is one thing that you need to remember. A common solution for this is a feature store, which adds a massive complexity to the system, because you need to have someone to develop the solution and maintain it and you need to change your experience."

A Feature Store could be a solution for several machine learning problems. In this ebook you will find out what these problems are and how a well-designed feature store can solve them. Along with a step-by-step tutorial:

ebook banner

DOWNLOAD FREE EBOOK

  1. Startups & business

"In startups, you don't usually start with data teams. You need to have a product and then develop it, try to improve it, find a market and then when a startup is successful, maybe set up a data team. At RenoFi this was a little bit different because our CTO has had a data background from day one and made really good decisions. At RenoFi, the data team is pretty small, but we started quite early from the beginning."

"In startups, things change quicker than in other companies, so change is the main thing in startups." 

"Startups can do more with less, and by less, I mean fewer people. And yeah, sometimes you don't need to go for a brand new shiny solution like ML, you can just use heuristics which should be good enough and provide adequate business results to a company, because the cost of a proper ML solution is really high."

"There's a plethora of cloud services and external cloud services that we also use. So almost every startup, I would say, has internal databases and external cloud services, from the business perspective and from the CEO perspective, the management team - would like to have all that data in one place. So obviously you have different names these days, whether it's a warehouse, whether it's a data lake or data lakehouse. Whatever, it's a big database which contains all of your data."

"So having all the solutions in small teams means incurring huge costs.  And don't even get us started on maintenance costs!  So you can set it up and it's all fine. It's easier. But then you need to update it. You need to make sure it's worth it. You need to set up monitoring. I wouldn't say this is feasible for a small team, furthermore if you want to have it running on a good quality level, then you would have to employ someone that doesn't have a life outside of work."

"Thankfully, building real-time solutions or powerful ML / AI models has become simpler and cheaper, thanks to new technologies and new tools. So it's likely that in a few years it will be no problem to use them by default, because the additional costs and efforts to build them will be relatively small."

"A good recommendation is that you don't build from scratch if you don't have the expertise in-house, but hire someone at least for a few months, to set it up to give you the best practices that are currently on the market. Maybe you would still have someone available from here, and then when required in-house, so what you should focus on are the analytics engineers. And by analytics engineers, I mean people that work with dbt and know enough in order to be able to code, who can work effectively within common line standard programming practices and tests. Therefore,  RenoFi is just great."

"So it's about having a very pragmatic approach and focusing first on the most critical functionalities, because they usually bring the most value. However, there are companies that must develop real-time online machine learning solutions from day one in order to just exist, because their core business model requires them to do so. So one example is Free Now, a multi-mobility company, or Uber or Bolt or a similar app. They need to calculate the price of a ride dynamically in real-time, based on actual supply and demand, and this changes all the time. They also need to predict the estimated time of driver arrival. The same goes for the estimated duration of your ride, so that you know when you will reach your destination and so on. If the apps do this well, they will obtain drivers, customers and will earn money. But if they do this badly, they will simply lose money. So in their case, real-time and machine learning is a must-have solution which needs to be invested in and improved on constantly, especially at scale."

"Usually, if you have vendors such as dbt Labs or the Google Cloud Platform, then if they are successful, they have a very big leverage for their solutions because they can have like hundreds or even thousands of user companies. So it's cost efficient and makes sense to them to invest in their solutions, thanks to this economy of scale. So they can keep improving them by adding new functionalities, especially the ones that you won't be able to implement by yourself, as sometimes it would be simply too expensive to develop some kind of custom-made or big feature only for yourself, because you won't have this economy of scale and the same leverage as they have."


References:

dbt Coalesce 2022 playlist


These are just snippets from the entire conversation which you can listen to here.

Subscribe to the Radio Data podcast to stay up-to-date with the latest technology trends and to discover the most interesting data use cases!

machine learning
dbt
ML
Modern Data Platform
startup
28 February 2023

Want more? Check our articles

maximizing personalization11
Tutorial

Maximizing Personalization: Real-Time Context and Persona Drive Better-Suited Products and Customer Experiences

Have you ever searched for something that isn't typical for you? Maybe you were looking for a gift for your grandmother on Amazon or wanted to listen…

Read more
obszar roboczy 1 100

Towards better Data Analytics - Google Cloud Bootcamp

“Without data, you are another person with an opinion” These words from Edward Deming, a management guru, are the best definition of what means to…

Read more
run your first private llm on gcpobszar roboczy 1 4
Tutorial

Run your first, private Large Language Model (LLM) on Google Cloud Platform

What are Large Language Models (LLMs)? You want to build a private LLM-based assistant to generate the financial report summary. Although Large…

Read more
getindata blog nifi tomasz nazarewicz
Tutorial

NiFi Scripted Components - the missing link between scripts and fully custom stuff

Custom components As we probably know, the biggest strength of Apache Nifi is the large amount of ready-to-use components. There are, of course…

Read more
lean big data 1
Tutorial

Lean Big Data - How to avoid wasting money with Big Data technologies and get some ROI

During my 6-year Hadoop adventure, I had an opportunity to work with Big Data technologies at several companies ranging from fast-growing startups (e…

Read more
getindata nifi blog post
Tutorial

NiFi Ingestion Blog Series. PART III - No coding, just drag and drop what you need, but if it’s not there… - custom processors, scripts, external services

Apache NiFI, a big data processing engine with graphical WebUI, was created to give non-programmers the ability to swiftly and codelessly create data…

Read more

Contact us

Interested in our solutions?
Contact us!

Together, we will select the best Big Data solutions for your organization and build a project that will have a real impact on your organization.


What did you find most impressive about GetInData?

They did a very good job in finding people that fitted in Acast both technically as well as culturally.
Type the form or send a e-mail: hello@getindata.com
The administrator of your personal data is GetInData Poland Sp. z o.o. with its registered seat in Warsaw (02-508), 39/20 Pulawska St. Your data is processed for the purpose of provision of electronic services in accordance with the Terms & Conditions. For more information on personal data processing and your rights please see Privacy Policy.

By submitting this form, you agree to our Terms & Conditions and Privacy Policy