8 min read

Data Journey with Michał Wróbel (RenoFi) - Doing more with less with a Modern Data Platform and ML at home


In this episode of the RadioData Podcast, Adam Kawa talks with Michał Wróbel about business use cases at RenoFi (​​a U.S.-based FinTech), the Modern Data Platform on top of the Google Cloud Platform and advanced ML/AI models. We will also get an insight into what the specifics of startup projects are. 


Host: Adam Kawa, GetInData | Part of Xebia CEO

Since 2010, Adam has been working with Big Data at Spotify (where he proudly operated one of the largest and fastest-growing Hadoop clusters in Europe), Truecaller and as a Cloudera Training Partner. Nine years ago, he co-founded GetInData | Part of Xebia – a company that helps its customers to become data-driven and builds custom Big Data solutions. Adam is also the creator of many community initiatives like the RadioData podcast, Big Data meetups and the DATA Pill newsletter.

Guest: Michał Wróbel, Lead Software Engineer

Michał has worked as a data engineer since 2015. Michał has a lot of expertise in AWS, analytics engineering, AI/ML and end-to-end projects. At RenoFi, Michał was responsible for delivering several data products and for the management of RenoFi’s data platform. Now Michał works at Embedded Insurance as Lead Software Engineer.


RenoFi Use Case

RenoFi is a platform that helps people who are trying to renovate their properties to borrow more money with the lowest possible monthly repayment. What makes RenoFi different? Usually, when institutions like banks grant loans, they use the present value of the property, without taking into account the value after renovation. RenoFi tries to calculate the post-renovation value of the property, which significantly affects the terms of the loan.


Key quotes

  1. dbt 

I think dbt triggered the biggest change in my data career, because it simplified so many things for so many people. It was revolutionary. I remember the times when we had custom SQL scripts and when we ran them on schedule on Airflow, you had to execute them in order. You either had one big file, thousands of lines - just sequential SQL. Now, with dbt you have all the dependencies, all the docs just there in one place.”

“Back in the day, you had to write your own custom scripts, keep the state. Hightouch does this for you, you just connect, say what the data source is, what your destination is and hightouch will handle what has changed on the source and how it changed on the destination. If it didn't change on destination then I wouldn't even hit and update it. So, it has nice notifications, is really easy to integrate with dbt and any data plot.”

  1. Data

"There is this common problem with training prediction skew, where you might train your model on the data that you have available in the warehouse.

But when you try to predict, the client of this model might send one feature that is in a different shape / format than model expects.

This could be an upper case, lower case, not normalised, etc. It could be also different because in the DBT you have transformed it from the raw shape and data scientist wasn’t aware of..

So this is one thing that you need to remember. A common solution for this is a feature store, which adds a massive complexity to the system, because you need to have someone to develop the solution and maintain it and you need to change your experience."

A Feature Store could be a solution for several machine learning problems. In this ebook you will find out what these problems are and how a well-designed feature store can solve them. Along with a step-by-step tutorial:

ebook banner

DOWNLOAD FREE EBOOK

  1. Startups & business

"In startups, you don't usually start with data teams. You need to have a product and then develop it, try to improve it, find a market and then when a startup is successful, maybe set up a data team. At RenoFi this was a little bit different because our CTO has had a data background from day one and made really good decisions. At RenoFi, the data team is pretty small, but we started quite early from the beginning."

"In startups, things change quicker than in other companies, so change is the main thing in startups." 

"Startups can do more with less, and by less, I mean fewer people. And yeah, sometimes you don't need to go for a brand new shiny solution like ML, you can just use heuristics which should be good enough and provide adequate business results to a company, because the cost of a proper ML solution is really high."

"There's a plethora of cloud services and external cloud services that we also use. So almost every startup, I would say, has internal databases and external cloud services, from the business perspective and from the CEO perspective, the management team - would like to have all that data in one place. So obviously you have different names these days, whether it's a warehouse, whether it's a data lake or data lakehouse. Whatever, it's a big database which contains all of your data."

"So having all the solutions in small teams means incurring huge costs.  And don't even get us started on maintenance costs!  So you can set it up and it's all fine. It's easier. But then you need to update it. You need to make sure it's worth it. You need to set up monitoring. I wouldn't say this is feasible for a small team, furthermore if you want to have it running on a good quality level, then you would have to employ someone that doesn't have a life outside of work."

"Thankfully, building real-time solutions or powerful ML / AI models has become simpler and cheaper, thanks to new technologies and new tools. So it's likely that in a few years it will be no problem to use them by default, because the additional costs and efforts to build them will be relatively small."

"A good recommendation is that you don't build from scratch if you don't have the expertise in-house, but hire someone at least for a few months, to set it up to give you the best practices that are currently on the market. Maybe you would still have someone available from here, and then when required in-house, so what you should focus on are the analytics engineers. And by analytics engineers, I mean people that work with dbt and know enough in order to be able to code, who can work effectively within common line standard programming practices and tests. Therefore,  RenoFi is just great."

"So it's about having a very pragmatic approach and focusing first on the most critical functionalities, because they usually bring the most value. However, there are companies that must develop real-time online machine learning solutions from day one in order to just exist, because their core business model requires them to do so. So one example is Free Now, a multi-mobility company, or Uber or Bolt or a similar app. They need to calculate the price of a ride dynamically in real-time, based on actual supply and demand, and this changes all the time. They also need to predict the estimated time of driver arrival. The same goes for the estimated duration of your ride, so that you know when you will reach your destination and so on. If the apps do this well, they will obtain drivers, customers and will earn money. But if they do this badly, they will simply lose money. So in their case, real-time and machine learning is a must-have solution which needs to be invested in and improved on constantly, especially at scale."

"Usually, if you have vendors such as dbt Labs or the Google Cloud Platform, then if they are successful, they have a very big leverage for their solutions because they can have like hundreds or even thousands of user companies. So it's cost efficient and makes sense to them to invest in their solutions, thanks to this economy of scale. So they can keep improving them by adding new functionalities, especially the ones that you won't be able to implement by yourself, as sometimes it would be simply too expensive to develop some kind of custom-made or big feature only for yourself, because you won't have this economy of scale and the same leverage as they have."


References:

dbt Coalesce 2022 playlist


These are just snippets from the entire conversation which you can listen to here.

Subscribe to the Radio Data podcast to stay up-to-date with the latest technology trends and to discover the most interesting data use cases!

machine learning
dbt
ML
Modern Data Platform
startup
28 February 2023

Want more? Check our articles

big data blog getindata from spreadsheets automated data pipelines how this can be achieved 2png
Tutorial

From spreadsheets to automated data pipelines - and how this can be achieved with support of Google Cloud

CSVs and XLSXs files are one of the most common file formats used in business to store and analyze data. Unfortunately, such an approach is not…

Read more
streaming data ai telecomobszar roboczy 1 4
Tech News

Why is streaming data and real-time AI critical in telecom?

In an era where connectivity is the lifeblood of our digital world, the telecom industry stands at the forefront of technological evolution. As the…

Read more
getindata big data blog ml model mleap
Tutorial

Online ML Model serving using MLeap

Training ML models and using them in online prediction on production is not an easy task. Fortunately, there are more and more tools and libs that can…

Read more
mariusz blogobszar roboczy 1 4x 100
Tutorial

OAuth2-based authentication on Istio-powered Kubernetes clusters

You have just installed your first Kubernetes cluster and installed Istio to get the full advantage of Service Mesh. Thanks to really awesome…

Read more
getindata big data blog apache sedona introduction
Tutorial

Introduction to Apache Sedona (incubating)

Apache Sedona is a distributed system which gives you the possibility to load, process, transform and analyze huge amounts of geospatial data across…

Read more
power of big data ii obszar roboczy 1 3x 100
Tutorial

Power of Big Data: Healthcare

Welcome to another Power of Big Data series post. In the series, we present the possibilities offered by solutions related to the management, analysis…

Read more

Contact us

Interested in our solutions?
Contact us!

Together, we will select the best Big Data solutions for your organization and build a project that will have a real impact on your organization.


What did you find most impressive about GetInData?

They did a very good job in finding people that fitted in Acast both technically as well as culturally.
Type the form or send a e-mail: hello@getindata.com
The administrator of your personal data is GetInData Poland Sp. z o.o. with its registered seat in Warsaw (02-508), 39/20 Pulawska St. Your data is processed for the purpose of provision of electronic services in accordance with the Terms & Conditions. For more information on personal data processing and your rights please see Privacy Policy.

By submitting this form, you agree to our Terms & Conditions and Privacy Policy