GetInData in 2020 - our achievements and challenges in Big Data environment
The end of 2020 has come, and it's time to stop for a moment and look back. The past year was not the easiest one and presented us with many…
Read more
In this episode of the RadioData Podcast, Adam Kawa talks with Michał Wróbel about business use cases at RenoFi (a U.S.-based FinTech), the Modern Data Platform on top of the Google Cloud Platform and advanced ML/AI models. We will also get an insight into what the specifics of startup projects are.
Host: Adam Kawa, GetInData | Part of Xebia CEO
Since 2010, Adam has been working with Big Data at Spotify (where he proudly operated one of the largest and fastest-growing Hadoop clusters in Europe), Truecaller and as a Cloudera Training Partner. Nine years ago, he co-founded GetInData | Part of Xebia – a company that helps its customers to become data-driven and builds custom Big Data solutions. Adam is also the creator of many community initiatives like the RadioData podcast, Big Data meetups and the DATA Pill newsletter.
Guest: Michał Wróbel, Lead Software Engineer
Michał has worked as a data engineer since 2015. Michał has a lot of expertise in AWS, analytics engineering, AI/ML and end-to-end projects. At RenoFi, Michał was responsible for delivering several data products and for the management of RenoFi’s data platform. Now Michał works at Embedded Insurance as Lead Software Engineer.
RenoFi is a platform that helps people who are trying to renovate their properties to borrow more money with the lowest possible monthly repayment. What makes RenoFi different? Usually, when institutions like banks grant loans, they use the present value of the property, without taking into account the value after renovation. RenoFi tries to calculate the post-renovation value of the property, which significantly affects the terms of the loan.
“I think dbt triggered the biggest change in my data career, because it simplified so many things for so many people. It was revolutionary. I remember the times when we had custom SQL scripts and when we ran them on schedule on Airflow, you had to execute them in order. You either had one big file, thousands of lines - just sequential SQL. Now, with dbt you have all the dependencies, all the docs just there in one place.”
“Back in the day, you had to write your own custom scripts, keep the state. Hightouch does this for you, you just connect, say what the data source is, what your destination is and hightouch will handle what has changed on the source and how it changed on the destination. If it didn't change on destination then I wouldn't even hit and update it. So, it has nice notifications, is really easy to integrate with dbt and any data plot.”
"There is this common problem with training prediction skew, where you might train your model on the data that you have available in the warehouse.
But when you try to predict, the client of this model might send one feature that is in a different shape / format than model expects.
This could be an upper case, lower case, not normalised, etc. It could be also different because in the DBT you have transformed it from the raw shape and data scientist wasn’t aware of..
So this is one thing that you need to remember. A common solution for this is a feature store, which adds a massive complexity to the system, because you need to have someone to develop the solution and maintain it and you need to change your experience."
A Feature Store could be a solution for several machine learning problems. In this ebook you will find out what these problems are and how a well-designed feature store can solve them. Along with a step-by-step tutorial:
"In startups, you don't usually start with data teams. You need to have a product and then develop it, try to improve it, find a market and then when a startup is successful, maybe set up a data team. At RenoFi this was a little bit different because our CTO has had a data background from day one and made really good decisions. At RenoFi, the data team is pretty small, but we started quite early from the beginning."
"In startups, things change quicker than in other companies, so change is the main thing in startups."
"Startups can do more with less, and by less, I mean fewer people. And yeah, sometimes you don't need to go for a brand new shiny solution like ML, you can just use heuristics which should be good enough and provide adequate business results to a company, because the cost of a proper ML solution is really high."
"There's a plethora of cloud services and external cloud services that we also use. So almost every startup, I would say, has internal databases and external cloud services, from the business perspective and from the CEO perspective, the management team - would like to have all that data in one place. So obviously you have different names these days, whether it's a warehouse, whether it's a data lake or data lakehouse. Whatever, it's a big database which contains all of your data."
"So having all the solutions in small teams means incurring huge costs. And don't even get us started on maintenance costs! So you can set it up and it's all fine. It's easier. But then you need to update it. You need to make sure it's worth it. You need to set up monitoring. I wouldn't say this is feasible for a small team, furthermore if you want to have it running on a good quality level, then you would have to employ someone that doesn't have a life outside of work."
"Thankfully, building real-time solutions or powerful ML / AI models has become simpler and cheaper, thanks to new technologies and new tools. So it's likely that in a few years it will be no problem to use them by default, because the additional costs and efforts to build them will be relatively small."
"A good recommendation is that you don't build from scratch if you don't have the expertise in-house, but hire someone at least for a few months, to set it up to give you the best practices that are currently on the market. Maybe you would still have someone available from here, and then when required in-house, so what you should focus on are the analytics engineers. And by analytics engineers, I mean people that work with dbt and know enough in order to be able to code, who can work effectively within common line standard programming practices and tests. Therefore, RenoFi is just great."
"So it's about having a very pragmatic approach and focusing first on the most critical functionalities, because they usually bring the most value. However, there are companies that must develop real-time online machine learning solutions from day one in order to just exist, because their core business model requires them to do so. So one example is Free Now, a multi-mobility company, or Uber or Bolt or a similar app. They need to calculate the price of a ride dynamically in real-time, based on actual supply and demand, and this changes all the time. They also need to predict the estimated time of driver arrival. The same goes for the estimated duration of your ride, so that you know when you will reach your destination and so on. If the apps do this well, they will obtain drivers, customers and will earn money. But if they do this badly, they will simply lose money. So in their case, real-time and machine learning is a must-have solution which needs to be invested in and improved on constantly, especially at scale."
"Usually, if you have vendors such as dbt Labs or the Google Cloud Platform, then if they are successful, they have a very big leverage for their solutions because they can have like hundreds or even thousands of user companies. So it's cost efficient and makes sense to them to invest in their solutions, thanks to this economy of scale. So they can keep improving them by adding new functionalities, especially the ones that you won't be able to implement by yourself, as sometimes it would be simply too expensive to develop some kind of custom-made or big feature only for yourself, because you won't have this economy of scale and the same leverage as they have."
References:
These are just snippets from the entire conversation which you can listen to here.
Subscribe to the Radio Data podcast to stay up-to-date with the latest technology trends and to discover the most interesting data use cases!
The end of 2020 has come, and it's time to stop for a moment and look back. The past year was not the easiest one and presented us with many…
Read moreApache NiFi, a big data processing engine with graphical WebUI, was created to give non-programmers the ability to swiftly and codelessly create data…
Read moreIn an era where connectivity is the lifeblood of our digital world, the telecom industry stands at the forefront of technological evolution. As the…
Read moreAirflow is a commonly used orchestrator that helps you schedule, run and monitor all kinds of workflows. Thanks to Python, it offers lots of freedom…
Read moreIntroduction At GetInData, we understand the value of full observability across our application stacks. For our Customers, we always recommend…
Read moreGetInData, Google and Truecaller participate in the Big Data Tech Warsaw Summit 2019. It’s already less than two weeks to the 5th edition of Big Data…
Read moreTogether, we will select the best Big Data solutions for your organization and build a project that will have a real impact on your organization.
What did you find most impressive about GetInData?