Tutorial
5 min read

From spreadsheets to automated data pipelines - and how this can be achieved with support of Google Cloud

CSVs and XLSXs files are one of the most common file formats used in business to store and analyze data. Unfortunately, such an approach is not scalable and it becomes more and more difficult to provide all team members access to one common file in which they can cooperate and share the results of their work with different teams.

Surely, there are available solutions to implement  in the spreadsheet files in real-time, but it’s still difficult to share data with different teams, process it or decide about the target data format, especially when using autodetection data formatting.

Where should we start?

The first aspect is about preparing the landscape of all processed spreadsheets. We start with the most important thing: understanding the data, verifying which information is important for target users and therefore  possibly reducing  the amount of processed data by deleting unused data.

The second part is focused on understanding how to create new spreadsheets - can we automate that step? Can it be only done by users manually? How frequently is data uploaded? How can we verify if there are any changes in the data schema?

When we have defined the logistics and we know what the input and desired output will be, we can hop over to the next step to define how to process data, create some aggregations for example, save data in the target database or storage, clean data and how to deliver output to the target users.

It is necessary to mention that we must know if there are any tools used by different teams to analyze or visualize data further. If users utilise one solution, it might be worth implementing in our new process to simplify the users’ onboarding process.

The last step is about adding a monitoring layer. Who can take this action if there's a problem with source data and how should we notify the analysts? How can we check data quality? What should we do to avoid human error in the case of a manual process? We should implement metrics reporters to our application and queries to detect incorrect records or those  with data that is too varied. We can create alerts and dashboards based on the findings.

The multistage process 

Public cloud such as Google Cloud Platform helps companies to improve their data pipelines and move quickly from local Excel development to scalable tools. It makes work faster, more efficient and with no human errors or problems with data formatting.

The first step is about data ingestion. The perfect place to store raw, unprocessed data is Google Cloud Storage. Users can upload data there or add a sync script between Cloud Storage and some remote drives. Here we start the journey with process alignment and data integration from multiple sources.

For data processing pipelines, we can go with multiple solutions. Due to the different use cases in each project, the best way is to create a custom Python script(s) to process data while the scripts themselves can be scheduled by tools like Google Cloud Composer (managed Apache Airflow), self-managed Apache Airflow, Google Cloud Tasks, Google Cloud Scheduler or even a mix of Cloud Pub/Sub with Cloud Functions.

In the example scenario, we use Composer with Python scripts executed on the Kubernetes pods of the Composer that is the most flexible solution and can easily be extended in the future.

As the final part of the CSVs and XLSXs processing platform, we need to ingest processed data somewhere. It depends on the exact use cases, the most common of which can be solved by inserting data into BigQuery which works perfectly as the data warehouse and can be used as the engine for Business Intelligence. The performance is great.

Last but not least, everything must be managed by the Infrastructure-as-a-Code. A mix of Terraform and CICD tools like GitHub Actions or GitLab Ci helps in making it happen fast and provides possibilities to easily manage infrastructure. If you want to read something more about terraform, check our blog post  “Terraform your Cloud Infrastructure”.

We also need to mention the monitoring layer. It's powered by Cloud Monitoring, Cloud Logging and BigQueries tables in which we can store information about potential errors in the source data. It can be visualized in Data Studio or a similar tool, while alerts can be sent via email to the stakeholders who can then take action.

Automate work and simplify the processes with Google Cloud

Another benefit of this solution is that it's not expensive. It delivers High Availability and can easily be scaled up, depending on the company's needs and the complexity of the next tasks that must be implemented by the processing platform. Here's an example of how we can quickly move from local Excel development to the automated cloud environment to simplify data management and start data-driven development in the cloud.

local-excel-development-the-automated-cloud-environment-simplify-data-management

Would you like to chance your spreadsheet files to automated data pipelines with Google Cloud? Let’s discuss about this, contact us!

big data
technology
google cloud platform
cloud
data pipelines
GCP
8 February 2022

Want more? Check our articles

read mlops snowflake getindata
Tutorial

From 0 to MLOps with ❄️ Part 2: Architecting the cloud-agnostic MLOps Platform for Snowflake Data Cloud

From 0 to MLOps with Snowflake ❄️ In the first part of the blogpost, we presented our kedro-snowflake plugin that enables you to run your Kedro…

Read more
getindator illustration of squirrel holding a trophy and standi 537810f1 a5a2 4f1a a701 18b280cf6acf 720

3 Apache Flink Blogs That Will Revolutionize Your Streaming Game

Streaming analytics is no longer just a buzzword—it’s a must-have for modern businesses dealing with dynamic, real-time data. Apache Flink has emerged…

Read more
propozycja2
Tutorial

Deploying efficient Kedro pipelines on GCP Composer / Airflow with node grouping & MLflow

Airflow is a commonly used orchestrator that helps you schedule, run and monitor all kinds of workflows. Thanks to Python, it offers lots of freedom…

Read more
getindata monitoring alert data streaming platfrorm
Use-cases/Project

How to build continuous processing for real-time data streaming platform?

Real-time data streaming platforms are tough to create and to maintain. This difficulty is caused by a huge amount of data that we have to process as…

Read more
noweobszar roboczy 1 3

GetInData in 2022 - achievements and challenges in Big Data world

Time flies extremely fast and we are ready to summarize our achievements in 2022. Last year we continued our previous knowledge-sharing actions and…

Read more
1716380755877
Big Data Event

Overview of InfoShare 2024 - Part 1: Demystifying AI Buzzwords, Gamified Security Training

The 2024 edition of InfoShare was a landmark two-day conference for IT professionals, attracting data and platform engineers, software developers…

Read more

Contact us

Interested in our solutions?
Contact us!

Together, we will select the best Big Data solutions for your organization and build a project that will have a real impact on your organization.


What did you find most impressive about GetInData?

They did a very good job in finding people that fitted in Acast both technically as well as culturally.
Type the form or send a e-mail: hello@getindata.com
The administrator of your personal data is GetInData Poland Sp. z o.o. with its registered seat in Warsaw (02-508), 39/20 Pulawska St. Your data is processed for the purpose of provision of electronic services in accordance with the Terms & Conditions. For more information on personal data processing and your rights please see Privacy Policy.

By submitting this form, you agree to our Terms & Conditions and Privacy Policy