10 min read

5 questions you need to answer before starting a big data project


For project managers, development teams and whole organizations, making the first step into the Big Data world might be a big challenge. In most cases, it really is. Handle a large amount of data in your system and not over engineer at the same time. Build a performant solution but not over invest into infrastructure. Choose the right technology to process and store your data. These are just a few you could imagine.

Over the last 7 years working on distributed data platforms I have seen many common mistakes made by organizations in various industries, like unnecessarily spending fortune on cluster infrastructure, starting big data projects without people with real experience or building ‘data lake’ with no use-case. Please check my colleague’s post to find out How to avoid wasting money with Big Data technologies and get some ROI - while it was written a few years ago, many of the mistakes are still repeatedly made.

In this post, I will show you that answering a few crucial questions before you start your first big data project, may let you avoid failure.

1. What’s the expected value to be delivered by your use case(s)?

How large data sets could be collected and stored? How could data sources be integrated? In most cas,es technical questions are not the right ones to ask at the beginning. Rather than that, for your project, (...) it is always important to focus on the value it will provide to its users and other stakeholders. Building the platform with no valid use cases, with the hope of finding out some in the future (could be a potentially valid strategy which big corps may afford) frequently leads to underestimated investments which don’t pay back.

Usually, Big Data & distributed processing comes with relatively increased complexity, infrastructure costs, experts are harder to find etc. That’s why defining proper use cases from a business value delivery perspective is crucial. What is the final value big data project is going to deliver? How would it improve business processes? Would it generate income? Do we have all data sources defined or is the initial research phase required? These are very important questions to answer honestly before starting your investment. 

Assure that all involved stakeholders understand answers to above questions. They need to know what the expected outcomes and project goals are, especially if cross department coordination is needed.

2. What is the nature of your data?

Data is collected at different frequencies. That could be daily snapshot files, monthly aggregates or incoming events in real-time. The nature of the data source will determine the design of your system. Nevertheless, data sources with real-time nature (e.g. like button clicks, song plays etc) could be accessed in less frequent manner (e.g. downloading daily snapshots). It’s important to keep in mind that usually lower latency is harder to achieve. That may mean infrastructure and development costs.

Nevertheless, in reality, most data sources are streams as data points appear over time, so it seems to be natural to collect and store them in a real-time manner. Even if your application shows only aggregated values now, streaming could be a trade-off not to close your architecture to future real-time use cases. 

How should you deliver your data? Focus on your use case and how data is presented to the end user, but also consider potential future extensions in your architecture. Value delivered to the final user should lead to non functional requirements definition.

3. What are the key factors for your big data project? 

Having a first look at the list of available Big Data tools, processing & ingestion frameworks, search engines, libraries etc gives a huge headache. Many of them developed in the open-source community and are available for free. On the other hand, more and more companies compete with their commercial products and enterprise support. Many tools are quite new on the market, lacking maturity, good quality documentation or reference implementations. Development resources may not be easy to acquire, as technology is fresh and not many experienced people are available on the market.

Selecting key project factors may help you make a decision:

  • performance
  • maturity
  • support (community or commercial)
  • portability 
  • flexibility
  • licencing

Before you select concrete tools/technologies, you may select strategy - open-source or commercial. Working with open source may require more development work and building a knowledge base in the organization. It also gives more flexibility, but sometimes lacks maturity and support (for some open source projects, vendor support is available, e.g Spark, Kafka, Hadoop). Paying for commercial tools licences you get the promise to get more straight forward onboarding and high-quality support.

Decisions should always be made referring to the selected key factors and organization resources. 


4. What is your data model?

With no experience in Big Data, when thinking about databases usually people have in mind RDBMS. The relational data model is something everyone is familiar with. With large data volumes, usually, it’s necessary to go out of that comfort zone and consider NoSQL solutions. That’s important to keep in mind nothing comes for free - you can achieve better performance, store more data, solve problems harder or even impossible to solve with RDBMS, but need to pay in more development, harder maintenance and less generic model. Different types of NoSQLs are dedicated to solving a specific range of problems. You have a choice of:

  • document-based databases (MongoDB, Couchbase, ElasticSearch)
  • graph databases (JanusGraph, ArangoDB)
  • columnar databases (HBase, Cassandra, Google Bigtable)
  • distributed OLAP solutions (Druid, ClickHouse)
  • key-value stores (AeroSpike, Redis)

Additionally, the characteristics of different implementations from the same category may vary. That means two different document databases might be good to solve different problems (e.g. MongoDB vs ElasticSearch).

How to select the right one? Again, focus on your use case and business needs - how does the data need to be served to the end-user? What are the questions (queries) you will answer with your data? Once you collect those facts you will be able to model your data, decide the level of denormalization etc. Your model will let you select the right data store. For instance, in online content base systems (e.g. e-commerce ) majority of the queries is single page/content retrieval. That could be kept even in a key-value store or in a document database (for better search capabilities) in denormalized form. For reports or aggregated data OLAP solutions (e.g. Druid) could apply, where you can perform ad-hoc analytical queries.

5. Where your solution is going to run?

Finally, the system you build requires infrastructure to run. Nowadays, you have a choice of highly developed cloud platforms such as Google Cloud Platform, Amazon Web Services, Microsoft Azure or Alibaba Cloud to name a few. At the same time, running your own on-premise environment is still a valid strategy. It’s important to understand the differences and take the approach suitable for your project and organization. You need to consider different pricing models, as well as initial costs. For on-premise hardware cost for Hadoop cluster might be tremendous, while you can start in the cloud with cost close to zero and pay as you go (usage based pricing). Cloud providers offer dedicated, fully managed services with reduced maintenance costs. On the other hand that locks you in with the concrete vendor, which could be avoided with open source technologies. Cloud and on-premise also vary in terms of security, deployments, upgrades and the way you control your system. 


How to start a big data project?

Start simple 

Keep your architecture simple and open to extension. Setting up large infrastructure and installing lots of tools at the beginning is a common mistake. Complicated architecture, especially if you just entered the Big Data area may lead to unnecessary over engineering and delivery delays. Select tools and frameworks that are good enough rather than the most performant/efficient. Focus on what meets your requirements, at potentially lower cost.

Focus on the right data

Drive your decision by the goal of the project, not by the way it could be achieved. That will let you focus on the right data, which gives value to the business. It’s easy to be trapped by trendy buzzwords. 

Check a few reference architectures

Even if your big data project is very innovative you have quite a high chance that something similar was already built. You should not necessarily get anything as it is but rather treat that as reference architectures. Many companies are happy to publish their success stories with a good level of technical detail. As these are already released and working solutions you have the advantage to know their findings and learnings, so you can even use that to improve. 

You can definitely find much more on data-driven companies tech blogs such as Spotify, Allegro, ING, GetInData and more.

Find big data experts

Years of experience exploring technology and developing production data driven systems may be priceless. Advices you can get from industry experts may change your perspective and put your project on the success track. Find, recruit and hire architects and data engineers with a range of experience in Big Data projects, so their expertise will cover every aspect of your system. 

Deliver fast a first successful project

Complex systems, especially in the Big Data area require huge efforts. It may be very hard to find promoters for long-running and expensive projects. At GetInData we always try to isolate use cases where we can deliver fast. When a client is able to see good achievements and delivery in a short time, they're more eager to invest in further projects. Please find our story with KCell where with relatively small incremental steps we went into bigger cooperation.

At GetInData you have access to 50+ distributed systems and cloud experts working with big data systems from the very beginning since Apache Hadoop was released. Do not hesitate to contact us, our team will be happy to discuss your first big data project.

big data
big data project
data model
big data experts
17 February 2021

Want more? Check our articles

getindata 1000 followers

5 reasons to follow us on Linkedin. Celebrating 1,000 followers on our profile!

We are excited to announce that we recently hit the 1,000+ followers on our profile on Linkedin. We would like to send a special THANK YOU :) to…

Read more
blog1obszar roboczy 1 4
Tech News

Is my company data-driven? Here’s how you can find out

Planning any journey requires some prerequisites. Before you decide on a route and start packing your clothes, you need to know where you are and what…

Read more
blogobszar roboczy 1 4

Power of Big Data: MLOps for business.

Welcome to the next instalment of the “Power of Big Data” series. The entire series aims to make readers aware of how much Big Data is needed and how…

Read more
getindata big data blog apache spark iceberg

Apache Spark with Apache Iceberg - a way to boost your data pipeline performance and safety

SQL language was invented in 1970 and has powered databases for decades. It allows you not only to query the data, but also to modify it easily on the…

Read more
getindata big data tech main 1
Big Data Event

A Review of the Presentations at the Big Data Technology Warsaw Summit 2022!

The 8th edition of the Big Data Tech Summit is already over, and we would like to thank all of the attendees for joining us this year. It was a real…

Read more
getindator design a vibrant and engaging scene showcasing real  76ab8269 a013 4120 b722 f95e879d333c

Stream enrichment with Flink SQL

In today's world, real-time data processing is essential for businesses that want to remain competitive and responsive. The ability to obtain results…

Read more

Contact us

Interested in our solutions?
Contact us!

Together, we will select the best Big Data solutions for your organization and build a project that will have a real impact on your organization.

What did you find most impressive about GetInData?

They did a very good job in finding people that fitted in Acast both technically as well as culturally.
Type the form or send a e-mail:
The administrator of your personal data is GetInData Poland Sp. z o.o. with its registered seat in Warsaw (02-508), 39/20 Pulawska St. Your data is processed for the purpose of provision of electronic services in accordance with the Terms & Conditions. For more information on personal data processing and your rights please see Privacy Policy.

By submitting this form, you agree to our Terms & Conditions and Privacy Policy