Data Is Everywhere, But Few Know What It Actually Is

Data is more abundant than ever, but trust in it varies. Three out of four executives don’t trust their organisation’s data, and only 10% feel they have mastered its quality. The reason isn’t technical, it’s purely human. We don’t understand where data comes from, how it’s created, and which choices shape it before it becomes the foundation for our decision-making.

Martti Asikainen & Umair Ali Khan, 27.2.2026 | Photo by Adobe Stock Photo

Exhausted depressed businesswoman leaning on a wall covered with financial charts and feeling hopeless, business failure concept

Imagine an ordinary Monday meeting. The sales director opens their presentation and confidently states that the data shows your customers want faster delivery. Your team nods around the table, but no one asks where the data came from, how it was collected, whom it represents, or what it fails to tell us, because data speaks for itself. Fast forward a moment. A decision is made, a budget is allocated, and direction changes, because that’s what the data said.

This scenario isn’t a rare exception; it is a fairly typical description of almost every data-driven company today. The figures reveal a perplexing contradiction. Data is everywhere, but its nature seems increasingly shrouded in mystery, even for those who rely on it most in their work (Mayer-Schönberger & Cukier, 2013). Perhaps this is precisely why data researcher Rob Kitchin (2014) has often noted that if data is treated as objective and self-evident, its origins, structure, and purpose easily go unexamined.

According to a 2019 report by KPMG and Forrester Consulting, up to 60% of data and analytics decision-makers felt they weren’t confident in their analytics insights, and only 10% considered their organisation excellent at managing data quality (KPMG/Forrester 2019, n=2,165). Meanwhile, HFS Research reveals that as many as three out of four executives don’t trust their own organisation’s data (HFS Research 2022). In other words, the issue is not only whether data is objectively correct, but whether decision-makers believe it is reliable, complete, timely, and relevant enough to support action.

And perhaps they shouldn’t. Data quality can vary significantly. Data quality expert Thomas Redman (2018) notes in his article published in Harvard Business Review that poor-quality data doesn’t just weaken analytics results, it can render even your most advanced machine learning models practically useless. This lack of trust isn’t just a feeling, but a genuine structural problem.

What Is Data and How It Works

Inspired by this knowledge, it’s perhaps appropriate to ask what data actually is. At its simplest, it’s observations of reality that have been converted into measurable form. When you press the like button on social media, a data point is created. If you enter a shop and your mobile application registers your location, that’s also a data point.

The same applies to the loyalty card you use in the shop, from which you receive bonuses when paying, and the text you type into a search engine: “best pizza Helsinki”. In other words, every interaction in a digital environment leaves a trace, and those traces are part of data.

However, data doesn’t arise spontaneously, nor should it ever be mistaken for neutral (see Noble 2018). Behind every data point is a choice: what to measure, how to measure, when to measure, who measures, and how the observation is classified (Bowker & Star 1999). These choices don’t occur in a vacuum either, they always reflect the values, perspective, and objectives of some person, organisation, or system (Gitelman 2013).

Broadly speaking, data can be divided into three categories. Structured data is organised and machine-readable, such as sales history or customer records stored in a database, spreadsheet, or CRM systems. Semi-structured data contains recognisable fields or markers that provide some organisational structure. For instance, emails or social media posts are partially organised but free-form. Unstructured data, meanwhile, is everything that hasn’t been categorised in advance, such as images, videos, audio files, and handwritten notes. According to various estimates, approximately 80–90% of all the world’s data is unstructured (IDC 2018; Gartner 2023).

Five Different Creation Stories

Data is created in five different ways, and distinguishing among them is important because the method of creation directly affects what can be done with the data and how much it can be trusted. The first and oldest method is collecting data through a procedure. A company conducts a customer survey, a researcher interviews test subjects, or a counter counts visitors at the door.

Collected data always has purposefulness and intention behind it. The data has been created because someone decided to collect it for a specific purpose. This purposefulness is simultaneously its strength and weakness. Strength, because the data answers a specific, predetermined question. Weakness, because it only answers the question that someone knew to ask in advance.

The second is passively generated data, often called exhaust data or digital footprint. It’s created as a by-product of other activities, such as clicks, search terms, payment transactions, location data, and time spent reading. Your smartphone registers, a mobile application saves, and an online shop remembers. Most of the data companies collect belongs to this category, and it’s precisely this data that has fuelled the growth of major technology companies over the past two decades (Zuboff 2019).

The third category is generative data, which people actively produce without thinking they’re producing data. Social media posts, reviews, comments, and blog texts are all generative data.

The fourth is data produced by sensors and devices, created in the Internet of Things (IoT), from factory machines, traffic lights, pacemakers, and weather stations. Predictions suggest that by 2030, there will be approximately 39 billion connected devices in the world, all producing data continuously (IoT Analytics 2025).

The fifth category is synthetic data, i.e., artificial data generated by generative AI models. Instead of being directly collected from real-world events, synthetic data is produced computationally to resemble the statistical properties and patterns of actual datasets.

Organisations use synthetic data for purposes such as training machine learning models when real data is scarce, sensitive, or restricted due to privacy regulations, as well as for simulation, testing, and scenario analysis.

While synthetic data can expand data availability and reduce privacy risks, it also reflects the assumptions, biases, and limitations of the models and source data used to generate it. Therefore, decisions based on synthetic data still require critical evaluation of how and why that data was created.

Data Doesn't Arise Ready-Made

Data never arises ready-made. It goes through a process that can be called the data lifecycle. First, it’s created and collected, then it’s stored, after which it’s cleaned and transformed into usable form, analysed, and finally presented or utilised in decision-making. Each stage brings the possibility of errors and distortions.

During the storage stage, data can be lost or recorded incorrectly. During the cleaning stage, choices are made about which observations are “outliers” and removed, which can mean blurring the edges of reality. During the analysis stage, a method is chosen, which in turn affects the result. During the presentation stage, the scale of a graph can make a small change appear dramatic or a large change insignificant.

For this reason, the philosophy of science has long established that raw and objective data doesn’t exist, as every data point has already been interpreted before it reaches analysis (e.g. Latour 1987; Bowker 2005; Gitelman 2013). Therefore, the claim we made at the beginning, that data speaks for itself, is misleading and false. Data never speaks for itself, but always with someone’s voice, and through someone’s decisions and choices.

The age of AI has made the data question more acute than ever before. AI models learn from data, and they reproduce the structures contained in the data, including its deficiencies and biases. When Amazon’s recruitment algorithm learned from historical recruitment data, it learned that successful applicants were primarily men. The algorithm didn’t discriminate against women out of malice, but because it had learned only part of reality through the data (Dastin 2018).

The same logic applies everywhere data is used to support decision-making. In healthcare, for instance, cases have been observed where AI models recognise illnesses less effectively in women than in men or engage in ethnic profiling regarding care needs, because the models have been trained on data that doesn’t sufficiently account for demographic differences (Straw & Wu 2022; Obermeyer et al. 2019). Data used in credit risk assessment, meanwhile, reflects previous credit decisions, which can systematically exclude certain population groups from financing (Barocas & Selbst 2016; O’Neil 2016; Kleinberg et al. 2018).

Ask the Right Questions

Let’s return briefly to the Monday meeting and the sales director whose data indicates customers want faster delivery. Now that you know where data comes from and how it’s created, you know to ask the right questions. From whom was this data collected? Do they represent all your customers? Are there differences in customer segments that should be considered when interpreting the data? When was the data collected? Has the situation changed? What was asked? Did the framing of questions lead to the answers? And what’s missing from the data, and whose voice isn’t heard at all?

These aren’t technical questions. They’re management questions. And asking them doesn’t mean doubting the data, but understanding it. Data isn’t just numbers in a table, but a narrative about reality that someone has written, from some perspective, and for some purpose. Harvard business professor Thomas Davenport and former Accenture executive Jeanne Harris have aptly noted that competitive advantage doesn’t arise merely from having the most data, but from knowing how to ask data the right questions (Davenport & Harris 2007).

The technical solutions for data collection have developed rapidly and largely become standardised, which is why the problem can be considered almost solved. A far more difficult question and greater challenge is determining what to do with collected data and how to recognise its limitations before it guides decisions. Without this understanding, analyses can be technically flawless yet still misleading. This is a skill upon which all other data competency depends — without it, everything else is built on rather weak foundations.

References

Barocas, S., & Selbst, A. D. (2016). Big data’s disparate impact. California Law Review, 104(3), 671–732.

Bowker, G. C. (2005). Memory practices in the sciences. MIT Press.

Bowker, G. C., & Star, S. L. (1999). Sorting things out: Classification and its consequences. MIT Press.

Criado Perez, C. (2019). Invisible women: Data bias in a world designed for men. Chatto & Windus.

Dastin, J. (2018, October 11). Amazon scraps secret AI recruiting tool that showed bias against women. Reuters. https://www.reuters.com

Davenport, T. H., & Harris, J. G. (2007). Competing on analytics: The new science of winning. Harvard Business School Press.

Forrester Consulting, & KPMG. (2019). Guardians of trust: Who is responsible for trusted analytics? KPMG International.

Gartner. (2023). Data and analytics trends. Gartner Research.

Gitelman, L. (Ed.). (2013). “Raw data” is an oxymoron. MIT Press.

HFS Research. (2022). 75% of executives don’t trust their data. HFS Research.

IDC. (2018). The data age 2025: The digitization of the world. International Data Corporation.

IoT Analytics. (2025). Number of connected IoT devices growing 14% to 21.1 billion. IoT Analytics Research.

Kitchin, R. (2014). The data revolution: Big data, open data, data infrastructures and their consequences. SAGE.

Kleinberg, J., Mullainathan, S., & Raghavan, M. (2018). Human decisions and machine predictions. The Quarterly Journal of Economics, 133(1), 237–293.

Latour, B. (1987). Science in action: How to follow scientists and engineers through society. Harvard University Press.

Mayer-Schönberger, V., & Cukier, K. (2013). Big data: A revolution that will transform how we live, work, and think. Houghton Mifflin Harcourt.

Noble, S. U. (2018). Algorithms of oppression: How search engines reinforce racism. NYU Press.

O’Neil, C. (2016). Weapons of math destruction: How big data increases inequality and threatens democracy. Crown.

Obermeyer, Z., Powers, B., Vogeli, C., & Mullainathan, S. (2019). Dissecting racial bias in an algorithm used to manage the health of populations. Science, 366(6464), 447–453.

Redman, T. C. (2018). If your data is bad, your machine learning tools are useless. Harvard Business Review. https://hbr.org

Straw, I., & Wu, H. (2022). Investigating for bias in healthcare algorithms: A sex disparity in liver disease AI performance. BMJ Health & Care Informatics, 29(1), e100457.

Zuboff, S. (2019). The age of surveillance capitalism: The fight for a human future at the new frontier of power. PublicAffairs.

Authors

Martti Asikainen

Communications Lead
Finnish AI Region
+358 44 920 7374
martti.asikainen@haaga-helia.fi

Dr. Umair Ali Khan, Haaga-Helia University of Applied Sciences

Dr. Umair Ali Khan

Senior Researcher
Finnish AI Region
+358 294471413
umairali.khan@haaga-helia.fi

White logo of Finnish AI Region (FAIR EDIH). In is written FAIR - FINNISH AI REGION, EDIH
Euroopan unionin osarahoittama logo

Finnish AI Region
2022-2025.
Media contacts