Armchair Architects: How do you get meaning from your data?

brauerblogs · ‎Mar 15 2023

Welcome back to another episode of Armchair Architects as part of the Azure Enablement Show. Today we will be discussing how do you get meaning from your data? Our hosts will be David Blank-Edelman and our armchair architects Uli Homann and Eric Charran.

What is usually missing raw data is context.

I am swimming in data, there's some over there and some over there and then there’s some data in this other repository. I have so much data, I don't know how to make meaning out of it all. How do you think about getting meaning from your data?

This topic is near and dear to Eric’s heart because this is something he is doing right now for multiple teams here at Microsoft. If you recall from the previous blog on dark data, which is “I know I'm capturing data, but how do I find it, use it, and actually synthesize it to answer my questions?” Eric thinks this discussion is a little bit different than that one, because let's assume that you don't have a data swamp anymore. Let's assume that you have a relatively streamlined way of extracting data from hundreds of thousands of transactions or mysterious funny messages coming from little devices that you've connected to a cloud provider. What is usually lost is context. Which is how do I understand this time series message that I got from this funny little device which is only identified by a GUID [globally unique identifier] in the IoT scenario or in the IoT space? How do I know where this thing was, who manufactured it, when its last firmware update was, what it's making, and then where it sits in physical space? All of those things are inherently absent from the little message that it sends into the cloud, which eventually finds its way into your data link.

To answer different questions which many of our customers have such as, what's happening right now? How efficient am I? What's going to happen and what should I do about it? Context associated with data, which is commonly in my estimation, referred to as semantics, which is the meaning that data holds, and the context associated with it is super important.

Data designers need to be clear in defining the data that’s being shared.

Uli believes meaning and contextualization is obviously very important, but then think about the other piece. That's often the case if you look at data models. Oftentimes the data designers don't really specify what they meant. Is this pounds in pressure or pressure in pounds or is this in kilowatts or whatever the measurement unit is. It is often omitted because designers would assume that everyone counts the same or uses the same unit of measurement. And that's really another element of this contextualization that often gets overlooked because people should “know” what is meant. One example of this was in the case of the famous NASA event, where one guy thought they were working in metric, the individual thought in they were working in imperial that led to a loss of an expensive rocket and satellite. Uli thinks it's really important to think about all the aspects that Eric talked about, but then to also make sure that people understand what is the unit of measurement we are using here.

How do architects think about relationships between data?

That's an example of definition being important to context in addition to relationships. For example, I may be in an auto or a chemical factory, I have one device that may be sending one set of data next to the other device that is also sending another set of data, but I don't really know what they're doing but there is a relationship between those two IoT devices. How do we think about relationships when it comes to meaning and data and context? How do we think about data relationships and describe them?

Eric thinks about it in three ways. There's entities, attributes, and relationships between the entities. Those are things in which from an architecture perspective, from a data modeling perspective, that's your foundation. So, there's got to be a way, technology, or a model in which you can express those things. And the way that you do that is you use languages like RDF [resource description framework] or a labeled property graph or any of these modeling languages to say this is what my world looks like. These are the things that I care about. Some of these things correlate to physical objects. For example, in traditional digital twins use case, some of them might be business objects like work orders or supply contracts or things of that nature. All of them exist as entities which have attributes, which are properties. Those properties have meaning, like in Uli’s example, units of measurement and then there's relationships between them and that's the way in which we want to present as architects and solution builders. We want to present that to our customers and our end users. Not necessarily this table has an associative. entity which links to this table, and these are the primary keys and foreign keys associated with it, but the end user still doesn’t know the temperature measurement metric or unit of measure. We want to be able to say, “hey, here's an experience that says these are the entities and attributes and relationships that you as business users or a regular users understand and if you're not sure what something is, you can mouse over it and get semantic meaning.” This is the unit of measurement for this temperature and then that brings together context and contextualization. But to get there you must be able to tie these sets of data together. You have to conduct data fusion between multiple data sources in order to achieve this.

How do architects think about relationships that are yet unknown?

One scenario which architects encounter is that not all relationships are always known ahead of time to get meaning of your data. Sometimes you're constructing relationships after the fact or in process. What are some considerations to think about in this situation?

Uli offers some of his thoughts in this scenario in that there are two models for how you think about this. The first one is a temporal alignment where the time measure is the key relationship edge to use the graph language that Eric mentioned. Graphs are one way of describing things and an example is the Boston Dynamics Spot robot which looks like a dog but has the ability to host multiple sensor packs but only one at a time. The packs can include 3D sensor packs: one for vision, one for audio, and a third one. You put one on Spot, Spot runs around, does its thing, and then an hour later it takes picks up another sensor pack. How do you model that and how do you make sure that you understand that in the 1^st hour, Spot was using the vision pack, and in the 2^nd hour Spot was using the audio pack? In this example, Spot was always the same and we had our choice of 3 different sensor packs which is always the same item, but the combination of the two is interesting as it varies by hour. And this is not a data discovery problem, this is something that you schedule, and you set up, but you must now reflect it in your system. Your system must be able to know what the robot was carrying at hour X of day Y and so forth because your data analytics will be skewed differently if you don't really understand that. There is a pattern in the way the Spot is walking where you need to understand the locality, location of those kinds of things and then other things that come from outside data.

In a different scenario, humidity plays a very big role in manufacturing for example, especially when you do process flow. Humidity really impacts what's going on and whether it is dry, hot, or cold? These different factors are really important in manufacturing. You have to understand what is the data that you care about and what is the influencer. And then you have to expand your view beyond just the raw data that you have to get to contextuality that actually really makes sense. That's part of the discovery which really influences your data set. You can have a data set that says what the temperature is inside, but does it really matter? Is this now impacted by the surroundings such as the outside temperature? Because it's now -15 degrees Celsius outside, does that have an impact on how much you must heat the pipe in order to keep the oil flowing? All these things are super important and that is “discovery” from my perspective. And then you have to model it so that you can persist it and then bring up the data when somebody asks the question “what happened to this pipe or this oil in that pipe at that time?” You know exactly what the elements of data are that you need to bring back and that's unfortunately comes from experience, and experimentation.

Earlier, Uli mentioned that there were two things associated with the question about relationships. One was a temporal alignment and the second part of this is causality. This is where you have causal data like temperature on the outside, humidity and those other kinds of factors you can't influence, but those factors do impact and influence what you measure and what the meaning of that measurement actually is or the impact of that is.

Eric cites an example of a ‘slowly changing type 2 problem’.

Uli’s example made Eric think of something and for many of our viewers that have been in the data warehousing space this is not a new problem. So, for those folks that are data warehouse toolkit fans, whether you are a Bill Inmon or Ralph Kimball fan, depending on which side of the world you're on, we used to do this in the data warehousing days. We took a fact table, which is the raw data, let’s say sales data and then we had dimensions around it. We had a salesperson dimension, and we also had a time and product dimension so we could figure out who sold what and when and in what quantity. Now one of the challenges associated with this is what Uli’s example made Eric think of is this other scenario.

Let's say that you and I are working for the same company, and I inherit your sales territory. Now we have a history problem in which I can't take credit for all of your historical sales. If we make that change in the salesperson dimension, we have what we used to call a slowly changing type two problem. What that means is as much as you define the entities, relationships and attributes that will help you contextualize raw transactional data, you must focus on how those models themselves will change over time so that you're not actually looking back in history with today's understanding of how those entities, attributes and relationships are implemented. It's a very old problem that's going to only be exacerbated by the different speeds, feeds, and data sources that we have in today's modern world.

Recommended Next Steps:

If you’d like to learn more about the general principles prescribed by Microsoft, we recommend Microsoft Cloud Adoption Framework for platform and environment-level guidance and Azure Well-Architected Framework. You can also register for an upcoming workshop led by Azure partners on cloud migration and adoption topics and incorporate click-through labs to ensure effective, pragmatic training.

To hear the whole conversation, you can watch the video below.

Products (50)

Special Topics (27)

Video Hub (462)

Most Active Hubs

Most Active Hubs

Video Hub

Armchair Architects: How do you get meaning from your data?