#19. What is InfoSum's decentralised clean room solution?

Notes from an interview with Alistair Bastian (CTO) at InfoSum

Mar 11, 2024

blue UTP cord — Photo by Jordan Harrison on Unsplash

One of InfoSum’s selling points as a data clean room solution lies in its decentralized technology. I was introduced to this concept during my tenure at Ocado, where I led the retail media business. I was always curious to understand more about it, including what it means, and how it works differently to other clean room providers who may not have a decentralised solution. Recently, I had the opportunity to interview Alistair Bastian, CTO at InfoSum, to learn more about this technology. Below are brief notes from the interview.

Q1: When you think of your clean room technology, what are the different technology components that make it up and what is the purpose of each of those components?

Alistair:

InfoSum considers there to be three layers to any collaboration stack. Starting at the bottom, we have the infrastructure layer. It provides the necessary foundational data environment for secure and compliant data collaboration. This layer would include the hardware, the software, and the networking architecture to manage data access, storage and processing. It's this layer that's responsible for security and scalability (in terms of how performant it is), and how much data you can load and how efficiently the collaborations and data processing can be executed. This is where the data lives, where the data is onboarded, where the data is typically transformed and standardized. And this is the layer that enables deep integrations - whether you're bringing in data from a CRM system, other cloud providers, data warehouses or any other data source.

The next layer that sits on top of that is what we call the trust layer. Trust is a word that I'm going to be using a lot. It's a theme that is really important anytime anyone collaborates on sensitive data. The trust layer is the framework for secure and privacy-safe data collaboration. The bottom infrastructure layer can be something you build by yourself, it could be something you leverage from a data warehouse, or it could be some components you assemble from a cloud provider. But on top of those fundamental capabilities, you need to build a trust layer encompassing a set of protocols, technologies like privacy-enhancing technologies and policies like permissioning controls (e.g., who can access what data). This layer ensures data privacy and that data adheres to all legal and ethical standards.

The final layer is the intelligence layer. Once you've got your data in the infrastructure layer and worked out all the controls, permissions and technologies to ensure that it's used in a compliant way through the Trust layer, you build your intelligence layer at the top. This layer is responsible for harnessing all of that aggregated and privacy-compliant data. It will employ various data processing techniques to extract patterns and then be able to take actions based on the patterns/insights. This is where your activation layer will sit, as will your business intelligence and interoperability, because you should only be activating data that is passed through permission controls, privacy-enhancing technologies and other protocols - in other words, data that is passed through the trust layer.

Q2: Using the framework of three layers you described above, can you explain how the tech works for a clean room to be interoperable with cloud providers versus to be interoperable with other clean rooms?

Alistair

Ingesting data from multiple cloud providers is more of an integration task. It requires integrating raw data from different sources into the infrastructure layer, which is then passed to the trust layer and used by the business through the intelligence layer. That's integration.

When I talk about interoperability, I'm talking about interoperability with other clean room providers. And what that means is you're not sharing raw data at that point; you are working with privacy-safe data that's been through the trust layer of another clean room which has all of the collaboration controls (e.g. what data can be shared) and privacy-enhancing technologies (e.g. pseudonymization) applied. Once the data has been passed through the Trust layer, it is then sent to another clean room where the user can use the intelligence layer of that clean room to pull insights from that data in a privacy-safe way. Google’s PAIR and IAB’s OPJA are both standards for interoperability among clean rooms, and they define how clean rooms can pass data, which has been through the trust layer among each other.

Q3: Over the years, we have heard the term data ‘Bunkers’ from InfoSum. What are ‘Bunkers’?

Alistair

Bunkers are standalone private cloud instances. Each Bunker is unique to a single company. Only the data owner has access to a Bunker. Even InfoSum can't access a Bunker contractually unless the owner has explicitly instructed us to access it under certain exceptional circumstances, which is very rare. Bunkers are a secure environment that wraps and isolates a customer's hashed and encrypted data, alongside InfoSum's patented tech to ensure very high standards of security and privacy.

There are two types of Bunkers depending on the use case. The first is an Insights Bunker, which supports use cases around planning and measurement. Once raw data is put into this Bunker, it is sealed. No raw data can now go out of it. The only thing that comes out of that Bunker is a mathematical model, which is the core of our decentralization tech. The second Bunker is an Activation Bunker because, at some point, you need the data to come out of that Bunker to the data owner or for the data owner to redirect that to somewhere where they are happy for that data to go. So that's the only difference with the Activation Bunker in that it has the ability for the encrypted data to be returned back to the owner of the Bunker based on the result of collaboration

There are three main properties of a Bunker, which InfoSum believes are important, and that's why we've created them.

First is control. It ensures that the original data remains within the organization's (data owner’s) control. No one, including InfoSum, has access to that data physically or otherwise.

Second is decentralization. Bunkers allow data to be processed in a decentralized manner. Once the raw data goes into a Bunker, it doesn't leave it. So, without sharing data, without moving data between parties, without having to move data across regions, or even commingling any data, we enable parties to derive insights from each other’s data. This helps companies with all local laws and regulations around data security.

Third is data interoperability. A large area of friction in collaboration is that different datasets may adhere to different data standards and formats. For example, 6th March 2024 might be represented in one dataset as 06/03/2024 and in another dataset as 03/06/2024. Or emails may be lowercase in one dataset and uppercase in another. The Bunker encapsulates all of the tech to standardize and normalize the datasets to the same convention. That means as soon as you've put your data into a Bunker, it's instantly compatible with any other Bunker on the InfoSum system, and you can collaborate instantly.

So those are the three fundamental properties of Bunker - control, decentralization, and data interoperability.

Q4: Do you need your clients to load their raw data into your Bunker or can clients keep their data with their cloud service provider and you enable querying directly into data stored in the cloud?

Alistair

The former. A huge reason is what we discussed above - data in a Bunker gets standardized in a few minutes and is ready for collaboration with other datasets.

Q5: Can you explain step-by-step how collaboration happens between two parties who have their data in their respective Bunkers? And how does your decentralized technology come into play?

Alistair

Let me provide two scenarios for collaboration - a simple and complex one, to describe how the tech works. In a simple scenario, let’s say that two parties - a brand and a publisher, want to collaborate. They both license our Bunkers and put their data in their respective Bunkers. Their data is then standardized, pseudonymized, hashed, encrypted, and isolated in their Bunkers.

They then use our user interface to select how they want to collaborate. This comes into the trust layer. There are three steps for them. First, they might just want to view the magnitude of overlap in the data. No access to querying or looking at any other attributes. They just want to see if there is sufficient overlap to pursue the collaboration. Second, if the overlap is sufficiently high, the parties will determine what they want to do with the data. For example, the parties might agree to work with only the overlapping users. This is the part that relates to the Trust layer of the infrastructure. This is where both parties establish the permissions for how the other party can use their data. The third step is enabling one party or the other to query that data and move that encrypted data out of the Bunker back to the owner, as part of activation.

Now, how can all of these joins between datasets happen when the data is in different Bunkers, potentially in different regions? Without centralizing data in the same place. Within a Bunker, InfoSum creates an abstract mathematical model of the encrypted data. Imagine one Bunker had three users - Alistair, Keshav, and Nicola. The Bunker takes the hashed, pseudonymized, and encrypted values, combines them and transforms them into abstract mathematical models. In academia, the technical term for such models is data sketch. Essentially, it is a one-way transformation of the data. You cannot retrieve the original values from the mathematical model. All you can do is test whether certain values exist in it.

I will give you the simple analogy of a cake here. Think of the mathematical model as a cake. You have the ingredients like eggs, flour, and sugar. You use these ingredients and bake them into a cake. Now, all you see is the cake (the output) - you cannot see or retrieve the individual ingredients. But you could test the cake in a lab to determine if it contains a particular ingredient.

Once the data is converted into an abstract mathematical model, we transfer that into another Bunker. Then we test the model in that Bunker against the data in that Bunker to find the overlap in users. At this point, the brand might say they want all overlapping individuals to be activated in their media campaign at the publisher. But the important thing in this process is that it is the models that move between the Bunkers and not the raw data identifiers like hashed emails or names.

The same process works in a more complex scenario where you have more than two parties collaborating (e.g., along with brands and publishers, you could have identity resolution providers and measurement partners). But through our Bunkers and decentralized tech, we enable all parties to collaborate without friction and without ceding control of their data.

Q6: How does this process work for clean rooms that do not have decentralized technology?

Alistair

It is possible to do data collaboration without decentralized tech. But to do so, the data needs to be centralized or co-mingled. This can lead to increased security risks as it creates a single point of failure, it can also lead to increased privacy risks due to the , compounded complexities when you have several parties collaborating. It also limits the potential collaboration scenarios, specifically centralizing or co-mingling data could mean that cross=border collaborations are not possible without copying or moving the data to another region. It could mean that in certain circumstances, such as in healthcare and finance, where data can not leave the premises of firewalls of an entity, collaboration with other entities with similar conditions (e.g. across hospitals or financial institutions.) is either not possible or not cost-effective.

Retail Media Insights

Discussion about this post