Transcript
Dennis Talley, Executive Director, Enterprise Data Services
Niels Hanson, Director, KPMG Lighthouse
Dennis Talley:
Hi everyone, my name is Dennis Talley and I'm an Executive Director in KPMG's Enterprise Data Services organization. I work in the CDO function.
Niels Hanson:
And I'm Niels Hanson. I'm a director in the KPMG lighthouse within the cloud and engineering practice.
Dennis Talley:
We'll be talking today about Multi-cloud Enterprise Delta Sharing and Governance using Unity Catalog. We worked with S&P Global on this.
So just a quick overview, today we're going to talk about data Sharing challenges. We'll talk a little bit about Delta Sharing and Unity Catalog. We'll talk about the pilot that we did with S&P Global, which was a, a cross cloud delta Sharing initiative. And we'll also talk about the validation criteria that we had for the pilot. Finally, Niels will finish up with discussion on our Modern Data Platform and the tie in that we have to the SFDR ESG model that we had as the capstone for this initiative.
So, sharing data is, is a challenge that a lot of organizations face. We'd like to use data as an asset, but there are a lot of constraints that we have to work with. Homegrown solutions are fault prone. Commercial platforms often have a lock in that is a little more constrained than we'd like to see. And then direct cloud storage Sharing can lead to compliance issues. In a firm like KPMG, compliance issues are top of mind for us, and so it's something that we need to bring to bear every time we're working with the technology such as this.
I'm going to talk a little bit about the pilot, but I'm going to leave the, the bulk of the work to Niels. It was at this conference a year ago at the executive forum where I heard Monas Soni and Gary Jones from S&P Global Sustainable One Program. They were talking about an internal delta Sharing initiative that they had amongst S&P Global business units. KPMG buys a lot of data from S&P Global and so after the session, I approached Gary and Mona and said, I'm an existing customer looking for a more efficient and effective way to receive data. I know you're not working on external, but when you are, please give me a call. I'd love to be your first pilot. That was in June and September or October, they gave me a call back in January we were planning sprints, and in the first week of March we started our pilot. That pilot lasted for about five weeks. So, we ran through information security on both sides. Five weeks seemed like an awful short amount of time to do this work, and that's an illustration of just how easy this can be. Databricks through Unity Catalog has made Delta Sharing really, really easy, and we'll talk a little bit about that today. I don't need to talk about the Data and AI Summit, you all know where you are but it really is exciting to hear the momentum around Delta Sharing this morning in the keynote. And it's nice to be in a place where the world is coming to us, meeting us where we are. So, looking forward to hearing any questions you might have. I'll turn it over to Niels.
Niels Hanson:
Alright so we heard about these exciting technologies this morning, but I'll just remind you again why we really, really are excited about Delta Sharing. It's really the first open standard for secure sharing across data cloud in particular. And so if you have clouds on multiple scale hyperscalers, this is the open platform by which you can do it. So definitely we'll go through the talk about the benefits a little bit. Like we have open cross platform Data Sharing. You're able to share live data with no replication, which is really a big deal. There's centralized governance across these clouds now, like we saw you know, blob storage and data warehouses are now coming together with the same storage governance paradigm. And then another exciting thing is now that we have this great sharing protocol, there's now a great marketplace to get all these data available, like, like providers, like S&P, but there's a whole multitude of them that we heard about this morning. And finally, this protocol allowing you know, private organizations to share mutual data sets allows this concept of the data clean room to, you know, really do these kind of sensitive co-modeling, co-analyzing environments in a really safe and secure way.
So, another thing like with Unity Catalog, this is now, this is this catalog that exists across the cloud enabled by Databricks. Essentially centralized governance for data and AI on Databricks built in data search and discovery performance and scalability of this data across, and also automated lineage and of workloads across everything. And finally, it also integrates with a number of existing tools that we are really excited to use. So that's a bit about like the features and why we're excited about it. And a little bit about, about the how.
So, on the left-hand side here, we have the, the S&P Global platform. And essentially, and on the right-hand side, we have our KPMG environment. And essentially what they do is from their existing data products internally, they create these delivery packages, which is what, you know, the kind of data sets that they want to share with a certain provider, a certain collaborator. And what they do is they create a share in their Databricks admin, in their Unity Catalog. And that when provided to the KPMG catalog comes across as just another table in that Unity Catalog. And from there, that KPMG admin can then provide that dataset to the downstream data consumers within, within KPMG and provide that to different use cases and governance there.
So, if it sounds really, really, simple, it's because it actually is kind of that simple. So, the data actually doesn't come across, it's merely you're pushing your queries into the S&P Global data platform, and that’s where things are being performed. And so that's you know, a great thing, instead of the data coming across the wire, we have downstream storage within KPMG. Sometimes you do need to bring the data across for downstream consumption in other systems, but we want to encourage more and more use cases to live within the Databricks environment. So, in terms of what we did, in terms of the pilot we had all of the success criteria. So we just really wanted to make sure that it actually did what it said it did.
So we sort of evaluated data consistency. We evaluated performance, security, auditability, scalability, and integration, which is really around the data, is what it says it is in both environments. We can have read and write performance around different things. And they met our internal SLAs. We wanted to make sure that the security features that were around that validate that when we share something it's shared to KPMG and not anybody else. And then within KPMG, we want to be able to share row level and column level security, being able to divide those data sets up for different kinds of use cases. And then we also wanted to validate when we removed access that it was actually going to happen. Auditability, this is very important. We wanted to show that every action is recorded, and also see how updates deletes inserts are all sort of tracked within the environment.
Finally, we did a couple of heavy, heavy queries to ensure that, you know, these things do indeed scale within the Databricks environment, just because we're sharing from the S&P environment that things, you know, indeed worked within the KPMG environment. And finally there was an integration. We tested a number of tools within our own environment to ensure that, you know, it's great if you can work within the Databricks environment, but if you don't, you don't need to work in the Databricks environment or you're not able to, you can use other kinds of integrations to get things done. So, in terms of this we're going to go to a little bit of a demo and bear with me because kind of goes on its own. But here, we're in the KPMG environment and we're going to run this query. And this query's not going to find any data because this data hasn't been provided to us from S&P yet. Wait, so nothing happened. There's no, no results. Now on the S&P side, we're going to create the share, and then they're going to create the table that was supposed to be shared. And then we're going to run a query and there's on the S&P side and there, there isn't any data in this table just yet. So, we haven't, we haven't done anything yet. Now we're going to insert a record into that table.
Alright, great. Now we're going to go back to the KPMG side and then we're going to make sure that that record is found. And indeed, it is. Now we're going to update that record so that key instance is going to be that record's going to be 004 there. We're going to do another select and we should be able to see 004 in the output. Alright, there you go. So, it's really one-to-one. We're going to do, just to make sure that it wasn't a fluke, we're going to do another update. So that's now going to be 003. Back to the KPMG side, we're going to run another query and it should be 003 and indeed it is. Now they're going to delete the table from the S&P side. We're going to go back to the KPMG side, and we should find that there's no more records to be found.
Alright, no results. So, the table has been successfully removed. Alright, so now we're going to look at the audit logs. So, here's the way that you can ask the table to give you all the updates inserts, deletes. And so here you see all the actions that we sort of just performed. You have the key values, the commit, the commit timestamps. And so, this provides a full audit log for you to see who accessed what when. And indeed, you know, these are things that you wouldn't be able to normally get this detail at the table level. And this is what, you know, Unity Catalog and Delta Sharing are providing. This is just going through talking to the table. Finally, we have a quick little performance query to show that within the KPMG side that indeed this thing is performant. Just let it go. We will just run a quick count star on this table to show that indeed it's getting the number of records. Now we're going to run some group buys to another show, another scalability query. Sox, this should be basically a simple group by and count.
And that returned, you know, relatively quickly. And so indeed these kinds of queries are meeting our SLAs on, on Delta Sharing, and we're actually very, very impressed with the performance of this. So indeed, like we didn't need to bring the data across the wire, and we can do these really, really performant queries. Finally, we're going to show that you can have different security grants at the catalog level, the database schema level, and at the table level so that you can, you know, divide your permissions along these things within, within Delta Sharing or divide this table within KPMG environment along these lines.
And then finally, we're going to look at lower-level security and column level security. So, these are two different KPMG users on the left and the right. The one on the left has much more permissions than the one on the right. So the one on the left will get more data. So basically, they can look at data from any year and the one on the right can only see things that are more recent than 2019. And so this is just to show that indeed the same query by two different users to yields two different results. See on the left, we got records going back to 2004, and on the right, we only see things that are, you know fairly recent. So essentially that was our validation. It was just a very cute and quick little pilot, but indeed showed that, you know Delta Sharing and Unity Catalog can work across clouds and can work across clouds in the enterprise environment.
In terms of what this enable is like, these are things that we do within KPMG with this data is the SFDR Reporting Solution. So essentially a number of fund managers need to assess their investments on whether their investments are truly green or not. And KPMG has an accelerator where we can do these kinds of calculations very, very quickly. And this is one of the things that we're really, really interested in doing with S&P because they're a great provider of this kind of data. And Databricks is now a really great place for fund managers to do these kinds of calculations. And the results are more efficient and up to date because, you know, the data didn't have to come across the wire and is continuously updated on the S&P side. So, there's no lag or data inconsistency. Finally, more broadly I lead the KPMG's Modern Data Platform from a technical side. And this is sort of our scalable cloud offering. And if you need to talk about cloud, cloud strategy, or cloud implementation within Databricks or beyond you know, great, come and talk to me about it. And we do this for a number of clients across a number of industries.
So, you've heard our side of the story, unfortunately, Gary got food poisoning earlier this earlier today. So, you only get to hear our side of the story. But you can come to our booth later and we can talk more about data transformation and other opportunities. And also, McGregor's a really, really talented guy and he gave a great talk earlier today.
Speaker Unknown:
Okay. That's all we have. Thank you so much.
Dennis Talley:
Thank you all for coming.
Niels Hanson:
Great. Thank you.