Databricks

Video transcripts

Dennis Talley, Executive Director, Enterprise Data Services

Niels Hanson, Director, KPMG Lighthouse

Dennis Talley:

Hi everyone, my name is Dennis Talley and I'm an Executive Director in KPMG's Enterprise Data Services organization. I work in the CDO function.

Niels Hanson:

And I'm Niels Hanson. I'm a director in the KPMG lighthouse within the cloud and engineering practice.

Dennis Talley:

We'll be talking today about Multi-cloud Enterprise Delta Sharing and Governance using Unity Catalog. We worked with S&P Global on this.

So just a quick overview, today we're going to talk about data Sharing challenges. We'll talk a little bit about Delta Sharing and Unity Catalog. We'll talk about the pilot that we did with S&P Global, which was a, a cross cloud delta Sharing initiative. And we'll also talk about the validation criteria that we had for the pilot. Finally, Niels will finish up with discussion on our Modern Data Platform and the tie in that we have to the SFDR ESG model that we had as the capstone for this initiative.

So, sharing data is, is a challenge that a lot of organizations face. We'd like to use data as an asset, but there are a lot of constraints that we have to work with. Homegrown solutions are fault prone. Commercial platforms often have a lock in that is a little more constrained than we'd like to see. And then direct cloud storage Sharing can lead to compliance issues. In a firm like KPMG, compliance issues are top of mind for us, and so it's something that we need to bring to bear every time we're working with the technology such as this.

I'm going to talk a little bit about the pilot, but I'm going to leave the, the bulk of the work to Niels. It was at this conference a year ago at the executive forum where I heard Monas Soni and Gary Jones from S&P Global Sustainable One Program. They were talking about an internal delta Sharing initiative that they had amongst S&P Global business units. KPMG buys a lot of data from S&P Global and so after the session, I approached Gary and Mona and said, I'm an existing customer looking for a more efficient and effective way to receive data. I know you're not working on external, but when you are, please give me a call. I'd love to be your first pilot. That was in June and September or October, they gave me a call back in January we were planning sprints, and in the first week of March we started our pilot. That pilot lasted for about five weeks. So, we ran through information security on both sides. Five weeks seemed like an awful short amount of time to do this work, and that's an illustration of just how easy this can be. Databricks through Unity Catalog has made Delta Sharing really, really easy, and we'll talk a little bit about that today. I don't need to talk about the Data and AI Summit, you all know where you are but it really is exciting to hear the momentum around Delta Sharing this morning in the keynote. And it's nice to be in a place where the world is coming to us, meeting us where we are. So, looking forward to hearing any questions you might have. I'll turn it over to Niels.

Niels Hanson:

Alright so we heard about these exciting technologies this morning, but I'll just remind you again why we really, really are excited about Delta Sharing. It's really the first open standard for secure sharing across data cloud in particular. And so if you have clouds on multiple scale hyperscalers, this is the open platform by which you can do it. So definitely we'll go through the talk about the benefits a little bit. Like we have open cross platform Data Sharing. You're able to share live data with no replication, which is really a big deal. There's centralized governance across these clouds now, like we saw you know, blob storage and data warehouses are now coming together with the same storage governance paradigm. And then another exciting thing is now that we have this great sharing protocol, there's now a great marketplace to get all these data available, like, like providers, like S&P, but there's a whole multitude of them that we heard about this morning. And finally, this protocol allowing you know, private organizations to share mutual data sets allows this concept of the data clean room to, you know, really do these kind of sensitive co-modeling, co-analyzing environments in a really safe and secure way.

So, another thing like with Unity Catalog, this is now, this is this catalog that exists across the cloud enabled by Databricks. Essentially centralized governance for data and AI on Databricks built in data search and discovery performance and scalability of this data across, and also automated lineage and of workloads across everything. And finally, it also integrates with a number of existing tools that we are really excited to use. So that's a bit about like the features and why we're excited about it. And a little bit about, about the how.

So, on the left-hand side here, we have the, the S&P Global platform. And essentially, and on the right-hand side, we have our KPMG environment. And essentially what they do is from their existing data products internally, they create these delivery packages, which is what, you know, the kind of data sets that they want to share with a certain provider, a certain collaborator. And what they do is they create a share in their Databricks admin, in their Unity Catalog. And that when provided to the KPMG catalog comes across as just another table in that Unity Catalog. And from there, that KPMG admin can then provide that dataset to the downstream data consumers within, within KPMG and provide that to different use cases and governance there.

So, if it sounds really, really, simple, it's because it actually is kind of that simple. So, the data actually doesn't come across, it's merely you're pushing your queries into the S&P Global data platform, and that’s where things are being performed. And so that's you know, a great thing, instead of the data coming across the wire, we have downstream storage within KPMG. Sometimes you do need to bring the data across for downstream consumption in other systems, but we want to encourage more and more use cases to live within the Databricks environment. So, in terms of what we did, in terms of the pilot we had all of the success criteria. So we just really wanted to make sure that it actually did what it said it did.

So we sort of evaluated data consistency. We evaluated performance, security, auditability, scalability, and integration, which is really around the data, is what it says it is in both environments. We can have read and write performance around different things. And they met our internal SLAs. We wanted to make sure that the security features that were around that validate that when we share something it's shared to KPMG and not anybody else. And then within KPMG, we want to be able to share row level and column level security, being able to divide those data sets up for different kinds of use cases. And then we also wanted to validate when we removed access that it was actually going to happen. Auditability, this is very important. We wanted to show that every action is recorded, and also see how updates deletes inserts are all sort of tracked within the environment.

Finally, we did a couple of heavy, heavy queries to ensure that, you know, these things do indeed scale within the Databricks environment, just because we're sharing from the S&P environment that things, you know, indeed worked within the KPMG environment. And finally there was an integration. We tested a number of tools within our own environment to ensure that, you know, it's great if you can work within the Databricks environment, but if you don't, you don't need to work in the Databricks environment or you're not able to, you can use other kinds of integrations to get things done. So, in terms of this we're going to go to a little bit of a demo and bear with me because kind of goes on its own. But here, we're in the KPMG environment and we're going to run this query. And this query's not going to find any data because this data hasn't been provided to us from S&P yet. Wait, so nothing happened. There's no, no results. Now on the S&P side, we're going to create the share, and then they're going to create the table that was supposed to be shared. And then we're going to run a query and there's on the S&P side and there, there isn't any data in this table just yet. So, we haven't, we haven't done anything yet. Now we're going to insert a record into that table.

Alright, great. Now we're going to go back to the KPMG side and then we're going to make sure that that record is found. And indeed, it is. Now we're going to update that record so that key instance is going to be that record's going to be 004 there. We're going to do another select and we should be able to see 004 in the output. Alright, there you go. So, it's really one-to-one. We're going to do, just to make sure that it wasn't a fluke, we're going to do another update. So that's now going to be 003. Back to the KPMG side, we're going to run another query and it should be 003 and indeed it is. Now they're going to delete the table from the S&P side. We're going to go back to the KPMG side, and we should find that there's no more records to be found.

Alright, no results. So, the table has been successfully removed. Alright, so now we're going to look at the audit logs. So, here's the way that you can ask the table to give you all the updates inserts, deletes. And so here you see all the actions that we sort of just performed. You have the key values, the commit, the commit timestamps. And so, this provides a full audit log for you to see who accessed what when. And indeed, you know, these are things that you wouldn't be able to normally get this detail at the table level. And this is what, you know, Unity Catalog and Delta Sharing are providing. This is just going through talking to the table. Finally, we have a quick little performance query to show that within the KPMG side that indeed this thing is performant. Just let it go. We will just run a quick count star on this table to show that indeed it's getting the number of records. Now we're going to run some group buys to another show, another scalability query. Sox, this should be basically a simple group by and count.

And that returned, you know, relatively quickly. And so indeed these kinds of queries are meeting our SLAs on, on Delta Sharing, and we're actually very, very impressed with the performance of this. So indeed, like we didn't need to bring the data across the wire, and we can do these really, really performant queries. Finally, we're going to show that you can have different security grants at the catalog level, the database schema level, and at the table level so that you can, you know, divide your permissions along these things within, within Delta Sharing or divide this table within KPMG environment along these lines.

And then finally, we're going to look at lower-level security and column level security. So, these are two different KPMG users on the left and the right. The one on the left has much more permissions than the one on the right. So the one on the left will get more data. So basically, they can look at data from any year and the one on the right can only see things that are more recent than 2019. And so this is just to show that indeed the same query by two different users to yields two different results. See on the left, we got records going back to 2004, and on the right, we only see things that are, you know fairly recent. So essentially that was our validation. It was just a very cute and quick little pilot, but indeed showed that, you know Delta Sharing and Unity Catalog can work across clouds and can work across clouds in the enterprise environment.

In terms of what this enable is like, these are things that we do within KPMG with this data is the SFDR Reporting Solution. So essentially a number of fund managers need to assess their investments on whether their investments are truly green or not. And KPMG has an accelerator where we can do these kinds of calculations very, very quickly. And this is one of the things that we're really, really interested in doing with S&P because they're a great provider of this kind of data. And Databricks is now a really great place for fund managers to do these kinds of calculations. And the results are more efficient and up to date because, you know, the data didn't have to come across the wire and is continuously updated on the S&P side. So, there's no lag or data inconsistency. Finally, more broadly I lead the KPMG's Modern Data Platform from a technical side. And this is sort of our scalable cloud offering. And if you need to talk about cloud, cloud strategy, or cloud implementation within Databricks or beyond you know, great, come and talk to me about it. And we do this for a number of clients across a number of industries.

So, you've heard our side of the story, unfortunately, Gary got food poisoning earlier this earlier today. So, you only get to hear our side of the story. But you can come to our booth later and we can talk more about data transformation and other opportunities. And also, McGregor's a really, really talented guy and he gave a great talk earlier today.

Speaker Unknown:

Okay. That's all we have. Thank you so much.

Dennis Talley:

Thank you all for coming.

Niels Hanson:

Great. Thank you.

Apache Spark video transcript

MacGregor Winegard, Associate Data Engineer, KPMG

Yeah, thank you. I'm super excited to tell you all how we're using Spark Streaming and Delta Live Tables to accelerate our KPMG compliance for real time IoT insights, right? So really quickly, my name is MacGregor Winegard. I'm an Associate Data Engineer at KPMG. I've been on our Modern Data Platform team for about six or seven months now. I do a lot of work managing our data pipelines and infrastructure in, Databricks supporting various dashboards across Tableau and Power bi, as well as managing all of our cloud infrastructure in Terraform. So, why do you need a Modern Data Platform, right? At KPMG, we're working with companies of all sizes and shapes, right? And we see a lot of companies that implement a Modern Data Platform really well, and we definitely, you know, see somewhere there's a reason that they're coming to us for help, right?

I think a lot of their struggles can be kind of summed up into two more major categories. The first being that they don't have the data they need, right? They're maybe don't have the right quality data, they're not getting it as quickly as they need to. They don't have the verification checks in place to go ahead and make sure everything's, you know, correct, or maybe they just don't even know what data they need, so they're, you know, not answering the right question, right? I think the other thing we see with clients is that they, didn't think about how all the pieces are going to fit together, right? Maybe they didn't think about how are we going to scale our data up and down, and as we're ingesting all of it, how are our different pieces going to fit together? And this is maybe just driven by the lack of expertise because they maybe didn't have the right people in place.

And it's the, you know, the 21st century, right? So security and privacy are always something at the forefront. On the flip side, I think the clients that implement a Modern Data Platform really well, they've sat down, and they said, alright, what do, what is our vision for our data and our business, and then what is each individual piece we need to get there from start to finish? So, we view our Modern Data Platform as real. We really bridge this gap between your tech needs and your data, right? This is where your business people can come together and say, all right, these are the, the questions we want to answer. This is, you know, the ML models and everything that we want to build, right? And now your tech people can say, all right, great. We have one cohesive platform from start to finish where we can work with you, show you the different information and data points we want to collect. Support that from ingestion and cleaning all the way to where we're then feeding that out now to Power BI and Tableau, right? With our Modern Data Platform, we're really encouraging an iterative approach, right? So, you know, we're emphasizing that time to market, making sure we're getting all of our business and tech people in that room figuring out the problem, and then iteratively scaling up and up and up so that we now have a whole expansive data ecosystem for everybody to work in.

So, our Modern Data Platform is a service catalog, right? And we can offer it to you in three different ways. The first way is that we'll give you our reference architecture, right? So, you can get an understanding of, alright, what are all the different pieces we're using for ingestion, cleaning, how are we fitting in Databricks and all of our other technologies together. And then you now have a lot of freedom to go ahead and implement this in your cloud, in your tenant the way you want to. The second way we'll do this for you is we'll give you our Terraform modules. So like I said, I've done a lot of work managing our Terraform infrastructure, so we can go ahead and give this to you and you can actually deploy it on your tenant. We'll be there along, you know, the way to complete to help you and assist you but now it's on your tenant and you have, you know, complete control and oversight of it.

So you have the comfortability of that. The third way we'll give you our Modern Data Platform is that we'll actually host it for you on our tenant, right? So again, we'll deploy the Terraform infrastructure, but we already have a lot of the foundational pieces in place. So we're really able to quickly accelerate you to market, get you ready to go, and then give you the access you need to go ahead and have your cloud. The other thing we have is a whole bunch of different services, right? So, we have analytics, integration, data science, we have development and engineering teams who are super knowledgeable with Databricks and the rest of our tech stack across Azure so we can be here right along the way, helping with all your different challenges. And the great thing about KPMG is that we have a whole bunch of different partnerships with different companies who are giving us access to their data, and in turn, we can turn around and give that to you, right?

One example of this is that we're actually using unity catalog and Delta sharing to have access to S&P Global’s ESG data. And the great thing about this is, you know, we can now give you access to this and support things like our ESG accelerator. We will have a presentation at 5:10 over on stage number three. So, if you want to hear more about our SS&P Delta sharing initiative, please come and see that. Now, our, my use case today we're going to be talking about IoT analytics and manufacturing, but I want to emphasize that we support a whole bunch of different clients, a bunch of whole different a bunch of different industries and a plethora of different use cases, right? So just because you know, you maybe don't see your specific application here doesn't mean that we don't support it. We'd love to see you at booth number 724 later and talk to you about how we can be a real catalyst for change in your data environment.

So we found that unplanned downtime in manufacturing costs Fortune 500 firms up to a trillion dollars per year, right? This is going to come because of the fact that they have roughly 800 hours per year of unplanned downtime, 15 hours a week, a little over two hours per day, right? Most of us probably sit on our laptops, but think about how you would feel if your internet was unexpectedly cutting out all the time and, you know, you weren't prepared to download files or anything like that. Now you're waiting until your internet service provider gets you back online and you're a bit lost. The cost for downtime are going to spiral to well over $260,000 per hour because now you have your machine operators standing by idle waiting to get back online. You've got your products waiting, you know, not going anywhere because they're not going into machines.

You've got clients who are getting a little antsy, right? They want their products and you're now rushing in your service and maintenance people, so they're going to be jacking up those costs for you, right? On the flip side, we find that companies that implement some sort of a predictive maintenance plan see an average reduction in their costs of 25 to 30%. So, we think this is a great opportunity to implement IoT or Internet of Things, right? With IoT we're connecting a whole bunch of internet enabled devices and in some applications you can be pushing information or commands these devices, but in our case of predictive analytics, we're actually be collecting data off of all these different machines and, and getting all those readings, right? When you're implementing an IoT solution, though, you're opening the door to a whole bunch of other challenges, you're going to need to think about how we're going to process, interpret and operationalize all of this data.

There's going to be a lot of data constantly coming in. It can be, you know, gigabytes, terabytes, petabytes, right? So, we're going to need a system that can go ahead and scale up and down and be able to handle all this different data as it's constantly streaming in. So, you're going to need that uptime 24/7 with IoT, you're going to be collecting information from a whole bunch of different machines. So, you're going to need to consider how are we handing our structured data, our unstructured data, our semi-structured data. We're going to need a plan to be able to bring that together and really give us one unified picture of what's going on across our environment. You're going to want to think about the veracity of this data, right? So you might have some interference with your wireless connections coming from, you know, different machines in a plant. You might just have bad internet, you know, communications overall.

So, we're going to need a plan of how we're going to handle those errors when we have significant gaps or corrupted messages coming from our, our different IoT messages. In any data solution in the 21st century we have to be thinking about security and governance, right? We don't want to be opening ourselves up to potential cyber-attacks and all the risks that can come with that. And once you have all this, you know, security and strong infrastructure in place, it's kind of hard to have a way where you can support these AI models to get those real time insights. So, we think our IoT accelerator is the great way to handle this challenge, and we're using Databricks and the Databricks Lakehouse to support this, right? We're using Spark Stream to give you a scalable and performance solution that allows you to analyze all of your data in near real time and give you those machine learning model predictions. Thanks to Spark, you're able to ingest process and transform all this data performantly at scale.

Thanks to Databricks Lakehouse, we have the reliability, the strong governance and the performance that you come to know and love from a data warehouse, while also having the openness, the flexibility, and the machine learning capabilities of a data lake. This solution that I'm showing you today can be applied across any environment where you can use IoT streaming for pre preventative maintenance in domains such as mining, remote sensing, medical devices, network security, anything in that field is a great application for our accelerator. Now I also want to emphasize that IoT can be applied to more than just manufacturing. I think it's a safe bet to say that probably every one of you in this room has a cell phone on you right now, right? Most of you probably have a smartwatch or some other internet capable device, you know, on you besides just your cell phone.

And you probably got here with some form of public transportation, right? Maybe a plane, maybe you took an Uber, anything in this category of, you know, cars and trucks and things that go right, these are all great applications for us to go ahead and figure out how we can glean these natural insights from the data in the world around us. So our use case today is looking at detecting faults in ion mill etching tools used during the semi-conductor fabrication process. Basically we're manufacturing computer chips. We have these silicon wafers that we need to cut up and then create all the little tracks for our semi-conductors to go ahead and, and you know, do what they do. And we are going to then collect all the data from the machines as these are going through and make sure that our measurements for all of our products is coming off in the right shape.

We're going to want to make sure that we have real time mass consumption of our machine events by leveraging Azure event hubs. We're going to need a way to seamlessly extract clean and transform all of our machine data. We're going to use Databricks Lakehouse to have a performance and scalable storage solution. And then of course we're going to want those machine learning predictions of all of our etching tools, health status, and then we're going on a dashboard, right? Because if you just have a bunch of data, it's kind of hard to figure out what's going on. But when we give ourselves a clean picture that we can quickly look at and analyze our problem, we're able to get to our solution faster. The outcome of our solution is that we have a real time view of each etching tool’s operating characteristics, and all the telemetry information for our product.

And we're able to predict the time to failure for each machine. So basically, how long we have from this exact moment in time to the point where our machine is, you know, code red and we're rushing in people to operate it and get it back up online. So, for a second, I'm going to put you in the shoes of a plant operator, right? So, if you want to know what's going on with your plant, this is the view that we give you from our Tableau dashboard. On the top there, you're able to see the individual variance from the target measurement of each product that's coming off the line. So, in blue we have our individual measurements, and our red is a three-hour moving average, very similar to like the seven day rolling averages we used to have for covid cases.

In the bottom, you're able to go ahead and just get an overall view of, alright, how many of my machines are healthy, how many are critical, you know, how many need attention? And in the bottom right there you're able to see the individual measurements or the number of measurements we have coming off of each machine. If you're a plant operator, you're also probably going to want to know, you know, what specific machines are giving me problems you know, continually and what are the typical issues that are coming up from this machine. So on this view, you're able to go ahead and filter by specific tool, specific type of issue arising. You're able to get a breakdown of all your specific faults and in the bottom there you can go ahead and actually dig into an individual data point, you know, pretty quickly if you want to do that as well.

So this is our architecture from start to finish. First, we have our IoT data generation. For this use case that we're showing you, we're simulating our IoT events. So we have a file that's being dropped into our Azure storage block. This event is going to trigger our Azure function, which is going to ahead and stream all these raw events, the Azure event hubs. From there, we're going to go ahead and ingest and transform all of our data. So, Azure event hubs is going to stream all the event information into our Databricks medallion architecture, right? And we're using Spark Streaming to adjust that. So we're using the medallion architecture, your bronze, silver, and gold tables. First, we're bringing it into our bronze table. This is where we have our historical archive of source, complete data, lineage and auditability. So we can reprocess this if we have any issues downstream.

From there we're going to use our silver Spark Stream to go ahead and read from the bronze table, transform all the events, drop duplicates, add any derived columns like our three hour rolling average as necessary and load this data into our silver table, right? Because we're using Spark, any operation that you come to know and love is something that we can naturally support here. Spark is also giving us exactly once fault tolerant guarantees. And because we're using Databricks workflows, we have fault tolerant retries. So, you know, there's pretty consistent data there. And we are using Delta Lake as well. So we're guaranteeing those acid transactions, scalable metadata handling, schema enforcement time travel, which I think is a great tool for our data scientists, right? As this data's going to be constantly streaming and changing, we want to make sure that they're able to get reproducible results so they can look at consistent view of the, of the data over time and they can upsert all of their data as well.

From there we go into our machine learning model and failure predictions page. This is where we're going to actually generate our ML model and go ahead and get those insights of what we think is going to happen to our machines. And we now have it ready as consumption ready data for our business use cases, right? We're taking that three hour computed moving average on the sensor data and we're retrieving our features from the Databricks feature store. Joining this with our enriched data from our silver table and producing our machine learning model input. Thanks to Databricks ML flow, we're managing the versioning experiments and the deployments of all our models, which is going to accelerate our machine learning modeling process. Throughout all this, we're able to create initial models and, you know, have a whole bunch of experiments over time as maybe our, our situation changes around our plant. Our machine learning model is going to generate the time to failure prediction classes for each machine. And now, again, we're using Spark Structure Streaming to go ahead and load all this data into our curated gold table.

Lastly, we're going to go ahead and serve all of our curated data to our end users. This is where we're going to go ahead and support the Tableau dashboard that I was showing you all earlier. Thanks to Databricks connections with Tableau, we're actually able to directly query onto our Databricks Lakehouse via the SQL Warehouse. So we are using Spark SQL and we're facilitating a live connection to go ahead and ingest all that data. Again, any sort of transformation or calculation that you've come to know and love from Spark is something that we can do here because we're naturally, you know, using Apache Spark. Your Tableau dashboard is giving you the real time machine telemetry data and the health forecast for each machine via time series graph. And you have tables with varying levels of filters like I was showing you earlier, right? You're able to go ahead and look at past machine fault occurrences and the reported sensor values at the time of failure for each machine via a separate table that can go ahead and be queried with filters.

And we have the reliability engineering, right? So now our machine operators can get the visibility into their ongoing operations. They can proactively plan maintenance to get ahead of things before, you know, machines go down and we can be predicting again when those machines are at a critical health level because we're able to go back and look at the historical fault data, we're able to understand our failure trends view machine fall to the granular level and get the full understanding of that, right? We now have a past, present and future view of our machines and we're able to get that full understanding of what's going on with our plants. The other great thing about our gold tables right, is that this is always ready for business consumption data. So, if you have your C-Suite or somebody coming up with an ad hoc query, you're able to quickly, you know, go to your goal table. You don't have to go and find some crazy join, you know that it's all ready to go.

So, what have we accomplished? We've taken our plan operators from reactive to proactive, right? We've enabled them to get ahead of their maintenance needs and fix their problems before they arise. We're allowing them to process over 400,000 rows of data across three tables in less than three minutes. We're giving our plant operators a real-time view and insight into their plant's operations through their Tableau dashboard, and we're helping them save up to a trillion dollars a year by eliminating up to 800 hours of downtime, which can cost them over $260,000 per hour. So thank you so much for being here. We'd love to see you at booth number 724 to talk about how we can support you in all of your data needs, and also come see us again later at 5:10 on stage three to hear about Delta sharing. And with that, I think we have some time for questions.

Relevant Results

Sorry, there are no results matching your search.

KPMG and Databricks

Dive into our thinking:

KPMG & Databricks overview videos

Multicloud Enterprise Delta Sharing and Governance using Unity Catalog at S&P Global

Apache Spark™ Streaming and Delta Live Tables accelerates KPMG Clients for Real Time IoT Insights

Video transcripts

Apache Spark video transcript

Meet our team

Explore more

Contact KPMG

Headline

Insights

Services

Industries

How We Work

Careers & Culture

Relevant Results

Sorry, there are no results matching your search.

KPMG and Databricks

Dive into our thinking:

KPMG & Databricks overview videos

Multicloud Enterprise Delta Sharing and Governance using Unity Catalog at S&P Global

Apache Spark™ Streaming and Delta Live Tables accelerates KPMG Clients for Real Time IoT Insights

Video transcripts

Apache Spark video transcript

Meet our team

Explore more

Headline