The observatory
-
July 23, 2023

Dan Shiebler, Abnormal

Dan Shiebler, head of ML and AI at Abnormal, and Kyle Kirwan, CEO and co-founder of Bigeye, discuss resilient ML, cybersecurity, and more.

Read on for a lightly edited version of the transcript.

Kyle: Welcome back to another episode of the Observatory. I'm your host, Kyle from Bigeye, and today we are talking with Dan Shiebler from Abnormal, where he is the head of machine learning and leads the detection team. Dan, welcome!

Dan:Thanks, Kyle. Happy to be here.

Kyle: Abnormal does some pretty cool stuff in threat detection. They work with some pretty exciting companies, like Splunk and Xerox. Some of the world's largest organizations are using Abnormal.

I'm not from the security space. I'm sure plenty of people watching are not either. Dan, could you give us just an intro? What does Abnormal do and why is that important? Why are companies like this working with you? And then what does the detection team do within that?

Dan: So Abnormal Security fights cyber crime. We do this by detecting cyber threats coming in through messaging applications or through compromised accounts. Our primary product is email security; detecting and mitigating threats like phishing, vendor fraud, ransomware, and business email compromise.

Essentially, threat actors will send malicious messages or compromise accounts and try to take steps to increase their access, increase the amount of damage they could do to organizations in order to either get credentials and exploit them, or install ransomware, or apply a number of other malicious actions.

The Abnormal Security product functions by detecting and mitigating these kinds of attacks. The detection team utilizes machine learning, artificial intelligence, and data science in order to identify these attacks, distinguish them from normal business behavior, and mitigate them effectively.

Kyle: As the lead on the detection team, I assume you have information around, say, a user's inbox that we're currently looking at, what's in the body of the email, who's the center of the email, et cetera. Those are all inputs. Then I assume from that, you're producing some threat level or categorization or warning? What is your team translating from on the detection?

Dan: Yeah, exactly. I mean, we can ultimately think of this as a very fundamental classification problem in a lot of ways. This is a very classic ML problem. You have a particular message or a particular account, and you want to determine whether or not this message is an attack or this account is compromised.

Given the information that we have about this message or this account, and, and this requires collecting as much data as possible about who is the person who's receiving this message? Who is the person who's sending this message? What are their historical patterns of behavior? And how is this different from what we're observing here?

What are all the different indicators of compromise that might be attached to this account, sign an event or to this message event? And how do those indicators of compromise compare to things that we've seen?

Kyle: So, you have a bunch of different signals here. And you're processing these in real time. So an email comes in and then you guys need to have this pre-trained algorithm. You're gonna feed the email through that, and then you're gonna get a warning.

Does that surface directly to the user? Do you guys block the email from going into the inbox, or what's the result from the model?

Dan: That’s a great question. So there's actually a bunch of different results depending on our customer integration, different customers, the different security teams, different degrees of sophistication and desire for control over what kinds of communication and behaviors to apply. The default is what we call an API-based solution. So basically when there's a sign in or an email, we receive a notification. And we can take action to pull that email or to block that sign in ourselves and then surface the information to the customer security team to take further action.

But there's a wide range of different kinds of remediation actions that we can take: moving a message to a different folder, or tagging it with a banner, or presenting a notification to a security team without notifying the customer. We take these actions based on the size of the security team, the circumstances, the business judgment of the risk level, et cetera.

Kyle: And how large is this team? What’s the mix between data scientists, software engineers, traditional software engineers, machine learning engineers, etc.? What does it take team-wise to essentially build this product?

Dan: The team as a whole is about 40 people. That’s split between people who are a little bit more traditional: backend software engineers, machine learning engineers, and data scientists. Rather than focus on the individual roles, it's helpful to explore the kinds of problems that come when we're building this kind of system.

There is, of course, a tremendous amount of data to process to be able to understand not just the context of an individual email, to consume it and process it and take action on it fast enough, which is itself a very difficult problem. But also, we must accumulate all the information about historical emails and incorporate that into the judgment of what to do on a particular message.

Dan: For example, messaging often takes place in conversational threads. It's important to know what has happened in all of the threads leading up to a particular message when judging whether or not there might be invoice fraud or a conversation injection. A very common type of attack, for instance, is someone compromises an account and then tries to leverage ongoing threads into attacks.

And so detecting that effectively requires having access to these previous messages that were recently processed when making this judgment. And so that's a difficult backend engineering challenge. Processing each message in a way that you can make them available to future processing in an extremely low latency fashion. So there's a lot of intense backend engineering that powers that.

And then there's machine learning engineering and actually building and deploying machine learning models, training machine learning models on the new data that we see, and data science. There's a humongous range for understanding the performance of the models that we're deploying.

We work to understand how to optimally tweak the parameters of the system, in order to adapt to each of the different customers and understand when things are going well and when things need to be improved.

Kyle: You mentioned “models” in the plural. Should we assume there isn’t one mega model that can handle all of this?

Are we talking about distinct models for different types of threats that you need to process, or do you have different models for different customers? What is the landscape of models that your team actually owns?

Dan: So there are models for different kinds of threats, certainly. We can think about models broken down at a number of different levels.

There are models that are broken down to display what’s going on - with accounts and sign-ins and what’s going on with emails and messages, and then different kinds of messages as well. Even within the realm of messages, models break down in many different ways. We have general centralized models that try to solve whether something is an attack or it’s safe at a very high level. But we often find it valuable to build submodels that dig into individual sub-cases.

For instance, characteristics of an attack from a sender that's never been seen before, something that's coming in completely from the outside, is very different than looking at a vendor who's been compromised. For a vendor who's been compromised, the type of information that you have to use to make a decision is very, very targeted.

Dan: You need to look not only at the normal characteristics of an attack, but specifically the characteristics of this vendor and how that is different from what we're seeing when trying to determine if this is a vendor compromise. And so given the degree of difference between the types of signals of which we want the model to take optimal advantage, we often find it valuable to break it down into individual submodels, and different attack types.  

So at the level of each individual attack type, we have different kinds of submodels to handle it, as well as general ensemble models that try to make general decisions. We take a very defensive depth approach to being able to drill down into the most important and most vicious kinds of attacks.

Kyle: Do you have ensembles layered over all of that ultimately? Or do models drive individual features? Are you architecting that a model can drive a specific feature or response, or does everything fit under a larger umbrella or abstraction?

Dan: So, we can think about things as feeding into a detection layer where the detection layer makes the final decision of whether or not a particular input should have a remediation action taken. And that decision, you could think of it as sort of a big “or”,  where it's a giant “or” statement over any of the individual detectors that are individually equipped to make this final decision that this is bad and we should take action on this. And of course, there's the case that as you add more and more detectors, you run the risk of decreasing your precision. We have extremely high precision requirements in order to produce a customer experience that doesn't require security teams to spend hours looking for missing messages and locked out accounts. So we need to be very careful when we add new detectors or modify detectors to maintain this precision.

But ultimately, each individual kind of detector feeds into this general “or” statement. Certain detectors are ensembles of lots of different models. Certain detectors are a single model that’s equipped to make a certain decision. Sometimes one detector is an individual model that says, if this model's prediction is at least some very very large number, then we're confident enough to make this decision.

But a different detector might take that model as an input into a more general decision and be able to take action even if that model happens to be less confident. Sometimes this will happen a lot if you have an individual model that looks for a particular pattern, but attackers modify that pattern very frequently and try to get around it. So we have general ensemble models that will look for a more general combination of different attack types.

Kyle: All right, so you have a fairly large portfolio here that your team is responsible for maintaining, and we'll talk about this in a moment. But, you have a background at Twitter as well, working on problems at scale there.

Anything specific that you've learned about the management of this many models? Obviously you're working in an environment that's constantly changing and evolving, so I assume that there's quite a lot of effort in keeping those models up to date and reacting to the world and the way it’s changing. Anything specific that you've learned about what it takes to manage a portfolio that large, with that many models and keep everything fresh?

Dan: One thing that I've gone back and forth on, both at my time at Twitter and my time here at Abnormal, has been how to think about the complexity that exists in a modeling stack, and what are the kinds of complexity that are malicious and what are the kinds of complexity that are benign.

If we have a hundred models all in production, but there's only three or four of them that we're constantly thinking about and seeing issues from, then it's only really those three or four that are the complexity that people need to think about. If the other models are running independently and not feeding into each other and performing with high enough precision that they don't need to be taken down, then they're not actively contributing to complexity that people are managing.

The key to this is monitoring, and monitoring the precision of each individual detector on its own. For this type of federated system where we have each of these individual detectors operating independently and capable of making a final decision by themselves, it’s critical to understand the criteria for keeping a particular detector live.

If you launch something and keep it live for as long the unique precision of this individual detector is high enough, and then take it down as soon as that precision drops, what you've done is largely separate the complexity of this detector from the complexity of the other detectors. This is a strategy that makes sense when we have teams of a certain size controlling the deployment of different types of models.

The alternative solution, of course, is trying to wrap everything into a single model. It serves as the single point that makes all final decisions. Everything feeds into a  single ensemble. The problem with doing that is that you create only a single on ramp for changes to be made. It's the same as the decision for whether or not you want a system that's filled with microservices or a system with a monolith.

A monolith is easier to maintain as a small organization where you are trying to avoid the overhead of lots of different services, but it's also much harder for many people to be able to make different kinds of changes too. Utilizing lots of smaller models, all federated together into a simpler or ensembling system, enables engineers to make changes or improvements to smaller parts of the system independently.

At Twitter, I worked with a lot of different teams that were working with different kinds of model deployments. Very often I found that the single model approach would freeze progress because the amount of work and effort required to make an improvement to the model requires you to shut the previous model down and turn on your new one.

And so that's a very, very large and heavy process filled with humongous amounts of operational burden and validation burden. Whereas when you have a very small thing, being able to make incremental improvements is much faster and easier. And so that's one of the reasons why we're opting for this more federated architecture Then different engineers, different parts of the team can make incremental improvements by themselves.

Kyle: That's really cool. The microservices versus monolith has been an ongoing debate. Anybody logging into Hacker News once in a while, will see at least one thread a day that has to do with complaining about one or the other. I hadn't really thought about that when it comes to model management, but one thing that you said stood out to me. You mentioned some models contributing to complexity or overhead. Or requiring thoughtful care. And then the long tail of everything else where you've got some monitoring on it, and as long as the monitoring isn't throwing off any alerts, you just assume that it's okay. And to borrow a phrase from devOps, you treat those models like cattle, not like pets.

And those high complexity models, those are your pets and you have to pay attention to them a little more closely. Is that an accurate characterization of the general way that you treat them?

Dan: I think it is. I think that one of the keys to this is being confident that they actually are cattle and they don't need to be treated like pets. You need to have the appropriate monitoring in place in order to know when their precision is going down or when they're no longer performing up to snuff.

Ultimately, for us, a lot of it comes down to what we're relying on models to do. In a lot of ways it's really just two things. One is to flag enough things uniquely so that it's contributing to our overall performance of the system.

And the other is to flag things with high precision so that they're not creating false positives and creating customer pain. And as long as those two things are true, then we are able to feel confident that something is performing well.

And so we have a centralized process for understanding and evaluating how well these two metrics maintain their performance at the level of each of the individual detectors that we deploy. And so that lets us understand which models need to be treated like cattle and which need to be treated like pets.

And this is very constant over time. There's many models that remain unchanged for a year. Usually what will happen is we'll launch a model. It will have some amount of performance. If it is a cattle model, it will slowly degrade over time over the course of perhaps six months to a year, until eventually it reaches the point where enough other better things have been launched that we can bring it off and send it to be turned to steak.

Kyle: I've never heard of a model turning into steak, but there's a first time for everything! So the monitoring that you're applying, I assume you have some sort of basic template or a rule book. This becomes systematic at this point where, for example, you mentioned precision.

To track precision of the model over time, we’ll track some other characteristics on an ongoing basis. And then there's some sort of a threshold or basic indicator that says somebody needs to possibly decommission this model or spin it down?

Dan: Yep, that's right. So basically we have a guideline for the unique contribution of any individual detector that needs to be maintained in order to stay alive. That’s directly measured in terms of precision. If this is flagging things and nothing else is flagging things, what is the precision of the things that this is uniquely flagging?

If this is uniquely flagging things with really low precision, someone needs to look into that. Sometimes we'll need to keep things like that on because they're flagging valuable and important things. And often when you have something that's flagging valuable and important things, but it has really low precision, that's something you need to take action on. Either you increase its precision, or to get something else that has higher precision, to start flagging those important things.

As long as the set of things that this is flagging uniquely has high precision, and we're not worried that it was previously flagging lots of things, and now it's flagging fewer and we’re missing things, there's no reason to allow the complexity of this individual thing to override complexity overall.

Kyle: Got it. So monitoring brings human attention to something  to go and evaluate it and decide. And maybe once somebody on your team actually looks into it, you can take appropriate action from there.

Dan: Yep.

Kyle: So, Dan, you've spoken in the past about what you phrased as “resilient machine learning”, and that at least sounds related to these types of processes that you just described. Can you tell us a bit more about what is resilient machine learning? How's that related to what we just talked about?

Dan: Yeah, absolutely. So resilient machine learning is the design of machine learning systems that are capable of handling the kinds of challenges that software systems often experience, whether it's catastrophic failures in terms of features suddenly disappearing, services going down, data distributions changing, or softer, slower failures like the onboarding of new customers or new kinds of users whose behaviors are very different than what's been seen in the past.

Designing resilient machine learning systems requires anticipating the set of possible changes and possible stresses that your system may undergo, and testing your system so that it is resilient and you understand how it will behave in these kinds of settings. And making design decisions with your machine learning models and with the deployment of your machine learning models so that you're confident that your models and systems will be able to perform well and not catastrophically bad when these problems inevitably occur.

Kyle: It sounds like there's some degree to this. If you're designing machine learning systems that could be blind to these types of problems,the problem can occur and the model can just continue to work at some level of reliability. Is that correct or is monitoring also a piece of the resiliency strategy? How does that actually work in terms of a working system?

Dan: I would say that they're complementary rather than being closely tied to each other. The federated design of our system is intended to be an element of resiliency in order to be able to support the operation of very different kinds of customers and very different kinds of attacks. One thing that we need to think about a ton is cyber attackers coming up with new kinds of attacks. This happens every day and the efficacy of our product is determined not by how well we can catch attacks that have happened in the past, but how well we can catch attacks that will happen in the future.

So we need to be resilient through those kinds of changes in data distribution that are adversarially selected based on what attackers have seen work in the past. And so that kind of adaptability is key. But of course there's also certain types of signals that are not available in certain customer environments or in certain types of changes. And we also need to be resilient when services go down.

And so the federated design allows us to build and ship detectors that rely on very different kinds of signals. So some rely on signals that are easy for attackers to spoof, some don't. Some only rely on signals that attackers have no control over or that are not as susceptible to feature-serving failures or customer distribution changes. And so this federated design is built to anticipate the different kinds of resiliency and execute detection in a way that is resilient. But this federated system requires having a lot of monitoring built into play in order to support it.

So designing a resilient system requires making design decisions that are very monitoring-heavy in terms of  what types of metrics we need to be able to track and how we need to be able to track them.

Kyle: Got it. Okay. I do want to get to the rapidfire questions before we close out the episode. But before that, if people want to learn more about resiliency in machine learning or hear more from you on this, what's the best way for people to find out that information?

Dan: I have a post on my website, where I've written a little bit about it and I'm always open to people reaching out.

Kyle: Great. So we'll include a link down below in the description.  I think it's a fascinating topic. I think about it a lot in terms of other systems as well. We would expect to see resiliency in physical infrastructure, we would expect to see it in  data pipelines. So seeing it applied in machine learning is exciting.

Now…are you ready for some rapid fire questions?

Dan: Shoot.

Kyle: Okay.Number one. What is your favorite word?

Dan: Majestic.

Kyle: Okay. That's good. Number two, favorite history podcast.

Dan: History of Byzantium.

Kyle: That's a, that's a specific podcast all about Byzantium?

Dan: That's right.

Kyle: Last question, Dan. What's one subject you wish you could learn more about?

Dan: Probably philosophy.

Kyle: Okay. That's a good one. Well, Dan, it's been awesome having you on. We talked about resiliency in machine learning. We talked about threat detection models. We talked about managing a portfolio of a large number of models and what's involved in doing that successfully. We heard a little bit about Abnormal and what you guys do in the security space. Thanks for being on the Observatory today.

Dan: Thanks Kyle. I had a great time.

share this episode