Over the past few years, we have had the opportunity to conduct several Purple Teaming exercises together with our customers. Some of the customers have their own Blue Team, others use an external provider for this service. Sometimes it is a mix, where an external company supports the internal Blue Team in its daily tasks.
Particularly after Purple Teaming exercises involving external providers, we often see a mismatch between the customer’s expectations and the service provided. This does not necessarily mean that the service provider has done a poor job, but rather that the customer expected something more, something different.
We have had many heated discussions in our office about how this mismatch between customer expectations and the actual service provided comes about. From these discussions, combined with our experience from our past Purple Teaming exercises, we compiled this blog post to share our take on how to prevent the most prevalent issues as early as possible.
Before we get into the details, we would like to point out that this blog is not intended to be a bashing of such service providers! On the contrary, we believe that these services are essential for companies that lack the size and capabilities to operate their own Blue Team. Rather, we want to turn a lose-lose situation into a win-win situation for both, the customer and the service provider.
No time to read this post? Download our one-pager PDF:
Where it all starts – Evaluation of a Service Provider
Nomen est omen?
tl;dr:
Understand the technical detection capabilities and limitations of the service being offered.
Key questions:
- What are the provided detection capabilities?
- Is the service (solely) on a commercial product?
- Is it possible to implement custom detection logic?
- What kind of log sources can be ingested by the provider?
As you may have noticed, in our introduction to this blog we just mentioned a “service”. The reason for this is that there are many different names for such a “service”. Here are some examples we have come across:
- Managed Detection and Response (MDR)
- Security Operations Center (SOC)
- Security Operations Center as a Service (SOCaaS)
- Managed Security Service Provider (MSSP)
As far as we know, there is no clear definition of these terms and what they encompass, so we were never sure what to expect. It is probably fair to assume that customers feel the same way. So how can these services be compared and what can be expected?
In our experience, one of the key differentiators between various provider models is the scope of the underlying detection capabilities. There are mainly two types that we encountered:
- EDR based detection capabilities based on commercial products
- EDR based + additional custom detection capabilities developed by the service provider
Both types of service have their place. Services based solely on an EDR are more cost-effective, but lack the ability to implement complex custom detection rules. Also, additional existing security devices such as firewalls, web application firewalls, proxies, etc. may not be able to be integrated into such a solution, restricting the coverage for possible detections. Nevertheless, depending on the size, complexity and threat model of your environment, this type of service may be more than enough. The important thing is that you, as the customer, have a clear understanding of the provided detection capabilities.
Does it fit?
tl;dr:
The provided coverage of the service should fit your environment.
Key questions:
- Does the service reflect your threat model?
- Are all key aspects of your IT infrastructure and critical assets covered by the service?
- Does the time coverage of the service match your business model and availability?
Now that you understand the potential detection capabilities that can be provided, the big question is whether it fits your environment or not. There are three main points to consider:
- The service should be chosen according to your threat model
- The service should cover all major/relevant aspects of your IT infrastructure and especially your critical assets
- The service availability should align with your business model
Threat Model
To choose the right service and service level, you need to understand your threat model. Is the main threat a ransomware attack or do you fear a more targeted attack on your infrastructure and users? Lies your main concern with the availability of your services or rather with the confidentiality of your data (or both)? What assets are crucial to your company’s ability to operate?
Depending on your threat model the necessary service might look different.
IT Infrastructure Coverage
As implied previously, an EDR-only approach may be sufficient for your environment and threat model. Still you need to check if the EDR solution can be installed on all your different operating systems. If a substantial part of your infrastructure is operated on Linux, but the EDR solution of your provider only runs on Windows, this could prove problematic. An other example might be that a provider offers extensive coverage of cloud infrastructures, but has little to no detection capabilities for on-premise systems.
However, if you have a more complex environment and critical assets to protect, you may need to consider further detection capabilities beyond an EDR solution.
Suppose you have a database server with sensitive customer data that is accessible through a web application. You want to know if someone has gained unauthorized access to the database and extracted this sensitive data. How hard can that be, right? Well, do you know which log files you need to build a detection logic for that?
- Operating system logs
- Database logs
- Web server logs
- Firewall logs
- Web Application Firewall logs
- etc.
The point is that writing custom detection logic is not a simple task. So if you require such use cases, the provided service must at least be able to ingest these logs. In the best case the service already provides such use cases that can be implemented or adapted to your environment.
Service Availability
We encountered many different service availability models during our Purple Teaming engagements:
- 7×24
- 7×10
- 5×11
- 5×10
- 5×8
- etc.
Obviously these different models have different price tags attached, and usually more is better from a security perspective. However, the main point here is that the chosen model should fit your business model and your availability.
Lets assume you have chosen the 7×24 model. On a Saturday at 03:00, the external provider notices suspicious activities on one of your clients. They send you an email with the incident and the corresponding details. Will you see this email in time to benefit from the 7×24 model or will it be picked up only on Monday morning? You may have set up an on-call service and be able to react accordingly, but these things are your responsibility as a customer. Maybe a 5×10 model would better fit your business model and availability?
Communication is Key
tl;dr
- Understand which alerts and incidents are handled by the service provider.
- Understand your responsibilities and those of the service provider.
Key questions:
- Is it clear which and how alerts and incidents are handled by the service provider?
- Are responsibilities clearly defined?
- Are means of communication clearly defined, including fallback solutions?
In addition to the detection capabilities discussed above, another major trip wire is the communication between the service provider and the customer. Let us give you a few examples.
During a Purple Team we typically execute a known malware on a provided test client. This triggers the EDR (or whatever anti-virus solution is in place) to block and/or quarantine the malware on the host. Top! But often service providers do not open an incident for such an alert. Why not? Well, sometimes the reason given to us is that the malicious software has been successfully blocked, so the customer does not need to take any action. Another reason could be that the alert does not reach a required severity threshold. For example, it is a low rated alert and only high and critical alerts are handled by the service provider. However, you as a customer might still want to know about such incidents, since malware on a client could be an indication of something bigger happening.
In another instance, the customer might have installed some new servers for us to run our tests on, but has forgotten to report these systems to the service provider. So they were never correctly on-boarded and alerts were never received.
What this shows is that it is vital that communication works in both directions and that everyone knows what to expect and where the responsibilities lie.
Clearly define which alerts, incidents etc. will be handled by the service provider and which incidents are forwarded to you. Furthermore, clearly define the responsibilities for both parties, e.g. reporting of new devices, users and systems.
Lastly, it also should be ensured that the chosen communication channels fit your environment and enable you as a customer to react to incidents with as little delay as possible. Also think about fallback solutions and escalation chains in case of emergencies.
Incident Happened – Now What?
tl;dr:
- Understand which information is provided by the service provider in case of an incident.
- Understand which reactions the service provider can perform in case of an incident.
Key questions:
- What information is provided in case of an incident?
- Is the provided incident information comprehensive and useful for you?
- Should a service provider be able to take actions autonomously in case of an incident?
Let’s stick with the above example of malware running on the test client. You have clearly defined that you want an incident for such an alert, and the service provider will open one for you.
But what kind of information will the service provider present to you? Will it just be the auto-generated alert from the EDR solution? Will they add context to the auto-generated alert? Will they provide you with recommended remediation steps? Will they perform an initial analysis and provide this information to you?
It is important to understand that there is no golden rule here. But, whatever the details of the incident look like, it should meet your needs to quickly and correctly understand the incident and the associated risk so that you can decide on an adequate response. If you operate your own internal Blue Team and your employees have a SOC background, not much additional information might be required. On the other hand, if incidents raised by your provider are handled by your regular service desk for example, they might need additional guidance.
You may also want the service provider to be able to take immediate action, such as isolating a client. Be sure to check what reactions a service provider can offer, and weigh the pros and cons carefully.
Transparency Creates Trust
tl;dr:
Understand how the service provider documents their services and how this information is made available to you.
Key questions:
- Are custom use cases documented in a comprehensive way?
- Is the documentation accessible to you at any time?
- Is it possible to track decisions taken by the service provider in a comprehensive way?
- Is the provided dashboard comprehensive and usable?
If you need custom or extended use cases, or if the service provider brings their own, it is important to understand how these use cases are implemented. This does not mean that you need access to the underlying query and detection logic itself, but rather the documentation of such use cases. This helps you understand whether a given use case even makes sense in your environment. In addition, in the event of an incident, this information may be critical to understanding the incident.
You probably also want to be able to track the decisions and assessments made by the service provider for your alerts and incidents. For example, you may want to know why a certain alert was classified as false-positive. Most service providers provide this to their customers in the form of a dashboard. Check that all the information you need is available and that the portal meets your needs.
Implementation Hell
tl;dr
Avoid common pitfalls during the implementation phase.
Key questions:
- Are all log sources correctly integrated?
- Are all use cases tailored for your environment? (Thresholds, localization, etc.)
- Are all use cases regularly tested and verified in your environment?
- Is a clear exception handling process defined?
- Are exceptions crafted as narrowly as possible?
- Is the implementation clearly documented?
Once you have evaluated a service provider, it is time to start the integration and on-boarding process. This is one of the phases where we, as a Purple Team, identify the most technical pitfalls. Here are the ones we encounter most often.
Missing Log Collection
A very common problem is that not all log sources are on-boarded into the service. Sometimes the agent is not installed on a specific system, or a particular log source was not considered to be implemented.
An accurate inventory of your assets will help you keep track of what is missing. Regularly checking your inventory and the sources you have on-boarded will also help you to reduce drift.
Missing Tailoring for Customer Environment
Custom use cases can add value to your detection capabilities. Unfortunately, we often find that these use cases do not work as expected in our customers’ environments. This is mostly because the use cases are not tailored to the customer’s environment.
For example, if a use case requires a certain threshold to be reached, say a login was attempted on 100 accounts in a short period of time, but there are only <100 accounts in the customer environment, this use case may actually never be triggered.
Another example might be a use case to detect if a local administrator is used on a client. If the use case is based on a specific user name, e.g. “administrator”, but the user has been renamed to “admin” in the customer environment, this use case will never be triggered.
It is therefore vital that all use cases are challenged and tailored to your environment.
Unclear Exception Handling
In any reasonably large environment, exceptions will quickly accumulate, especially during the integration phase. It is easy to lose track of what has been excluded from which use case, for what reason, and so on. Exceptions can also be set too broadly, which can render a use case ineffective.
Having a clear process for how exceptions are handled and documented is critical to keeping track.
Missing Use Case Testing
Far too often we come across use cases which have never been tested in the customer environment. They might work in a lab where they have been developed but for what ever reason they might fail to trigger elsewhere.
During the service acceptance test, each use case should be tested and verified first. But don’t stop there. Ideally, use cases should be tested regularly. If possible, automatically as long as it is in your production environment.
Unclear Documentation
In a Purple Team engagement we often have a lot of questions about use cases and general implementation etc. Unfortunately, it is sometimes difficult to get an informed answer to our questions. Either the right person is not available or it is simply unknown and requires a configuration review and deeper investigation.
Having an up-to-date and clear documentation of your implementation, use cases, exceptions etc. will help you to understand if something is missing or unclear. Don’t put documentation off for too long.
Continuous Improvement
tl;dr:
Keep challenging your detection capabilities.
Key questions:
- Are current detection capabilities sufficient and do they cover changes made to your IT environment?
- How can your detection capabilities be improved?
- Do your existing detection capabilities still work as expected?
When the integration project is complete, and you have avoided all the pitfalls and completed the documentation, you can be proud of yourself! Well done!
Don’t take too much of a break, though. Your IT environment is about to change. It could be through replacements, updates or new additions of tools and other gadgets. Maybe you decide to get rid of a substantial amount of on-premise systems and move your services into the cloud. Keeping up with these changes can be a challenge. The threat landscape will also change in the future, and you will need to adapt to the new risks.
So getting into the habit of questioning your existing solution is a good thing and should be done regularly.
Final Thoughts
We hope to have highlighted the most common pitfalls we encounter in our Purple Team exercises and how to avoid them. Running a service like this is no easy task and there are many more things that can go wrong. Have you come across any of these on your journey? Or are we missing something important? Please leave a comment, we are always keen to discuss.
Authors
This article was co-written by Alex Joss and Felix Aeppli.
Leave a Reply