Week 4: The Logging and Visibility Problem No One Mentions

Most security teams believe they have better visibility than they actually do. In modern SaaS and cloud environments, logging gaps are structural, costly, and often invisible until an incident exposes them.

You probably think you can see more than you actually can.

That’s not a criticism—it’s just how modern environments work. The assumptions we built our mental models on (servers you own, networks you control, applications you can instrument however you want) don’t hold anymore. But we still operate like they do.

SaaS applications don’t give you the same visibility you’d have if you ran the application yourself. Cloud providers give you logs, but not necessarily the logs you need. Third-party integrations happen at the API layer where your network monitoring can’t see them. Serverless architectures create ephemeral compute that exists for seconds and then disappears.

And somehow we’re supposed to detect threats, investigate incidents, and demonstrate compliance in environments where half of what’s happening is invisible to us.

The gap between what vendors promise and what their APIs actually deliver is real. The difference between “we provide comprehensive logging” and “here’s what you can actually export and how much it costs” is often significant. And most organizations don’t discover this gap until they’re in the middle of an incident and realize they can’t answer basic questions about what happened.

The Old Model (It’s Gone)

Ten to fifteen year ago (roughly 2010-2015 for those reading this in the future), visibility was hard but at least it was straightforward. You owned the servers. They sat in your datacenter. You controlled the network. You could put whatever monitoring and logging you wanted on them.

Want to know what happened on a system? You had the logs. Want to capture network traffic? You owned the infrastructure. Want to instrument an application? You controlled the deployment.

The constraints were mostly technical and resource-based. Storage was expensive, so you couldn’t keep logs forever. Processing power was limited, so you couldn’t analyze everything in real-time. But in theory, if you had the resources, you could see everything that mattered.

That model is mostly dead now.

The New Reality (It’s Complicated)

Modern environments are a mix of SaaS applications you don’t control, cloud infrastructure you sort-of control, on-premises systems you fully control (but are increasingly a minority), and mobile/remote users connecting from everywhere.

Each piece has different visibility characteristics, different logging capabilities, different costs, different APIs, different limitations.

SaaS applications are particularly tricky. You get whatever logging the vendor decides to provide. Sometimes that’s comprehensive. Sometimes it’s basic audit logs that tell you who logged in but not much about what they did. Sometimes it costs extra. Sometimes it’s only available in higher-tier plans. Sometimes the data is there but the API to extract it is rate-limited or poorly documented.

You don’t control the infrastructure, so you can’t just install an agent or capture packets. You’re dependent on what the vendor exposes, and their priorities aren’t always your priorities.

Cloud providers give you a lot of visibility—if you know where to look and how to configure it. But it’s not automatic. CloudTrail in AWS doesn’t log data events by default. Azure Activity Logs don’t capture everything. GCP audit logs need to be configured per-service. And all of this generates massive volumes of data that cost money to store and process—much of it operational noise with limited security value. You’re often paying to retain logs that don’t help you detect or investigate incidents, while the events you actually need might require additional configuration or higher-tier services to capture.

The visibility is there, but you have to deliberately build it. And you have to pay for it, which means someone has to approve that cost.

The Vendor Promise vs. Reality Gap

Here’s a conversation that happens constantly:

Security team: “We need comprehensive logging for [SaaS application].”

Vendor sales: “Absolutely, we take security very seriously. We provide full audit logging of all activities.”

[Six months later, during implementation]

Security team: “We’re ready to integrate your logs into our SIEM.”

Vendor support: “We don’t have a direct SIEM integration. You can manually export logs from the admin console. But don’t worry—if there’s ever an incident, we’ll help you get whatever logs you need.”

[Fourteen months later, during an incident]

Security team: “We need the access logs for this compromised account for the past 90 days.”

Vendor support: “Unfortunately we can’t provide those. Our logging infrastructure commingles customer data and we don’t have a way to filter and export just your logs. We can give you authentication events and admin actions, but detailed access logs aren’t available.”

This isn’t malicious. The vendor’s not lying, exactly—they do have logging. It’s just that what they consider “full audit logging,” what’s actually accessible to you, and what you need for security investigation are three different things.

And you don’t find out about the gap until you need the logs.

What You Can’t See (It’s More Than You Think)

In a typical modern environment, visibility gaps fall into several categories:

SaaS and vendor-controlled systems

Authentication visibility depends on whether you control it. If the application federates through your SSO, you can see logins in your identity provider logs. If it doesn’t, you’re dependent on what the vendor provides—which might be limited to their admin console with no way to export or integrate with your monitoring.

Beyond authentication, detailed user activity—what documents they accessed, what data they downloaded, what API calls they made—is often not available, or only available at premium tiers, or only retained for short periods. And even when these logs exist, they’re often in proprietary formats with no API for automated export, or the API is rate-limited to the point of being useless for real-time monitoring. Manual exports from a web console aren’t a viable solution at scale.

API traffic between applications is similarly opaque. Service-to-service authentication, automated integrations, data exchanges—in an on-premises environment you could at least capture this with network taps, port mirroring, or API gateways and proxies that intercept and log traffic. In SaaS environments, you don’t control the underlying infrastructure, so that’s not an option. You’re entirely dependent on application-level logging that the vendor may or may not provide. The major providers like Microsoft 365 might expose some API activity logs. Most mid-tier and startup SaaS vendors don’t expose these logs to customers at all.

Cloud infrastructure

Someone spins up an EC2 instance, uses it for a few hours, and terminates it. If you’re not capturing those events in real-time and you’re not paying to retain them long-term, they might as well have never happened. Ephemeral resources come and go, and if your logging isn’t configured to catch them, you have no record they existed.

Shadow IT

Departments using applications that IT doesn’t know about. Data being stored in places that aren’t sanctioned. Integrations being set up by business users who don’t think about security implications. By definition, you can’t monitor what you don’t know exists. This is often SaaS applications purchased with credit cards, bypassing procurement and IT approval entirely.

Unmanaged and BYOD endpoints

Users accessing resources from personal devices, from home networks, from coffee shops. You might see the authentication, but the endpoint visibility you’d have on a corporate-managed device isn’t there.

MDM solutions like Intune can give you some visibility—device compliance status, patch levels, whether antivirus is running. But can you get forensically useful logs? Can you see process execution, network connections, file access? Not in the same way you could with a corporate-managed endpoint running full EDR. You know the device met your baseline requirements when it connected, but you don’t have the detailed telemetry you’d need to investigate suspicious activity.

Encrypted traffic

SSL/TLS everywhere is good for security. It’s terrible for visibility if you’re trying to inspect traffic.

TLS interception is becoming more common and easier to implement, but it’s not comprehensive. Some traffic has to be excluded—applications using certificate pinning, mutual TLS authentication, or endpoints that break when you try to intercept. Medical devices, some IoT, certain vendor integrations—these often can’t tolerate interception without breaking functionality. And even when interception is technically possible, the operational overhead of managing it (certificate distribution, exclusion lists, troubleshooting broken applications) means you’re making trade-offs about what you actually inspect.

You get visibility into some encrypted traffic, but you’re still trusting a lot of it.

The Cost Problem

Logging isn’t free, and comprehensive logging is expensive.

Cloud providers charge for log storage. SIEM vendors charge per GB ingested. Analysis tools charge for compute. Every log source you add, every event you capture, every day of retention you want—it all costs money.

And in cloud environments, the costs compound. If you’re in AWS sending logs to a cloud-based SIEM: you’re charged to generate the logs, charged for egress to transmit them, charged per GB to ingest them into your SIEM, and charged to store them—often in multiple places. You might not realize a log source isn’t security-relevant until months in, when you finally get around to normalization and building use cases, only to discover you’ve been paying to collect noise.

So you make trade-offs. You log authentication events but not every API call. You capture critical system changes but not routine operations. You retain logs for 90 days instead of a year because that’s what the budget allows.

These are reasonable decisions based on real constraints. But they create gaps, and you need to understand what those gaps are.

The other cost is operational. More logs mean more noise. More alerts. More false positives. More time spent tuning and managing the logging infrastructure instead of actually using the data.

There’s a balance between “log everything” (expensive, noisy, often impractical) and “log nothing” (cheap, quiet, useless). Finding that balance requires understanding what you actually need versus what would be nice to have.

What You Actually Need (It’s Less Than Everything)

You can’t log everything. But you can make intelligent choices about what matters most.

Take Windows event logs as an example. Out-of-box default configuration gives you some useful security events, but other valuable events aren’t enabled by default. And if you just ingest all of Application, System, and Security logs without filtering, you’ll be buried in operational noise with no security value. There are guides for what to collect, but actually implementing that filtering—especially depending on your SIEM’s capabilities—takes time, expertise, and persistence. You have to keep asking: do I need this? Is this useful? It’s not a one-time configuration; it’s ongoing refinement.

Core log categories worth prioritizing:

Authentication and authorization events. Who logged in, when, from where. Successful and failed attempts. Privilege changes. Access grants and revocations. This is foundational—you need to know who did what, and that starts with knowing who was authenticated.

Administrative actions. Changes to configurations, policies, permissions. Creation and deletion of resources. Anything that modifies the security posture or operational state. These are high-value events that should basically always be logged.

Access to sensitive data. If you have data that’s particularly valuable or regulated, you need to know who accessed it. This is harder in SaaS environments, but it’s worth fighting for.

Security-relevant events. Firewall blocks. IDS/IPS alerts. Antivirus detections. Authentication failures beyond normal thresholds. Things that might indicate compromise or attack.

Change tracking for critical systems. What changed, when, and who changed it. For production systems, for security infrastructure, for anything where unauthorized changes could cause serious problems.

You don’t necessarily need to log every single read operation in a database. You don’t need verbose debugging output from every application. You don’t need to capture every DNS query (unless you’re doing DNS-based detection, which is legitimate but specialized).

Figure out what your crown jewels are and what your riskiest systems are—make sure you have good visibility for those. Everything else is nice-to-have, and you prioritize based on resources.

The Detection Problem

Logs only matter if you actually use them. And using them means having some way to detect anomalies, threats, or policy violations.

If you’re collecting logs but nobody’s looking at them except during incidents, you’re doing expensive archival, not security monitoring.

But here’s the thing: archival has value too.

Commercial aircraft carry flight data recorders—black boxes that continuously record flight parameters, cockpit audio, and system status. Nobody monitors this data in real-time. The recorder just captures everything. But when something goes wrong, investigators use that data to reconstruct exactly what happened, second by second. Without it, you’re left guessing.

Your SIEM serves the same purpose. Even if you’re not actively monitoring every event in real-time, having that forensic capability when you need to investigate an incident is critical. Understanding what happened, how, and when depends on having those logs available. So don’t dismiss log collection just because you’re not building sophisticated detection on top of it yet.

Detection requires either rules (if X happens, alert) or baselines (if this deviates from normal, alert) or threat intelligence (if we see indicators of known-bad, alert). All of these require effort to build and maintain.

Rules need to be tuned. What generates alerts in your environment? What’s normal activity that looks suspicious? What’s actually suspicious but happens rarely enough that you don’t have good patterns for it? This tuning process never ends.

Some MDR providers promise to handle this for you, but be careful about the black box approach. If they’re running generic rulesets across all their customers without understanding your specific environment, you’ll get alerts that don’t make sense for your context and miss things that matter in your particular setup. Worse, if your environment is slightly non-standard—a configuration setting that’s just a bit different, a logging format that’s slightly off—their rules might never trigger even when there’s a real problem. The alert they expect to see doesn’t fire because your logs don’t match their assumptions. Effective detection requires knowing your environment, and outsourcing doesn’t eliminate that requirement—it just shifts who needs to know it.

Baselines require understanding what normal looks like, which means having enough data to establish patterns, which means you need to have been logging long enough to know what normal is. And normal changes over time, so baselines drift.

This gets complicated with fragmented identity visibility. If you’re not ingesting and correlating all authentication paths—on-prem AD, Azure AD/Entra, M365, federated SaaS applications—you can’t establish an accurate baseline for how an identity actually behaves. You might see AD authentications and M365 authentications as separate, unrelated streams. Analyzed independently, neither looks unusual. But if you could see them together, the pattern would be obvious: this user never logs into M365 from that location, or this authentication sequence doesn’t match their normal workflow. Hybrid and cloud identities make baseline detection harder because the full picture is spread across multiple log sources that need to be normalized and correlated.

Threat intelligence helps with known threats but doesn’t help with novel attacks or insider activity or misconfigurations that create risk without being actively malicious.

And not all threat intelligence is created equal. There are excellent open-source and commercial threat intel sources. There are also terrible ones. I’ve seen threat feeds that included legitimate IPs because someone submitted indicators from a phishing email without filtering out the spoofed sender addresses and legitimate URLs the attacker included to make the email look real. I’ve seen feeds where one bad submission poisoned the entire aggregator. If you just turn on every available threat feed without validation, you’ll either block legitimate traffic or you will see so many false positive alerts, you will stop using your SIEM because it’s too noisy.

Threat intel requires curation. You need to understand the source, validate the indicators make sense, and monitor for false positives.

And think carefully about how you use it. The most effective approach isn’t using threat feeds to generate alerts or actively block traffic—that’s where bad intel causes the most damage. Instead, use threat intelligence to influence risk scoring. An authentication from an IP that appears in multiple trusted threat feeds gets weighted higher in your risk calculation. A file hash match adds context to other suspicious behaviors. The key is weighting based on source quality and combining threat intel with other correlations and detections you’ve built. Threat feeds are contextual enrichment, not triggers for action.

The detection problem is honestly harder than the logging problem in many ways. Logs are just data. Detection is turning that data into actionable information, and that’s where a lot of organizations struggle.

The Retention Question

How long do you keep logs? It depends on what you’re trying to accomplish.

For incident investigation, you need logs that go back far enough to reconstruct what happened. Breaches often aren’t discovered immediately. The median dwell time (time between initial compromise and detection) is measured in weeks or months. If you’re only retaining logs for 30 days, you might miss the early indicators entirely.

But there’s a practical limit to how far back logs remain useful for investigation. Within 30 days, you can definitively say what happened based on logs. Within 90 days, same—maybe with a bit more uncertainty as context fades. Six months or more? Things get fuzzy. Systems have been patched or reconfigured. People who made decisions have left. The environment has changed enough that you’re making educated guesses rather than definitive statements. Keeping logs forever isn’t always useful—there’s a point where the forensic value diminishes because the context that makes them interpretable is gone.

For compliance, you need whatever the regulation or framework requires. PCI-DSS wants a year. Some frameworks want more. This is non-negotiable if you’re in scope.

For threat hunting, you need historical data to identify patterns over time. “Show me all authentication attempts from this IP over the last six months” isn’t a question you can answer if you only have 90 days of logs.

But retention is expensive. A year of detailed logs for a medium-sized environment can run into serious money. So you make decisions: keep authentication logs for a year, keep detailed application logs for 90 days, keep packet captures for a week.

Hot storage (fast, queryable, expensive) versus cold storage (cheap, slower to retrieve but still readily available, good enough for compliance) versus archival/glacial storage (AWS Glacier, Azure Archive tier—extremely cheap, but retrieval takes hours and costs money). Different retention periods for different log types based on their investigative value.

If you’re using archival storage, test the restoration process before you need it. How long does it take to retrieve? What does it cost? Can you query it in place or do you have to restore to hot storage first? Does it require a different interface or query language? Learning a completely new system during an active forensic investigation is miserable. Know how it works ahead of time.

There’s no one right answer, but there are wrong answers. Thirty days is probably too short for most purposes. Seven years is probably excessive unless you have specific regulatory requirements.

Blind Spots You Didn’t Know You Had

The dangerous thing about visibility gaps is that you often don’t know they exist until you need the data and discover it’s not there.

You think you’re logging all administrative changes, but it turns out changes made through a particular API endpoint aren’t captured. You think you’re monitoring file access, but only on the primary file server, not the secondary one that got set up six months ago. You think you’re capturing authentication events, but only for interactive logins, not for service account activity.

The time to discover these gaps is not during an incident.

Periodically test your visibility. Run tabletop exercises where you simulate an incident and walk through what logs you’d need. Can you answer basic investigative questions? If not, you’ve found a gap.

Review your logging configurations regularly. Environments change. New resources get added. Vendors update their logging capabilities (sometimes removing features, sometimes adding them). What was true six months ago might not be true now.

Talk to vendors about their logging roadmap. If there’s a capability gap that matters to you, raise it. Sometimes vendors don’t prioritize logging features because customers don’t ask for them. Be the customer who asks.

Building Visibility Deliberately

You can’t fix everything at once, but you can make deliberate progress.

Start with authentication. If you can’t see who’s logging in and from where, you’re operating blind. This should be foundational.

Add administrative activity logging. Changes to security configurations, user permissions, infrastructure. This is high-value data that’s usually feasible to capture.

Layer in access to sensitive data when possible. This is harder in SaaS environments, but for systems you control, implement it. For SaaS, push vendors to provide it or find compensating controls.

Build detection gradually. Don’t try to alert on everything at once. Pick a few high-value detection use cases and implement those well. Then add more over time.

Document what you can see and what you can’t. Be explicit about blind spots. This helps with risk discussions and helps prioritize what to fix next.

And accept that visibility will never be complete. But better visibility than you had last quarter is still progress.

Practical Takeaways

Understand what logging your SaaS vendors actually provide before you commit to them. “Comprehensive audit logging” means different things to different vendors.

Cloud (IaaS/PaaS) environments require deliberate configuration. Logging isn’t automatic. Know what you need to enable and what it costs.

Prioritize authentication, authorization, and administrative actions. These are foundational and usually feasible to capture.

Test your visibility periodically. Simulate incidents and see if you can answer investigative questions with the logs you have.

Document blind spots explicitly. Knowing what you can’t see is itself valuable information.

Balance retention costs against investigative and compliance needs. Not everything needs to be kept forever, but 30 days is usually too short.

Build detection incrementally. Start with high-value use cases and expand over time rather than trying to alert on everything at once.

Visibility is expensive and imperfect, but deliberate investment in the right areas makes a real difference when you need it.

Podcast: Download (Duration: 22:00 — 11.8MB) | Embed

Subscribe to the Cultivating Security Podcast Spotify | Pandora | RSS | More

Subscribe to be notified when we publish new content!

Support this work

If you liked this and want to support more analysis like it, consider buying me a coffee.