Skip to main content
  1. Posts/

Splunk + Snowflake: Building a Hybrid Security Data Lake with Federated Queries

Table of Contents

You can find all the code and demo scripts for this article on GitHub.


Better Together: Why Splunk + Snowflake
#

Splunk is the gold standard for real-time security monitoring. Its detection engine, Enterprise Security’s 1,400+ pre-built correlation searches, threat intelligence framework, and sub-second alerting are unmatched. Security teams rely on Splunk because it works — and it keeps getting better.

But as organizations grow, so does log volume. And not every log source needs sub-second alerting. Web access logs, DNS queries, cloud audit trails, netflow data — these are invaluable for investigations and compliance, but they don’t all need to live in the real-time detection tier.

This is where the Splunk + Snowflake partnership shines. Snowflake provides a complementary tier for cost-effective, long-term, queryable storage. Splunk and Snowflake recognized this together — it’s a better together story, not a replacement story.

The architecture is straightforward: send everything to Snowflake for deep storage and analytics. Send the critical, time-sensitive subset to Splunk for real-time detection. When Splunk analysts need historical context, they query Snowflake directly via federated search — without ever leaving the Splunk interface.

You keep Splunk’s world-class detection capabilities. You keep all your logs for years. And you optimize your overall security spend by 70-80%.


Architecture
#

                         ┌─────────────────────────────┐
                         │       Log Sources            │
                         │  (EDR, Firewall, DNS, WAF,   │
                         │   Cloud Audit, Web Access,   │
                         │   Netflow, Auth Logs)         │
                         └──────────────┬──────────────┘
                         ┌─────────────────────────────┐
                         │      Routing Layer           │
                         │  (Vector / Kafka / Cribl /   │
                         │       Fluent Bit)             │
                         └──────┬──────────────┬───────┘
                                │              │
                     Critical   │              │  ALL logs
                    (20-30%)    │              │  (100%)
                                ▼              ▼
                 ┌──────────────────┐  ┌──────────────────┐
                 │     Splunk       │  │    Snowflake      │
                 │                  │  │                   │
                 │  Real-time       │  │  Historical       │
                 │  detection &     │  │  investigation,   │
                 │  alerting        │  │  compliance,      │
                 │                  │  │  threat hunting    │
                 │  ~$150-200/GB/d  │  │  ~$23-40/TB/mo    │
                 └────────┬─────────┘  └───────▲──────────┘
                          │                    │
                          │   DB Connect       │
                          │   (dbxquery)       │
                          └────────────────────┘
                          Federated Search:
                          SQL runs on Snowflake compute,
                          only results return to Splunk

The routing layer is the key component. Tools like Vector, Cribl, Kafka, or Fluent Bit sit between your log sources and your destinations. They evaluate each event and route it to one or both destinations based on rules you define.

All logs flow to Snowflake — no exceptions. Only the critical subset flows to Splunk. This means Snowflake becomes your system of record, and Splunk becomes your real-time detection engine.


Cost Comparison
#

Here is what the numbers look like for a typical 100 GB/day ingestion scenario:

Splunk-OnlyHybrid (Splunk + Snowflake)
Daily ingestion100 GB to Splunk20-30 GB to Splunk, 100 GB to Snowflake
Splunk license cost~$15,000-20,000/mo~$3,000-6,000/mo
Snowflake storage~$70-120/mo (~3 TB/mo)
Snowflake compute~$200-500/mo (ad-hoc queries)
Total monthly cost~$15,000-20,000~$3,500-6,500
Retention30-90 days (typical)30 days Splunk + years in Snowflake
Cost reduction70-80%

The savings come from routing: Snowflake storage costs ~$23-40/TB/month with no ingestion fee, while Splunk’s value is in real-time detection where it commands a premium. By routing only the 20-30% of data that needs sub-second alerting to Splunk, you optimize spend for both platforms.

Detailed Snowflake Cost Breakdown
#

ComponentMonthly CostNotes
Storage (100 GB/day × 30 days = 3 TB)~$70-120At $23-40/TB/month
Compute: Dynamic Tables pipeline (Small warehouse, ~8 hrs effective/day)~$250-500Auto-suspends between refreshes
Compute: Splunk DB Connect queries (XS warehouse, ~2 hrs/day)~$60-120On-demand for federated queries
Compute: Threat hunting (Medium warehouse, ~10 hrs/month)~$40-80Ad-hoc analyst queries
Snowpipe ingestion~$50-100Serverless, pay per file
Total Snowflake~$500-900/moFor 100 GB/day with active detection

Multi-Year Retention — The Compliance Multiplier
#

Compliance frameworks (GDPR, DORA, PCI DSS, HIPAA) often require 1-7 years of log retention. This is where the hybrid model delivers the biggest financial impact:

RetentionSplunk-Only (100 GB/day)Snowflake Storage
90 days (3 TB)Included in license~$70-120/mo
1 year (36 TB)Significantly increases license~$830-1,440/mo
2 years (73 TB)Often cost-prohibitive~$1,680-2,920/mo
5 years (182 TB)Not practical for most orgs~$4,200-7,300/mo

With Snowflake, keeping 5 years of security logs is economically feasible — and those logs are queryable at any time, either directly or via Splunk’s federated search.

Snowflake Storage Lifecycle Policies — Automatic Tiering
#

Since November 2025, Snowflake offers Storage Lifecycle Policies (GA) that automatically move aging data to lower-cost storage tiers. This further reduces the cost of long-term security log retention:

TierAccess PatternCostAvailability
StandardFrequently accessed (real-time queries, dashboards)~$23-40/TB/monthAll clouds
COOLInfrequently accessed (quarterly investigations, audits)Significantly lower than standardAWS + Azure
COLDRarely accessed (annual compliance, legal hold)Lowest cost tierAWS only

Storage lifecycle policies are schema-level objects that archive or expire rows based on conditions you define — for example, move security logs older than 90 days to COOL, and logs older than 1 year to COLD:

-- Example: Tiered retention for security logs
CREATE STORAGE LIFECYCLE POLICY security_tiering
    ON TABLE security_data.access_logs
    ARCHIVE_TIER = 'COOL'
    ARCHIVE_AFTER_DAYS = 90
    EXPIRE_AFTER_DAYS = 730;  -- 2 years total retention

This means your first 90 days of security data are in standard (hot) storage for fast queries and detection pipelines. After 90 days, data automatically moves to COOL storage at reduced cost — still queryable but optimized for less frequent access. The data expires after 2 years (or whatever your compliance framework requires).

For security data lakes with multi-year retention, combining storage lifecycle policies with Apache Iceberg tables provides even more flexibility — Iceberg tables store data in your own cloud storage (S3, ADLS, GCS) while remaining fully queryable through Snowflake.

Documentation:


What Goes Where: Data Routing Strategy
#

The routing decision comes down to one question: does this event need sub-minute detection and alerting?

Splunk (Critical — 20-30% of volume)
#

Log TypeWhy It Needs Splunk
EDR / endpoint alertsReal-time threat detection, kill chain correlation
Authentication failuresBrute force detection, account lockout alerting
Firewall denies / blocksActive attack detection, immediate response
High-severity SIEM alertsTriage and escalation workflows
Privilege escalation eventsImmediate investigation required
DLP alertsData exfiltration response

Snowflake (Everything — 100% of volume)
#

Log TypeWhy It Goes to Snowflake
Web access logsHigh volume, needed for historical investigation
DNS query logsThreat hunting, C2 detection over time
Netflow / traffic metadataForensics, lateral movement analysis
Cloud audit trails (AWS CloudTrail, Azure Activity, GCP Audit)Compliance, long-term investigation
Email gateway logsPhishing campaign correlation
Proxy / URL filtering logsHistorical threat hunting
All of the above critical logs tooSystem of record, backup, long-term retention

The critical point: Snowflake gets everything. It is not an either/or. The routing layer duplicates critical events to both destinations. Snowflake is always the complete record.


Feature Comparison
#

Each platform brings distinct strengths. The key is using each for what it does best.

CapabilitySplunkSnowflakeNotes
Real-time alerting (< 1 min)ExcellentImprovingSplunk excels here; Snowflake Dynamic Tables target ~1 min lag, triggered tasks can go sub-minute
SPL query languageNativeN/ASPL is purpose-built for security; SQL requires different patterns but is more widely known
Splunk ES (SIEM framework)Full supportComplementaryNotable models, risk-based alerting, adaptive response — Splunk-native. Snowflake enriches, not replaces
Threat intelligenceBuilt-in (ES), 9 intel collectionsMarketplace (growing)Splunk ES threat intel framework is mature; Snowflake Marketplace adds Criminal IP, Cybersixgill, and others
Historical investigationGood (within retention window)ExcellentSnowflake’s strength — cost-effective queries across years of data
Ad-hoc SQL analyticsSPL-basedFull SQLSnowflake’s SQL engine excels at complex analytical queries and joins
Multi-year retentionNot its primary use casePurpose-built~$23-40/TB/month in Snowflake
Data sharing / marketplaceSplunkbase ecosystemSnowflake MarketplaceBoth have growing ecosystems; different strengths
ML / anomaly detectionMLTK, AI Toolkit, DSDLSnowpark ML, Cortex AISplunk expanded ML with AI Toolkit and Deep Learning (DSDL); Snowflake offers Cortex LLMs and Snowpark ML
Compliance reportingGood for event-basedExcellent for structuredSQL + BI tools complement Splunk’s event-based compliance monitoring
DashboardsPurpose-built SOC dashboardsStreamlit, BI toolsSplunk is the standard for real-time SOC operations dashboards
Native cross-platform searchFederated Search for Snowflake (2025)N/ASplunk announced native Federated Search for Snowflake — query Snowflake data directly from SPL without DB Connect

The takeaway: Splunk Enterprise Security’s notable models, risk-based alerting, and adaptive response actions are core to any SOC that runs Splunk. The hybrid approach does not replace Splunk ES — it reduces the volume of data that needs to flow through the real-time tier, while making all historical data accessible to Splunk analysts via federated search.

New in 2025: Splunk announced Federated Search for Snowflake — a native integration that lets analysts query Snowflake data directly from the Splunk UI using SPL-like syntax, without DB Connect. This deepens the partnership and makes the hybrid architecture even more seamless.


Setting Up Splunk DB Connect with Snowflake
#

Before running federated queries, you need to connect Splunk Cloud to Snowflake via DB Connect. Here’s the step-by-step setup:

Step 1: Install DB Connect and JDBC Driver
#

Install Splunk DB Connect (v3.x or 4.x) from Splunkbase. Then install the Snowflake JDBC driver:

Splunk CloudSplunk On-Premise
DB Connect installInstall via Splunk Web → AppsInstall via Splunk Web or CLI
JDBC driverPre-packaged as a Splunk add-on. During DB Connect connection setup, install the JDBC Driver for Snowflake add-on directly from within DB Connect — no manual download needed. Alternatively, install it from Splunkbase as an app.Download the Snowflake JDBC JAR (v3.14.x+) from Maven and place it in $SPLUNK_HOME/etc/apps/splunk_app_db_connect/drivers/, then restart Splunk.
JREPre-configured (Java 17)Must install Java 17 or 21 and set the JRE path
Network accessSplunk Cloud can reach Snowflake over the internet by defaultMay require firewall rules to allow outbound HTTPS to *.snowflakecomputing.com:443
AuthenticationUsername/password or OAuth (v1.2.3+ supports OAuth Authorization Code)Username/password, key-pair, or OAuth

Step 2: Configure General Settings
#

DB Connect requires Java 17 or 21. The JVM settings differ between Splunk Cloud and on-premise deployments.

Splunk Cloud (validated working configuration):

SettingValue
JRE Path/usr/lib/jvm/java-17-openjdk-amd64
Task Server Port9998
Task Server JVM Options-Ddw.server.applicationConnectors[0].port=9998 --add-opens java.base/java.nio=ALL-UNNAMED
Query Server JVM Options-Dport=9999 --add-opens java.base/java.nio=ALL-UNNAMED

Splunk On-Premise — settings will vary depending on your environment:

SettingNotes
JRE PathPath to your Java 17 or 21 installation (e.g., /opt/java/jdk-17/ on Linux, C:\Program Files\Java\jdk-17\ on Windows)
Task Server PortDefault 9998 unless another service uses that port
Task/Query Server JVM OptionsSame --add-opens flag is required. You may also need to adjust heap size: -Xms512m -Xmx2048m for large query workloads

Critical for both: The --add-opens java.base/java.nio=ALL-UNNAMED flag is required for Apache Arrow library compatibility with Java 17+. Without it, you’ll get NoClassDefFoundError exceptions when querying Snowflake.

Step 3: Create an Identity
#

Go to Configuration and create an Identity with the credentials for your Snowflake service account:

  • Username: SVC_SPLUNK (or your service account name)
  • Password: The service account password

Step 4: Create a Connection
#

Create a new connection with these parameters:

ParameterValueNotes
Connection Namesnowflake_securityYour choice
IdentitySelect the identity from Step 3
Connection TypeSnowflake
Host<account_identifier>Account ID only — do NOT include .snowflakecomputing.com
Default DatabaseSECURITY_DATAYour security database
Default SchemaPUBLICOr your schema

Step 5: Set Snowflake User Defaults
#

DB Connect doesn’t support passing warehouse, role, or namespace via JDBC URL parameters. Instead, set defaults on the Snowflake user:

-- Run this in Snowflake
ALTER USER SVC_SPLUNK SET
    DEFAULT_WAREHOUSE = 'SPLUNK_WH'
    DEFAULT_NAMESPACE = 'SECURITY_DATA.PUBLIC'
    DEFAULT_ROLE = 'SPLUNK_READER';

-- Grant minimum required privileges
CREATE ROLE IF NOT EXISTS SPLUNK_READER;
GRANT USAGE ON WAREHOUSE SPLUNK_WH TO ROLE SPLUNK_READER;
GRANT USAGE ON DATABASE SECURITY_DATA TO ROLE SPLUNK_READER;
GRANT USAGE ON SCHEMA SECURITY_DATA.PUBLIC TO ROLE SPLUNK_READER;
GRANT SELECT ON ALL TABLES IN SCHEMA SECURITY_DATA.PUBLIC TO ROLE SPLUNK_READER;
GRANT SELECT ON FUTURE TABLES IN SCHEMA SECURITY_DATA.PUBLIC TO ROLE SPLUNK_READER;
GRANT ROLE SPLUNK_READER TO USER SVC_SPLUNK;

Step 6: Test the Connection
#

In Splunk, run a test query:

| dbxquery connection="snowflake_security" query="SELECT CURRENT_USER(), CURRENT_ROLE(), CURRENT_WAREHOUSE()"

If you see your user, role, and warehouse returned, the connection is working.

Troubleshooting
#

IssueSolution
NoClassDefFoundErrorAdd --add-opens java.base/java.nio=ALL-UNNAMED to JVM options
Queries timeoutIncrease heap: add -Xms512m -Xmx2048m to JVM options
Wrong warehouse/roleSet defaults on the Snowflake user (Step 5)
Connection refusedVerify the account identifier format — no .snowflakecomputing.com suffix

Dynamic Tables: Near-Real-Time Detection in Snowflake
#

While Splunk handles sub-second alerting, Snowflake’s Dynamic Tables provide a powerful near-real-time detection layer for the data that lives in the Snowflake tier. Dynamic Tables automatically refresh based on a target lag — typically 1-5 minutes for security workloads — and chain together to form multi-stage detection pipelines.

Why Dynamic Tables Matter for Security
#

Traditional task-based pipelines require you to write scheduling logic, manage dependencies, and handle failures. Dynamic Tables handle all of this declaratively — you define the SQL transformation and the target lag, and Snowflake manages the rest.

Raw Logs (Snowpipe)
    ▼  Target lag: 1 minute
┌─────────────────────────┐
│  Stage 1: Normalized     │  Parse, standardize timestamps, extract fields
│  (Dynamic Table)         │
└─────────────────────────┘
    ▼  Target lag: 2 minutes
┌─────────────────────────┐
│  Stage 2: Enriched       │  Add geo-IP, threat intel, user context
│  (Dynamic Table)         │
└─────────────────────────┘
    ▼  Target lag: 5 minutes
┌─────────────────────────┐
│  Stage 3: Alerts         │  Threshold detection, anomaly flags
│  (Dynamic Table)         │
└─────────────────────────┘
  Webhook notification → Splunk / SIEM / PagerDuty

Example: Brute Force Detection Pipeline
#

-- Stage 1: Normalize login events (target lag: 1 minute)
CREATE OR REPLACE DYNAMIC TABLE security_data.login_events_normalized
    TARGET_LAG = '1 minute'
    WAREHOUSE = SECURITY_WH
AS
SELECT
    event_timestamp,
    user_name,
    client_ip,
    is_success,
    first_authentication_factor,
    reported_client_type
FROM SNOWFLAKE.ACCOUNT_USAGE.LOGIN_HISTORY
WHERE event_timestamp >= DATEADD('day', -1, CURRENT_TIMESTAMP());

-- Stage 2: Aggregate for brute force detection (target lag: 2 minutes)
CREATE OR REPLACE DYNAMIC TABLE security_data.brute_force_candidates
    TARGET_LAG = '2 minutes'
    WAREHOUSE = SECURITY_WH
AS
SELECT
    client_ip,
    user_name,
    COUNT(*) AS failed_attempts,
    MIN(event_timestamp) AS first_attempt,
    MAX(event_timestamp) AS last_attempt,
    DATEDIFF('minute', MIN(event_timestamp), MAX(event_timestamp)) AS window_minutes
FROM security_data.login_events_normalized
WHERE is_success = 'NO'
  AND event_timestamp >= DATEADD('hour', -1, CURRENT_TIMESTAMP())
GROUP BY client_ip, user_name
HAVING failed_attempts >= 5;

Warehouse Sizing for Security Workloads
#

Warehouse SizeUse CaseApproximate Cost/Hour
X-SmallLight queries, Splunk DB Connect reads~$1-2/hour
SmallDynamic Table refreshes for moderate log volume~$2-4/hour
MediumHigh-volume detection pipelines, threat hunting~$4-8/hour

Cost optimization tip: Use a dedicated SECURITY_WH warehouse with AUTO_SUSPEND = 60 (1 minute). Dynamic Table refreshes will auto-resume the warehouse when needed and it suspends automatically between refresh cycles. For periodic threat hunting from Splunk via DB Connect, use a separate SPLUNK_WH so costs are tracked independently.

CREATE WAREHOUSE SECURITY_WH
    WAREHOUSE_SIZE = 'SMALL'
    AUTO_SUSPEND = 60
    AUTO_RESUME = TRUE
    INITIALLY_SUSPENDED = TRUE
    COMMENT = 'Security detection pipeline - Dynamic Tables';

CREATE WAREHOUSE SPLUNK_WH
    WAREHOUSE_SIZE = 'XSMALL'
    AUTO_SUSPEND = 60
    AUTO_RESUME = TRUE
    INITIALLY_SUSPENDED = TRUE
    COMMENT = 'Splunk DB Connect federated queries';

Federated Queries from Splunk via DB Connect
#

With the connection set up, Splunk analysts can query Snowflake directly from the Splunk UI using the dbxquery command.

The key benefit is aggregation pushdown. The SQL query runs on Snowflake’s compute, and only the result set comes back to Splunk. You are not pulling raw logs into Splunk — you are pulling aggregated answers.

Example 1: Basic Historical Fetch
#

Pull the last 24 hours of web access logs from Snowflake for a specific source IP under investigation:

| dbxquery
    connection="snowflake_security"
    query="
        SELECT timestamp, src_ip, dst_ip, method, uri, status_code, user_agent
        FROM security_lake.web_access_logs
        WHERE src_ip = '10.0.45.12'
          AND timestamp > DATEADD('hour', -24, CURRENT_TIMESTAMP())
        ORDER BY timestamp DESC
        LIMIT 10000
    "
| table timestamp, src_ip, dst_ip, method, uri, status_code, user_agent

Example 2: Security Overview Dashboard (Aggregation Pushdown)
#

Build a SOC dashboard panel that shows threat categories over time — the heavy aggregation runs on Snowflake, and Splunk just renders the results:

| dbxquery
    connection="snowflake_security"
    query="
        SELECT
            DATE_TRUNC('hour', timestamp) AS hour,
            CASE
                WHEN status_code >= 400 AND status_code < 500 THEN 'client_error'
                WHEN status_code >= 500 THEN 'server_error'
                WHEN uri ILIKE '%/admin%' THEN 'admin_access'
                ELSE 'normal'
            END AS category,
            COUNT(*) AS event_count
        FROM security_lake.web_access_logs
        WHERE timestamp > DATEADD('day', -7, CURRENT_TIMESTAMP())
        GROUP BY 1, 2
        ORDER BY 1 DESC
    "
| timechart span=1h sum(event_count) by category

Example 3: Threat Classification (SQL Injection, XSS Detection)
#

Run a threat hunting query across months of historical logs — something that would be impractical in Splunk at this retention depth:

| dbxquery
    connection="snowflake_security"
    query="
        SELECT
            src_ip,
            CASE
                WHEN uri ILIKE '%UNION%SELECT%' OR uri ILIKE '%1=1%' OR uri ILIKE '%DROP%TABLE%'
                    THEN 'sql_injection'
                WHEN uri ILIKE '%<script%' OR uri ILIKE '%javascript:%' OR uri ILIKE '%onerror=%'
                    THEN 'xss'
                WHEN uri ILIKE '%../%' OR uri ILIKE '%/etc/passwd%'
                    THEN 'path_traversal'
                WHEN uri ILIKE '%/admin%' AND status_code = 200
                    THEN 'admin_access'
                ELSE 'other'
            END AS threat_type,
            COUNT(*) AS attempt_count,
            MIN(timestamp) AS first_seen,
            MAX(timestamp) AS last_seen,
            COUNT(DISTINCT uri) AS unique_uris
        FROM security_lake.web_access_logs
        WHERE timestamp > DATEADD('month', -6, CURRENT_TIMESTAMP())
          AND (
              uri ILIKE '%UNION%SELECT%'
              OR uri ILIKE '%<script%'
              OR uri ILIKE '%../%'
              OR uri ILIKE '%/etc/passwd%'
              OR (uri ILIKE '%/admin%' AND status_code = 200)
          )
        GROUP BY 1, 2
        HAVING attempt_count > 5
        ORDER BY attempt_count DESC
        LIMIT 100
    "
| table src_ip, threat_type, attempt_count, first_seen, last_seen, unique_uris
| sort -attempt_count

Example 4: Hybrid Correlation (Splunk + Snowflake Join)
#

Correlate real-time Splunk alerts with historical Snowflake data. For example, take today’s Splunk EDR alerts and look up whether those source IPs have a history of suspicious activity in Snowflake:

index=edr_alerts severity="critical" earliest=-1h
| stats count AS alert_count, values(alert_name) AS alerts BY src_ip
| map search="
    | dbxquery
        connection=\"snowflake_security\"
        query=\"
            SELECT
                '$src_ip$' AS src_ip,
                COUNT(*) AS historical_events,
                SUM(CASE WHEN uri ILIKE '%UNION%SELECT%' OR uri ILIKE '%<script%' THEN 1 ELSE 0 END) AS suspicious_requests,
                MIN(timestamp) AS first_seen_in_logs,
                MAX(timestamp) AS last_seen_in_logs
            FROM security_lake.web_access_logs
            WHERE src_ip = '$src_ip$'
              AND timestamp > DATEADD('month', -3, CURRENT_TIMESTAMP())
        \"
    "
| table src_ip, alert_count, alerts, historical_events, suspicious_requests, first_seen_in_logs, last_seen_in_logs

Example 5: Scheduled Compliance Report
#

Run a weekly compliance query that counts authentication events by outcome — useful for auditors who need evidence of monitoring:

| dbxquery
    connection="snowflake_security"
    query="
        SELECT
            DATE_TRUNC('day', timestamp) AS day,
            event_type,
            outcome,
            COUNT(*) AS event_count,
            COUNT(DISTINCT src_ip) AS unique_sources
        FROM security_lake.auth_events
        WHERE timestamp > DATEADD('day', -30, CURRENT_TIMESTAMP())
        GROUP BY 1, 2, 3
        ORDER BY 1 DESC, 4 DESC
    "
| table day, event_type, outcome, event_count, unique_sources

Demo Walkthrough
#

The GitHub repository includes a complete demo that generates realistic security log data, loads it into Snowflake, and provides the queries to federate from Splunk.

What the Demo Does
#

  1. Generates 100,000 realistic web access log events using Python — including normal traffic, automated scanners, and injected attack patterns
  2. Injects realistic attack signatures including SQL injection attempts, XSS payloads, path traversal, and privilege escalation patterns
  3. Uploads the data to Snowflake for querying
  4. Provides ready-to-use SPL queries for federated search from Splunk

Running the Demo
#

The project uses pixi for dependency management. To get started:

git clone https://github.com/sfc-gh-kkeller/splunk-snowflake-security-datalake.git
cd splunk-snowflake-security-datalake
pixi install

Generate the sample data:

pixi run generate

This creates a dataset with events spread across multiple days, including:

  • Normal web traffic — GET/POST requests to typical endpoints with realistic user agents and status codes
  • SQL injection attemptsUNION SELECT, 1=1, DROP TABLE, OR 1=1-- patterns injected into URI parameters
  • XSS payloads<script>, javascript:, onerror= patterns in request URIs
  • Path traversal../../etc/passwd, ..\\windows\\system32 attempts
  • Privilege escalation — repeated access attempts to /admin, /api/admin, /console endpoints with mixed success/failure status codes

Upload to Snowflake:

pixi run upload

This creates the security_lake.web_access_logs table in your Snowflake account and loads the generated events. You will need Snowflake credentials configured (the repo README covers the details).

Once the data is in Snowflake, configure Splunk DB Connect with your Snowflake connection and run the federated queries shown above.


When to Use This Pattern (and When Not To)
#

Good fit
#

  • High log volume (50+ GB/day) — the cost savings scale linearly with volume; below 50 GB/day the complexity may not be worth it
  • Compliance requirements for multi-year retention — regulators want 1-3 years of logs; Snowflake makes this economically feasible
  • Threat hunting on historical data — your analysts need to search across months of logs, not just the Splunk retention window
  • Budget pressure on SIEM costs — if your Splunk renewal is the single largest line item in your security budget, this is worth exploring
  • Teams already using Snowflake — if your org has Snowflake for data warehousing, adding security logs is a natural extension

Not a good fit
#

  • Small log volumes (< 10 GB/day) — the operational complexity of running two platforms is not justified at low volumes
  • Heavy reliance on Splunk ES notable models — if your detection logic depends on Splunk ES’s correlation searches and risk framework running against all log sources, splitting data will break those detections
  • Real-time-only use case — if you never do historical investigation and only care about real-time alerts, the Snowflake side adds cost without value
  • No engineering capacity for the routing layer — someone needs to build and maintain the routing rules; if your team is already stretched thin, this is not a weekend project

Trade-offs to understand
#

  • Federated query latency: dbxquery calls to Snowflake take seconds, not milliseconds. This is fine for investigation and dashboards but not for real-time correlation rules.
  • Two platforms to operate: You now have two systems to monitor, secure, and keep running. Make sure you have the operational capacity.
  • Routing logic complexity: Deciding what is “critical” requires ongoing tuning. Start conservative (send more to Splunk) and reduce over time as you gain confidence.
  • Splunk ES content dependencies: Some Splunk ES correlation searches reference specific log sources. Before routing a source to Snowflake-only, audit which ES searches depend on it and adjust your routing strategy accordingly.

Related Articles#

Kevin Keller
Author
Kevin Keller
Personal blog about AI, Observability & Data Sovereignty. Snowflake-related articles explore the art of the possible and are not official Snowflake solutions or endorsed by Snowflake unless explicitly stated. Opinions are my own. Content is meant as educational inspiration, not production guidance.
Share this article

Related