CDC Wastewater Data

# Wastewater Viral Activity Levels (WVALs) ## COVID | Range | Level | | ------------ | --------- | | <= 1.5 | Very Low | | \> 1.5, <= 3 | Low | | \> 3, <= 4.5 | Moderate | | \> 4.5, <= 8 | High | | \> 8 | Very High | ## Calculations **Source:** [About Wastewater Data \| National Wastewater Surveillance System \| CDC](https://www.cdc.gov/nwss/about-data.html) ### Data Normalization - Data's normalized based on the data that are submitted by the site. - If both flow-population and microbial normalization values are available, flow-population normalization is used. - After normalization, all concentration data is log transformed. ### Baseline Calculations - For each combination of site, data submitter, PCR target, lab methods, and normalization method, a baseline is established. The “baseline” is the 10th percentile of the log-transformed and normalized concentration data within a specific time frame. Details on the baseline calculation by pathogen are below: - SARS-CoV-2 - For site and method combinations (as listed above) with over six months of data, baselines are re-calculated every six calendar months (January 1st and July 1st) using the past 12 months of data. - For sites and method combinations with less than six months of data, baselines are computed weekly until reaching six months, after which they remain unchanged until the next January 1st or July 1st, at which time baselines are re-calculated. - Influenza A and RSV - For site and method combinations (as listed above) with over twelve months of data, baselines are re-calculated every August 1st using all available data in the previous 18 months. - For sites and method combinations with less than twelve months of data, baselines are computed weekly until reaching twelve months, after which they remain unchanged until the next August 1st, at which time baselines are re-calculated. - The standard deviation for each site and method combination is calculated using the same time frame as the baseline. ### WVAL Level Calculation - The number of standard deviations that each log-transformed concentration value deviates from the baseline (positive if above, negative if below) is calculated. - This value (x) is then converted back to a linear scale (by calculating ex) to form the Wastewater Viral Activity Level for the site and method combination. - The Wastewater Viral Activity Levels from a site are averaged by week for all figures. # Implementation Hopefully correct: ```python reader = ... # Read the historical wastewater data. rows_data: list[dict[str, Any]] = [ { **row, 'pcr_target_avg_conc': float(row['pcr_target_avg_conc']), 'sample_collect_date': datetime.strptime( row['sample_collect_date'], '%Y-%m-%d'), } for row in reader ] # Group the data. grouped_data = defaultdict(list) for row in rows_data: key = ( row['site_id'], row['data_source'], row['pcr_target'], row['major_lab_method'], row['pcr_target_units'], ) grouped_data[key].append(row) # Compute the wastewater levels. results = [] for key, rows in grouped_data.items(): rows.sort(key=lambda row: row['sample_collect_date']) # Build a lookup by anchor date to avoid recomputing baselines # repeatedly. anchor_windows: dict[datetime, tuple[float, float] | None] = {} for row in rows: sample_date = row['sample_collect_date'] year = sample_date.year if sample_date.month < 7: anchor_date = datetime(year, 1, 1) else: anchor_date = datetime(year, 7, 1) baseline_window_start = anchor_date - timedelta(days=365) # Cache baseline stats per anchor period if anchor_date not in anchor_windows: baseline_rows = [ x for x in rows if (baseline_window_start <= x['sample_collect_date'] < anchor_date) ] log_concs = [ math.log10(x['pcr_target_avg_conc']) for x in baseline_rows if x['pcr_target_avg_conc'] > 0 ] if len(log_concs) < MIN_BASELINE_COUNT: anchor_windows[anchor_date] = None # not enough data else: log_concs.sort() percentile_index = len(log_concs) * 0.10 lower = int(math.floor(percentile_index)) upper = int(math.ceil(percentile_index)) if lower == upper: baseline = log_concs[lower] else: frac = percentile_index - lower baseline = ( (log_concs[lower] * (1 - frac)) + (log_concs[upper] * frac) ) std_dev = statistics.stdev(log_concs) anchor_windows[anchor_date] = (baseline, std_dev) baseline_data = anchor_windows.get(anchor_date) if baseline_data is None: # Skip this sample, no valid baseline. continue baseline, std_dev = baseline_data if row['pcr_target_avg_conc'] <= 0: # Invalid value. continue log_val = math.log10(row['pcr_target_avg_conc']) if std_dev != 0: z = (log_val - baseline) / std_dev else: z = 0 wval = math.exp(z) if wval <= 1.5: level = 'Very Low' elif wval <= 3: level = 'Low' elif wval <= 4.5: level = 'Moderate' elif wval <= 8: level = 'High' else: level = 'Very High' results.append({ **row, 'WVAL': round(wval, 3), 'Z': round(z, 3), 'activity_level': level, 'baseline_anchor': anchor_date.date(), 'baseline_window_start': baseline_window_start.date(), 'log_conc': round(log_val, 3), }) ``` # Write-Ups * A [really useful post about](https://www.reddit.com/r/COVID19_Pandemic/comments/1i8oc1g/cdc_wastewater_data_baselines_wval_and/) WVAL concentration measurements and methodology: # Datasets * CDC: * [NWSS Public SARS-CoV-2 Concentration in Wastewater Data \| Data \| Centers for Disease Control and Prevention](https://data.cdc.gov/Public-Health-Surveillance/NWSS-Public-SARS-CoV-2-Concentration-in-Wastewater/g653-rqe2/data_preview)