Server Architecture Deep Dive¶

This document is the authoritative technical planning reference for the OpenAstro server. It builds on what's established in General overview.md, Details on componets.md, science_cases_and_scheduler.md, and the Gap Analysis. Read those first. This document goes deeper — it is intended to be comprehensive enough that a developer can build the system directly from it.

1. Component Breakdown¶

The server is composed of seven logical subsystems. They share a database and communicate through it (no internal message bus at MVP scale). Each subsystem can be a separate Python module or FastAPI router.

┌──────────────────────────────────────────────────────────────────────┐
│                         OPENASTRO SERVER                              │
│                                                                      │
│  ┌─────────────┐  ┌─────────────┐  ┌────────────┐  ┌─────────────┐  │
│  │  Scheduler  │  │Alert Ingest │  │  Data       │  │    Auth     │  │
│  │             │  │  Service    │  │  Pipeline   │  │  (API keys) │  │
│  └──────┬──────┘  └──────┬──────┘  └─────┬──────┘  └──────┬──────┘  │
│         │                │               │                │          │
│  ┌──────▼──────────────────▼──────────────▼────────────────▼──────┐  │
│  │                       PostgreSQL Database                       │  │
│  │  sites | targets | observations | campaigns | alerts |          │  │
│  │  instruments | calibrations | light_curves | heartbeats         │  │
│  └────────────────────────────────┬───────────────────────────────┘  │
│                                   │                                  │
│  ┌────────────────────────────────▼───────────────────────────────┐  │
│  │                     FastAPI REST Layer                          │  │
│  │  /api/v1/targets  /api/v1/observations  /api/v1/alerts         │  │
│  │  /api/v1/sites    /api/v1/campaigns     /api/v1/status         │  │
│  └────────────────────────────────┬───────────────────────────────┘  │
│                                   │                                  │
│  ┌─────────────┐  ┌───────────────▼──────────┐  ┌─────────────────┐ │
│  │   Caddy     │  │    Redis Cache            │  │   Backblaze B2  │ │
│  │  (HTTPS /   │  │  (target list cache,      │  │  (FITS storage) │ │
│  │  reverse    │  │   heartbeat state)        │  │                 │ │
│  │  proxy)     │  └──────────────────────────┘  └─────────────────┘ │
│  └─────────────┘                                                     │
└──────────────────────────────────────────────────────────────────────┘
                                    │ HTTPS
              ┌─────────────────────┼─────────────────────┐
              ▼                     ▼                     ▼
         ┌─────────┐          ┌─────────┐          ┌─────────┐
         │ Site A  │          │ Site B  │          │ Site C  │
         │ (poll)  │          │ (poll)  │          │ (alert) │
         └─────────┘          └─────────┘          └─────────┘

1.1 Scheduler¶

The brain of the network. Runs as a background task (Celery beat or APScheduler) every 2 minutes, and also on-demand when the /targets endpoint is hit. Detailed in Section 2.

1.2 Alert Ingestion Service¶

Polls or subscribes to ZTF, Gaia Alerts, GCN, TNS, and MPC. Normalizes alerts into the internal format. Creates new targets or updates existing ones. Detailed in Alert Ingestion Design.md.

1.3 Data Pipeline¶

Processes incoming FITS files and photometry submissions. Runs plate solving, photometry extraction, quality flagging, and light curve assembly. Detailed in Stage 1 Data Pipeline Design.md.

1.4 Auth (API Keys)¶

Simple API key auth for all telescope clients. No OAuth, no sessions, no user accounts at MVP. Human dashboard access can use HTTP Basic Auth behind Caddy for now. See Section 8.

1.5 Dashboard¶

Jinja2-rendered server-side HTML. Shows: active sites map, recent observations, campaign progress, alert feed, per-site contribution stats. No JavaScript framework at MVP — plain HTML tables updated every 60s with a meta-refresh tag. Graduate to HTMX when interactivity is needed.

1.6 REST API¶

FastAPI application exposing versioned endpoints. See Section 8 for full design.

1.7 Background Worker¶

APScheduler embedded in the FastAPI process at MVP (avoid Celery overhead until needed). Runs: - Scheduler tick every 2 minutes - Alert ingestion polls every 5 minutes (ZTF/TNS) or 30 seconds (GCN) - Data pipeline jobs as observation submissions arrive (async task queue via FastAPI BackgroundTasks) - Daily backup job

2. Scheduler Design¶

2.1 What the Scheduler Solves¶

The scheduler's job is to answer one question: given a specific site at a specific moment, what should it observe? The answer must be: - Scientifically optimal (highest priority targets first) - Geometrically correct (target is actually above the horizon) - Network-aware (avoid redundant simultaneous coverage for non-simultaneous science cases, maximize it for simultaneous ones like occultations) - Responsive to alerts (urgent alerts must propagate to clients within one heartbeat cycle — 30s)

The scheduler is not a global optimizer that produces a schedule for all sites for the night. That's too complex, requires weather prediction, and is unnecessary. Instead it produces a per-request ranked target list for a single site at the moment of request. Each site pulls its own list every 60s. This is the poll-based model agreed in the architecture decisions.

2.2 Scoring Architecture¶

The combined score is a weighted sum of three independent sub-scores, each on a 0–100 scale:

combined_score = (priority_score × 0.50) + (observability_score × 0.30) + (capability_score × 0.20)

Priority score (0–100): How urgently does the network need this observation? Observability score (0–100): How good is the geometry right now from this site? Capability score (0–100): How well is this site equipped for this target?

Targets returning capability_score == 0 or observability_score == 0 are hard-filtered out before scoring — they cannot be observed regardless of priority.

2.3 Priority Score Components¶

Priority is dynamic and recomputes every request. Components:

Base priority (set when target is created): The science case determines this. - Occultation: 80–100 (time-critical, irreplaceable) - Transit in progress: 85 - GRB afterglow < 1 hour old: 95 - GRB afterglow 1–6 hours: 75 - TTV exoplanet monitoring: 60 - Long-term variable monitoring: 40 - Archival-quality target: 20

Time criticality boost: Applied to targets with event windows.

if target.type == 'occultation' and time_to_event < 1 hour:    score += 50
if target.type == 'occultation' and time_to_event < 24 hours:  score += 30
if target.type == 'transit_ingress' and in_window:             score += 40
if target.type == 'grb' and age < 10 min:                      score += 60
if target.type == 'grb' and age < 1 hour:                      score += 40

Coverage deficit boost: How behind is this target on its required cadence?

hours_since_last_obs = now - last_obs_time  # network-wide, not per-site
cadence_deficit = hours_since_last_obs / target.cadence_hours  # >1 = behind
if cadence_deficit > 2.0: score += 25
if cadence_deficit > 1.0: score += 15
if cadence_deficit < 0.5: score -= 20  # Recently observed, lower urgency

Campaign boost: Active campaigns multiply priority.

for campaign in target.campaigns:
    if campaign.is_active:
        score = min(score * campaign.priority_multiplier, 100)

Redundancy penalty for non-simultaneous targets: If another site is currently observing this same target AND it's not an occultation/kilonova tiling scenario, penalize.

active_observers = count_sites_heartbeating_target(target.id, window=5min)
if active_observers > 0 and not target.requires_simultaneous:
    score -= 20 * active_observers  # Diminishing marginal value

Reverse redundancy bonus for simultaneous targets: Occultations and kilonova error box tiling WANT simultaneous coverage.

if target.requires_simultaneous:
    if active_observers < target.min_simultaneous_sites:
        score += 30  # We need more coverage, urgent to join
    elif active_observers >= target.min_simultaneous_sites:
        score += 10  # Redundancy is still science here

2.4 Observability Score Components¶

# Altitude component
if altitude < 15: return 0  # hard filter
if 45 <= altitude <= 75: score = 100
if altitude < 30: score -= 30
if altitude < 45: score -= 15
if altitude > 80: score -= 10  # zenith tracking issues

# Airmass penalty
airmass = 1 / cos(radians(90 - altitude))
if airmass > 2.5: score -= 30
elif airmass > 2.0: score -= 20
elif airmass > 1.5: score -= 10

# Sun: hard filter
if sun_alt > -6: return 0  # civil twilight
if sun_alt > -12: score -= 40  # nautical twilight
if sun_alt > -18: score -= 20  # astronomical twilight

# Moon: soft penalty
moon_illumination = get_moon_illumination()
if moon_up and target_moon_separation < 30:
    score -= 30 * moon_illumination  # worse when full moon
elif moon_up and target_moon_separation < 60:
    score -= 15 * moon_illumination

# Rising vs. setting: prefer rising targets
if target_is_rising(site, target): score += 5
if target_sets_within_minutes(site, target, minutes=20): score -= 10

2.5 Capability Score Components¶

Already documented in science_cases_and_scheduler.md (the calculate_site_capability function). Key additions:

GPS timing requirement: If target.requires_gps_timing and site does not have GPS, capability = 0 for that target. This is a hard filter for occultation science.
Filter match: Partial match (site has some of the required filters) gives partial score. Complete mismatch gives score -= 50 but not 0, because unfiltered observations have science value for some targets.
Pixel scale: The existing formula in science_cases_and_scheduler.md is correct. Add: if target is an extended object (comet, galaxy), penalize undersampled sites less harshly.
Limiting magnitude margin: The "mag_margin" penalty in the existing code is correct. Extend it: if target brightness varies (variable star, transient), use the brightest expected magnitude for capability check.

2.6 Visibility Windows¶

The scheduler computes not just "is it visible now?" but also "when does visibility start and end tonight?" This is used to: 1. Show observers a preview of their night's opportunities 2. Pre-generate target lists so the first /targets poll of the night returns instantly (cached) 3. Calculate handover windows for time-critical targets

def compute_visibility_window(site, target, date):
    """
    Returns (rise_time, set_time) in UTC for tonight, above min_altitude.
    Uses astropy's target rise/set calculation.
    """
    from astropy.coordinates import EarthLocation, AltAz, SkyCoord
    from astropy.time import Time
    from astroplan import Observer, FixedTarget
    import astropy.units as u

    observer = Observer(
        location=EarthLocation(lat=site.latitude*u.deg,
                               lon=site.longitude*u.deg,
                               height=site.elevation*u.m),
        timezone='UTC'
    )
    target_coord = FixedTarget(SkyCoord(ra=target.ra*u.deg, dec=target.dec*u.deg))

    # Compute rise/set for tonight
    night_start = observer.twilight_evening_astronomical(date)
    night_end = observer.twilight_morning_astronomical(date + 1*u.day)

    rise_time = observer.target_rise_time(night_start, target_coord, horizon=30*u.deg)
    set_time = observer.target_set_time(night_start, target_coord, horizon=30*u.deg)

    # Clip to astronomical night
    rise_time = max(rise_time, night_start)
    set_time = min(set_time, night_end)

    if rise_time >= set_time:
        return None  # Not visible tonight

    return (rise_time, set_time)

Visibility windows are precomputed nightly at sunset for all active targets × all active sites and cached in Redis. Cache TTL: 8 hours. Key format: vis:{site_id}:{target_id}:{date_utc}.

2.7 Campaign Management¶

A campaign is a science program with defined goals, time windows, and participating targets. Campaigns have: - start_date / end_date: Hard boundaries. Targets outside this range get campaign priority boost removed. - priority_multiplier: Applied to all member targets (float, 1.0 = no boost, 2.0 = double priority) - coverage_goal: Number of observations or observation-hours needed - requires_simultaneous: Whether targets in this campaign need multi-site simultaneous coverage - campaign_type: enum — 'occultation', 'transit', 'grb', 'monitoring', 'survey'

The campaign system does not pre-assign targets to specific sites. Sites pull their ranked target lists and the campaign priority boost makes campaign targets float to the top naturally. This keeps the architecture simple.

Exception: GRB follow-up and occultations. For these, the server actively pushes abort signals via the heartbeat response (see Section 2.8). Sites in the middle of other observations get a 205 Reset Content on their next heartbeat if a higher-priority event triggers.

2.8 Handover Logic¶

When a time-critical target (GRB, bright transient) is activated, the server needs all available sites on it within the next 30-second heartbeat cycle. The handover mechanism:

Alert is ingested and a new target is created with priority = 95 and type = 'grb_afterglow'
A Redis key abort_signal:{site_id} is set for every site that is:
Currently active (heartbeated within last 2 minutes)
Has the GRB target visible above 30°
Is not already observing a higher-priority target (another GRB, active occultation)
On the next heartbeat request from that site, the server checks for abort_signal:{site_id} and returns HTTP 205 Reset Content with the new target in the body
Client-side: the client must handle 205 by aborting current observation and slewing to the returned target

For occultations: handover is pre-planned (prediction is hours in advance). No abort needed. Sites see occultation in their ranked list with rising priority as event approaches. The scheduler handles this naturally via the time criticality boost.

For TTV transits: the ingress is predictable. Scheduler pre-boosts the transit target starting 2 hours before ingress. No abort mechanism needed for most transits; only time if a higher-priority opportunistic event arrives mid-transit (in which case the transit is probably best abandoned — a missed transit is a gap in data but not a lost observation entirely).

2.9 Performance at Scale¶

At 1000 sites polling every 60 seconds: ~17 requests per second to the /targets endpoint. Each request triggers the scoring algorithm over all active targets.

If there are 500 active targets and 1000 sites, naive operation requires 500 × 1000 = 500,000 score computations per 60-second window. Each computation is cheap (astropy coordinate transform + a few arithmetic ops), but this adds up.

Optimizations: 1. Cache visibility windows in Redis (computed nightly, refreshed every 2 hours). The altitude/sun/moon check is the expensive part. With cached windows, the per-request overhead drops to cache lookups + priority arithmetic. 2. Two-tier target list: Pre-filter targets to those visible from the site's latitude band (±30°). A site at 40°N latitude cannot see targets with dec < -50° — pre-filter at registration, not at query time. 3. Redis hash per site: Store the last computed ranked target list per site. Serve it directly if polled within the last 30 seconds, recompute if older. targets:{site_id} → JSON list, TTL 30s. 4. Async DB queries: The FastAPI endpoint is async; all DB queries go through asyncpg (not synchronous SQLAlchemy). This lets the server handle many concurrent requests without thread-blocking.

3. Database Schema¶

This section supersedes the schemas in Details on componets.md and science_cases_and_scheduler.md. It is the canonical schema going into production.

3.1 Core Tables¶

-- =========================================================
-- SITES
-- =========================================================
CREATE TABLE sites (
    site_id         UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    name            VARCHAR(255) NOT NULL,
    owner_name      VARCHAR(255),
    owner_email     VARCHAR(255),

    -- Location (all coordinates J2000 / WGS84)
    latitude        DOUBLE PRECISION NOT NULL,  -- decimal degrees
    longitude       DOUBLE PRECISION NOT NULL,  -- decimal degrees
    elevation_m     DOUBLE PRECISION,
    timezone        VARCHAR(64),  -- IANA tz name e.g. 'America/New_York'

    -- Equipment (used by scheduler capability scoring)
    aperture_mm     INTEGER,
    focal_length_mm INTEGER,
    pixel_size_um   DOUBLE PRECISION,           -- microns
    sensor_width_mm DOUBLE PRECISION,
    sensor_height_mm DOUBLE PRECISION,
    camera_model    VARCHAR(128),
    filters         TEXT[],                     -- e.g. ARRAY['B','V','R','I','Clear']
    bortle_class    SMALLINT,                   -- 1–9
    typical_seeing  DOUBLE PRECISION,           -- arcsec
    has_gps         BOOLEAN DEFAULT false,

    -- Automation level
    automation_level VARCHAR(16) DEFAULT 'manual',  -- 'manual','semi','robotic'

    -- Derived (cached, updated by scheduler)
    limiting_mag_v  DOUBLE PRECISION,  -- estimated V-band limiting mag (60s)

    -- Status
    is_active       BOOLEAN DEFAULT true,
    api_key_hash    VARCHAR(128) NOT NULL UNIQUE,  -- bcrypt hash
    created_at      TIMESTAMPTZ DEFAULT NOW(),
    last_seen       TIMESTAMPTZ,
    last_obs_at     TIMESTAMPTZ
);

CREATE INDEX idx_sites_active ON sites (is_active) WHERE is_active = true;
CREATE INDEX idx_sites_coords ON sites USING GIST (ll_to_earth(latitude, longitude));
CREATE INDEX idx_sites_last_seen ON sites (last_seen DESC);

-- =========================================================
-- TARGETS
-- =========================================================
CREATE TABLE targets (
    target_id       UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    name            VARCHAR(255) NOT NULL,
    aliases         TEXT[],  -- alternative names / catalog IDs

    -- Coordinates (ICRS J2000, decimal degrees)
    ra_deg          DOUBLE PRECISION NOT NULL,
    dec_deg         DOUBLE PRECISION NOT NULL,

    -- Classification
    target_type     VARCHAR(64) NOT NULL,
    -- values: 'variable_star','exoplanet_host','asteroid','transient',
    --         'occultation_star','grb_afterglow','kilonova_candidate',
    --         'microlensing_event','comet','blazar','eclipsing_binary'

    -- Photometric properties
    magnitude       DOUBLE PRECISION,  -- expected V or clear-band magnitude
    magnitude_band  VARCHAR(8) DEFAULT 'V',

    -- Priority and scheduling
    base_priority   SMALLINT DEFAULT 50,  -- 1–100
    cadence_hours   DOUBLE PRECISION,     -- desired observation cadence
    min_observations INTEGER DEFAULT 1,

    -- Filter requirements
    required_filters TEXT[],  -- can be empty (any filter OK)
    required_fov_arcmin DOUBLE PRECISION,  -- minimum FOV needed

    -- Special requirements
    requires_gps_timing BOOLEAN DEFAULT false,  -- occultations
    requires_simultaneous BOOLEAN DEFAULT false,
    min_simultaneous_sites SMALLINT DEFAULT 1,
    requires_high_cadence BOOLEAN DEFAULT false,
    max_exposure_sec DOUBLE PRECISION,  -- for high-cadence targets

    -- Time constraints
    event_time      TIMESTAMPTZ,   -- occultation event time
    ingress_time    TIMESTAMPTZ,   -- transit ingress
    egress_time     TIMESTAMPTZ,   -- transit egress
    trigger_time    TIMESTAMPTZ,   -- GRB/transient trigger
    expires_at      TIMESTAMPTZ,   -- when target becomes inactive

    -- Source and provenance
    source          VARCHAR(64),  -- 'manual','ztf','gaia','gcn','tns','mpc'
    external_id     VARCHAR(128), -- original ID in source system
    alert_id        UUID REFERENCES alerts(alert_id),

    -- Status
    active          BOOLEAN DEFAULT true,
    created_at      TIMESTAMPTZ DEFAULT NOW(),
    updated_at      TIMESTAMPTZ DEFAULT NOW(),
    notes           TEXT
);

CREATE INDEX idx_targets_active ON targets (active) WHERE active = true;
CREATE INDEX idx_targets_coords ON targets USING GIST (ll_to_earth(dec_deg, ra_deg));
CREATE INDEX idx_targets_type ON targets (target_type);
CREATE INDEX idx_targets_priority ON targets (base_priority DESC) WHERE active = true;
CREATE INDEX idx_targets_expires ON targets (expires_at) WHERE expires_at IS NOT NULL;
CREATE INDEX idx_targets_event_time ON targets (event_time) WHERE event_time IS NOT NULL;

-- =========================================================
-- CAMPAIGNS
-- =========================================================
CREATE TABLE campaigns (
    campaign_id     UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    name            VARCHAR(255) NOT NULL,
    description     TEXT,
    campaign_type   VARCHAR(32),  -- 'occultation','transit','monitoring','survey','grb'

    start_date      DATE,
    end_date        DATE,

    priority_multiplier DOUBLE PRECISION DEFAULT 1.0,
    requires_simultaneous BOOLEAN DEFAULT false,
    coverage_goal   INTEGER,  -- total observations needed across network

    is_active       BOOLEAN DEFAULT true,
    created_at      TIMESTAMPTZ DEFAULT NOW()
);

CREATE TABLE campaign_targets (
    campaign_id     UUID REFERENCES campaigns(campaign_id) ON DELETE CASCADE,
    target_id       UUID REFERENCES targets(target_id) ON DELETE CASCADE,
    added_at        TIMESTAMPTZ DEFAULT NOW(),
    PRIMARY KEY (campaign_id, target_id)
);

CREATE INDEX idx_ct_target ON campaign_targets (target_id);

-- =========================================================
-- OBSERVATIONS
-- =========================================================
CREATE TABLE observations (
    obs_id          UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    target_id       UUID NOT NULL REFERENCES targets(target_id),
    site_id         UUID NOT NULL REFERENCES sites(site_id),

    -- Timing (all UTC)
    obs_start       TIMESTAMPTZ NOT NULL,
    obs_mid         TIMESTAMPTZ NOT NULL,  -- BJD correction should be applied upstream
    obs_end         TIMESTAMPTZ NOT NULL,

    -- Exposure
    exposure_sec    DOUBLE PRECISION NOT NULL,
    num_frames      INTEGER DEFAULT 1,
    filter_used     VARCHAR(16),

    -- Observation geometry
    airmass         DOUBLE PRECISION,
    altitude_deg    DOUBLE PRECISION,
    azimuth_deg     DOUBLE PRECISION,
    moon_sep_deg    DOUBLE PRECISION,
    moon_illumination DOUBLE PRECISION,

    -- Photometric result
    magnitude       DOUBLE PRECISION,
    mag_error       DOUBLE PRECISION,
    flux_adu        DOUBLE PRECISION,  -- raw ADU if useful
    snr             DOUBLE PRECISION,

    -- Astrometric result
    ra_measured_deg DOUBLE PRECISION,  -- from plate solution
    dec_measured_deg DOUBLE PRECISION,

    -- Comparison stars
    comparison_stars JSONB,  -- [{catalog_id, ra, dec, mag_catalog, mag_inst}, ...]
    zero_point       DOUBLE PRECISION,

    -- Data quality
    quality_flags    TEXT[],  -- ['CLOUDY','FOCUS_DRIFT','SATELLITE_TRAIL',...]
    quality_score    SMALLINT DEFAULT 0,  -- 0=good, 1=warn, 2=bad

    -- Raw data
    fits_path       VARCHAR(512),  -- path in B2 object storage
    fits_header     JSONB,         -- full FITS header as JSON

    -- Metadata
    pipeline_version VARCHAR(32),  -- which version of pipeline processed this
    submitted_at    TIMESTAMPTZ DEFAULT NOW(),
    processed_at    TIMESTAMPTZ    -- when data pipeline finished
);

-- Index strategy: the observations table will be the largest.
-- Primary query patterns:
--   1. Light curve for a target: target_id + obs_mid time range
--   2. Recent obs by site: site_id + submitted_at
--   3. Quality filtering: quality_score
--   4. Campaign analysis: target_id + campaign join

CREATE INDEX idx_obs_target_time ON observations (target_id, obs_mid DESC);
CREATE INDEX idx_obs_site_time ON observations (site_id, submitted_at DESC);
CREATE INDEX idx_obs_quality ON observations (quality_score) WHERE quality_score > 0;
CREATE INDEX idx_obs_submitted ON observations (submitted_at DESC);

-- Partial index for unprocessed observations (pipeline work queue)
CREATE INDEX idx_obs_unprocessed ON observations (submitted_at)
    WHERE processed_at IS NULL;

-- For time-partitioned queries (after migration to partitioned table):
-- Partition observations by month on obs_mid. At 1000 sites × 10 obs/night =
-- ~10,000 obs/night = ~300k/month. One partition per month.
-- PostgreSQL declarative partitioning:
-- CREATE TABLE observations_2026_01 PARTITION OF observations
--     FOR VALUES FROM ('2026-01-01') TO ('2026-02-01');

-- =========================================================
-- HEARTBEATS
-- =========================================================
-- Tracks the 30s heartbeat signals from active clients.
-- Separate from 'last_seen' in sites to keep sites table clean.
-- Retain only last 2 hours; older rows purged by cron.
CREATE TABLE heartbeats (
    heartbeat_id    UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    site_id         UUID NOT NULL REFERENCES sites(site_id),
    received_at     TIMESTAMPTZ DEFAULT NOW(),
    current_target_id UUID REFERENCES targets(target_id),  -- what they're observing
    status          VARCHAR(16),  -- 'idle','observing','slewing','error'
    weather_ok      BOOLEAN,
    sky_quality     DOUBLE PRECISION  -- mag/arcsec² if available
);

CREATE INDEX idx_hb_site_time ON heartbeats (site_id, received_at DESC);
CREATE INDEX idx_hb_recent ON heartbeats (received_at DESC);

-- =========================================================
-- ALERTS (External Event Ingestion)
-- =========================================================
CREATE TABLE alerts (
    alert_id        UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    source          VARCHAR(64) NOT NULL,  -- 'ztf','gaia','gcn','tns','mpc','ligo'
    external_id     VARCHAR(256),           -- original alert ID in source system
    alert_type      VARCHAR(64),           -- 'supernova','grb','transient','occultation','neo'

    -- Coordinates (may be imprecise for GRBs)
    ra_deg          DOUBLE PRECISION,
    dec_deg         DOUBLE PRECISION,
    position_error_deg DOUBLE PRECISION,   -- 1σ localization error

    priority        DOUBLE PRECISION DEFAULT 50,

    -- Raw payload from source (for re-parsing if schema changes)
    raw_payload     JSONB NOT NULL,

    -- Processing state
    processed       BOOLEAN DEFAULT false,
    target_id       UUID REFERENCES targets(target_id),  -- created target, if any

    received_at     TIMESTAMPTZ DEFAULT NOW(),
    processed_at    TIMESTAMPTZ
);

CREATE INDEX idx_alerts_source ON alerts (source, received_at DESC);
CREATE INDEX idx_alerts_unprocessed ON alerts (received_at) WHERE processed = false;

-- =========================================================
-- INSTRUMENTS (Stage 1 registry)
-- =========================================================
-- Populated by the calibration pipeline when processing archival data.
-- Each unique (observer_code, telescope, camera) combination gets a record.
CREATE TABLE instruments (
    instrument_id   UUID PRIMARY KEY DEFAULT gen_random_uuid(),

    -- Provenance
    observer_code   VARCHAR(32),   -- AAVSO observer code, MPC code, etc.
    source          VARCHAR(32),   -- 'aavso','mpc','astrobin','manual'

    -- Hardware
    aperture_mm     INTEGER,
    focal_length_mm INTEGER,
    camera_model    VARCHAR(128),
    pixel_size_um   DOUBLE PRECISION,

    -- Calibration (derived from running against Gaia/APASS)
    zero_point_v    DOUBLE PRECISION,   -- V-band zero point (mag)
    zero_point_r    DOUBLE PRECISION,   -- R-band
    color_term_bv   DOUBLE PRECISION,   -- B-V color term coefficient
    noise_floor_mmag DOUBLE PRECISION,  -- floor below which precision can't improve
    calibration_epoch TIMESTAMPTZ,      -- when calibration was derived
    num_calibration_obs INTEGER,        -- how many obs used to derive calibration

    -- Location (if known; not required for archival instruments)
    latitude        DOUBLE PRECISION,
    longitude       DOUBLE PRECISION,

    created_at      TIMESTAMPTZ DEFAULT NOW(),
    updated_at      TIMESTAMPTZ DEFAULT NOW()
);

-- =========================================================
-- LIGHT CURVES (Assembled by pipeline)
-- =========================================================
-- Denormalized summary of assembled light curve per target per filter.
-- Rebuilt by pipeline on each new observation; cached result.
CREATE TABLE light_curve_points (
    lcp_id          UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    target_id       UUID NOT NULL REFERENCES targets(target_id),
    obs_id          UUID NOT NULL REFERENCES observations(obs_id),

    -- Time (BJD_TDB preferred for variable star/exoplanet science)
    bjd_tdb         DOUBLE PRECISION,  -- Barycentric Julian Date

    filter_used     VARCHAR(16),
    magnitude       DOUBLE PRECISION NOT NULL,
    mag_error       DOUBLE PRECISION NOT NULL,

    -- Ensemble normalization state
    ensemble_corrected BOOLEAN DEFAULT false,
    zero_point_applied DOUBLE PRECISION,

    UNIQUE (obs_id)  -- one point per observation
);

CREATE INDEX idx_lcp_target_filter ON light_curve_points (target_id, filter_used, bjd_tdb);

3.2 Indexing Strategy¶

The query patterns that dominate at scale:

Query	Frequency	Index Used
Get targets for site (scheduler)	17 req/s at 1000 sites	`idx_targets_active`, Redis cache
Submit observation	100–1000/day initially	PK insert
Get light curve for target	On-demand	`idx_lcp_target_filter`
Recent observations by site	Dashboard	`idx_obs_site_time`
Unprocessed observations (pipeline)	Every 5 min	`idx_obs_unprocessed`
Active heartbeating sites	Every 30s	`idx_hb_recent`
Spatial: sites within FOV of alert	On alert	`idx_sites_coords` (GIST)

The observations table is the growth driver. At 1000 sites × 10 obs/night average = 3.65M observations/year. Without partitioning, queries on target_id + time range will degrade as the table grows past ~10M rows. Partition by month when row count approaches 5M (roughly year 2 at that activity level).

The PostGIS ll_to_earth function is used for spatial queries. Ensure the earthdistance and cube extensions are installed:

CREATE EXTENSION IF NOT EXISTS earthdistance CASCADE;
CREATE EXTENSION IF NOT EXISTS cube CASCADE;

3.3 JSON Fields Policy¶

Use JSONB for: - fits_header: full FITS header dump — schema varies per instrument, no need to normalize - comparison_stars: list of comparison star photometry — variable length - raw_payload: external alert payloads — schema controlled by source, not us

Do not use JSONB for fields that are queried (indexed) or that have consistent structure. Those should be proper columns. The temptation to put everything in JSONB should be resisted.

4. API Design¶

4.1 Versioning¶

All endpoints under /api/v1/. When breaking changes are needed, add /api/v2/ and maintain v1 for at least 6 months. Never break existing clients silently. The minimum version support commitment to volunteers: current version + one previous version.

4.2 Authentication¶

All site/client endpoints require Authorization: Bearer {api_key} header. The API key is a 32-byte random hex string (secrets.token_hex(32)) stored as a bcrypt hash in the database. Never store keys in plaintext.

Human/dashboard endpoints use separate scoped tokens or HTTP Basic Auth behind Caddy. No mixing of site keys and human auth.

Alert ingest endpoints (from our own ingestion services) use service-to-service shared secrets, not site keys.

4.3 Rate Limiting¶

Per-site rate limits enforced by Redis counters (INCR + EXPIRE pattern):

Endpoint	Limit
GET /api/v1/targets	2 req/min (client polls every 60s; 2 gives some tolerance)
POST /api/v1/observations	100 req/hour
PUT /api/v1/sites/{id}/heartbeat	4 req/min (30s heartbeat; 4 = tolerance)
POST /api/v1/files (FITS upload)	20 req/hour
GET /api/v1/lightcurve/{id}	30 req/min (dashboard access)
Public /status	60 req/min, no auth required

Exceeding limit: HTTP 429 with Retry-After header. Client must honor this.

Implementation: FastAPI middleware using slowapi (wraps Redis counters). Key: ratelimit:{api_key}:{endpoint_slug}:{minute_bucket}.

4.4 Pagination¶

All list endpoints paginate with cursor-based pagination (not offset). Offset pagination degrades as pages grow. Cursor pagination is consistent and fast with indexed queries.

Response shape:

{
  "data": [...],
  "cursor": "eyJ0aW1lIjogIjIwMjYtMDMtMjBUMTA6MzA6MDBaIiwgImlkIjogImFiYzEyMyJ9",
  "has_more": true
}

The cursor is a base64-encoded JSON {"time": "<iso>", "id": "<uuid>"} pointing to the last item in the current page. Next page: GET /api/v1/observations?cursor=<token>&limit=100.

Exception: /api/v1/targets for site polling — returns a fixed-size ranked list (default 20), not paginated. Pagination doesn't make sense here; you want the top N targets, not all of them.

4.5 Full Endpoint List¶

# Site operations
POST   /api/v1/sites/register              # New site registration
GET    /api/v1/sites/{site_id}             # Get site details
PATCH  /api/v1/sites/{site_id}             # Update site metadata
PUT    /api/v1/sites/{site_id}/heartbeat   # 30s heartbeat (includes current target)
GET    /api/v1/sites/{site_id}/stats       # Contribution stats for dashboard

# Scheduling
GET    /api/v1/targets                     # Ranked target list for this site
       ?count=20&type=occultation          # optional filters
GET    /api/v1/targets/{target_id}         # Full target details
GET    /api/v1/targets/{target_id}/plan    # Observation plan (exposure, timing, comp stars)

# Observations
POST   /api/v1/observations                # Submit observation (JSON photometry)
POST   /api/v1/observations/{obs_id}/fits  # Upload FITS file for observation
GET    /api/v1/observations/{obs_id}       # Get observation details
GET    /api/v1/observations                # Query observations (auth required)
       ?target_id=&site_id=&after=&before=&quality=0

# Light curves
GET    /api/v1/lightcurve/{target_id}      # Assembled light curve
       ?filter=V&after=&before=

# Campaigns
GET    /api/v1/campaigns                   # List active campaigns
GET    /api/v1/campaigns/{campaign_id}     # Campaign details + targets

# Alerts
POST   /api/v1/alerts                      # Internal: submit external alert
GET    /api/v1/alerts                      # Recent alerts (dashboard)

# Public
GET    /api/v1/status                      # Network health (no auth)
GET    /api/v1/network/map                 # Active site locations (no auth)

4.6 Key Response Shapes¶

GET /api/v1/targets — the most frequently hit endpoint:

{
  "timestamp": "2026-03-20T03:42:00Z",
  "site_id": "...",
  "targets": [
    {
      "target_id": "...",
      "name": "GRB 260320A",
      "ra_deg": 145.23,
      "dec_deg": 28.44,
      "type": "grb_afterglow",
      "magnitude": 14.2,
      "score": 94.1,
      "score_breakdown": {
        "priority": 95.0,
        "observability": 87.0,
        "capability": 90.0
      },
      "observation_settings": {
        "recommended_exposure_sec": 60,
        "recommended_filter": "R",
        "cadence_sec": 120,
        "total_duration_min": 30
      },
      "timing": {
        "optimal_window_start": "2026-03-20T03:40:00Z",
        "optimal_window_end": "2026-03-20T07:15:00Z",
        "event_time": null
      }
    }
  ],
  "next_poll_seconds": 60,
  "abort_signal": false
}

PUT /api/v1/sites/{id}/heartbeat response — normally 200:

{"status": "ok", "abort": false}

When abort is triggered:

{"status": "abort", "abort": true, "reason": "grb_alert", "priority_target_id": "..."}

Client implementation note: the client MUST check the abort field on every heartbeat response. If true, it should gracefully stop the current exposure (if interruptible), slew to the priority target, and add the interrupted observation to a local retry queue (submit with incomplete=true flag).

5. Scaling Path¶

5.1 Phase 1: Single VPS (0–50 sites)¶

Infrastructure: Hetzner CX21 (2 vCPU, 4GB RAM, 40GB SSD, €5.49/month).

What runs on it: - FastAPI app (Gunicorn, 2 workers, uvicorn worker class) - PostgreSQL 15 (local) - Redis (local, persistence off — all cache data can be rebuilt) - Caddy (reverse proxy + automatic HTTPS) - APScheduler embedded in FastAPI process

What does NOT need to run separately: - Celery (APScheduler is enough) - Separate database server - Load balancer

Storage: B2 for FITS files (cold, cheap). Local SSD for database only.

Expected load: < 1 req/s at 50 sites polling every 60s. A single Gunicorn worker handles this without breaking a sweat.

Failure risk at this phase: Single point of failure everywhere. That's OK. Documented recovery procedure matters more than redundancy at this scale. If the VPS goes down for an hour, volunteer sites just get "server unavailable" errors and retry. No science is lost — they still have their own local FITS files.

5.2 Phase 2: Optimized Single Node (50–200 sites)¶

When: top shows > 50% CPU during peak polling periods, or PostgreSQL query time exceeds 500ms.

Changes: - Upgrade to Hetzner CX31 or CX41 (4–8 vCPU, 8–16GB RAM) - Add Redis caching for target lists (already designed for this in Section 2.9) - Add pgBouncer for connection pooling (PostgreSQL max_connections is not free) - Move from embedded APScheduler to a separate Celery worker process (same machine) for isolation - Enable PostgreSQL WAL archiving for crash recovery without full restore

5.3 Phase 3: Separated Services (200–1000 sites)¶

When: read-heavy load on targets endpoint saturates the database.

Changes: - Separate PostgreSQL to its own Hetzner managed database (or dedicated VPS) - Add a PostgreSQL read replica for dashboard queries (not site polling — site polling needs fresh data) - Add a second application server (Hetzner CX21 × 2) behind Caddy load balancing - Redis becomes a separate dedicated instance (Hetzner or managed Redis) - Alert ingestion becomes a separate service with its own process - Partition the observations table by month (add to migration script)

Total cost at this phase: ~$80–120/month. Still very cheap for the load.

5.4 Phase 4: Multi-Region (1000+ sites)¶

At 1000+ sites: ~17 req/s to /targets continuously. This is still a single-machine load (modern web servers handle 1000s of req/s). Multi-region is more about latency than throughput.

Latency matters for: - Sites in Asia/Pacific polling a Europe server: ~200ms round trip. Not a problem for 60s polling. - The 30s heartbeat abort signal: 200ms is fine.

So multi-region is not actually necessary even at 1000+ sites for our architecture. The main reason to go multi-region would be if we add a real-time streaming component for GRB alerts (sub-second latency matters there) or if we have local data storage compliance requirements.

If/when multi-region is needed: - Primary in Europe (Hetzner Falkenstein) - Read replicas in US-East and Singapore - Write always goes to primary - Reads (target lists) served from nearest replica - Alert ingest and scheduler run on primary only

6. Failure Modes and Resilience¶

6.1 Client-Side Failures (Sites)¶

Site goes offline mid-observation: No impact on server. The observation is simply not submitted. The scheduler notes the coverage gap at next cadence check. If it was a time-critical event (occultation), nothing can be done; other sites covered it or they didn't. This is inherent to distributed volunteer networks.

Client crashes mid-submission: Clients should buffer observations locally and retry. Retry with exponential backoff (see failure_analysis_guide.md). The server's POST /observations endpoint is idempotent if the client sends a client-generated obs_id (UUID). If the server has already stored this obs_id, it returns 200 (idempotent success) instead of creating a duplicate.

Site submits garbage data: Quality flagging pipeline rejects it. The site's quality_score history is tracked; after 10 consecutive low-quality submissions, the site is automatically throttled (lower capability score) until a human reviews.

Mass simultaneous polling (thundering herd): Unlikely at our scale (sites are globally distributed). But if 100 sites come online at once after server restart, Redis cache absorbs the spike — first site warms the cache, subsequent 99 get the cached response.

6.2 Server-Side Failures¶

Database down: FastAPI returns 503. Clients get HTTP error, log it, retry next cycle. No data is lost on the client side (they have local FITS). When database recovers, clients resume normally.

Redis down: Fall through to database for all cache operations. Slower but correct. The server logs Redis failures and alerts via UptimeRobot webhook. For the abort signal specifically, if Redis is down, abort signals cannot be sent — document this as an accepted risk, since Redis being down is usually brief.

Application server crash: systemd restarts it (Restart=always in unit file). Typical recovery: < 30 seconds. During the restart window, clients get connection refused errors and retry.

Full VPS failure: This is the most serious failure mode. Recovery: 1. Restore from daily pg_dump backup to a new VPS (< 1 hour recovery time) 2. Restore Redis state (not needed — fully recomputable) 3. DNS update to point to new IP 4. Total downtime: 2–4 hours in the worst case 5. Data loss: up to 24 hours of observations if backup was recent

To improve RTO and RPO: configure WAL archiving to B2 (continuous backup). Recovery time drops to < 30 minutes, data loss to < 5 minutes.

Alert ingestion service crashes: Alerts are missed during the downtime window. For GCN specifically, this means missed GRB targets. Mitigation: GCN supports backfilling missed alerts. The ingestion service, on restart, should pull the last 2 hours of alerts and process any it hasn't seen before.

6.3 Data Quality Failures¶

The most insidious failure mode — data that looks correct but isn't. Primary examples:

Clock drift / wrong timestamps: An incorrectly synchronized clock on a site produces observations timestamped with seconds or minutes of error. For TTV science this is catastrophic. Detection: compare observation submission time (server clock) with the obs_mid in the submission. If they differ by more than exposure_sec + 120s, flag with TIMESTAMP_SUSPECT. For critical timing science (occultations), require GPS timestamps and validate against submission time.

Wrong target: Site plate-solves incorrectly and observes the wrong star. Detection: cross-match reported RA/Dec with expected target position. If > 60 arcsec offset, flag with WRONG_FIELD. If FITS is submitted, run plate solve on the server side to verify.

Systematic offsets from uncalibrated instruments: An instrument with a consistent 0.3-magnitude offset from standard isn't producing garbage, but it's producing biased photometry that will corrupt heterogeneous light curves. Detection: compare the site's photometry of comparison stars against catalog values. Track the zero-point offset per site over time. Alert if zero-point shifts by > 0.1 mag between sessions.

7. Dashboard Design¶

The dashboard is secondary to the API but important for volunteer retention (see failure_analysis_guide.md). Design principles:

Refreshes every 60 seconds (meta-refresh tag, no WebSockets needed)
Mobile-readable: Observers often check from phone
Shows the network doing useful work: Not just uptime stats

Pages: 1. Home / Status: Active site count, observations last 24h, current campaigns, alert feed 2. Map: Leaflet.js map of active sites (green = heartbeat in last 2 min, grey = inactive). GRB and occultation events shown as overlays. 3. Targets: List of active targets, priority, last obs time, coverage % 4. My Observatory: Per-site view after entering API key. Shows contribution stats, recent observations, quality scores, light curves of recent targets. 5. Light Curves: Assembled light curves for active campaign targets. Interactive (zoom/pan via Plotly.js — the one JS dependency that is justified). 6. Campaigns: Status of active campaigns, progress bars, sites participating.

8. Security Considerations¶

API key rotation: Sites can rotate their API key via the dashboard. Old key remains valid for 24 hours during rotation window to allow client-side update without disruption.

Key compromise: If a key is suspected compromised (observed from unexpected IP ranges, anomalous submission rate), admin can immediately revoke it via dashboard. The site owner is notified by email.

Input validation: All submitted photometry goes through Pydantic validation. Magnitude values outside [-5, 25] are rejected. RA/Dec outside valid ranges are rejected. Exposure time > 7200s is rejected. These are hard scientific limits, not arbitrary restrictions.

FITS file uploads: Limit file size to 100MB per file. Validate that uploaded files are valid FITS before storing in B2. Do not execute any code from uploaded files (FITS is not executable, but be paranoid about filenames and paths).

Rate limit all endpoints: Documented in Section 4.3. DoS from a single malicious API key is mitigated by per-key rate limits. DoS from many keys is mitigated by Cloudflare (put the domain behind Cloudflare CDN for free DDoS protection).

No SQL injection surface: Use parameterized queries everywhere (SQLAlchemy handles this). Never format SQL strings with user input.

Last updated: 2026-03-20 Status: Planning document — not yet implemented Next: See Alert Ingestion Design.md and Stage 1 Data Pipeline Design.md