Server Architecture Deep Dive¶
This document is the authoritative technical planning reference for the OpenAstro server. It builds on what's established in General overview.md, Details on componets.md, science_cases_and_scheduler.md, and the Gap Analysis. Read those first. This document goes deeper β it is intended to be comprehensive enough that a developer can build the system directly from it.
1. Component Breakdown¶
The server is composed of seven logical subsystems. They share a database and communicate through it (no internal message bus at MVP scale). Each subsystem can be a separate Python module or FastAPI router.
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β OPENASTRO SERVER β
β β
β βββββββββββββββ βββββββββββββββ ββββββββββββββ βββββββββββββββ β
β β Scheduler β βAlert Ingest β β Data β β Auth β β
β β β β Service β β Pipeline β β (API keys) β β
β ββββββββ¬βββββββ ββββββββ¬βββββββ βββββββ¬βββββββ ββββββββ¬βββββββ β
β β β β β β
β ββββββββΌβββββββββββββββββββΌβββββββββββββββΌβββββββββββββββββΌβββββββ β
β β PostgreSQL Database β β
β β sites | targets | observations | campaigns | alerts | β β
β β instruments | calibrations | light_curves | heartbeats β β
β ββββββββββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββ β
β β β
β ββββββββββββββββββββββββββββββββββΌββββββββββββββββββββββββββββββββ β
β β FastAPI REST Layer β β
β β /api/v1/targets /api/v1/observations /api/v1/alerts β β
β β /api/v1/sites /api/v1/campaigns /api/v1/status β β
β ββββββββββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββ β
β β β
β βββββββββββββββ βββββββββββββββββΌβββββββββββ βββββββββββββββββββ β
β β Caddy β β Redis Cache β β Backblaze B2 β β
β β (HTTPS / β β (target list cache, β β (FITS storage) β β
β β reverse β β heartbeat state) β β β β
β β proxy) β ββββββββββββββββββββββββββββ βββββββββββββββββββ β
β βββββββββββββββ β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β HTTPS
βββββββββββββββββββββββΌββββββββββββββββββββββ
βΌ βΌ βΌ
βββββββββββ βββββββββββ βββββββββββ
β Site A β β Site B β β Site C β
β (poll) β β (poll) β β (alert) β
βββββββββββ βββββββββββ βββββββββββ
1.1 Scheduler¶
The brain of the network. Runs as a background task (Celery beat or APScheduler) every 2 minutes, and also on-demand when the /targets endpoint is hit. Detailed in Section 2.
1.2 Alert Ingestion Service¶
Polls or subscribes to ZTF, Gaia Alerts, GCN, TNS, and MPC. Normalizes alerts into the internal format. Creates new targets or updates existing ones. Detailed in Alert Ingestion Design.md.
1.3 Data Pipeline¶
Processes incoming FITS files and photometry submissions. Runs plate solving, photometry extraction, quality flagging, and light curve assembly. Detailed in Stage 1 Data Pipeline Design.md.
1.4 Auth (API Keys)¶
Simple API key auth for all telescope clients. No OAuth, no sessions, no user accounts at MVP. Human dashboard access can use HTTP Basic Auth behind Caddy for now. See Section 8.
1.5 Dashboard¶
Jinja2-rendered server-side HTML. Shows: active sites map, recent observations, campaign progress, alert feed, per-site contribution stats. No JavaScript framework at MVP β plain HTML tables updated every 60s with a meta-refresh tag. Graduate to HTMX when interactivity is needed.
1.6 REST API¶
FastAPI application exposing versioned endpoints. See Section 8 for full design.
1.7 Background Worker¶
APScheduler embedded in the FastAPI process at MVP (avoid Celery overhead until needed). Runs: - Scheduler tick every 2 minutes - Alert ingestion polls every 5 minutes (ZTF/TNS) or 30 seconds (GCN) - Data pipeline jobs as observation submissions arrive (async task queue via FastAPI BackgroundTasks) - Daily backup job
2. Scheduler Design¶
2.1 What the Scheduler Solves¶
The scheduler's job is to answer one question: given a specific site at a specific moment, what should it observe? The answer must be: - Scientifically optimal (highest priority targets first) - Geometrically correct (target is actually above the horizon) - Network-aware (avoid redundant simultaneous coverage for non-simultaneous science cases, maximize it for simultaneous ones like occultations) - Responsive to alerts (urgent alerts must propagate to clients within one heartbeat cycle β 30s)
The scheduler is not a global optimizer that produces a schedule for all sites for the night. That's too complex, requires weather prediction, and is unnecessary. Instead it produces a per-request ranked target list for a single site at the moment of request. Each site pulls its own list every 60s. This is the poll-based model agreed in the architecture decisions.
2.2 Scoring Architecture¶
The combined score is a weighted sum of three independent sub-scores, each on a 0β100 scale:
combined_score = (priority_score Γ 0.50) + (observability_score Γ 0.30) + (capability_score Γ 0.20)
Priority score (0β100): How urgently does the network need this observation? Observability score (0β100): How good is the geometry right now from this site? Capability score (0β100): How well is this site equipped for this target?
Targets returning capability_score == 0 or observability_score == 0 are hard-filtered out before scoring β they cannot be observed regardless of priority.
2.3 Priority Score Components¶
Priority is dynamic and recomputes every request. Components:
Base priority (set when target is created): The science case determines this. - Occultation: 80β100 (time-critical, irreplaceable) - Transit in progress: 85 - GRB afterglow < 1 hour old: 95 - GRB afterglow 1β6 hours: 75 - TTV exoplanet monitoring: 60 - Long-term variable monitoring: 40 - Archival-quality target: 20
Time criticality boost: Applied to targets with event windows.
if target.type == 'occultation' and time_to_event < 1 hour: score += 50
if target.type == 'occultation' and time_to_event < 24 hours: score += 30
if target.type == 'transit_ingress' and in_window: score += 40
if target.type == 'grb' and age < 10 min: score += 60
if target.type == 'grb' and age < 1 hour: score += 40
Coverage deficit boost: How behind is this target on its required cadence?
hours_since_last_obs = now - last_obs_time # network-wide, not per-site
cadence_deficit = hours_since_last_obs / target.cadence_hours # >1 = behind
if cadence_deficit > 2.0: score += 25
if cadence_deficit > 1.0: score += 15
if cadence_deficit < 0.5: score -= 20 # Recently observed, lower urgency
Campaign boost: Active campaigns multiply priority.
for campaign in target.campaigns:
if campaign.is_active:
score = min(score * campaign.priority_multiplier, 100)
Redundancy penalty for non-simultaneous targets: If another site is currently observing this same target AND it's not an occultation/kilonova tiling scenario, penalize.
active_observers = count_sites_heartbeating_target(target.id, window=5min)
if active_observers > 0 and not target.requires_simultaneous:
score -= 20 * active_observers # Diminishing marginal value
Reverse redundancy bonus for simultaneous targets: Occultations and kilonova error box tiling WANT simultaneous coverage.
if target.requires_simultaneous:
if active_observers < target.min_simultaneous_sites:
score += 30 # We need more coverage, urgent to join
elif active_observers >= target.min_simultaneous_sites:
score += 10 # Redundancy is still science here
2.4 Observability Score Components¶
# Altitude component
if altitude < 15: return 0 # hard filter
if 45 <= altitude <= 75: score = 100
if altitude < 30: score -= 30
if altitude < 45: score -= 15
if altitude > 80: score -= 10 # zenith tracking issues
# Airmass penalty
airmass = 1 / cos(radians(90 - altitude))
if airmass > 2.5: score -= 30
elif airmass > 2.0: score -= 20
elif airmass > 1.5: score -= 10
# Sun: hard filter
if sun_alt > -6: return 0 # civil twilight
if sun_alt > -12: score -= 40 # nautical twilight
if sun_alt > -18: score -= 20 # astronomical twilight
# Moon: soft penalty
moon_illumination = get_moon_illumination()
if moon_up and target_moon_separation < 30:
score -= 30 * moon_illumination # worse when full moon
elif moon_up and target_moon_separation < 60:
score -= 15 * moon_illumination
# Rising vs. setting: prefer rising targets
if target_is_rising(site, target): score += 5
if target_sets_within_minutes(site, target, minutes=20): score -= 10
2.5 Capability Score Components¶
Already documented in science_cases_and_scheduler.md (the calculate_site_capability function). Key additions:
- GPS timing requirement: If
target.requires_gps_timingand site does not have GPS, capability = 0 for that target. This is a hard filter for occultation science. - Filter match: Partial match (site has some of the required filters) gives partial score. Complete mismatch gives score -= 50 but not 0, because unfiltered observations have science value for some targets.
- Pixel scale: The existing formula in
science_cases_and_scheduler.mdis correct. Add: if target is an extended object (comet, galaxy), penalize undersampled sites less harshly. - Limiting magnitude margin: The "mag_margin" penalty in the existing code is correct. Extend it: if target brightness varies (variable star, transient), use the brightest expected magnitude for capability check.
2.6 Visibility Windows¶
The scheduler computes not just "is it visible now?" but also "when does visibility start and end tonight?" This is used to:
1. Show observers a preview of their night's opportunities
2. Pre-generate target lists so the first /targets poll of the night returns instantly (cached)
3. Calculate handover windows for time-critical targets
def compute_visibility_window(site, target, date):
"""
Returns (rise_time, set_time) in UTC for tonight, above min_altitude.
Uses astropy's target rise/set calculation.
"""
from astropy.coordinates import EarthLocation, AltAz, SkyCoord
from astropy.time import Time
from astroplan import Observer, FixedTarget
import astropy.units as u
observer = Observer(
location=EarthLocation(lat=site.latitude*u.deg,
lon=site.longitude*u.deg,
height=site.elevation*u.m),
timezone='UTC'
)
target_coord = FixedTarget(SkyCoord(ra=target.ra*u.deg, dec=target.dec*u.deg))
# Compute rise/set for tonight
night_start = observer.twilight_evening_astronomical(date)
night_end = observer.twilight_morning_astronomical(date + 1*u.day)
rise_time = observer.target_rise_time(night_start, target_coord, horizon=30*u.deg)
set_time = observer.target_set_time(night_start, target_coord, horizon=30*u.deg)
# Clip to astronomical night
rise_time = max(rise_time, night_start)
set_time = min(set_time, night_end)
if rise_time >= set_time:
return None # Not visible tonight
return (rise_time, set_time)
Visibility windows are precomputed nightly at sunset for all active targets Γ all active sites and cached in Redis. Cache TTL: 8 hours. Key format: vis:{site_id}:{target_id}:{date_utc}.
2.7 Campaign Management¶
A campaign is a science program with defined goals, time windows, and participating targets. Campaigns have:
- start_date / end_date: Hard boundaries. Targets outside this range get campaign priority boost removed.
- priority_multiplier: Applied to all member targets (float, 1.0 = no boost, 2.0 = double priority)
- coverage_goal: Number of observations or observation-hours needed
- requires_simultaneous: Whether targets in this campaign need multi-site simultaneous coverage
- campaign_type: enum β 'occultation', 'transit', 'grb', 'monitoring', 'survey'
The campaign system does not pre-assign targets to specific sites. Sites pull their ranked target lists and the campaign priority boost makes campaign targets float to the top naturally. This keeps the architecture simple.
Exception: GRB follow-up and occultations. For these, the server actively pushes abort signals via the heartbeat response (see Section 2.8). Sites in the middle of other observations get a 205 Reset Content on their next heartbeat if a higher-priority event triggers.
2.8 Handover Logic¶
When a time-critical target (GRB, bright transient) is activated, the server needs all available sites on it within the next 30-second heartbeat cycle. The handover mechanism:
- Alert is ingested and a new target is created with
priority = 95andtype = 'grb_afterglow' - A Redis key
abort_signal:{site_id}is set for every site that is: - Currently active (heartbeated within last 2 minutes)
- Has the GRB target visible above 30Β°
- Is not already observing a higher-priority target (another GRB, active occultation)
- On the next heartbeat request from that site, the server checks for
abort_signal:{site_id}and returns HTTP 205 Reset Content with the new target in the body - Client-side: the client must handle 205 by aborting current observation and slewing to the returned target
For occultations: handover is pre-planned (prediction is hours in advance). No abort needed. Sites see occultation in their ranked list with rising priority as event approaches. The scheduler handles this naturally via the time criticality boost.
For TTV transits: the ingress is predictable. Scheduler pre-boosts the transit target starting 2 hours before ingress. No abort mechanism needed for most transits; only time if a higher-priority opportunistic event arrives mid-transit (in which case the transit is probably best abandoned β a missed transit is a gap in data but not a lost observation entirely).
2.9 Performance at Scale¶
At 1000 sites polling every 60 seconds: ~17 requests per second to the /targets endpoint. Each request triggers the scoring algorithm over all active targets.
If there are 500 active targets and 1000 sites, naive operation requires 500 Γ 1000 = 500,000 score computations per 60-second window. Each computation is cheap (astropy coordinate transform + a few arithmetic ops), but this adds up.
Optimizations:
1. Cache visibility windows in Redis (computed nightly, refreshed every 2 hours). The altitude/sun/moon check is the expensive part. With cached windows, the per-request overhead drops to cache lookups + priority arithmetic.
2. Two-tier target list: Pre-filter targets to those visible from the site's latitude band (Β±30Β°). A site at 40Β°N latitude cannot see targets with dec < -50Β° β pre-filter at registration, not at query time.
3. Redis hash per site: Store the last computed ranked target list per site. Serve it directly if polled within the last 30 seconds, recompute if older. targets:{site_id} β JSON list, TTL 30s.
4. Async DB queries: The FastAPI endpoint is async; all DB queries go through asyncpg (not synchronous SQLAlchemy). This lets the server handle many concurrent requests without thread-blocking.
3. Database Schema¶
This section supersedes the schemas in Details on componets.md and science_cases_and_scheduler.md. It is the canonical schema going into production.
3.1 Core Tables¶
-- =========================================================
-- SITES
-- =========================================================
CREATE TABLE sites (
site_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
name VARCHAR(255) NOT NULL,
owner_name VARCHAR(255),
owner_email VARCHAR(255),
-- Location (all coordinates J2000 / WGS84)
latitude DOUBLE PRECISION NOT NULL, -- decimal degrees
longitude DOUBLE PRECISION NOT NULL, -- decimal degrees
elevation_m DOUBLE PRECISION,
timezone VARCHAR(64), -- IANA tz name e.g. 'America/New_York'
-- Equipment (used by scheduler capability scoring)
aperture_mm INTEGER,
focal_length_mm INTEGER,
pixel_size_um DOUBLE PRECISION, -- microns
sensor_width_mm DOUBLE PRECISION,
sensor_height_mm DOUBLE PRECISION,
camera_model VARCHAR(128),
filters TEXT[], -- e.g. ARRAY['B','V','R','I','Clear']
bortle_class SMALLINT, -- 1β9
typical_seeing DOUBLE PRECISION, -- arcsec
has_gps BOOLEAN DEFAULT false,
-- Automation level
automation_level VARCHAR(16) DEFAULT 'manual', -- 'manual','semi','robotic'
-- Derived (cached, updated by scheduler)
limiting_mag_v DOUBLE PRECISION, -- estimated V-band limiting mag (60s)
-- Status
is_active BOOLEAN DEFAULT true,
api_key_hash VARCHAR(128) NOT NULL UNIQUE, -- bcrypt hash
created_at TIMESTAMPTZ DEFAULT NOW(),
last_seen TIMESTAMPTZ,
last_obs_at TIMESTAMPTZ
);
CREATE INDEX idx_sites_active ON sites (is_active) WHERE is_active = true;
CREATE INDEX idx_sites_coords ON sites USING GIST (ll_to_earth(latitude, longitude));
CREATE INDEX idx_sites_last_seen ON sites (last_seen DESC);
-- =========================================================
-- TARGETS
-- =========================================================
CREATE TABLE targets (
target_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
name VARCHAR(255) NOT NULL,
aliases TEXT[], -- alternative names / catalog IDs
-- Coordinates (ICRS J2000, decimal degrees)
ra_deg DOUBLE PRECISION NOT NULL,
dec_deg DOUBLE PRECISION NOT NULL,
-- Classification
target_type VARCHAR(64) NOT NULL,
-- values: 'variable_star','exoplanet_host','asteroid','transient',
-- 'occultation_star','grb_afterglow','kilonova_candidate',
-- 'microlensing_event','comet','blazar','eclipsing_binary'
-- Photometric properties
magnitude DOUBLE PRECISION, -- expected V or clear-band magnitude
magnitude_band VARCHAR(8) DEFAULT 'V',
-- Priority and scheduling
base_priority SMALLINT DEFAULT 50, -- 1β100
cadence_hours DOUBLE PRECISION, -- desired observation cadence
min_observations INTEGER DEFAULT 1,
-- Filter requirements
required_filters TEXT[], -- can be empty (any filter OK)
required_fov_arcmin DOUBLE PRECISION, -- minimum FOV needed
-- Special requirements
requires_gps_timing BOOLEAN DEFAULT false, -- occultations
requires_simultaneous BOOLEAN DEFAULT false,
min_simultaneous_sites SMALLINT DEFAULT 1,
requires_high_cadence BOOLEAN DEFAULT false,
max_exposure_sec DOUBLE PRECISION, -- for high-cadence targets
-- Time constraints
event_time TIMESTAMPTZ, -- occultation event time
ingress_time TIMESTAMPTZ, -- transit ingress
egress_time TIMESTAMPTZ, -- transit egress
trigger_time TIMESTAMPTZ, -- GRB/transient trigger
expires_at TIMESTAMPTZ, -- when target becomes inactive
-- Source and provenance
source VARCHAR(64), -- 'manual','ztf','gaia','gcn','tns','mpc'
external_id VARCHAR(128), -- original ID in source system
alert_id UUID REFERENCES alerts(alert_id),
-- Status
active BOOLEAN DEFAULT true,
created_at TIMESTAMPTZ DEFAULT NOW(),
updated_at TIMESTAMPTZ DEFAULT NOW(),
notes TEXT
);
CREATE INDEX idx_targets_active ON targets (active) WHERE active = true;
CREATE INDEX idx_targets_coords ON targets USING GIST (ll_to_earth(dec_deg, ra_deg));
CREATE INDEX idx_targets_type ON targets (target_type);
CREATE INDEX idx_targets_priority ON targets (base_priority DESC) WHERE active = true;
CREATE INDEX idx_targets_expires ON targets (expires_at) WHERE expires_at IS NOT NULL;
CREATE INDEX idx_targets_event_time ON targets (event_time) WHERE event_time IS NOT NULL;
-- =========================================================
-- CAMPAIGNS
-- =========================================================
CREATE TABLE campaigns (
campaign_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
name VARCHAR(255) NOT NULL,
description TEXT,
campaign_type VARCHAR(32), -- 'occultation','transit','monitoring','survey','grb'
start_date DATE,
end_date DATE,
priority_multiplier DOUBLE PRECISION DEFAULT 1.0,
requires_simultaneous BOOLEAN DEFAULT false,
coverage_goal INTEGER, -- total observations needed across network
is_active BOOLEAN DEFAULT true,
created_at TIMESTAMPTZ DEFAULT NOW()
);
CREATE TABLE campaign_targets (
campaign_id UUID REFERENCES campaigns(campaign_id) ON DELETE CASCADE,
target_id UUID REFERENCES targets(target_id) ON DELETE CASCADE,
added_at TIMESTAMPTZ DEFAULT NOW(),
PRIMARY KEY (campaign_id, target_id)
);
CREATE INDEX idx_ct_target ON campaign_targets (target_id);
-- =========================================================
-- OBSERVATIONS
-- =========================================================
CREATE TABLE observations (
obs_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
target_id UUID NOT NULL REFERENCES targets(target_id),
site_id UUID NOT NULL REFERENCES sites(site_id),
-- Timing (all UTC)
obs_start TIMESTAMPTZ NOT NULL,
obs_mid TIMESTAMPTZ NOT NULL, -- BJD correction should be applied upstream
obs_end TIMESTAMPTZ NOT NULL,
-- Exposure
exposure_sec DOUBLE PRECISION NOT NULL,
num_frames INTEGER DEFAULT 1,
filter_used VARCHAR(16),
-- Observation geometry
airmass DOUBLE PRECISION,
altitude_deg DOUBLE PRECISION,
azimuth_deg DOUBLE PRECISION,
moon_sep_deg DOUBLE PRECISION,
moon_illumination DOUBLE PRECISION,
-- Photometric result
magnitude DOUBLE PRECISION,
mag_error DOUBLE PRECISION,
flux_adu DOUBLE PRECISION, -- raw ADU if useful
snr DOUBLE PRECISION,
-- Astrometric result
ra_measured_deg DOUBLE PRECISION, -- from plate solution
dec_measured_deg DOUBLE PRECISION,
-- Comparison stars
comparison_stars JSONB, -- [{catalog_id, ra, dec, mag_catalog, mag_inst}, ...]
zero_point DOUBLE PRECISION,
-- Data quality
quality_flags TEXT[], -- ['CLOUDY','FOCUS_DRIFT','SATELLITE_TRAIL',...]
quality_score SMALLINT DEFAULT 0, -- 0=good, 1=warn, 2=bad
-- Raw data
fits_path VARCHAR(512), -- path in B2 object storage
fits_header JSONB, -- full FITS header as JSON
-- Metadata
pipeline_version VARCHAR(32), -- which version of pipeline processed this
submitted_at TIMESTAMPTZ DEFAULT NOW(),
processed_at TIMESTAMPTZ -- when data pipeline finished
);
-- Index strategy: the observations table will be the largest.
-- Primary query patterns:
-- 1. Light curve for a target: target_id + obs_mid time range
-- 2. Recent obs by site: site_id + submitted_at
-- 3. Quality filtering: quality_score
-- 4. Campaign analysis: target_id + campaign join
CREATE INDEX idx_obs_target_time ON observations (target_id, obs_mid DESC);
CREATE INDEX idx_obs_site_time ON observations (site_id, submitted_at DESC);
CREATE INDEX idx_obs_quality ON observations (quality_score) WHERE quality_score > 0;
CREATE INDEX idx_obs_submitted ON observations (submitted_at DESC);
-- Partial index for unprocessed observations (pipeline work queue)
CREATE INDEX idx_obs_unprocessed ON observations (submitted_at)
WHERE processed_at IS NULL;
-- For time-partitioned queries (after migration to partitioned table):
-- Partition observations by month on obs_mid. At 1000 sites Γ 10 obs/night =
-- ~10,000 obs/night = ~300k/month. One partition per month.
-- PostgreSQL declarative partitioning:
-- CREATE TABLE observations_2026_01 PARTITION OF observations
-- FOR VALUES FROM ('2026-01-01') TO ('2026-02-01');
-- =========================================================
-- HEARTBEATS
-- =========================================================
-- Tracks the 30s heartbeat signals from active clients.
-- Separate from 'last_seen' in sites to keep sites table clean.
-- Retain only last 2 hours; older rows purged by cron.
CREATE TABLE heartbeats (
heartbeat_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
site_id UUID NOT NULL REFERENCES sites(site_id),
received_at TIMESTAMPTZ DEFAULT NOW(),
current_target_id UUID REFERENCES targets(target_id), -- what they're observing
status VARCHAR(16), -- 'idle','observing','slewing','error'
weather_ok BOOLEAN,
sky_quality DOUBLE PRECISION -- mag/arcsecΒ² if available
);
CREATE INDEX idx_hb_site_time ON heartbeats (site_id, received_at DESC);
CREATE INDEX idx_hb_recent ON heartbeats (received_at DESC);
-- =========================================================
-- ALERTS (External Event Ingestion)
-- =========================================================
CREATE TABLE alerts (
alert_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
source VARCHAR(64) NOT NULL, -- 'ztf','gaia','gcn','tns','mpc','ligo'
external_id VARCHAR(256), -- original alert ID in source system
alert_type VARCHAR(64), -- 'supernova','grb','transient','occultation','neo'
-- Coordinates (may be imprecise for GRBs)
ra_deg DOUBLE PRECISION,
dec_deg DOUBLE PRECISION,
position_error_deg DOUBLE PRECISION, -- 1Ο localization error
priority DOUBLE PRECISION DEFAULT 50,
-- Raw payload from source (for re-parsing if schema changes)
raw_payload JSONB NOT NULL,
-- Processing state
processed BOOLEAN DEFAULT false,
target_id UUID REFERENCES targets(target_id), -- created target, if any
received_at TIMESTAMPTZ DEFAULT NOW(),
processed_at TIMESTAMPTZ
);
CREATE INDEX idx_alerts_source ON alerts (source, received_at DESC);
CREATE INDEX idx_alerts_unprocessed ON alerts (received_at) WHERE processed = false;
-- =========================================================
-- INSTRUMENTS (Stage 1 registry)
-- =========================================================
-- Populated by the calibration pipeline when processing archival data.
-- Each unique (observer_code, telescope, camera) combination gets a record.
CREATE TABLE instruments (
instrument_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
-- Provenance
observer_code VARCHAR(32), -- AAVSO observer code, MPC code, etc.
source VARCHAR(32), -- 'aavso','mpc','astrobin','manual'
-- Hardware
aperture_mm INTEGER,
focal_length_mm INTEGER,
camera_model VARCHAR(128),
pixel_size_um DOUBLE PRECISION,
-- Calibration (derived from running against Gaia/APASS)
zero_point_v DOUBLE PRECISION, -- V-band zero point (mag)
zero_point_r DOUBLE PRECISION, -- R-band
color_term_bv DOUBLE PRECISION, -- B-V color term coefficient
noise_floor_mmag DOUBLE PRECISION, -- floor below which precision can't improve
calibration_epoch TIMESTAMPTZ, -- when calibration was derived
num_calibration_obs INTEGER, -- how many obs used to derive calibration
-- Location (if known; not required for archival instruments)
latitude DOUBLE PRECISION,
longitude DOUBLE PRECISION,
created_at TIMESTAMPTZ DEFAULT NOW(),
updated_at TIMESTAMPTZ DEFAULT NOW()
);
-- =========================================================
-- LIGHT CURVES (Assembled by pipeline)
-- =========================================================
-- Denormalized summary of assembled light curve per target per filter.
-- Rebuilt by pipeline on each new observation; cached result.
CREATE TABLE light_curve_points (
lcp_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
target_id UUID NOT NULL REFERENCES targets(target_id),
obs_id UUID NOT NULL REFERENCES observations(obs_id),
-- Time (BJD_TDB preferred for variable star/exoplanet science)
bjd_tdb DOUBLE PRECISION, -- Barycentric Julian Date
filter_used VARCHAR(16),
magnitude DOUBLE PRECISION NOT NULL,
mag_error DOUBLE PRECISION NOT NULL,
-- Ensemble normalization state
ensemble_corrected BOOLEAN DEFAULT false,
zero_point_applied DOUBLE PRECISION,
UNIQUE (obs_id) -- one point per observation
);
CREATE INDEX idx_lcp_target_filter ON light_curve_points (target_id, filter_used, bjd_tdb);
3.2 Indexing Strategy¶
The query patterns that dominate at scale:
| Query | Frequency | Index Used |
|---|---|---|
| Get targets for site (scheduler) | 17 req/s at 1000 sites | idx_targets_active, Redis cache |
| Submit observation | 100β1000/day initially | PK insert |
| Get light curve for target | On-demand | idx_lcp_target_filter |
| Recent observations by site | Dashboard | idx_obs_site_time |
| Unprocessed observations (pipeline) | Every 5 min | idx_obs_unprocessed |
| Active heartbeating sites | Every 30s | idx_hb_recent |
| Spatial: sites within FOV of alert | On alert | idx_sites_coords (GIST) |
The observations table is the growth driver. At 1000 sites Γ 10 obs/night average = 3.65M observations/year. Without partitioning, queries on target_id + time range will degrade as the table grows past ~10M rows. Partition by month when row count approaches 5M (roughly year 2 at that activity level).
The PostGIS ll_to_earth function is used for spatial queries. Ensure the earthdistance and cube extensions are installed:
CREATE EXTENSION IF NOT EXISTS earthdistance CASCADE;
CREATE EXTENSION IF NOT EXISTS cube CASCADE;
3.3 JSON Fields Policy¶
Use JSONB for:
- fits_header: full FITS header dump β schema varies per instrument, no need to normalize
- comparison_stars: list of comparison star photometry β variable length
- raw_payload: external alert payloads β schema controlled by source, not us
Do not use JSONB for fields that are queried (indexed) or that have consistent structure. Those should be proper columns. The temptation to put everything in JSONB should be resisted.
4. API Design¶
4.1 Versioning¶
All endpoints under /api/v1/. When breaking changes are needed, add /api/v2/ and maintain v1 for at least 6 months. Never break existing clients silently. The minimum version support commitment to volunteers: current version + one previous version.
4.2 Authentication¶
All site/client endpoints require Authorization: Bearer {api_key} header. The API key is a 32-byte random hex string (secrets.token_hex(32)) stored as a bcrypt hash in the database. Never store keys in plaintext.
Human/dashboard endpoints use separate scoped tokens or HTTP Basic Auth behind Caddy. No mixing of site keys and human auth.
Alert ingest endpoints (from our own ingestion services) use service-to-service shared secrets, not site keys.
4.3 Rate Limiting¶
Per-site rate limits enforced by Redis counters (INCR + EXPIRE pattern):
| Endpoint | Limit |
|---|---|
| GET /api/v1/targets | 2 req/min (client polls every 60s; 2 gives some tolerance) |
| POST /api/v1/observations | 100 req/hour |
| PUT /api/v1/sites/{id}/heartbeat | 4 req/min (30s heartbeat; 4 = tolerance) |
| POST /api/v1/files (FITS upload) | 20 req/hour |
| GET /api/v1/lightcurve/{id} | 30 req/min (dashboard access) |
| Public /status | 60 req/min, no auth required |
Exceeding limit: HTTP 429 with Retry-After header. Client must honor this.
Implementation: FastAPI middleware using slowapi (wraps Redis counters). Key: ratelimit:{api_key}:{endpoint_slug}:{minute_bucket}.
4.4 Pagination¶
All list endpoints paginate with cursor-based pagination (not offset). Offset pagination degrades as pages grow. Cursor pagination is consistent and fast with indexed queries.
Response shape:
{
"data": [...],
"cursor": "eyJ0aW1lIjogIjIwMjYtMDMtMjBUMTA6MzA6MDBaIiwgImlkIjogImFiYzEyMyJ9",
"has_more": true
}
The cursor is a base64-encoded JSON {"time": "<iso>", "id": "<uuid>"} pointing to the last item in the current page. Next page: GET /api/v1/observations?cursor=<token>&limit=100.
Exception: /api/v1/targets for site polling β returns a fixed-size ranked list (default 20), not paginated. Pagination doesn't make sense here; you want the top N targets, not all of them.
4.5 Full Endpoint List¶
# Site operations
POST /api/v1/sites/register # New site registration
GET /api/v1/sites/{site_id} # Get site details
PATCH /api/v1/sites/{site_id} # Update site metadata
PUT /api/v1/sites/{site_id}/heartbeat # 30s heartbeat (includes current target)
GET /api/v1/sites/{site_id}/stats # Contribution stats for dashboard
# Scheduling
GET /api/v1/targets # Ranked target list for this site
?count=20&type=occultation # optional filters
GET /api/v1/targets/{target_id} # Full target details
GET /api/v1/targets/{target_id}/plan # Observation plan (exposure, timing, comp stars)
# Observations
POST /api/v1/observations # Submit observation (JSON photometry)
POST /api/v1/observations/{obs_id}/fits # Upload FITS file for observation
GET /api/v1/observations/{obs_id} # Get observation details
GET /api/v1/observations # Query observations (auth required)
?target_id=&site_id=&after=&before=&quality=0
# Light curves
GET /api/v1/lightcurve/{target_id} # Assembled light curve
?filter=V&after=&before=
# Campaigns
GET /api/v1/campaigns # List active campaigns
GET /api/v1/campaigns/{campaign_id} # Campaign details + targets
# Alerts
POST /api/v1/alerts # Internal: submit external alert
GET /api/v1/alerts # Recent alerts (dashboard)
# Public
GET /api/v1/status # Network health (no auth)
GET /api/v1/network/map # Active site locations (no auth)
4.6 Key Response Shapes¶
GET /api/v1/targets β the most frequently hit endpoint:
{
"timestamp": "2026-03-20T03:42:00Z",
"site_id": "...",
"targets": [
{
"target_id": "...",
"name": "GRB 260320A",
"ra_deg": 145.23,
"dec_deg": 28.44,
"type": "grb_afterglow",
"magnitude": 14.2,
"score": 94.1,
"score_breakdown": {
"priority": 95.0,
"observability": 87.0,
"capability": 90.0
},
"observation_settings": {
"recommended_exposure_sec": 60,
"recommended_filter": "R",
"cadence_sec": 120,
"total_duration_min": 30
},
"timing": {
"optimal_window_start": "2026-03-20T03:40:00Z",
"optimal_window_end": "2026-03-20T07:15:00Z",
"event_time": null
}
}
],
"next_poll_seconds": 60,
"abort_signal": false
}
PUT /api/v1/sites/{id}/heartbeat response β normally 200:
{"status": "ok", "abort": false}
When abort is triggered:
{"status": "abort", "abort": true, "reason": "grb_alert", "priority_target_id": "..."}
Client implementation note: the client MUST check the abort field on every heartbeat response. If true, it should gracefully stop the current exposure (if interruptible), slew to the priority target, and add the interrupted observation to a local retry queue (submit with incomplete=true flag).
5. Scaling Path¶
5.1 Phase 1: Single VPS (0β50 sites)¶
Infrastructure: Hetzner CX21 (2 vCPU, 4GB RAM, 40GB SSD, β¬5.49/month).
What runs on it: - FastAPI app (Gunicorn, 2 workers, uvicorn worker class) - PostgreSQL 15 (local) - Redis (local, persistence off β all cache data can be rebuilt) - Caddy (reverse proxy + automatic HTTPS) - APScheduler embedded in FastAPI process
What does NOT need to run separately: - Celery (APScheduler is enough) - Separate database server - Load balancer
Storage: B2 for FITS files (cold, cheap). Local SSD for database only.
Expected load: < 1 req/s at 50 sites polling every 60s. A single Gunicorn worker handles this without breaking a sweat.
Failure risk at this phase: Single point of failure everywhere. That's OK. Documented recovery procedure matters more than redundancy at this scale. If the VPS goes down for an hour, volunteer sites just get "server unavailable" errors and retry. No science is lost β they still have their own local FITS files.
5.2 Phase 2: Optimized Single Node (50β200 sites)¶
When: top shows > 50% CPU during peak polling periods, or PostgreSQL query time exceeds 500ms.
Changes: - Upgrade to Hetzner CX31 or CX41 (4β8 vCPU, 8β16GB RAM) - Add Redis caching for target lists (already designed for this in Section 2.9) - Add pgBouncer for connection pooling (PostgreSQL max_connections is not free) - Move from embedded APScheduler to a separate Celery worker process (same machine) for isolation - Enable PostgreSQL WAL archiving for crash recovery without full restore
5.3 Phase 3: Separated Services (200β1000 sites)¶
When: read-heavy load on targets endpoint saturates the database.
Changes:
- Separate PostgreSQL to its own Hetzner managed database (or dedicated VPS)
- Add a PostgreSQL read replica for dashboard queries (not site polling β site polling needs fresh data)
- Add a second application server (Hetzner CX21 Γ 2) behind Caddy load balancing
- Redis becomes a separate dedicated instance (Hetzner or managed Redis)
- Alert ingestion becomes a separate service with its own process
- Partition the observations table by month (add to migration script)
Total cost at this phase: ~$80β120/month. Still very cheap for the load.
5.4 Phase 4: Multi-Region (1000+ sites)¶
At 1000+ sites: ~17 req/s to /targets continuously. This is still a single-machine load (modern web servers handle 1000s of req/s). Multi-region is more about latency than throughput.
Latency matters for: - Sites in Asia/Pacific polling a Europe server: ~200ms round trip. Not a problem for 60s polling. - The 30s heartbeat abort signal: 200ms is fine.
So multi-region is not actually necessary even at 1000+ sites for our architecture. The main reason to go multi-region would be if we add a real-time streaming component for GRB alerts (sub-second latency matters there) or if we have local data storage compliance requirements.
If/when multi-region is needed: - Primary in Europe (Hetzner Falkenstein) - Read replicas in US-East and Singapore - Write always goes to primary - Reads (target lists) served from nearest replica - Alert ingest and scheduler run on primary only
6. Failure Modes and Resilience¶
6.1 Client-Side Failures (Sites)¶
Site goes offline mid-observation: No impact on server. The observation is simply not submitted. The scheduler notes the coverage gap at next cadence check. If it was a time-critical event (occultation), nothing can be done; other sites covered it or they didn't. This is inherent to distributed volunteer networks.
Client crashes mid-submission: Clients should buffer observations locally and retry. Retry with exponential backoff (see failure_analysis_guide.md). The server's POST /observations endpoint is idempotent if the client sends a client-generated obs_id (UUID). If the server has already stored this obs_id, it returns 200 (idempotent success) instead of creating a duplicate.
Site submits garbage data: Quality flagging pipeline rejects it. The site's quality_score history is tracked; after 10 consecutive low-quality submissions, the site is automatically throttled (lower capability score) until a human reviews.
Mass simultaneous polling (thundering herd): Unlikely at our scale (sites are globally distributed). But if 100 sites come online at once after server restart, Redis cache absorbs the spike β first site warms the cache, subsequent 99 get the cached response.
6.2 Server-Side Failures¶
Database down: FastAPI returns 503. Clients get HTTP error, log it, retry next cycle. No data is lost on the client side (they have local FITS). When database recovers, clients resume normally.
Redis down: Fall through to database for all cache operations. Slower but correct. The server logs Redis failures and alerts via UptimeRobot webhook. For the abort signal specifically, if Redis is down, abort signals cannot be sent β document this as an accepted risk, since Redis being down is usually brief.
Application server crash: systemd restarts it (Restart=always in unit file). Typical recovery: < 30 seconds. During the restart window, clients get connection refused errors and retry.
Full VPS failure: This is the most serious failure mode. Recovery: 1. Restore from daily pg_dump backup to a new VPS (< 1 hour recovery time) 2. Restore Redis state (not needed β fully recomputable) 3. DNS update to point to new IP 4. Total downtime: 2β4 hours in the worst case 5. Data loss: up to 24 hours of observations if backup was recent
To improve RTO and RPO: configure WAL archiving to B2 (continuous backup). Recovery time drops to < 30 minutes, data loss to < 5 minutes.
Alert ingestion service crashes: Alerts are missed during the downtime window. For GCN specifically, this means missed GRB targets. Mitigation: GCN supports backfilling missed alerts. The ingestion service, on restart, should pull the last 2 hours of alerts and process any it hasn't seen before.
6.3 Data Quality Failures¶
The most insidious failure mode β data that looks correct but isn't. Primary examples:
Clock drift / wrong timestamps: An incorrectly synchronized clock on a site produces observations timestamped with seconds or minutes of error. For TTV science this is catastrophic. Detection: compare observation submission time (server clock) with the obs_mid in the submission. If they differ by more than exposure_sec + 120s, flag with TIMESTAMP_SUSPECT. For critical timing science (occultations), require GPS timestamps and validate against submission time.
Wrong target: Site plate-solves incorrectly and observes the wrong star. Detection: cross-match reported RA/Dec with expected target position. If > 60 arcsec offset, flag with WRONG_FIELD. If FITS is submitted, run plate solve on the server side to verify.
Systematic offsets from uncalibrated instruments: An instrument with a consistent 0.3-magnitude offset from standard isn't producing garbage, but it's producing biased photometry that will corrupt heterogeneous light curves. Detection: compare the site's photometry of comparison stars against catalog values. Track the zero-point offset per site over time. Alert if zero-point shifts by > 0.1 mag between sessions.
7. Dashboard Design¶
The dashboard is secondary to the API but important for volunteer retention (see failure_analysis_guide.md). Design principles:
- Refreshes every 60 seconds (meta-refresh tag, no WebSockets needed)
- Mobile-readable: Observers often check from phone
- Shows the network doing useful work: Not just uptime stats
Pages: 1. Home / Status: Active site count, observations last 24h, current campaigns, alert feed 2. Map: Leaflet.js map of active sites (green = heartbeat in last 2 min, grey = inactive). GRB and occultation events shown as overlays. 3. Targets: List of active targets, priority, last obs time, coverage % 4. My Observatory: Per-site view after entering API key. Shows contribution stats, recent observations, quality scores, light curves of recent targets. 5. Light Curves: Assembled light curves for active campaign targets. Interactive (zoom/pan via Plotly.js β the one JS dependency that is justified). 6. Campaigns: Status of active campaigns, progress bars, sites participating.
8. Security Considerations¶
API key rotation: Sites can rotate their API key via the dashboard. Old key remains valid for 24 hours during rotation window to allow client-side update without disruption.
Key compromise: If a key is suspected compromised (observed from unexpected IP ranges, anomalous submission rate), admin can immediately revoke it via dashboard. The site owner is notified by email.
Input validation: All submitted photometry goes through Pydantic validation. Magnitude values outside [-5, 25] are rejected. RA/Dec outside valid ranges are rejected. Exposure time > 7200s is rejected. These are hard scientific limits, not arbitrary restrictions.
FITS file uploads: Limit file size to 100MB per file. Validate that uploaded files are valid FITS before storing in B2. Do not execute any code from uploaded files (FITS is not executable, but be paranoid about filenames and paths).
Rate limit all endpoints: Documented in Section 4.3. DoS from a single malicious API key is mitigated by per-key rate limits. DoS from many keys is mitigated by Cloudflare (put the domain behind Cloudflare CDN for free DDoS protection).
No SQL injection surface: Use parameterized queries everywhere (SQLAlchemy handles this). Never format SQL strings with user input.
Last updated: 2026-03-20
Status: Planning document β not yet implemented
Next: See Alert Ingestion Design.md and Stage 1 Data Pipeline Design.md