↑ Top

Unified Healthcare Data & AI Ecosystem

A real-time, AI-ready architecture converting fragmented health data into a continuously learning, sovereign platform for clinical, operational, and research intelligence.

Google Cloud Reference Architecture

Click any box to view detailed deep-dive

Data
Sources
Ingestion
Lakehouse
AI Layer
Security
Agents
Experience
1

Unified Intelligent Data Pipeline

Ingestion & Harmonization
▽ ▽ ▽
2

AI-Ready Healthcare Data Lakehouse

BigQuery + Dataplex
▽ ▽ ▽
3

AI-Intelligible & Semantic Layer

Vectors • RAG • Knowledge Graphs
▽ ▽ ▽
4

Data Fabric, Security & Sovereignty

Governed Access • Zero Trust
▽ ▽ ▽
5

Agentic & Reasoning AI Layer

Vertex AI Agents • Continuous Learning

Agent Orchestration

  • Vertex AI Agent tools as orchestration layer
  • Connected to: RAG, Vector Search, Knowledge Graph, FHIR APIs
  • Governed write/act paths with safety constraints

MLOps & Continuous Learning

  • Vertex AI Pipelines for lifecycle management
  • Model Monitoring: drift, bias, safety
  • Clinician overrides & outcomes feed retraining
  • A/B experiments & feedback loops
▽ ▽ ▽
6

Clinician & Operations Experiences

Dashboards • EHR Integration • Patient Apps

↻ Continuous Learning Feedback Loop

Clinical outcomes, clinician corrections, and user interactions feed back through the pipeline — retraining models, updating knowledge graphs, and improving AI accuracy over time. Every data point makes the system smarter.

← Back to Overview

EHR / HL7v2 — Deep Dive

Electronic Health Record systems communicating via HL7 Version 2 messaging — the dominant real-time clinical data source feeding the unified pipeline.

EHR Source Systems

Epic

Inpatient & Ambulatory EHR
Bridges, Care Everywhere

Oracle Health (Cerner)

Millennium platform
Real-time feeds via HCI

MEDITECH

Expanse / 6.x
NPR & DR interfaces

Allscripts / Veradigm

TouchWorks, Sunrise
Ambulatory & acute

📚

athenahealth

Cloud-native ambulatory
athenaClinicals APIs

Other / Legacy

VA VistA, DoD Genesis,
regional & specialty EHRs

▼ ▼ ▼ ▼ ▼
Generate HL7v2 messages on clinical events
HL7v2 Message Types & Trigger Events

ADT — Admit / Discharge / Transfer

ADT^A01, A02, A03, A04, A08 ...
  • Patient admissions & registrations
  • Transfers between units
  • Discharges & leave of absence
  • Demographics updates (A31)
  • Merge patient records (A40)

ORM / OML — Orders

ORM^O01, OML^O21, OML^O33
  • Laboratory test orders
  • Radiology & imaging orders
  • Medication orders
  • Procedure orders
  • Order status changes & cancellations

ORU — Results / Observations

ORU^R01, ORU^R30
  • Lab results (chemistry, hematology, micro)
  • Radiology reports
  • Pathology reports
  • Vital signs & flowsheet data
  • Transcribed documents

SIU — Scheduling

SIU^S12, S14, S15, S26
  • Appointment creation & updates
  • Cancellations & no-shows
  • Resource allocation
  • Clinic schedule management

MDM — Document Management

MDM^T01, T02, T11
  • Clinical notes (H&P, progress, consult)
  • Discharge summaries
  • Operative notes
  • Document status tracking

RDE / RAS — Pharmacy

RDE^O11, RAS^O17, RDS^O13
  • Pharmacy encode (prescription details)
  • Medication administration events
  • Dispense notifications
  • Formulary interactions

DFT — Financial / Charges

DFT^P03, DFT^P11
  • Charge postings (CPT, HCPCS)
  • Procedure-level billing events
  • Insurance verification triggers
  • Claim adjudication signals

VXU — Immunizations

VXU^V04
  • Vaccination administration records
  • Immunization history queries
  • Registry submissions
  • Adverse event reporting
▼ ▼ ▼
HL7v2 Message Anatomy — Segment Structure

Example: ADT^A01 (Patient Admission)

MSH SendingApp | SendingFac | ReceivingApp | ReceivingFac | DateTime | ADT^A01 | MsgCtrlID | 2.5.1
EVN A01 | EventDateTime | PlannedDateTime | ReasonCode
PID PatientID (MRN) | Name (Last^First) | DOB | Sex | Race | Address | Phone | SSN
PV1 PatientClass (I/O/E) | Location (Unit^Room^Bed) | AttendingDr | AdmitType | VisitNumber | AdmitDateTime
DG1 DiagnosisCode (ICD-10) | Description | Type (A/W/F) | DiagnosingClinician
OBX ValueType (NM/ST/CE) | ObsIdentifier (LOINC) | Value | Units | RefRange | AbnormalFlag
IN1 InsurancePlanID | CompanyName | GroupNumber | PolicyNumber | EffectiveDate
MSH — Message Header
EVN — Event Type
PID — Patient Identity
PV1 — Patient Visit
DG1 — Diagnosis
OBX — Observation
IN1 — Insurance
▼ ▼ ▼
HL7v2 to FHIR R4 Resource Mapping
HL7v2 Segment Trigger Event FHIR R4 Resource Key Fields Mapped
PID ADT^A01/A04/A08 Patient MRN, name, DOB, gender, address, telecom, identifiers
PV1 + PV2 ADT^A01/A02/A03 Encounter Class, location, period, participant (attending), status
ORC + OBR ORM^O01 / OML^O21 ServiceRequest Code, requester, status, priority, specimen requirements
OBX ORU^R01 Observation LOINC code, value, units, reference range, interpretation
OBR (Radiology) ORU^R01 DiagnosticReport Study code, results, conclusion, imaging references
RXE / RXA RDE^O11 / RAS^O17 MedicationRequest / MedicationAdministration Drug (RxNorm), dose, route, frequency, prescriber
DG1 ADT^A01/A03 Condition ICD-10 code, category (encounter/problem-list), onset
AL1 ADT^A01/A08 AllergyIntolerance Substance, reaction, severity, clinical status
SCH + AIS SIU^S12 Appointment DateTime, participant, location, status, serviceType
TXA + OBX MDM^T02 DocumentReference Type, author, date, content (base64 / URL), status
IN1 + IN2 ADT^A01 Coverage Payor, subscriber, group, period, type
▼ ▼ ▼
Integration Patterns — EHR to Google Cloud

Real-Time Streaming Primary

HL7v2 messages streamed as they occur for near-zero-latency ingestion. Supports ADT, ORU, ORM events with sub-second delivery.

EHR Interface Engine
MLLP/HTTPS Adapter
Cloud Healthcare API (HL7v2 Store)
Pub/Sub Notification
Dataflow Pipeline

FHIR Native Modern EHRs

EHRs with FHIR R4 APIs (Epic USCDI, Cerner Ignite) push resources directly, bypassing HL7v2 translation.

EHR FHIR Server
SMART Backend Auth
Cloud Healthcare API (FHIR Store)
BigQuery Streaming

Interface Engine Mediated Common

Rhapsody, Mirth Connect, or InterSystems HealthShare handles routing, filtering, and protocol translation before cloud ingestion.

EHR (HL7v2)
Mirth / Rhapsody / HealthShare
Transform & Route
Cloud Healthcare API

Batch / Bulk Export Historical

Initial data migration and periodic bulk refreshes using FHIR $export or flat-file extracts for historical backfill.

EHR Bulk Export / CSV
Cloud Storage (GCS)
Dataflow Batch Job
BigQuery (Raw Zone)
▼ ▼ ▼
Typical Data Volume & Throughput (Large Health System)
2-5M
HL7v2 Messages / Day
ADT + ORM + ORU combined
50-200
Messages / Second (Peak)
Morning admit surge, shift changes
1-4 KB
Avg Message Size
ORU with embedded results can be 10-50 KB
15-30
Segments per Message (Avg)
Repeating OBX for multi-result ORU
< 2s
End-to-End Latency Target
EHR event → BigQuery availability
99.99%
Uptime Requirement
Clinical systems = mission-critical
▼ ▼ ▼
Key Challenges & Considerations

Version Variability

HL7v2 versions 2.3 through 2.8 coexist. Z-segments (custom extensions) vary per vendor and site, requiring per-source mapping.

Patient Identity

MRN fragmentation across facilities. Requires MPI (Master Patient Index) or EMPI resolution before deduplication in the lakehouse.

Terminology Gaps

Local codes vs. standard terminologies (LOINC, SNOMED, RxNorm). Healthcare Data Engine handles mapping but requires curation.

Message Ordering

Out-of-order delivery and duplicate messages. Pipeline must handle idempotency, sequencing, and late-arriving corrections (A08).

PHI & Compliance

Every message contains PHI. Must enforce encryption in transit (TLS/MLLP-S), at rest (CMEK), and de-identification for research.

Downtime & Recovery

EHR downtimes require message queuing and replay. Dead-letter queues and reconciliation jobs ensure zero data loss.

▼ ▼ ▼

Downstream: Into the Unified Pipeline

Once ingested and parsed, EHR/HL7v2 data flows through harmonization into the lakehouse, becoming AI-ready records that power clinical agents, analytics, and research.

HL7v2 Store
Pub/Sub
Dataflow (Harmonize)
BigQuery Raw Zone
Curated Zone
Enriched + Embeddings
AI Agents
← Back to Overview

Imaging / DICOM — Deep Dive

Medical imaging systems communicating via DICOM protocol — the primary source for radiology, cardiology, and pathology pixel data feeding the unified pipeline.

Imaging Source Systems
🖼

GE Healthcare PACS

Centricity / Edison
Enterprise imaging archive

🖼

Philips PACS

IntelliSpace PACS
Multi-modality support

🖼

Siemens Healthineers

syngo.via / teamplay
AI-ready platform

🗃

VNA (Vendor Neutral Archive)

Hyland / Fuji / IBM
Long-term image storage

Modalities

CT, MRI, US, XR, PET
Mammo, Path slides

Specialty Systems

Cardiology CVIT, Derm
Ophthalmology, Dental

▼ ▼ ▼ ▼ ▼
Generate DICOM objects on acquisition / post-processing
DICOM Object Model — Hierarchy

Patient → Study → Series → Instance

Patient PatientID | PatientName | DOB | Sex
Study StudyInstanceUID | StudyDate | AccessionNumber | ReferringPhysician | StudyDescription
Series SeriesInstanceUID | Modality | SeriesNumber | BodyPartExamined | SeriesDescription
Instance SOPInstanceUID | SOPClassUID | InstanceNumber | PixelData | TransferSyntax
Patient Level
Study Level
Series Level
Instance Level
▼ ▼ ▼
Common SOP Classes

CT Image Storage

1.2.840.10008.5.1.4.1.1.2
  • Axial slices, 512x512 typical
  • Hounsfield units, 16-bit depth
  • 100-5000+ slices per study

MR Image Storage

1.2.840.10008.5.1.4.1.1.4
  • Multiple sequences (T1, T2, FLAIR)
  • Variable matrix sizes
  • Multi-planar reconstructions

US Image Storage

1.2.840.10008.5.1.4.1.1.6.1
  • Multi-frame / cine clips
  • Doppler overlays
  • Measurements embedded

Secondary Capture

1.2.840.10008.5.1.4.1.1.7
  • Scanned documents, screenshots
  • ECG waveform captures
  • Non-DICOM source images

Structured Report (SR)

1.2.840.10008.5.1.4.1.1.88.x
  • Coded measurements & findings
  • CAD results, dose reports
  • AI inference outputs

Presentation State

1.2.840.10008.5.1.4.1.1.11.x
  • Window/level settings
  • Annotations, overlays
  • Hanging protocol references
▼ ▼ ▼
Key DICOM Tags
Tag Name Level Notes
(0010,0020) PatientID Patient MRN; critical for cross-system matching
(0020,000D) StudyInstanceUID Study Globally unique study identifier
(0020,000E) SeriesInstanceUID Series Groups images by acquisition sequence
(0008,0018) SOPInstanceUID Instance Unique per image/object
(0008,0060) Modality Series CT, MR, US, XR, PT, MG, SM
(0008,0020) StudyDate Study Date of imaging examination
(0008,0090) ReferringPhysician Study Ordering clinician name
(0018,0015) BodyPartExamined Series CHEST, HEAD, ABDOMEN, etc.
(7FE0,0010) PixelData Instance Bulk pixel data; largest element
(0002,0010) TransferSyntaxUID Meta Encoding: Explicit VR, JPEG2000, etc.
▼ ▼ ▼
DICOM to FHIR R4 Resource Mapping
DICOM Source FHIR R4 Resource Key Fields Mapped
Study ImagingStudy StudyInstanceUID, modality list, numberOfSeries/Instances, started, endpoint
Structured Report (SR) DiagnosticReport / Observation Coded findings, measurements, conclusion, performer
Patient Tags Patient PatientID, name, DOB, gender mapped to FHIR Patient resource
Order (AccessionNumber) ServiceRequest Accession, requested procedure, referring physician, priority
Series ImagingStudy.series Modality, body site, laterality, number of instances, UID
Instance ImagingStudy.series.instance SOPClass, instance number, WADO-RS endpoint for retrieval
▼ ▼ ▼
Integration Patterns — Imaging to Google Cloud

DICOMweb to Cloud Healthcare API Primary

Native DICOMweb (STOW-RS, WADO-RS, QIDO-RS) direct to Cloud Healthcare API DICOM Store. RESTful, standards-based.

PACS / VNA
DICOMweb (STOW-RS)
Cloud Healthcare API (DICOM Store)
Pub/Sub Notification

DIMSE Gateway Legacy

For legacy PACS using C-STORE/C-FIND. DIMSE proxy translates traditional DICOM networking to DICOMweb for cloud ingestion.

Legacy PACS (DIMSE)
DICOM Gateway (C-STORE SCP)
STOW-RS Adapter
Cloud Healthcare API

Cloud Storage Bulk Import Migration

Bulk DICOM archive migration. Upload Part 10 files to Cloud Storage, then import into DICOM Store via batch job.

DICOM Archive (Part 10)
Cloud Storage (GCS)
DICOM Store Import
Cloud Healthcare API

Pub/Sub Event-Driven Automation

Pub/Sub notifications on new DICOM instances trigger downstream pipelines: metadata extraction, de-identification, AI inference.

DICOM Store
Pub/Sub (new study event)
Cloud Functions / Dataflow
Vertex AI Inference
▼ ▼ ▼
Typical Data Volume & Throughput (Large Health System)
2-5 TB
New Imaging / Day
All modalities combined
500K-1M
Studies / Year
Across all departments
50-500 MB
Per Study (Avg)
CT thin-slice can exceed 1 GB
PB-Scale
Archive Size
10+ years of historical data
100-1000
Instances / Study (Avg)
CT: 500-5000 slices
< 30s
Ingestion Latency Target
Study available for AI after store
▼ ▼ ▼
Key Challenges & Considerations

Large File Sizes

CT/MR studies can be 1+ GB. Requires chunked uploads, resumable transfers, and efficient network utilization for cloud migration.

Compression Trade-offs

Lossy vs. lossless compression (JPEG2000, JPEG-LS). Lossy acceptable for viewing but not for AI training or primary diagnosis.

Burned-In PHI

Patient name/DOB baked into pixel data (ultrasound overlays, scanned docs). Requires OCR-based pixel scrubbing for de-identification.

Multi-Frame & Cine

Ultrasound clips, cardiac cine MRI, fluoroscopy — multi-frame objects need special handling for storage, viewing, and AI processing.

AI on Pixel Data

Vertex AI inference requires pixel extraction, normalization, and pre-processing. Transfer syntax conversion may be needed.

Cross-Site Reconciliation

Patients imaged at multiple facilities. StudyInstanceUIDs differ; requires MPI matching and study linking across PACS systems.

▼ ▼ ▼

Downstream: Into the Unified Pipeline

Imaging data splits into metadata (structured) and pixel data (binary) paths, converging in the lakehouse for AI-powered radiology and clinical analytics.

DICOM Store
Pub/Sub
Dataflow (Metadata Extract)
BigQuery (Metadata)
+
Cloud Storage (Pixel Data)
Vertex AI (Imaging AI)
← Back to Overview

Labs / LIS — Deep Dive

Laboratory Information Systems generating orders, specimens, and results via HL7v2 and FHIR — the highest-volume discrete clinical data source.

Lab Source Systems

Sunquest

Enterprise LIS
Chemistry, Heme, Micro, BB

Beaker (Epic)

Integrated with Epic EHR
AP & CP modules

Cerner PathNet

Oracle Health LIS
General & Anatomic Path

SoftLab / MEDITECH

MEDITECH integrated lab
Expanse & legacy

📚

Reference Labs

Quest Diagnostics, LabCorp
Send-out results via HL7v2

POC Devices

i-STAT, glucometers, ABG
Bedside testing, rapid results

▼ ▼ ▼ ▼ ▼
Generate lab orders, specimens, and results
Lab Data Model — Order-to-Result Hierarchy

Order → Specimen → Result → Component

Order OrderID | OrderCode | OrderingProvider | Priority | OrderDateTime
Specimen SpecimenID | Type (Blood/Urine/CSF) | CollectionTime | Source (Venous/Arterial)
Result TestCode (LOINC) | TestName | Status (P/F/C) | ResultDateTime
Component Value (7.4) | Units (mg/dL) | RefRange (3.5-10.5) | Flag (H/L/A/C)
Order Level
Specimen Level
Result Level
Component Level
▼ ▼ ▼
Lab Order Types & Departments

Chemistry

BMP, CMP, LFTs, Lipids, A1c
  • Highest volume department
  • Discrete numeric results
  • Automated analyzer output

Hematology

CBC, Diff, Coags (PT/INR, PTT)
  • Multi-component panels
  • Automated cell counts
  • Manual differential when flagged

Microbiology

Culture & Sensitivity, AFB, Fungal
  • Progressive results over days
  • Organism ID + antibiotic MICs
  • Complex multi-step workflow

Blood Bank

Type & Screen, Crossmatch, Antibody ID
  • Critical for transfusion safety
  • Antigen/antibody panels
  • Regulatory traceability required

Anatomic Pathology

Surgical Path, Cytology, Autopsy
  • Narrative & synoptic reports
  • IHC stains, special stains
  • Cancer staging (CAP protocols)

Molecular / Genetics

NGS Panels, PCR, FISH, Karyotype
  • Variant-level results
  • Pharmacogenomics (PGx)
  • Turnaround: days to weeks
▼ ▼ ▼
HL7v2 Lab Messages — OBR + OBX Segment Structure

Example: ORU^R01 (Lab Result)

MSH LIS | LAB_FAC | EHR | ORU^R01 | 2.5.1
PID MRN | Name | DOB | Sex
ORC OrderControl (RE) | PlacerOrderNum | FillerOrderNum | OrderStatus
OBR UniversalServiceID (LOINC) | RequestedDateTime | ObservationDateTime | ResultStatus (F/P/C)
OBX-1 NM | 2823-3 (K+) | 4.2 | mmol/L | 3.5-5.1 | N
OBX-2 NM | 2951-2 (Na+) | 148 | mmol/L | 136-145 | H
OBX-3 NM | 2160-0 (Creat) | 2.8 | mg/dL | 0.7-1.3 | HH
MSH — Header
PID — Patient
ORC — Order Control
OBR — Observation Request (Panel)
OBX — Observation Result (per analyte)
▼ ▼ ▼
Lab to FHIR R4 Resource Mapping
HL7v2 Segment FHIR R4 Resource Code System Key Fields Mapped
OBR DiagnosticReport LOINC Panel code, status, effectiveDateTime, performer, conclusion
OBX Observation LOINC Code, valueQuantity, referenceRange, interpretation (H/L/A), status
SPM Specimen SNOMED Type, collection dateTime, source site, condition, container
ORC + OBR ServiceRequest LOINC / local Code, requester, priority, status, authoredOn, specimen requirements
OBX (Micro) Observation (component) SNOMED Organism, antibiotic, MIC value, interpretation (S/I/R)
OBX (AP narrative) DiagnosticReport.presentedForm LOINC Pathology report text (synoptic/narrative), attachment
▼ ▼ ▼
Integration Patterns — Lab to Google Cloud

HL7v2 Streaming Primary

ORU^R01 results streamed in real time. ORM^O01 orders captured for order-result linkage. Sub-second delivery for critical values.

LIS Interface Engine
MLLP Adapter
Cloud Healthcare API (HL7v2 Store)
Pub/Sub → Dataflow

LIS FHIR APIs Modern

Modern LIS platforms expose FHIR R4 endpoints. DiagnosticReport and Observation resources pulled or pushed directly.

LIS FHIR Server
SMART Backend Auth
Cloud Healthcare API (FHIR Store)
BigQuery Export

Reference Lab Results External

Quest, LabCorp, and specialty reference labs return results via HL7v2 or FHIR. Routed through interface engine for normalization.

Reference Lab (Quest/LabCorp)
HL7v2/FHIR
Cloud Healthcare API
Dataflow Pipeline

POC Device Data Bedside

Point-of-care devices (i-STAT, glucometers) transmit results via device middleware to Pub/Sub for real-time capture.

POC Devices
Device Middleware
Pub/Sub
Dataflow → BigQuery
▼ ▼ ▼
Typical Data Volume & Throughput (Large Health System)
500K-2M
Results / Day
Individual OBX observations
10-50
OBX per Panel
CMP=14, CBC+Diff=20+
< 5 min
Critical Result Latency
K+ > 6.0, Troponin > threshold
3-7 days
Micro Culture Duration
Progressive preliminary results
50-100
Messages / Second (Peak)
Morning draw results arrive 7-10 AM
1-4 KB
Avg ORU Message Size
Micro/AP can be 10-50 KB narrative
▼ ▼ ▼
Key Challenges & Considerations

Amendments & Corrections

Result status F → C (corrected). Pipeline must handle updates, maintain audit trail of original vs. corrected values.

Micro Progressive Results

Culture results arrive over days: preliminary → organism ID → susceptibilities. Must link all updates to single order.

Delta Checks & Critical Values

Real-time alerting on critical values (K+ > 6.5, Hgb < 7). Delta checks detect instrument errors. Pipeline must support < 5 min latency.

Discrete vs. Narrative

Chemistry/heme = discrete numeric. Pathology = narrative text. NLP/embeddings required for AP reports to be AI-queryable.

LOINC Mapping Completeness

Local test codes must map to LOINC for interoperability. 60-80% auto-mapped; remainder requires manual curation. Ongoing maintenance.

Reference Range Variability

Ranges differ by lab, instrument, age, sex. Must capture per-result ranges, not global defaults, for accurate interpretation.

▼ ▼ ▼

Downstream: Into the Unified Pipeline

Lab data flows into structured BigQuery tables for discrete results and embedding pipelines for narrative pathology reports, powering clinical decision support and AI agents.

HL7v2 Store
Dataflow
BigQuery (Structured Results)
+
Embeddings (Narrative Path)
Vertex AI Search
AI Agents
← Back to Overview

Wearables / IoT — Deep Dive

Consumer wearables, remote patient monitoring devices, and hospital IoT sensors generating continuous time-series health data at massive scale.

Device Categories
💉

Continuous Glucose Monitors

Dexcom G7, Abbott Libre
Reading every 5 min, 288/day

Cardiac Monitors

Apple Watch, AliveCor, Zio Patch
ECG, rhythm detection

🏃

Activity Trackers

Fitbit, Garmin, Oura
Steps, sleep, HRV, calories

RPM Kits

BP cuffs, pulse ox, scales
Cellular-connected home devices

💻

Hospital Bedside Monitors

Philips IntelliVue, GE CARESCAPE
HR, SpO2, BP, temp, 1/sec

Infusion Pumps & Vents

Alaris, Baxter, Draeger
Drug rates, vent settings, alarms

▼ ▼ ▼ ▼ ▼
Continuous streaming of vitals, waveforms, and activity data
Data Types & Formats

Time-Series Vitals

HR, SpO2, BP, Temp, Glucose
  • Numeric readings with timestamps
  • 1/sec (hospital) to 1/5min (CGM)
  • FHIR Observation with effectiveDateTime

Waveforms

ECG (250-500Hz), EEG, Pleth
  • High-frequency continuous data
  • Multi-lead (12-lead ECG)
  • IEEE 11073 / SCP-ECG format

Activity Metrics

Steps, Sleep Stages, Calories, HRV
  • Aggregated epochs (1-min, 5-min)
  • Apple HealthKit / Google Health Connect
  • Daily summaries + granular data

Alerts & Alarms

Threshold violations, arrhythmia, apnea
  • Real-time event notifications
  • Severity levels (advisory/warning/crisis)
  • Alarm context (parameter, limit, value)
▼ ▼ ▼
Standards & Protocols

Device Data Pathway

Protocol IEEE 11073 | BLE Health Profiles | HL7v2 ORU | FHIR Observation
Platform Apple HealthKit | Google Health Connect | Manufacturer Cloud API
Payload DeviceID | PatientID | Timestamp (UTC) | Metric Code (LOINC) | Value + Units
Meta Device Model | Firmware Version | Battery Level | Signal Quality
Transport Protocol
Platform / Aggregator
Observation Payload
Device Metadata
▼ ▼ ▼
Ingestion Patterns — Devices to Google Cloud

Manufacturer Cloud API Consumer

Dexcom, Fitbit, Withings expose REST APIs. Cloud Functions poll or receive webhooks, normalize, and push to Pub/Sub.

Device → Mfg Cloud
Cloud Functions (webhook/poll)
Pub/Sub
Dataflow Streaming

Hospital IoT Gateway Inpatient

Bedside monitors → local IoT gateway (Capsule, Bernoulli) → HL7v2 or MQTT to Pub/Sub for real-time streaming.

Bedside Monitor
IoT Gateway (On-Prem)
Pub/Sub
Dataflow → Bigtable

Patient App / FHIR RPM

Patient-facing apps write FHIR Observation resources (BP, weight, glucose) directly to Cloud Healthcare API FHIR Store.

Patient App
FHIR Observation
Cloud Healthcare API (FHIR Store)
BigQuery Export

Bulk CSV / JSON Historical

Batch export from device platforms (Fitbit data export, CGM CSV downloads). Cloud Storage → Dataflow batch processing.

CSV / JSON Export
Cloud Storage (GCS)
Dataflow Batch
BigQuery
▼ ▼ ▼
Time-Series Processing on GCP

Dataflow Streaming Pipeline

Ingest Pub/Sub (raw readings) Dataflow: Parse & Validate Dedup & Timestamp Align
Process Downsample (5-min windows) Noise Filter / Smooth Anomaly Detection (z-score)
Store Bigtable (raw high-freq) + BigQuery (aggregated) + Cloud Storage (waveforms)
Ingestion Layer
Stream Processing
Computation
Hot Storage
Analytical Store
Cold Storage
▼ ▼ ▼
Typical Data Volume & Throughput
288
CGM Readings / Day
1 reading every 5 minutes
250-500 Hz
ECG Waveform Rate
12-lead = 3000-6000 samples/sec
1/sec
Hospital Monitor Rate
HR, SpO2, BP per patient per sec
Millions
Data Points / Day (RPM)
Thousands of patients in program
10-100 GB
Daily Waveform Data
ICU with 50+ monitored beds
< 5s
Alert Latency Target
Critical deterioration detection
▼ ▼ ▼
Key Challenges & Considerations

Data Quality & Noise

Motion artifacts, poor sensor contact, environmental interference. Requires signal quality scoring and filtering before clinical use.

Connectivity Gaps

Bluetooth dropouts, Wi-Fi dead zones, cellular coverage gaps. Must handle store-and-forward with gap reconciliation.

Timestamp Synchronization

Device clocks drift. Multiple devices per patient with different time sources. Must normalize to UTC with known accuracy.

Alert Fatigue

90%+ of monitor alarms are non-actionable. AI must filter noise, detect true deterioration patterns, and suppress false positives.

Patient Compliance

Wearable adherence drops over time. Missing data windows must be flagged, not treated as normal. Engagement tracking needed.

Massive Volume, Low Signal

99%+ of readings are normal. Storage cost optimization via tiered storage (hot/warm/cold) and intelligent downsampling is essential.

▼ ▼ ▼

Downstream: Into the Unified Pipeline

Device data streams through real-time processing into dual storage (Bigtable for raw, BigQuery for aggregated), feeding Vertex AI for anomaly detection and clinical deterioration alerting.

Pub/Sub
Dataflow (Filter/Aggregate)
BigQuery + Bigtable
Vertex AI (Anomaly Detection)
Clinical Agents (Alerts)
← Back to Overview

Genomics — Deep Dive

Sequencing platforms and bioinformatics pipelines generating variant calls, gene expression profiles, and pharmacogenomic data for precision medicine.

Source Systems
🧬

Illumina NovaSeq / NextSeq

Short-read sequencing
WGS, WES, targeted panels

🧬

PacBio Revio

Long-read HiFi sequencing
Structural variants, phasing

🧬

Oxford Nanopore

Real-time long-read
Rapid turnaround, portable

Bioinformatics Pipelines

GATK, BWA-MEM2, DRAGEN
Alignment + variant calling

📚

LIMS & Tumor Boards

Sample tracking, clinical
interpretation, reporting

Pharmacogenomics

PGx platforms (CPIC)
Drug-gene interaction testing

▼ ▼ ▼ ▼ ▼
Generate raw reads, aligned sequences, variant calls, and clinical reports
Data Types & Formats

FASTQ (Raw Reads)

@readID / sequence / +quality
  • Raw base calls from sequencer
  • WGS: ~100 GB per sample
  • Paired-end (R1 + R2 files)

BAM / CRAM (Aligned)

Binary Alignment Map / Compressed
  • Reads aligned to reference genome
  • BAM: 50-100 GB; CRAM: 30-60 GB
  • Indexed for region queries (.bai)

VCF / gVCF (Variants)

CHROM POS ID REF ALT QUAL FILTER INFO
  • SNVs, indels, structural variants
  • VCF: 100-500 MB per WGS
  • gVCF includes reference confidence

RNA-seq / Expression

Gene expression quantification
  • TPM / FPKM normalized counts
  • Differential expression analysis
  • Fusion gene detection

NGS Panels

50-500 gene targeted panels
  • Oncology (Foundation, Tempus)
  • Hereditary risk (BRCA, Lynch)
  • Carrier screening panels

PGx Results

Star alleles: CYP2D6 *1/*4
  • Metabolizer phenotype classification
  • Drug-gene interaction pairs
  • CPIC guideline recommendations
▼ ▼ ▼
Genomics Processing on GCP

End-to-End Pipeline: Sequencer → Clinical Insight

Raw Data FASTQ / BAM Upload Cloud Storage (Multi-Region) Lifecycle to Nearline/Archive
Pipeline Cloud Batch BWA-MEM2 (Align) GATK / DeepVariant (Call) VCF Output
Annotate Variant Transforms BigQuery (Variants Table) ClinVar + gnomAD Annotation
Cohort Hail on Dataproc GWAS / Burden Tests Population Analytics
Storage & Lifecycle
Pipeline Execution
Annotation & Query
Cohort Analysis
▼ ▼ ▼
Genomics to FHIR R4 Resource Mapping
Genomic Source FHIR R4 Resource Key Fields Mapped
VCF Variant Observation (variant) Gene, DNA change (HGVS), protein change, zygosity, allele frequency
Sequence Data MolecularSequence Reference sequence, coordinate system, quality scores, repository
PGx Star Alleles Observation (haplotype) Gene (CYP2D6), allele name (*1/*4), metabolizer phenotype
PGx Recommendation Task (medication-recommendation) Drug, action (adjust dose/avoid), evidence level, CPIC guideline
Clinical Report DiagnosticReport (genetics) Conclusion, variant list, interpretation (P/LP/VUS/LB/B), performer
Panel / Test Order ServiceRequest Panel code, specimen, requester, reason (condition), priority
▼ ▼ ▼
Integration Patterns — Genomics to Google Cloud

Raw File Storage Foundation

FASTQ and BAM files uploaded to Cloud Storage with lifecycle policies. Multi-region for durability, Nearline/Archive for cost optimization.

Sequencer Output
gsutil / Transfer Service
Cloud Storage (Standard)
Lifecycle → Nearline/Archive

Pipeline Execution Compute

Cloud Batch runs GATK/DeepVariant workflows. Auto-scaling VMs, preemptible instances for cost. WDL/Nextflow orchestration.

Cloud Storage (FASTQ)
Cloud Batch (GATK/DeepVariant)
Cloud Storage (VCF/BAM)
Variant Transforms

Variant Analysis in BigQuery Analytics

Variant Transforms loads VCF into BigQuery. Join with ClinVar, gnomAD for annotation. SQL-based variant filtering and cohort queries.

VCF Files
Variant Transforms
BigQuery (Variants + Annotations)
Cohort Analytics

Cohort Analysis with Hail Population

Hail on Dataproc for large-scale cohort analysis: GWAS, burden tests, PCA. Scales to millions of variants across thousands of samples.

BigQuery / VCF
Dataproc (Hail)
GWAS / PCA / Burden
Results → BigQuery
▼ ▼ ▼
Typical Data Volume (Large Academic Center)
~100 GB
FASTQ per WGS Sample
30x coverage, paired-end
100-500 MB
VCF per WGS Sample
4-5M variants per genome
5-10 GB
WES per Sample
~60K variants (exome only)
1-5 GB
NGS Panel per Sample
50-500 genes, high depth
10K-50K
Samples / Year
Large academic / research center
2-24 hrs
Pipeline Runtime
WGS alignment + variant calling
▼ ▼ ▼
Key Challenges & Considerations

Massive File Sizes

Single WGS = 100+ GB raw. 50K samples/year = 5+ PB. Requires tiered storage, compression (CRAM), and efficient transfer.

Long Pipeline Runtimes

WGS alignment + calling: 2-24 hours per sample. Requires auto-scaling compute (Cloud Batch) and spot/preemptible instances for cost.

Variant Interpretation (VUS)

40-60% of variants are VUS (Variants of Uncertain Significance). Requires ongoing reclassification as databases update.

Re-Analysis Requirements

As reference databases (ClinVar, gnomAD) update, prior results need re-annotation. Must maintain pipeline versioning and audit trail.

Consent & Return of Results

Incidental findings, right not to know, family implications. Consent management and result disclosure policies vary by institution.

Population Reference Gaps

Reference genomes biased toward European ancestry. gnomAD coverage varies by population. Equity implications for variant calling accuracy.

▼ ▼ ▼

Downstream: Into the Unified Pipeline

Genomic data flows from raw storage through compute pipelines into BigQuery variant tables, joined with knowledge graphs for AI-powered PGx recommendations and tumor profiling.

Cloud Storage (Raw)
Cloud Batch (Pipelines)
BigQuery (Variants)
Knowledge Graph (ClinVar, gnomAD)
AI Agents (PGx, Tumor Profiling)
← Back to Overview

Claims / SDoH — Deep Dive

Administrative claims data and Social Determinants of Health providing the financial, utilization, and social context layer for population health and equity analytics.

Source Systems — Claims
💰

Clearinghouses

Availity, Change Healthcare
X12 837/835 transaction hub

🏛

Medicare / Medicaid

CMS Blue Button 2.0 (FHIR)
State Medicaid feeds

💳

Commercial Payers

UHC, Anthem, Aetna, Cigna
EDI 837P/837I feeds

State HIEs

Regional health exchanges
ADT notifications, claims

Source Systems — SDoH
🏠

Census / ACS Data

Demographics, income, education
By FIPS / ZIP / tract

🍎

USDA Food Access

Food desert research atlas
Low access / low income tracts

📊

CDC PLACES / SVI

County health estimates
Social Vulnerability Index

🏢

ADI / HUD / 211

Area Deprivation Index
Housing, social services

▼ ▼ ▼ ▼ ▼
Claims transactions + geocoded social determinant indices
Claims Data Model — X12 Transactions

X12 837 Professional / Institutional Claim Structure

Header SubmitterID | ReceiverID | TransactionDate | ClaimType (P/I)
Patient MemberID | Name | DOB | GroupNumber | PayerID
Claim ClaimID | DOS (From-To) | PlaceOfService | TotalCharge | DRG
DX Codes ICD-10-CM (Primary) | ICD-10 (DX2..DX12) | ICD-10-PCS (Procedures)
Lines CPT/HCPCS Code | Modifiers | Units | Allowed Amount | NPI (Rendering)
Remit PaidAmount | AdjustmentReason | PatientResponsibility | CheckDate
Transaction Header
Patient / Member
Claim Level
Diagnosis Codes
Service Lines
Remittance (835)
▼ ▼ ▼
SDoH Data Model

Z-Codes (ICD-10 Z55-Z65)

Z59.0 Homelessness, Z56.0 Unemployment
  • Captured in EHR problem list / claims
  • Under-coded (5-10% capture rate)
  • Maps to FHIR Condition resource

Screening Tools

AHC-HRSN, PRAPARE, PHQ-2/9
  • Standardized questionnaires
  • Food, housing, transport, safety domains
  • FHIR QuestionnaireResponse

Geocoded Indices

ADI, SVI, Food Access Research Atlas
  • Census tract / ZIP level scores
  • Deprivation, vulnerability rankings
  • Joined by FIPS/ZIP to patient address

SDoH Domains

Gravity Project SDOH Categories
  • Food insecurity
  • Housing instability / homelessness
  • Transportation, employment, education
  • Interpersonal safety, social isolation
▼ ▼ ▼
Claims & SDoH to FHIR R4 Resource Mapping
Source FHIR R4 Resource IG / Profile Key Fields Mapped
X12 837 Claim CARIN BB Type, provider, diagnosis, procedure, total, item lines
X12 835 (EOB) ExplanationOfBenefit CARIN BB Payment, adjudication, adjustments, patient responsibility
X12 270/271 Coverage DaVinci PDex Payor, subscriber, group, period, type, beneficiary
Payer / Provider Organization US Core NPI, name, type, address, active status
SDoH Screening QuestionnaireResponse Gravity SDOH Questionnaire ref, items, answers, authored date
SDoH Need Condition Gravity SDOH Category (sdoh), code (Z-code), clinicalStatus, evidence
SDoH Referral ServiceRequest / Task Gravity SDOH Category, code, status, requester, performer (CBO), for (patient)
▼ ▼ ▼
Integration Patterns — Claims & SDoH to Google Cloud

Claims Flat File Ingestion Primary

X12 837/835 or CSV flat files from clearinghouses. Batch upload to Cloud Storage, parsed by Dataflow, loaded into BigQuery.

Clearinghouse (X12/CSV)
Cloud Storage (GCS)
Dataflow (Parse X12)
BigQuery (Claims Tables)

Payer FHIR APIs CMS Mandate

CMS Blue Button 2.0, payer Patient Access APIs. ExplanationOfBenefit resources pulled via FHIR R4 into Cloud Healthcare API.

Payer FHIR API
Cloud Functions (OAuth2 flow)
Cloud Healthcare API (FHIR Store)
BigQuery Export

SDoH Public Datasets Geocoded

Census/ACS, ADI, SVI, USDA datasets loaded into BigQuery. Joined to patient records by FIPS code, ZIP, or census tract.

Census API / Public Datasets
Cloud Storage / BQ Public
BigQuery (SDoH Tables)
JOIN by FIPS/ZIP

EHR SDoH Screening Clinical

AHC-HRSN / PRAPARE screening responses flow from EHR via HL7v2 or FHIR. Z-codes captured in ADT/DG1 segments.

EHR (HL7v2 / FHIR)
Cloud Healthcare API
Dataflow (Extract Z-codes)
BigQuery (SDoH Screening)
▼ ▼ ▼
Typical Data Volume & Refresh Cadence
Millions
Claims / Month
Large health system, all payers
30-90 Days
Claims Lag
From service date to adjudication
Real-Time
Eligibility Checks
270/271 transactions, sub-second
Quarterly
SDoH Index Refresh
ADI, SVI, PLACES updated periodically
Annual
Census / ACS Data
5-year ACS estimates, decennial census
5-10%
Z-Code Capture Rate
SDoH under-documented in claims/EHR
▼ ▼ ▼
Key Challenges & Considerations

Claims Lag & Adjustments

30-90 day delay from service to paid claim. Denials, resubmissions, and adjustments create multiple versions. Must handle retroactive changes.

SDoH Data Sparsity

Z-code capture is 5-10%. Screening adoption is uneven. Geocoded indices are proxies, not individual-level data. Gaps in rural areas.

Geocoding Accuracy

Patient addresses may be PO boxes, shelters, or outdated. Census tract assignment requires geocoding services and address standardization.

Social Risk vs. Social Need

Area-level deprivation (ADI) = risk. Individual screening = need. Both needed but different. Risk does not equal individual experience.

Cross-Payer Linkage

Patients have claims across multiple payers (Medicare + commercial). No universal patient ID. Requires probabilistic matching and deduplication.

Consent for SDoH

Patients may not consent to sharing social needs data. Sensitive categories (domestic violence, substance use). Must respect preferences.

▼ ▼ ▼

Downstream: Into the Unified Pipeline

Claims and SDoH data join with clinical data in BigQuery, enabling population health analytics, risk stratification, care gap detection, and health equity analysis powered by AI agents.

BigQuery (Claims + SDoH)
JOIN with Clinical Data
Population Health Analytics
AI Agents (Risk, Gaps, Equity)
← Back to Overview

Cloud Healthcare API + Pub/Sub + Dataflow + HDE

The unified ingestion pipeline: healthcare-native APIs, event-driven messaging, Apache Beam processing, and clinical data harmonization on GCP.

Component Overview

Cloud Healthcare API

  • HL7v2 Store — MLLP ingestion
  • FHIR Store (R4) — CRUD + search
  • DICOM Store — DICOMweb
  • Managed, HIPAA-compliant

Cloud Pub/Sub

  • Serverless event bus
  • Topic-per-data-type routing
  • Exactly-once delivery
  • Dead-letter queue support

Cloud Dataflow

  • Apache Beam (Java/Python)
  • Streaming + batch unified
  • Autoscaling workers
  • Exactly-once processing

Healthcare Data Engine

  • FHIR harmonization
  • Patient matching (EMPI)
  • Terminology normalization
  • Data quality rules
▼ ▼ ▼
Cloud Healthcare API — Store Details

FHIR Store R4

  • Full CRUD on FHIR R4 resources
  • Search: _include, _revinclude, chained params
  • $everything — full patient record
  • Bulk export to BigQuery (streaming + scheduled)
  • Conditional create/update (If-None-Exist)
  • Bundle transactions (up to 100 entries)
  • SMART on FHIR scopes for access control

HL7v2 Store v2.x

  • MLLP adapter — on-prem to GCP bridge
  • Message parsing with configurable schemas
  • Pub/Sub notification on each message
  • Segment-level field extraction
  • ACK/NAK response handling
  • Supports v2.1 through v2.9

DICOM Store DICOMweb

  • STOW-RS — store instances
  • WADO-RS — retrieve studies/series/instances
  • QIDO-RS — query studies by metadata
  • De-identification profiles built-in
  • Integration with Cloud Storage for bulk

Common Capabilities Platform

  • HIPAA BAA, HITRUST, SOC 2 compliant
  • CMEK encryption at rest
  • IAM + SMART on FHIR access control
  • Audit logging to Cloud Logging
  • Regional and multi-regional deployments
▼ ▼ ▼
Pub/Sub as Event Bus

Topic Architecture

One topic per data type enables independent scaling, filtering, and consumer isolation.

hl7v2-messages
ADT, ORM, ORU events
fhir-notifications
FHIR Store changes
dicom-studies
New study arrivals
iot-events
Device telemetry
claims-ingest
X12 835/837 events
dlq-*
Dead-letter per topic
FeatureConfigurationPurpose
Exactly-Once Deliveryenable_exactly_once_delivery: trueNo duplicate processing downstream
Message Orderingordering_key: patient_idIn-order per patient for ADT events
Dead-Letter Topicsmax_delivery_attempts: 5Failed messages routed for triage
Push Subscriptionspush_endpoint: Cloud Run URLLow-latency alert triggers
Pull Subscriptionsack_deadline: 60sDataflow streaming consumption
Retentionmessage_retention: 7dReplay window for reprocessing
▼ ▼ ▼
Dataflow Pipelines — Streaming & Batch

Streaming Jobs Always-On

  • HL7v2 parse → FHIR R4 transform → BigQuery write
  • Real-time feature engineering (vitals, alerts)
  • Terminology mapping (local → SNOMED/LOINC)
  • Patient ID resolution via EMPI lookup
  • IoT device stream aggregation (1-min windows)
  • Clinical event enrichment & routing

Batch Jobs Scheduled

  • Historical backfill from bulk FHIR exports
  • Claims file processing (X12 835/837)
  • Genomic pipeline output integration
  • Monthly terminology table refresh
  • Data quality reconciliation reports
  • Feature store batch materialization
Apache Beam Concepts

PCollections

Immutable distributed datasets — each step produces a new PCollection

ParDo / DoFn

Element-wise transforms — HL7v2 parsing, FHIR mapping, validation

Windowing

Fixed (1-min), sliding (5-min/1-min), session (30-min gap) windows

Watermarks

Event-time progress tracking — handle late data with allowed lateness

Side Inputs

Broadcast lookup tables — terminology maps, facility configs

Dead Letters

Failed elements routed to BigQuery error table + DLQ topic

▼ ▼ ▼
Healthcare Data Engine (HDE)
CapabilityDetailOutput
Patient Matching (EMPI)Probabilistic + deterministic matching on name, DOB, SSN, MRNGolden patient_id
FHIR HarmonizationNormalize heterogeneous FHIR into canonical R4 profilesConformant FHIR bundles
Terminology NormalizationMap local codes → SNOMED CT, LOINC, RxNorm, ICD-10Standard coded values
Data Quality RulesCompleteness, validity, consistency checks per resource typeQuality score + flags
Longitudinal AssemblyMerge records across sources into single patient timelineUnified patient record
De-identificationSafe Harbor / Expert Determination for research datasetsDe-identified FHIR
▼ ▼ ▼
Pipeline Architecture — End-to-End Flow

Streaming Path

EHR / HL7v2
MLLP Adapter
Cloud Healthcare API
Pub/Sub
Dataflow (Streaming)
HDE (Harmonize)
BigQuery Raw Zone

FHIR Native Path

EHR FHIR R4
FHIR Store
Pub/Sub
Dataflow (Streaming)
BigQuery Raw Zone

Imaging Path

PACS / VNA
DICOM Store
Pub/Sub
Dataflow (Metadata)
BigQuery + Cloud Storage

Batch Path

Bulk Export / Files
Cloud Storage (GCS)
Dataflow (Batch)
HDE (Harmonize)
BigQuery Raw Zone
▼ ▼ ▼
Monitoring & Observability

Cloud Monitoring Dashboards

  • Dataflow job metrics: throughput, element count, backlog size
  • Pub/Sub: unacked messages, publish latency, subscription age
  • Healthcare API: request count, error rate, latency p50/p99
  • BigQuery: streaming insert rate, slot utilization, query performance

Alerting & SLOs

  • SLO: end-to-end latency < 5s (p99), availability 99.95%
  • Alert: Pub/Sub backlog > 10K messages
  • Alert: Dataflow error rate > 0.1%
  • Alert: DLQ message count > 0
  • Alert: Healthcare API 5xx rate > 1%
  • Weekly SLO burn-rate reports via Cloud Monitoring
▼ ▼ ▼
Scaling Characteristics
Auto
Dataflow Workers
1 → 1000+ based on backlog
Pub/Sub Throughput
No provisioned capacity needed
1M
BQ Streaming Inserts/sec
Per project, expandable
Regional
Deployment Model
us-central1 primary, failover ready
CMEK
Encryption
Customer-managed keys everywhere
99.95%
Pipeline SLA Target
End-to-end availability
▼ ▼ ▼

Downstream: Into the Lakehouse

Ingested and harmonized data lands in the BigQuery lakehouse, progressing through Raw, Curated, and Enriched zones to become AI-ready.

Ingestion Pipeline
BigQuery Raw Zone
Curated Zone
Enriched Zone
Vertex AI + Agents
← Back to Overview

Raw Zone — Deep Dive

Immutable landing zone in BigQuery and Cloud Storage. Source-of-truth copies for audit, compliance, and reprocessing.

Purpose & Principles
🔒

Immutable Landing

Data written once, never modified. Append-only ingestion preserves original fidelity.

📄

Source of Truth

Exact copy of upstream data. All downstream zones derive from raw — enables full recompute.

🔍

Audit Trail

Every record timestamped with ingestion metadata. Supports HIPAA audit and regulatory review.

Reprocessing

When transformation logic changes, replay from raw. No need to re-extract from source systems.

▼ ▼ ▼
BigQuery Storage Layout — Datasets by Source
raw_ehr
HL7v2 messages + FHIR resources
raw_imaging_meta
DICOM metadata (studies, series)
raw_labs
Lab results, micro, path
raw_claims
X12 835/837, ERA, eligibility
raw_iot
Device telemetry, wearables
raw_genomics
VCF variants, annotations
▼ ▼ ▼
Raw FHIR Tables

Schema: raw_ehr.fhir_resources Auto-populated via Healthcare API Export

ColumnTypeDescription
resource_typeSTRINGPatient, Encounter, Observation, Condition, etc.
idSTRINGFHIR resource ID (server-assigned UUID)
meta_last_updatedTIMESTAMPServer-side last modified timestamp
meta_version_idSTRINGResource version for optimistic concurrency
resource_jsonJSONFull FHIR R4 resource payload
source_fhir_storeSTRINGCloud Healthcare API FHIR store path
ingestion_timestampTIMESTAMPPipeline ingestion time (partition key)
▼ ▼ ▼
Raw HL7v2 Tables

Schema: raw_ehr.hl7v2_messages Parsed from HL7v2 Store

ColumnTypeDescription
message_idSTRINGUnique message control ID (MSH-10)
message_typeSTRINGADT, ORM, ORU, SIU, MDM, etc.
trigger_eventSTRINGA01, A03, O01, R01, etc.
sending_facilitySTRINGMSH-4 sending facility identifier
sending_applicationSTRINGMSH-3 sending application name
raw_messageSTRINGOriginal pipe-delimited HL7v2 message
parsed_segmentsJSONStructured JSON of all segments (MSH, PID, PV1, OBX...)
message_datetimeTIMESTAMPMSH-7 message date/time
ingestion_timestampTIMESTAMPPipeline arrival time (partition key)
▼ ▼ ▼
Raw Claims Tables

Schema: raw_claims.claims_raw X12 835/837

ColumnTypeDescription
claim_idSTRINGPayer-assigned claim identifier
claim_typeSTRINGProfessional (837P), Institutional (837I), Dental (837D)
service_date_fromDATEService start date
service_date_toDATEService end date
dx_codesARRAY<STRING>ICD-10-CM diagnosis codes (primary + secondary)
px_codesARRAY<STRING>CPT/HCPCS procedure codes
billed_amountNUMERICTotal billed amount
allowed_amountNUMERICPayer-allowed amount
payer_nameSTRINGInsurance payer identifier
raw_x12STRINGOriginal X12 transaction content
ingestion_timestampTIMESTAMPPipeline arrival time (partition key)
▼ ▼ ▼
Cloud Storage — Large Binary Objects

Storage Organization GCS Buckets

Prefix structure: gs://project-raw/{source}/{type}/{YYYY}/{MM}/{DD}/

  • DICOM files — original imaging studies (.dcm)
  • Genomics — FASTQ, BAM, VCF files
  • Clinical documents — scanned PDFs, CDA/C-CDA
  • Large HL7v2 batches — bulk file drops

Object Metadata Labels

  • source_system — originating system ID
  • data_type — dicom, genomics, document
  • phi_flag — true/false
  • ingestion_date — ISO 8601 arrival date
  • retention_class — hot, warm, cold
▼ ▼ ▼
Data Quality at Rest — Dataplex DQ
Check TypeToolExample Rule
Schema ValidationDataplex Data QualityAll required columns present, correct types
CompletenessDataplex Data Qualitypatient_id NOT NULL, message_type NOT NULL
Duplicate DetectionDataform assertionCOUNT(DISTINCT message_id) = COUNT(*)
Freshness MonitoringDataplex Data QualityMAX(ingestion_timestamp) within last 15 minutes
Range ValidationDataplex Data Qualityingestion_timestamp between source_time and NOW()
Volume AnomalyCloud MonitoringDaily row count within 2 stddev of trailing 30-day mean
▼ ▼ ▼
Retention & Lifecycle Policies

Tiered Storage Strategy

BigQuery (Hot)
0 – 90 days
Cloud Storage Nearline
90 days – 1 year
Coldline
1 – 3 years
Archive
3 – 7+ years

BigQuery Lifecycle

  • Long-term storage pricing after 90 days (auto)
  • Partition expiration for non-critical staging tables
  • Time travel: 7-day query snapshots for recovery
  • Fail-safe: additional 7 days (Google-managed)

Cloud Storage Lifecycle

  • Object Lifecycle rules auto-transition classes
  • Bucket Lock for WORM compliance
  • Legal holds for litigation / regulatory freezes
  • Object versioning enabled for accidental overwrite
▼ ▼ ▼
Governance — Dataplex & Data Catalog

Dataplex Asset Registration

Every raw dataset/bucket registered as a Dataplex asset within the healthcare lake. Auto-discovery scans for new tables.

Data Catalog Tags

Automated tagging: source system, data classification (PHI/PII/public), ingestion date, owner team, retention policy.

Column-Level Security

BigQuery policy tags on PHI columns (SSN, name, DOB). Data Catalog taxonomy enforces access via IAM.

Lineage Tracking

Dataplex lineage captures raw → curated → enriched provenance. Integrated with Dataform DAGs.

▼ ▼ ▼

Downstream: Raw → Curated Zone

Scheduled Dataflow and Dataform jobs transform raw data into normalized, deduplicated, quality-controlled records in the Curated Zone.

Raw Zone (BigQuery + GCS)
Dataform / Dataflow
Curated Zone
Enriched Zone
AI / Analytics
← Back to Overview

Curated Zone — Deep Dive

Normalized, deduplicated, quality-controlled healthcare data ready for analytics and downstream enrichment.

Purpose

Normalized

Standard terminologies (SNOMED, LOINC, RxNorm, ICD-10). Consistent schemas across sources.

🔗

Deduplicated

EMPI-resolved patient identity. One golden record per patient, encounter, observation.

Quality-Controlled

Dataform assertions + Dataplex DQ rules enforce integrity. Quality score per record.

📈

Analytics-Ready

Flat, queryable tables optimized for BigQuery. Partitioned and clustered for performance.

▼ ▼ ▼
Transformation Pipeline — Raw to Curated

Processing Steps

Raw Zone
Deduplication
Code Normalization
EMPI Resolution
Business Rules
Flatten FHIR
DQ Validation
Curated Tables

Dataform SQL-based

  • SQL transformations with dependency DAGs
  • Incremental models — process only new/changed rows
  • Built-in assertions for data testing
  • Auto-generated documentation
  • Git-integrated versioning in Cloud Source Repos

Dataflow Complex Logic

  • EMPI matching (probabilistic + deterministic)
  • Cross-source record linkage
  • Terminology mapping with large lookup tables
  • Nested FHIR JSON → flat BigQuery schemas
  • Scheduled via Cloud Composer (Airflow)
▼ ▼ ▼
Core Curated Tables
patient_master
Golden record
encounters
Visits & admissions
observations
Vitals & labs
conditions
Diagnoses
medications
Orders & admin
procedures
Surgical & clinical
allergies
Intolerances
immunizations
Vaccine records
documents
Clinical notes
claims_adjudicated
Processed claims
appointments
Scheduling
▼ ▼ ▼
Patient Master Table — Golden Record
ColumnTypeDescription
patient_idSTRINGEMPI-resolved universal patient identifier
mrnsARRAY<STRUCT>All known MRNs [{mrn, facility, active}]
given_nameSTRINGPatient first name (best-known)
family_nameSTRINGPatient last name (best-known)
date_of_birthDATEDate of birth
genderSTRINGAdministrative gender
raceSTRINGOMB race category
ethnicitySTRINGOMB ethnicity category
addressSTRUCTPrimary address (line, city, state, zip)
primary_pcpSTRINGPrimary care provider NPI
risk_scoresSTRUCT{hcc_score, lace_score, cci_score}
last_encounter_dateDATEMost recent encounter date
insuranceARRAY<STRUCT>Active coverage [{payer, plan, member_id, type}]
is_deceasedBOOLEANDeceased flag
updated_atTIMESTAMPLast curated-zone update timestamp
▼ ▼ ▼
Encounter Table
ColumnTypeDescription
encounter_idSTRINGUnique encounter identifier
patient_idSTRINGFK to patient_master
encounter_typeSTRINGambulatory, emergency, inpatient, virtual
encounter_classSTRINGAMB, EMER, IMP, VR (FHIR class codes)
facility_idSTRINGFacility / location identifier
departmentSTRINGDepartment name
admit_dateTIMESTAMPAdmission or check-in time
discharge_dateTIMESTAMPDischarge or check-out time
attending_npiSTRINGAttending provider NPI
diagnosesARRAY<STRUCT>[{icd10, description, rank, type}]
proceduresARRAY<STRUCT>[{cpt, description, date}]
dispositionSTRINGDischarge disposition code
▼ ▼ ▼
Observations Table — Vitals & Lab Results
ColumnTypeDescription
observation_idSTRINGUnique observation identifier
patient_idSTRINGFK to patient_master
encounter_idSTRINGFK to encounters (nullable for ambulatory)
loinc_codeSTRINGLOINC observation code
display_nameSTRINGHuman-readable observation name
value_numericFLOAT64Numeric result (if applicable)
value_textSTRINGText result (if non-numeric)
unitsSTRINGUCUM unit of measure
reference_rangeSTRINGNormal reference range
abnormal_flagSTRINGH, L, HH, LL, A, N
effective_dateTIMESTAMPClinically relevant date/time
source_systemSTRINGOriginating system identifier
▼ ▼ ▼
Data Quality Rules
Rule TypeToolExample
Not NullDataform assertionpatient_id, encounter_id, loinc_code must be non-null
Valid RangeDataform assertionHeart rate 20-300, temp 90-110F, SpO2 50-100%
Referential IntegrityDataform assertionAll encounter.patient_id exists in patient_master
Code System ValidationDataform assertionloinc_code matches LOINC reference table
Completeness ScoreDataplex DQ% of required fields populated per record
TimelinessDataplex DQCurated table refresh < 30 min after raw arrival
▼ ▼ ▼
Dataform DAG — Example Patient Pipeline

raw_fhir → stg_patients → curated_patient_master

raw_ehr.fhir_resources
stg_patients_dedup
stg_patients_empi
curated.patient_master
assert_patient_pk_unique
raw_ehr.hl7v2_messages
stg_encounters_parsed
stg_encounters_normalized
curated.encounters
assert_encounter_fk_valid
raw_ehr.fhir_resources
stg_observations_flat
stg_observations_loinc
curated.observations
assert_loinc_valid
▼ ▼ ▼
Partitioning & Clustering Strategy

Partition by Date

All curated tables partitioned on primary date column (admit_date, effective_date, updated_at). Enables efficient time-range queries.

Cluster by patient_id

Clustering on patient_id collocates patient data for fast $everything-style queries across encounters, observations, conditions.

Materialized Views

Pre-computed aggregations: active_patients, recent_admissions, pending_results. Auto-refreshed by BigQuery.

BI Engine Acceleration

BigQuery BI Engine reservations on high-traffic curated tables for sub-second Looker dashboard queries.

▼ ▼ ▼

Downstream: Curated → Enriched Zone

Curated records feed feature engineering, embeddings generation, cohort building, and research marts in the Enriched Zone.

Curated Zone
Feature Engineering
Enriched Zone
Vertex AI + Looker
← Back to Overview

Enriched Zone — Deep Dive

ML features, embeddings, cohorts, and research marts — the AI-ready layer of the healthcare lakehouse.

Purpose

ML Features

Pre-computed risk scores, utilization metrics, temporal aggregations ready for model training and inference.

🔬

Embeddings

Vector representations of clinical notes, imaging, and lab panels for semantic search and similarity.

👥

Cohorts

Pre-built patient cohorts for clinical trials, quality measures, and population health programs.

📊

Research Marts

Disease-specific and operational data marts optimized for analytics and Looker dashboards.

▼ ▼ ▼
Feature Engineering — Computed Features
LACE Score
Readmission risk (LOS, Acuity, Comorbidities, ED visits)
CCI
Charlson Comorbidity Index
APACHE II
ICU severity scoring
Med Adherence
PDC / MPR calculations
Utilization (7d/30d/90d)
ED visits, admissions, procedures
Longitudinal Trends
Lab value slopes, vital trajectories

Dataform Features SQL-based

  • Window functions for temporal aggregations
  • 7-day, 30-day, 90-day rolling windows
  • Incremental updates — only recompute changed patients
  • Scheduled via Cloud Composer DAGs

Dataflow Features Complex

  • Streaming feature computation (real-time vitals)
  • Cross-table joins for composite scores
  • External API enrichment (NPI registry, geocoding)
  • SDoH feature derivation from address data
▼ ▼ ▼
Vertex AI Feature Store
Entity TypeKey FeaturesOnline ServingOffline Serving
patient risk_scores, demographics, utilization_30d, med_count, last_a1c, insurance_type < 10ms (Bigtable) BigQuery export
encounter los_hours, icu_flag, diagnosis_count, procedure_count, ed_to_admit_min < 10ms (Bigtable) BigQuery export
provider panel_size, avg_los, readmit_rate, specialty, quality_scores < 10ms (Bigtable) BigQuery export

Online Serving Low-Latency

  • Sub-10ms lookups for clinical agents
  • Backed by Bigtable for high throughput
  • Used by real-time inference endpoints
  • Auto-synced from BigQuery feature tables

Point-in-Time Correctness Training

  • Prevent data leakage in ML training
  • Feature values as-of prediction timestamp
  • Temporal join logic built into Feature Store SDK
  • Critical for readmission / mortality models
▼ ▼ ▼
Embedding Tables
TableKey ColumnsEmbedding ModelDimensions
clinical_note_embeddings note_id, patient_id, encounter_id, embedding_vector, model_version, note_type Med-PaLM / Gemini 768 / 1024
imaging_embeddings study_id, series_id, patient_id, embedding_vector, modality, body_part Med-PaLM Vision 1024
lab_panel_embeddings patient_id, panel_date, embedding_vector, panel_type, lab_count Custom Vertex AI 256
patient_summary_embeddings patient_id, embedding_vector, summary_date, model_version Gemini 768

Vector Search Integration

  • BigQuery VECTOR_SEARCH for analytics-time similarity queries
  • Vertex AI Vector Search for low-latency online retrieval (RAG)
  • Cosine similarity for clinical note search, patient matching
  • Used by clinical agents for context retrieval
▼ ▼ ▼
Cohort Tables
ColumnTypeDescription
cohort_idSTRINGUnique cohort identifier
cohort_nameSTRINGHuman-readable name (e.g., "T2DM A1c > 9")
criteria_definitionJSONStructured inclusion/exclusion criteria
patient_idsARRAY<STRING>Matching patient IDs
patient_countINT64Cohort size
creation_dateTIMESTAMPWhen cohort was computed
irb_numberSTRINGAssociated IRB protocol (if research)
refresh_scheduleSTRINGdaily, weekly, one-time
created_bySTRINGRequesting user / team
▼ ▼ ▼
Research & Operational Marts

Oncology Mart

Tumor registry, staging, treatment lines, genomic variants, outcomes by regimen.

Cardiology Mart

Echo metrics, cath lab data, LVEF trends, HF readmissions, anticoagulation adherence.

Diabetes Mart

A1c trajectories, insulin dosing, complication rates, eye/foot exam compliance.

Operational — Throughput

ED wait times, OR utilization, bed turnover, discharge delays, staffing ratios.

Operational — Readmissions

30-day readmission rates by DRG, payer, provider. Risk-stratified cohorts.

Population Health

Risk stratification tiers, care gaps (screenings, vaccines), SDoH indices, HEDIS measures.

▼ ▼ ▼
ML Training Datasets
ComponentDetailStorage
Feature SnapshotsPoint-in-time feature values at prediction timestampBigQuery (versioned)
Label Tablesreadmission_30d, mortality_inpatient, sepsis_onset, deterioration_6hBigQuery
Train/Val/Test SplitsTemporal split (train < 2024, val = 2024-H1, test = 2024-H2)BigQuery + GCS
Dataset VersioningDataplex lineage tracks dataset provenance per model versionDataplex metadata
Data CardsDataset documentation: size, demographics, label distribution, known biasesVertex AI Metadata
▼ ▼ ▼
Downstream — From Enriched to AI & Analytics

ML/AI Path

Feature Store
Vertex AI Training
Model Registry
Vertex AI Endpoints
Clinical Agents

Analytics Path

Research Marts
BigQuery
Looker
Dashboards & Reports

RAG Path

Embedding Tables
Vertex AI Vector Search
Gemini / Agents
Clinical Summaries
▼ ▼ ▼

Full Pipeline: Ingestion → Raw → Curated → Enriched → AI

The Enriched Zone is the final lakehouse layer. From here, data powers Vertex AI models, clinical agents, Looker dashboards, and research workflows.

Ingestion
Raw Zone
Curated Zone
Enriched Zone
Vertex AI
Agents + Looker
← Back to Overview

Embeddings — Deep Dive

Convert all healthcare data types into dense vector representations enabling semantic search, similarity matching, and cross-modality reasoning across the clinical data ecosystem.

Embedding Models on Vertex AI
Model Data Type Dimensions Use Case
text-embedding-005 General text 768 Clinical notes, discharge summaries, guidelines
text-multilingual-embedding-002 Multilingual text 768 Patient-facing materials, consent forms
Med-PaLM Embeddings Clinical text 768 H&P notes, radiology/pathology reports, medical Q&A
Health AI Dev Foundations Medical imaging 1024 X-ray, CT, pathology slide embeddings
Custom Fine-Tuned (Vertex AI Training) Domain-specific 768/1024 Org-specific terminology, specialty notes, lab panels
multimodalembedding@001 Image + text 1408 Cross-modal search: text query → image results
▼ ▼ ▼
Healthcare Data Types Embedded

Clinical Notes

  • History & Physical (H&P)
  • Progress notes
  • Discharge summaries
  • Consult notes

Diagnostic Reports

  • Radiology reports
  • Pathology reports
  • Operative notes
  • Procedure findings

Lab Panel Signatures

  • Vectorized result patterns
  • Multi-analyte panels (BMP, CBC)
  • Trending abnormal results
  • Reference range deviations

Encounter Summaries

  • Visit-level patient summaries
  • Problem list snapshots
  • Medication reconciliation
  • Care plan abstracts

Guidelines & Protocols

  • Clinical pathways
  • Formulary policies
  • Order set documentation
  • Institutional SOPs

Research Literature

  • PubMed abstracts
  • Internal publications
  • Trial protocols
  • Systematic reviews
▼ ▼ ▼
Data flows into embedding pipeline
Embedding Pipeline Architecture

Batch Embedding Pipeline Primary

Scheduled and event-driven embedding generation for all new and updated clinical data via Dataflow orchestration.

BigQuery / Cloud Storage
Dataflow (Orchestration)
Vertex AI Batch Prediction
BigQuery (VECTOR columns)

Real-Time Embedding Low Latency

On-demand embedding for new documents and agent queries via Vertex AI online prediction endpoints.

Pub/Sub (New Doc Event)
Cloud Run Function
Vertex AI Online Prediction
Vertex AI Vector Search

Continuous Indexing Streaming

Incremental updates to vector indices as new embeddings arrive, ensuring near-real-time search availability.

BigQuery (New Vectors)
Dataflow Streaming
Vertex AI Vector Search (Index Update)
Deployed Index Endpoint
▼ ▼ ▼
Vector Storage & Search Options
Service Vector Capability Search Algorithm Best For Latency
BigQuery VECTOR type, VECTOR_SEARCH() Cosine / Dot Product / Euclidean Analytical queries, cohort similarity, SQL joins with vectors Seconds (analytical)
Vertex AI Vector Search Managed ANN index, deployed endpoints ScaNN (Scalable Nearest Neighbors) Low-latency serving, real-time agent retrieval, RAG pipeline < 10ms (p99)
AlloyDB pgvector extension, ANN index IVFFlat / HNSW Transactional + vector hybrid, app-embedded search < 50ms
Spanner K-Nearest Neighbors (approx) Cosine distance built-in Global-scale transactional with vector search < 20ms
▼ ▼ ▼
Embedding Table Schema (BigQuery)

healthcare_embeddings.document_embeddings

document_id STRING (PK) patient_id STRING (FK) encounter_id STRING (FK) source_type STRING
text_chunk STRING chunk_index INT64 chunk_token_count INT64
embedding_vector ARRAY<FLOAT64> (768/1024 dim)
model_id STRING model_version STRING created_at TIMESTAMP metadata_json JSON
Primary Key
Foreign Key
Data Fields
Vector Column
Metadata / Lineage
▼ ▼ ▼
Similarity Operations

Patient Matching

  • Nearest-neighbor patient similarity
  • Find patients with similar conditions
  • Treatment outcome cohort matching

Similar Case Retrieval

  • Query by clinical scenario
  • Historical case lookup for CDS
  • Cross-facility case matching

Note Deduplication

  • Detect copy-paste notes (high cosine)
  • Version diff across encounters
  • Identify template-derived content

Literature Matching

  • Patient context → relevant studies
  • Guideline-to-case alignment
  • Evidence gap identification

Anomaly Detection

  • Distance threshold outlier detection
  • Unusual lab result patterns
  • Atypical documentation flagging
▼ ▼ ▼
Quality & Lifecycle Management

Embedding Drift Monitoring

Track distribution shift in embedding space over time. Alert when new data deviates significantly from training distribution.

Model Version Management

Track model_id + model_version per vector. Support side-by-side versions during migration. Vertex AI Model Registry for lineage.

Re-Embedding Pipeline

On model update, trigger batch re-embedding of existing corpus. Dataflow job with BigQuery source, write-back with new model_version.

A/B Evaluation

Compare embedding quality across models using retrieval precision/recall on curated eval sets. Vertex AI Experiments for tracking.

Dimension Reduction

UMAP / t-SNE projections for visualization and debugging. Stored as 2D/3D coordinates for dashboarding in Looker.

TTL & Retention

Embedding expiration aligned with source data retention policies. Automated cleanup via BigQuery scheduled queries.

▼ ▼ ▼

Downstream: Powering Search, RAG & AI Agents

Embeddings feed into vector search for retrieval, RAG pipelines for grounded generation, and BigQuery for cohort analytics and research similarity analysis.

Vertex AI Vector Search
RAG Pipeline Retrieval
AI Agents (Grounded)
Clinical Decisions
BigQuery Vectors
Cohort Similarity Analysis
Research & Population Health
← Back to Overview

Knowledge Graph — Deep Dive

Encode medical ontologies, clinical pathways, and operational rules as a graph to validate and contextualize AI reasoning with structured medical knowledge.

Graph Database on GCP
Service Type Query Language Best For
Neo4j Aura on GCP Managed graph DB (GCP Marketplace) Cypher Full ontology encoding, multi-hop traversals, pathway validation
Spanner Graph Graph layer on Cloud Spanner Spanner Graph Query Global-scale, strongly consistent graph + relational hybrid
Memorystore (Redis Graph) In-memory graph Cypher subset Cached frequent traversals, low-latency lookups at inference
BigQuery + Graph Analytics Analytical graph SQL + GRAPH_PATH() Batch graph analytics on large-scale clinical datasets
▼ ▼ ▼
Core Ontologies Encoded

SNOMED CT

~350K active concepts
  • Clinical terms & hierarchies
  • IS_A relationships (subsumption)
  • Finding-site, causative-agent edges
  • Concept model attributes

ICD-10-CM / PCS

~72K diagnosis + 78K procedure codes
  • Diagnosis code hierarchies
  • Procedure classification
  • SNOMED ↔ ICD-10 mappings
  • HCC risk groupings

LOINC

~100K lab/observation codes
  • Lab test codes & panels
  • Component + method axes
  • Panel → member relationships
  • Units of measure mappings

RxNorm

~115K drug concepts
  • Medications & ingredients
  • Dose forms & strengths
  • NDC ↔ RxNorm mappings
  • Clinical drug → ingredient edges

CPT / HCPCS

~10K+ procedure codes
  • Procedure billing codes
  • Category I, II, III codes
  • Modifier relationships
  • CPT ↔ ICD-10-PCS mappings
▼ ▼ ▼
Ontologies encoded as graph nodes and edges
Graph Schema — Nodes & Edges

Node Types

:Concept :Medication :Condition :LabTest :Procedure :Pathway :Guideline :Contraindication :GeneVariant

Edge Types (Relationships)

IS_A HAS_FINDING TREATS CONTRAINDICATED_WITH ORDERED_FOR MAPS_TO PART_OF_PATHWAY INTERACTS_WITH HAS_INGREDIENT ASSOCIATED_WITH
Concept (Generic)
Medication
Condition
LabTest
Procedure
Pathway / Guideline
GeneVariant
▼ ▼ ▼
Clinical Pathway Graphs & Drug Interaction Graph

Clinical Pathway Graph Evidence-Based

Encode pathways (sepsis bundle, ACS protocol, diabetes management) as directed graphs with decision nodes, time constraints, and required actions.

Trigger Condition
Decision Node
Required Action (Time-Bound)
Next Decision
Outcome Node

Drug Interaction Graph Safety

Medication → ingredient → interaction edges with severity levels (critical, major, moderate, minor). Used at inference to validate AI medication recommendations.

Medication A
Ingredient X
INTERACTS_WITH [severity: critical]
Ingredient Y
Medication B

Sepsis Bundle Example

Time-zero recognition → lactate draw (30 min) → blood cultures (before abx) → broad-spectrum antibiotics (1 hr) → fluid resuscitation (30 mL/kg if hypotensive) → reassess.

ACS Protocol Example

Chest pain → 12-lead ECG (10 min) → troponin draw → STEMI pathway (cath lab activation) or NSTEMI pathway (risk stratification) → anticoagulation → cardiology consult.

▼ ▼ ▼
Graph-Powered Validation at Inference

AI Action → Graph Validation → Safe Output

AI Agent Proposes Action Query Graph Validation Result

Contraindication Check

  • Traverse CONTRAINDICATED_WITH edges
  • Check patient allergy list vs proposed med
  • Block or warn on match

Guideline Concordance

  • Match proposed action to PART_OF_PATHWAY edges
  • Verify action aligns with clinical pathway
  • Flag deviations with explanation

Terminology Correctness

  • Validate codes via IS_A / MAPS_TO traversals
  • Ensure SNOMED, ICD-10, LOINC accuracy
  • Resolve ambiguous terms to correct concepts

Pathway Completeness

  • Check all required pathway steps are addressed
  • Identify missing actions in protocol
  • Verify time constraints are met
▼ ▼ ▼
Graph APIs & Integration
Access Method GCP Service Protocol Use Case
Direct graph queries Neo4j Aura (Bolt) Bolt protocol / Cypher Complex traversals, ontology exploration, ad-hoc queries
REST endpoints Cloud Run HTTPS / JSON Agent tool calls: validate_medication, check_pathway, lookup_code
Cached lookups Memorystore (Redis) Redis protocol Frequent traversals cached: drug interactions, code lookups
Agent tool integration Vertex AI Agent Builder Tool / Function Calling Graph queries exposed as callable tools for Gemini agents
Batch analytics BigQuery + Dataflow SQL + Graph export Bulk ontology analysis, mapping coverage reports
▼ ▼ ▼
Graph Maintenance & Lifecycle

Automated Ontology Updates

SNOMED CT releases biannually. RxNorm monthly. ICD-10 annual updates. Automated pipelines ingest new releases and update graph nodes/edges.

Versioned Graph Snapshots

Every ontology update creates a versioned snapshot. Enables rollback and point-in-time queries. Stored in Cloud Storage as Neo4j dumps.

Clinical Review Workflows

Pathway updates require clinical committee review. Approval workflow in Cloud Workflows with human-in-the-loop before graph promotion.

Lineage Tracking

Every node/edge tracks: source ontology, version, last_updated, provenance. Queryable for audit and compliance.

Consistency Validation

Scheduled Cypher queries detect orphan nodes, broken relationships, and circular hierarchies. Alerts via Cloud Monitoring.

Cross-Ontology Alignment

Maintain MAPS_TO edges across ontologies (SNOMED↔ICD-10, LOINC↔CPT). Validate mapping coverage on each release cycle.

▼ ▼ ▼
Graph Scale
800K+
Total Nodes
Across all encoded ontologies
3M+
Total Edges
Relationships (IS_A, TREATS, etc.)
< 5ms
Cached Lookup
Redis-cached drug interaction check
< 50ms
Multi-Hop Traversal
3-hop pathway validation query
5
Ontologies Encoded
SNOMED, ICD-10, LOINC, RxNorm, CPT
Monthly
Update Cadence
Aligned with fastest ontology (RxNorm)
▼ ▼ ▼

Downstream: Validation, Grounding & Enrichment

The Knowledge Graph serves as a real-time validation layer for AI agents, provides concept expansion for RAG grounding, and enriches embeddings with ontology-aware relationships.

Knowledge Graph
Agent Tool Calls (Validation)
Safe Clinical Recommendations
Knowledge Graph
RAG Grounding (Concept Expansion)
Embedding Enrichment (Ontology Vectors)
← Back to Overview

Data Fabric, IAM, DLP, VPC-SC & Policies

Intelligent data fabric providing consistent, policy-controlled access to distributed healthcare data while appearing unified to AI agents and users.

Data Fabric Components — Dataplex Logical Domains

Clinical Domain

Patient records, encounters, observations, conditions, medications. FHIR-native views.

📸

Imaging Domain

DICOM metadata, radiology reports, pathology slides. Linked to clinical context.

Operations Domain

ADT census, scheduling, staffing, supply chain, billing. Real-time event streams.

🔬

Research Domain

De-identified cohorts, OMOP CDM tables, trial registries. IRB-controlled access.

🧬

Genomics Domain

VCF files, variant annotations, pharmacogenomics panels. Stored in Cloud Storage + BigQuery.

Data Catalog & Data Products

Data Catalog Dataplex

  • Automated metadata discovery across BigQuery, Cloud Storage, FHIR stores
  • Data lineage tracking — source to consumption
  • Business glossary: standardized healthcare term definitions
  • Sensitivity tags (PHI, PII, de-identified) auto-classified by DLP
  • Search and discovery for analysts and AI agents

Data Products Contracts & SLAs

  • Longitudinal Patient Record — unified view, <5min freshness SLA
  • ICU Telemetry Stream — real-time vitals, <10s latency SLA
  • Oncology Registry — curated staging/treatment data, daily refresh
  • Claims Mart — adjudicated claims + denials, T+1 SLA
  • Each product has owner, schema contract, quality checks, access policy
▼ ▼ ▼
Standardized Access APIs
Access PatternGCP ServiceConsumersUse Cases
FHIR R4 REST Cloud Healthcare API EHR apps, SMART-on-FHIR, CDS Hooks Patient read/write, clinical data exchange
REST / GraphQL Cloud Run + Hasura/Apollo Internal apps, dashboards Flexible queries over BigQuery curated views
Search & RAG Vertex AI Search AI agents, clinician search Semantic search across clinical documents + notes
Feature Serving Vertex AI Feature Store ML models, prediction agents Low-latency feature vectors for real-time inference
External API Gateway Apigee External partners, HIEs, payers Rate limiting, auth, analytics for external consumers
▼ ▼ ▼
IAM Architecture — Hierarchy & Roles

GCP Resource Hierarchy

Organization (healthcare-corp.com)
└─ Folder: US-East Region Folder: US-West Region Folder: Research
└─ Project: prod-clinical Project: prod-imaging Project: prod-ops
└─ Project: staging-clinical Project: dev-sandbox
IAM Roles & Group Mapping
RoleIAM BindingAccess ScopeMapped Group
Clinician Viewer roles/healthcare.fhirResourceReader Own patients, assigned unit grp-cardiology, grp-oncology, etc.
Researcher Analyst roles/bigquery.dataViewer De-identified datasets only grp-research-approved
Operations Admin roles/bigquery.dataEditor Operational tables, dashboards grp-ops-managers
AI Agent Service Account roles/aiplatform.user + custom Scoped per agent type, purpose-bound sa-clinical-agent@proj.iam
External Partner roles/healthcare.fhirResourceReader Specific FHIR resources via Apigee grp-external-payer-feeds
▼ ▼ ▼
VPC Service Controls

Perimeter Architecture

VPC-SC Perimeter: healthcare-prod
BigQuery Cloud Storage Cloud Healthcare API Vertex AI Cloud KMS
Ingress Rules Authorized corporate networks, VPN, specific service accounts
Egress Rules Restricted to approved external APIs (Apigee, HIE endpoints)
Bridge Perimeter Cross-project AI pipelines (prod-clinical ↔ prod-ai)
▼ ▼ ▼
DLP — Data Loss Prevention

PHI Detection Cloud DLP

  • Inspection jobs scan BigQuery tables + Cloud Storage objects
  • Detects: MRN, SSN, DOB, patient names, addresses, phone numbers
  • Custom infoTypes for institution-specific identifiers
  • Continuous inspection on new data via Dataflow integration
  • Findings exported to BigQuery for audit and dashboards

De-identification Transforms Automated

  • Masking — replace PHI with redacted tokens
  • Tokenization — deterministic crypto-hash for re-linkage
  • Date shifting — random offset preserving intervals
  • K-anonymity — generalize quasi-identifiers (age buckets, zip3)
  • Automated DLP transforms in Dataflow pipelines for research datasets
▼ ▼ ▼
Encryption & Confidential Computing

CMEK — Customer-Managed Keys

  • All data stores encrypted with Cloud KMS keys
  • Key hierarchy: org root → project keys → dataset keys
  • Automatic key rotation (90-day policy)
  • Key access audit via Cloud Audit Logs

Confidential VMs

  • AMD SEV / Intel TDX memory encryption
  • Used for sensitive AI inference workloads
  • Data encrypted in use — not just at rest / in transit
  • Attestation reports for compliance evidence

Column-Level Encryption

  • Ultra-sensitive fields (SSN, genomic data) encrypted at column level
  • Separate CMEK per sensitivity tier
  • Decrypt only with explicit IAM grant + purpose justification
  • BigQuery policy tags enforce column-level access
▼ ▼ ▼
Policy Engine — Attribute-Based Access Control (ABAC)

Agent Call Evaluation Flow

Agent / User Request Policy Engine (Cloud Run) Evaluate Attributes Allow / Deny + Log
AttributeSourceExamples
User Role IAM + Google Groups clinician, researcher, ops-admin, AI-agent
Purpose Request header / token claim treatment, payment, operations, research
Data Sensitivity Dataplex tags + DLP classification PHI, de-identified, public, restricted
Patient Consent FHIR Consent resource opt-in research, restrict substance-abuse records
Context IAM Conditions Time of day, IP range, device posture
▼ ▼ ▼
Audit & Compliance

Cloud Audit Logs

  • Data access logs (who read what, when)
  • Admin activity logs (config changes)
  • Exported to BigQuery for long-term retention
  • Real-time alerting on anomalous access

Compliance Frameworks

  • HIPAA BAA executed with Google Cloud
  • SOC 2 Type II — continuous controls
  • FedRAMP High (GovCloud for federal)
  • HITRUST CSF certification

Access Reviews

  • Quarterly IAM access recertification
  • Automated unused-permission detection (IAM Recommender)
  • Breach detection via Security Command Center
  • SIEM integration for SOC workflows
▼ ▼ ▼
Sovereignty Controls

Regional Deployment

  • Org policies restrict resource locations (us-east1, us-central1)
  • Assured Workloads for IL4/IL5 regulated environments
  • Data residency enforcement — no cross-region replication without policy

Jurisdiction Controls

  • State-specific data handling (California CCPA, Texas HB 300)
  • Cross-border interop layers for international sites
  • Consent enforcement varies by jurisdiction
▼ ▼ ▼
Security Monitoring Stack

Security Command Center

Threat detection, vulnerability scanning, compliance posture management

Policy Intelligence

IAM Recommender: least-privilege suggestions, unused role alerts

VPC Flow Logs

Network traffic analysis, anomaly detection, forensic investigation

DLP Dashboard

Looker dashboard: PHI findings, de-id coverage, inspection job status

Chronicle SIEM

Centralized security analytics, correlation rules, incident response

▼ ▼ ▼

Cross-Cutting: Spans All Pipeline Layers

Data Fabric and Security controls wrap every component — from ingestion through AI agent execution. Every data access is policy-evaluated, logged, and auditable.

Ingestion Harmonization Lakehouse Knowledge + Embeddings AI Agents Consumer Apps

Data Fabric + IAM + VPC-SC + DLP + Audit = enforced at every arrow above

← Back to Overview

Clinical AI Agents — Deep Dive

Autonomous AI agents that monitor, reason, and recommend within clinical workflows — integrated into EHR, always grounded, always auditable.

Agent Architecture on Vertex AI

Core Platform Stack

Vertex AI Agent Builder Gemini (Reasoning Engine) LangChain / LangGraph Orchestration
Agent Tools: RAG Search FHIR API Knowledge Graph Feature Store EHR Write-Back
▼ ▼ ▼
Clinical Agent Types

Deterioration Prediction Agent

  • Continuous monitoring of vitals, labs, nursing assessments
  • NEWS2 / MEWS scoring with ML-enhanced prediction
  • Real-time alerts to care team via FHIR CommunicationRequest
  • Escalation protocols: nurse → charge → rapid response → code team
  • 6-hour, 12-hour, 24-hour deterioration risk windows

Clinical Documentation Agent

  • Ambient listening via speech-to-text (Chirp on Vertex AI)
  • Generates structured clinical notes (SOAP, H&P, progress)
  • Extracts structured data: diagnoses, medications, procedures
  • ICD-10 and CPT coding suggestions with confidence scores
  • Clinician review and sign-off before EHR commit

Diagnostic Support Agent

  • Differential diagnosis from symptoms, labs, imaging findings
  • Guideline-matched recommendations (AHA, NCCN, IDSA)
  • Literature evidence retrieval via RAG over PubMed + UpToDate
  • Imaging interpretation assist (radiology, pathology)
  • Confidence scoring with supporting evidence chain

Medication Safety Agent

  • Real-time drug-drug interaction check via Knowledge Graph
  • Dose adjustment for renal impairment (CrCl-based) and hepatic function
  • Formulary compliance and therapeutic alternatives
  • Allergy cross-check against documented AllergyIntolerance
  • High-alert medication double-check enforcement

Care Gap Agent

  • Identifies missing screenings (colonoscopy, mammogram, A1c)
  • Vaccination gap detection (influenza, pneumococcal, COVID)
  • Follow-up tracking per HEDIS / CMS quality measures
  • Patient outreach generation (secure message, letter, call list)
  • Quality measure impact scoring for value-based contracts
▼ ▼ ▼
Agent Interaction Pattern — Event-Driven Flow

Clinical Event to EHR Notification

Clinical Event Pub/Sub Agent Trigger RAG Retrieval + Knowledge Graph + Feature Store
└─→ Recommendation Safety Check EHR Notification
Triggers: new lab result, vital sign, order entry, admission, discharge, scheduled interval
▼ ▼ ▼
Safety Constraints

Confidence Thresholds

Recommendations suppressed below configurable confidence threshold. Low-confidence outputs routed to human review queue.

Knowledge Graph Validation

Every clinical recommendation validated against curated Knowledge Graph (drug interactions, contraindications, guidelines).

Human-in-the-Loop

High-risk actions (medication changes, code blue alerts, diagnosis) require clinician confirmation before execution.

Explanation Generation

Every recommendation includes reasoning chain: evidence sources, feature contributions, guideline references.

Override Tracking

Clinician overrides logged with reason. Override patterns analyzed for model improvement and safety signal detection.

▼ ▼ ▼
Integration with EHR
Integration PatternStandardUse CaseDirection
App Launch SMART-on-FHIR Agent UI embedded in EHR context (patient, encounter) EHR → Agent
Decision Support CDS Hooks patient-view, order-select, order-sign hook triggers EHR → Agent → EHR
Event Subscription FHIR Subscriptions New lab result, admission, medication order triggers agent EHR → Agent
Notification Write-Back FHIR CommunicationRequest In-basket messages, alerts, task assignments to care team Agent → EHR
Documentation Write-Back FHIR DocumentReference AI-generated notes posted back for clinician review Agent → EHR
▼ ▼ ▼
Model Stack
ModelPlatformRoleUse Cases
Gemini Vertex AI Reasoning & generation Agent orchestration, note generation, differential diagnosis
Med-PaLM Vertex AI Clinical Q&A Medical knowledge retrieval, clinical question answering
Custom ML Models Vertex AI Training Specialized prediction Deterioration, readmission, sepsis, LOS prediction
Ensemble Scoring Vertex AI Endpoints Combined inference Multi-model consensus for high-stakes clinical decisions
▼ ▼ ▼
Monitoring & Feedback Loop

Agent Action Logs BigQuery

  • Every agent invocation: input, tools used, output, latency
  • Clinician acceptance / override rates per agent type
  • Outcome correlation: did the alert prevent adverse event?
  • Alert fatigue metrics: suppress rate, snooze rate

Performance Dashboards Looker

  • Model accuracy: sensitivity, specificity, PPV per agent
  • Bias monitoring across demographics (age, race, sex)
  • Drift detection: feature distribution shifts over time
  • A/B comparison for model version rollouts
▼ ▼ ▼

Agent Execution Flow

Clinical AI agents consume data from the unified lakehouse, reason with Gemini, validate against the Knowledge Graph, and deliver actionable insights back into the EHR.

Lakehouse (BigQuery) Feature Store Agent Builder Gemini Reasoning Safety Validation EHR Action
← Back to Overview

Operations AI Agents — Deep Dive

AI agents optimizing hospital operations — bed management, staffing, throughput, supply chain, and revenue cycle.

Operations Agent Types

Bed Management Agent

  • Real-time census from ADT feed (Cloud Healthcare API → Pub/Sub)
  • Predicted discharges via ML model (Vertex AI custom training)
  • Admission demand forecasting — ED, surgical, transfer
  • Bed assignment optimization (unit, isolation, acuity matching)
  • EVS coordination: auto-trigger room cleaning on discharge

Staffing Agent

  • Acuity-based staffing models (nurse-to-patient ratio)
  • Shift optimization: minimize gaps, balance workload
  • Float pool allocation based on predicted census
  • Overtime prediction and premium labor cost alerts
  • Skill-mix matching: certifications to unit requirements

Throughput Agent

  • ED boarding detection and escalation alerts
  • OR turnover optimization (case duration prediction)
  • Discharge barrier identification (pending consults, transport, Rx)
  • Patient flow bottleneck analysis across departments
  • Discharge-before-noon tracking and nudge notifications

Supply Chain Agent

  • Inventory forecasting using time-series models
  • Par level optimization per unit and item category
  • Expiration tracking with waste reduction alerts
  • Vendor order automation (PO generation via ERP API)
  • Pandemic stockpile monitoring and surge readiness scoring

Revenue Cycle Agent

  • Charge capture validation — missing charges flagged at discharge
  • Coding accuracy review (ICD-10/CPT vs documentation)
  • Denial prediction — flag claims likely to be denied pre-submission
  • Prior authorization automation (fax → AI extraction → status tracking)
  • A/R aging prioritization — rank accounts by recovery likelihood
▼ ▼ ▼
Data Sources — All Via BigQuery Enriched Zone
Data SourceFeed TypeGCP IngestionAgents Consuming
ADT Feed (Real-time Census) HL7v2 ADT^A01-A03 Cloud Healthcare API → Pub/Sub → BigQuery Bed, Throughput, Staffing
Scheduling Systems SIU messages / API Dataflow → BigQuery Throughput, Staffing
HR / Timekeeping Batch / API (Kronos, Workday) Cloud Storage → Dataflow → BigQuery Staffing
Materials Management ERP API (Infor, SAP) Cloud Run connector → BigQuery Supply Chain
Billing / Claims 837/835 EDI, DFT Dataflow → BigQuery Revenue Cycle
Patient Satisfaction Survey API (Press Ganey) Cloud Functions → BigQuery Throughput, All
▼ ▼ ▼
Agent Architecture

Orchestration Pattern

Event (Pub/Sub) Vertex AI Agent Builder Tools: BigQuery, FHIR API, Scheduling API, ERP API
Schedule (Cloud Scheduler) Vertex AI Agent Builder Optimization Models (OR-Tools on Cloud Run)
└─→ Recommendation / Action Looker Dashboard / Push Notification / ERP Update
▼ ▼ ▼
Optimization Models

Demand Forecasting Time-Series

  • Vertex AI AutoML Forecasting for admission volume
  • Features: day-of-week, seasonality, flu trends, census history
  • Horizons: 4-hour, 24-hour, 7-day predictions
  • Per-unit and per-service-line granularity

Constraint Optimization OR-Tools

  • OR-Tools on Cloud Run for bed assignment and staff scheduling
  • Constraints: acuity, isolation, gender, unit capacity
  • Objective: minimize transfers, maximize utilization
  • Solves in <30s for 500-bed facility

Simulation What-If

  • Discrete event simulation for patient flow scenarios
  • Test impact of: adding beds, changing discharge criteria, OR block changes
  • Monte Carlo runs for probabilistic outcomes
  • Results visualized in Looker dashboards

Reinforcement Learning Dynamic

  • RL agents for dynamic staffing decisions
  • Environment: real-time census, acuity, upcoming admits
  • Reward: patient outcomes + cost efficiency balance
  • Trained on Vertex AI, deployed to Cloud Run
▼ ▼ ▼
Key Metrics Tracked
<20min
ED Door-to-Doc
Target for throughput agent
<45min
Bed Turnaround
Discharge to next admit
>85%
OR Utilization
Prime-time block usage
>30%
Discharge Before Noon
Early discharge target
<5%
Premium Labor %
Agency / overtime spend
>95%
Clean Claim Rate
First-pass acceptance
<35
Days in A/R
Revenue collection speed
<1%
Supply Stockout Rate
Critical item availability
▼ ▼ ▼
Integration & Outputs

Looker Dashboards

Real-time ops command center: census, throughput, staffing, revenue cycle KPIs. Role-based views for CNO, CMO, CFO.

Push Notifications

Alerts to charge nurses, bed managers, department directors via mobile (Firebase Cloud Messaging).

Automated Work Orders

EVS cleaning triggers, patient transport requests, equipment setup — auto-generated on discharge/transfer events.

ERP System Updates

Purchase orders, inventory adjustments, staffing schedule changes pushed to ERP systems via Cloud Run connectors.

▼ ▼ ▼
ROI Indicators

ED Boarding Reduction

Target 40% reduction in boarding hours through predictive bed assignment and discharge acceleration.

OR Utilization Gains

5-10% improvement in prime-time OR utilization via case duration prediction and turnover optimization.

Labor Cost Savings

15-25% reduction in premium labor (agency, overtime) through predictive staffing and float pool optimization.

Supply Waste Reduction

20-30% reduction in expired supplies through demand-driven par levels and expiration alerts.

Revenue Recovery

2-5% increase in net revenue via charge capture improvement, denial prevention, and faster A/R collection.

▼ ▼ ▼

Operations Agent Execution Flow

Operations AI agents consume real-time operational feeds, apply forecasting and optimization models, and deliver actions to staff and systems.

ADT + Scheduling + HR + Supply BigQuery Enriched Zone Agent Builder + OR-Tools Looker + Notifications + ERP
← Back to Overview

Research AI Agents — Deep Dive

AI agents that accelerate clinical research — cohort discovery, literature analysis, trial matching, and population health pattern recognition.

Research Agent Types

Cohort Discovery Agent

  • Natural language queries: "patients with HFrEF, A1c >9, on SGLT2i, seen in last year"
  • Text-to-SQL via Gemini against BigQuery research tables
  • Validates queries against data dictionary and OMOP concept sets
  • Returns counts, demographics breakdown, feasibility assessment
  • Iterative refinement: agent suggests criteria modifications

Literature Agent

  • Semantic search across PubMed + institutional publications via RAG
  • Evidence summarization with citation chain
  • Systematic review assistance: screen abstracts, extract PICO elements
  • Citation network analysis: influential papers, research trends
  • Grounded answers with source references and confidence

Trial Matching Agent

  • Ingests active trials from ClinicalTrials.gov API
  • Extracts inclusion/exclusion criteria via NLU
  • Screens patient records against eligibility criteria
  • Generates pre-screening lists ranked by match confidence
  • Notifies investigators and coordinators of eligible patients

Anomaly Detection Agent

  • Continuous surveillance on population data in BigQuery
  • Detects emerging disease clusters (geo-temporal patterns)
  • Unusual lab result trends across patient populations
  • Adverse event signal detection (drug safety surveillance)
  • Infection outbreak pattern recognition (syndromic surveillance)

Hypothesis Generation Agent

  • Identifies correlations in multimodal data (labs + meds + outcomes)
  • Suggests research questions based on data patterns
  • Proposes study designs (RCT, cohort, case-control)
  • Estimates sample sizes and statistical power
  • Cross-references findings with existing literature
▼ ▼ ▼
Cohort Discovery Agent — Interaction Flow

Natural Language to SQL to Results

Researcher Query (NL) Gemini Text-to-SQL Data Dictionary Validation OMOP Concept Expansion
└─→ BigQuery Execution Results (Count, Demographics, Feasibility) Export / Refine
All queries execute against de-identified research tables. IRB approval verified before data export.
▼ ▼ ▼
Data Access & Privacy Controls
Access TierData TypeControlsUse Case
De-identified (Safe Harbor) 18 identifiers removed via DLP Open to approved researchers Cohort discovery, feasibility, population analytics
Limited Dataset Dates + zip3 retained DUA required, IRB-approved Longitudinal studies, temporal pattern analysis
Honest Broker Re-linkable via broker only Broker intermediary, audit trail Multi-source data linkage, registry enrollment
Synthetic Data Generated via Vertex AI No restrictions Model development, algorithm testing, education
Identified (PHI) Full patient data IRB + patient consent + CISO approval Interventional trials, direct patient contact
▼ ▼ ▼
Research Data Platform — GCP Stack

Query & Analysis Primary

  • BigQuery — primary SQL engine for cohort queries, population analytics
  • Dataproc (Spark) — large-scale genomics analysis, variant processing
  • Vertex AI Workbench — managed Jupyter for interactive R/Python analysis
  • Vertex AI Training — custom model development (survival analysis, NLP)

Storage & Compute Infrastructure

  • Cloud Storage — raw genomic files (VCF, BAM, FASTQ)
  • BigQuery — OMOP CDM tables, de-identified research warehouse
  • Vertex AI Feature Store — pre-computed research features
  • Artifact Registry — versioned model artifacts and containers
▼ ▼ ▼
External Integrations
SystemIntegrationGCP ConnectorPurpose
REDCap REST API Cloud Run connector Electronic data capture for prospective studies
i2b2 / OMOP CDM BigQuery views Native BigQuery tables Standard research data models, OHDSI tool compatibility
OHDSI Tools (Atlas, Achilles) WebAPI Cloud Run + BigQuery OMOP Cohort definitions, data quality, characterization
SAS / R / Python BigQuery connectors bigrquery, pandas-gbq, SAS/ACCESS Statistical analysis in researcher's preferred tool
ClinicalTrials.gov REST API Cloud Functions → BigQuery Trial eligibility criteria ingestion for matching
PubMed E-utilities API RAG index (Vertex AI Search) Literature search, evidence retrieval for agents
▼ ▼ ▼
Knowledge Sources for Agent Grounding

Knowledge Graph

Ontologies (SNOMED, LOINC, RxNorm, ICD-10) for query expansion and concept mapping.

PubMed Index

36M+ biomedical abstracts indexed in Vertex AI Search for RAG-powered literature retrieval.

ClinicalTrials.gov

400K+ trial records with structured eligibility criteria for automated patient matching.

Institutional Data Dictionary

Local table schemas, field definitions, valid values. Ensures accurate text-to-SQL generation.

OMOP Concept Sets

Standardized phenotype definitions for reproducible cohort queries across institutions.

▼ ▼ ▼
Research Data Governance

Purpose-Based Access

Access granted per approved research protocol. Treatment data vs. research data separated at IAM and VPC-SC level.

Full Audit Trail

Every query logged in BigQuery audit tables: who, what, when, which dataset, under which IRB protocol.

Minimum Necessary

Column-level access: researchers see only fields required by their protocol. Enforced via BigQuery policy tags.

Re-identification Risk

Automated risk assessment before data export. K-anonymity and l-diversity checks via Cloud DLP.

Consent Management

FHIR Consent resources integrated: patients opting out of research excluded from query results automatically.

▼ ▼ ▼
Example: End-to-End Trial Matching Workflow

From Protocol to Pre-Screening List

New Trial Protocol NLU Criteria Extraction (Gemini) OMOP Concept Mapping BigQuery Patient Screen
└─→ Match Scoring + Ranking Pre-Screening List Coordinator Review + REDCap Enrollment
Reduces manual chart review from weeks to hours. Average 3x increase in enrollment rate.
▼ ▼ ▼

Research Agent Execution Flow

Research AI agents operate on de-identified data, leverage ontologies for precision, and feed results back to investigators through familiar tools.

De-identified Lakehouse Knowledge Graph + Ontologies Agent Builder (Gemini) BigQuery + RAG Workbench / REDCap / OHDSI
← Back to Overview

Dashboards (Looker) — Deep Dive

Real-time operational, clinical, and executive analytics powered by BigQuery, surfaced through Looker and Looker Studio across the health system.

Looker Architecture on GCP

Looker (SaaS / Core)

Managed Looker instance
or Looker Core on GKE

📊

LookML Semantic Layer

Git-managed models on top of
BigQuery curated & enriched zones

📈

Looker Studio

Lightweight self-service
dashboards & ad-hoc reports

🛠

Embed SDK

Embedded analytics in
EHR portals & custom apps

BigQuery BI Engine

In-memory acceleration
sub-second query response

🔬

Vertex AI Integration

ML predictions surfaced
as Looker metrics (risk scores)

▼ ▼ ▼ ▼ ▼
LookML models connect to BigQuery curated & enriched zones
Dashboard Categories

Clinical Quality

Quality & Safety
  • HEDIS measures compliance tracking
  • 30-day readmission rates by DRG
  • Mortality indices (O/E ratios)
  • Hospital-acquired infection rates
  • Patient safety indicators (PSIs)
  • Core measures compliance (SEP-1, VTE, etc.)

Operational Command Center

Real-Time Ops
  • Real-time inpatient census by unit
  • ED throughput: door-to-doc, boarding hours
  • OR utilization & turnover time
  • Bed turnaround & discharge tracker
  • Staffing ratios vs. patient acuity
  • Transfer center volume & capacity

Financial / Revenue Cycle

Finance
  • Clean claim rate & denial rate trends
  • Days in A/R by payer
  • Case mix index (CMI) by service line
  • Cost per case & margin analysis
  • Payer mix & contract performance
  • Charge capture leakage detection

Population Health

Pop Health
  • Risk stratification panels (high/med/low)
  • Care gap closure rates by measure
  • Chronic disease registries (DM, CHF, COPD)
  • SDoH impact analysis (food, housing, transport)
  • Health equity metrics by demographics
  • ACO/VBC performance tracking

Executive / Board

Leadership
  • Balanced scorecard (quality, finance, ops, people)
  • Trend analysis with rolling 12-month views
  • Peer benchmark comparisons (Vizient, CMS)
  • Strategic KPI tracking & goal progress
  • Service line growth & volume trends
▼ ▼ ▼
LookML Semantic Layer
LookML Component BigQuery Target Purpose Key Details
model: clinical bq_curated.clinical_* Clinical domain explores Encounters, conditions, observations, medications
model: operations bq_curated.ops_* Operational metrics Census, throughput, capacity, staffing tables
model: finance bq_curated.rev_cycle_* Revenue cycle & cost Claims, charges, payments, denials, A/R aging
model: research bq_enriched.research_* De-identified research cohorts Cohort tables, genomic summaries, trial enrollment
derived_table (PDT) bq_scratch.pdt_* Expensive computed metrics Readmission flags, risk scores, rolling aggregates
access_filter user_attributes Row-level security Filter by facility_id, department, user role
aggregate_awareness bq_curated.agg_* Query acceleration Pre-aggregated daily/weekly/monthly rollups
▼ ▼ ▼
Real-Time Capabilities

Streaming Refresh 30s – 5min

BigQuery streaming buffer ingests events in real time. Looker dashboards auto-refresh at configurable intervals for operational views.

Pub/Sub Events
BigQuery Streaming
BI Engine Cache
Looker Auto-Refresh

Alerts & Scheduled Delivery Proactive

Threshold-based alerts (e.g., ED boarding > 4h) delivered via email, Slack, or PagerDuty. Scheduled report PDFs for leadership.

  • Conditional alerts on metric thresholds
  • Scheduled Look delivery (email, Slack, SFTP)
  • PagerDuty integration for critical operational alerts
  • Mobile-optimized views for on-call managers

Drill-Down Navigation Interactive

System-level → facility → unit → patient-level drill paths. Cross-dashboard linking for root cause analysis.

  • Click census count → unit breakdown → patient list
  • Click denial rate → denial reason → individual claims
  • Cross-filter between related dashboards

AI-Powered Insights Vertex AI

Vertex AI predictions (risk scores, demand forecasts, readmission probability) written to BigQuery and surfaced as Looker metrics.

Vertex AI Models
BigQuery (predictions)
Looker Dashboards
▼ ▼ ▼
End-to-End Data Flow

BigQuery to User

Data flows from the curated/enriched lakehouse through the LookML semantic layer to dashboards consumed by clinicians, operators, and executives.

BigQuery (Curated/Enriched)
LookML Models
Looker Explores
Dashboards / Looks
Users / Embedded Apps
▼ ▼ ▼
Access Control & Security
Layer Mechanism GCP Service Details
Authentication SSO / SAML 2.0 Google Workspace / Cloud Identity Federated login, MFA enforced
Looker Roles role-based access Looker IAM Groups Admin, developer, viewer, embed-user roles
Model Access model_set LookML project Users see only permitted models (clinical, finance, etc.)
Row-Level Security access_filter BigQuery + LookML User sees only their facility/department data
Content Access folder permissions Looker folders/boards Dashboard visibility controlled by folder ACLs
Query Guardrails query cost limits BigQuery Reservations Slot-based quotas, per-user query byte limits
▼ ▼ ▼
Embedding & Extensions

Looker Embed SDK Embed

Embed interactive dashboards directly in EHR portals, custom web apps, and patient portals with SSO pass-through.

Host App (EHR/Portal)
Looker Embed SDK
Signed URL / SSO Embed
Interactive Dashboard

Looker Actions Workflow

Trigger downstream workflows from dashboard data: generate outreach lists, push to CRM, create tasks in care management systems.

  • Send care gap list → outreach CRM
  • Export high-risk panel → care coordinator queue
  • Push denial data → billing worklist

Extension Framework Custom

Build custom React-based applications hosted inside Looker for specialized workflows (e.g., clinical registry management).

  • Custom visualizations (D3.js, Vega)
  • In-Looker workflow tools
  • Looker API & SDK for programmatic access

Looker Studio Self-Service

Lightweight self-service dashboards for business users who need ad-hoc exploration without LookML complexity.

BigQuery
Looker Studio Connector
Self-Service Reports
▼ ▼ ▼
Performance & Optimization
< 3s
Dashboard Load Target
BI Engine + Looker caching
1 GB
BI Engine Reservation
In-memory acceleration per project
30s
Min Auto-Refresh
Operational command center dashboards
PDTs
Persistent Derived Tables
Pre-compute expensive aggregations
Agg Aware
Aggregate Awareness
Auto-select pre-aggregated tables
Slots
BQ Reservations
Dedicated compute for dashboard queries
← Back to Overview

EHR Integration (SMART-on-FHIR) — Deep Dive

Surface AI insights and platform capabilities directly within the clinician's EHR workflow — zero context switching.

SMART-on-FHIR Framework
🔒

EHR Launch

App launched from EHR context
Patient + encounter pre-populated

🌐

Standalone Launch

App launched independently
User selects patient context

🔐

OAuth 2.0 Scopes

patient/*.read, user/*.read
launch/patient, launch/encounter

👤

FHIR Context

Patient ID, Encounter ID
User identity & role

📝

App Registration

Registered with EHR vendor
Client ID, redirect URIs, scopes

Token Validation

Short-lived access tokens
Refresh token rotation

▼ ▼ ▼ ▼ ▼
EHR fires CDS Hooks on clinical events
CDS Hooks Integration

Hook Points & Request/Response Flow

patient-view Chart Open Cloud Run CDS Vertex AI Agent Cards: risk scores, care gaps
order-select Ordering Med/Lab Cloud Run CDS Vertex AI Agent Cards: suggestions, alerts
order-sign Before Signing Cloud Run CDS Vertex AI Agent Cards: contraindications, prior auth
encounter-start Visit Begins Cloud Run CDS Vertex AI Agent Cards: patient summary, prep checklist
appointment-book Scheduling Cloud Run CDS Vertex AI Agent Cards: prep orders, pre-visit tasks
▼ ▼ ▼
Deployed SMART App Types

AI Insights Panel

EHR Sidebar — iframe embed
  • Active risk scores (sepsis, readmission, fall)
  • Pending care gap alerts
  • Recent agent recommendations
  • Trend sparklines from BigQuery enriched zone

Documentation Assistant

SMART launch from note editor
  • Ambient note generation (Vertex AI Gemini)
  • Structured data extraction from dictation
  • ICD-10 / CPT suggestion
  • Quality measure documentation prompts

Diagnostic Decision Support

CDS Hooks — order-select trigger
  • Differential diagnosis ranking
  • Evidence-based order recommendations
  • Literature links & clinical guidelines
  • Drug-drug interaction checks

Prior Authorization Tool

SMART launch — order workflow
  • Automated payer criteria matching
  • Clinical document assembly
  • Electronic submission (X12 278)
  • Status tracking & appeal support

Patient Timeline

FHIR facade on BigQuery enriched zone
  • Unified longitudinal view across all encounters
  • Lab trends, medication history, problem list
  • External records via Carequality / CommonWell
  • AI-generated visit summaries
▼ ▼ ▼
Architecture — End-to-End

EHR to AI Backend

SMART launch triggers authentication, then Cloud Run hosts the app and orchestrates calls to AI and data services. All responses formatted as FHIR resources.

EHR (SMART Launch)
OAuth (Google Identity / EHR)
Cloud Run (App Host)
Vertex AI Agents
BigQuery / Cloud Healthcare API
FHIR Response
▼ ▼ ▼
FHIR Facade Pattern

Read Path FHIR R4

BigQuery curated data exposed as standard FHIR endpoints. The EHR reads enriched/computed data as if it were a native FHIR server.

EHR FHIR Client
Cloud Healthcare API
FHIR Facade (Cloud Run)
BigQuery Curated Zone

Write-Back Path Human-in-Loop

AI recommendations require clinician approval before writing back. Audit trail and undo capability enforced on every write.

AI Recommendation
Clinician Review/Approve
FHIR Write (Cloud Healthcare API)
HL7v2 Outbound to EHR
▼ ▼ ▼
EHR Vendor Specifics
EHR Vendor App Marketplace SMART Support CDS Hooks Key Notes
Epic App Orchard / Gallery Full (Hyperdrive web) Supported USCDI v3, Bulk FHIR, embedded via Hyperspace/Hyperdrive
Oracle Health (Cerner) code Console Full (Ignite APIs) Supported Ignite FHIR R4, Millennium HL7v2 feeds, open.epic equivalent
MEDITECH Greenfield SMART R1 (expanding) Limited Expanse FHIR R4, Greenfield SMART for Expanse web
athenahealth Marketplace FHIR R4 Roadmap Cloud-native, strong API-first approach, REST APIs
▼ ▼ ▼
Write-Back FHIR Resources
Use Case FHIR Resource Source Safety Controls
Risk score documentation Observation Vertex AI prediction Clinician approval required, audit log
Care plan creation CarePlan AI agent recommendation Human review, undo within 24h
Order recommendation ServiceRequest CDS Hook suggestion Clinician must sign, no auto-ordering
Note generation DocumentReference Ambient documentation Clinician edits & co-signs before commit
Problem list update Condition Diagnostic decision support Suggestion only, clinician confirms
▼ ▼ ▼
Security & Performance
< 500ms
CDS Hook Response Target
Sync path for simple lookups
OAuth 2.0
Token Validation
Short-lived, patient-context scoped
Async
Long AI Inference
Progressive loading, background fetch
Minimum
Data Principle
Request only needed FHIR scopes
Audit
Every Action Logged
Cloud Logging → BigQuery audit
Consent
Enforcement
Patient consent checked before AI use
← Back to Overview

Patient Portals & Virtual Agents — Deep Dive

Patient-facing digital experiences grounded in the AI platform — safe, personalized, and accessible across all channels.

Portal Capabilities

Health Record Access

Cloud Healthcare API — FHIR R4 (US Core)
  • Lab results with trend charts
  • Medications & allergies
  • Immunization records
  • Visit summaries & discharge instructions
  • Imaging reports & pathology

AI Health Assistant

Vertex AI Conversation + Dialogflow CX
  • Symptom triage with safety routing
  • Medication questions & interactions
  • Appointment scheduling via conversation
  • Test result explanation (plain language)
  • Health education content delivery

Care Plan Tracker

FHIR CarePlan + Observation
  • View active care plans & goals
  • Track goal progress (A1c, BP, weight)
  • Log self-reported data (PROs, vitals)
  • Medication reminders & adherence

Secure Messaging

Firestore + Cloud Run
  • Patient-provider secure messaging
  • AI-assisted message routing to right team
  • Draft response suggestions for staff
  • Smart reply for common questions

Appointment & Scheduling

Vertex AI Optimization + FHIR Appointment
  • Self-scheduling with AI slot optimization
  • Pre-visit questionnaires (FHIR Questionnaire)
  • Digital check-in & insurance verification
  • Telehealth launch (video visit integration)
▼ ▼ ▼ ▼ ▼
Patient query enters the Virtual Agent pipeline
Virtual Agent Architecture

Conversational AI Pipeline

STEP 1 Patient Query Dialogflow CX (Intent/Entity) Vertex AI Agent (Reasoning)
STEP 2 Vertex AI Agent RAG: Patient FHIR Data + RAG: Health Education Corpus
STEP 3 Knowledge Graph Validation Safety Checks & Guardrails Confidence Scoring
STEP 4 Response + Citations + Disclaimers Patient (Mobile / Web / SMS)
▼ ▼ ▼
Grounding & Safety Guardrails

No Diagnosis

Agent never provides a diagnosis. Symptom triage routes to appropriate care level (ER, urgent care, PCP, self-care) with disclaimers.

No Prescribing

Agent cannot prescribe, adjust, or recommend stopping medications. All medication queries reference existing prescription data only.

Emergency Redirect

Keywords (chest pain, suicidal, can't breathe) trigger immediate 911/crisis line redirect. No further conversation on emergency topics.

Human Escalation

Clinical concerns beyond agent scope escalated to nurse triage line or provider message. Patient can request human at any time.

Grounded Responses

All answers grounded in patient's FHIR data and vetted content (MedlinePlus, institutional patient education). Hallucination detection active.

Mandatory Disclaimers

Every clinical response includes disclaimer: "This is not medical advice. Contact your provider for medical decisions." Confidence score shown.

▼ ▼ ▼
Technology Stack
Layer GCP Service Purpose Details
Frontend Firebase Hosting Web portal (React / Flutter Web) CDN-backed, HTTPS, responsive design
Mobile Flutter (iOS + Android) Cross-platform native app Push notifications via Firebase Cloud Messaging
API Backend Cloud Run Serverless APIs Auto-scaling, min instances for latency
FHIR Data Cloud Healthcare API Patient records (US Core FHIR) FHIR R4, SMART scopes, consent-aware
Conversational AI Dialogflow CX + Vertex AI NLU + reasoning Multi-turn, multilingual, context-aware
Session State Firestore Conversation history & context Real-time sync, TTL-based expiration
Async Tasks Cloud Tasks + Pub/Sub Notifications, reminders, background jobs Scheduled medication reminders, follow-ups
Authentication Google Identity Platform Patient login (OIDC) MFA, ID proofing, social login, SMS OTP
Translation Cloud Translation API Multi-language support 140+ languages, medical term-aware
▼ ▼ ▼
Accessibility & Health Equity

Multi-Language Cloud Translation + Gemini

Real-time translation of portal content and agent conversations. Multilingual Gemini handles complex medical term translation.

  • Cloud Translation API for UI text
  • Gemini multilingual for conversational AI
  • Clinician-reviewed translations for key content

Health Literacy Auto-Simplify

Medical jargon automatically simplified to patient-friendly language. Reading level targeting (6th-8th grade).

  • Gemini-powered jargon-to-plain-language
  • Configurable reading level per patient preference
  • Visual aids and diagrams where applicable

WCAG 2.1 AA Accessible

Screen reader compatible, keyboard navigable, high contrast mode, resizable text. Tested with assistive technologies.

  • ARIA labels on all interactive elements
  • Color contrast ratios verified
  • Voice navigation support

Low-Tech Channels Equity

SMS and voice channel fallback for patients without smartphones or reliable internet. Caregiver proxy access with verified authorization.

  • SMS-based appointment reminders & triage
  • IVR voice agent (Dialogflow CX phone gateway)
  • Caregiver proxy access with HIPAA authorization
▼ ▼ ▼
Data Flow

Patient Interaction to Data Pipeline

Patient interactions flow through the conversational AI stack; patient-generated data (PROs, vitals) flows back into the platform as FHIR resources for clinical use.

Patient (App/Web/SMS)
Dialogflow CX
Vertex AI + Cloud Healthcare API
BigQuery (Enriched Zone)
Personalized Response
Patient-Generated Data (PROs, Vitals)
FHIR Observation
Cloud Healthcare API
Pub/Sub → Pipeline → BigQuery
▼ ▼ ▼
Privacy & Consent

HIPAA Compliance

All data encrypted in transit (TLS 1.3) and at rest (CMEK). BAA with Google Cloud. PHI access logged and auditable.

Consent Management

Opt-in required for AI features. Granular preferences: AI assistant on/off, data sharing, research participation.

Right to Access/Export

FHIR $export for patient data download. Machine-readable format (FHIR JSON, C-CDA). Compliant with 21st Century Cures Act.

Minor/Guardian Controls

Age-appropriate access. Guardian proxy with verified authorization. Adolescent confidentiality rules per state law.

Data Sharing Preferences

Patient controls data sharing scope: within health system only, HIE participation, research opt-in/out. Preferences enforced at API layer.

Breach Notification

Automated breach detection (Cloud DLP, Security Command Center). Notification workflows per HIPAA Breach Notification Rule (60-day window).

▼ ▼ ▼
Portal Analytics
BigQuery
Usage Metrics
Page views, feature adoption, session duration
Dialogflow
Conversation Analytics
Intent match rate, fallback rate, containment
NPS / CSAT
Patient Satisfaction
Post-interaction surveys, star ratings
Engagement
Activation & Retention
Portal activation %, monthly active users
Outcomes
Health Correlation
Portal engagement vs. care gap closure
Looker
Executive Dashboards
Digital health program KPIs in Looker