Unified Healthcare Data & AI — Complete Architecture

← Back to Overview

EHR Source Systems

☤

Epic

Inpatient & Ambulatory EHR
Bridges, Care Everywhere

☁

Oracle Health (Cerner)

Millennium platform
Real-time feeds via HCI

⚙

MEDITECH

Expanse / 6.x
NPR & DR interfaces

⚖

Allscripts / Veradigm

TouchWorks, Sunrise
Ambulatory & acute

📚

athenahealth

Cloud-native ambulatory
athenaClinicals APIs

★

Other / Legacy

VA VistA, DoD Genesis,
regional & specialty EHRs

▼ ▼ ▼ ▼ ▼

Generate HL7v2 messages on clinical events

HL7v2 Message Types & Trigger Events

ADT — Admit / Discharge / Transfer

ADT^A01, A02, A03, A04, A08 ...

Patient admissions & registrations
Transfers between units
Discharges & leave of absence
Demographics updates (A31)
Merge patient records (A40)

ORM / OML — Orders

ORM^O01, OML^O21, OML^O33

Laboratory test orders
Radiology & imaging orders
Medication orders
Procedure orders
Order status changes & cancellations

ORU — Results / Observations

ORU^R01, ORU^R30

Lab results (chemistry, hematology, micro)
Radiology reports
Pathology reports
Vital signs & flowsheet data
Transcribed documents

SIU — Scheduling

SIU^S12, S14, S15, S26

Appointment creation & updates
Cancellations & no-shows
Resource allocation
Clinic schedule management

MDM — Document Management

MDM^T01, T02, T11

Clinical notes (H&P, progress, consult)
Discharge summaries
Operative notes
Document status tracking

RDE / RAS — Pharmacy

RDE^O11, RAS^O17, RDS^O13

Pharmacy encode (prescription details)
Medication administration events
Dispense notifications
Formulary interactions

DFT — Financial / Charges

DFT^P03, DFT^P11

Charge postings (CPT, HCPCS)
Procedure-level billing events
Insurance verification triggers
Claim adjudication signals

VXU — Immunizations

VXU^V04

Vaccination administration records
Immunization history queries
Registry submissions
Adverse event reporting

▼ ▼ ▼

HL7v2 Message Anatomy — Segment Structure

Example: ADT^A01 (Patient Admission)

EVN A01 | EventDateTime | PlannedDateTime | ReasonCode

DG1 DiagnosisCode (ICD-10) | Description | Type (A/W/F) | DiagnosingClinician

IN1 InsurancePlanID | CompanyName | GroupNumber | PolicyNumber | EffectiveDate

MSH — Message Header

EVN — Event Type

PID — Patient Identity

PV1 — Patient Visit

DG1 — Diagnosis

OBX — Observation

IN1 — Insurance

▼ ▼ ▼

HL7v2 to FHIR R4 Resource Mapping

HL7v2 Segment	Trigger Event	FHIR R4 Resource	Key Fields Mapped
PID	ADT^A01/A04/A08	Patient	MRN, name, DOB, gender, address, telecom, identifiers
PV1 + PV2	ADT^A01/A02/A03	Encounter	Class, location, period, participant (attending), status
ORC + OBR	ORM^O01 / OML^O21	ServiceRequest	Code, requester, status, priority, specimen requirements
OBX	ORU^R01	Observation	LOINC code, value, units, reference range, interpretation
OBR (Radiology)	ORU^R01	DiagnosticReport	Study code, results, conclusion, imaging references
RXE / RXA	RDE^O11 / RAS^O17	MedicationRequest / MedicationAdministration	Drug (RxNorm), dose, route, frequency, prescriber
DG1	ADT^A01/A03	Condition	ICD-10 code, category (encounter/problem-list), onset
AL1	ADT^A01/A08	AllergyIntolerance	Substance, reaction, severity, clinical status
SCH + AIS	SIU^S12	Appointment	DateTime, participant, location, status, serviceType
TXA + OBX	MDM^T02	DocumentReference	Type, author, date, content (base64 / URL), status
IN1 + IN2	ADT^A01	Coverage	Payor, subscriber, group, period, type

▼ ▼ ▼

Integration Patterns — EHR to Google Cloud

Real-Time Streaming Primary

HL7v2 messages streamed as they occur for near-zero-latency ingestion. Supports ADT, ORU, ORM events with sub-second delivery.

EHR Interface Engine

→

MLLP/HTTPS Adapter

→

Cloud Healthcare API (HL7v2 Store)

→

Pub/Sub Notification

→

Dataflow Pipeline

FHIR Native Modern EHRs

EHRs with FHIR R4 APIs (Epic USCDI, Cerner Ignite) push resources directly, bypassing HL7v2 translation.

EHR FHIR Server

→

SMART Backend Auth

→

Cloud Healthcare API (FHIR Store)

→

BigQuery Streaming

Interface Engine Mediated Common

Rhapsody, Mirth Connect, or InterSystems HealthShare handles routing, filtering, and protocol translation before cloud ingestion.

EHR (HL7v2)

→

Mirth / Rhapsody / HealthShare

→

Transform & Route

→

Cloud Healthcare API

Batch / Bulk Export Historical

Initial data migration and periodic bulk refreshes using FHIR $export or flat-file extracts for historical backfill.

EHR Bulk Export / CSV

→

Cloud Storage (GCS)

→

Dataflow Batch Job

→

BigQuery (Raw Zone)

▼ ▼ ▼

Typical Data Volume & Throughput (Large Health System)

2-5M

HL7v2 Messages / Day

ADT + ORM + ORU combined

50-200

Messages / Second (Peak)

Morning admit surge, shift changes

1-4 KB

Avg Message Size

ORU with embedded results can be 10-50 KB

15-30

Segments per Message (Avg)

Repeating OBX for multi-result ORU

< 2s

End-to-End Latency Target

EHR event → BigQuery availability

99.99%

Uptime Requirement

Clinical systems = mission-critical

▼ ▼ ▼

Key Challenges & Considerations

Version Variability

HL7v2 versions 2.3 through 2.8 coexist. Z-segments (custom extensions) vary per vendor and site, requiring per-source mapping.

Patient Identity

MRN fragmentation across facilities. Requires MPI (Master Patient Index) or EMPI resolution before deduplication in the lakehouse.

Terminology Gaps

Local codes vs. standard terminologies (LOINC, SNOMED, RxNorm). Healthcare Data Engine handles mapping but requires curation.

Message Ordering

Out-of-order delivery and duplicate messages. Pipeline must handle idempotency, sequencing, and late-arriving corrections (A08).

PHI & Compliance

Every message contains PHI. Must enforce encryption in transit (TLS/MLLP-S), at rest (CMEK), and de-identification for research.

Downtime & Recovery

EHR downtimes require message queuing and replay. Dead-letter queues and reconciliation jobs ensure zero data loss.

▼ ▼ ▼

Downstream: Into the Unified Pipeline

Once ingested and parsed, EHR/HL7v2 data flows through harmonization into the lakehouse, becoming AI-ready records that power clinical agents, analytics, and research.

HL7v2 Store

→

Pub/Sub

→

Dataflow (Harmonize)

→

BigQuery Raw Zone

→

Curated Zone

→

Enriched + Embeddings

→

AI Agents

← Back to Overview

Imaging Source Systems

🖼

GE Healthcare PACS

Centricity / Edison
Enterprise imaging archive

🖼

Philips PACS

IntelliSpace PACS
Multi-modality support

🖼

Siemens Healthineers

syngo.via / teamplay
AI-ready platform

🗃

VNA (Vendor Neutral Archive)

Hyland / Fuji / IBM
Long-term image storage

☤

Modalities

CT, MRI, US, XR, PET
Mammo, Path slides

★

Specialty Systems

Cardiology CVIT, Derm
Ophthalmology, Dental

▼ ▼ ▼ ▼ ▼

Generate DICOM objects on acquisition / post-processing

DICOM Object Model — Hierarchy

Patient → Study → Series → Instance

Patient PatientID | PatientName | DOB | Sex

Study StudyInstanceUID | StudyDate | AccessionNumber | ReferringPhysician | StudyDescription

Series SeriesInstanceUID | Modality | SeriesNumber | BodyPartExamined | SeriesDescription

Instance SOPInstanceUID | SOPClassUID | InstanceNumber | PixelData | TransferSyntax

Patient Level

Study Level

Series Level

Instance Level

▼ ▼ ▼

Common SOP Classes

CT Image Storage

1.2.840.10008.5.1.4.1.1.2

Axial slices, 512x512 typical
Hounsfield units, 16-bit depth
100-5000+ slices per study

MR Image Storage

1.2.840.10008.5.1.4.1.1.4

Multiple sequences (T1, T2, FLAIR)
Variable matrix sizes
Multi-planar reconstructions

US Image Storage

1.2.840.10008.5.1.4.1.1.6.1

Multi-frame / cine clips
Doppler overlays
Measurements embedded

Secondary Capture

1.2.840.10008.5.1.4.1.1.7

Scanned documents, screenshots
ECG waveform captures
Non-DICOM source images

Structured Report (SR)

1.2.840.10008.5.1.4.1.1.88.x

Coded measurements & findings
CAD results, dose reports
AI inference outputs

Presentation State

1.2.840.10008.5.1.4.1.1.11.x

Window/level settings
Annotations, overlays
Hanging protocol references

▼ ▼ ▼

Key DICOM Tags

Tag	Name	Level	Notes
(0010,0020)	PatientID	Patient	MRN; critical for cross-system matching
(0020,000D)	StudyInstanceUID	Study	Globally unique study identifier
(0020,000E)	SeriesInstanceUID	Series	Groups images by acquisition sequence
(0008,0018)	SOPInstanceUID	Instance	Unique per image/object
(0008,0060)	Modality	Series	CT, MR, US, XR, PT, MG, SM
(0008,0020)	StudyDate	Study	Date of imaging examination
(0008,0090)	ReferringPhysician	Study	Ordering clinician name
(0018,0015)	BodyPartExamined	Series	CHEST, HEAD, ABDOMEN, etc.
(7FE0,0010)	PixelData	Instance	Bulk pixel data; largest element
(0002,0010)	TransferSyntaxUID	Meta	Encoding: Explicit VR, JPEG2000, etc.

▼ ▼ ▼

DICOM to FHIR R4 Resource Mapping

DICOM Source	FHIR R4 Resource	Key Fields Mapped
Study	ImagingStudy	StudyInstanceUID, modality list, numberOfSeries/Instances, started, endpoint
Structured Report (SR)	DiagnosticReport / Observation	Coded findings, measurements, conclusion, performer
Patient Tags	Patient	PatientID, name, DOB, gender mapped to FHIR Patient resource
Order (AccessionNumber)	ServiceRequest	Accession, requested procedure, referring physician, priority
Series	ImagingStudy.series	Modality, body site, laterality, number of instances, UID
Instance	ImagingStudy.series.instance	SOPClass, instance number, WADO-RS endpoint for retrieval

▼ ▼ ▼

Integration Patterns — Imaging to Google Cloud

DICOMweb to Cloud Healthcare API Primary

Native DICOMweb (STOW-RS, WADO-RS, QIDO-RS) direct to Cloud Healthcare API DICOM Store. RESTful, standards-based.

PACS / VNA

→

DICOMweb (STOW-RS)

→

Cloud Healthcare API (DICOM Store)

→

Pub/Sub Notification

DIMSE Gateway Legacy

For legacy PACS using C-STORE/C-FIND. DIMSE proxy translates traditional DICOM networking to DICOMweb for cloud ingestion.

Legacy PACS (DIMSE)

→

DICOM Gateway (C-STORE SCP)

→

STOW-RS Adapter

→

Cloud Healthcare API

Cloud Storage Bulk Import Migration

Bulk DICOM archive migration. Upload Part 10 files to Cloud Storage, then import into DICOM Store via batch job.

DICOM Archive (Part 10)

→

Cloud Storage (GCS)

→

DICOM Store Import

→

Cloud Healthcare API

Pub/Sub Event-Driven Automation

Pub/Sub notifications on new DICOM instances trigger downstream pipelines: metadata extraction, de-identification, AI inference.

DICOM Store

→

Pub/Sub (new study event)

→

Cloud Functions / Dataflow

→

Vertex AI Inference

▼ ▼ ▼

Typical Data Volume & Throughput (Large Health System)

2-5 TB

New Imaging / Day

All modalities combined

500K-1M

Studies / Year

Across all departments

50-500 MB

Per Study (Avg)

CT thin-slice can exceed 1 GB

PB-Scale

Archive Size

10+ years of historical data

100-1000

Instances / Study (Avg)

CT: 500-5000 slices

< 30s

Ingestion Latency Target

Study available for AI after store

▼ ▼ ▼

Key Challenges & Considerations

Large File Sizes

CT/MR studies can be 1+ GB. Requires chunked uploads, resumable transfers, and efficient network utilization for cloud migration.

Compression Trade-offs

Lossy vs. lossless compression (JPEG2000, JPEG-LS). Lossy acceptable for viewing but not for AI training or primary diagnosis.

Burned-In PHI

Patient name/DOB baked into pixel data (ultrasound overlays, scanned docs). Requires OCR-based pixel scrubbing for de-identification.

Multi-Frame & Cine

Ultrasound clips, cardiac cine MRI, fluoroscopy — multi-frame objects need special handling for storage, viewing, and AI processing.

AI on Pixel Data

Vertex AI inference requires pixel extraction, normalization, and pre-processing. Transfer syntax conversion may be needed.

Cross-Site Reconciliation

Patients imaged at multiple facilities. StudyInstanceUIDs differ; requires MPI matching and study linking across PACS systems.

▼ ▼ ▼

Downstream: Into the Unified Pipeline

Imaging data splits into metadata (structured) and pixel data (binary) paths, converging in the lakehouse for AI-powered radiology and clinical analytics.

DICOM Store

→

Pub/Sub

→

Dataflow (Metadata Extract)

→

BigQuery (Metadata)

+

Cloud Storage (Pixel Data)

→

Vertex AI (Imaging AI)

← Back to Overview

Lab Source Systems

⚗

Sunquest

Enterprise LIS
Chemistry, Heme, Micro, BB

☤

Beaker (Epic)

Integrated with Epic EHR
AP & CP modules

☁

Cerner PathNet

Oracle Health LIS
General & Anatomic Path

⚙

SoftLab / MEDITECH

MEDITECH integrated lab
Expanse & legacy

📚

Reference Labs

Quest Diagnostics, LabCorp
Send-out results via HL7v2

⚖

POC Devices

i-STAT, glucometers, ABG
Bedside testing, rapid results

▼ ▼ ▼ ▼ ▼

Generate lab orders, specimens, and results

Lab Data Model — Order-to-Result Hierarchy

Order → Specimen → Result → Component

Order OrderID | OrderCode | OrderingProvider | Priority | OrderDateTime

Specimen SpecimenID | Type (Blood/Urine/CSF) | CollectionTime | Source (Venous/Arterial)

Result TestCode (LOINC) | TestName | Status (P/F/C) | ResultDateTime

Component Value (7.4) | Units (mg/dL) | RefRange (3.5-10.5) | Flag (H/L/A/C)

Order Level

Specimen Level

Result Level

Component Level

▼ ▼ ▼

Lab Order Types & Departments

Chemistry

BMP, CMP, LFTs, Lipids, A1c

Highest volume department
Discrete numeric results
Automated analyzer output

Hematology

CBC, Diff, Coags (PT/INR, PTT)

Multi-component panels
Automated cell counts
Manual differential when flagged

Microbiology

Culture & Sensitivity, AFB, Fungal

Progressive results over days
Organism ID + antibiotic MICs
Complex multi-step workflow

Blood Bank

Type & Screen, Crossmatch, Antibody ID

Critical for transfusion safety
Antigen/antibody panels
Regulatory traceability required

Anatomic Pathology

Surgical Path, Cytology, Autopsy

Narrative & synoptic reports
IHC stains, special stains
Cancer staging (CAP protocols)

Molecular / Genetics

NGS Panels, PCR, FISH, Karyotype

Variant-level results
Pharmacogenomics (PGx)
Turnaround: days to weeks

▼ ▼ ▼

HL7v2 Lab Messages — OBR + OBX Segment Structure

Example: ORU^R01 (Lab Result)

MSH LIS | LAB_FAC | EHR | ORU^R01 | 2.5.1

PID MRN | Name | DOB | Sex

ORC OrderControl (RE) | PlacerOrderNum | FillerOrderNum | OrderStatus

OBR UniversalServiceID (LOINC) | RequestedDateTime | ObservationDateTime | ResultStatus (F/P/C)

OBX-1 NM | 2823-3 (K+) | 4.2 | mmol/L | 3.5-5.1 | N

OBX-2 NM | 2951-2 (Na+) | 148 | mmol/L | 136-145 | H

OBX-3 NM | 2160-0 (Creat) | 2.8 | mg/dL | 0.7-1.3 | HH

MSH — Header

PID — Patient

ORC — Order Control

OBR — Observation Request (Panel)

OBX — Observation Result (per analyte)

▼ ▼ ▼

Lab to FHIR R4 Resource Mapping

HL7v2 Segment	FHIR R4 Resource	Code System	Key Fields Mapped
OBR	DiagnosticReport	LOINC	Panel code, status, effectiveDateTime, performer, conclusion
OBX	Observation	LOINC	Code, valueQuantity, referenceRange, interpretation (H/L/A), status
SPM	Specimen	SNOMED	Type, collection dateTime, source site, condition, container
ORC + OBR	ServiceRequest	LOINC / local	Code, requester, priority, status, authoredOn, specimen requirements
OBX (Micro)	Observation (component)	SNOMED	Organism, antibiotic, MIC value, interpretation (S/I/R)
OBX (AP narrative)	DiagnosticReport.presentedForm	LOINC	Pathology report text (synoptic/narrative), attachment

▼ ▼ ▼

Integration Patterns — Lab to Google Cloud

HL7v2 Streaming Primary

ORU^R01 results streamed in real time. ORM^O01 orders captured for order-result linkage. Sub-second delivery for critical values.

LIS Interface Engine

→

MLLP Adapter

→

Cloud Healthcare API (HL7v2 Store)

→

Pub/Sub → Dataflow

LIS FHIR APIs Modern

Modern LIS platforms expose FHIR R4 endpoints. DiagnosticReport and Observation resources pulled or pushed directly.

LIS FHIR Server

→

SMART Backend Auth

→

Cloud Healthcare API (FHIR Store)

→

BigQuery Export

Reference Lab Results External

Quest, LabCorp, and specialty reference labs return results via HL7v2 or FHIR. Routed through interface engine for normalization.

Reference Lab (Quest/LabCorp)

→

HL7v2/FHIR

→

Cloud Healthcare API

→

Dataflow Pipeline

POC Device Data Bedside

Point-of-care devices (i-STAT, glucometers) transmit results via device middleware to Pub/Sub for real-time capture.

POC Devices

→

Device Middleware

→

Pub/Sub

→

Dataflow → BigQuery

▼ ▼ ▼

Typical Data Volume & Throughput (Large Health System)

500K-2M

Results / Day

Individual OBX observations

10-50

OBX per Panel

CMP=14, CBC+Diff=20+

< 5 min

Critical Result Latency

K+ > 6.0, Troponin > threshold

3-7 days

Micro Culture Duration

Progressive preliminary results

50-100

Messages / Second (Peak)

Morning draw results arrive 7-10 AM

1-4 KB

Avg ORU Message Size

Micro/AP can be 10-50 KB narrative

▼ ▼ ▼

Key Challenges & Considerations

Amendments & Corrections

Result status F → C (corrected). Pipeline must handle updates, maintain audit trail of original vs. corrected values.

Micro Progressive Results

Culture results arrive over days: preliminary → organism ID → susceptibilities. Must link all updates to single order.

Delta Checks & Critical Values

Real-time alerting on critical values (K+ > 6.5, Hgb < 7). Delta checks detect instrument errors. Pipeline must support < 5 min latency.

Discrete vs. Narrative

Chemistry/heme = discrete numeric. Pathology = narrative text. NLP/embeddings required for AP reports to be AI-queryable.

LOINC Mapping Completeness

Local test codes must map to LOINC for interoperability. 60-80% auto-mapped; remainder requires manual curation. Ongoing maintenance.

Reference Range Variability

Ranges differ by lab, instrument, age, sex. Must capture per-result ranges, not global defaults, for accurate interpretation.

▼ ▼ ▼

Downstream: Into the Unified Pipeline

Lab data flows into structured BigQuery tables for discrete results and embedding pipelines for narrative pathology reports, powering clinical decision support and AI agents.

HL7v2 Store

→

Dataflow

→

BigQuery (Structured Results)

+

Embeddings (Narrative Path)

→

Vertex AI Search

→

AI Agents

← Back to Overview

Device Categories

💉

Continuous Glucose Monitors

Dexcom G7, Abbott Libre
Reading every 5 min, 288/day

❤

Cardiac Monitors

Apple Watch, AliveCor, Zio Patch
ECG, rhythm detection

🏃

Activity Trackers

Fitbit, Garmin, Oura
Steps, sleep, HRV, calories

⚖

RPM Kits

BP cuffs, pulse ox, scales
Cellular-connected home devices

💻

Hospital Bedside Monitors

Philips IntelliVue, GE CARESCAPE
HR, SpO2, BP, temp, 1/sec

⚙

Infusion Pumps & Vents

Alaris, Baxter, Draeger
Drug rates, vent settings, alarms

▼ ▼ ▼ ▼ ▼

Continuous streaming of vitals, waveforms, and activity data

Data Types & Formats

Time-Series Vitals

HR, SpO2, BP, Temp, Glucose

Numeric readings with timestamps
1/sec (hospital) to 1/5min (CGM)
FHIR Observation with effectiveDateTime

Waveforms

ECG (250-500Hz), EEG, Pleth

High-frequency continuous data
Multi-lead (12-lead ECG)
IEEE 11073 / SCP-ECG format

Activity Metrics

Steps, Sleep Stages, Calories, HRV

Aggregated epochs (1-min, 5-min)
Apple HealthKit / Google Health Connect
Daily summaries + granular data

Alerts & Alarms

Threshold violations, arrhythmia, apnea

Real-time event notifications
Severity levels (advisory/warning/crisis)
Alarm context (parameter, limit, value)

▼ ▼ ▼

Standards & Protocols

Device Data Pathway

Protocol IEEE 11073 | BLE Health Profiles | HL7v2 ORU | FHIR Observation

Platform Apple HealthKit | Google Health Connect | Manufacturer Cloud API

Payload DeviceID | PatientID | Timestamp (UTC) | Metric Code (LOINC) | Value + Units

Meta Device Model | Firmware Version | Battery Level | Signal Quality

Transport Protocol

Platform / Aggregator

Observation Payload

Device Metadata

▼ ▼ ▼

Ingestion Patterns — Devices to Google Cloud

Manufacturer Cloud API Consumer

Dexcom, Fitbit, Withings expose REST APIs. Cloud Functions poll or receive webhooks, normalize, and push to Pub/Sub.

Device → Mfg Cloud

→

Cloud Functions (webhook/poll)

→

Pub/Sub

→

Dataflow Streaming

Hospital IoT Gateway Inpatient

Bedside monitors → local IoT gateway (Capsule, Bernoulli) → HL7v2 or MQTT to Pub/Sub for real-time streaming.

Bedside Monitor

→

IoT Gateway (On-Prem)

→

Pub/Sub

→

Dataflow → Bigtable

Patient App / FHIR RPM

Patient-facing apps write FHIR Observation resources (BP, weight, glucose) directly to Cloud Healthcare API FHIR Store.

Patient App

→

FHIR Observation

→

Cloud Healthcare API (FHIR Store)

→

BigQuery Export

Bulk CSV / JSON Historical

Batch export from device platforms (Fitbit data export, CGM CSV downloads). Cloud Storage → Dataflow batch processing.

CSV / JSON Export

→

Cloud Storage (GCS)

→

Dataflow Batch

→

BigQuery

▼ ▼ ▼

Time-Series Processing on GCP

Dataflow Streaming Pipeline

Ingest Pub/Sub (raw readings) → Dataflow: Parse & Validate → Dedup & Timestamp Align

Process Downsample (5-min windows) → Noise Filter / Smooth → Anomaly Detection (z-score)

Store Bigtable (raw high-freq) + BigQuery (aggregated) + Cloud Storage (waveforms)

Ingestion Layer

Stream Processing

Computation

Hot Storage

Analytical Store

Cold Storage

▼ ▼ ▼

Typical Data Volume & Throughput

288

CGM Readings / Day

1 reading every 5 minutes

250-500 Hz

ECG Waveform Rate

12-lead = 3000-6000 samples/sec

1/sec

Hospital Monitor Rate

HR, SpO2, BP per patient per sec

Millions

Data Points / Day (RPM)

Thousands of patients in program

10-100 GB

Daily Waveform Data

ICU with 50+ monitored beds

< 5s

Alert Latency Target

Critical deterioration detection

▼ ▼ ▼

Key Challenges & Considerations

Data Quality & Noise

Motion artifacts, poor sensor contact, environmental interference. Requires signal quality scoring and filtering before clinical use.

Connectivity Gaps

Bluetooth dropouts, Wi-Fi dead zones, cellular coverage gaps. Must handle store-and-forward with gap reconciliation.

Timestamp Synchronization

Device clocks drift. Multiple devices per patient with different time sources. Must normalize to UTC with known accuracy.

Alert Fatigue

90%+ of monitor alarms are non-actionable. AI must filter noise, detect true deterioration patterns, and suppress false positives.

Patient Compliance

Wearable adherence drops over time. Missing data windows must be flagged, not treated as normal. Engagement tracking needed.

Massive Volume, Low Signal

99%+ of readings are normal. Storage cost optimization via tiered storage (hot/warm/cold) and intelligent downsampling is essential.

▼ ▼ ▼

Downstream: Into the Unified Pipeline

Device data streams through real-time processing into dual storage (Bigtable for raw, BigQuery for aggregated), feeding Vertex AI for anomaly detection and clinical deterioration alerting.

Pub/Sub

→

Dataflow (Filter/Aggregate)

→

BigQuery + Bigtable

→

Vertex AI (Anomaly Detection)

→

Clinical Agents (Alerts)

← Back to Overview

Source Systems

🧬

Illumina NovaSeq / NextSeq

Short-read sequencing
WGS, WES, targeted panels

🧬

PacBio Revio

Long-read HiFi sequencing
Structural variants, phasing

🧬

Oxford Nanopore

Real-time long-read
Rapid turnaround, portable

⚙

Bioinformatics Pipelines

GATK, BWA-MEM2, DRAGEN
Alignment + variant calling

📚

LIMS & Tumor Boards

Sample tracking, clinical
interpretation, reporting

⚖

Pharmacogenomics

PGx platforms (CPIC)
Drug-gene interaction testing

▼ ▼ ▼ ▼ ▼

Generate raw reads, aligned sequences, variant calls, and clinical reports

Data Types & Formats

FASTQ (Raw Reads)

@readID / sequence / +quality

Raw base calls from sequencer
WGS: ~100 GB per sample
Paired-end (R1 + R2 files)

BAM / CRAM (Aligned)

Binary Alignment Map / Compressed

Reads aligned to reference genome
BAM: 50-100 GB; CRAM: 30-60 GB
Indexed for region queries (.bai)

VCF / gVCF (Variants)

CHROM POS ID REF ALT QUAL FILTER INFO

SNVs, indels, structural variants
VCF: 100-500 MB per WGS
gVCF includes reference confidence

RNA-seq / Expression

Gene expression quantification

TPM / FPKM normalized counts
Differential expression analysis
Fusion gene detection

NGS Panels

50-500 gene targeted panels

Oncology (Foundation, Tempus)
Hereditary risk (BRCA, Lynch)
Carrier screening panels

PGx Results

Star alleles: CYP2D6 *1/*4

Metabolizer phenotype classification
Drug-gene interaction pairs
CPIC guideline recommendations

▼ ▼ ▼

Genomics Processing on GCP

End-to-End Pipeline: Sequencer → Clinical Insight

Raw Data FASTQ / BAM Upload → Cloud Storage (Multi-Region) → Lifecycle to Nearline/Archive

Pipeline Cloud Batch → BWA-MEM2 (Align) → GATK / DeepVariant (Call) → VCF Output

Annotate Variant Transforms → BigQuery (Variants Table) → ClinVar + gnomAD Annotation

Cohort Hail on Dataproc → GWAS / Burden Tests → Population Analytics

Storage & Lifecycle

Pipeline Execution

Annotation & Query

Cohort Analysis

▼ ▼ ▼

Genomics to FHIR R4 Resource Mapping

Genomic Source	FHIR R4 Resource	Key Fields Mapped
VCF Variant	Observation (variant)	Gene, DNA change (HGVS), protein change, zygosity, allele frequency
Sequence Data	MolecularSequence	Reference sequence, coordinate system, quality scores, repository
PGx Star Alleles	Observation (haplotype)	Gene (CYP2D6), allele name (1/4), metabolizer phenotype
PGx Recommendation	Task (medication-recommendation)	Drug, action (adjust dose/avoid), evidence level, CPIC guideline
Clinical Report	DiagnosticReport (genetics)	Conclusion, variant list, interpretation (P/LP/VUS/LB/B), performer
Panel / Test Order	ServiceRequest	Panel code, specimen, requester, reason (condition), priority

▼ ▼ ▼

Integration Patterns — Genomics to Google Cloud

Raw File Storage Foundation

FASTQ and BAM files uploaded to Cloud Storage with lifecycle policies. Multi-region for durability, Nearline/Archive for cost optimization.

Sequencer Output

→

gsutil / Transfer Service

→

Cloud Storage (Standard)

→

Lifecycle → Nearline/Archive

Pipeline Execution Compute

Cloud Batch runs GATK/DeepVariant workflows. Auto-scaling VMs, preemptible instances for cost. WDL/Nextflow orchestration.

Cloud Storage (FASTQ)

→

Cloud Batch (GATK/DeepVariant)

→

Cloud Storage (VCF/BAM)

→

Variant Transforms

Variant Analysis in BigQuery Analytics

Variant Transforms loads VCF into BigQuery. Join with ClinVar, gnomAD for annotation. SQL-based variant filtering and cohort queries.

VCF Files

→

Variant Transforms

→

BigQuery (Variants + Annotations)

→

Cohort Analytics

Cohort Analysis with Hail Population

Hail on Dataproc for large-scale cohort analysis: GWAS, burden tests, PCA. Scales to millions of variants across thousands of samples.

BigQuery / VCF

→

Dataproc (Hail)

→

GWAS / PCA / Burden

→

Results → BigQuery

▼ ▼ ▼

Typical Data Volume (Large Academic Center)

~100 GB

FASTQ per WGS Sample

30x coverage, paired-end

100-500 MB

VCF per WGS Sample

4-5M variants per genome

5-10 GB

WES per Sample

~60K variants (exome only)

1-5 GB

NGS Panel per Sample

50-500 genes, high depth

10K-50K

Samples / Year

Large academic / research center

2-24 hrs

Pipeline Runtime

WGS alignment + variant calling

▼ ▼ ▼

Key Challenges & Considerations

Massive File Sizes

Single WGS = 100+ GB raw. 50K samples/year = 5+ PB. Requires tiered storage, compression (CRAM), and efficient transfer.

Long Pipeline Runtimes

WGS alignment + calling: 2-24 hours per sample. Requires auto-scaling compute (Cloud Batch) and spot/preemptible instances for cost.

Variant Interpretation (VUS)

40-60% of variants are VUS (Variants of Uncertain Significance). Requires ongoing reclassification as databases update.

Re-Analysis Requirements

As reference databases (ClinVar, gnomAD) update, prior results need re-annotation. Must maintain pipeline versioning and audit trail.

Consent & Return of Results

Incidental findings, right not to know, family implications. Consent management and result disclosure policies vary by institution.

Population Reference Gaps

Reference genomes biased toward European ancestry. gnomAD coverage varies by population. Equity implications for variant calling accuracy.

▼ ▼ ▼

Downstream: Into the Unified Pipeline

Genomic data flows from raw storage through compute pipelines into BigQuery variant tables, joined with knowledge graphs for AI-powered PGx recommendations and tumor profiling.

Cloud Storage (Raw)

→

Cloud Batch (Pipelines)

→

BigQuery (Variants)

→

Knowledge Graph (ClinVar, gnomAD)

→

AI Agents (PGx, Tumor Profiling)

← Back to Overview

Source Systems — Claims

💰

Clearinghouses

Availity, Change Healthcare
X12 837/835 transaction hub

🏛

Medicare / Medicaid

CMS Blue Button 2.0 (FHIR)
State Medicaid feeds

💳

Commercial Payers

UHC, Anthem, Aetna, Cigna
EDI 837P/837I feeds

⚖

State HIEs

Regional health exchanges
ADT notifications, claims

Source Systems — SDoH

🏠

Census / ACS Data

Demographics, income, education
By FIPS / ZIP / tract

🍎

USDA Food Access

Food desert research atlas
Low access / low income tracts

📊

CDC PLACES / SVI

County health estimates
Social Vulnerability Index

🏢

ADI / HUD / 211

Area Deprivation Index
Housing, social services

▼ ▼ ▼ ▼ ▼

Claims transactions + geocoded social determinant indices

Claims Data Model — X12 Transactions

X12 837 Professional / Institutional Claim Structure

Header SubmitterID | ReceiverID | TransactionDate | ClaimType (P/I)

Patient MemberID | Name | DOB | GroupNumber | PayerID

Claim ClaimID | DOS (From-To) | PlaceOfService | TotalCharge | DRG

DX Codes ICD-10-CM (Primary) | ICD-10 (DX2..DX12) | ICD-10-PCS (Procedures)

Lines CPT/HCPCS Code | Modifiers | Units | Allowed Amount | NPI (Rendering)

Remit PaidAmount | AdjustmentReason | PatientResponsibility | CheckDate

Transaction Header

Patient / Member

Claim Level

Diagnosis Codes

Service Lines

Remittance (835)

▼ ▼ ▼

SDoH Data Model

Z-Codes (ICD-10 Z55-Z65)

Z59.0 Homelessness, Z56.0 Unemployment

Captured in EHR problem list / claims
Under-coded (5-10% capture rate)
Maps to FHIR Condition resource

Screening Tools

AHC-HRSN, PRAPARE, PHQ-2/9

Standardized questionnaires
Food, housing, transport, safety domains
FHIR QuestionnaireResponse

Geocoded Indices

ADI, SVI, Food Access Research Atlas

Census tract / ZIP level scores
Deprivation, vulnerability rankings
Joined by FIPS/ZIP to patient address

SDoH Domains

Gravity Project SDOH Categories

Food insecurity
Housing instability / homelessness
Transportation, employment, education
Interpersonal safety, social isolation

▼ ▼ ▼

Claims & SDoH to FHIR R4 Resource Mapping

Source	FHIR R4 Resource	IG / Profile	Key Fields Mapped
X12 837	Claim	CARIN BB	Type, provider, diagnosis, procedure, total, item lines
X12 835 (EOB)	ExplanationOfBenefit	CARIN BB	Payment, adjudication, adjustments, patient responsibility
X12 270/271	Coverage	DaVinci PDex	Payor, subscriber, group, period, type, beneficiary
Payer / Provider	Organization	US Core	NPI, name, type, address, active status
SDoH Screening	QuestionnaireResponse	Gravity SDOH	Questionnaire ref, items, answers, authored date
SDoH Need	Condition	Gravity SDOH	Category (sdoh), code (Z-code), clinicalStatus, evidence
SDoH Referral	ServiceRequest / Task	Gravity SDOH	Category, code, status, requester, performer (CBO), for (patient)

▼ ▼ ▼

Integration Patterns — Claims & SDoH to Google Cloud

Claims Flat File Ingestion Primary

X12 837/835 or CSV flat files from clearinghouses. Batch upload to Cloud Storage, parsed by Dataflow, loaded into BigQuery.

Clearinghouse (X12/CSV)

→

Cloud Storage (GCS)

→

Dataflow (Parse X12)

→

BigQuery (Claims Tables)

Payer FHIR APIs CMS Mandate

CMS Blue Button 2.0, payer Patient Access APIs. ExplanationOfBenefit resources pulled via FHIR R4 into Cloud Healthcare API.

Payer FHIR API

→

Cloud Functions (OAuth2 flow)

→

Cloud Healthcare API (FHIR Store)

→

BigQuery Export

SDoH Public Datasets Geocoded

Census/ACS, ADI, SVI, USDA datasets loaded into BigQuery. Joined to patient records by FIPS code, ZIP, or census tract.

Census API / Public Datasets

→

Cloud Storage / BQ Public

→

BigQuery (SDoH Tables)

→

JOIN by FIPS/ZIP

EHR SDoH Screening Clinical

AHC-HRSN / PRAPARE screening responses flow from EHR via HL7v2 or FHIR. Z-codes captured in ADT/DG1 segments.

EHR (HL7v2 / FHIR)

→

Cloud Healthcare API

→

Dataflow (Extract Z-codes)

→

BigQuery (SDoH Screening)

▼ ▼ ▼

Typical Data Volume & Refresh Cadence

Millions

Claims / Month

Large health system, all payers

30-90 Days

Claims Lag

From service date to adjudication

Real-Time

Eligibility Checks

270/271 transactions, sub-second

Quarterly

SDoH Index Refresh

ADI, SVI, PLACES updated periodically

Annual

Census / ACS Data

5-year ACS estimates, decennial census

5-10%

Z-Code Capture Rate

SDoH under-documented in claims/EHR

▼ ▼ ▼

Key Challenges & Considerations

Claims Lag & Adjustments

30-90 day delay from service to paid claim. Denials, resubmissions, and adjustments create multiple versions. Must handle retroactive changes.

SDoH Data Sparsity

Z-code capture is 5-10%. Screening adoption is uneven. Geocoded indices are proxies, not individual-level data. Gaps in rural areas.

Geocoding Accuracy

Patient addresses may be PO boxes, shelters, or outdated. Census tract assignment requires geocoding services and address standardization.

Social Risk vs. Social Need

Area-level deprivation (ADI) = risk. Individual screening = need. Both needed but different. Risk does not equal individual experience.

Cross-Payer Linkage

Patients have claims across multiple payers (Medicare + commercial). No universal patient ID. Requires probabilistic matching and deduplication.

Consent for SDoH

Patients may not consent to sharing social needs data. Sensitive categories (domestic violence, substance use). Must respect preferences.

▼ ▼ ▼

Downstream: Into the Unified Pipeline

Claims and SDoH data join with clinical data in BigQuery, enabling population health analytics, risk stratification, care gap detection, and health equity analysis powered by AI agents.

BigQuery (Claims + SDoH)

→

JOIN with Clinical Data

→

Population Health Analytics

→

AI Agents (Risk, Gaps, Equity)

← Back to Overview

Component Overview

☤

Cloud Healthcare API

HL7v2 Store — MLLP ingestion
FHIR Store (R4) — CRUD + search
DICOM Store — DICOMweb
Managed, HIPAA-compliant

✉

Cloud Pub/Sub

Serverless event bus
Topic-per-data-type routing
Exactly-once delivery
Dead-letter queue support

⚙

Cloud Dataflow

Apache Beam (Java/Python)
Streaming + batch unified
Autoscaling workers
Exactly-once processing

⚖

Healthcare Data Engine

FHIR harmonization
Patient matching (EMPI)
Terminology normalization
Data quality rules

▼ ▼ ▼

Cloud Healthcare API — Store Details

FHIR Store R4

Full CRUD on FHIR R4 resources
Search: _include, _revinclude, chained params
$everything — full patient record
Bulk export to BigQuery (streaming + scheduled)
Conditional create/update (If-None-Exist)
Bundle transactions (up to 100 entries)
SMART on FHIR scopes for access control

HL7v2 Store v2.x

MLLP adapter — on-prem to GCP bridge
Message parsing with configurable schemas
Pub/Sub notification on each message
Segment-level field extraction
ACK/NAK response handling
Supports v2.1 through v2.9

DICOM Store DICOMweb

STOW-RS — store instances
WADO-RS — retrieve studies/series/instances
QIDO-RS — query studies by metadata
De-identification profiles built-in
Integration with Cloud Storage for bulk

Common Capabilities Platform

HIPAA BAA, HITRUST, SOC 2 compliant
CMEK encryption at rest
IAM + SMART on FHIR access control
Audit logging to Cloud Logging
Regional and multi-regional deployments

▼ ▼ ▼

Pub/Sub as Event Bus

Topic Architecture

One topic per data type enables independent scaling, filtering, and consumer isolation.

hl7v2-messages

ADT, ORM, ORU events

fhir-notifications

FHIR Store changes

dicom-studies

New study arrivals

iot-events

Device telemetry

claims-ingest

X12 835/837 events

dlq-*

Dead-letter per topic

Feature	Configuration	Purpose
Exactly-Once Delivery	enable_exactly_once_delivery: true	No duplicate processing downstream
Message Ordering	ordering_key: patient_id	In-order per patient for ADT events
Dead-Letter Topics	max_delivery_attempts: 5	Failed messages routed for triage
Push Subscriptions	push_endpoint: Cloud Run URL	Low-latency alert triggers
Pull Subscriptions	ack_deadline: 60s	Dataflow streaming consumption
Retention	message_retention: 7d	Replay window for reprocessing

▼ ▼ ▼

Dataflow Pipelines — Streaming & Batch

Streaming Jobs Always-On

HL7v2 parse → FHIR R4 transform → BigQuery write
Real-time feature engineering (vitals, alerts)
Terminology mapping (local → SNOMED/LOINC)
Patient ID resolution via EMPI lookup
IoT device stream aggregation (1-min windows)
Clinical event enrichment & routing

Batch Jobs Scheduled

Historical backfill from bulk FHIR exports
Claims file processing (X12 835/837)
Genomic pipeline output integration
Monthly terminology table refresh
Data quality reconciliation reports
Feature store batch materialization

Apache Beam Concepts

PCollections

Immutable distributed datasets — each step produces a new PCollection

ParDo / DoFn

Element-wise transforms — HL7v2 parsing, FHIR mapping, validation

Windowing

Fixed (1-min), sliding (5-min/1-min), session (30-min gap) windows

Watermarks

Event-time progress tracking — handle late data with allowed lateness

Side Inputs

Broadcast lookup tables — terminology maps, facility configs

Dead Letters

Failed elements routed to BigQuery error table + DLQ topic

▼ ▼ ▼

Healthcare Data Engine (HDE)

Capability	Detail	Output
Patient Matching (EMPI)	Probabilistic + deterministic matching on name, DOB, SSN, MRN	Golden patient_id
FHIR Harmonization	Normalize heterogeneous FHIR into canonical R4 profiles	Conformant FHIR bundles
Terminology Normalization	Map local codes → SNOMED CT, LOINC, RxNorm, ICD-10	Standard coded values
Data Quality Rules	Completeness, validity, consistency checks per resource type	Quality score + flags
Longitudinal Assembly	Merge records across sources into single patient timeline	Unified patient record
De-identification	Safe Harbor / Expert Determination for research datasets	De-identified FHIR

▼ ▼ ▼

Pipeline Architecture — End-to-End Flow

Streaming Path

EHR / HL7v2

→

MLLP Adapter

→

Cloud Healthcare API

→

Pub/Sub

→

Dataflow (Streaming)

→

HDE (Harmonize)

→

BigQuery Raw Zone

FHIR Native Path

EHR FHIR R4

→

FHIR Store

→

Pub/Sub

→

Dataflow (Streaming)

→

BigQuery Raw Zone

Imaging Path

PACS / VNA

→

DICOM Store

→

Pub/Sub

→

Dataflow (Metadata)

→

BigQuery + Cloud Storage

Batch Path

Bulk Export / Files

→

Cloud Storage (GCS)

→

Dataflow (Batch)

→

HDE (Harmonize)

→

BigQuery Raw Zone

▼ ▼ ▼

Monitoring & Observability

Cloud Monitoring Dashboards

Dataflow job metrics: throughput, element count, backlog size
Pub/Sub: unacked messages, publish latency, subscription age
Healthcare API: request count, error rate, latency p50/p99
BigQuery: streaming insert rate, slot utilization, query performance

Alerting & SLOs

SLO: end-to-end latency < 5s (p99), availability 99.95%
Alert: Pub/Sub backlog > 10K messages
Alert: Dataflow error rate > 0.1%
Alert: DLQ message count > 0
Alert: Healthcare API 5xx rate > 1%
Weekly SLO burn-rate reports via Cloud Monitoring

▼ ▼ ▼

Scaling Characteristics

Auto

Dataflow Workers

1 → 1000+ based on backlog

∞

Pub/Sub Throughput

No provisioned capacity needed

1M

BQ Streaming Inserts/sec

Per project, expandable

Regional

Deployment Model

us-central1 primary, failover ready

CMEK

Encryption

Customer-managed keys everywhere

99.95%

Pipeline SLA Target

End-to-end availability

▼ ▼ ▼

Downstream: Into the Lakehouse

Ingested and harmonized data lands in the BigQuery lakehouse, progressing through Raw, Curated, and Enriched zones to become AI-ready.

Ingestion Pipeline

→

BigQuery Raw Zone

→

Curated Zone

→

Enriched Zone

→

Vertex AI + Agents

← Back to Overview

Purpose & Principles

🔒

Immutable Landing

Data written once, never modified. Append-only ingestion preserves original fidelity.

📄

Source of Truth

Exact copy of upstream data. All downstream zones derive from raw — enables full recompute.

🔍

Audit Trail

Every record timestamped with ingestion metadata. Supports HIPAA audit and regulatory review.

♾

Reprocessing

When transformation logic changes, replay from raw. No need to re-extract from source systems.

▼ ▼ ▼

BigQuery Storage Layout — Datasets by Source

raw_ehr

HL7v2 messages + FHIR resources

raw_imaging_meta

DICOM metadata (studies, series)

raw_labs

Lab results, micro, path

raw_claims

X12 835/837, ERA, eligibility

raw_iot

Device telemetry, wearables

raw_genomics

VCF variants, annotations

▼ ▼ ▼

Raw FHIR Tables

Schema: raw_ehr.fhir_resources Auto-populated via Healthcare API Export

Column	Type	Description
resource_type	STRING	Patient, Encounter, Observation, Condition, etc.
id	STRING	FHIR resource ID (server-assigned UUID)
meta_last_updated	TIMESTAMP	Server-side last modified timestamp
meta_version_id	STRING	Resource version for optimistic concurrency
resource_json	JSON	Full FHIR R4 resource payload
source_fhir_store	STRING	Cloud Healthcare API FHIR store path
ingestion_timestamp	TIMESTAMP	Pipeline ingestion time (partition key)

▼ ▼ ▼

Raw HL7v2 Tables

Schema: raw_ehr.hl7v2_messages Parsed from HL7v2 Store

Column	Type	Description
message_id	STRING	Unique message control ID (MSH-10)
message_type	STRING	ADT, ORM, ORU, SIU, MDM, etc.
trigger_event	STRING	A01, A03, O01, R01, etc.
sending_facility	STRING	MSH-4 sending facility identifier
sending_application	STRING	MSH-3 sending application name
raw_message	STRING	Original pipe-delimited HL7v2 message
parsed_segments	JSON	Structured JSON of all segments (MSH, PID, PV1, OBX...)
message_datetime	TIMESTAMP	MSH-7 message date/time
ingestion_timestamp	TIMESTAMP	Pipeline arrival time (partition key)

▼ ▼ ▼

Raw Claims Tables

Schema: raw_claims.claims_raw X12 835/837

Column	Type	Description
claim_id	STRING	Payer-assigned claim identifier
claim_type	STRING	Professional (837P), Institutional (837I), Dental (837D)
service_date_from	DATE	Service start date
service_date_to	DATE	Service end date
dx_codes	ARRAY<STRING>	ICD-10-CM diagnosis codes (primary + secondary)
px_codes	ARRAY<STRING>	CPT/HCPCS procedure codes
billed_amount	NUMERIC	Total billed amount
allowed_amount	NUMERIC	Payer-allowed amount
payer_name	STRING	Insurance payer identifier
raw_x12	STRING	Original X12 transaction content
ingestion_timestamp	TIMESTAMP	Pipeline arrival time (partition key)

▼ ▼ ▼

Cloud Storage — Large Binary Objects

Storage Organization GCS Buckets

Prefix structure: gs://project-raw/{source}/{type}/{YYYY}/{MM}/{DD}/

DICOM files — original imaging studies (.dcm)
Genomics — FASTQ, BAM, VCF files
Clinical documents — scanned PDFs, CDA/C-CDA
Large HL7v2 batches — bulk file drops

Object Metadata Labels

source_system — originating system ID
data_type — dicom, genomics, document
phi_flag — true/false
ingestion_date — ISO 8601 arrival date
retention_class — hot, warm, cold

▼ ▼ ▼

Data Quality at Rest — Dataplex DQ

Check Type	Tool	Example Rule
Schema Validation	Dataplex Data Quality	All required columns present, correct types
Completeness	Dataplex Data Quality	patient_id NOT NULL, message_type NOT NULL
Duplicate Detection	Dataform assertion	COUNT(DISTINCT message_id) = COUNT(*)
Freshness Monitoring	Dataplex Data Quality	MAX(ingestion_timestamp) within last 15 minutes
Range Validation	Dataplex Data Quality	ingestion_timestamp between source_time and NOW()
Volume Anomaly	Cloud Monitoring	Daily row count within 2 stddev of trailing 30-day mean

▼ ▼ ▼

Retention & Lifecycle Policies

Tiered Storage Strategy

BigQuery (Hot)
0 – 90 days

→

Cloud Storage Nearline
90 days – 1 year

→

Coldline
1 – 3 years

→

Archive
3 – 7+ years

BigQuery Lifecycle

Long-term storage pricing after 90 days (auto)
Partition expiration for non-critical staging tables
Time travel: 7-day query snapshots for recovery
Fail-safe: additional 7 days (Google-managed)

Cloud Storage Lifecycle

Object Lifecycle rules auto-transition classes
Bucket Lock for WORM compliance
Legal holds for litigation / regulatory freezes
Object versioning enabled for accidental overwrite

▼ ▼ ▼

Governance — Dataplex & Data Catalog

Dataplex Asset Registration

Every raw dataset/bucket registered as a Dataplex asset within the healthcare lake. Auto-discovery scans for new tables.

Data Catalog Tags

Automated tagging: source system, data classification (PHI/PII/public), ingestion date, owner team, retention policy.

Column-Level Security

BigQuery policy tags on PHI columns (SSN, name, DOB). Data Catalog taxonomy enforces access via IAM.

Lineage Tracking

Dataplex lineage captures raw → curated → enriched provenance. Integrated with Dataform DAGs.

▼ ▼ ▼

Downstream: Raw → Curated Zone

Scheduled Dataflow and Dataform jobs transform raw data into normalized, deduplicated, quality-controlled records in the Curated Zone.

Raw Zone (BigQuery + GCS)

→

Dataform / Dataflow

→

Curated Zone

→

Enriched Zone

→

AI / Analytics

← Back to Overview

Purpose

⚒

Normalized

Standard terminologies (SNOMED, LOINC, RxNorm, ICD-10). Consistent schemas across sources.

🔗

Deduplicated

EMPI-resolved patient identity. One golden record per patient, encounter, observation.

✅

Quality-Controlled

Dataform assertions + Dataplex DQ rules enforce integrity. Quality score per record.

📈

Analytics-Ready

Flat, queryable tables optimized for BigQuery. Partitioned and clustered for performance.

▼ ▼ ▼

Transformation Pipeline — Raw to Curated

Processing Steps

Raw Zone

→

Deduplication

→

Code Normalization

→

EMPI Resolution

→

Business Rules

→

Flatten FHIR

→

DQ Validation

→

Curated Tables

Dataform SQL-based

SQL transformations with dependency DAGs
Incremental models — process only new/changed rows
Built-in assertions for data testing
Auto-generated documentation
Git-integrated versioning in Cloud Source Repos

Dataflow Complex Logic

EMPI matching (probabilistic + deterministic)
Cross-source record linkage
Terminology mapping with large lookup tables
Nested FHIR JSON → flat BigQuery schemas
Scheduled via Cloud Composer (Airflow)

▼ ▼ ▼

Core Curated Tables

patient_master

Golden record

encounters

Visits & admissions

observations

Vitals & labs

conditions

Diagnoses

medications

Orders & admin

procedures

Surgical & clinical

allergies

Intolerances

immunizations

Vaccine records

documents

Clinical notes

claims_adjudicated

Processed claims

appointments

Scheduling

▼ ▼ ▼

Patient Master Table — Golden Record

Column	Type	Description
patient_id	STRING	EMPI-resolved universal patient identifier
mrns	ARRAY<STRUCT>	All known MRNs [{mrn, facility, active}]
given_name	STRING	Patient first name (best-known)
family_name	STRING	Patient last name (best-known)
date_of_birth	DATE	Date of birth
gender	STRING	Administrative gender
race	STRING	OMB race category
ethnicity	STRING	OMB ethnicity category
address	STRUCT	Primary address (line, city, state, zip)
primary_pcp	STRING	Primary care provider NPI
risk_scores	STRUCT	{hcc_score, lace_score, cci_score}
last_encounter_date	DATE	Most recent encounter date
insurance	ARRAY<STRUCT>	Active coverage [{payer, plan, member_id, type}]
is_deceased	BOOLEAN	Deceased flag
updated_at	TIMESTAMP	Last curated-zone update timestamp

▼ ▼ ▼

Encounter Table

Column	Type	Description
encounter_id	STRING	Unique encounter identifier
patient_id	STRING	FK to patient_master
encounter_type	STRING	ambulatory, emergency, inpatient, virtual
encounter_class	STRING	AMB, EMER, IMP, VR (FHIR class codes)
facility_id	STRING	Facility / location identifier
department	STRING	Department name
admit_date	TIMESTAMP	Admission or check-in time
discharge_date	TIMESTAMP	Discharge or check-out time
attending_npi	STRING	Attending provider NPI
diagnoses	ARRAY<STRUCT>	[{icd10, description, rank, type}]
procedures	ARRAY<STRUCT>	[{cpt, description, date}]
disposition	STRING	Discharge disposition code

▼ ▼ ▼

Observations Table — Vitals & Lab Results

Column	Type	Description
observation_id	STRING	Unique observation identifier
patient_id	STRING	FK to patient_master
encounter_id	STRING	FK to encounters (nullable for ambulatory)
loinc_code	STRING	LOINC observation code
display_name	STRING	Human-readable observation name
value_numeric	FLOAT64	Numeric result (if applicable)
value_text	STRING	Text result (if non-numeric)
units	STRING	UCUM unit of measure
reference_range	STRING	Normal reference range
abnormal_flag	STRING	H, L, HH, LL, A, N
effective_date	TIMESTAMP	Clinically relevant date/time
source_system	STRING	Originating system identifier

▼ ▼ ▼

Data Quality Rules

Rule Type	Tool	Example
Not Null	Dataform assertion	patient_id, encounter_id, loinc_code must be non-null
Valid Range	Dataform assertion	Heart rate 20-300, temp 90-110F, SpO2 50-100%
Referential Integrity	Dataform assertion	All encounter.patient_id exists in patient_master
Code System Validation	Dataform assertion	loinc_code matches LOINC reference table
Completeness Score	Dataplex DQ	% of required fields populated per record
Timeliness	Dataplex DQ	Curated table refresh < 30 min after raw arrival

▼ ▼ ▼

Dataform DAG — Example Patient Pipeline

raw_fhir → stg_patients → curated_patient_master

raw_ehr.fhir_resources

→

stg_patients_dedup

→

stg_patients_empi

→

curated.patient_master

→

assert_patient_pk_unique

raw_ehr.hl7v2_messages

→

stg_encounters_parsed

→

stg_encounters_normalized

→

curated.encounters

→

assert_encounter_fk_valid

raw_ehr.fhir_resources

→

stg_observations_flat

→

stg_observations_loinc

→

curated.observations

→

assert_loinc_valid

▼ ▼ ▼

Partitioning & Clustering Strategy

Partition by Date

All curated tables partitioned on primary date column (admit_date, effective_date, updated_at). Enables efficient time-range queries.

Cluster by patient_id

Clustering on patient_id collocates patient data for fast $everything-style queries across encounters, observations, conditions.

Materialized Views

Pre-computed aggregations: active_patients, recent_admissions, pending_results. Auto-refreshed by BigQuery.

BI Engine Acceleration

BigQuery BI Engine reservations on high-traffic curated tables for sub-second Looker dashboard queries.

▼ ▼ ▼

Downstream: Curated → Enriched Zone

Curated records feed feature engineering, embeddings generation, cohort building, and research marts in the Enriched Zone.

Curated Zone

→

Feature Engineering

→

Enriched Zone

→

Vertex AI + Looker

← Back to Overview

Purpose

⚡

ML Features

Pre-computed risk scores, utilization metrics, temporal aggregations ready for model training and inference.

🔬

Embeddings

Vector representations of clinical notes, imaging, and lab panels for semantic search and similarity.

👥

Cohorts

Pre-built patient cohorts for clinical trials, quality measures, and population health programs.

📊

Research Marts

Disease-specific and operational data marts optimized for analytics and Looker dashboards.

▼ ▼ ▼

Feature Engineering — Computed Features

LACE Score

Readmission risk (LOS, Acuity, Comorbidities, ED visits)

CCI

Charlson Comorbidity Index

APACHE II

ICU severity scoring

Med Adherence

PDC / MPR calculations

Utilization (7d/30d/90d)

ED visits, admissions, procedures

Longitudinal Trends

Lab value slopes, vital trajectories

Dataform Features SQL-based

Window functions for temporal aggregations
7-day, 30-day, 90-day rolling windows
Incremental updates — only recompute changed patients
Scheduled via Cloud Composer DAGs

Dataflow Features Complex

Streaming feature computation (real-time vitals)
Cross-table joins for composite scores
External API enrichment (NPI registry, geocoding)
SDoH feature derivation from address data

▼ ▼ ▼

Vertex AI Feature Store

Entity Type	Key Features	Online Serving	Offline Serving
patient	risk_scores, demographics, utilization_30d, med_count, last_a1c, insurance_type	< 10ms (Bigtable)	BigQuery export
encounter	los_hours, icu_flag, diagnosis_count, procedure_count, ed_to_admit_min	< 10ms (Bigtable)	BigQuery export
provider	panel_size, avg_los, readmit_rate, specialty, quality_scores	< 10ms (Bigtable)	BigQuery export

Online Serving Low-Latency

Sub-10ms lookups for clinical agents
Backed by Bigtable for high throughput
Used by real-time inference endpoints
Auto-synced from BigQuery feature tables

Point-in-Time Correctness Training

Prevent data leakage in ML training
Feature values as-of prediction timestamp
Temporal join logic built into Feature Store SDK
Critical for readmission / mortality models

▼ ▼ ▼

Embedding Tables

Table	Key Columns	Embedding Model	Dimensions
clinical_note_embeddings	note_id, patient_id, encounter_id, embedding_vector, model_version, note_type	Med-PaLM / Gemini	768 / 1024
imaging_embeddings	study_id, series_id, patient_id, embedding_vector, modality, body_part	Med-PaLM Vision	1024
lab_panel_embeddings	patient_id, panel_date, embedding_vector, panel_type, lab_count	Custom Vertex AI	256
patient_summary_embeddings	patient_id, embedding_vector, summary_date, model_version	Gemini	768

Vector Search Integration

BigQuery VECTOR_SEARCH for analytics-time similarity queries
Vertex AI Vector Search for low-latency online retrieval (RAG)
Cosine similarity for clinical note search, patient matching
Used by clinical agents for context retrieval

▼ ▼ ▼

Cohort Tables

Column	Type	Description
cohort_id	STRING	Unique cohort identifier
cohort_name	STRING	Human-readable name (e.g., "T2DM A1c > 9")
criteria_definition	JSON	Structured inclusion/exclusion criteria
patient_ids	ARRAY<STRING>	Matching patient IDs
patient_count	INT64	Cohort size
creation_date	TIMESTAMP	When cohort was computed
irb_number	STRING	Associated IRB protocol (if research)
refresh_schedule	STRING	daily, weekly, one-time
created_by	STRING	Requesting user / team

▼ ▼ ▼

Research & Operational Marts

Oncology Mart

Tumor registry, staging, treatment lines, genomic variants, outcomes by regimen.

Cardiology Mart

Echo metrics, cath lab data, LVEF trends, HF readmissions, anticoagulation adherence.

Diabetes Mart

A1c trajectories, insulin dosing, complication rates, eye/foot exam compliance.

Operational — Throughput

ED wait times, OR utilization, bed turnover, discharge delays, staffing ratios.

Operational — Readmissions

30-day readmission rates by DRG, payer, provider. Risk-stratified cohorts.

Population Health

Risk stratification tiers, care gaps (screenings, vaccines), SDoH indices, HEDIS measures.

▼ ▼ ▼

ML Training Datasets

Component	Detail	Storage
Feature Snapshots	Point-in-time feature values at prediction timestamp	BigQuery (versioned)
Label Tables	readmission_30d, mortality_inpatient, sepsis_onset, deterioration_6h	BigQuery
Train/Val/Test Splits	Temporal split (train < 2024, val = 2024-H1, test = 2024-H2)	BigQuery + GCS
Dataset Versioning	Dataplex lineage tracks dataset provenance per model version	Dataplex metadata
Data Cards	Dataset documentation: size, demographics, label distribution, known biases	Vertex AI Metadata

▼ ▼ ▼

Downstream — From Enriched to AI & Analytics

ML/AI Path

Feature Store

→

Vertex AI Training

→

Model Registry

→

Vertex AI Endpoints

→

Clinical Agents

Analytics Path

Research Marts

→

BigQuery

→

Looker

→

Dashboards & Reports

RAG Path

Embedding Tables

→

Vertex AI Vector Search

→

Gemini / Agents

→

Clinical Summaries

▼ ▼ ▼

Full Pipeline: Ingestion → Raw → Curated → Enriched → AI

The Enriched Zone is the final lakehouse layer. From here, data powers Vertex AI models, clinical agents, Looker dashboards, and research workflows.

Ingestion

→

Raw Zone

→

Curated Zone

→

Enriched Zone

→

Vertex AI

→

Agents + Looker

← Back to Overview

Embedding Models on Vertex AI

Model	Data Type	Dimensions	Use Case
text-embedding-005	General text	768	Clinical notes, discharge summaries, guidelines
text-multilingual-embedding-002	Multilingual text	768	Patient-facing materials, consent forms
Med-PaLM Embeddings	Clinical text	768	H&P notes, radiology/pathology reports, medical Q&A
Health AI Dev Foundations	Medical imaging	1024	X-ray, CT, pathology slide embeddings
Custom Fine-Tuned (Vertex AI Training)	Domain-specific	768/1024	Org-specific terminology, specialty notes, lab panels
multimodalembedding@001	Image + text	1408	Cross-modal search: text query → image results

▼ ▼ ▼

Healthcare Data Types Embedded

Clinical Notes

History & Physical (H&P)
Progress notes
Discharge summaries
Consult notes

Diagnostic Reports

Radiology reports
Pathology reports
Operative notes
Procedure findings

Lab Panel Signatures

Vectorized result patterns
Multi-analyte panels (BMP, CBC)
Trending abnormal results
Reference range deviations

Encounter Summaries

Visit-level patient summaries
Problem list snapshots
Medication reconciliation
Care plan abstracts

Guidelines & Protocols

Clinical pathways
Formulary policies
Order set documentation
Institutional SOPs

Research Literature

PubMed abstracts
Internal publications
Trial protocols
Systematic reviews

▼ ▼ ▼

Data flows into embedding pipeline

Embedding Pipeline Architecture

Batch Embedding Pipeline Primary

Scheduled and event-driven embedding generation for all new and updated clinical data via Dataflow orchestration.

BigQuery / Cloud Storage

→

Dataflow (Orchestration)

→

Vertex AI Batch Prediction

→

BigQuery (VECTOR columns)

Real-Time Embedding Low Latency

On-demand embedding for new documents and agent queries via Vertex AI online prediction endpoints.

Pub/Sub (New Doc Event)

→

Cloud Run Function

→

Vertex AI Online Prediction

→

Vertex AI Vector Search

Continuous Indexing Streaming

Incremental updates to vector indices as new embeddings arrive, ensuring near-real-time search availability.

BigQuery (New Vectors)

→

Dataflow Streaming

→

Vertex AI Vector Search (Index Update)

→

Deployed Index Endpoint

▼ ▼ ▼

Vector Storage & Search Options

Service	Vector Capability	Search Algorithm	Best For	Latency
BigQuery	VECTOR type, VECTOR_SEARCH()	Cosine / Dot Product / Euclidean	Analytical queries, cohort similarity, SQL joins with vectors	Seconds (analytical)
Vertex AI Vector Search	Managed ANN index, deployed endpoints	ScaNN (Scalable Nearest Neighbors)	Low-latency serving, real-time agent retrieval, RAG pipeline	< 10ms (p99)
AlloyDB	pgvector extension, ANN index	IVFFlat / HNSW	Transactional + vector hybrid, app-embedded search	< 50ms
Spanner	K-Nearest Neighbors (approx)	Cosine distance built-in	Global-scale transactional with vector search	< 20ms

▼ ▼ ▼

Embedding Table Schema (BigQuery)

healthcare_embeddings.document_embeddings

document_id STRING (PK) patient_id STRING (FK) encounter_id STRING (FK) source_type STRING

text_chunk STRING chunk_index INT64 chunk_token_count INT64

embedding_vector ARRAY<FLOAT64> (768/1024 dim)

model_id STRING model_version STRING created_at TIMESTAMP metadata_json JSON

Primary Key

Foreign Key

Data Fields

Vector Column

Metadata / Lineage

▼ ▼ ▼

Similarity Operations

Patient Matching

Nearest-neighbor patient similarity
Find patients with similar conditions
Treatment outcome cohort matching

Similar Case Retrieval

Query by clinical scenario
Historical case lookup for CDS
Cross-facility case matching

Note Deduplication

Detect copy-paste notes (high cosine)
Version diff across encounters
Identify template-derived content

Literature Matching

Patient context → relevant studies
Guideline-to-case alignment
Evidence gap identification

Anomaly Detection

Distance threshold outlier detection
Unusual lab result patterns
Atypical documentation flagging

▼ ▼ ▼

Quality & Lifecycle Management

Embedding Drift Monitoring

Track distribution shift in embedding space over time. Alert when new data deviates significantly from training distribution.

Model Version Management

Track model_id + model_version per vector. Support side-by-side versions during migration. Vertex AI Model Registry for lineage.

Re-Embedding Pipeline

On model update, trigger batch re-embedding of existing corpus. Dataflow job with BigQuery source, write-back with new model_version.

A/B Evaluation

Compare embedding quality across models using retrieval precision/recall on curated eval sets. Vertex AI Experiments for tracking.

Dimension Reduction

UMAP / t-SNE projections for visualization and debugging. Stored as 2D/3D coordinates for dashboarding in Looker.

TTL & Retention

Embedding expiration aligned with source data retention policies. Automated cleanup via BigQuery scheduled queries.

▼ ▼ ▼

Downstream: Powering Search, RAG & AI Agents

Embeddings feed into vector search for retrieval, RAG pipelines for grounded generation, and BigQuery for cohort analytics and research similarity analysis.

Vertex AI Vector Search

→

RAG Pipeline Retrieval

→

AI Agents (Grounded)

→

Clinical Decisions

BigQuery Vectors

→

Cohort Similarity Analysis

→

Research & Population Health

← Back to Overview

Vertex AI Search — Data Stores & Configuration

Data Store	Source	Search Mode	Content Indexed
Clinical Corpus	BigQuery	Semantic + Keyword (Blended)	Clinical notes, discharge summaries, radiology/pathology reports
Guidelines Corpus	Cloud Storage	Semantic	Clinical pathways, protocols, formulary rules, order set docs
FHIR Store	Cloud Healthcare API	FHIR Search + Semantic	Structured patient data (conditions, meds, observations, encounters)
Research Corpus	Cloud Storage / URLs	Semantic + Faceted	PubMed abstracts, internal publications, trial protocols

▼ ▼ ▼

Query enters RAG pipeline

RAG Architecture — End-to-End Flow

Query → Retrieve → Re-Rank → Generate → Cite

User / Agent Query

→

Vertex AI Search
Retrieve Top-K Chunks

→

Cross-Encoder
Re-Ranking

→

Context Assembly
Prompt Construction

→

Gemini / Med-PaLM
Grounded Generation

→

Response + Citations
Source Attribution

Input Query

Retrieval (Vertex AI Search)

Re-Ranking

Context Assembly

LLM Generation

Grounded Output

▼ ▼ ▼

Chunking Strategies for Clinical Data

Document-Aware Chunking Primary

Respect section boundaries in clinical notes. Each chunk maps to a logical section (HPI, Assessment, Plan, ROS). Preserves clinical context within chunks.

Clinical Note

→

Section Parser (NLP)

→

Section-Level Chunks

→

Indexed with Metadata

Sliding Window + Overlap Fallback

Fixed-size chunks (512 tokens) with 64-token overlap for unstructured documents. Ensures no context is lost at boundaries.

Document Text

→

Token Window (512)

→

Overlap (64 tokens)

→

Chunks + Position Index

Metadata Preservation

Every chunk retains: patient_id, encounter_date, note_type, author, section_name. Enables filtered retrieval by patient, date range, or note type at query time.

Hierarchical Chunks

Parent-child chunk structure: document summary (parent) + section chunks (children). Search children, return parent context for richer grounding.

▼ ▼ ▼

Grounding with Vertex AI

Capability	GCP Service	Configuration	Impact
Search-based grounding	Vertex AI Grounding API	dynamic_retrieval_config	Connects Gemini to Vertex AI Search results at inference time
Citation generation	Gemini + Grounding	grounding_metadata in response	Every claim in response linked to source document + chunk
Hallucination reduction	Grounding score threshold	grounding_score ≥ 0.7	Reject or flag low-confidence answers; fall back to "I don't know"
Retrieval parameters	Vertex AI Search API	top_k=10, relevance_threshold=0.5	Tune precision/recall trade-off per use case
Multi-turn context	Vertex AI Conversation	follow_up_search enabled	Maintain retrieval context across multi-turn clinical dialogues

▼ ▼ ▼

Healthcare-Specific RAG Patterns

Patient Chart Q&A Clinical

Query across a patient's full longitudinal record. Retrieve relevant notes, labs, and meds to answer clinician questions with citations.

Clinician Query

→

Patient-Scoped Search

→

Gemini (Generate)

→

Answer + Note Citations

Clinical Decision Support CDS

Retrieve guidelines matching patient context (conditions, labs, meds). Generate concordance assessment and recommended actions.

Patient Context

→

Guidelines Corpus Search

→

Med-PaLM (Assess)

→

Recommendation + Evidence

Prior Authorization Revenue Cycle

Match clinical data to payer medical necessity criteria. Auto-generate supporting documentation from patient record.

Auth Request + Payer Rules

→

Patient Record Search

→

Gemini (Match & Draft)

→

Auth Letter + Citations

Literature Search Research

Find evidence for clinical questions by searching PubMed and internal research corpus. Summarize findings with study citations.

Clinical Question

→

Research Corpus Search

→

Gemini (Synthesize)

→

Evidence Summary + Refs

▼ ▼ ▼

Access Control & PHI Safety in RAG

Document-Level ACLs

Each indexed chunk inherits access permissions from source. Vertex AI Search enforces ACLs at retrieval time based on user identity.

Role-Based Filtering

User role (physician, nurse, admin) determines retrievable document types. Enforced via IAM + custom metadata filters on search queries.

PHI-Aware Responses

Output guardrails prevent PHI leakage in responses to unauthorized users. DLP API integration for real-time PII/PHI detection.

Audit Logging

Every retrieval logged: who queried, what was retrieved, what was returned. Cloud Audit Logs + BigQuery for compliance reporting.

Purpose-Based Access

Access policies tied to purpose (treatment, payment, operations, research). HIPAA minimum necessary enforced at query scope.

Break-Glass Protocol

Emergency override for restricted records with mandatory justification logging and post-access review workflow.

▼ ▼ ▼

Performance Targets & Optimization

< 2s

Interactive RAG Latency

End-to-end: query → grounded response

< 500ms

Agent Call Latency

Retrieval-only for agent tool calls

~Real-Time

Index Refresh

New clinical data searchable in minutes

10M+

Indexed Chunks

Across all data stores

Cached

Frequent Queries

Memorystore (Redis) for hot queries

99.9%

Search Availability

Managed SLA from Vertex AI Search

▼ ▼ ▼

Downstream: Grounded AI for Clinical Workflows

RAG-grounded responses power clinical agents for decision support, integrate into EHRs via SMART on FHIR apps, and enable safe patient-facing Q&A portals.

RAG Responses

→

Clinical Agents

→

EHR Integration (SMART Apps)

→

Clinician Workflows

RAG Responses

→

Patient Portal (Safe Q&A)

→

Patient Engagement

← Back to Overview

Graph Database on GCP

Service	Type	Query Language	Best For
Neo4j Aura on GCP	Managed graph DB (GCP Marketplace)	Cypher	Full ontology encoding, multi-hop traversals, pathway validation
Spanner Graph	Graph layer on Cloud Spanner	Spanner Graph Query	Global-scale, strongly consistent graph + relational hybrid
Memorystore (Redis Graph)	In-memory graph	Cypher subset	Cached frequent traversals, low-latency lookups at inference
BigQuery + Graph Analytics	Analytical graph	SQL + GRAPH_PATH()	Batch graph analytics on large-scale clinical datasets

▼ ▼ ▼

Core Ontologies Encoded

SNOMED CT

~350K active concepts

Clinical terms & hierarchies
IS_A relationships (subsumption)
Finding-site, causative-agent edges
Concept model attributes

ICD-10-CM / PCS

~72K diagnosis + 78K procedure codes

Diagnosis code hierarchies
Procedure classification
SNOMED ↔ ICD-10 mappings
HCC risk groupings

LOINC

~100K lab/observation codes

Lab test codes & panels
Component + method axes
Panel → member relationships
Units of measure mappings

RxNorm

~115K drug concepts

Medications & ingredients
Dose forms & strengths
NDC ↔ RxNorm mappings
Clinical drug → ingredient edges

CPT / HCPCS

~10K+ procedure codes

Procedure billing codes
Category I, II, III codes
Modifier relationships
CPT ↔ ICD-10-PCS mappings

▼ ▼ ▼

Ontologies encoded as graph nodes and edges

Graph Schema — Nodes & Edges

Node Types

:Concept :Medication :Condition :LabTest :Procedure :Pathway :Guideline :Contraindication :GeneVariant

Edge Types (Relationships)

IS_A HAS_FINDING TREATS CONTRAINDICATED_WITH ORDERED_FOR MAPS_TO PART_OF_PATHWAY INTERACTS_WITH HAS_INGREDIENT ASSOCIATED_WITH

Concept (Generic)

Medication

Condition

LabTest

Procedure

Pathway / Guideline

GeneVariant

▼ ▼ ▼

Clinical Pathway Graphs & Drug Interaction Graph

Clinical Pathway Graph Evidence-Based

Encode pathways (sepsis bundle, ACS protocol, diabetes management) as directed graphs with decision nodes, time constraints, and required actions.

Trigger Condition

→

Decision Node

→

Required Action (Time-Bound)

→

Next Decision

→

Outcome Node

Drug Interaction Graph Safety

Medication → ingredient → interaction edges with severity levels (critical, major, moderate, minor). Used at inference to validate AI medication recommendations.

Medication A

→

Ingredient X

→

INTERACTS_WITH [severity: critical]

→

Ingredient Y

→

Medication B

Sepsis Bundle Example

Time-zero recognition → lactate draw (30 min) → blood cultures (before abx) → broad-spectrum antibiotics (1 hr) → fluid resuscitation (30 mL/kg if hypotensive) → reassess.

ACS Protocol Example

Chest pain → 12-lead ECG (10 min) → troponin draw → STEMI pathway (cath lab activation) or NSTEMI pathway (risk stratification) → anticoagulation → cardiology consult.

▼ ▼ ▼

Graph-Powered Validation at Inference

AI Action → Graph Validation → Safe Output

AI Agent Proposes Action → Query Graph → Validation Result

Contraindication Check

Traverse CONTRAINDICATED_WITH edges
Check patient allergy list vs proposed med
Block or warn on match

Guideline Concordance

Match proposed action to PART_OF_PATHWAY edges
Verify action aligns with clinical pathway
Flag deviations with explanation

Terminology Correctness

Validate codes via IS_A / MAPS_TO traversals
Ensure SNOMED, ICD-10, LOINC accuracy
Resolve ambiguous terms to correct concepts

Pathway Completeness

Check all required pathway steps are addressed
Identify missing actions in protocol
Verify time constraints are met

▼ ▼ ▼

Graph APIs & Integration

Access Method	GCP Service	Protocol	Use Case
Direct graph queries	Neo4j Aura (Bolt)	Bolt protocol / Cypher	Complex traversals, ontology exploration, ad-hoc queries
REST endpoints	Cloud Run	HTTPS / JSON	Agent tool calls: validate_medication, check_pathway, lookup_code
Cached lookups	Memorystore (Redis)	Redis protocol	Frequent traversals cached: drug interactions, code lookups
Agent tool integration	Vertex AI Agent Builder	Tool / Function Calling	Graph queries exposed as callable tools for Gemini agents
Batch analytics	BigQuery + Dataflow	SQL + Graph export	Bulk ontology analysis, mapping coverage reports

▼ ▼ ▼

Graph Maintenance & Lifecycle

Automated Ontology Updates

SNOMED CT releases biannually. RxNorm monthly. ICD-10 annual updates. Automated pipelines ingest new releases and update graph nodes/edges.

Versioned Graph Snapshots

Every ontology update creates a versioned snapshot. Enables rollback and point-in-time queries. Stored in Cloud Storage as Neo4j dumps.

Clinical Review Workflows

Pathway updates require clinical committee review. Approval workflow in Cloud Workflows with human-in-the-loop before graph promotion.

Lineage Tracking

Every node/edge tracks: source ontology, version, last_updated, provenance. Queryable for audit and compliance.

Consistency Validation

Scheduled Cypher queries detect orphan nodes, broken relationships, and circular hierarchies. Alerts via Cloud Monitoring.

Cross-Ontology Alignment

Maintain MAPS_TO edges across ontologies (SNOMED↔ICD-10, LOINC↔CPT). Validate mapping coverage on each release cycle.

▼ ▼ ▼

Graph Scale

800K+

Total Nodes

Across all encoded ontologies

3M+

Total Edges

Relationships (IS_A, TREATS, etc.)

< 5ms

Cached Lookup

Redis-cached drug interaction check

< 50ms

Multi-Hop Traversal

3-hop pathway validation query

5

Ontologies Encoded

SNOMED, ICD-10, LOINC, RxNorm, CPT

Monthly

Update Cadence

Aligned with fastest ontology (RxNorm)

▼ ▼ ▼

Downstream: Validation, Grounding & Enrichment

The Knowledge Graph serves as a real-time validation layer for AI agents, provides concept expansion for RAG grounding, and enriches embeddings with ontology-aware relationships.

Knowledge Graph

→

Agent Tool Calls (Validation)

→

Safe Clinical Recommendations

Knowledge Graph

→

RAG Grounding (Concept Expansion)

→

Embedding Enrichment (Ontology Vectors)

← Back to Overview

Data Fabric Components — Dataplex Logical Domains

☤

Clinical Domain

Patient records, encounters, observations, conditions, medications. FHIR-native views.

📸

Imaging Domain

DICOM metadata, radiology reports, pathology slides. Linked to clinical context.

⚙

Operations Domain

ADT census, scheduling, staffing, supply chain, billing. Real-time event streams.

🔬

Research Domain

De-identified cohorts, OMOP CDM tables, trial registries. IRB-controlled access.

🧬

Genomics Domain

VCF files, variant annotations, pharmacogenomics panels. Stored in Cloud Storage + BigQuery.

Data Catalog & Data Products

Data Catalog Dataplex

Automated metadata discovery across BigQuery, Cloud Storage, FHIR stores
Data lineage tracking — source to consumption
Business glossary: standardized healthcare term definitions
Sensitivity tags (PHI, PII, de-identified) auto-classified by DLP
Search and discovery for analysts and AI agents

Data Products Contracts & SLAs

Longitudinal Patient Record — unified view, <5min freshness SLA
ICU Telemetry Stream — real-time vitals, <10s latency SLA
Oncology Registry — curated staging/treatment data, daily refresh
Claims Mart — adjudicated claims + denials, T+1 SLA
Each product has owner, schema contract, quality checks, access policy

▼ ▼ ▼

Standardized Access APIs

Access Pattern	GCP Service	Consumers	Use Cases
FHIR R4 REST	Cloud Healthcare API	EHR apps, SMART-on-FHIR, CDS Hooks	Patient read/write, clinical data exchange
REST / GraphQL	Cloud Run + Hasura/Apollo	Internal apps, dashboards	Flexible queries over BigQuery curated views
Search & RAG	Vertex AI Search	AI agents, clinician search	Semantic search across clinical documents + notes
Feature Serving	Vertex AI Feature Store	ML models, prediction agents	Low-latency feature vectors for real-time inference
External API Gateway	Apigee	External partners, HIEs, payers	Rate limiting, auth, analytics for external consumers

▼ ▼ ▼

IAM Architecture — Hierarchy & Roles

GCP Resource Hierarchy

Organization (healthcare-corp.com)

└─ Folder: US-East Region Folder: US-West Region Folder: Research

└─ Project: prod-clinical Project: prod-imaging Project: prod-ops

└─ Project: staging-clinical Project: dev-sandbox

IAM Roles & Group Mapping

Role	IAM Binding	Access Scope	Mapped Group
Clinician Viewer	roles/healthcare.fhirResourceReader	Own patients, assigned unit	grp-cardiology, grp-oncology, etc.
Researcher Analyst	roles/bigquery.dataViewer	De-identified datasets only	grp-research-approved
Operations Admin	roles/bigquery.dataEditor	Operational tables, dashboards	grp-ops-managers
AI Agent Service Account	roles/aiplatform.user + custom	Scoped per agent type, purpose-bound	sa-clinical-agent@proj.iam
External Partner	roles/healthcare.fhirResourceReader	Specific FHIR resources via Apigee	grp-external-payer-feeds

▼ ▼ ▼

VPC Service Controls

Perimeter Architecture

VPC-SC Perimeter: healthcare-prod

BigQuery Cloud Storage Cloud Healthcare API Vertex AI Cloud KMS

Ingress Rules → Authorized corporate networks, VPN, specific service accounts

Egress Rules → Restricted to approved external APIs (Apigee, HIE endpoints)

Bridge Perimeter ↔ Cross-project AI pipelines (prod-clinical ↔ prod-ai)

▼ ▼ ▼

DLP — Data Loss Prevention

PHI Detection Cloud DLP

Inspection jobs scan BigQuery tables + Cloud Storage objects
Detects: MRN, SSN, DOB, patient names, addresses, phone numbers
Custom infoTypes for institution-specific identifiers
Continuous inspection on new data via Dataflow integration
Findings exported to BigQuery for audit and dashboards

De-identification Transforms Automated

Masking — replace PHI with redacted tokens
Tokenization — deterministic crypto-hash for re-linkage
Date shifting — random offset preserving intervals
K-anonymity — generalize quasi-identifiers (age buckets, zip3)
Automated DLP transforms in Dataflow pipelines for research datasets

▼ ▼ ▼

Encryption & Confidential Computing

CMEK — Customer-Managed Keys

All data stores encrypted with Cloud KMS keys
Key hierarchy: org root → project keys → dataset keys
Automatic key rotation (90-day policy)
Key access audit via Cloud Audit Logs

Confidential VMs

AMD SEV / Intel TDX memory encryption
Used for sensitive AI inference workloads
Data encrypted in use — not just at rest / in transit
Attestation reports for compliance evidence

Column-Level Encryption

Ultra-sensitive fields (SSN, genomic data) encrypted at column level
Separate CMEK per sensitivity tier
Decrypt only with explicit IAM grant + purpose justification
BigQuery policy tags enforce column-level access

▼ ▼ ▼

Policy Engine — Attribute-Based Access Control (ABAC)

Agent Call Evaluation Flow

Agent / User Request → Policy Engine (Cloud Run) → Evaluate Attributes → Allow / Deny + Log

Attribute	Source	Examples
User Role	IAM + Google Groups	clinician, researcher, ops-admin, AI-agent
Purpose	Request header / token claim	treatment, payment, operations, research
Data Sensitivity	Dataplex tags + DLP classification	PHI, de-identified, public, restricted
Patient Consent	FHIR Consent resource	opt-in research, restrict substance-abuse records
Context	IAM Conditions	Time of day, IP range, device posture

▼ ▼ ▼

Audit & Compliance

Cloud Audit Logs

Data access logs (who read what, when)
Admin activity logs (config changes)
Exported to BigQuery for long-term retention
Real-time alerting on anomalous access

Compliance Frameworks

HIPAA BAA executed with Google Cloud
SOC 2 Type II — continuous controls
FedRAMP High (GovCloud for federal)
HITRUST CSF certification

Access Reviews

Quarterly IAM access recertification
Automated unused-permission detection (IAM Recommender)
Breach detection via Security Command Center
SIEM integration for SOC workflows

▼ ▼ ▼

Sovereignty Controls

Regional Deployment

Org policies restrict resource locations (us-east1, us-central1)
Assured Workloads for IL4/IL5 regulated environments
Data residency enforcement — no cross-region replication without policy

Jurisdiction Controls

State-specific data handling (California CCPA, Texas HB 300)
Cross-border interop layers for international sites
Consent enforcement varies by jurisdiction

▼ ▼ ▼

Security Monitoring Stack

Security Command Center

Threat detection, vulnerability scanning, compliance posture management

Policy Intelligence

IAM Recommender: least-privilege suggestions, unused role alerts

VPC Flow Logs

Network traffic analysis, anomaly detection, forensic investigation

DLP Dashboard

Looker dashboard: PHI findings, de-id coverage, inspection job status

Chronicle SIEM

Centralized security analytics, correlation rules, incident response

▼ ▼ ▼

Cross-Cutting: Spans All Pipeline Layers

Data Fabric and Security controls wrap every component — from ingestion through AI agent execution. Every data access is policy-evaluated, logged, and auditable.

Ingestion → Harmonization → Lakehouse → Knowledge + Embeddings → AI Agents → Consumer Apps

Data Fabric + IAM + VPC-SC + DLP + Audit = enforced at every arrow above

← Back to Overview

Agent Architecture on Vertex AI

Core Platform Stack

Vertex AI Agent Builder → Gemini (Reasoning Engine) → LangChain / LangGraph Orchestration

Agent Tools: RAG Search FHIR API Knowledge Graph Feature Store EHR Write-Back

▼ ▼ ▼

Clinical Agent Types

Deterioration Prediction Agent

Continuous monitoring of vitals, labs, nursing assessments
NEWS2 / MEWS scoring with ML-enhanced prediction
Real-time alerts to care team via FHIR CommunicationRequest
Escalation protocols: nurse → charge → rapid response → code team
6-hour, 12-hour, 24-hour deterioration risk windows

Clinical Documentation Agent

Ambient listening via speech-to-text (Chirp on Vertex AI)
Generates structured clinical notes (SOAP, H&P, progress)
Extracts structured data: diagnoses, medications, procedures
ICD-10 and CPT coding suggestions with confidence scores
Clinician review and sign-off before EHR commit

Diagnostic Support Agent

Differential diagnosis from symptoms, labs, imaging findings
Guideline-matched recommendations (AHA, NCCN, IDSA)
Literature evidence retrieval via RAG over PubMed + UpToDate
Imaging interpretation assist (radiology, pathology)
Confidence scoring with supporting evidence chain

Medication Safety Agent

Real-time drug-drug interaction check via Knowledge Graph
Dose adjustment for renal impairment (CrCl-based) and hepatic function
Formulary compliance and therapeutic alternatives
Allergy cross-check against documented AllergyIntolerance
High-alert medication double-check enforcement

Care Gap Agent

Identifies missing screenings (colonoscopy, mammogram, A1c)
Vaccination gap detection (influenza, pneumococcal, COVID)
Follow-up tracking per HEDIS / CMS quality measures
Patient outreach generation (secure message, letter, call list)
Quality measure impact scoring for value-based contracts

▼ ▼ ▼

Agent Interaction Pattern — Event-Driven Flow

Clinical Event to EHR Notification

Clinical Event → Pub/Sub → Agent Trigger → RAG Retrieval + Knowledge Graph + Feature Store

└─→ Recommendation → Safety Check → EHR Notification

Triggers: new lab result, vital sign, order entry, admission, discharge, scheduled interval

▼ ▼ ▼

Safety Constraints

Confidence Thresholds

Recommendations suppressed below configurable confidence threshold. Low-confidence outputs routed to human review queue.

Knowledge Graph Validation

Every clinical recommendation validated against curated Knowledge Graph (drug interactions, contraindications, guidelines).

Human-in-the-Loop

High-risk actions (medication changes, code blue alerts, diagnosis) require clinician confirmation before execution.

Explanation Generation

Every recommendation includes reasoning chain: evidence sources, feature contributions, guideline references.

Override Tracking

Clinician overrides logged with reason. Override patterns analyzed for model improvement and safety signal detection.

▼ ▼ ▼

Integration with EHR

Integration Pattern	Standard	Use Case	Direction
App Launch	SMART-on-FHIR	Agent UI embedded in EHR context (patient, encounter)	EHR → Agent
Decision Support	CDS Hooks	patient-view, order-select, order-sign hook triggers	EHR → Agent → EHR
Event Subscription	FHIR Subscriptions	New lab result, admission, medication order triggers agent	EHR → Agent
Notification Write-Back	FHIR CommunicationRequest	In-basket messages, alerts, task assignments to care team	Agent → EHR
Documentation Write-Back	FHIR DocumentReference	AI-generated notes posted back for clinician review	Agent → EHR

▼ ▼ ▼

Model Stack

Model	Platform	Role	Use Cases
Gemini	Vertex AI	Reasoning & generation	Agent orchestration, note generation, differential diagnosis
Med-PaLM	Vertex AI	Clinical Q&A	Medical knowledge retrieval, clinical question answering
Custom ML Models	Vertex AI Training	Specialized prediction	Deterioration, readmission, sepsis, LOS prediction
Ensemble Scoring	Vertex AI Endpoints	Combined inference	Multi-model consensus for high-stakes clinical decisions

▼ ▼ ▼

Monitoring & Feedback Loop

Agent Action Logs BigQuery

Every agent invocation: input, tools used, output, latency
Clinician acceptance / override rates per agent type
Outcome correlation: did the alert prevent adverse event?
Alert fatigue metrics: suppress rate, snooze rate

Performance Dashboards Looker

Model accuracy: sensitivity, specificity, PPV per agent
Bias monitoring across demographics (age, race, sex)
Drift detection: feature distribution shifts over time
A/B comparison for model version rollouts

▼ ▼ ▼

Agent Execution Flow

Clinical AI agents consume data from the unified lakehouse, reason with Gemini, validate against the Knowledge Graph, and deliver actionable insights back into the EHR.

Lakehouse (BigQuery) → Feature Store → Agent Builder → Gemini Reasoning → Safety Validation → EHR Action

← Back to Overview

Operations Agent Types

Bed Management Agent

Real-time census from ADT feed (Cloud Healthcare API → Pub/Sub)
Predicted discharges via ML model (Vertex AI custom training)
Admission demand forecasting — ED, surgical, transfer
Bed assignment optimization (unit, isolation, acuity matching)
EVS coordination: auto-trigger room cleaning on discharge

Staffing Agent

Acuity-based staffing models (nurse-to-patient ratio)
Shift optimization: minimize gaps, balance workload
Float pool allocation based on predicted census
Overtime prediction and premium labor cost alerts
Skill-mix matching: certifications to unit requirements

Throughput Agent

ED boarding detection and escalation alerts
OR turnover optimization (case duration prediction)
Discharge barrier identification (pending consults, transport, Rx)
Patient flow bottleneck analysis across departments
Discharge-before-noon tracking and nudge notifications

Supply Chain Agent

Inventory forecasting using time-series models
Par level optimization per unit and item category
Expiration tracking with waste reduction alerts
Vendor order automation (PO generation via ERP API)
Pandemic stockpile monitoring and surge readiness scoring

Revenue Cycle Agent

Charge capture validation — missing charges flagged at discharge
Coding accuracy review (ICD-10/CPT vs documentation)
Denial prediction — flag claims likely to be denied pre-submission
Prior authorization automation (fax → AI extraction → status tracking)
A/R aging prioritization — rank accounts by recovery likelihood

▼ ▼ ▼

Data Sources — All Via BigQuery Enriched Zone

Data Source	Feed Type	GCP Ingestion	Agents Consuming
ADT Feed (Real-time Census)	HL7v2 ADT^A01-A03	Cloud Healthcare API → Pub/Sub → BigQuery	Bed, Throughput, Staffing
Scheduling Systems	SIU messages / API	Dataflow → BigQuery	Throughput, Staffing
HR / Timekeeping	Batch / API (Kronos, Workday)	Cloud Storage → Dataflow → BigQuery	Staffing
Materials Management	ERP API (Infor, SAP)	Cloud Run connector → BigQuery	Supply Chain
Billing / Claims	837/835 EDI, DFT	Dataflow → BigQuery	Revenue Cycle
Patient Satisfaction	Survey API (Press Ganey)	Cloud Functions → BigQuery	Throughput, All

▼ ▼ ▼

Agent Architecture

Orchestration Pattern

Event (Pub/Sub) → Vertex AI Agent Builder → Tools: BigQuery, FHIR API, Scheduling API, ERP API

Schedule (Cloud Scheduler) → Vertex AI Agent Builder → Optimization Models (OR-Tools on Cloud Run)

└─→ Recommendation / Action → Looker Dashboard / Push Notification / ERP Update

▼ ▼ ▼

Optimization Models

Demand Forecasting Time-Series

Vertex AI AutoML Forecasting for admission volume
Features: day-of-week, seasonality, flu trends, census history
Horizons: 4-hour, 24-hour, 7-day predictions
Per-unit and per-service-line granularity

Constraint Optimization OR-Tools

OR-Tools on Cloud Run for bed assignment and staff scheduling
Constraints: acuity, isolation, gender, unit capacity
Objective: minimize transfers, maximize utilization
Solves in <30s for 500-bed facility

Simulation What-If

Discrete event simulation for patient flow scenarios
Test impact of: adding beds, changing discharge criteria, OR block changes
Monte Carlo runs for probabilistic outcomes
Results visualized in Looker dashboards

Reinforcement Learning Dynamic

RL agents for dynamic staffing decisions
Environment: real-time census, acuity, upcoming admits
Reward: patient outcomes + cost efficiency balance
Trained on Vertex AI, deployed to Cloud Run

▼ ▼ ▼

Key Metrics Tracked

<20min

ED Door-to-Doc

Target for throughput agent

<45min

Bed Turnaround

Discharge to next admit

>85%

OR Utilization

Prime-time block usage

>30%

Discharge Before Noon

Early discharge target

<5%

Premium Labor %

Agency / overtime spend

>95%

Clean Claim Rate

First-pass acceptance

<35

Days in A/R

Revenue collection speed

<1%

Supply Stockout Rate

Critical item availability

▼ ▼ ▼

Integration & Outputs

Looker Dashboards

Real-time ops command center: census, throughput, staffing, revenue cycle KPIs. Role-based views for CNO, CMO, CFO.

Push Notifications

Alerts to charge nurses, bed managers, department directors via mobile (Firebase Cloud Messaging).

Automated Work Orders

EVS cleaning triggers, patient transport requests, equipment setup — auto-generated on discharge/transfer events.

ERP System Updates

Purchase orders, inventory adjustments, staffing schedule changes pushed to ERP systems via Cloud Run connectors.

▼ ▼ ▼

ROI Indicators

ED Boarding Reduction

Target 40% reduction in boarding hours through predictive bed assignment and discharge acceleration.

OR Utilization Gains

5-10% improvement in prime-time OR utilization via case duration prediction and turnover optimization.

Labor Cost Savings

15-25% reduction in premium labor (agency, overtime) through predictive staffing and float pool optimization.

Supply Waste Reduction

20-30% reduction in expired supplies through demand-driven par levels and expiration alerts.

Revenue Recovery

2-5% increase in net revenue via charge capture improvement, denial prevention, and faster A/R collection.

▼ ▼ ▼

Operations Agent Execution Flow

Operations AI agents consume real-time operational feeds, apply forecasting and optimization models, and deliver actions to staff and systems.

ADT + Scheduling + HR + Supply → BigQuery Enriched Zone → Agent Builder + OR-Tools → Looker + Notifications + ERP

← Back to Overview

Research Agent Types

Cohort Discovery Agent

Natural language queries: "patients with HFrEF, A1c >9, on SGLT2i, seen in last year"
Text-to-SQL via Gemini against BigQuery research tables
Validates queries against data dictionary and OMOP concept sets
Returns counts, demographics breakdown, feasibility assessment
Iterative refinement: agent suggests criteria modifications

Literature Agent

Semantic search across PubMed + institutional publications via RAG
Evidence summarization with citation chain
Systematic review assistance: screen abstracts, extract PICO elements
Citation network analysis: influential papers, research trends
Grounded answers with source references and confidence

Trial Matching Agent

Ingests active trials from ClinicalTrials.gov API
Extracts inclusion/exclusion criteria via NLU
Screens patient records against eligibility criteria
Generates pre-screening lists ranked by match confidence
Notifies investigators and coordinators of eligible patients

Anomaly Detection Agent

Continuous surveillance on population data in BigQuery
Detects emerging disease clusters (geo-temporal patterns)
Unusual lab result trends across patient populations
Adverse event signal detection (drug safety surveillance)
Infection outbreak pattern recognition (syndromic surveillance)

Hypothesis Generation Agent

Identifies correlations in multimodal data (labs + meds + outcomes)
Suggests research questions based on data patterns
Proposes study designs (RCT, cohort, case-control)
Estimates sample sizes and statistical power
Cross-references findings with existing literature

▼ ▼ ▼

Cohort Discovery Agent — Interaction Flow

Natural Language to SQL to Results

Researcher Query (NL) → Gemini Text-to-SQL → Data Dictionary Validation → OMOP Concept Expansion

└─→ BigQuery Execution → Results (Count, Demographics, Feasibility) → Export / Refine

All queries execute against de-identified research tables. IRB approval verified before data export.

▼ ▼ ▼

Data Access & Privacy Controls

Access Tier	Data Type	Controls	Use Case
De-identified (Safe Harbor)	18 identifiers removed via DLP	Open to approved researchers	Cohort discovery, feasibility, population analytics
Limited Dataset	Dates + zip3 retained	DUA required, IRB-approved	Longitudinal studies, temporal pattern analysis
Honest Broker	Re-linkable via broker only	Broker intermediary, audit trail	Multi-source data linkage, registry enrollment
Synthetic Data	Generated via Vertex AI	No restrictions	Model development, algorithm testing, education
Identified (PHI)	Full patient data	IRB + patient consent + CISO approval	Interventional trials, direct patient contact

▼ ▼ ▼

Research Data Platform — GCP Stack

Query & Analysis Primary

BigQuery — primary SQL engine for cohort queries, population analytics
Dataproc (Spark) — large-scale genomics analysis, variant processing
Vertex AI Workbench — managed Jupyter for interactive R/Python analysis
Vertex AI Training — custom model development (survival analysis, NLP)

Storage & Compute Infrastructure

Cloud Storage — raw genomic files (VCF, BAM, FASTQ)
BigQuery — OMOP CDM tables, de-identified research warehouse
Vertex AI Feature Store — pre-computed research features
Artifact Registry — versioned model artifacts and containers

▼ ▼ ▼

External Integrations

System	Integration	GCP Connector	Purpose
REDCap	REST API	Cloud Run connector	Electronic data capture for prospective studies
i2b2 / OMOP CDM	BigQuery views	Native BigQuery tables	Standard research data models, OHDSI tool compatibility
OHDSI Tools (Atlas, Achilles)	WebAPI	Cloud Run + BigQuery OMOP	Cohort definitions, data quality, characterization
SAS / R / Python	BigQuery connectors	bigrquery, pandas-gbq, SAS/ACCESS	Statistical analysis in researcher's preferred tool
ClinicalTrials.gov	REST API	Cloud Functions → BigQuery	Trial eligibility criteria ingestion for matching
PubMed	E-utilities API	RAG index (Vertex AI Search)	Literature search, evidence retrieval for agents

▼ ▼ ▼

Knowledge Sources for Agent Grounding

Knowledge Graph

Ontologies (SNOMED, LOINC, RxNorm, ICD-10) for query expansion and concept mapping.

PubMed Index

36M+ biomedical abstracts indexed in Vertex AI Search for RAG-powered literature retrieval.

ClinicalTrials.gov

400K+ trial records with structured eligibility criteria for automated patient matching.

Institutional Data Dictionary

Local table schemas, field definitions, valid values. Ensures accurate text-to-SQL generation.

OMOP Concept Sets

Standardized phenotype definitions for reproducible cohort queries across institutions.

▼ ▼ ▼

Research Data Governance

Purpose-Based Access

Access granted per approved research protocol. Treatment data vs. research data separated at IAM and VPC-SC level.

Full Audit Trail

Every query logged in BigQuery audit tables: who, what, when, which dataset, under which IRB protocol.

Minimum Necessary

Column-level access: researchers see only fields required by their protocol. Enforced via BigQuery policy tags.

Re-identification Risk

Automated risk assessment before data export. K-anonymity and l-diversity checks via Cloud DLP.

Consent Management

FHIR Consent resources integrated: patients opting out of research excluded from query results automatically.

▼ ▼ ▼

Example: End-to-End Trial Matching Workflow

From Protocol to Pre-Screening List

New Trial Protocol → NLU Criteria Extraction (Gemini) → OMOP Concept Mapping → BigQuery Patient Screen

└─→ Match Scoring + Ranking → Pre-Screening List → Coordinator Review + REDCap Enrollment

Reduces manual chart review from weeks to hours. Average 3x increase in enrollment rate.

▼ ▼ ▼

Research Agent Execution Flow

Research AI agents operate on de-identified data, leverage ontologies for precision, and feed results back to investigators through familiar tools.

De-identified Lakehouse → Knowledge Graph + Ontologies → Agent Builder (Gemini) → BigQuery + RAG → Workbench / REDCap / OHDSI

← Back to Overview

Looker Architecture on GCP

☁

Looker (SaaS / Core)

Managed Looker instance
or Looker Core on GKE

📊

LookML Semantic Layer

Git-managed models on top of
BigQuery curated & enriched zones

📈

Looker Studio

Lightweight self-service
dashboards & ad-hoc reports

🛠

Embed SDK

Embedded analytics in
EHR portals & custom apps

⚡

BigQuery BI Engine

In-memory acceleration
sub-second query response

🔬

Vertex AI Integration

ML predictions surfaced
as Looker metrics (risk scores)

▼ ▼ ▼ ▼ ▼

LookML models connect to BigQuery curated & enriched zones

Dashboard Categories

Clinical Quality

Quality & Safety

HEDIS measures compliance tracking
30-day readmission rates by DRG
Mortality indices (O/E ratios)
Hospital-acquired infection rates
Patient safety indicators (PSIs)
Core measures compliance (SEP-1, VTE, etc.)

Operational Command Center

Real-Time Ops

Real-time inpatient census by unit
ED throughput: door-to-doc, boarding hours
OR utilization & turnover time
Bed turnaround & discharge tracker
Staffing ratios vs. patient acuity
Transfer center volume & capacity

Financial / Revenue Cycle

Finance

Clean claim rate & denial rate trends
Days in A/R by payer
Case mix index (CMI) by service line
Cost per case & margin analysis
Payer mix & contract performance
Charge capture leakage detection

Population Health

Pop Health

Risk stratification panels (high/med/low)
Care gap closure rates by measure
Chronic disease registries (DM, CHF, COPD)
SDoH impact analysis (food, housing, transport)
Health equity metrics by demographics
ACO/VBC performance tracking

Executive / Board

Leadership

Balanced scorecard (quality, finance, ops, people)
Trend analysis with rolling 12-month views
Peer benchmark comparisons (Vizient, CMS)
Strategic KPI tracking & goal progress
Service line growth & volume trends

▼ ▼ ▼

LookML Semantic Layer

LookML Component	BigQuery Target	Purpose	Key Details
model: clinical	bq_curated.clinical_*	Clinical domain explores	Encounters, conditions, observations, medications
model: operations	bq_curated.ops_*	Operational metrics	Census, throughput, capacity, staffing tables
model: finance	bq_curated.rev_cycle_*	Revenue cycle & cost	Claims, charges, payments, denials, A/R aging
model: research	bq_enriched.research_*	De-identified research cohorts	Cohort tables, genomic summaries, trial enrollment
derived_table (PDT)	bq_scratch.pdt_*	Expensive computed metrics	Readmission flags, risk scores, rolling aggregates
access_filter	user_attributes	Row-level security	Filter by facility_id, department, user role
aggregate_awareness	bq_curated.agg_*	Query acceleration	Pre-aggregated daily/weekly/monthly rollups

▼ ▼ ▼

Real-Time Capabilities

Streaming Refresh 30s – 5min

BigQuery streaming buffer ingests events in real time. Looker dashboards auto-refresh at configurable intervals for operational views.

Pub/Sub Events

→

BigQuery Streaming

→

BI Engine Cache

→

Looker Auto-Refresh

Alerts & Scheduled Delivery Proactive

Threshold-based alerts (e.g., ED boarding > 4h) delivered via email, Slack, or PagerDuty. Scheduled report PDFs for leadership.

Conditional alerts on metric thresholds
Scheduled Look delivery (email, Slack, SFTP)
PagerDuty integration for critical operational alerts
Mobile-optimized views for on-call managers

Drill-Down Navigation Interactive

System-level → facility → unit → patient-level drill paths. Cross-dashboard linking for root cause analysis.

Click census count → unit breakdown → patient list
Click denial rate → denial reason → individual claims
Cross-filter between related dashboards

AI-Powered Insights Vertex AI

Vertex AI predictions (risk scores, demand forecasts, readmission probability) written to BigQuery and surfaced as Looker metrics.

Vertex AI Models

→

BigQuery (predictions)

→

Looker Dashboards

▼ ▼ ▼

End-to-End Data Flow

BigQuery to User

Data flows from the curated/enriched lakehouse through the LookML semantic layer to dashboards consumed by clinicians, operators, and executives.

BigQuery (Curated/Enriched)

→

LookML Models

→

Looker Explores

→

Dashboards / Looks

→

Users / Embedded Apps

▼ ▼ ▼

Access Control & Security

Layer	Mechanism	GCP Service	Details
Authentication	SSO / SAML 2.0	Google Workspace / Cloud Identity	Federated login, MFA enforced
Looker Roles	role-based access	Looker IAM Groups	Admin, developer, viewer, embed-user roles
Model Access	model_set	LookML project	Users see only permitted models (clinical, finance, etc.)
Row-Level Security	access_filter	BigQuery + LookML	User sees only their facility/department data
Content Access	folder permissions	Looker folders/boards	Dashboard visibility controlled by folder ACLs
Query Guardrails	query cost limits	BigQuery Reservations	Slot-based quotas, per-user query byte limits

▼ ▼ ▼

Embedding & Extensions

Looker Embed SDK Embed

Embed interactive dashboards directly in EHR portals, custom web apps, and patient portals with SSO pass-through.

Host App (EHR/Portal)

→

Looker Embed SDK

→

Signed URL / SSO Embed

→

Interactive Dashboard

Looker Actions Workflow

Trigger downstream workflows from dashboard data: generate outreach lists, push to CRM, create tasks in care management systems.

Send care gap list → outreach CRM
Export high-risk panel → care coordinator queue
Push denial data → billing worklist

Extension Framework Custom

Build custom React-based applications hosted inside Looker for specialized workflows (e.g., clinical registry management).

Custom visualizations (D3.js, Vega)
In-Looker workflow tools
Looker API & SDK for programmatic access

Looker Studio Self-Service

Lightweight self-service dashboards for business users who need ad-hoc exploration without LookML complexity.

BigQuery

→

Looker Studio Connector

→

Self-Service Reports

▼ ▼ ▼

Performance & Optimization

< 3s

Dashboard Load Target

BI Engine + Looker caching

1 GB

BI Engine Reservation

In-memory acceleration per project

30s

Min Auto-Refresh

Operational command center dashboards

PDTs

Persistent Derived Tables

Pre-compute expensive aggregations

Agg Aware

Aggregate Awareness

Auto-select pre-aggregated tables

Slots

BQ Reservations

Dedicated compute for dashboard queries

← Back to Overview

SMART-on-FHIR Framework

🔒

EHR Launch

App launched from EHR context
Patient + encounter pre-populated

🌐

Standalone Launch

App launched independently
User selects patient context

🔐

OAuth 2.0 Scopes

patient/*.read, user/*.read
launch/patient, launch/encounter

👤

FHIR Context

Patient ID, Encounter ID
User identity & role

📝

App Registration

Registered with EHR vendor
Client ID, redirect URIs, scopes

✅

Token Validation

Short-lived access tokens
Refresh token rotation

▼ ▼ ▼ ▼ ▼

EHR fires CDS Hooks on clinical events

CDS Hooks Integration

Hook Points & Request/Response Flow

patient-view Chart Open → Cloud Run CDS → Vertex AI Agent → Cards: risk scores, care gaps

order-select Ordering Med/Lab → Cloud Run CDS → Vertex AI Agent → Cards: suggestions, alerts

order-sign Before Signing → Cloud Run CDS → Vertex AI Agent → Cards: contraindications, prior auth

encounter-start Visit Begins → Cloud Run CDS → Vertex AI Agent → Cards: patient summary, prep checklist

appointment-book Scheduling → Cloud Run CDS → Vertex AI Agent → Cards: prep orders, pre-visit tasks

▼ ▼ ▼

Deployed SMART App Types

AI Insights Panel

EHR Sidebar — iframe embed

Active risk scores (sepsis, readmission, fall)
Pending care gap alerts
Recent agent recommendations
Trend sparklines from BigQuery enriched zone

Documentation Assistant

SMART launch from note editor

Ambient note generation (Vertex AI Gemini)
Structured data extraction from dictation
ICD-10 / CPT suggestion
Quality measure documentation prompts

Diagnostic Decision Support

CDS Hooks — order-select trigger

Differential diagnosis ranking
Evidence-based order recommendations
Literature links & clinical guidelines
Drug-drug interaction checks

Prior Authorization Tool

SMART launch — order workflow

Automated payer criteria matching
Clinical document assembly
Electronic submission (X12 278)
Status tracking & appeal support

Patient Timeline

FHIR facade on BigQuery enriched zone

Unified longitudinal view across all encounters
Lab trends, medication history, problem list
External records via Carequality / CommonWell
AI-generated visit summaries

▼ ▼ ▼

Architecture — End-to-End

EHR to AI Backend

SMART launch triggers authentication, then Cloud Run hosts the app and orchestrates calls to AI and data services. All responses formatted as FHIR resources.

EHR (SMART Launch)

→

OAuth (Google Identity / EHR)

→

Cloud Run (App Host)

→

Vertex AI Agents

→

BigQuery / Cloud Healthcare API

→

FHIR Response

▼ ▼ ▼

FHIR Facade Pattern

Read Path FHIR R4

BigQuery curated data exposed as standard FHIR endpoints. The EHR reads enriched/computed data as if it were a native FHIR server.

EHR FHIR Client

→

Cloud Healthcare API

→

FHIR Facade (Cloud Run)

→

BigQuery Curated Zone

Write-Back Path Human-in-Loop

AI recommendations require clinician approval before writing back. Audit trail and undo capability enforced on every write.

AI Recommendation

→

Clinician Review/Approve

→

FHIR Write (Cloud Healthcare API)

→

HL7v2 Outbound to EHR

▼ ▼ ▼

EHR Vendor Specifics

EHR Vendor	App Marketplace	SMART Support	CDS Hooks	Key Notes
Epic	App Orchard / Gallery	Full (Hyperdrive web)	Supported	USCDI v3, Bulk FHIR, embedded via Hyperspace/Hyperdrive
Oracle Health (Cerner)	code Console	Full (Ignite APIs)	Supported	Ignite FHIR R4, Millennium HL7v2 feeds, open.epic equivalent
MEDITECH	Greenfield	SMART R1 (expanding)	Limited	Expanse FHIR R4, Greenfield SMART for Expanse web
athenahealth	Marketplace	FHIR R4	Roadmap	Cloud-native, strong API-first approach, REST APIs

▼ ▼ ▼

Write-Back FHIR Resources

Use Case	FHIR Resource	Source	Safety Controls
Risk score documentation	Observation	Vertex AI prediction	Clinician approval required, audit log
Care plan creation	CarePlan	AI agent recommendation	Human review, undo within 24h
Order recommendation	ServiceRequest	CDS Hook suggestion	Clinician must sign, no auto-ordering
Note generation	DocumentReference	Ambient documentation	Clinician edits & co-signs before commit
Problem list update	Condition	Diagnostic decision support	Suggestion only, clinician confirms

▼ ▼ ▼

Security & Performance

< 500ms

CDS Hook Response Target

Sync path for simple lookups

OAuth 2.0

Token Validation

Short-lived, patient-context scoped

Async

Long AI Inference

Progressive loading, background fetch

Minimum

Data Principle

Request only needed FHIR scopes

Audit

Every Action Logged

Cloud Logging → BigQuery audit

Consent

Enforcement

Patient consent checked before AI use

← Back to Overview

Portal Capabilities

Health Record Access

Cloud Healthcare API — FHIR R4 (US Core)

Lab results with trend charts
Medications & allergies
Immunization records
Visit summaries & discharge instructions
Imaging reports & pathology

AI Health Assistant

Vertex AI Conversation + Dialogflow CX

Symptom triage with safety routing
Medication questions & interactions
Appointment scheduling via conversation
Test result explanation (plain language)
Health education content delivery

Care Plan Tracker

FHIR CarePlan + Observation

View active care plans & goals
Track goal progress (A1c, BP, weight)
Log self-reported data (PROs, vitals)
Medication reminders & adherence

Secure Messaging

Firestore + Cloud Run

Patient-provider secure messaging
AI-assisted message routing to right team
Draft response suggestions for staff
Smart reply for common questions

Appointment & Scheduling

Vertex AI Optimization + FHIR Appointment

Self-scheduling with AI slot optimization
Pre-visit questionnaires (FHIR Questionnaire)
Digital check-in & insurance verification
Telehealth launch (video visit integration)

▼ ▼ ▼ ▼ ▼

Patient query enters the Virtual Agent pipeline

Virtual Agent Architecture

Conversational AI Pipeline

STEP 1 Patient Query → Dialogflow CX (Intent/Entity) → Vertex AI Agent (Reasoning)

STEP 2 Vertex AI Agent → RAG: Patient FHIR Data + RAG: Health Education Corpus

STEP 3 Knowledge Graph Validation → Safety Checks & Guardrails → Confidence Scoring

STEP 4 Response + Citations + Disclaimers → Patient (Mobile / Web / SMS)

▼ ▼ ▼

Grounding & Safety Guardrails

No Diagnosis

Agent never provides a diagnosis. Symptom triage routes to appropriate care level (ER, urgent care, PCP, self-care) with disclaimers.

No Prescribing

Agent cannot prescribe, adjust, or recommend stopping medications. All medication queries reference existing prescription data only.

Emergency Redirect

Keywords (chest pain, suicidal, can't breathe) trigger immediate 911/crisis line redirect. No further conversation on emergency topics.

Human Escalation

Clinical concerns beyond agent scope escalated to nurse triage line or provider message. Patient can request human at any time.

Grounded Responses

All answers grounded in patient's FHIR data and vetted content (MedlinePlus, institutional patient education). Hallucination detection active.

Mandatory Disclaimers

Every clinical response includes disclaimer: "This is not medical advice. Contact your provider for medical decisions." Confidence score shown.

▼ ▼ ▼

Technology Stack

Layer	GCP Service	Purpose	Details
Frontend	Firebase Hosting	Web portal (React / Flutter Web)	CDN-backed, HTTPS, responsive design
Mobile	Flutter (iOS + Android)	Cross-platform native app	Push notifications via Firebase Cloud Messaging
API Backend	Cloud Run	Serverless APIs	Auto-scaling, min instances for latency
FHIR Data	Cloud Healthcare API	Patient records (US Core FHIR)	FHIR R4, SMART scopes, consent-aware
Conversational AI	Dialogflow CX + Vertex AI	NLU + reasoning	Multi-turn, multilingual, context-aware
Session State	Firestore	Conversation history & context	Real-time sync, TTL-based expiration
Async Tasks	Cloud Tasks + Pub/Sub	Notifications, reminders, background jobs	Scheduled medication reminders, follow-ups
Authentication	Google Identity Platform	Patient login (OIDC)	MFA, ID proofing, social login, SMS OTP
Translation	Cloud Translation API	Multi-language support	140+ languages, medical term-aware

▼ ▼ ▼

Accessibility & Health Equity

Multi-Language Cloud Translation + Gemini

Real-time translation of portal content and agent conversations. Multilingual Gemini handles complex medical term translation.

Cloud Translation API for UI text
Gemini multilingual for conversational AI
Clinician-reviewed translations for key content

Health Literacy Auto-Simplify

Medical jargon automatically simplified to patient-friendly language. Reading level targeting (6th-8th grade).

Gemini-powered jargon-to-plain-language
Configurable reading level per patient preference
Visual aids and diagrams where applicable

WCAG 2.1 AA Accessible

Screen reader compatible, keyboard navigable, high contrast mode, resizable text. Tested with assistive technologies.

ARIA labels on all interactive elements
Color contrast ratios verified
Voice navigation support

Low-Tech Channels Equity

SMS and voice channel fallback for patients without smartphones or reliable internet. Caregiver proxy access with verified authorization.

SMS-based appointment reminders & triage
IVR voice agent (Dialogflow CX phone gateway)
Caregiver proxy access with HIPAA authorization

▼ ▼ ▼

Data Flow

Patient Interaction to Data Pipeline

Patient interactions flow through the conversational AI stack; patient-generated data (PROs, vitals) flows back into the platform as FHIR resources for clinical use.

Patient (App/Web/SMS)

→

Dialogflow CX

→

Vertex AI + Cloud Healthcare API

→

BigQuery (Enriched Zone)

→

Personalized Response

Patient-Generated Data (PROs, Vitals)

→

FHIR Observation

→

Cloud Healthcare API

→

Pub/Sub → Pipeline → BigQuery

▼ ▼ ▼

Privacy & Consent

HIPAA Compliance

All data encrypted in transit (TLS 1.3) and at rest (CMEK). BAA with Google Cloud. PHI access logged and auditable.

Consent Management

Opt-in required for AI features. Granular preferences: AI assistant on/off, data sharing, research participation.

Right to Access/Export

FHIR $export for patient data download. Machine-readable format (FHIR JSON, C-CDA). Compliant with 21st Century Cures Act.

Minor/Guardian Controls

Age-appropriate access. Guardian proxy with verified authorization. Adolescent confidentiality rules per state law.

Data Sharing Preferences

Patient controls data sharing scope: within health system only, HIE participation, research opt-in/out. Preferences enforced at API layer.

Breach Notification

Automated breach detection (Cloud DLP, Security Command Center). Notification workflows per HIPAA Breach Notification Rule (60-day window).

▼ ▼ ▼

Portal Analytics

BigQuery

Usage Metrics

Page views, feature adoption, session duration

Dialogflow

Conversation Analytics

Intent match rate, fallback rate, containment

NPS / CSAT

Patient Satisfaction

Post-interaction surveys, star ratings

Engagement

Activation & Retention

Portal activation %, monthly active users

Outcomes

Health Correlation

Portal engagement vs. care gap closure

Looker

Executive Dashboards

Digital health program KPIs in Looker

Unified Healthcare Data & AI Ecosystem

Unified Intelligent Data Pipeline

Source Systems

Ingestion Services

Real-Time Processing

AI-Ready Healthcare Data Lakehouse

Raw Zone

Curated Zone

Enriched Zone

Governance

AI-Intelligible & Semantic Layer

Embedding Generation

RAG Pipeline

Knowledge Graph

Data Fabric, Security & Sovereignty

Data Fabric

Security & Compliance

Sovereignty

Agentic & Reasoning AI Layer

Clinical Agents

Operations Agents

Research Agents

Agent Orchestration

MLOps & Continuous Learning

Clinician & Operations Experiences

Analytics & Dashboards

EHR-Integrated Workflows

Patient-Facing

↻ Continuous Learning Feedback Loop

EHR / HL7v2 — Deep Dive

Epic

Oracle Health (Cerner)

MEDITECH

Allscripts / Veradigm

athenahealth

Other / Legacy

ADT — Admit / Discharge / Transfer

ORM / OML — Orders

ORU — Results / Observations

SIU — Scheduling

MDM — Document Management

RDE / RAS — Pharmacy

DFT — Financial / Charges

VXU — Immunizations

Example: ADT^A01 (Patient Admission)

Real-Time Streaming Primary

FHIR Native Modern EHRs

Interface Engine Mediated Common

Batch / Bulk Export Historical

Version Variability

Patient Identity

Terminology Gaps

Message Ordering

PHI & Compliance

Downtime & Recovery

Downstream: Into the Unified Pipeline

Imaging / DICOM — Deep Dive

GE Healthcare PACS

Philips PACS

Siemens Healthineers

VNA (Vendor Neutral Archive)

Modalities

Specialty Systems

Patient → Study → Series → Instance

CT Image Storage

MR Image Storage

US Image Storage

Secondary Capture

Structured Report (SR)

Presentation State

DICOMweb to Cloud Healthcare API Primary

DIMSE Gateway Legacy

Cloud Storage Bulk Import Migration

Pub/Sub Event-Driven Automation

Large File Sizes

Compression Trade-offs

Burned-In PHI

Multi-Frame & Cine

AI on Pixel Data

Cross-Site Reconciliation