Back to Development & Coding

Database Architect

Schema design, optimization, and migration scripts for all major databases.

65% tokens saved
4hrs per schema saved
85% popularity

Quick Info

Version

2.0

Last Updated

2025-01

Difficulty

advanced

Category

Development & Coding

Use Cases

  • Schema design
  • Query optimization
  • Migration planning
  • Index strategies

Features

  • ERD generation
  • Normalization
  • Performance tuning
  • Migration scripts

System Prompt

You are a database architecture expert with deep knowledge of:
- Relational databases (PostgreSQL, MySQL, Oracle, SQL Server)
- NoSQL databases (MongoDB, Cassandra, DynamoDB, Redis)
- Time-series databases (InfluxDB, TimescaleDB)
- Graph databases (Neo4j, Amazon Neptune)
- Database design patterns and anti-patterns
- Normalization and denormalization strategies
- Indexing strategies and query optimization
- Sharding and partitioning
- Replication and high availability
- ACID properties and CAP theorem
- Migration strategies and tools
- Performance tuning and monitoring

You design database schemas that are scalable, performant, and maintainable while considering data integrity, consistency, and business requirements.

Main Prompt

Design a comprehensive database architecture based on the requirements below. Provide schema design, optimization strategies, and implementation details.

## 📊 Database Requirements
### Application Context:
[DESCRIBE_APPLICATION]

### Data Requirements:
- Entities: [LIST_MAIN_ENTITIES]
- Relationships: [DESCRIBE_RELATIONSHIPS]
- Data Volume: [EXPECTED_RECORDS]
- Growth Rate: [RECORDS_PER_DAY]
- Read/Write Ratio: [RATIO]

### Performance Requirements:
- Query Response Time: [MILLISECONDS]
- Concurrent Users: [NUMBER]
- Transactions Per Second: [TPS]

## 🏗️ Database Architecture Design

### Technology Selection
**Primary Database**: [PostgreSQL/MySQL/MongoDB/etc.]
**Reasoning**: [Why this database fits the requirements]

**Supporting Technologies**:
- Cache Layer: [Redis/Memcached]
- Search Engine: [Elasticsearch/Solr]
- Analytics: [ClickHouse/BigQuery]

### Schema Design

#### Entity-Relationship Diagram
```mermaid
erDiagram
    ENTITY_1 ||--o{ ENTITY_2 : relationship
    ENTITY_1 {
        uuid id PK
        timestamp created_at
        timestamp updated_at
        string field_1
        integer field_2
    }
    ENTITY_2 {
        uuid id PK
        uuid entity_1_id FK
        string field_3
        json metadata
    }
```

#### DDL Statements
```sql
-- Main tables with optimal data types and constraints
CREATE TABLE entity_1 (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    created_at TIMESTAMPTZ NOT NULL DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMPTZ NOT NULL DEFAULT CURRENT_TIMESTAMP,
    field_1 VARCHAR(255) NOT NULL,
    field_2 INTEGER NOT NULL DEFAULT 0,
    field_3 TEXT,
    metadata JSONB,
    status VARCHAR(50) NOT NULL DEFAULT 'active',
    
    -- Constraints
    CONSTRAINT chk_status CHECK (status IN ('active', 'inactive', 'deleted')),
    CONSTRAINT chk_field_2_positive CHECK (field_2 >= 0)
);

-- Add comments for documentation
COMMENT ON TABLE entity_1 IS 'Main entity for storing...';
COMMENT ON COLUMN entity_1.metadata IS 'Flexible JSON storage for additional attributes';

-- Junction table for many-to-many relationships
CREATE TABLE entity_1_entity_2 (
    entity_1_id UUID NOT NULL REFERENCES entity_1(id) ON DELETE CASCADE,
    entity_2_id UUID NOT NULL REFERENCES entity_2(id) ON DELETE CASCADE,
    created_at TIMESTAMPTZ NOT NULL DEFAULT CURRENT_TIMESTAMP,
    metadata JSONB,
    
    PRIMARY KEY (entity_1_id, entity_2_id)
);

-- Audit table for tracking changes
CREATE TABLE audit_log (
    id BIGSERIAL PRIMARY KEY,
    table_name VARCHAR(100) NOT NULL,
    record_id UUID NOT NULL,
    action VARCHAR(10) NOT NULL,
    changed_by UUID,
    changed_at TIMESTAMPTZ NOT NULL DEFAULT CURRENT_TIMESTAMP,
    old_values JSONB,
    new_values JSONB
);
```

### Indexing Strategy

```sql
-- Primary indexes for common queries
CREATE INDEX idx_entity_1_status_created ON entity_1(status, created_at DESC);
CREATE INDEX idx_entity_1_field_1_lower ON entity_1(LOWER(field_1)) WHERE status = 'active';

-- Partial indexes for specific conditions
CREATE INDEX idx_entity_1_active ON entity_1(id) WHERE status = 'active';

-- Composite indexes for complex queries
CREATE INDEX idx_entity_1_composite ON entity_1(field_1, field_2, created_at DESC);

-- GIN index for JSONB queries
CREATE INDEX idx_entity_1_metadata ON entity_1 USING GIN (metadata);

-- Full-text search index
CREATE INDEX idx_entity_1_search ON entity_1 USING GIN (to_tsvector('english', field_1 || ' ' || field_3));

-- Analyze index usage
COMMENT ON INDEX idx_entity_1_status_created IS 'Used for: Dashboard queries, pagination';
```

### Partitioning Strategy

```sql
-- Partition large tables by date
CREATE TABLE events (
    id BIGSERIAL,
    created_at TIMESTAMPTZ NOT NULL,
    event_type VARCHAR(50) NOT NULL,
    data JSONB,
    PRIMARY KEY (id, created_at)
) PARTITION BY RANGE (created_at);

-- Create monthly partitions
CREATE TABLE events_2025_01 PARTITION OF events
    FOR VALUES FROM ('2025-01-01') TO ('2025-02-01');

CREATE TABLE events_2025_02 PARTITION OF events
    FOR VALUES FROM ('2025-02-01') TO ('2025-03-01');

-- Automated partition management
CREATE OR REPLACE FUNCTION create_monthly_partitions()
RETURNS void AS $$
DECLARE
    start_date DATE;
    end_date DATE;
    partition_name TEXT;
BEGIN
    start_date := DATE_TRUNC('month', CURRENT_DATE);
    end_date := start_date + INTERVAL '1 month';
    partition_name := 'events_' || TO_CHAR(start_date, 'YYYY_MM');
    
    EXECUTE format('CREATE TABLE IF NOT EXISTS %I PARTITION OF events FOR VALUES FROM (%L) TO (%L)',
        partition_name, start_date, end_date);
END;
$$ LANGUAGE plpgsql;
```

### Query Optimization

#### Common Query Patterns
```sql
-- Optimized pagination with cursor
WITH paginated AS (
    SELECT *,
           COUNT(*) OVER() AS total_count
    FROM entity_1
    WHERE status = 'active'
      AND created_at < $1  -- cursor
    ORDER BY created_at DESC
    LIMIT $2
)
SELECT * FROM paginated;

-- Efficient aggregation with window functions
SELECT 
    DATE_TRUNC('day', created_at) AS day,
    COUNT(*) AS daily_count,
    SUM(COUNT(*)) OVER (ORDER BY DATE_TRUNC('day', created_at)) AS running_total,
    AVG(COUNT(*)) OVER (ORDER BY DATE_TRUNC('day', created_at) ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) AS seven_day_avg
FROM entity_1
WHERE created_at >= CURRENT_DATE - INTERVAL '30 days'
GROUP BY DATE_TRUNC('day', created_at);

-- Optimized search with trigrams
CREATE EXTENSION IF NOT EXISTS pg_trgm;
CREATE INDEX idx_entity_1_field_1_trgm ON entity_1 USING GIN (field_1 gin_trgm_ops);

SELECT * FROM entity_1
WHERE field_1 % 'search_term'  -- Fuzzy search
ORDER BY similarity(field_1, 'search_term') DESC
LIMIT 10;
```

### Performance Tuning

#### Database Configuration
```ini
# PostgreSQL performance tuning (postgresql.conf)
shared_buffers = 25% of RAM
effective_cache_size = 75% of RAM
maintenance_work_mem = 256MB
checkpoint_completion_target = 0.9
wal_buffers = 16MB
default_statistics_target = 100
random_page_cost = 1.1  # For SSD
effective_io_concurrency = 200  # For SSD
work_mem = 4MB
max_connections = 200
```

#### Query Performance Analysis
```sql
-- Enable query logging for slow queries
ALTER SYSTEM SET log_min_duration_statement = 100;  -- Log queries > 100ms

-- Analyze query performance
EXPLAIN (ANALYZE, BUFFERS) 
SELECT ... FROM entity_1 WHERE ...;

-- Find missing indexes
SELECT schemaname, tablename, attname, n_distinct, correlation
FROM pg_stats
WHERE schemaname = 'public'
  AND n_distinct > 100
  AND correlation < 0.1
ORDER BY n_distinct DESC;

-- Monitor table bloat
SELECT 
    schemaname,
    tablename,
    pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) AS size,
    n_live_tup,
    n_dead_tup,
    round(n_dead_tup::numeric / NULLIF(n_live_tup, 0), 3) AS dead_ratio
FROM pg_stat_user_tables
ORDER BY n_dead_tup DESC;
```

### Data Migration Strategy

```sql
-- Zero-downtime migration approach
-- Step 1: Add new column without NOT NULL
ALTER TABLE entity_1 ADD COLUMN new_field VARCHAR(100);

-- Step 2: Backfill data in batches
DO $$
DECLARE
    batch_size INTEGER := 1000;
    total_updated INTEGER := 0;
BEGIN
    LOOP
        WITH batch AS (
            SELECT id FROM entity_1 
            WHERE new_field IS NULL 
            LIMIT batch_size
            FOR UPDATE SKIP LOCKED
        )
        UPDATE entity_1 
        SET new_field = 'default_value'
        WHERE id IN (SELECT id FROM batch);
        
        GET DIAGNOSTICS total_updated = ROW_COUNT;
        EXIT WHEN total_updated = 0;
        
        PERFORM pg_sleep(0.1);  -- Prevent lock contention
        COMMIT;
    END LOOP;
END $$;

-- Step 3: Add constraint after backfill
ALTER TABLE entity_1 ALTER COLUMN new_field SET NOT NULL;
```

### Backup & Recovery

```bash
# Automated backup script
#!/bin/bash
BACKUP_DIR="/backups/postgres"
DB_NAME="production"
TIMESTAMP=$(date +%Y%m%d_%H%M%S)

# Full backup with compression
pg_dump -Fc -d $DB_NAME > "$BACKUP_DIR/backup_$TIMESTAMP.dump"

# Point-in-time recovery setup
archive_mode = on
archive_command = 'test ! -f /archive/%f && cp %p /archive/%f'
wal_level = replica

# Restore procedure
pg_restore -d $DB_NAME -j 4 backup_file.dump
```

### Monitoring & Maintenance

```sql
-- Key metrics to monitor
CREATE VIEW database_health AS
SELECT 
    (SELECT count(*) FROM pg_stat_activity) AS active_connections,
    (SELECT count(*) FROM pg_stat_activity WHERE state = 'active') AS active_queries,
    (SELECT max(now() - query_start) FROM pg_stat_activity WHERE state = 'active') AS longest_query,
    (SELECT pg_database_size(current_database())) AS database_size,
    (SELECT count(*) FROM pg_stat_user_tables WHERE n_dead_tup > 1000) AS tables_need_vacuum;

-- Automated maintenance tasks
CREATE EXTENSION pg_cron;

-- Daily vacuum analyze
SELECT cron.schedule('vacuum-analyze', '0 2 * * *', 'VACUUM ANALYZE;');

-- Weekly reindex
SELECT cron.schedule('reindex', '0 3 * * 0', 'REINDEX DATABASE production;');

-- Monthly partition cleanup
SELECT cron.schedule('partition-cleanup', '0 4 1 * *', 'CALL cleanup_old_partitions();');
```

### Security Best Practices
- Use role-based access control (RBAC)
- Implement row-level security (RLS) for multi-tenant data
- Encrypt sensitive data at rest and in transit
- Regular security audits
- Implement connection pooling (PgBouncer)
- Use prepared statements to prevent SQL injection

### Scalability Roadmap
1. **Phase 1** (0-100K records): Single primary database
2. **Phase 2** (100K-1M records): Add read replicas
3. **Phase 3** (1M-10M records): Implement caching layer
4. **Phase 4** (10M+ records): Horizontal sharding

Variables

DESCRIBE_APPLICATIONRequired

Type and purpose of the application

Example: E-commerce platform, SaaS application, Analytics system

LIST_MAIN_ENTITIESRequired

Core entities/tables needed

Example: Users, Orders, Products, Payments

Pro Tips

  • Provide clear business requirements and constraints
  • Specify expected data volumes and growth patterns
  • Mention any compliance requirements (GDPR, HIPAA)
  • Include typical query patterns if known
  • Specify if multi-tenancy is required
More Development & Coding Agents