Schema design, optimization, and migration scripts for all major databases.
2.0
2025-01
advanced
Development & Coding
You are a database architecture expert with deep knowledge of: - Relational databases (PostgreSQL, MySQL, Oracle, SQL Server) - NoSQL databases (MongoDB, Cassandra, DynamoDB, Redis) - Time-series databases (InfluxDB, TimescaleDB) - Graph databases (Neo4j, Amazon Neptune) - Database design patterns and anti-patterns - Normalization and denormalization strategies - Indexing strategies and query optimization - Sharding and partitioning - Replication and high availability - ACID properties and CAP theorem - Migration strategies and tools - Performance tuning and monitoring You design database schemas that are scalable, performant, and maintainable while considering data integrity, consistency, and business requirements.
Design a comprehensive database architecture based on the requirements below. Provide schema design, optimization strategies, and implementation details.
## 📊 Database Requirements
### Application Context:
[DESCRIBE_APPLICATION]
### Data Requirements:
- Entities: [LIST_MAIN_ENTITIES]
- Relationships: [DESCRIBE_RELATIONSHIPS]
- Data Volume: [EXPECTED_RECORDS]
- Growth Rate: [RECORDS_PER_DAY]
- Read/Write Ratio: [RATIO]
### Performance Requirements:
- Query Response Time: [MILLISECONDS]
- Concurrent Users: [NUMBER]
- Transactions Per Second: [TPS]
## 🏗️ Database Architecture Design
### Technology Selection
**Primary Database**: [PostgreSQL/MySQL/MongoDB/etc.]
**Reasoning**: [Why this database fits the requirements]
**Supporting Technologies**:
- Cache Layer: [Redis/Memcached]
- Search Engine: [Elasticsearch/Solr]
- Analytics: [ClickHouse/BigQuery]
### Schema Design
#### Entity-Relationship Diagram
```mermaid
erDiagram
ENTITY_1 ||--o{ ENTITY_2 : relationship
ENTITY_1 {
uuid id PK
timestamp created_at
timestamp updated_at
string field_1
integer field_2
}
ENTITY_2 {
uuid id PK
uuid entity_1_id FK
string field_3
json metadata
}
```
#### DDL Statements
```sql
-- Main tables with optimal data types and constraints
CREATE TABLE entity_1 (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
created_at TIMESTAMPTZ NOT NULL DEFAULT CURRENT_TIMESTAMP,
updated_at TIMESTAMPTZ NOT NULL DEFAULT CURRENT_TIMESTAMP,
field_1 VARCHAR(255) NOT NULL,
field_2 INTEGER NOT NULL DEFAULT 0,
field_3 TEXT,
metadata JSONB,
status VARCHAR(50) NOT NULL DEFAULT 'active',
-- Constraints
CONSTRAINT chk_status CHECK (status IN ('active', 'inactive', 'deleted')),
CONSTRAINT chk_field_2_positive CHECK (field_2 >= 0)
);
-- Add comments for documentation
COMMENT ON TABLE entity_1 IS 'Main entity for storing...';
COMMENT ON COLUMN entity_1.metadata IS 'Flexible JSON storage for additional attributes';
-- Junction table for many-to-many relationships
CREATE TABLE entity_1_entity_2 (
entity_1_id UUID NOT NULL REFERENCES entity_1(id) ON DELETE CASCADE,
entity_2_id UUID NOT NULL REFERENCES entity_2(id) ON DELETE CASCADE,
created_at TIMESTAMPTZ NOT NULL DEFAULT CURRENT_TIMESTAMP,
metadata JSONB,
PRIMARY KEY (entity_1_id, entity_2_id)
);
-- Audit table for tracking changes
CREATE TABLE audit_log (
id BIGSERIAL PRIMARY KEY,
table_name VARCHAR(100) NOT NULL,
record_id UUID NOT NULL,
action VARCHAR(10) NOT NULL,
changed_by UUID,
changed_at TIMESTAMPTZ NOT NULL DEFAULT CURRENT_TIMESTAMP,
old_values JSONB,
new_values JSONB
);
```
### Indexing Strategy
```sql
-- Primary indexes for common queries
CREATE INDEX idx_entity_1_status_created ON entity_1(status, created_at DESC);
CREATE INDEX idx_entity_1_field_1_lower ON entity_1(LOWER(field_1)) WHERE status = 'active';
-- Partial indexes for specific conditions
CREATE INDEX idx_entity_1_active ON entity_1(id) WHERE status = 'active';
-- Composite indexes for complex queries
CREATE INDEX idx_entity_1_composite ON entity_1(field_1, field_2, created_at DESC);
-- GIN index for JSONB queries
CREATE INDEX idx_entity_1_metadata ON entity_1 USING GIN (metadata);
-- Full-text search index
CREATE INDEX idx_entity_1_search ON entity_1 USING GIN (to_tsvector('english', field_1 || ' ' || field_3));
-- Analyze index usage
COMMENT ON INDEX idx_entity_1_status_created IS 'Used for: Dashboard queries, pagination';
```
### Partitioning Strategy
```sql
-- Partition large tables by date
CREATE TABLE events (
id BIGSERIAL,
created_at TIMESTAMPTZ NOT NULL,
event_type VARCHAR(50) NOT NULL,
data JSONB,
PRIMARY KEY (id, created_at)
) PARTITION BY RANGE (created_at);
-- Create monthly partitions
CREATE TABLE events_2025_01 PARTITION OF events
FOR VALUES FROM ('2025-01-01') TO ('2025-02-01');
CREATE TABLE events_2025_02 PARTITION OF events
FOR VALUES FROM ('2025-02-01') TO ('2025-03-01');
-- Automated partition management
CREATE OR REPLACE FUNCTION create_monthly_partitions()
RETURNS void AS $$
DECLARE
start_date DATE;
end_date DATE;
partition_name TEXT;
BEGIN
start_date := DATE_TRUNC('month', CURRENT_DATE);
end_date := start_date + INTERVAL '1 month';
partition_name := 'events_' || TO_CHAR(start_date, 'YYYY_MM');
EXECUTE format('CREATE TABLE IF NOT EXISTS %I PARTITION OF events FOR VALUES FROM (%L) TO (%L)',
partition_name, start_date, end_date);
END;
$$ LANGUAGE plpgsql;
```
### Query Optimization
#### Common Query Patterns
```sql
-- Optimized pagination with cursor
WITH paginated AS (
SELECT *,
COUNT(*) OVER() AS total_count
FROM entity_1
WHERE status = 'active'
AND created_at < $1 -- cursor
ORDER BY created_at DESC
LIMIT $2
)
SELECT * FROM paginated;
-- Efficient aggregation with window functions
SELECT
DATE_TRUNC('day', created_at) AS day,
COUNT(*) AS daily_count,
SUM(COUNT(*)) OVER (ORDER BY DATE_TRUNC('day', created_at)) AS running_total,
AVG(COUNT(*)) OVER (ORDER BY DATE_TRUNC('day', created_at) ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) AS seven_day_avg
FROM entity_1
WHERE created_at >= CURRENT_DATE - INTERVAL '30 days'
GROUP BY DATE_TRUNC('day', created_at);
-- Optimized search with trigrams
CREATE EXTENSION IF NOT EXISTS pg_trgm;
CREATE INDEX idx_entity_1_field_1_trgm ON entity_1 USING GIN (field_1 gin_trgm_ops);
SELECT * FROM entity_1
WHERE field_1 % 'search_term' -- Fuzzy search
ORDER BY similarity(field_1, 'search_term') DESC
LIMIT 10;
```
### Performance Tuning
#### Database Configuration
```ini
# PostgreSQL performance tuning (postgresql.conf)
shared_buffers = 25% of RAM
effective_cache_size = 75% of RAM
maintenance_work_mem = 256MB
checkpoint_completion_target = 0.9
wal_buffers = 16MB
default_statistics_target = 100
random_page_cost = 1.1 # For SSD
effective_io_concurrency = 200 # For SSD
work_mem = 4MB
max_connections = 200
```
#### Query Performance Analysis
```sql
-- Enable query logging for slow queries
ALTER SYSTEM SET log_min_duration_statement = 100; -- Log queries > 100ms
-- Analyze query performance
EXPLAIN (ANALYZE, BUFFERS)
SELECT ... FROM entity_1 WHERE ...;
-- Find missing indexes
SELECT schemaname, tablename, attname, n_distinct, correlation
FROM pg_stats
WHERE schemaname = 'public'
AND n_distinct > 100
AND correlation < 0.1
ORDER BY n_distinct DESC;
-- Monitor table bloat
SELECT
schemaname,
tablename,
pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) AS size,
n_live_tup,
n_dead_tup,
round(n_dead_tup::numeric / NULLIF(n_live_tup, 0), 3) AS dead_ratio
FROM pg_stat_user_tables
ORDER BY n_dead_tup DESC;
```
### Data Migration Strategy
```sql
-- Zero-downtime migration approach
-- Step 1: Add new column without NOT NULL
ALTER TABLE entity_1 ADD COLUMN new_field VARCHAR(100);
-- Step 2: Backfill data in batches
DO $$
DECLARE
batch_size INTEGER := 1000;
total_updated INTEGER := 0;
BEGIN
LOOP
WITH batch AS (
SELECT id FROM entity_1
WHERE new_field IS NULL
LIMIT batch_size
FOR UPDATE SKIP LOCKED
)
UPDATE entity_1
SET new_field = 'default_value'
WHERE id IN (SELECT id FROM batch);
GET DIAGNOSTICS total_updated = ROW_COUNT;
EXIT WHEN total_updated = 0;
PERFORM pg_sleep(0.1); -- Prevent lock contention
COMMIT;
END LOOP;
END $$;
-- Step 3: Add constraint after backfill
ALTER TABLE entity_1 ALTER COLUMN new_field SET NOT NULL;
```
### Backup & Recovery
```bash
# Automated backup script
#!/bin/bash
BACKUP_DIR="/backups/postgres"
DB_NAME="production"
TIMESTAMP=$(date +%Y%m%d_%H%M%S)
# Full backup with compression
pg_dump -Fc -d $DB_NAME > "$BACKUP_DIR/backup_$TIMESTAMP.dump"
# Point-in-time recovery setup
archive_mode = on
archive_command = 'test ! -f /archive/%f && cp %p /archive/%f'
wal_level = replica
# Restore procedure
pg_restore -d $DB_NAME -j 4 backup_file.dump
```
### Monitoring & Maintenance
```sql
-- Key metrics to monitor
CREATE VIEW database_health AS
SELECT
(SELECT count(*) FROM pg_stat_activity) AS active_connections,
(SELECT count(*) FROM pg_stat_activity WHERE state = 'active') AS active_queries,
(SELECT max(now() - query_start) FROM pg_stat_activity WHERE state = 'active') AS longest_query,
(SELECT pg_database_size(current_database())) AS database_size,
(SELECT count(*) FROM pg_stat_user_tables WHERE n_dead_tup > 1000) AS tables_need_vacuum;
-- Automated maintenance tasks
CREATE EXTENSION pg_cron;
-- Daily vacuum analyze
SELECT cron.schedule('vacuum-analyze', '0 2 * * *', 'VACUUM ANALYZE;');
-- Weekly reindex
SELECT cron.schedule('reindex', '0 3 * * 0', 'REINDEX DATABASE production;');
-- Monthly partition cleanup
SELECT cron.schedule('partition-cleanup', '0 4 1 * *', 'CALL cleanup_old_partitions();');
```
### Security Best Practices
- Use role-based access control (RBAC)
- Implement row-level security (RLS) for multi-tenant data
- Encrypt sensitive data at rest and in transit
- Regular security audits
- Implement connection pooling (PgBouncer)
- Use prepared statements to prevent SQL injection
### Scalability Roadmap
1. **Phase 1** (0-100K records): Single primary database
2. **Phase 2** (100K-1M records): Add read replicas
3. **Phase 3** (1M-10M records): Implement caching layer
4. **Phase 4** (10M+ records): Horizontal shardingDESCRIBE_APPLICATIONRequiredType and purpose of the application
Example: E-commerce platform, SaaS application, Analytics system
LIST_MAIN_ENTITIESRequiredCore entities/tables needed
Example: Users, Orders, Products, Payments
Professional code review with actionable feedback
Find and fix bugs 10x faster
Design and document APIs instantly
Generate complete test suites automatically