Indexes & Performance

Phần này bao gồm tất cả kiến thức về Index và tối ưu hiệu suất trong SQL Server - từ cơ bản đến nâng cao.

Các chủ đề

Index Cơ Bản: Clustered/Non-Clustered Index, B-tree, SARGable predicates, chọn clustered key, covering index, fragmentation, index maintenance
Index Nâng Cao: Filtered Index, Columnstore Index, Full-Text, XML, Spatial, Unique, Composite, In-Memory OLTP Indexes
Query Optimization: Query Pipeline, Cost-Based Optimizer, Anti-Patterns, JOIN Algorithms, Parameter Sniffing, TempDB, Parallelism, DMVs, Extended Events
Execution Plans: Đọc và phân tích execution plans, các operators phổ biến

Q&A - Phỏng vấn

Junior Level

Q1: Index là gì và tại sao cần dùng index?

A: Index là cấu trúc dữ liệu bổ sung (B-tree) giúp SQL Server tìm kiếm dữ liệu nhanh hơn mà không cần đọc toàn bộ bảng (Table Scan). Không có index, SQL Server phải đọc tất cả các trang dữ liệu. Với index, SQL Server đi theo B-tree từ root đến leaf để tìm đúng records - tương tự mục lục của cuốn sách.

Q2: Clustered Index và Non-Clustered Index khác nhau thế nào?

Clustered Index: quyết định thứ tự vật lý lưu trữ dữ liệu trong bảng. Leaf pages chứa chính data rows. Mỗi bảng chỉ có 1 clustered index.
Non-Clustered Index: cấu trúc B-tree riêng biệt với bảng. Leaf pages chứa key columns + row locator (clustered key hoặc RID nếu bảng là heap). Mỗi bảng có thể có đến 999 non-clustered indexes.

Q3: Khi nào xảy ra Table Scan và khi nào Index Seek?

A: Table Scan xảy ra khi không có index phù hợp, predicate không SARGable, hoặc optimizer ước tính phần lớn bảng sẽ được đọc (thường > 5-30%). Index Seek xảy ra khi có SARGable predicate matching leading key column của index. SARGable có nghĩa là cột không bị bao bọc trong function hay biểu thức.

Q4: Covering Index là gì?

A: Covering Index là index chứa tất cả các cột mà một query cần - cả cột dùng để filter (WHERE) và cột cần lấy (SELECT). Khi query được “covered” bởi index, SQL Server không cần quay lại bảng gốc để lấy thêm dữ liệu (tránh Key Lookup), giúp giảm I/O đáng kể.

Q5: UNION và UNION ALL khác nhau thế nào?

A: UNION loại bỏ duplicate rows (thực hiện DISTINCT ngầm, tốn thêm sort/hash). UNION ALL giữ tất cả rows kể cả duplicate và nhanh hơn. Dùng UNION ALL khi biết chắc không có duplicate hoặc không cần loại bỏ chúng để tối ưu hiệu suất.

Q6: Fill Factor là gì? Giá trị nên chọn là bao nhiêu?

A: Fill Factor là % không gian lấp đầy trên mỗi leaf page khi tạo/rebuild index. Để lại không gian trống giúp giảm page split khi INSERT. Giá trị phổ biến:

100 (0): read-only tables
80-90: mixed workload bình thường
60-70: heavy insert/update vào giữa index

Q7: Heap là gì trong SQL Server?

A: Heap là bảng không có clustered index - data pages không được tổ chức theo thứ tự. SQL Server dùng IAM (Index Allocation Map) pages để tìm data pages. Heap tốt khi insert nhiều và không cần range scan. Non-clustered index trên Heap dùng RID (Row ID = file + page + slot) làm row locator thay vì clustered key.

Q8: Fragmentation là gì? Khi nào REBUILD vs REORGANIZE?

A: Fragmentation xảy ra khi pages không còn liên tục về mặt vật lý sau nhiều INSERT/UPDATE/DELETE. Ảnh hưởng: sequential I/O trở thành random I/O, chậm hơn.

REORGANIZE: fragmentation 10-30%, online (không block reads/writes), ít log hơn
REBUILD: fragmentation > 30%, có thể offline hoặc ONLINE (Enterprise), reset fill factor, update statistics đầy đủ

Mid Level

Q9: Tại sao không nên tạo quá nhiều indexes trên một bảng?

A: Mỗi index tốn thêm write overhead: khi INSERT/UPDATE/DELETE một hàng, SQL Server phải cập nhật tất cả indexes liên quan. Ngoài ra, nhiều indexes tốn storage và RAM (buffer pool). Index cũng cần maintenance (REBUILD/REORGANIZE). Rule of thumb: OLTP tables thường không nên quá 5-7 indexes, DWH/fact tables có thể nhiều hơn.

Q10: SARGable predicate là gì? Cho ví dụ non-SARGable và cách sửa?

A: SARGable (Search ARGument ABLE) là predicate mà SQL Server có thể dùng để Index Seek. Predicate trở thành non-SARGable khi cột nằm bên trong function hoặc biểu thức.

-- ❌ Non-SARGable
WHERE YEAR(OrderDate) = 2024
WHERE UPPER(LastName) = 'NGUYEN'

-- ✅ SARGable alternatives
WHERE OrderDate >= '2024-01-01' AND OrderDate < '2025-01-01'
WHERE LastName = 'Nguyen'   -- CI collation tự xử lý case-insensitive

Q11: Parameter Sniffing là gì? Khi nào nó gây vấn đề và cách giải quyết?

A: Parameter Sniffing xảy ra khi SQL Server compile stored procedure lần đầu với một parameter value cụ thể và cache plan đó. Nếu distribution data không đồng đều, plan tốt cho một giá trị có thể rất tệ cho giá trị khác.

Giải pháp:

OPTION (RECOMPILE) - compile lại mỗi lần (mất cache benefit)
OPTION (OPTIMIZE FOR UNKNOWN) - dùng average statistics
Local variable trick - DECLARE @Local = @Param
Query Store - detect regression và force good plan

Q12: Composite Index: thứ tự cột ảnh hưởng thế nào?

A: SQL Server chỉ có thể seek trên composite index nếu bắt đầu từ leading column. Index (A, B, C) hỗ trợ seek trên A, A+B, A+B+C nhưng không seek được trên chỉ B hay C. Nguyên tắc: equality columns trước, range column sau cùng, selectivity cao ưu tiên trước.

Q13: Khi nào dùng Filtered Index?

A: Filtered Index hữu ích khi:

Sparse data: cột có nhiều NULL (chỉ index hàng không NULL)
Soft deletes: WHERE IsDeleted = 0 - chỉ index active records
Status filtering: chỉ query một trạng thái cụ thể thường xuyên
Partial uniqueness: unique email chỉ trong active users

Filtered Index nhỏ hơn full index → ít RAM, ít I/O, ít maintenance. Nhưng chỉ được dùng khi query predicate tương thích với filter predicate của index.

Q14: Hash Join vs Nested Loops vs Merge Join - khi nào optimizer chọn cái nào?

Nested Loops: outer table nhỏ, inner table có index trên join key. Tốt cho OLTP lookups.
Hash Join: cả hai inputs lớn, không có index phù hợp. Tốn memory (có thể spill TempDB).
Merge Join: cả hai inputs đã sorted theo join key (từ index). Rất hiệu quả nhưng cần sorted input.

Optimizer chọn dựa trên estimated row counts và available indexes. Nếu estimates sai (statistics lỗi thời), wrong algorithm có thể được chọn.

Q15: Implicit conversion là gì và tại sao gây vấn đề hiệu suất?

A: Implicit conversion xảy ra khi SQL Server tự động convert kiểu dữ liệu để so sánh. Ví dụ: cột NVARCHAR so sánh với VARCHAR parameter → SQL Server phải convert toàn bộ cột sang VARCHAR, làm mất khả năng dùng index. Giải pháp: đảm bảo parameter và column có cùng kiểu dữ liệu, dùng N'' prefix cho Unicode strings.

Q16: Covering Index và Key Lookup liên quan thế nào?

A: Key Lookup xảy ra khi SQL Server tìm rows qua Non-Clustered Index nhưng cần thêm cột không có trong index → phải quay về Clustered Index/Heap để lấy. Đây là Bookmark Lookup. Mỗi Key Lookup là một round-trip tốn kém. Covering Index loại bỏ Key Lookup bằng cách thêm các cột cần thiết vào INCLUDE.

Q17: Columnstore Index phù hợp cho workload nào?

A: Columnstore Index tối ưu cho OLAP/analytical workloads: aggregations (SUM, AVG, COUNT), GROUP BY, wide table scans. Lý do: compression tốt hơn (5-10x), batch mode execution, column pruning (chỉ đọc cột cần). Không phù hợp cho OLTP với nhiều point lookups và single-row updates/inserts (delta store overhead).

Senior Level

Q18: Giải thích cơ chế Plan Cache và khi nào plan bị evict hoặc recompile?

A: SQL Server lưu execution plans trong Plan Cache (một phần của buffer pool). Plan được identify bằng hash của query text. Khi query mới match hash, dùng lại plan (avoid expensive compilation).

Plan bị evict khi:

Memory pressure (cần RAM cho dữ liệu)
DBCC FREEPROCCACHE
Đủ điều kiện memory pressure eviction (least recently used)

Plan bị recompile khi:

Schema change (ALTER TABLE, CREATE INDEX)
Statistics change (threshold: 20% rows + 500 modified)
SET options thay đổi
sp_recompile hoặc OPTION (RECOMPILE) hint

Q19: Giải thích Columnstore Delta Store và Tuple Mover hoạt động thế nào?

A: Khi INSERT vào Columnstore table:

Rows mới đi vào Delta Store (row-store B-tree nhỏ trong TempDB/data file)
Delta Store tích lũy đến ~1 triệu rows
Tuple Mover (background process) compress và chuyển Delta Store → Column Segments (read-only, highly compressed)
Deleted rows được đánh dấu trong Delete Bitmap (không xóa ngay khỏi segments)

Khi REBUILD Columnstore Index → Delta Stores và Delete Bitmaps được flush, tất cả merge thành segments mới, reclaim không gian.

Q20: Wait statistics là gì? Cách dùng để diagnose performance vấn đề?

A: SQL Server track thời gian mỗi task phải chờ resource (sys.dm_os_wait_stats). Phân tích top waits giúp xác định bottleneck:

PAGEIOLATCH: I/O bottleneck → thêm RAM, optimize queries, faster disk
LCK_M_X: lock contention → shorten transactions, optimize isolation level
CXPACKET: parallelism overhead → tăng cost threshold, giảm MAXDOP
ASYNCH_NETWORK_IO: client chậm đọc → reduce result set, pagination
SOS_SCHEDULER_YIELD: CPU pressure → optimize CPU-heavy queries, thêm CPU

Approach: reset waits (DBCC SQLPERF('sys.dm_os_wait_stats', CLEAR)), run workload 30 phút, check top waits lại → targeted investigation.

Q21: Hash Join Spill là gì và làm thế nào để tránh?

A: Hash Join Spill xảy ra khi SQL Server không có đủ memory để giữ hash table trong RAM → phải spill sang TempDB (disk). Điều này tăng I/O dramatically và làm query chậm.

Nguyên nhân:

Cardinality estimate quá thấp (statistics lỗi thời) → cấp ít memory grant
Query thực sự xử lý dữ liệu rất lớn

Giải pháp:

Update statistics
Thêm/sửa indexes để reduce rows trước khi join
Tăng memory grant hint: OPTION (MIN_GRANT_PERCENT = 50)
Tăng server memory
Dùng Query Store để detect và force better plan

Q22: Query Store có gì tốt hơn Extended Events để track performance?

A: Query Store (SQL 2016+) tự động persist query text, plans và runtime stats vào database catalog - không cần setup trace session. Cho phép:

So sánh plan thay đổi theo thời gian (plan regression detection)
Force plan cụ thể khi có regression
Identify “regressed queries” tự động (SQL 2022 Intelligent Query Processing)
View qua SSMS GUI hoặc sys.query_store_* views

Extended Events tốt hơn khi cần real-time capture, filter chi tiết (theo database, user, host), capture events không có trong Query Store (locking, deadlock, etc.).

Q23: In-Memory OLTP indexes khác traditional indexes thế nào?

Hash Index: dùng cho equality lookups, O(1), không sort. PHẢI ước tính BUCKET_COUNT chính xác (too low → collision, too high → wasted memory). Không hỗ trợ range, ORDER BY.
Nonclustered Range Index: Bw-tree (lock-free B-tree variant). Hỗ trợ range và ORDER BY. O(log N).
Không có clustered index: Memory-optimized tables không có clustered index concept. Data là heap trong memory, access qua indexes.
Lock-free: All indexes dùng optimistic concurrency, MVCC - không có page latches hay traditional locks.

Q24: Giải thích MAXDOP và Cost Threshold for Parallelism, nên set về giá trị bao nhiêu?

MAXDOP (Max Degree of Parallelism): số CPU threads tối đa cho một query. Default 0 = tất cả CPUs.
Cost Threshold for Parallelism: estimated cost threshold để SQL Server mới xem xét parallel plan. Default 5 (rất thấp, gây nhiều small queries đi parallel không cần thiết).

Recommendations:

Cost Threshold: tăng lên 25-50 cho OLTP workloads (giảm unnecessary parallelism overhead)
MAXDOP OLTP: = số cores per NUMA node hoặc 8 (lấy số nhỏ hơn)
MAXDOP DWH: 0 hoặc số cores per socket

Từ SQL Server 2022: ALTER DATABASE SCOPED CONFIGURATION cho phép set per-database.

Q25: Tại sao NOT IN nguy hiểm khi có NULL? Giải thích bằng SQL logic ba giá trị?

A: SQL dùng three-valued logic: TRUE, FALSE, và UNKNOWN. Khi so sánh với NULL, kết quả là UNKNOWN (không phải FALSE).

x NOT IN (1, 2, NULL) được evaluate như: NOT (x = 1 OR x = 2 OR x = NULL) = NOT (x = 1 OR x = 2 OR UNKNOWN) = NOT (UNKNOWN) (khi x ≠ 1 và x ≠ 2) = UNKNOWN

WHERE chỉ cho qua rows với UNKNOWN = TRUE. UNKNOWN rows bị loại. Kết quả: không hàng nào được trả về dù logic bạn muốn là “x không có trong danh sách”. Luôn dùng NOT EXISTS thay thế.

Q26: Giải thích cơ chế Statistics trong SQL Server và khi nào bị outdated?

A: Statistics là histogram mô tả phân phối dữ liệu của một column. SQL Server dùng statistics để ước tính cardinality (số rows mỗi operator trả về), từ đó chọn join algorithm, index, và memory grant.

Statistics tự động update khi:

20% của rows trong bảng bị modify (+ 500 rows cho bảng nhỏ)
Với trace flag 2371 hoặc SQL 2016+: dynamic threshold cho bảng lớn

Statistics có thể không đủ chính xác khi:

Data distribution lệch (skewed) - histogram chỉ có 200 steps
Nhiều NULL values
Correlated columns (stats độc lập nhau, không track correlation)
Ascending key problem (new values ngoài histogram range)

Giải pháp: UPDATE STATISTICS ... WITH FULLSCAN, enable Auto Update Statistics Async.

Q27: Concurrent index rebuild có thể làm gì? Hạn chế là gì?

A: ALTER INDEX ... REBUILD WITH (ONLINE = ON) (Enterprise Edition) cho phép rebuild không block reads/writes:

Dùng row versioning để maintain cả hai copies trong quá trình rebuild
Schema Modification Lock (SCH-M) chỉ lấy ở đầu và cuối (rất ngắn)
Tốn thêm disk space (phải giữ cả old và new index cùng lúc)
Chậm hơn offline rebuild
Không available trên Standard Edition
Không available cho Columnstore indexes (dùng REORGANIZE hoặc offline)

Q28: Giải thích sự khác biệt giữa Seek Predicate và Predicate trong execution plan?

Seek Predicate: phần predicate SQL Server dùng để navigate B-tree (true seek). Rất hiệu quả - chỉ đọc relevant leaf pages.
Predicate (Residual predicate): phần predicate được apply sau khi đã seek đến leaf pages, để filter thêm các rows đọc được. Kém hiệu quả hơn vì phải đọc rồi mới filter.

Ví dụ với Index (CustomerId, Status):

WHERE CustomerId = 100 AND Status = 'Active' AND TotalAmount > 500
-- Seek Predicate: CustomerId = 100 AND Status = 'Active' (cả hai key columns)
-- Residual Predicate: TotalAmount > 500 (không phải key column)

Mục tiêu tối ưu: đưa càng nhiều predicate vào Seek Predicate càng tốt (thêm cột vào index key hoặc adjust index).

Q29: Khi nào nên dùng Table Variable thay Temp Table và ngược lại?

A: Table Variable tốt hơn khi:

< 100 rows (thường)
Trong function (bắt buộc dùng table variable)
Cần transactions riêng (không bị rollback khi outer TX rollback)
Đơn giản, không cần indexes phức tạp

Temp Table tốt hơn khi:

1000 rows (statistics giúp optimizer)
Cần explicit indexes
JOIN phức tạp với temp data (optimizer có statistics ước tính đúng hơn)
Cần dùng trong nhiều batches/procedures

Vấn đề chính của Table Variable: optimizer luôn ước tính 1 row (SQL 2017 trở lên có Deferred Compilation cải thiện phần nào). Điều này gây wrong join algorithm nếu thực tế có nhiều rows.

Q30: Làm thế nào phân tích và fix một query đột ngột chạy chậm trên production?

A: Quy trình điều tra:

Query Store (SQL 2016+): kiểm tra “Regressed Queries” trong SSMS hoặc:

SELECT q.query_id, qt.query_sql_text, rs.avg_duration, p.plan_id
FROM sys.query_store_query q
INNER JOIN sys.query_store_query_text qt ON q.query_text_id = qt.query_text_id
INNER JOIN sys.query_store_plan p ON q.query_id = p.query_id
INNER JOIN sys.query_store_runtime_stats rs ON p.plan_id = rs.plan_id
WHERE rs.last_execution_time > DATEADD(HOUR, -2, GETUTCDATE())
ORDER BY rs.avg_duration DESC;

So sánh plans: xem plan mới vs plan cũ có gì khác (join algorithm thay đổi, index mới/bị xóa)
Check statistics: sys.dm_db_stats_properties - xem modification_counter cao → update statistics
Check blocking/locks: sys.dm_exec_requests với blocking_session_id
Check execution plan hiện tại: sys.dm_exec_query_plan - có key lookup không cần thiết, bad join không?
Quick fix: EXEC sp_recompile 'ProcName' hoặc force plan qua Query Store
Root cause: update statistics, thêm/sửa index, fix parameter sniffing, hoặc code fix