While overhauling many of the queries and functions from a PostgreSQL 8.3 database, I noticed different elapsed times with minor changes to code structure and decided to benchmark my observations. Simple modifications such as combining separate IF
and ELSEIF
expressions into one expression, or replacing them entirely with conditional expressions resulted in surprising performance boosts.
This very primitive benchmark function will determine the number of rows in which the first column (a) is the lowest number, the second column (b) contains the second-lowest number, and the middle column (c) is lower than both columns (d) and (e). It is not efficient and is not supposed to be, but it should be enough to provide a basis for more extensive formal testing, if needed.
There are much faster ways to do this; for instance, you could simply have a simple SELECT statement:
SELECT COUNT(*) FROM foo WHERE a <= b AND b <= c AND c <= d AND c <= e;
count: 17994
Time: 209.326 ms
Or even slightly faster:
SELECT COUNT(*) FROM foo WHERE b BETWEEN a AND c AND c <= d AND c <= e;
Time: 196.504 ms
Again, the purpose of this exercise is to test the impact of code decisions over millions of operations inside Postgres 8.3 functions, not to optimize a query for the fastest result, or for comparing features, functions, or performance in Postgres 9.x, or higher.
Database Setup
Start by generating a simple one-million-row table with five columns of random integers from 0 to 100:
CREATE TABLE foo(a int, b int, c int, d int, e int);
INSERT INTO foo SELECT random()*100, random()*100, random()*100, random()*100, random()*100 FROM generate_series(1, 1000000);
Create a simple function to mimic the min() or least() function available in many languages:
CREATE OR REPLACE FUNCTION foo_min(integer[]) RETURNS integer AS $$
DECLARE result int; i int;
BEGIN
FOR i IN 1 .. array_upper($1, 1) LOOP
IF result IS NULL THEN result := $1[i];
ELSEIF $1[i] < result THEN result := $1[i];
END IF;
END LOOP;
RETURN result;
END;
$$ LANGUAGE plpgsql IMMUTABLE;
Unless specifically noted, new code snippets for this function will only replace the code between the BEGIN
and END
statements. Similarly, unless noted, the SELECT
statement for additional benchmarks is the same as the initial benchmark below. The test queries return identical counts (unless noted), so the count will be generally omitted after the initial benchmark. Average times in benchmarks are the truncated mean of several groups of elapsed times in order to ensure reasonable consistency; fractional milliseconds are omitted from the average. All benchmarks are performed on the same dedicated hardware and software, and with no significant changes to the underlying database. All benchmarks were performed manually using the psql shell.
Get a starting benchmark.
SELECT COUNT(*) FROM foo WHERE a = foo_min(ARRAY[a,b,c,d,e]) AND b = foo_min(ARRAY[b,c,d,e]) AND c = foo_min(ARRAY[e,d,c]);
count = 17994
Time: 6958.124 ms (Average: 6959 ms)
The third call to the foo_min function in the above benchmark statement specifies the array elements in order of (e), (d), and (c). While I would have preferred presenting it in alphabetical order (like the first two calls), a bug within the third-party plugin that displays formatted code required rearranging the fields. The re-ordering did not impact metrics.
Wow. Seven seconds. Pretty bad — as intended. Let’s try making some changes.
TEST: Reduce IF Expressions
Consolidate the two IF
and ELSEIF
expressions into one. Average results were unexpectedly 13.8% faster. That’s quite a bit of an improvement for such a simple change.
FOR i IN 1 .. array_upper($1, 1) LOOP
IF result IS NULL OR $1[i] < result THEN result := $1[i];
END IF;
END LOOP;
RETURN result;
Time: 5984.493 ms (Average: 5996 ms)
TEST: Replace IF with a Conditional Expression
Eliminate the IF/THEN/END IF
expression and replace it with CASE WHEN/THEN/ELSE/END
. Average results were 22% faster than the original benchmark. This is somewhat expected as SQL statements are generally faster than their PL/pgSQL equivalents.
FOR i IN 1 .. array_upper($1, 1) LOOP
result := CASE WHEN $1[i] < result THEN $1[i] ELSE COALESCE(result, $1[i]) END;
END LOOP;
RETURN result;
Time: 5436.668 ms (Average: 5418 ms)
TEST: Use a Temporary Variable for Element Value
Since there were several references to $1[i]
, I thought I’d check the impact of assigning the value to a new local variable (e
) and reference that instead. The full function code snippet is included below due to the additional variable declaration above the BEGIN
statement. Results were rather poor, 21% slower than the previous benchmark, and only 1.4% faster than the initial benchmark — too close to be a real differentiation.
CREATE OR REPLACE FUNCTION foo_min(integer[]) RETURNS integer AS $$
DECLARE result int; i int; e int;
BEGIN
FOR i IN 1 .. array_upper($1, 1) LOOP
e = $1[i];
result := CASE WHEN e < result THEN e ELSE COALESCE(result, e) END;
END LOOP;
RETURN result;
END;
$$ LANGUAGE plpgsql IMMUTABLE;
Time: 6858.071 ms (Average: 6859 ms)
TEST: Use Aliases for Function Parameters
Since the temporary variable was a bust but I’d still like to avoid the $n
parameters, let’s try the CASE WHEN
with a parameter alias for increased readability. The full function code snippet is included below due to the altered method signature. Results were virtually identical to the CASE WHEN
example, the difference too negligible (roughly 0.5%) to consider an improvement.
CREATE OR REPLACE FUNCTION foo_min(a integer[]) RETURNS integer AS $$
DECLARE result int; i int;
BEGIN
FOR i IN 1 .. array_upper(a, 1) LOOP
result := CASE WHEN a[i] < result THEN a[i] ELSE COALESCE(result, a[i]) END;
END LOOP;
RETURN result;
END;
$$ LANGUAGE plpgsql IMMUTABLE;
Time: 5436.733 ms (Average 5390 ms)
TEST: Designate the Function as VOLATILE
Reuse the initial benchmark function and omit the last word, IMMUTABLE
, so that the last line matches the following:
$$ LANGUAGE plpgsql;
An IMMUTABLE
function cannot modify the database and is guaranteed to return the same results given the same arguments forever. If the CREATE FUNCTION
command does not specify a volatility category, a default of VOLATILE
is used. When IMMUTABLE
is omitted from this function that would otherwise benefit from pre-evaluation by the query optimizer, performance drops dramatically, over 45% slower than the benchmark.
Time: 10141.483 ms (Average 10136 ms)
TEST: Use SQL instead of PL/pgSQL
For this test, a different paradigm, a completely different function, and a new initial benchmark are needed. The function in the initial language benchmark assumes the array will always have five non-NULL integer values, and there is no validation or error handling. The same database is used in these tests, but since this is not an “apples to apples” comparison, do not compare results for these two language-specific benchmarks against any of the benchmarks above.
The new, rather ugly and non-readable function for initial benchmarking with PL/pgSQL:
CREATE OR REPLACE FUNCTION foo_min(integer[]) RETURNS integer AS $$
BEGIN
RETURN (SELECT CASE WHEN $1[1]<$1[2] AND $1[1]<$1[3] AND $1[1]<$1[4] AND $1[1]<$1[5] THEN $1[1] WHEN $1[2]<$1[3] AND $1[2]<$1[4] AND $1[2]<$1[5] THEN $1[2] WHEN $1[3]<$1[4] AND $1[3]<$1[5] THEN $1[3] WHEN $1[4]<$1[5] THEN $1[4] ELSE $1[5] END);
END;
$$ LANGUAGE plpgsql IMMUTABLE;
The new SQL query that assumes exactly five input values, and the results for the PL/pgSQL function:
SELECT COUNT(*) FROM foo WHERE a = foo_min(ARRAY[a,b,c,d,e]);
count = 204543
Time: 23380.717 ms (Average 23505 ms)
Yikes. For the heck of it, I modified the PL/pgSQL version of the function above to use a parameter alias in the hope that it might somehow improve performance. Nope. The result was even worse — over 16% slower, averaging 28 seconds in duration.
The pure SQL equivalent (simply created by removing the BEGIN
, END
and RETURN
statements, and changing plpgsql
to sql
) was almost six times (568.9%) faster:
CREATE OR REPLACE FUNCTION foo_min(integer[]) RETURNS integer AS $$
SELECT CASE WHEN $1[1]<$1[2] AND $1[1]<$1[3] AND $1[1]<$1[4] AND $1[1]<$1[5] THEN $1[1] WHEN $1[2]<$1[3] AND $1[2]<$1[4] AND $1[2]<$1[5] THEN $1[2] WHEN $1[3]<$1[4] AND $1[3]<$1[5] THEN $1[3] WHEN $1[4]<$1[5] THEN $1[4] ELSE $1[5] END;
$$ LANGUAGE sql IMMUTABLE;
Time: 4171.685 ms (Average 4132 ms)
Conclusions
Significant performance improvements can sometimes be made with small, subtle changes. Use the IMMUTABLE
volatility category where appropriate (and ONLY where appropriate), use the closest-to-native query language that can obtain desired results, and rethink the use of control structures where practical. Use function parameter aliases when benchmarking tests prove they have no impact, but avoid the unnecessary declaration and assignment of ephemeral variables whenever possible.
Recent Comments