Pythian Blog: Technical Track

Oracle UTF8 Encoding and String Manipulation Overhead

For one of our customers, I’ve recently reviewed the strategy of migration from single-byte encoding to variable length multi-byte UTF8 (AL32UTF8 encoding in Oracle naming standards). These type of projects are coming up again and again so I think it must be common for many of you. Thus, this bit might be useful. I’m also interested in your experience – perhaps you can run this simple simulation on your platforms and provide the results in the comments?

Back to the project… One area was estimation of the string manipulation overhead. Based on Pythian experience as well as the feedback from my network, I could conclude that nobody has observed any measurable performance degradation or significant CPU usage growth clearly attributed to UTF8 migration.

Thus, I decided to simulate at least some operations and measure them. One of the concerns was sub-string extraction — fixed length encoding sub-string is super easy to implement as the offset and length in bytes are known. Variable length character set would require scanning the whole string from the very beginning because byte offset and byte length are not known until the string is traversed from the beginning character by character.

This is the PL/SQL block I came up with:

[sql] declare
s1 varchar2(4000 char);
s2 varchar2(4000 char);
i INTEGER;
begin
s1 := rpad(‘a’,3999,’a’);
for i in 1..10000000 loop
null;
–s2 := ‘a’ || ‘b’ || ‘c’ || ‘d’;
–s2 := substr(s1,3000,1) || substr(s1,3001,1) || substr(s1,3002,1) || substr(s1,3003,1);
end loop;
dbms_output.put_line(s2);
end;
[/sql]

There are 3 cases and only one of them needs to be un-commented when running the block. null; is there to just capture the baseline and measure time required to run an empty loop. Concatenation s2 := 'a' || 'b' || 'c' || 'd'; is one scenario and, finally, substring extraction s2 := substr(s1,3000,1) || ... is the second one.

I’ve run it in the VMs on my laptop as well as customer’s own physical servers to confirm the timing. Below is the result from my virtual machines but the percentage was pretty much the same. Platform is Linux. Oracle version 10.2 and 11.1. Note that 11.2 is smarter and optimizes the block with NULL loop so you would need to trick PL/SQL optimizer better.

Character set null concat substr
WE8ISO8859P1 0.5 1.8 8.0
AL32UTF8 0.5 2.2 9.9

The time in the table is seconds that PL/SQL block was executing. I ran if few times and averaged the result but it was very consistent anyway. Also, skip the very first measurement to avoid the impact of block parsing.

To interpret the results, the string manipulation/concatenation (excluding 0.5s of looping itself) accounts 1.3s vs 1.7s – 30% UTF8 overhead. SUBSTR function overhead (excluding loop and concat itself) accounts for 6.2s vs 7.7s which is only 25% overhead.

I found substring overhead rather small – I expected order of magnitude difference to be honest. However, 30% of concatenation overhead seemed to be little too much and I don’t see why it should be that high.

Now, while overhead is rather noticeable, I didn’t see databases that were doing mostly string manipulations in SQL and PL/SQL. I think that’s why in the big picture, we generally don’t see much performance impact moving to variable-length character set.

If you have access to the databases with fixed-length and variable-length character sets on the similar hardware, feel free to post your results below.

No Comments Yet

Let us know what you think

Subscribe by email