Why is there a performance penalty for nested subroutines in Delphi?Is it more optimized to use local function or global function?Delphi code completion performanceWhat is Causing This Memory Leak in Delphi?Delphi Interface Performance IssueIs the compiler treatment of implicit interface variables documented?how to define classes inside classes in delphi?Delphi XE custom build target is always disabledAlternative to nested for-loop in DelphiIs there a performance penalty in accessing the Windows API through Delphi?How to affect Delphi XEx code generation for Android/ARM targets?Are there any penalties for using generic types in Delphi?

Pressure to defend the relevance of one's area of mathematics

Cannot populate data in lightning data table

Why the difference in metal between 銀行 and お金?

Electric guitar: why such heavy pots?

Help, my Death Star suffers from Kessler syndrome!

How to determine the actual or "true" resolution of a digital photograph?

Is GOCE a satellite or aircraft?

Why is the origin of “threshold” uncertain?

Has any spacecraft ever had the ability to directly communicate with civilian air traffic control?

What is the difference between `a[bc]d` (brackets) and `ab,cd` (braces)?

Stark VS Thanos

Unexpected email from Yorkshire Bank

Does a creature that is immune to a condition still make a saving throw?

Why does Bran Stark feel that Jon Snow "needs to know" about his lineage?

In Proverbs 14:34, is sin a disgrace to a people, or is mercy a sin-offering?

Why does nature favour the Laplacian?

Lock in SQL Server and Oracle

Minimum value of 4 digit number divided by sum of its digits

Will a top journal at least read my introduction?

How to set the font color of quantity objects (Version 11.3 vs version 12)?

Pulling the rope with one hand is as heavy as with two hands?

Any examples of headwear for races with animal ears?

Illegal assignment from SObject to Contact

Upright [...] in italics quotation



Why is there a performance penalty for nested subroutines in Delphi?


Is it more optimized to use local function or global function?Delphi code completion performanceWhat is Causing This Memory Leak in Delphi?Delphi Interface Performance IssueIs the compiler treatment of implicit interface variables documented?how to define classes inside classes in delphi?Delphi XE custom build target is always disabledAlternative to nested for-loop in DelphiIs there a performance penalty in accessing the Windows API through Delphi?How to affect Delphi XEx code generation for Android/ARM targets?Are there any penalties for using generic types in Delphi?






.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty height:90px;width:728px;box-sizing:border-box;








9















A static analyzer we use has a report that says:




Subprograms with local subprograms (OPTI7)



This section lists subprograms that themselves have local subprograms.
Especially when these subprograms share local variables, it can have a
negative effect on performance.




This guide says:




Do not use nested routines Nested routines (routines within other
routines; also known as "local procedures") require some special stack
manipulation so that the variables of the outer routine can be seen by
the inner routine. This results in a good bit of overhead. Instead of
nesting, move the procedure to the unit scoping level and pass the
necessary variables - if necessary by reference (use the var keyword)
- or make the variable global at the unit scope.




We were interested in knowing if we should take this report into consideration when validating our code. The answers to this question suggest that one should profile one's application to see if there is any performance difference, but not much is said about the difference between nested routines and normal subroutines.



What is the actual difference between nested routines and normal routines and how may it cause a performance penalty?










share|improve this question

















  • 1





    Any code quality tool that encourages you to refactor into global variables should be treated with a severe degree of scepticism. With that said, it's good to see that the compiler, as expected, is not quite so incompetent as this would suggest.

    – J...
    Apr 24 at 14:46







  • 2





    @J..., the bit about global variables is not from Peganza.

    – Uli Gerhardt
    Apr 24 at 14:52











  • @UliGerhardt I see that now. I guess the second lesson is...don't trust code quality advice that was written almost two decades ago. Unless, of course, we're meaning to micro-optimize our code for an ancient compiler and the "new" Pentium II and its advanced core features 0_o

    – J...
    Apr 24 at 15:00







  • 2





    The difference are never that big, and these kinds of optimizations only count when you are invoking that code very often. In loops of a billion iterations, it could matter, but in a standard business application where you're reading and writing some stuff to files and databases, and now and then have a piece of logic with a nested procedure that is executed for each of the 100 rows you just queried, there is no way you're gonna spot the difference. Doesn't mean, though, that you shouldn't refactor them, if only for quality metrics like readability and re-usability.

    – GolezTrol
    Apr 24 at 17:15







  • 1





    @GolezTrol: I would actually take it into consideration. But I'm a freak. <g>

    – Rudy Velthuis
    Apr 25 at 12:20

















9















A static analyzer we use has a report that says:




Subprograms with local subprograms (OPTI7)



This section lists subprograms that themselves have local subprograms.
Especially when these subprograms share local variables, it can have a
negative effect on performance.




This guide says:




Do not use nested routines Nested routines (routines within other
routines; also known as "local procedures") require some special stack
manipulation so that the variables of the outer routine can be seen by
the inner routine. This results in a good bit of overhead. Instead of
nesting, move the procedure to the unit scoping level and pass the
necessary variables - if necessary by reference (use the var keyword)
- or make the variable global at the unit scope.




We were interested in knowing if we should take this report into consideration when validating our code. The answers to this question suggest that one should profile one's application to see if there is any performance difference, but not much is said about the difference between nested routines and normal subroutines.



What is the actual difference between nested routines and normal routines and how may it cause a performance penalty?










share|improve this question

















  • 1





    Any code quality tool that encourages you to refactor into global variables should be treated with a severe degree of scepticism. With that said, it's good to see that the compiler, as expected, is not quite so incompetent as this would suggest.

    – J...
    Apr 24 at 14:46







  • 2





    @J..., the bit about global variables is not from Peganza.

    – Uli Gerhardt
    Apr 24 at 14:52











  • @UliGerhardt I see that now. I guess the second lesson is...don't trust code quality advice that was written almost two decades ago. Unless, of course, we're meaning to micro-optimize our code for an ancient compiler and the "new" Pentium II and its advanced core features 0_o

    – J...
    Apr 24 at 15:00







  • 2





    The difference are never that big, and these kinds of optimizations only count when you are invoking that code very often. In loops of a billion iterations, it could matter, but in a standard business application where you're reading and writing some stuff to files and databases, and now and then have a piece of logic with a nested procedure that is executed for each of the 100 rows you just queried, there is no way you're gonna spot the difference. Doesn't mean, though, that you shouldn't refactor them, if only for quality metrics like readability and re-usability.

    – GolezTrol
    Apr 24 at 17:15







  • 1





    @GolezTrol: I would actually take it into consideration. But I'm a freak. <g>

    – Rudy Velthuis
    Apr 25 at 12:20













9












9








9


1






A static analyzer we use has a report that says:




Subprograms with local subprograms (OPTI7)



This section lists subprograms that themselves have local subprograms.
Especially when these subprograms share local variables, it can have a
negative effect on performance.




This guide says:




Do not use nested routines Nested routines (routines within other
routines; also known as "local procedures") require some special stack
manipulation so that the variables of the outer routine can be seen by
the inner routine. This results in a good bit of overhead. Instead of
nesting, move the procedure to the unit scoping level and pass the
necessary variables - if necessary by reference (use the var keyword)
- or make the variable global at the unit scope.




We were interested in knowing if we should take this report into consideration when validating our code. The answers to this question suggest that one should profile one's application to see if there is any performance difference, but not much is said about the difference between nested routines and normal subroutines.



What is the actual difference between nested routines and normal routines and how may it cause a performance penalty?










share|improve this question














A static analyzer we use has a report that says:




Subprograms with local subprograms (OPTI7)



This section lists subprograms that themselves have local subprograms.
Especially when these subprograms share local variables, it can have a
negative effect on performance.




This guide says:




Do not use nested routines Nested routines (routines within other
routines; also known as "local procedures") require some special stack
manipulation so that the variables of the outer routine can be seen by
the inner routine. This results in a good bit of overhead. Instead of
nesting, move the procedure to the unit scoping level and pass the
necessary variables - if necessary by reference (use the var keyword)
- or make the variable global at the unit scope.




We were interested in knowing if we should take this report into consideration when validating our code. The answers to this question suggest that one should profile one's application to see if there is any performance difference, but not much is said about the difference between nested routines and normal subroutines.



What is the actual difference between nested routines and normal routines and how may it cause a performance penalty?







delphi






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Apr 24 at 14:19









afarahafarah

532211




532211







  • 1





    Any code quality tool that encourages you to refactor into global variables should be treated with a severe degree of scepticism. With that said, it's good to see that the compiler, as expected, is not quite so incompetent as this would suggest.

    – J...
    Apr 24 at 14:46







  • 2





    @J..., the bit about global variables is not from Peganza.

    – Uli Gerhardt
    Apr 24 at 14:52











  • @UliGerhardt I see that now. I guess the second lesson is...don't trust code quality advice that was written almost two decades ago. Unless, of course, we're meaning to micro-optimize our code for an ancient compiler and the "new" Pentium II and its advanced core features 0_o

    – J...
    Apr 24 at 15:00







  • 2





    The difference are never that big, and these kinds of optimizations only count when you are invoking that code very often. In loops of a billion iterations, it could matter, but in a standard business application where you're reading and writing some stuff to files and databases, and now and then have a piece of logic with a nested procedure that is executed for each of the 100 rows you just queried, there is no way you're gonna spot the difference. Doesn't mean, though, that you shouldn't refactor them, if only for quality metrics like readability and re-usability.

    – GolezTrol
    Apr 24 at 17:15







  • 1





    @GolezTrol: I would actually take it into consideration. But I'm a freak. <g>

    – Rudy Velthuis
    Apr 25 at 12:20












  • 1





    Any code quality tool that encourages you to refactor into global variables should be treated with a severe degree of scepticism. With that said, it's good to see that the compiler, as expected, is not quite so incompetent as this would suggest.

    – J...
    Apr 24 at 14:46







  • 2





    @J..., the bit about global variables is not from Peganza.

    – Uli Gerhardt
    Apr 24 at 14:52











  • @UliGerhardt I see that now. I guess the second lesson is...don't trust code quality advice that was written almost two decades ago. Unless, of course, we're meaning to micro-optimize our code for an ancient compiler and the "new" Pentium II and its advanced core features 0_o

    – J...
    Apr 24 at 15:00







  • 2





    The difference are never that big, and these kinds of optimizations only count when you are invoking that code very often. In loops of a billion iterations, it could matter, but in a standard business application where you're reading and writing some stuff to files and databases, and now and then have a piece of logic with a nested procedure that is executed for each of the 100 rows you just queried, there is no way you're gonna spot the difference. Doesn't mean, though, that you shouldn't refactor them, if only for quality metrics like readability and re-usability.

    – GolezTrol
    Apr 24 at 17:15







  • 1





    @GolezTrol: I would actually take it into consideration. But I'm a freak. <g>

    – Rudy Velthuis
    Apr 25 at 12:20







1




1





Any code quality tool that encourages you to refactor into global variables should be treated with a severe degree of scepticism. With that said, it's good to see that the compiler, as expected, is not quite so incompetent as this would suggest.

– J...
Apr 24 at 14:46






Any code quality tool that encourages you to refactor into global variables should be treated with a severe degree of scepticism. With that said, it's good to see that the compiler, as expected, is not quite so incompetent as this would suggest.

– J...
Apr 24 at 14:46





2




2





@J..., the bit about global variables is not from Peganza.

– Uli Gerhardt
Apr 24 at 14:52





@J..., the bit about global variables is not from Peganza.

– Uli Gerhardt
Apr 24 at 14:52













@UliGerhardt I see that now. I guess the second lesson is...don't trust code quality advice that was written almost two decades ago. Unless, of course, we're meaning to micro-optimize our code for an ancient compiler and the "new" Pentium II and its advanced core features 0_o

– J...
Apr 24 at 15:00






@UliGerhardt I see that now. I guess the second lesson is...don't trust code quality advice that was written almost two decades ago. Unless, of course, we're meaning to micro-optimize our code for an ancient compiler and the "new" Pentium II and its advanced core features 0_o

– J...
Apr 24 at 15:00





2




2





The difference are never that big, and these kinds of optimizations only count when you are invoking that code very often. In loops of a billion iterations, it could matter, but in a standard business application where you're reading and writing some stuff to files and databases, and now and then have a piece of logic with a nested procedure that is executed for each of the 100 rows you just queried, there is no way you're gonna spot the difference. Doesn't mean, though, that you shouldn't refactor them, if only for quality metrics like readability and re-usability.

– GolezTrol
Apr 24 at 17:15






The difference are never that big, and these kinds of optimizations only count when you are invoking that code very often. In loops of a billion iterations, it could matter, but in a standard business application where you're reading and writing some stuff to files and databases, and now and then have a piece of logic with a nested procedure that is executed for each of the 100 rows you just queried, there is no way you're gonna spot the difference. Doesn't mean, though, that you shouldn't refactor them, if only for quality metrics like readability and re-usability.

– GolezTrol
Apr 24 at 17:15





1




1





@GolezTrol: I would actually take it into consideration. But I'm a freak. <g>

– Rudy Velthuis
Apr 25 at 12:20





@GolezTrol: I would actually take it into consideration. But I'm a freak. <g>

– Rudy Velthuis
Apr 25 at 12:20












1 Answer
1






active

oldest

votes


















19














tl;dr



  • There are extra push/pops for nested subroutines

  • Turning on optimizations may strip those away, such that the generated code is the same for both nested subroutines and normal subroutines

  • Inlining results in the same code being generated for both nested and normal subroutines

  • For simple routines with few parameters and local variables we perceived no performance difference even with optimizations turned off

I wrote a little test to determine this, where GetRTClock is measuring the current time with a precision of 1ns:



function subprogram_main(z : Integer) : Int64;
var
n : Integer;
s : Int64;

function subprogram_aux(n, z : Integer) : Integer;
var
i : Integer;
begin
// Do some useless work on the aux program
for i := 0 to n - 1 do begin
if (i > z) then
z := z + i
else
z := z - i;
end;
Result := z;
end;
begin
s := GetRTClock;

// Do some minor work on the main program
n := z div 100 * 100 + 100;
// Call the aux program
z := subprogram_aux(n, z);

Result := GetRTClock - s;
end;

function normal_aux(n, z : Integer) : Integer;
var
i : Integer;
begin
// Do some useless work on the aux program
for i := 0 to n - 1 do begin
if (i > z) then
z := z + i
else
z := z - i;
end;
Result := z;
end;

function normal_main(z : Integer) : Int64;
var
n : Integer;
s : Int64;
begin
s := GetRTClock;

// Do some minor work on the main program
n := z div 100 * 100 + 100;
// Call the aux program
z := normal_aux(n, z);

Result := GetRTClock - s;
end;


This compiles to:



subprogram_main

MyFormU.pas.41: begin
005CE7D0 55 push ebp
005CE7D1 8BEC mov ebp,esp
005CE7D3 83C4E0 add esp,-$20
005CE7D6 8945FC mov [ebp-$04],eax
MyFormU.pas.42: s := GetRTClock;
...
MyFormU.pas.45: n := z div 100 * 100 + 100;
...
MyFormU.pas.47: z := subprogram_aux(n, z);
005CE7F8 55 push ebp
005CE7F9 8B55FC mov edx,[ebp-$04]
005CE7FC 8B45EC mov eax,[ebp-$14]
005CE7FF E880FFFFFF call subprogram_aux
005CE804 59 pop ecx
005CE805 8945FC mov [ebp-$04],eax
MyFormU.pas.49: Result := GetRTClock - s;
...

normal_main

MyFormU.pas.70: begin
005CE870 55 push ebp
005CE871 8BEC mov ebp,esp
005CE873 83C4E0 add esp,-$20
005CE876 8945FC mov [ebp-$04],eax
MyFormU.pas.71: s := GetRTClock;
...
MyFormU.pas.74: n := z div 100 * 100 + 100;
...
MyFormU.pas.76: z := normal_aux(n, z);
005CE898 8B55FC mov edx,[ebp-$04]
005CE89B 8B45EC mov eax,[ebp-$14]
005CE89E E881FFFFFF call normal_aux
005CE8A3 8945FC mov [ebp-$04],eax
MyFormU.pas.78: Result := GetRTClock - s;
...

subprogram_aux:

MyFormU.pas.31: begin
005CE784 55 push ebp
005CE785 8BEC mov ebp,esp
005CE787 83C4EC add esp,-$14
005CE78A 8955F8 mov [ebp-$08],edx
005CE78D 8945FC mov [ebp-$04],eax
MyFormU.pas.33: for i := 0 to n - 1 do begin
005CE790 8B45FC mov eax,[ebp-$04]
005CE793 48 dec eax
005CE794 85C0 test eax,eax
005CE796 7C29 jl $005ce7c1
005CE798 40 inc eax
005CE799 8945EC mov [ebp-$14],eax
005CE79C C745F000000000 mov [ebp-$10],$00000000
MyFormU.pas.34: if (i > z) then
005CE7A3 8B45F0 mov eax,[ebp-$10]
005CE7A6 3B45F8 cmp eax,[ebp-$08]
005CE7A9 7E08 jle $005ce7b3
MyFormU.pas.35: z := z + i
005CE7AB 8B45F0 mov eax,[ebp-$10]
005CE7AE 0145F8 add [ebp-$08],eax
005CE7B1 EB06 jmp $005ce7b9
MyFormU.pas.37: z := z - i;
005CE7B3 8B45F0 mov eax,[ebp-$10]
005CE7B6 2945F8 sub [ebp-$08],eax

normal_aux:

MyFormU.pas.55: begin
005CE824 55 push ebp
005CE825 8BEC mov ebp,esp
005CE827 83C4EC add esp,-$14
005CE82A 8955F8 mov [ebp-$08],edx
005CE82D 8945FC mov [ebp-$04],eax
MyFormU.pas.57: for i := 0 to n - 1 do begin
005CE830 8B45FC mov eax,[ebp-$04]
005CE833 48 dec eax
005CE834 85C0 test eax,eax
005CE836 7C29 jl $005ce861
005CE838 40 inc eax
005CE839 8945EC mov [ebp-$14],eax
005CE83C C745F000000000 mov [ebp-$10],$00000000
MyFormU.pas.58: if (i > z) then
005CE843 8B45F0 mov eax,[ebp-$10]
005CE846 3B45F8 cmp eax,[ebp-$08]
005CE849 7E08 jle $005ce853
MyFormU.pas.59: z := z + i
005CE84B 8B45F0 mov eax,[ebp-$10]
005CE84E 0145F8 add [ebp-$08],eax
005CE851 EB06 jmp $005ce859
MyFormU.pas.61: z := z - i;
005CE853 8B45F0 mov eax,[ebp-$10]
005CE856 2945F8 sub [ebp-$08],eax


The only difference is one push and one pop. What happens if we turn on optimizations?



MyFormU.pas.47: z := subprogram_aux(n, z);
005CE7C5 8BD3 mov edx,ebx
005CE7C7 8BC6 mov eax,esi
005CE7C9 E8B6FFFFFF call subprogram_aux

MyFormU.pas.76: z := normal_aux(n, z);
005CE82D 8BD3 mov edx,ebx
005CE82F 8BC6 mov eax,esi
005CE831 E8B6FFFFFF call normal_aux


Both compile exactly to the same thing.



What happens when inlining?



MyFormU.pas.76: z := normal_aux(n, z);
005CE804 8BD3 mov edx,ebx
005CE806 8BC8 mov ecx,eax
005CE808 49 dec ecx
005CE809 85C9 test ecx,ecx
005CE80B 7C11 jl $005ce81e
005CE80D 41 inc ecx
005CE80E 33C0 xor eax,eax
005CE810 3BD0 cmp edx,eax
005CE812 7D04 jnl $005ce818
005CE814 03D0 add edx,eax
005CE816 EB02 jmp $005ce81a
005CE818 2BD0 sub edx,eax
005CE81A 40 inc eax
005CE81B 49 dec ecx
005CE81C 75F2 jnz $005ce810

subprogram_main:

MyFormU.pas.47: z := subprogram_aux(n, z);
005CE7A8 8BD3 mov edx,ebx
005CE7AA 8BC8 mov ecx,eax
005CE7AC 49 dec ecx
005CE7AD 85C9 test ecx,ecx
005CE7AF 7C11 jl $005ce7c2
005CE7B1 41 inc ecx
005CE7B2 33C0 xor eax,eax
005CE7B4 3BD0 cmp edx,eax
005CE7B6 7D04 jnl $005ce7bc
005CE7B8 03D0 add edx,eax
005CE7BA EB02 jmp $005ce7be
005CE7BC 2BD0 sub edx,eax
005CE7BE 40 inc eax
005CE7BF 49 dec ecx
005CE7C0 75F2 jnz $005ce7b4


Again, no difference.



I also profiled this little example, taking an average of 30 executions for each (normal and subprogram), called in random order:



constructor TForm1.Create(AOwner: TComponent);
const
c_nSamples = 60;
rnd_sample : array[0..c_nSamples - 1] of byte = (1, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0);
var
subprogram_gt_ns : Int64;
normal_gt_ns : Int64;
rnd_input : Integer;
i : Integer;
begin
inherited Create(AOwner);

normal_gt_ns := 0;
subprogram_gt_ns := 0;

rnd_input := Random(1000);

for i := 0 to c_nSamples - 1 do
if (rnd_sample[i] = 1) then
Inc(subprogram_gt_ns, subprogram_main(rnd_input))
else
Inc(normal_gt_ns, normal_main(rnd_input));

OutputDebugString(PChar(' Normal ' + FloatToStr(normal_gt_ns / 30) + ' Subprogram ' + FloatToStr(subprogram_gt_ns / 30)));
end;


There is no significant difference even with optimizations turned off:



Debug Output: Normal 1166,66666666667 Subprogram 1203,33333333333 Process MyProject.exe (1824)


Finally, both texts that warn about performance mention something about shared local variables.



If we do not pass z to subprogram_aux, instead access it directly, we get:



MyFormU.pas.47: z := subprogram_aux(n);
005CE7D2 55 push ebp
005CE7D3 8BC3 mov eax,ebx
005CE7D5 E8AAFFFFFF call subprogram_aux
005CE7DA 59 pop ecx
005CE7DB 8945FC mov [ebp-$04],eax


Even with optimizations turned on.






share|improve this answer




















  • 2





    Would be interesting to see what the other compilers do with this, e.g. Win64 compiler

    – David Heffernan
    Apr 24 at 14:26






  • 1





    Would also be interesting to see your RTClock (or RTC lock?). Does that require a specific piece of hardware, or is it available for everyone?

    – Rudy Velthuis
    Apr 24 at 21:17











  • @RudyVelthuis It's just a wrapper to Windows' QueryPerformanceCounter, I was told the resolution is 1ns but the documentation only says "<1us".

    – afarah
    Apr 24 at 22:30











  • @afarah: Ah, OK. That's a bit like Java's "nanosecond" resolution for its timers. <g>

    – Rudy Velthuis
    Apr 25 at 6:21











Your Answer






StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);



);













draft saved

draft discarded


















StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55832314%2fwhy-is-there-a-performance-penalty-for-nested-subroutines-in-delphi%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown

























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes









19














tl;dr



  • There are extra push/pops for nested subroutines

  • Turning on optimizations may strip those away, such that the generated code is the same for both nested subroutines and normal subroutines

  • Inlining results in the same code being generated for both nested and normal subroutines

  • For simple routines with few parameters and local variables we perceived no performance difference even with optimizations turned off

I wrote a little test to determine this, where GetRTClock is measuring the current time with a precision of 1ns:



function subprogram_main(z : Integer) : Int64;
var
n : Integer;
s : Int64;

function subprogram_aux(n, z : Integer) : Integer;
var
i : Integer;
begin
// Do some useless work on the aux program
for i := 0 to n - 1 do begin
if (i > z) then
z := z + i
else
z := z - i;
end;
Result := z;
end;
begin
s := GetRTClock;

// Do some minor work on the main program
n := z div 100 * 100 + 100;
// Call the aux program
z := subprogram_aux(n, z);

Result := GetRTClock - s;
end;

function normal_aux(n, z : Integer) : Integer;
var
i : Integer;
begin
// Do some useless work on the aux program
for i := 0 to n - 1 do begin
if (i > z) then
z := z + i
else
z := z - i;
end;
Result := z;
end;

function normal_main(z : Integer) : Int64;
var
n : Integer;
s : Int64;
begin
s := GetRTClock;

// Do some minor work on the main program
n := z div 100 * 100 + 100;
// Call the aux program
z := normal_aux(n, z);

Result := GetRTClock - s;
end;


This compiles to:



subprogram_main

MyFormU.pas.41: begin
005CE7D0 55 push ebp
005CE7D1 8BEC mov ebp,esp
005CE7D3 83C4E0 add esp,-$20
005CE7D6 8945FC mov [ebp-$04],eax
MyFormU.pas.42: s := GetRTClock;
...
MyFormU.pas.45: n := z div 100 * 100 + 100;
...
MyFormU.pas.47: z := subprogram_aux(n, z);
005CE7F8 55 push ebp
005CE7F9 8B55FC mov edx,[ebp-$04]
005CE7FC 8B45EC mov eax,[ebp-$14]
005CE7FF E880FFFFFF call subprogram_aux
005CE804 59 pop ecx
005CE805 8945FC mov [ebp-$04],eax
MyFormU.pas.49: Result := GetRTClock - s;
...

normal_main

MyFormU.pas.70: begin
005CE870 55 push ebp
005CE871 8BEC mov ebp,esp
005CE873 83C4E0 add esp,-$20
005CE876 8945FC mov [ebp-$04],eax
MyFormU.pas.71: s := GetRTClock;
...
MyFormU.pas.74: n := z div 100 * 100 + 100;
...
MyFormU.pas.76: z := normal_aux(n, z);
005CE898 8B55FC mov edx,[ebp-$04]
005CE89B 8B45EC mov eax,[ebp-$14]
005CE89E E881FFFFFF call normal_aux
005CE8A3 8945FC mov [ebp-$04],eax
MyFormU.pas.78: Result := GetRTClock - s;
...

subprogram_aux:

MyFormU.pas.31: begin
005CE784 55 push ebp
005CE785 8BEC mov ebp,esp
005CE787 83C4EC add esp,-$14
005CE78A 8955F8 mov [ebp-$08],edx
005CE78D 8945FC mov [ebp-$04],eax
MyFormU.pas.33: for i := 0 to n - 1 do begin
005CE790 8B45FC mov eax,[ebp-$04]
005CE793 48 dec eax
005CE794 85C0 test eax,eax
005CE796 7C29 jl $005ce7c1
005CE798 40 inc eax
005CE799 8945EC mov [ebp-$14],eax
005CE79C C745F000000000 mov [ebp-$10],$00000000
MyFormU.pas.34: if (i > z) then
005CE7A3 8B45F0 mov eax,[ebp-$10]
005CE7A6 3B45F8 cmp eax,[ebp-$08]
005CE7A9 7E08 jle $005ce7b3
MyFormU.pas.35: z := z + i
005CE7AB 8B45F0 mov eax,[ebp-$10]
005CE7AE 0145F8 add [ebp-$08],eax
005CE7B1 EB06 jmp $005ce7b9
MyFormU.pas.37: z := z - i;
005CE7B3 8B45F0 mov eax,[ebp-$10]
005CE7B6 2945F8 sub [ebp-$08],eax

normal_aux:

MyFormU.pas.55: begin
005CE824 55 push ebp
005CE825 8BEC mov ebp,esp
005CE827 83C4EC add esp,-$14
005CE82A 8955F8 mov [ebp-$08],edx
005CE82D 8945FC mov [ebp-$04],eax
MyFormU.pas.57: for i := 0 to n - 1 do begin
005CE830 8B45FC mov eax,[ebp-$04]
005CE833 48 dec eax
005CE834 85C0 test eax,eax
005CE836 7C29 jl $005ce861
005CE838 40 inc eax
005CE839 8945EC mov [ebp-$14],eax
005CE83C C745F000000000 mov [ebp-$10],$00000000
MyFormU.pas.58: if (i > z) then
005CE843 8B45F0 mov eax,[ebp-$10]
005CE846 3B45F8 cmp eax,[ebp-$08]
005CE849 7E08 jle $005ce853
MyFormU.pas.59: z := z + i
005CE84B 8B45F0 mov eax,[ebp-$10]
005CE84E 0145F8 add [ebp-$08],eax
005CE851 EB06 jmp $005ce859
MyFormU.pas.61: z := z - i;
005CE853 8B45F0 mov eax,[ebp-$10]
005CE856 2945F8 sub [ebp-$08],eax


The only difference is one push and one pop. What happens if we turn on optimizations?



MyFormU.pas.47: z := subprogram_aux(n, z);
005CE7C5 8BD3 mov edx,ebx
005CE7C7 8BC6 mov eax,esi
005CE7C9 E8B6FFFFFF call subprogram_aux

MyFormU.pas.76: z := normal_aux(n, z);
005CE82D 8BD3 mov edx,ebx
005CE82F 8BC6 mov eax,esi
005CE831 E8B6FFFFFF call normal_aux


Both compile exactly to the same thing.



What happens when inlining?



MyFormU.pas.76: z := normal_aux(n, z);
005CE804 8BD3 mov edx,ebx
005CE806 8BC8 mov ecx,eax
005CE808 49 dec ecx
005CE809 85C9 test ecx,ecx
005CE80B 7C11 jl $005ce81e
005CE80D 41 inc ecx
005CE80E 33C0 xor eax,eax
005CE810 3BD0 cmp edx,eax
005CE812 7D04 jnl $005ce818
005CE814 03D0 add edx,eax
005CE816 EB02 jmp $005ce81a
005CE818 2BD0 sub edx,eax
005CE81A 40 inc eax
005CE81B 49 dec ecx
005CE81C 75F2 jnz $005ce810

subprogram_main:

MyFormU.pas.47: z := subprogram_aux(n, z);
005CE7A8 8BD3 mov edx,ebx
005CE7AA 8BC8 mov ecx,eax
005CE7AC 49 dec ecx
005CE7AD 85C9 test ecx,ecx
005CE7AF 7C11 jl $005ce7c2
005CE7B1 41 inc ecx
005CE7B2 33C0 xor eax,eax
005CE7B4 3BD0 cmp edx,eax
005CE7B6 7D04 jnl $005ce7bc
005CE7B8 03D0 add edx,eax
005CE7BA EB02 jmp $005ce7be
005CE7BC 2BD0 sub edx,eax
005CE7BE 40 inc eax
005CE7BF 49 dec ecx
005CE7C0 75F2 jnz $005ce7b4


Again, no difference.



I also profiled this little example, taking an average of 30 executions for each (normal and subprogram), called in random order:



constructor TForm1.Create(AOwner: TComponent);
const
c_nSamples = 60;
rnd_sample : array[0..c_nSamples - 1] of byte = (1, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0);
var
subprogram_gt_ns : Int64;
normal_gt_ns : Int64;
rnd_input : Integer;
i : Integer;
begin
inherited Create(AOwner);

normal_gt_ns := 0;
subprogram_gt_ns := 0;

rnd_input := Random(1000);

for i := 0 to c_nSamples - 1 do
if (rnd_sample[i] = 1) then
Inc(subprogram_gt_ns, subprogram_main(rnd_input))
else
Inc(normal_gt_ns, normal_main(rnd_input));

OutputDebugString(PChar(' Normal ' + FloatToStr(normal_gt_ns / 30) + ' Subprogram ' + FloatToStr(subprogram_gt_ns / 30)));
end;


There is no significant difference even with optimizations turned off:



Debug Output: Normal 1166,66666666667 Subprogram 1203,33333333333 Process MyProject.exe (1824)


Finally, both texts that warn about performance mention something about shared local variables.



If we do not pass z to subprogram_aux, instead access it directly, we get:



MyFormU.pas.47: z := subprogram_aux(n);
005CE7D2 55 push ebp
005CE7D3 8BC3 mov eax,ebx
005CE7D5 E8AAFFFFFF call subprogram_aux
005CE7DA 59 pop ecx
005CE7DB 8945FC mov [ebp-$04],eax


Even with optimizations turned on.






share|improve this answer




















  • 2





    Would be interesting to see what the other compilers do with this, e.g. Win64 compiler

    – David Heffernan
    Apr 24 at 14:26






  • 1





    Would also be interesting to see your RTClock (or RTC lock?). Does that require a specific piece of hardware, or is it available for everyone?

    – Rudy Velthuis
    Apr 24 at 21:17











  • @RudyVelthuis It's just a wrapper to Windows' QueryPerformanceCounter, I was told the resolution is 1ns but the documentation only says "<1us".

    – afarah
    Apr 24 at 22:30











  • @afarah: Ah, OK. That's a bit like Java's "nanosecond" resolution for its timers. <g>

    – Rudy Velthuis
    Apr 25 at 6:21















19














tl;dr



  • There are extra push/pops for nested subroutines

  • Turning on optimizations may strip those away, such that the generated code is the same for both nested subroutines and normal subroutines

  • Inlining results in the same code being generated for both nested and normal subroutines

  • For simple routines with few parameters and local variables we perceived no performance difference even with optimizations turned off

I wrote a little test to determine this, where GetRTClock is measuring the current time with a precision of 1ns:



function subprogram_main(z : Integer) : Int64;
var
n : Integer;
s : Int64;

function subprogram_aux(n, z : Integer) : Integer;
var
i : Integer;
begin
// Do some useless work on the aux program
for i := 0 to n - 1 do begin
if (i > z) then
z := z + i
else
z := z - i;
end;
Result := z;
end;
begin
s := GetRTClock;

// Do some minor work on the main program
n := z div 100 * 100 + 100;
// Call the aux program
z := subprogram_aux(n, z);

Result := GetRTClock - s;
end;

function normal_aux(n, z : Integer) : Integer;
var
i : Integer;
begin
// Do some useless work on the aux program
for i := 0 to n - 1 do begin
if (i > z) then
z := z + i
else
z := z - i;
end;
Result := z;
end;

function normal_main(z : Integer) : Int64;
var
n : Integer;
s : Int64;
begin
s := GetRTClock;

// Do some minor work on the main program
n := z div 100 * 100 + 100;
// Call the aux program
z := normal_aux(n, z);

Result := GetRTClock - s;
end;


This compiles to:



subprogram_main

MyFormU.pas.41: begin
005CE7D0 55 push ebp
005CE7D1 8BEC mov ebp,esp
005CE7D3 83C4E0 add esp,-$20
005CE7D6 8945FC mov [ebp-$04],eax
MyFormU.pas.42: s := GetRTClock;
...
MyFormU.pas.45: n := z div 100 * 100 + 100;
...
MyFormU.pas.47: z := subprogram_aux(n, z);
005CE7F8 55 push ebp
005CE7F9 8B55FC mov edx,[ebp-$04]
005CE7FC 8B45EC mov eax,[ebp-$14]
005CE7FF E880FFFFFF call subprogram_aux
005CE804 59 pop ecx
005CE805 8945FC mov [ebp-$04],eax
MyFormU.pas.49: Result := GetRTClock - s;
...

normal_main

MyFormU.pas.70: begin
005CE870 55 push ebp
005CE871 8BEC mov ebp,esp
005CE873 83C4E0 add esp,-$20
005CE876 8945FC mov [ebp-$04],eax
MyFormU.pas.71: s := GetRTClock;
...
MyFormU.pas.74: n := z div 100 * 100 + 100;
...
MyFormU.pas.76: z := normal_aux(n, z);
005CE898 8B55FC mov edx,[ebp-$04]
005CE89B 8B45EC mov eax,[ebp-$14]
005CE89E E881FFFFFF call normal_aux
005CE8A3 8945FC mov [ebp-$04],eax
MyFormU.pas.78: Result := GetRTClock - s;
...

subprogram_aux:

MyFormU.pas.31: begin
005CE784 55 push ebp
005CE785 8BEC mov ebp,esp
005CE787 83C4EC add esp,-$14
005CE78A 8955F8 mov [ebp-$08],edx
005CE78D 8945FC mov [ebp-$04],eax
MyFormU.pas.33: for i := 0 to n - 1 do begin
005CE790 8B45FC mov eax,[ebp-$04]
005CE793 48 dec eax
005CE794 85C0 test eax,eax
005CE796 7C29 jl $005ce7c1
005CE798 40 inc eax
005CE799 8945EC mov [ebp-$14],eax
005CE79C C745F000000000 mov [ebp-$10],$00000000
MyFormU.pas.34: if (i > z) then
005CE7A3 8B45F0 mov eax,[ebp-$10]
005CE7A6 3B45F8 cmp eax,[ebp-$08]
005CE7A9 7E08 jle $005ce7b3
MyFormU.pas.35: z := z + i
005CE7AB 8B45F0 mov eax,[ebp-$10]
005CE7AE 0145F8 add [ebp-$08],eax
005CE7B1 EB06 jmp $005ce7b9
MyFormU.pas.37: z := z - i;
005CE7B3 8B45F0 mov eax,[ebp-$10]
005CE7B6 2945F8 sub [ebp-$08],eax

normal_aux:

MyFormU.pas.55: begin
005CE824 55 push ebp
005CE825 8BEC mov ebp,esp
005CE827 83C4EC add esp,-$14
005CE82A 8955F8 mov [ebp-$08],edx
005CE82D 8945FC mov [ebp-$04],eax
MyFormU.pas.57: for i := 0 to n - 1 do begin
005CE830 8B45FC mov eax,[ebp-$04]
005CE833 48 dec eax
005CE834 85C0 test eax,eax
005CE836 7C29 jl $005ce861
005CE838 40 inc eax
005CE839 8945EC mov [ebp-$14],eax
005CE83C C745F000000000 mov [ebp-$10],$00000000
MyFormU.pas.58: if (i > z) then
005CE843 8B45F0 mov eax,[ebp-$10]
005CE846 3B45F8 cmp eax,[ebp-$08]
005CE849 7E08 jle $005ce853
MyFormU.pas.59: z := z + i
005CE84B 8B45F0 mov eax,[ebp-$10]
005CE84E 0145F8 add [ebp-$08],eax
005CE851 EB06 jmp $005ce859
MyFormU.pas.61: z := z - i;
005CE853 8B45F0 mov eax,[ebp-$10]
005CE856 2945F8 sub [ebp-$08],eax


The only difference is one push and one pop. What happens if we turn on optimizations?



MyFormU.pas.47: z := subprogram_aux(n, z);
005CE7C5 8BD3 mov edx,ebx
005CE7C7 8BC6 mov eax,esi
005CE7C9 E8B6FFFFFF call subprogram_aux

MyFormU.pas.76: z := normal_aux(n, z);
005CE82D 8BD3 mov edx,ebx
005CE82F 8BC6 mov eax,esi
005CE831 E8B6FFFFFF call normal_aux


Both compile exactly to the same thing.



What happens when inlining?



MyFormU.pas.76: z := normal_aux(n, z);
005CE804 8BD3 mov edx,ebx
005CE806 8BC8 mov ecx,eax
005CE808 49 dec ecx
005CE809 85C9 test ecx,ecx
005CE80B 7C11 jl $005ce81e
005CE80D 41 inc ecx
005CE80E 33C0 xor eax,eax
005CE810 3BD0 cmp edx,eax
005CE812 7D04 jnl $005ce818
005CE814 03D0 add edx,eax
005CE816 EB02 jmp $005ce81a
005CE818 2BD0 sub edx,eax
005CE81A 40 inc eax
005CE81B 49 dec ecx
005CE81C 75F2 jnz $005ce810

subprogram_main:

MyFormU.pas.47: z := subprogram_aux(n, z);
005CE7A8 8BD3 mov edx,ebx
005CE7AA 8BC8 mov ecx,eax
005CE7AC 49 dec ecx
005CE7AD 85C9 test ecx,ecx
005CE7AF 7C11 jl $005ce7c2
005CE7B1 41 inc ecx
005CE7B2 33C0 xor eax,eax
005CE7B4 3BD0 cmp edx,eax
005CE7B6 7D04 jnl $005ce7bc
005CE7B8 03D0 add edx,eax
005CE7BA EB02 jmp $005ce7be
005CE7BC 2BD0 sub edx,eax
005CE7BE 40 inc eax
005CE7BF 49 dec ecx
005CE7C0 75F2 jnz $005ce7b4


Again, no difference.



I also profiled this little example, taking an average of 30 executions for each (normal and subprogram), called in random order:



constructor TForm1.Create(AOwner: TComponent);
const
c_nSamples = 60;
rnd_sample : array[0..c_nSamples - 1] of byte = (1, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0);
var
subprogram_gt_ns : Int64;
normal_gt_ns : Int64;
rnd_input : Integer;
i : Integer;
begin
inherited Create(AOwner);

normal_gt_ns := 0;
subprogram_gt_ns := 0;

rnd_input := Random(1000);

for i := 0 to c_nSamples - 1 do
if (rnd_sample[i] = 1) then
Inc(subprogram_gt_ns, subprogram_main(rnd_input))
else
Inc(normal_gt_ns, normal_main(rnd_input));

OutputDebugString(PChar(' Normal ' + FloatToStr(normal_gt_ns / 30) + ' Subprogram ' + FloatToStr(subprogram_gt_ns / 30)));
end;


There is no significant difference even with optimizations turned off:



Debug Output: Normal 1166,66666666667 Subprogram 1203,33333333333 Process MyProject.exe (1824)


Finally, both texts that warn about performance mention something about shared local variables.



If we do not pass z to subprogram_aux, instead access it directly, we get:



MyFormU.pas.47: z := subprogram_aux(n);
005CE7D2 55 push ebp
005CE7D3 8BC3 mov eax,ebx
005CE7D5 E8AAFFFFFF call subprogram_aux
005CE7DA 59 pop ecx
005CE7DB 8945FC mov [ebp-$04],eax


Even with optimizations turned on.






share|improve this answer




















  • 2





    Would be interesting to see what the other compilers do with this, e.g. Win64 compiler

    – David Heffernan
    Apr 24 at 14:26






  • 1





    Would also be interesting to see your RTClock (or RTC lock?). Does that require a specific piece of hardware, or is it available for everyone?

    – Rudy Velthuis
    Apr 24 at 21:17











  • @RudyVelthuis It's just a wrapper to Windows' QueryPerformanceCounter, I was told the resolution is 1ns but the documentation only says "<1us".

    – afarah
    Apr 24 at 22:30











  • @afarah: Ah, OK. That's a bit like Java's "nanosecond" resolution for its timers. <g>

    – Rudy Velthuis
    Apr 25 at 6:21













19












19








19







tl;dr



  • There are extra push/pops for nested subroutines

  • Turning on optimizations may strip those away, such that the generated code is the same for both nested subroutines and normal subroutines

  • Inlining results in the same code being generated for both nested and normal subroutines

  • For simple routines with few parameters and local variables we perceived no performance difference even with optimizations turned off

I wrote a little test to determine this, where GetRTClock is measuring the current time with a precision of 1ns:



function subprogram_main(z : Integer) : Int64;
var
n : Integer;
s : Int64;

function subprogram_aux(n, z : Integer) : Integer;
var
i : Integer;
begin
// Do some useless work on the aux program
for i := 0 to n - 1 do begin
if (i > z) then
z := z + i
else
z := z - i;
end;
Result := z;
end;
begin
s := GetRTClock;

// Do some minor work on the main program
n := z div 100 * 100 + 100;
// Call the aux program
z := subprogram_aux(n, z);

Result := GetRTClock - s;
end;

function normal_aux(n, z : Integer) : Integer;
var
i : Integer;
begin
// Do some useless work on the aux program
for i := 0 to n - 1 do begin
if (i > z) then
z := z + i
else
z := z - i;
end;
Result := z;
end;

function normal_main(z : Integer) : Int64;
var
n : Integer;
s : Int64;
begin
s := GetRTClock;

// Do some minor work on the main program
n := z div 100 * 100 + 100;
// Call the aux program
z := normal_aux(n, z);

Result := GetRTClock - s;
end;


This compiles to:



subprogram_main

MyFormU.pas.41: begin
005CE7D0 55 push ebp
005CE7D1 8BEC mov ebp,esp
005CE7D3 83C4E0 add esp,-$20
005CE7D6 8945FC mov [ebp-$04],eax
MyFormU.pas.42: s := GetRTClock;
...
MyFormU.pas.45: n := z div 100 * 100 + 100;
...
MyFormU.pas.47: z := subprogram_aux(n, z);
005CE7F8 55 push ebp
005CE7F9 8B55FC mov edx,[ebp-$04]
005CE7FC 8B45EC mov eax,[ebp-$14]
005CE7FF E880FFFFFF call subprogram_aux
005CE804 59 pop ecx
005CE805 8945FC mov [ebp-$04],eax
MyFormU.pas.49: Result := GetRTClock - s;
...

normal_main

MyFormU.pas.70: begin
005CE870 55 push ebp
005CE871 8BEC mov ebp,esp
005CE873 83C4E0 add esp,-$20
005CE876 8945FC mov [ebp-$04],eax
MyFormU.pas.71: s := GetRTClock;
...
MyFormU.pas.74: n := z div 100 * 100 + 100;
...
MyFormU.pas.76: z := normal_aux(n, z);
005CE898 8B55FC mov edx,[ebp-$04]
005CE89B 8B45EC mov eax,[ebp-$14]
005CE89E E881FFFFFF call normal_aux
005CE8A3 8945FC mov [ebp-$04],eax
MyFormU.pas.78: Result := GetRTClock - s;
...

subprogram_aux:

MyFormU.pas.31: begin
005CE784 55 push ebp
005CE785 8BEC mov ebp,esp
005CE787 83C4EC add esp,-$14
005CE78A 8955F8 mov [ebp-$08],edx
005CE78D 8945FC mov [ebp-$04],eax
MyFormU.pas.33: for i := 0 to n - 1 do begin
005CE790 8B45FC mov eax,[ebp-$04]
005CE793 48 dec eax
005CE794 85C0 test eax,eax
005CE796 7C29 jl $005ce7c1
005CE798 40 inc eax
005CE799 8945EC mov [ebp-$14],eax
005CE79C C745F000000000 mov [ebp-$10],$00000000
MyFormU.pas.34: if (i > z) then
005CE7A3 8B45F0 mov eax,[ebp-$10]
005CE7A6 3B45F8 cmp eax,[ebp-$08]
005CE7A9 7E08 jle $005ce7b3
MyFormU.pas.35: z := z + i
005CE7AB 8B45F0 mov eax,[ebp-$10]
005CE7AE 0145F8 add [ebp-$08],eax
005CE7B1 EB06 jmp $005ce7b9
MyFormU.pas.37: z := z - i;
005CE7B3 8B45F0 mov eax,[ebp-$10]
005CE7B6 2945F8 sub [ebp-$08],eax

normal_aux:

MyFormU.pas.55: begin
005CE824 55 push ebp
005CE825 8BEC mov ebp,esp
005CE827 83C4EC add esp,-$14
005CE82A 8955F8 mov [ebp-$08],edx
005CE82D 8945FC mov [ebp-$04],eax
MyFormU.pas.57: for i := 0 to n - 1 do begin
005CE830 8B45FC mov eax,[ebp-$04]
005CE833 48 dec eax
005CE834 85C0 test eax,eax
005CE836 7C29 jl $005ce861
005CE838 40 inc eax
005CE839 8945EC mov [ebp-$14],eax
005CE83C C745F000000000 mov [ebp-$10],$00000000
MyFormU.pas.58: if (i > z) then
005CE843 8B45F0 mov eax,[ebp-$10]
005CE846 3B45F8 cmp eax,[ebp-$08]
005CE849 7E08 jle $005ce853
MyFormU.pas.59: z := z + i
005CE84B 8B45F0 mov eax,[ebp-$10]
005CE84E 0145F8 add [ebp-$08],eax
005CE851 EB06 jmp $005ce859
MyFormU.pas.61: z := z - i;
005CE853 8B45F0 mov eax,[ebp-$10]
005CE856 2945F8 sub [ebp-$08],eax


The only difference is one push and one pop. What happens if we turn on optimizations?



MyFormU.pas.47: z := subprogram_aux(n, z);
005CE7C5 8BD3 mov edx,ebx
005CE7C7 8BC6 mov eax,esi
005CE7C9 E8B6FFFFFF call subprogram_aux

MyFormU.pas.76: z := normal_aux(n, z);
005CE82D 8BD3 mov edx,ebx
005CE82F 8BC6 mov eax,esi
005CE831 E8B6FFFFFF call normal_aux


Both compile exactly to the same thing.



What happens when inlining?



MyFormU.pas.76: z := normal_aux(n, z);
005CE804 8BD3 mov edx,ebx
005CE806 8BC8 mov ecx,eax
005CE808 49 dec ecx
005CE809 85C9 test ecx,ecx
005CE80B 7C11 jl $005ce81e
005CE80D 41 inc ecx
005CE80E 33C0 xor eax,eax
005CE810 3BD0 cmp edx,eax
005CE812 7D04 jnl $005ce818
005CE814 03D0 add edx,eax
005CE816 EB02 jmp $005ce81a
005CE818 2BD0 sub edx,eax
005CE81A 40 inc eax
005CE81B 49 dec ecx
005CE81C 75F2 jnz $005ce810

subprogram_main:

MyFormU.pas.47: z := subprogram_aux(n, z);
005CE7A8 8BD3 mov edx,ebx
005CE7AA 8BC8 mov ecx,eax
005CE7AC 49 dec ecx
005CE7AD 85C9 test ecx,ecx
005CE7AF 7C11 jl $005ce7c2
005CE7B1 41 inc ecx
005CE7B2 33C0 xor eax,eax
005CE7B4 3BD0 cmp edx,eax
005CE7B6 7D04 jnl $005ce7bc
005CE7B8 03D0 add edx,eax
005CE7BA EB02 jmp $005ce7be
005CE7BC 2BD0 sub edx,eax
005CE7BE 40 inc eax
005CE7BF 49 dec ecx
005CE7C0 75F2 jnz $005ce7b4


Again, no difference.



I also profiled this little example, taking an average of 30 executions for each (normal and subprogram), called in random order:



constructor TForm1.Create(AOwner: TComponent);
const
c_nSamples = 60;
rnd_sample : array[0..c_nSamples - 1] of byte = (1, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0);
var
subprogram_gt_ns : Int64;
normal_gt_ns : Int64;
rnd_input : Integer;
i : Integer;
begin
inherited Create(AOwner);

normal_gt_ns := 0;
subprogram_gt_ns := 0;

rnd_input := Random(1000);

for i := 0 to c_nSamples - 1 do
if (rnd_sample[i] = 1) then
Inc(subprogram_gt_ns, subprogram_main(rnd_input))
else
Inc(normal_gt_ns, normal_main(rnd_input));

OutputDebugString(PChar(' Normal ' + FloatToStr(normal_gt_ns / 30) + ' Subprogram ' + FloatToStr(subprogram_gt_ns / 30)));
end;


There is no significant difference even with optimizations turned off:



Debug Output: Normal 1166,66666666667 Subprogram 1203,33333333333 Process MyProject.exe (1824)


Finally, both texts that warn about performance mention something about shared local variables.



If we do not pass z to subprogram_aux, instead access it directly, we get:



MyFormU.pas.47: z := subprogram_aux(n);
005CE7D2 55 push ebp
005CE7D3 8BC3 mov eax,ebx
005CE7D5 E8AAFFFFFF call subprogram_aux
005CE7DA 59 pop ecx
005CE7DB 8945FC mov [ebp-$04],eax


Even with optimizations turned on.






share|improve this answer















tl;dr



  • There are extra push/pops for nested subroutines

  • Turning on optimizations may strip those away, such that the generated code is the same for both nested subroutines and normal subroutines

  • Inlining results in the same code being generated for both nested and normal subroutines

  • For simple routines with few parameters and local variables we perceived no performance difference even with optimizations turned off

I wrote a little test to determine this, where GetRTClock is measuring the current time with a precision of 1ns:



function subprogram_main(z : Integer) : Int64;
var
n : Integer;
s : Int64;

function subprogram_aux(n, z : Integer) : Integer;
var
i : Integer;
begin
// Do some useless work on the aux program
for i := 0 to n - 1 do begin
if (i > z) then
z := z + i
else
z := z - i;
end;
Result := z;
end;
begin
s := GetRTClock;

// Do some minor work on the main program
n := z div 100 * 100 + 100;
// Call the aux program
z := subprogram_aux(n, z);

Result := GetRTClock - s;
end;

function normal_aux(n, z : Integer) : Integer;
var
i : Integer;
begin
// Do some useless work on the aux program
for i := 0 to n - 1 do begin
if (i > z) then
z := z + i
else
z := z - i;
end;
Result := z;
end;

function normal_main(z : Integer) : Int64;
var
n : Integer;
s : Int64;
begin
s := GetRTClock;

// Do some minor work on the main program
n := z div 100 * 100 + 100;
// Call the aux program
z := normal_aux(n, z);

Result := GetRTClock - s;
end;


This compiles to:



subprogram_main

MyFormU.pas.41: begin
005CE7D0 55 push ebp
005CE7D1 8BEC mov ebp,esp
005CE7D3 83C4E0 add esp,-$20
005CE7D6 8945FC mov [ebp-$04],eax
MyFormU.pas.42: s := GetRTClock;
...
MyFormU.pas.45: n := z div 100 * 100 + 100;
...
MyFormU.pas.47: z := subprogram_aux(n, z);
005CE7F8 55 push ebp
005CE7F9 8B55FC mov edx,[ebp-$04]
005CE7FC 8B45EC mov eax,[ebp-$14]
005CE7FF E880FFFFFF call subprogram_aux
005CE804 59 pop ecx
005CE805 8945FC mov [ebp-$04],eax
MyFormU.pas.49: Result := GetRTClock - s;
...

normal_main

MyFormU.pas.70: begin
005CE870 55 push ebp
005CE871 8BEC mov ebp,esp
005CE873 83C4E0 add esp,-$20
005CE876 8945FC mov [ebp-$04],eax
MyFormU.pas.71: s := GetRTClock;
...
MyFormU.pas.74: n := z div 100 * 100 + 100;
...
MyFormU.pas.76: z := normal_aux(n, z);
005CE898 8B55FC mov edx,[ebp-$04]
005CE89B 8B45EC mov eax,[ebp-$14]
005CE89E E881FFFFFF call normal_aux
005CE8A3 8945FC mov [ebp-$04],eax
MyFormU.pas.78: Result := GetRTClock - s;
...

subprogram_aux:

MyFormU.pas.31: begin
005CE784 55 push ebp
005CE785 8BEC mov ebp,esp
005CE787 83C4EC add esp,-$14
005CE78A 8955F8 mov [ebp-$08],edx
005CE78D 8945FC mov [ebp-$04],eax
MyFormU.pas.33: for i := 0 to n - 1 do begin
005CE790 8B45FC mov eax,[ebp-$04]
005CE793 48 dec eax
005CE794 85C0 test eax,eax
005CE796 7C29 jl $005ce7c1
005CE798 40 inc eax
005CE799 8945EC mov [ebp-$14],eax
005CE79C C745F000000000 mov [ebp-$10],$00000000
MyFormU.pas.34: if (i > z) then
005CE7A3 8B45F0 mov eax,[ebp-$10]
005CE7A6 3B45F8 cmp eax,[ebp-$08]
005CE7A9 7E08 jle $005ce7b3
MyFormU.pas.35: z := z + i
005CE7AB 8B45F0 mov eax,[ebp-$10]
005CE7AE 0145F8 add [ebp-$08],eax
005CE7B1 EB06 jmp $005ce7b9
MyFormU.pas.37: z := z - i;
005CE7B3 8B45F0 mov eax,[ebp-$10]
005CE7B6 2945F8 sub [ebp-$08],eax

normal_aux:

MyFormU.pas.55: begin
005CE824 55 push ebp
005CE825 8BEC mov ebp,esp
005CE827 83C4EC add esp,-$14
005CE82A 8955F8 mov [ebp-$08],edx
005CE82D 8945FC mov [ebp-$04],eax
MyFormU.pas.57: for i := 0 to n - 1 do begin
005CE830 8B45FC mov eax,[ebp-$04]
005CE833 48 dec eax
005CE834 85C0 test eax,eax
005CE836 7C29 jl $005ce861
005CE838 40 inc eax
005CE839 8945EC mov [ebp-$14],eax
005CE83C C745F000000000 mov [ebp-$10],$00000000
MyFormU.pas.58: if (i > z) then
005CE843 8B45F0 mov eax,[ebp-$10]
005CE846 3B45F8 cmp eax,[ebp-$08]
005CE849 7E08 jle $005ce853
MyFormU.pas.59: z := z + i
005CE84B 8B45F0 mov eax,[ebp-$10]
005CE84E 0145F8 add [ebp-$08],eax
005CE851 EB06 jmp $005ce859
MyFormU.pas.61: z := z - i;
005CE853 8B45F0 mov eax,[ebp-$10]
005CE856 2945F8 sub [ebp-$08],eax


The only difference is one push and one pop. What happens if we turn on optimizations?



MyFormU.pas.47: z := subprogram_aux(n, z);
005CE7C5 8BD3 mov edx,ebx
005CE7C7 8BC6 mov eax,esi
005CE7C9 E8B6FFFFFF call subprogram_aux

MyFormU.pas.76: z := normal_aux(n, z);
005CE82D 8BD3 mov edx,ebx
005CE82F 8BC6 mov eax,esi
005CE831 E8B6FFFFFF call normal_aux


Both compile exactly to the same thing.



What happens when inlining?



MyFormU.pas.76: z := normal_aux(n, z);
005CE804 8BD3 mov edx,ebx
005CE806 8BC8 mov ecx,eax
005CE808 49 dec ecx
005CE809 85C9 test ecx,ecx
005CE80B 7C11 jl $005ce81e
005CE80D 41 inc ecx
005CE80E 33C0 xor eax,eax
005CE810 3BD0 cmp edx,eax
005CE812 7D04 jnl $005ce818
005CE814 03D0 add edx,eax
005CE816 EB02 jmp $005ce81a
005CE818 2BD0 sub edx,eax
005CE81A 40 inc eax
005CE81B 49 dec ecx
005CE81C 75F2 jnz $005ce810

subprogram_main:

MyFormU.pas.47: z := subprogram_aux(n, z);
005CE7A8 8BD3 mov edx,ebx
005CE7AA 8BC8 mov ecx,eax
005CE7AC 49 dec ecx
005CE7AD 85C9 test ecx,ecx
005CE7AF 7C11 jl $005ce7c2
005CE7B1 41 inc ecx
005CE7B2 33C0 xor eax,eax
005CE7B4 3BD0 cmp edx,eax
005CE7B6 7D04 jnl $005ce7bc
005CE7B8 03D0 add edx,eax
005CE7BA EB02 jmp $005ce7be
005CE7BC 2BD0 sub edx,eax
005CE7BE 40 inc eax
005CE7BF 49 dec ecx
005CE7C0 75F2 jnz $005ce7b4


Again, no difference.



I also profiled this little example, taking an average of 30 executions for each (normal and subprogram), called in random order:



constructor TForm1.Create(AOwner: TComponent);
const
c_nSamples = 60;
rnd_sample : array[0..c_nSamples - 1] of byte = (1, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0);
var
subprogram_gt_ns : Int64;
normal_gt_ns : Int64;
rnd_input : Integer;
i : Integer;
begin
inherited Create(AOwner);

normal_gt_ns := 0;
subprogram_gt_ns := 0;

rnd_input := Random(1000);

for i := 0 to c_nSamples - 1 do
if (rnd_sample[i] = 1) then
Inc(subprogram_gt_ns, subprogram_main(rnd_input))
else
Inc(normal_gt_ns, normal_main(rnd_input));

OutputDebugString(PChar(' Normal ' + FloatToStr(normal_gt_ns / 30) + ' Subprogram ' + FloatToStr(subprogram_gt_ns / 30)));
end;


There is no significant difference even with optimizations turned off:



Debug Output: Normal 1166,66666666667 Subprogram 1203,33333333333 Process MyProject.exe (1824)


Finally, both texts that warn about performance mention something about shared local variables.



If we do not pass z to subprogram_aux, instead access it directly, we get:



MyFormU.pas.47: z := subprogram_aux(n);
005CE7D2 55 push ebp
005CE7D3 8BC3 mov eax,ebx
005CE7D5 E8AAFFFFFF call subprogram_aux
005CE7DA 59 pop ecx
005CE7DB 8945FC mov [ebp-$04],eax


Even with optimizations turned on.







share|improve this answer














share|improve this answer



share|improve this answer








edited Apr 24 at 14:28

























answered Apr 24 at 14:19









afarahafarah

532211




532211







  • 2





    Would be interesting to see what the other compilers do with this, e.g. Win64 compiler

    – David Heffernan
    Apr 24 at 14:26






  • 1





    Would also be interesting to see your RTClock (or RTC lock?). Does that require a specific piece of hardware, or is it available for everyone?

    – Rudy Velthuis
    Apr 24 at 21:17











  • @RudyVelthuis It's just a wrapper to Windows' QueryPerformanceCounter, I was told the resolution is 1ns but the documentation only says "<1us".

    – afarah
    Apr 24 at 22:30











  • @afarah: Ah, OK. That's a bit like Java's "nanosecond" resolution for its timers. <g>

    – Rudy Velthuis
    Apr 25 at 6:21












  • 2





    Would be interesting to see what the other compilers do with this, e.g. Win64 compiler

    – David Heffernan
    Apr 24 at 14:26






  • 1





    Would also be interesting to see your RTClock (or RTC lock?). Does that require a specific piece of hardware, or is it available for everyone?

    – Rudy Velthuis
    Apr 24 at 21:17











  • @RudyVelthuis It's just a wrapper to Windows' QueryPerformanceCounter, I was told the resolution is 1ns but the documentation only says "<1us".

    – afarah
    Apr 24 at 22:30











  • @afarah: Ah, OK. That's a bit like Java's "nanosecond" resolution for its timers. <g>

    – Rudy Velthuis
    Apr 25 at 6:21







2




2





Would be interesting to see what the other compilers do with this, e.g. Win64 compiler

– David Heffernan
Apr 24 at 14:26





Would be interesting to see what the other compilers do with this, e.g. Win64 compiler

– David Heffernan
Apr 24 at 14:26




1




1





Would also be interesting to see your RTClock (or RTC lock?). Does that require a specific piece of hardware, or is it available for everyone?

– Rudy Velthuis
Apr 24 at 21:17





Would also be interesting to see your RTClock (or RTC lock?). Does that require a specific piece of hardware, or is it available for everyone?

– Rudy Velthuis
Apr 24 at 21:17













@RudyVelthuis It's just a wrapper to Windows' QueryPerformanceCounter, I was told the resolution is 1ns but the documentation only says "<1us".

– afarah
Apr 24 at 22:30





@RudyVelthuis It's just a wrapper to Windows' QueryPerformanceCounter, I was told the resolution is 1ns but the documentation only says "<1us".

– afarah
Apr 24 at 22:30













@afarah: Ah, OK. That's a bit like Java's "nanosecond" resolution for its timers. <g>

– Rudy Velthuis
Apr 25 at 6:21





@afarah: Ah, OK. That's a bit like Java's "nanosecond" resolution for its timers. <g>

– Rudy Velthuis
Apr 25 at 6:21



















draft saved

draft discarded
















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid


  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55832314%2fwhy-is-there-a-performance-penalty-for-nested-subroutines-in-delphi%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

Sum ergo cogito? 1 nng

三茅街道4182Guuntc Dn precexpngmageondP