Useless test instruction?
I got the below assembly list as result for JIT compilation for my java program.
mov 0x14(%rsp),%r10d
inc %r10d
mov 0x1c(%rsp),%r8d
inc %r8d
test %eax,(%r11) ; <--- this instruction
mov (%rsp),%r9
mov 0x40(%rsp),%r14d
mov 0x18(%rsp),%r11d
mov %ebp,%r13d
mov 0x8(%rsp),%rbx
mov 0x20(%rsp),%rbp
mov 0x10(%rsp),%ecx
mov 0x28(%rsp),%rax
movzbl 0x18(%r9),%edi
movslq %r8d,%rsi
cmp 0x30(%rsp),%rsi
jge 0x00007fd3d27c4f17
My understanding the test
instruction is useless here because the main idea of the test is
The flags SF, ZF, PF are modified while the result of the AND is discarded.
and here we don't use these result flags.
Is it a bug in JIT or do I miss something?
If it is, where the best place for reporting it?
Thanks!
java assembly jvm jit jvm-hotspot
|
show 4 more comments
I got the below assembly list as result for JIT compilation for my java program.
mov 0x14(%rsp),%r10d
inc %r10d
mov 0x1c(%rsp),%r8d
inc %r8d
test %eax,(%r11) ; <--- this instruction
mov (%rsp),%r9
mov 0x40(%rsp),%r14d
mov 0x18(%rsp),%r11d
mov %ebp,%r13d
mov 0x8(%rsp),%rbx
mov 0x20(%rsp),%rbp
mov 0x10(%rsp),%ecx
mov 0x28(%rsp),%rax
movzbl 0x18(%r9),%edi
movslq %r8d,%rsi
cmp 0x30(%rsp),%rsi
jge 0x00007fd3d27c4f17
My understanding the test
instruction is useless here because the main idea of the test is
The flags SF, ZF, PF are modified while the result of the AND is discarded.
and here we don't use these result flags.
Is it a bug in JIT or do I miss something?
If it is, where the best place for reporting it?
Thanks!
java assembly jvm jit jvm-hotspot
2
This instruction does indeed seem useless.
– fuz
Jan 5 at 18:23
6
FWIW, it implicitly checks that r11 contains a valid pointer, and raises an exception if not. Is that intentional? I don't know, out of context.
– another-dave
Jan 5 at 18:45
2
Now that we know the answer, if the JVM had more time to analyze the surrounding code it could have usedmov (%r11), %r9d
becauser9
is about to be written by another instruction. MOV is the same number of code bytes, but it's a pure load without an ALU uop. This is a minor optimization because ALU port pressure is almost certainly not a problem here, and modern x86 CPUs keep the load micro-fused into a single uop with the ALU instruction through most of the pipeline so it doesn't hurt front-end throughput.
– Peter Cordes
2 days ago
But it does take an extra scheduler entry until the load is ready so the ALU uop can execute, and 2 ROB entries on Sandybridge and earlier Intel. IvyBridge & later have fused-domain ROB, but SnB has an unfused-domain ReOrder Buffer. Source: Mentioned in a row in table 3 in this paper: publications.vpw.me/publications/2015_uop_flow_simulation.pdf. See Understanding the impact of lfence on a loop with two long dependency chains, for increasing lengths
– Peter Cordes
2 days ago
@PeterCordes That's pretty counterintuitive and strange. I always thought the microfused uops will keep fused until dispatching to execution port. I double check Agner Fog's manual, they also say the uop will keep fused to RS. They even say in page 92 that saving an ROB entry is an advantage of micro fusion since PM, which is quite reasonable. Are you sure ROB is an unfused-domain until IvyBridge?
– liliscent
2 days ago
|
show 4 more comments
I got the below assembly list as result for JIT compilation for my java program.
mov 0x14(%rsp),%r10d
inc %r10d
mov 0x1c(%rsp),%r8d
inc %r8d
test %eax,(%r11) ; <--- this instruction
mov (%rsp),%r9
mov 0x40(%rsp),%r14d
mov 0x18(%rsp),%r11d
mov %ebp,%r13d
mov 0x8(%rsp),%rbx
mov 0x20(%rsp),%rbp
mov 0x10(%rsp),%ecx
mov 0x28(%rsp),%rax
movzbl 0x18(%r9),%edi
movslq %r8d,%rsi
cmp 0x30(%rsp),%rsi
jge 0x00007fd3d27c4f17
My understanding the test
instruction is useless here because the main idea of the test is
The flags SF, ZF, PF are modified while the result of the AND is discarded.
and here we don't use these result flags.
Is it a bug in JIT or do I miss something?
If it is, where the best place for reporting it?
Thanks!
java assembly jvm jit jvm-hotspot
I got the below assembly list as result for JIT compilation for my java program.
mov 0x14(%rsp),%r10d
inc %r10d
mov 0x1c(%rsp),%r8d
inc %r8d
test %eax,(%r11) ; <--- this instruction
mov (%rsp),%r9
mov 0x40(%rsp),%r14d
mov 0x18(%rsp),%r11d
mov %ebp,%r13d
mov 0x8(%rsp),%rbx
mov 0x20(%rsp),%rbp
mov 0x10(%rsp),%ecx
mov 0x28(%rsp),%rax
movzbl 0x18(%r9),%edi
movslq %r8d,%rsi
cmp 0x30(%rsp),%rsi
jge 0x00007fd3d27c4f17
My understanding the test
instruction is useless here because the main idea of the test is
The flags SF, ZF, PF are modified while the result of the AND is discarded.
and here we don't use these result flags.
Is it a bug in JIT or do I miss something?
If it is, where the best place for reporting it?
Thanks!
java assembly jvm jit jvm-hotspot
java assembly jvm jit jvm-hotspot
edited 2 days ago
Henrik Schumacher
1433
1433
asked Jan 5 at 18:09
QIvanQIvan
16816
16816
2
This instruction does indeed seem useless.
– fuz
Jan 5 at 18:23
6
FWIW, it implicitly checks that r11 contains a valid pointer, and raises an exception if not. Is that intentional? I don't know, out of context.
– another-dave
Jan 5 at 18:45
2
Now that we know the answer, if the JVM had more time to analyze the surrounding code it could have usedmov (%r11), %r9d
becauser9
is about to be written by another instruction. MOV is the same number of code bytes, but it's a pure load without an ALU uop. This is a minor optimization because ALU port pressure is almost certainly not a problem here, and modern x86 CPUs keep the load micro-fused into a single uop with the ALU instruction through most of the pipeline so it doesn't hurt front-end throughput.
– Peter Cordes
2 days ago
But it does take an extra scheduler entry until the load is ready so the ALU uop can execute, and 2 ROB entries on Sandybridge and earlier Intel. IvyBridge & later have fused-domain ROB, but SnB has an unfused-domain ReOrder Buffer. Source: Mentioned in a row in table 3 in this paper: publications.vpw.me/publications/2015_uop_flow_simulation.pdf. See Understanding the impact of lfence on a loop with two long dependency chains, for increasing lengths
– Peter Cordes
2 days ago
@PeterCordes That's pretty counterintuitive and strange. I always thought the microfused uops will keep fused until dispatching to execution port. I double check Agner Fog's manual, they also say the uop will keep fused to RS. They even say in page 92 that saving an ROB entry is an advantage of micro fusion since PM, which is quite reasonable. Are you sure ROB is an unfused-domain until IvyBridge?
– liliscent
2 days ago
|
show 4 more comments
2
This instruction does indeed seem useless.
– fuz
Jan 5 at 18:23
6
FWIW, it implicitly checks that r11 contains a valid pointer, and raises an exception if not. Is that intentional? I don't know, out of context.
– another-dave
Jan 5 at 18:45
2
Now that we know the answer, if the JVM had more time to analyze the surrounding code it could have usedmov (%r11), %r9d
becauser9
is about to be written by another instruction. MOV is the same number of code bytes, but it's a pure load without an ALU uop. This is a minor optimization because ALU port pressure is almost certainly not a problem here, and modern x86 CPUs keep the load micro-fused into a single uop with the ALU instruction through most of the pipeline so it doesn't hurt front-end throughput.
– Peter Cordes
2 days ago
But it does take an extra scheduler entry until the load is ready so the ALU uop can execute, and 2 ROB entries on Sandybridge and earlier Intel. IvyBridge & later have fused-domain ROB, but SnB has an unfused-domain ReOrder Buffer. Source: Mentioned in a row in table 3 in this paper: publications.vpw.me/publications/2015_uop_flow_simulation.pdf. See Understanding the impact of lfence on a loop with two long dependency chains, for increasing lengths
– Peter Cordes
2 days ago
@PeterCordes That's pretty counterintuitive and strange. I always thought the microfused uops will keep fused until dispatching to execution port. I double check Agner Fog's manual, they also say the uop will keep fused to RS. They even say in page 92 that saving an ROB entry is an advantage of micro fusion since PM, which is quite reasonable. Are you sure ROB is an unfused-domain until IvyBridge?
– liliscent
2 days ago
2
2
This instruction does indeed seem useless.
– fuz
Jan 5 at 18:23
This instruction does indeed seem useless.
– fuz
Jan 5 at 18:23
6
6
FWIW, it implicitly checks that r11 contains a valid pointer, and raises an exception if not. Is that intentional? I don't know, out of context.
– another-dave
Jan 5 at 18:45
FWIW, it implicitly checks that r11 contains a valid pointer, and raises an exception if not. Is that intentional? I don't know, out of context.
– another-dave
Jan 5 at 18:45
2
2
Now that we know the answer, if the JVM had more time to analyze the surrounding code it could have used
mov (%r11), %r9d
because r9
is about to be written by another instruction. MOV is the same number of code bytes, but it's a pure load without an ALU uop. This is a minor optimization because ALU port pressure is almost certainly not a problem here, and modern x86 CPUs keep the load micro-fused into a single uop with the ALU instruction through most of the pipeline so it doesn't hurt front-end throughput.– Peter Cordes
2 days ago
Now that we know the answer, if the JVM had more time to analyze the surrounding code it could have used
mov (%r11), %r9d
because r9
is about to be written by another instruction. MOV is the same number of code bytes, but it's a pure load without an ALU uop. This is a minor optimization because ALU port pressure is almost certainly not a problem here, and modern x86 CPUs keep the load micro-fused into a single uop with the ALU instruction through most of the pipeline so it doesn't hurt front-end throughput.– Peter Cordes
2 days ago
But it does take an extra scheduler entry until the load is ready so the ALU uop can execute, and 2 ROB entries on Sandybridge and earlier Intel. IvyBridge & later have fused-domain ROB, but SnB has an unfused-domain ReOrder Buffer. Source: Mentioned in a row in table 3 in this paper: publications.vpw.me/publications/2015_uop_flow_simulation.pdf. See Understanding the impact of lfence on a loop with two long dependency chains, for increasing lengths
– Peter Cordes
2 days ago
But it does take an extra scheduler entry until the load is ready so the ALU uop can execute, and 2 ROB entries on Sandybridge and earlier Intel. IvyBridge & later have fused-domain ROB, but SnB has an unfused-domain ReOrder Buffer. Source: Mentioned in a row in table 3 in this paper: publications.vpw.me/publications/2015_uop_flow_simulation.pdf. See Understanding the impact of lfence on a loop with two long dependency chains, for increasing lengths
– Peter Cordes
2 days ago
@PeterCordes That's pretty counterintuitive and strange. I always thought the microfused uops will keep fused until dispatching to execution port. I double check Agner Fog's manual, they also say the uop will keep fused to RS. They even say in page 92 that saving an ROB entry is an advantage of micro fusion since PM, which is quite reasonable. Are you sure ROB is an unfused-domain until IvyBridge?
– liliscent
2 days ago
@PeterCordes That's pretty counterintuitive and strange. I always thought the microfused uops will keep fused until dispatching to execution port. I double check Agner Fog's manual, they also say the uop will keep fused to RS. They even say in page 92 that saving an ROB entry is an advantage of micro fusion since PM, which is quite reasonable. Are you sure ROB is an unfused-domain until IvyBridge?
– liliscent
2 days ago
|
show 4 more comments
1 Answer
1
active
oldest
votes
That must be the thread-local handshake poll.
Look where %r11
is read from. If it is read from some offset off the %r15
(thread-local storage), that's the guy. See the example here:
0.31% ↗ ...70: movzbl 0x94(%r9),%r10d
0.19% │ ...78: mov 0x108(%r15),%r11 ; read the thread-local page addr
25.62% │ ...7f: add $0x1,%rbp
35.10% │ ...83: test %eax,(%r11) ; thread-local handshake poll
34.91% │ ...86: test %r10d,%r10d
╰ ...89: je ...70
It is not useless, it would cause SEGV once the guard page is marked non-readable, and that would transfer control to JVM's SEGV handler. This is part of JVM's mechanics to safepoint Java threads, e.g. for GC.
UPD: Hopefully, more details here.
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54054782%2fuseless-test-instruction%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
That must be the thread-local handshake poll.
Look where %r11
is read from. If it is read from some offset off the %r15
(thread-local storage), that's the guy. See the example here:
0.31% ↗ ...70: movzbl 0x94(%r9),%r10d
0.19% │ ...78: mov 0x108(%r15),%r11 ; read the thread-local page addr
25.62% │ ...7f: add $0x1,%rbp
35.10% │ ...83: test %eax,(%r11) ; thread-local handshake poll
34.91% │ ...86: test %r10d,%r10d
╰ ...89: je ...70
It is not useless, it would cause SEGV once the guard page is marked non-readable, and that would transfer control to JVM's SEGV handler. This is part of JVM's mechanics to safepoint Java threads, e.g. for GC.
UPD: Hopefully, more details here.
add a comment |
That must be the thread-local handshake poll.
Look where %r11
is read from. If it is read from some offset off the %r15
(thread-local storage), that's the guy. See the example here:
0.31% ↗ ...70: movzbl 0x94(%r9),%r10d
0.19% │ ...78: mov 0x108(%r15),%r11 ; read the thread-local page addr
25.62% │ ...7f: add $0x1,%rbp
35.10% │ ...83: test %eax,(%r11) ; thread-local handshake poll
34.91% │ ...86: test %r10d,%r10d
╰ ...89: je ...70
It is not useless, it would cause SEGV once the guard page is marked non-readable, and that would transfer control to JVM's SEGV handler. This is part of JVM's mechanics to safepoint Java threads, e.g. for GC.
UPD: Hopefully, more details here.
add a comment |
That must be the thread-local handshake poll.
Look where %r11
is read from. If it is read from some offset off the %r15
(thread-local storage), that's the guy. See the example here:
0.31% ↗ ...70: movzbl 0x94(%r9),%r10d
0.19% │ ...78: mov 0x108(%r15),%r11 ; read the thread-local page addr
25.62% │ ...7f: add $0x1,%rbp
35.10% │ ...83: test %eax,(%r11) ; thread-local handshake poll
34.91% │ ...86: test %r10d,%r10d
╰ ...89: je ...70
It is not useless, it would cause SEGV once the guard page is marked non-readable, and that would transfer control to JVM's SEGV handler. This is part of JVM's mechanics to safepoint Java threads, e.g. for GC.
UPD: Hopefully, more details here.
That must be the thread-local handshake poll.
Look where %r11
is read from. If it is read from some offset off the %r15
(thread-local storage), that's the guy. See the example here:
0.31% ↗ ...70: movzbl 0x94(%r9),%r10d
0.19% │ ...78: mov 0x108(%r15),%r11 ; read the thread-local page addr
25.62% │ ...7f: add $0x1,%rbp
35.10% │ ...83: test %eax,(%r11) ; thread-local handshake poll
34.91% │ ...86: test %r10d,%r10d
╰ ...89: je ...70
It is not useless, it would cause SEGV once the guard page is marked non-readable, and that would transfer control to JVM's SEGV handler. This is part of JVM's mechanics to safepoint Java threads, e.g. for GC.
UPD: Hopefully, more details here.
edited Jan 5 at 22:48
answered Jan 5 at 19:07
Aleksey ShipilevAleksey Shipilev
13.9k23770
13.9k23770
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54054782%2fuseless-test-instruction%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
2
This instruction does indeed seem useless.
– fuz
Jan 5 at 18:23
6
FWIW, it implicitly checks that r11 contains a valid pointer, and raises an exception if not. Is that intentional? I don't know, out of context.
– another-dave
Jan 5 at 18:45
2
Now that we know the answer, if the JVM had more time to analyze the surrounding code it could have used
mov (%r11), %r9d
becauser9
is about to be written by another instruction. MOV is the same number of code bytes, but it's a pure load without an ALU uop. This is a minor optimization because ALU port pressure is almost certainly not a problem here, and modern x86 CPUs keep the load micro-fused into a single uop with the ALU instruction through most of the pipeline so it doesn't hurt front-end throughput.– Peter Cordes
2 days ago
But it does take an extra scheduler entry until the load is ready so the ALU uop can execute, and 2 ROB entries on Sandybridge and earlier Intel. IvyBridge & later have fused-domain ROB, but SnB has an unfused-domain ReOrder Buffer. Source: Mentioned in a row in table 3 in this paper: publications.vpw.me/publications/2015_uop_flow_simulation.pdf. See Understanding the impact of lfence on a loop with two long dependency chains, for increasing lengths
– Peter Cordes
2 days ago
@PeterCordes That's pretty counterintuitive and strange. I always thought the microfused uops will keep fused until dispatching to execution port. I double check Agner Fog's manual, they also say the uop will keep fused to RS. They even say in page 92 that saving an ROB entry is an advantage of micro fusion since PM, which is quite reasonable. Are you sure ROB is an unfused-domain until IvyBridge?
– liliscent
2 days ago