Useless test instruction?












26














I got the below assembly list as result for JIT compilation for my java program.



mov    0x14(%rsp),%r10d
inc %r10d

mov 0x1c(%rsp),%r8d
inc %r8d

test %eax,(%r11) ; <--- this instruction

mov (%rsp),%r9
mov 0x40(%rsp),%r14d
mov 0x18(%rsp),%r11d
mov %ebp,%r13d
mov 0x8(%rsp),%rbx
mov 0x20(%rsp),%rbp
mov 0x10(%rsp),%ecx
mov 0x28(%rsp),%rax

movzbl 0x18(%r9),%edi
movslq %r8d,%rsi

cmp 0x30(%rsp),%rsi
jge 0x00007fd3d27c4f17


My understanding the test instruction is useless here because the main idea of the test is




The flags SF, ZF, PF are modified while the result of the AND is discarded.




and here we don't use these result flags.



Is it a bug in JIT or do I miss something?
If it is, where the best place for reporting it?
Thanks!










share|improve this question




















  • 2




    This instruction does indeed seem useless.
    – fuz
    Jan 5 at 18:23






  • 6




    FWIW, it implicitly checks that r11 contains a valid pointer, and raises an exception if not. Is that intentional? I don't know, out of context.
    – another-dave
    Jan 5 at 18:45






  • 2




    Now that we know the answer, if the JVM had more time to analyze the surrounding code it could have used mov (%r11), %r9d because r9 is about to be written by another instruction. MOV is the same number of code bytes, but it's a pure load without an ALU uop. This is a minor optimization because ALU port pressure is almost certainly not a problem here, and modern x86 CPUs keep the load micro-fused into a single uop with the ALU instruction through most of the pipeline so it doesn't hurt front-end throughput.
    – Peter Cordes
    2 days ago












  • But it does take an extra scheduler entry until the load is ready so the ALU uop can execute, and 2 ROB entries on Sandybridge and earlier Intel. IvyBridge & later have fused-domain ROB, but SnB has an unfused-domain ReOrder Buffer. Source: Mentioned in a row in table 3 in this paper: publications.vpw.me/publications/2015_uop_flow_simulation.pdf. See Understanding the impact of lfence on a loop with two long dependency chains, for increasing lengths
    – Peter Cordes
    2 days ago












  • @PeterCordes That's pretty counterintuitive and strange. I always thought the microfused uops will keep fused until dispatching to execution port. I double check Agner Fog's manual, they also say the uop will keep fused to RS. They even say in page 92 that saving an ROB entry is an advantage of micro fusion since PM, which is quite reasonable. Are you sure ROB is an unfused-domain until IvyBridge?
    – liliscent
    2 days ago
















26














I got the below assembly list as result for JIT compilation for my java program.



mov    0x14(%rsp),%r10d
inc %r10d

mov 0x1c(%rsp),%r8d
inc %r8d

test %eax,(%r11) ; <--- this instruction

mov (%rsp),%r9
mov 0x40(%rsp),%r14d
mov 0x18(%rsp),%r11d
mov %ebp,%r13d
mov 0x8(%rsp),%rbx
mov 0x20(%rsp),%rbp
mov 0x10(%rsp),%ecx
mov 0x28(%rsp),%rax

movzbl 0x18(%r9),%edi
movslq %r8d,%rsi

cmp 0x30(%rsp),%rsi
jge 0x00007fd3d27c4f17


My understanding the test instruction is useless here because the main idea of the test is




The flags SF, ZF, PF are modified while the result of the AND is discarded.




and here we don't use these result flags.



Is it a bug in JIT or do I miss something?
If it is, where the best place for reporting it?
Thanks!










share|improve this question




















  • 2




    This instruction does indeed seem useless.
    – fuz
    Jan 5 at 18:23






  • 6




    FWIW, it implicitly checks that r11 contains a valid pointer, and raises an exception if not. Is that intentional? I don't know, out of context.
    – another-dave
    Jan 5 at 18:45






  • 2




    Now that we know the answer, if the JVM had more time to analyze the surrounding code it could have used mov (%r11), %r9d because r9 is about to be written by another instruction. MOV is the same number of code bytes, but it's a pure load without an ALU uop. This is a minor optimization because ALU port pressure is almost certainly not a problem here, and modern x86 CPUs keep the load micro-fused into a single uop with the ALU instruction through most of the pipeline so it doesn't hurt front-end throughput.
    – Peter Cordes
    2 days ago












  • But it does take an extra scheduler entry until the load is ready so the ALU uop can execute, and 2 ROB entries on Sandybridge and earlier Intel. IvyBridge & later have fused-domain ROB, but SnB has an unfused-domain ReOrder Buffer. Source: Mentioned in a row in table 3 in this paper: publications.vpw.me/publications/2015_uop_flow_simulation.pdf. See Understanding the impact of lfence on a loop with two long dependency chains, for increasing lengths
    – Peter Cordes
    2 days ago












  • @PeterCordes That's pretty counterintuitive and strange. I always thought the microfused uops will keep fused until dispatching to execution port. I double check Agner Fog's manual, they also say the uop will keep fused to RS. They even say in page 92 that saving an ROB entry is an advantage of micro fusion since PM, which is quite reasonable. Are you sure ROB is an unfused-domain until IvyBridge?
    – liliscent
    2 days ago














26












26








26


5





I got the below assembly list as result for JIT compilation for my java program.



mov    0x14(%rsp),%r10d
inc %r10d

mov 0x1c(%rsp),%r8d
inc %r8d

test %eax,(%r11) ; <--- this instruction

mov (%rsp),%r9
mov 0x40(%rsp),%r14d
mov 0x18(%rsp),%r11d
mov %ebp,%r13d
mov 0x8(%rsp),%rbx
mov 0x20(%rsp),%rbp
mov 0x10(%rsp),%ecx
mov 0x28(%rsp),%rax

movzbl 0x18(%r9),%edi
movslq %r8d,%rsi

cmp 0x30(%rsp),%rsi
jge 0x00007fd3d27c4f17


My understanding the test instruction is useless here because the main idea of the test is




The flags SF, ZF, PF are modified while the result of the AND is discarded.




and here we don't use these result flags.



Is it a bug in JIT or do I miss something?
If it is, where the best place for reporting it?
Thanks!










share|improve this question















I got the below assembly list as result for JIT compilation for my java program.



mov    0x14(%rsp),%r10d
inc %r10d

mov 0x1c(%rsp),%r8d
inc %r8d

test %eax,(%r11) ; <--- this instruction

mov (%rsp),%r9
mov 0x40(%rsp),%r14d
mov 0x18(%rsp),%r11d
mov %ebp,%r13d
mov 0x8(%rsp),%rbx
mov 0x20(%rsp),%rbp
mov 0x10(%rsp),%ecx
mov 0x28(%rsp),%rax

movzbl 0x18(%r9),%edi
movslq %r8d,%rsi

cmp 0x30(%rsp),%rsi
jge 0x00007fd3d27c4f17


My understanding the test instruction is useless here because the main idea of the test is




The flags SF, ZF, PF are modified while the result of the AND is discarded.




and here we don't use these result flags.



Is it a bug in JIT or do I miss something?
If it is, where the best place for reporting it?
Thanks!







java assembly jvm jit jvm-hotspot






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited 2 days ago









Henrik Schumacher

1433




1433










asked Jan 5 at 18:09









QIvanQIvan

16816




16816








  • 2




    This instruction does indeed seem useless.
    – fuz
    Jan 5 at 18:23






  • 6




    FWIW, it implicitly checks that r11 contains a valid pointer, and raises an exception if not. Is that intentional? I don't know, out of context.
    – another-dave
    Jan 5 at 18:45






  • 2




    Now that we know the answer, if the JVM had more time to analyze the surrounding code it could have used mov (%r11), %r9d because r9 is about to be written by another instruction. MOV is the same number of code bytes, but it's a pure load without an ALU uop. This is a minor optimization because ALU port pressure is almost certainly not a problem here, and modern x86 CPUs keep the load micro-fused into a single uop with the ALU instruction through most of the pipeline so it doesn't hurt front-end throughput.
    – Peter Cordes
    2 days ago












  • But it does take an extra scheduler entry until the load is ready so the ALU uop can execute, and 2 ROB entries on Sandybridge and earlier Intel. IvyBridge & later have fused-domain ROB, but SnB has an unfused-domain ReOrder Buffer. Source: Mentioned in a row in table 3 in this paper: publications.vpw.me/publications/2015_uop_flow_simulation.pdf. See Understanding the impact of lfence on a loop with two long dependency chains, for increasing lengths
    – Peter Cordes
    2 days ago












  • @PeterCordes That's pretty counterintuitive and strange. I always thought the microfused uops will keep fused until dispatching to execution port. I double check Agner Fog's manual, they also say the uop will keep fused to RS. They even say in page 92 that saving an ROB entry is an advantage of micro fusion since PM, which is quite reasonable. Are you sure ROB is an unfused-domain until IvyBridge?
    – liliscent
    2 days ago














  • 2




    This instruction does indeed seem useless.
    – fuz
    Jan 5 at 18:23






  • 6




    FWIW, it implicitly checks that r11 contains a valid pointer, and raises an exception if not. Is that intentional? I don't know, out of context.
    – another-dave
    Jan 5 at 18:45






  • 2




    Now that we know the answer, if the JVM had more time to analyze the surrounding code it could have used mov (%r11), %r9d because r9 is about to be written by another instruction. MOV is the same number of code bytes, but it's a pure load without an ALU uop. This is a minor optimization because ALU port pressure is almost certainly not a problem here, and modern x86 CPUs keep the load micro-fused into a single uop with the ALU instruction through most of the pipeline so it doesn't hurt front-end throughput.
    – Peter Cordes
    2 days ago












  • But it does take an extra scheduler entry until the load is ready so the ALU uop can execute, and 2 ROB entries on Sandybridge and earlier Intel. IvyBridge & later have fused-domain ROB, but SnB has an unfused-domain ReOrder Buffer. Source: Mentioned in a row in table 3 in this paper: publications.vpw.me/publications/2015_uop_flow_simulation.pdf. See Understanding the impact of lfence on a loop with two long dependency chains, for increasing lengths
    – Peter Cordes
    2 days ago












  • @PeterCordes That's pretty counterintuitive and strange. I always thought the microfused uops will keep fused until dispatching to execution port. I double check Agner Fog's manual, they also say the uop will keep fused to RS. They even say in page 92 that saving an ROB entry is an advantage of micro fusion since PM, which is quite reasonable. Are you sure ROB is an unfused-domain until IvyBridge?
    – liliscent
    2 days ago








2




2




This instruction does indeed seem useless.
– fuz
Jan 5 at 18:23




This instruction does indeed seem useless.
– fuz
Jan 5 at 18:23




6




6




FWIW, it implicitly checks that r11 contains a valid pointer, and raises an exception if not. Is that intentional? I don't know, out of context.
– another-dave
Jan 5 at 18:45




FWIW, it implicitly checks that r11 contains a valid pointer, and raises an exception if not. Is that intentional? I don't know, out of context.
– another-dave
Jan 5 at 18:45




2




2




Now that we know the answer, if the JVM had more time to analyze the surrounding code it could have used mov (%r11), %r9d because r9 is about to be written by another instruction. MOV is the same number of code bytes, but it's a pure load without an ALU uop. This is a minor optimization because ALU port pressure is almost certainly not a problem here, and modern x86 CPUs keep the load micro-fused into a single uop with the ALU instruction through most of the pipeline so it doesn't hurt front-end throughput.
– Peter Cordes
2 days ago






Now that we know the answer, if the JVM had more time to analyze the surrounding code it could have used mov (%r11), %r9d because r9 is about to be written by another instruction. MOV is the same number of code bytes, but it's a pure load without an ALU uop. This is a minor optimization because ALU port pressure is almost certainly not a problem here, and modern x86 CPUs keep the load micro-fused into a single uop with the ALU instruction through most of the pipeline so it doesn't hurt front-end throughput.
– Peter Cordes
2 days ago














But it does take an extra scheduler entry until the load is ready so the ALU uop can execute, and 2 ROB entries on Sandybridge and earlier Intel. IvyBridge & later have fused-domain ROB, but SnB has an unfused-domain ReOrder Buffer. Source: Mentioned in a row in table 3 in this paper: publications.vpw.me/publications/2015_uop_flow_simulation.pdf. See Understanding the impact of lfence on a loop with two long dependency chains, for increasing lengths
– Peter Cordes
2 days ago






But it does take an extra scheduler entry until the load is ready so the ALU uop can execute, and 2 ROB entries on Sandybridge and earlier Intel. IvyBridge & later have fused-domain ROB, but SnB has an unfused-domain ReOrder Buffer. Source: Mentioned in a row in table 3 in this paper: publications.vpw.me/publications/2015_uop_flow_simulation.pdf. See Understanding the impact of lfence on a loop with two long dependency chains, for increasing lengths
– Peter Cordes
2 days ago














@PeterCordes That's pretty counterintuitive and strange. I always thought the microfused uops will keep fused until dispatching to execution port. I double check Agner Fog's manual, they also say the uop will keep fused to RS. They even say in page 92 that saving an ROB entry is an advantage of micro fusion since PM, which is quite reasonable. Are you sure ROB is an unfused-domain until IvyBridge?
– liliscent
2 days ago




@PeterCordes That's pretty counterintuitive and strange. I always thought the microfused uops will keep fused until dispatching to execution port. I double check Agner Fog's manual, they also say the uop will keep fused to RS. They even say in page 92 that saving an ROB entry is an advantage of micro fusion since PM, which is quite reasonable. Are you sure ROB is an unfused-domain until IvyBridge?
– liliscent
2 days ago












1 Answer
1






active

oldest

votes


















36














That must be the thread-local handshake poll.
Look where %r11 is read from. If it is read from some offset off the %r15 (thread-local storage), that's the guy. See the example here:



  0.31%  ↗  ...70: movzbl 0x94(%r9),%r10d    
0.19% │ ...78: mov 0x108(%r15),%r11 ; read the thread-local page addr
25.62% │ ...7f: add $0x1,%rbp
35.10% │ ...83: test %eax,(%r11) ; thread-local handshake poll
34.91% │ ...86: test %r10d,%r10d
╰ ...89: je ...70


It is not useless, it would cause SEGV once the guard page is marked non-readable, and that would transfer control to JVM's SEGV handler. This is part of JVM's mechanics to safepoint Java threads, e.g. for GC.



UPD: Hopefully, more details here.






share|improve this answer























    Your Answer






    StackExchange.ifUsing("editor", function () {
    StackExchange.using("externalEditor", function () {
    StackExchange.using("snippets", function () {
    StackExchange.snippets.init();
    });
    });
    }, "code-snippets");

    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "1"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














    draft saved

    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54054782%2fuseless-test-instruction%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    1 Answer
    1






    active

    oldest

    votes








    1 Answer
    1






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    36














    That must be the thread-local handshake poll.
    Look where %r11 is read from. If it is read from some offset off the %r15 (thread-local storage), that's the guy. See the example here:



      0.31%  ↗  ...70: movzbl 0x94(%r9),%r10d    
    0.19% │ ...78: mov 0x108(%r15),%r11 ; read the thread-local page addr
    25.62% │ ...7f: add $0x1,%rbp
    35.10% │ ...83: test %eax,(%r11) ; thread-local handshake poll
    34.91% │ ...86: test %r10d,%r10d
    ╰ ...89: je ...70


    It is not useless, it would cause SEGV once the guard page is marked non-readable, and that would transfer control to JVM's SEGV handler. This is part of JVM's mechanics to safepoint Java threads, e.g. for GC.



    UPD: Hopefully, more details here.






    share|improve this answer




























      36














      That must be the thread-local handshake poll.
      Look where %r11 is read from. If it is read from some offset off the %r15 (thread-local storage), that's the guy. See the example here:



        0.31%  ↗  ...70: movzbl 0x94(%r9),%r10d    
      0.19% │ ...78: mov 0x108(%r15),%r11 ; read the thread-local page addr
      25.62% │ ...7f: add $0x1,%rbp
      35.10% │ ...83: test %eax,(%r11) ; thread-local handshake poll
      34.91% │ ...86: test %r10d,%r10d
      ╰ ...89: je ...70


      It is not useless, it would cause SEGV once the guard page is marked non-readable, and that would transfer control to JVM's SEGV handler. This is part of JVM's mechanics to safepoint Java threads, e.g. for GC.



      UPD: Hopefully, more details here.






      share|improve this answer


























        36












        36








        36






        That must be the thread-local handshake poll.
        Look where %r11 is read from. If it is read from some offset off the %r15 (thread-local storage), that's the guy. See the example here:



          0.31%  ↗  ...70: movzbl 0x94(%r9),%r10d    
        0.19% │ ...78: mov 0x108(%r15),%r11 ; read the thread-local page addr
        25.62% │ ...7f: add $0x1,%rbp
        35.10% │ ...83: test %eax,(%r11) ; thread-local handshake poll
        34.91% │ ...86: test %r10d,%r10d
        ╰ ...89: je ...70


        It is not useless, it would cause SEGV once the guard page is marked non-readable, and that would transfer control to JVM's SEGV handler. This is part of JVM's mechanics to safepoint Java threads, e.g. for GC.



        UPD: Hopefully, more details here.






        share|improve this answer














        That must be the thread-local handshake poll.
        Look where %r11 is read from. If it is read from some offset off the %r15 (thread-local storage), that's the guy. See the example here:



          0.31%  ↗  ...70: movzbl 0x94(%r9),%r10d    
        0.19% │ ...78: mov 0x108(%r15),%r11 ; read the thread-local page addr
        25.62% │ ...7f: add $0x1,%rbp
        35.10% │ ...83: test %eax,(%r11) ; thread-local handshake poll
        34.91% │ ...86: test %r10d,%r10d
        ╰ ...89: je ...70


        It is not useless, it would cause SEGV once the guard page is marked non-readable, and that would transfer control to JVM's SEGV handler. This is part of JVM's mechanics to safepoint Java threads, e.g. for GC.



        UPD: Hopefully, more details here.







        share|improve this answer














        share|improve this answer



        share|improve this answer








        edited Jan 5 at 22:48

























        answered Jan 5 at 19:07









        Aleksey ShipilevAleksey Shipilev

        13.9k23770




        13.9k23770






























            draft saved

            draft discarded




















































            Thanks for contributing an answer to Stack Overflow!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            To learn more, see our tips on writing great answers.





            Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


            Please pay close attention to the following guidance:


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54054782%2fuseless-test-instruction%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            Mario Kart Wii

            What does “Dominus providebit” mean?

            Antonio Litta Visconti Arese