Skip to content

Not optimize boolean computations for the usize type #140103

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
ZhonghaoPan-nju opened this issue Apr 21, 2025 · 7 comments
Open

Not optimize boolean computations for the usize type #140103

ZhonghaoPan-nju opened this issue Apr 21, 2025 · 7 comments
Labels
A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. C-optimization Category: An issue highlighting optimization opportunities or PRs implementing such I-slow Issue: Problems and improvements with respect to performance of generated code. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.

Comments

@ZhonghaoPan-nju
Copy link

The following example shows that rustc does not optimize boolean computations for the usize type.

I tried this code: (opt-level=3)
https://www.godbolt.org/z/zY3Kqj8ax

#[no_mangle]
fn even(x: usize) -> bool {
    if !(x >= (0 + 2)) {
        return false;
    } else if x == 2 {
        return true;
    } else {
        return even(x - 2);
    }
}

I expected to see this happen:

even:
        cmp     rdi, 2
        jae     .LBB0_2
        xor     eax, eax
        ret
.LBB0_2:
        mov     rax, rdi
        add     rdi, -2
        and     rax, -2
        cmp     rax, 2
        jne     .LBB0_2
        test    rdi, rdi
        sete    al
        ret

Instead, this happened:

even:
        cmp     rdi, 2
        jb      .LBB0_1
        lea     rcx, [rdi - 2]
        mov     eax, ecx
        shr     eax
        inc     eax
        and     eax, 7
        je      .LBB0_3
        add     eax, eax
        xor     edx, edx
.LBB0_5:
        add     rdx, 2
        cmp     rax, rdx
        jne     .LBB0_5
        sub     rdi, rdx
        sete    al
        cmp     rcx, 14
        jae     .LBB0_8
.LBB0_10:
        and     al, 1
        ret
.LBB0_1:
        xor     eax, eax
        and     al, 1
        ret
.LBB0_3:
        cmp     rcx, 14
        jb      .LBB0_10
.LBB0_8:
        add     rdi, -16
        cmp     rdi, 1
        ja      .LBB0_8
        test    rdi, rdi
        sete    al
        and     al, 1
        ret

However, when I change the type from usize to i32, the compiler generates the desired assembly code. Additionally, I can also get optimized result by rewriting if !(x >= (0 + 2)) as if x < 2.

Therefore, I think that something wrong may occur when optimizing boolean computation for usize type. If addressed in the future, this improvement could enhance rustc's code generation efficiency for such cases.

Could you please review the situation? Thank you!

Meta

rustc 1.85.0-nightly (d117b7f21 2024-12-31)
binary: rustc
commit-hash: d117b7f211835282b3b177dc64245fff0327c04c
commit-date: 2024-12-31
host: x86_64-unknown-linux-gnu
release: 1.85.0-nightly
LLVM version: 19.1.6
@ZhonghaoPan-nju ZhonghaoPan-nju added the C-bug Category: This is a bug. label Apr 21, 2025
@rustbot rustbot added the needs-triage This issue may need triage. Remove it if it has been sufficiently triaged. label Apr 21, 2025
@jieyouxu jieyouxu added C-optimization Category: An issue highlighting optimization opportunities or PRs implementing such and removed C-bug Category: This is a bug. labels Apr 21, 2025
@saethlin
Copy link
Member

I clicked on the godbolt link and I see the good assembly. So I'm confused.

@ZhonghaoPan-nju
Copy link
Author

@saethlin I have checked the godbolt link https://www.godbolt.org/z/zY3Kqj8ax. The Editor Rust Resource#2 is the example with missed optimization, and the compiler rustc 1.86.0 Editor #2 is the corresponding assembly. Other windows in this link are used for comparsion.

To show the problem clearly, I provide another godbolt link to show the example with bad assembly the same as it in this issue:
https://www.godbolt.org/z/fEW4zExWf

Thank you!

@saethlin
Copy link
Member

saethlin commented Apr 21, 2025

More fool me. I tried to open the link on mobile and got confused by how godbolt renders on mobile 🤦

Alive2 says this optimization is valid: https://alive2.llvm.org/ce/z/LaPXQ2

Also the only diff in the IR we generate is just whether the comparison is signed or unsigned: https://www.godbolt.org/z/GnTrWErcs

@saethlin saethlin added A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. I-slow Issue: Problems and improvements with respect to performance of generated code. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. and removed needs-triage This issue may need triage. Remove it if it has been sufficiently triaged. labels Apr 21, 2025
@ZhonghaoPan-nju
Copy link
Author

@saethlin Thank you for your review. I’ll learn it and keep following this issue. :)

@nikic
Copy link
Contributor

nikic commented Apr 21, 2025

As far as I can tell the difference here is that the loop gets runtime unrolled in the usize case. Is the resulting code actually slower (for large inputs)?

Just because you see less assembly doesn't mean that the code is less optimized.

@nikic
Copy link
Contributor

nikic commented Apr 21, 2025

I'll also add that, based on extensive experience in the LLVM project, missed optimization issues that are based around some kind of automated identification approach tend to provide significant negative value to the project. They result in contributors wasting a substantial amount of time implementing, reviewing and maintaining optimizations that do not provide any real-world value.

Making such reports actually useful requires significantly more effort on behalf of the reporter -- you need to understand what the root cause of the optimization difference is and then consider whether fixing that root cause can plausibly benefit real-world code. The usually more reliable approach is to instead directly work on a corpus of real-world code, and identify missed optimization opportunities in it.

I'd appreciate it if missed optimization reports that are not directly derived from unmodified real-world code were at least tagged as such, to help prioritize them appropriately.

@ZhonghaoPan-nju
Copy link
Author

@nikic Thank you very much for your comment, which has given me a deeper understanding of project maintenance regarding missed optimization opportunities.

Indeed, in complex systems, the handling of missed optimization issues tends to be more conservative. Through my research on detecting optimization oversights in various complex systems, I’ve observed that developers often do not prioritize minor optimization opportunities, as the associated costs outweigh the benefits.

In academia, some scholars adopt a relatively straightforward criterion: test cases with more than 100 lines of assembly code differences in differential testing are reported. However, this standard is not entirely accurate, as the size of the program itself also significantly influences the number of assembly lines. For smaller programs, even smaller assembly differences should warrant attention. In the case of this issue, the difference between the good and bad examples is more than twofold in terms of assembly lines, which is why it caught my attention.

More importantly, as you mentioned, the root cause of the missed optimization and its actual performance impact are what truly determine the value of the issue. In this case, the root cause may lie in the missed optimization of boolean operations involving the usize type, which prevents the recursion from being optimized into a loop, thereby slowing down the program. Additionally, there is a noticeable performance difference in practice. Below are the performance evaluation reports from perf for the two programs with an input of 333:

Bad example:

         2,048,659      cpu_core/cycles/                                            
     <not counted>      cpu_atom/cycles/                                              (0.00%)
         1,878,650      cpu_core/instructions/                                      
     <not counted>      cpu_atom/instructions/                                        (0.00%)
            20,471      cpu_core/cache-misses/                                      
     <not counted>      cpu_atom/cache-misses/                                        (0.00%)
           371,849      cpu_core/branches/                                          
     <not counted>      cpu_atom/branches/                                            (0.00%)
            10,912      cpu_core/branch-misses/                                     
     <not counted>      cpu_atom/branch-misses/                                       (0.00%)

       0.000708355 seconds time elapsed
       0.000692000 seconds user
       0.000000000 seconds sys  

Good example:

         1,806,808      cpu_core/cycles/                                            
     <not counted>      cpu_atom/cycles/                                              (0.00%)
         1,841,896      cpu_core/instructions/                                      
     <not counted>      cpu_atom/instructions/                                        (0.00%)
            19,881      cpu_core/cache-misses/                                      
     <not counted>      cpu_atom/cache-misses/                                        (0.00%)
           365,033      cpu_core/branches/                                          
     <not counted>      cpu_atom/branches/                                            (0.00%)
            11,035      cpu_core/branch-misses/                                     
     <not counted>      cpu_atom/branch-misses/                                       (0.00%)

       0.000700774 seconds time elapsed
       0.000657000 seconds user
       0.000000000 seconds sys  

The CPU cycle counts show a significant difference, indicating that this optimization does have meaningful impact.

Finally, does this missed optimization opportunity matter in the real world? I believe it does. Although the odd-number check in this issue may not be the best implementation, usize and recursion are still widely used techniques in programming. For large-scale computations, if more recursive patterns can be optimized into loops, it could lead to substantial performance improvements.

Thank you again for your response—it has been very enlightening. I hope my comment can be helpful to rustc, and please correct me if I’ve made any mistakes. Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. C-optimization Category: An issue highlighting optimization opportunities or PRs implementing such I-slow Issue: Problems and improvements with respect to performance of generated code. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests

5 participants