首页 \ 问答 \ ARM NEON优化不比C ++指针实现快(ARM NEON Optimization no faster than C++ Pointer Implementation)

ARM NEON优化不比C ++指针实现快(ARM NEON Optimization no faster than C++ Pointer Implementation)

 我有2个功能，用于将YUYV帧分成Y / U / V独立平面。 我这样做是为了通过将包含Y / U / V数据的3个纹理上传到GPU，在OpenGL ES 2.0着色器中执行从YUYV视频帧到RGBA的格式转换。 其中一个函数是用C ++编写的，一个是用ARM NEON编写的。 我的目标是Cortex-A15 AM57xx Sitara。  
 我希望NEON代码的性能优于C ++代码，但它们的表现相同。 一种可能性是我内存I / O绑定。 另一种可能性是我不擅长编写NEON代码。  
 为什么这两个函数执行相同的操作？ 是否可以对这两种功能进行明显的优化？  
 霓虹功能：  
/// This structure is passed to ARM Assembly code
/// to split the YUV frame into seperate planes for
/// OpenGL Consumption
typedef struct {
    char *input_data;
    int input_size;
    char *y_plane;
    char *u_plane;
    char *v_plane;
} yuvSplitStruct;

void TopOpenGL::splitYuvPlanes(yuvSplitStruct *yuvStruct)
{

    __asm__ volatile(

                "PUSH {r4}\n"                            /* Save callee-save registers R4 and R5 on the stack */
                "PUSH {r5}\n"                            /* r1 is the pointer to the input structure ( r0 is 'this' because c++ ) */
                "ldr r0 , [r1]\n"                        /* reuse r0 scratch register for the address of our frame input */
                "ldr r2 , [r1, #4]\n"                    /* use r2 scratch register to store the size in bytes of the YUYV frame */
                "ldr r3 , [r1, #8]\n"                    /* use r3 scratch register to store the destination Y plane address */
                "ldr r4 , [r1, #12]\n"                   /* use r4 register to store the destination U plane address */
                "ldr r5 , [r1, #16]\n"                   /* use r5 register to store the destination V plane address */
                "/* pld [r0, #192] PLD Does not seem to help */"
                    "mov r2, r2, lsr #5\n"               /* Divide number of bytes by 32 because we process 16 pixels at a time */
                    "loopYUYV:\n"
                        "vld4.8 {d0-d3}, [r0]!\n"        /* Load 8 YUYV elements from our frame into d0-d3, increment frame pointer */
                        "vst2.8 {d0,d2}, [r3]!\n"        /* Store both Y elements into destination y plane, increment plane pointer */
                        "vmov.F64 d0, d1\n"              /* Duplicate U value */
                        "vst2.8 {d0,d1}, [r4]!\n"        /* Store both U elements into destination u plane, increment plane pointer */
                        "vmov.F64 d1, d3\n"              /* Duplicate V value */
                        "vst2.8 {d1,d3}, [r5]!\n"        /* Store both V elements into destination v plane, increment plane pointer */
                        "subs r2, r2, #1\n"              /* Decrement the loop counter */
                    "bgt loopYUYV\n"                     /* Loop until entire frame is processed */
                "POP {r5}\n"                             /* Restore callee-save registers */
                "POP {r4}\n"
    );

}
 
 C ++函数：  
void TopOpenGL::splitYuvPlanes(unsigned char *data, int size, unsigned char *y, unsigned char *u, unsigned char *v)
{

    for ( int c = 0 ; c < ( size - 4 ) ; c+=4 ) {

        *y = *data; // Y0
        data++;
        *u = *data; // U0
        u++;
        *u = *data; // U0
        data++;
        y++;
        *y = *data; // Y1
        data++;
        *v = *data; // V0
        v++;
        *v = *data; // V0

        data++;
        y++;
        u++;
        v++;
    }

}

I have 2 functions for splitting a YUYV frame into Y/U/V independent planes. I am doing this in order to perform format conversion from a YUYV video frame to RGBA in an OpenGL ES 2.0 Shader by uploading 3 textures containing the Y/U/V data to the GPU. One of these functions is written in C++ and one is written in ARM NEON. My target is the Cortex-A15 AM57xx Sitara. 
I expected the NEON code to outperform the C++ code but they perform the same. One possibility is that I am memory I/O bound. Another possibility is that I am not great at writing NEON code.. 
Why do these 2 functions perform the same? Are there any glaring optimizations that could be made to either function? 
Neon Function: 
/// This structure is passed to ARM Assembly code
/// to split the YUV frame into seperate planes for
/// OpenGL Consumption
typedef struct {
    char *input_data;
    int input_size;
    char *y_plane;
    char *u_plane;
    char *v_plane;
} yuvSplitStruct;

void TopOpenGL::splitYuvPlanes(yuvSplitStruct *yuvStruct)
{

    __asm__ volatile(

                "PUSH {r4}\n"                            /* Save callee-save registers R4 and R5 on the stack */
                "PUSH {r5}\n"                            /* r1 is the pointer to the input structure ( r0 is 'this' because c++ ) */
                "ldr r0 , [r1]\n"                        /* reuse r0 scratch register for the address of our frame input */
                "ldr r2 , [r1, #4]\n"                    /* use r2 scratch register to store the size in bytes of the YUYV frame */
                "ldr r3 , [r1, #8]\n"                    /* use r3 scratch register to store the destination Y plane address */
                "ldr r4 , [r1, #12]\n"                   /* use r4 register to store the destination U plane address */
                "ldr r5 , [r1, #16]\n"                   /* use r5 register to store the destination V plane address */
                "/* pld [r0, #192] PLD Does not seem to help */"
                    "mov r2, r2, lsr #5\n"               /* Divide number of bytes by 32 because we process 16 pixels at a time */
                    "loopYUYV:\n"
                        "vld4.8 {d0-d3}, [r0]!\n"        /* Load 8 YUYV elements from our frame into d0-d3, increment frame pointer */
                        "vst2.8 {d0,d2}, [r3]!\n"        /* Store both Y elements into destination y plane, increment plane pointer */
                        "vmov.F64 d0, d1\n"              /* Duplicate U value */
                        "vst2.8 {d0,d1}, [r4]!\n"        /* Store both U elements into destination u plane, increment plane pointer */
                        "vmov.F64 d1, d3\n"              /* Duplicate V value */
                        "vst2.8 {d1,d3}, [r5]!\n"        /* Store both V elements into destination v plane, increment plane pointer */
                        "subs r2, r2, #1\n"              /* Decrement the loop counter */
                    "bgt loopYUYV\n"                     /* Loop until entire frame is processed */
                "POP {r5}\n"                             /* Restore callee-save registers */
                "POP {r4}\n"
    );

}
 
C++ Function: 
void TopOpenGL::splitYuvPlanes(unsigned char *data, int size, unsigned char *y, unsigned char *u, unsigned char *v)
{

    for ( int c = 0 ; c < ( size - 4 ) ; c+=4 ) {

        *y = *data; // Y0
        data++;
        *u = *data; // U0
        u++;
        *u = *data; // U0
        data++;
        y++;
        *y = *data; // Y1
        data++;
        *v = *data; // V0
        v++;
        *v = *data; // V0

        data++;
        y++;
        u++;
        v++;
    }

}

原文：https://stackoverflow.com/questions/37353984

更新时间：2024-01-24 10:01

最满意答案

 当然。 从一个空的情节开始，然后像使用图例一样，如果有情节。  
plot(NULL ,xaxt='n',yaxt='n',bty='n',ylab='',xlab='', xlim=0:1, ylim=0:1)
legend("topleft", legend =c('Sugar maple', 'White ash', 'Black walnut',
    'Red oak', 'Eastern hemlock'), pch=16, pt.cex=3, cex=1.5, bty='n',
    col = c('orange', 'red', 'green', 'blue', 'purple'))
mtext("Species", at=0.2, cex=2)
 
  

Sure. Just start from an empty plot and then use legend as you would if there were a plot.  
plot(NULL ,xaxt='n',yaxt='n',bty='n',ylab='',xlab='', xlim=0:1, ylim=0:1)
legend("topleft", legend =c('Sugar maple', 'White ash', 'Black walnut',
    'Red oak', 'Eastern hemlock'), pch=16, pt.cex=3, cex=1.5, bty='n',
    col = c('orange', 'red', 'green', 'blue', 'purple'))
mtext("Species", at=0.2, cex=2)

ARM NEON优化不比C ++指针实现快(ARM NEON Optimization no faster than C++ Pointer Implementation)

最满意答案

相关问答

TCP/IP模型是一个________。[2023-10-02]

下列中不属于面向对象的编程语言的是?[2022-05-30]

优化R中图例的最佳方法(Best way for optimising legend in R)[2023-04-13]

R的随机森林情节图例(Legend for Random Forest Plot in R)[2023-05-23]

R icenReg包：移动ic_np fit的图例(R icenReg package: Move plot legend for ic_np fit)[2023-01-04]

我如何在R中创建一个没有阴谋的图例？(How can I create a legend without a plot in R?)[2023-06-10]

在R图的图例中绘制“箭头”(Drawing an “arrow” in the legend of an R plot)[2024-01-20]

R：在R中添加一个图例(R: adding a plot legend in R)[2022-02-24]

在散景图中创建两行图例(Create a two line legend in a bokeh plot)[2022-05-07]

如何在Bokeh 12.4.1中为热图创建图例(How do I create a legend for a heatmap in Bokeh 12.4.1)[2022-07-21]

相关文章

最新问答