首页 \ 问答 \ ARM NEON优化不比C ++指针实现快(ARM NEON Optimization no faster than C++ Pointer Implementation)

ARM NEON优化不比C ++指针实现快(ARM NEON Optimization no faster than C++ Pointer Implementation)

我有2个功能,用于将YUYV帧分成Y / U / V独立平面。 我这样做是为了通过将包含Y / U / V数据的3个纹理上传到GPU,在OpenGL ES 2.0着色器中执行从YUYV视频帧到RGBA的格式转换。 其中一个函数是用C ++编写的,一个是用ARM NEON编写的。 我的目标是Cortex-A15 AM57xx Sitara。

我希望NEON代码的性能优于C ++代码,但它们的表现相同。 一种可能性是我内存I / O绑定。 另一种可能性是我不擅长编写NEON代码。

为什么这两个函数执行相同的操作? 是否可以对这两种功能进行明显的优化?

霓虹功能:

/// This structure is passed to ARM Assembly code
/// to split the YUV frame into seperate planes for
/// OpenGL Consumption
typedef struct {
    char *input_data;
    int input_size;
    char *y_plane;
    char *u_plane;
    char *v_plane;
} yuvSplitStruct;

void TopOpenGL::splitYuvPlanes(yuvSplitStruct *yuvStruct)
{

    __asm__ volatile(

                "PUSH {r4}\n"                            /* Save callee-save registers R4 and R5 on the stack */
                "PUSH {r5}\n"                            /* r1 is the pointer to the input structure ( r0 is 'this' because c++ ) */
                "ldr r0 , [r1]\n"                        /* reuse r0 scratch register for the address of our frame input */
                "ldr r2 , [r1, #4]\n"                    /* use r2 scratch register to store the size in bytes of the YUYV frame */
                "ldr r3 , [r1, #8]\n"                    /* use r3 scratch register to store the destination Y plane address */
                "ldr r4 , [r1, #12]\n"                   /* use r4 register to store the destination U plane address */
                "ldr r5 , [r1, #16]\n"                   /* use r5 register to store the destination V plane address */
                "/* pld [r0, #192] PLD Does not seem to help */"
                    "mov r2, r2, lsr #5\n"               /* Divide number of bytes by 32 because we process 16 pixels at a time */
                    "loopYUYV:\n"
                        "vld4.8 {d0-d3}, [r0]!\n"        /* Load 8 YUYV elements from our frame into d0-d3, increment frame pointer */
                        "vst2.8 {d0,d2}, [r3]!\n"        /* Store both Y elements into destination y plane, increment plane pointer */
                        "vmov.F64 d0, d1\n"              /* Duplicate U value */
                        "vst2.8 {d0,d1}, [r4]!\n"        /* Store both U elements into destination u plane, increment plane pointer */
                        "vmov.F64 d1, d3\n"              /* Duplicate V value */
                        "vst2.8 {d1,d3}, [r5]!\n"        /* Store both V elements into destination v plane, increment plane pointer */
                        "subs r2, r2, #1\n"              /* Decrement the loop counter */
                    "bgt loopYUYV\n"                     /* Loop until entire frame is processed */
                "POP {r5}\n"                             /* Restore callee-save registers */
                "POP {r4}\n"
    );

}

C ++函数:

void TopOpenGL::splitYuvPlanes(unsigned char *data, int size, unsigned char *y, unsigned char *u, unsigned char *v)
{

    for ( int c = 0 ; c < ( size - 4 ) ; c+=4 ) {

        *y = *data; // Y0
        data++;
        *u = *data; // U0
        u++;
        *u = *data; // U0
        data++;
        y++;
        *y = *data; // Y1
        data++;
        *v = *data; // V0
        v++;
        *v = *data; // V0

        data++;
        y++;
        u++;
        v++;
    }

}

I have 2 functions for splitting a YUYV frame into Y/U/V independent planes. I am doing this in order to perform format conversion from a YUYV video frame to RGBA in an OpenGL ES 2.0 Shader by uploading 3 textures containing the Y/U/V data to the GPU. One of these functions is written in C++ and one is written in ARM NEON. My target is the Cortex-A15 AM57xx Sitara.

I expected the NEON code to outperform the C++ code but they perform the same. One possibility is that I am memory I/O bound. Another possibility is that I am not great at writing NEON code..

Why do these 2 functions perform the same? Are there any glaring optimizations that could be made to either function?

Neon Function:

/// This structure is passed to ARM Assembly code
/// to split the YUV frame into seperate planes for
/// OpenGL Consumption
typedef struct {
    char *input_data;
    int input_size;
    char *y_plane;
    char *u_plane;
    char *v_plane;
} yuvSplitStruct;

void TopOpenGL::splitYuvPlanes(yuvSplitStruct *yuvStruct)
{

    __asm__ volatile(

                "PUSH {r4}\n"                            /* Save callee-save registers R4 and R5 on the stack */
                "PUSH {r5}\n"                            /* r1 is the pointer to the input structure ( r0 is 'this' because c++ ) */
                "ldr r0 , [r1]\n"                        /* reuse r0 scratch register for the address of our frame input */
                "ldr r2 , [r1, #4]\n"                    /* use r2 scratch register to store the size in bytes of the YUYV frame */
                "ldr r3 , [r1, #8]\n"                    /* use r3 scratch register to store the destination Y plane address */
                "ldr r4 , [r1, #12]\n"                   /* use r4 register to store the destination U plane address */
                "ldr r5 , [r1, #16]\n"                   /* use r5 register to store the destination V plane address */
                "/* pld [r0, #192] PLD Does not seem to help */"
                    "mov r2, r2, lsr #5\n"               /* Divide number of bytes by 32 because we process 16 pixels at a time */
                    "loopYUYV:\n"
                        "vld4.8 {d0-d3}, [r0]!\n"        /* Load 8 YUYV elements from our frame into d0-d3, increment frame pointer */
                        "vst2.8 {d0,d2}, [r3]!\n"        /* Store both Y elements into destination y plane, increment plane pointer */
                        "vmov.F64 d0, d1\n"              /* Duplicate U value */
                        "vst2.8 {d0,d1}, [r4]!\n"        /* Store both U elements into destination u plane, increment plane pointer */
                        "vmov.F64 d1, d3\n"              /* Duplicate V value */
                        "vst2.8 {d1,d3}, [r5]!\n"        /* Store both V elements into destination v plane, increment plane pointer */
                        "subs r2, r2, #1\n"              /* Decrement the loop counter */
                    "bgt loopYUYV\n"                     /* Loop until entire frame is processed */
                "POP {r5}\n"                             /* Restore callee-save registers */
                "POP {r4}\n"
    );

}

C++ Function:

void TopOpenGL::splitYuvPlanes(unsigned char *data, int size, unsigned char *y, unsigned char *u, unsigned char *v)
{

    for ( int c = 0 ; c < ( size - 4 ) ; c+=4 ) {

        *y = *data; // Y0
        data++;
        *u = *data; // U0
        u++;
        *u = *data; // U0
        data++;
        y++;
        *y = *data; // Y1
        data++;
        *v = *data; // V0
        v++;
        *v = *data; // V0

        data++;
        y++;
        u++;
        v++;
    }

}

原文:https://stackoverflow.com/questions/37353984
更新时间:2024-01-24 10:01

最满意答案

当然。 从一个空的情节开始,然后像使用图例一样,如果有情节。

plot(NULL ,xaxt='n',yaxt='n',bty='n',ylab='',xlab='', xlim=0:1, ylim=0:1)
legend("topleft", legend =c('Sugar maple', 'White ash', 'Black walnut',
    'Red oak', 'Eastern hemlock'), pch=16, pt.cex=3, cex=1.5, bty='n',
    col = c('orange', 'red', 'green', 'blue', 'purple'))
mtext("Species", at=0.2, cex=2)

没有情节的传说


Sure. Just start from an empty plot and then use legend as you would if there were a plot.

plot(NULL ,xaxt='n',yaxt='n',bty='n',ylab='',xlab='', xlim=0:1, ylim=0:1)
legend("topleft", legend =c('Sugar maple', 'White ash', 'Black walnut',
    'Red oak', 'Eastern hemlock'), pch=16, pt.cex=3, cex=1.5, bty='n',
    col = c('orange', 'red', 'green', 'blue', 'purple'))
mtext("Species", at=0.2, cex=2)

Legend without a plot

相关问答

更多

相关文章

更多

最新问答

更多
  • 您如何使用git diff文件,并将其应用于同一存储库的副本的本地分支?(How do you take a git diff file, and apply it to a local branch that is a copy of the same repository?)
  • 将长浮点值剪切为2个小数点并复制到字符数组(Cut Long Float Value to 2 decimal points and copy to Character Array)
  • OctoberCMS侧边栏不呈现(OctoberCMS Sidebar not rendering)
  • 页面加载后对象是否有资格进行垃圾回收?(Are objects eligible for garbage collection after the page loads?)
  • codeigniter中的语言不能按预期工作(language in codeigniter doesn' t work as expected)
  • 在计算机拍照在哪里进入
  • 使用cin.get()从c ++中的输入流中丢弃不需要的字符(Using cin.get() to discard unwanted characters from the input stream in c++)
  • No for循环将在for循环中运行。(No for loop will run inside for loop. Testing for primes)
  • 单页应用程序:页面重新加载(Single Page Application: page reload)
  • 在循环中选择具有相似模式的列名称(Selecting Column Name With Similar Pattern in a Loop)
  • System.StackOverflow错误(System.StackOverflow error)
  • KnockoutJS未在嵌套模板上应用beforeRemove和afterAdd(KnockoutJS not applying beforeRemove and afterAdd on nested templates)
  • 散列包括方法和/或嵌套属性(Hash include methods and/or nested attributes)
  • android - 如何避免使用Samsung RFS文件系统延迟/冻结?(android - how to avoid lag/freezes with Samsung RFS filesystem?)
  • TensorFlow:基于索引列表创建新张量(TensorFlow: Create a new tensor based on list of indices)
  • 企业安全培训的各项内容
  • 错误:RPC失败;(error: RPC failed; curl transfer closed with outstanding read data remaining)
  • C#类名中允许哪些字符?(What characters are allowed in C# class name?)
  • NumPy:将int64值存储在np.array中并使用dtype float64并将其转换回整数是否安全?(NumPy: Is it safe to store an int64 value in an np.array with dtype float64 and later convert it back to integer?)
  • 注销后如何隐藏导航portlet?(How to hide navigation portlet after logout?)
  • 将多个行和可变行移动到列(moving multiple and variable rows to columns)
  • 提交表单时忽略基础href,而不使用Javascript(ignore base href when submitting form, without using Javascript)
  • 对setOnInfoWindowClickListener的意图(Intent on setOnInfoWindowClickListener)
  • Angular $资源不会改变方法(Angular $resource doesn't change method)
  • 在Angular 5中不是一个函数(is not a function in Angular 5)
  • 如何配置Composite C1以将.m和桌面作为同一站点提供服务(How to configure Composite C1 to serve .m and desktop as the same site)
  • 不适用:悬停在悬停时:在元素之前[复制](Don't apply :hover when hovering on :before element [duplicate])
  • 常见的python rpc和cli接口(Common python rpc and cli interface)
  • Mysql DB单个字段匹配多个其他字段(Mysql DB single field matching to multiple other fields)
  • 产品页面上的Magento Up出售对齐问题(Magento Up sell alignment issue on the products page)