ARM NEON优化不比C ++指针实现快(ARM NEON Optimization no faster than C++ Pointer Implementation)
我有2个功能,用于将YUYV帧分成Y / U / V独立平面。 我这样做是为了通过将包含Y / U / V数据的3个纹理上传到GPU,在OpenGL ES 2.0着色器中执行从YUYV视频帧到RGBA的格式转换。 其中一个函数是用C ++编写的,一个是用ARM NEON编写的。 我的目标是Cortex-A15 AM57xx Sitara。
我希望NEON代码的性能优于C ++代码,但它们的表现相同。 一种可能性是我内存I / O绑定。 另一种可能性是我不擅长编写NEON代码。
为什么这两个函数执行相同的操作? 是否可以对这两种功能进行明显的优化?
霓虹功能:
/// This structure is passed to ARM Assembly code /// to split the YUV frame into seperate planes for /// OpenGL Consumption typedef struct { char *input_data; int input_size; char *y_plane; char *u_plane; char *v_plane; } yuvSplitStruct; void TopOpenGL::splitYuvPlanes(yuvSplitStruct *yuvStruct) { __asm__ volatile( "PUSH {r4}\n" /* Save callee-save registers R4 and R5 on the stack */ "PUSH {r5}\n" /* r1 is the pointer to the input structure ( r0 is 'this' because c++ ) */ "ldr r0 , [r1]\n" /* reuse r0 scratch register for the address of our frame input */ "ldr r2 , [r1, #4]\n" /* use r2 scratch register to store the size in bytes of the YUYV frame */ "ldr r3 , [r1, #8]\n" /* use r3 scratch register to store the destination Y plane address */ "ldr r4 , [r1, #12]\n" /* use r4 register to store the destination U plane address */ "ldr r5 , [r1, #16]\n" /* use r5 register to store the destination V plane address */ "/* pld [r0, #192] PLD Does not seem to help */" "mov r2, r2, lsr #5\n" /* Divide number of bytes by 32 because we process 16 pixels at a time */ "loopYUYV:\n" "vld4.8 {d0-d3}, [r0]!\n" /* Load 8 YUYV elements from our frame into d0-d3, increment frame pointer */ "vst2.8 {d0,d2}, [r3]!\n" /* Store both Y elements into destination y plane, increment plane pointer */ "vmov.F64 d0, d1\n" /* Duplicate U value */ "vst2.8 {d0,d1}, [r4]!\n" /* Store both U elements into destination u plane, increment plane pointer */ "vmov.F64 d1, d3\n" /* Duplicate V value */ "vst2.8 {d1,d3}, [r5]!\n" /* Store both V elements into destination v plane, increment plane pointer */ "subs r2, r2, #1\n" /* Decrement the loop counter */ "bgt loopYUYV\n" /* Loop until entire frame is processed */ "POP {r5}\n" /* Restore callee-save registers */ "POP {r4}\n" ); }
C ++函数:
void TopOpenGL::splitYuvPlanes(unsigned char *data, int size, unsigned char *y, unsigned char *u, unsigned char *v) { for ( int c = 0 ; c < ( size - 4 ) ; c+=4 ) { *y = *data; // Y0 data++; *u = *data; // U0 u++; *u = *data; // U0 data++; y++; *y = *data; // Y1 data++; *v = *data; // V0 v++; *v = *data; // V0 data++; y++; u++; v++; } }
I have 2 functions for splitting a YUYV frame into Y/U/V independent planes. I am doing this in order to perform format conversion from a YUYV video frame to RGBA in an OpenGL ES 2.0 Shader by uploading 3 textures containing the Y/U/V data to the GPU. One of these functions is written in C++ and one is written in ARM NEON. My target is the Cortex-A15 AM57xx Sitara.
I expected the NEON code to outperform the C++ code but they perform the same. One possibility is that I am memory I/O bound. Another possibility is that I am not great at writing NEON code..
Why do these 2 functions perform the same? Are there any glaring optimizations that could be made to either function?
Neon Function:
/// This structure is passed to ARM Assembly code /// to split the YUV frame into seperate planes for /// OpenGL Consumption typedef struct { char *input_data; int input_size; char *y_plane; char *u_plane; char *v_plane; } yuvSplitStruct; void TopOpenGL::splitYuvPlanes(yuvSplitStruct *yuvStruct) { __asm__ volatile( "PUSH {r4}\n" /* Save callee-save registers R4 and R5 on the stack */ "PUSH {r5}\n" /* r1 is the pointer to the input structure ( r0 is 'this' because c++ ) */ "ldr r0 , [r1]\n" /* reuse r0 scratch register for the address of our frame input */ "ldr r2 , [r1, #4]\n" /* use r2 scratch register to store the size in bytes of the YUYV frame */ "ldr r3 , [r1, #8]\n" /* use r3 scratch register to store the destination Y plane address */ "ldr r4 , [r1, #12]\n" /* use r4 register to store the destination U plane address */ "ldr r5 , [r1, #16]\n" /* use r5 register to store the destination V plane address */ "/* pld [r0, #192] PLD Does not seem to help */" "mov r2, r2, lsr #5\n" /* Divide number of bytes by 32 because we process 16 pixels at a time */ "loopYUYV:\n" "vld4.8 {d0-d3}, [r0]!\n" /* Load 8 YUYV elements from our frame into d0-d3, increment frame pointer */ "vst2.8 {d0,d2}, [r3]!\n" /* Store both Y elements into destination y plane, increment plane pointer */ "vmov.F64 d0, d1\n" /* Duplicate U value */ "vst2.8 {d0,d1}, [r4]!\n" /* Store both U elements into destination u plane, increment plane pointer */ "vmov.F64 d1, d3\n" /* Duplicate V value */ "vst2.8 {d1,d3}, [r5]!\n" /* Store both V elements into destination v plane, increment plane pointer */ "subs r2, r2, #1\n" /* Decrement the loop counter */ "bgt loopYUYV\n" /* Loop until entire frame is processed */ "POP {r5}\n" /* Restore callee-save registers */ "POP {r4}\n" ); }
C++ Function:
void TopOpenGL::splitYuvPlanes(unsigned char *data, int size, unsigned char *y, unsigned char *u, unsigned char *v) { for ( int c = 0 ; c < ( size - 4 ) ; c+=4 ) { *y = *data; // Y0 data++; *u = *data; // U0 u++; *u = *data; // U0 data++; y++; *y = *data; // Y1 data++; *v = *data; // V0 v++; *v = *data; // V0 data++; y++; u++; v++; } }
原文:https://stackoverflow.com/questions/37353984
最满意答案
当然。 从一个空的情节开始,然后像使用图例一样,如果有情节。
plot(NULL ,xaxt='n',yaxt='n',bty='n',ylab='',xlab='', xlim=0:1, ylim=0:1) legend("topleft", legend =c('Sugar maple', 'White ash', 'Black walnut', 'Red oak', 'Eastern hemlock'), pch=16, pt.cex=3, cex=1.5, bty='n', col = c('orange', 'red', 'green', 'blue', 'purple')) mtext("Species", at=0.2, cex=2)
Sure. Just start from an empty plot and then use legend as you would if there were a plot.
plot(NULL ,xaxt='n',yaxt='n',bty='n',ylab='',xlab='', xlim=0:1, ylim=0:1) legend("topleft", legend =c('Sugar maple', 'White ash', 'Black walnut', 'Red oak', 'Eastern hemlock'), pch=16, pt.cex=3, cex=1.5, bty='n', col = c('orange', 'red', 'green', 'blue', 'purple')) mtext("Species", at=0.2, cex=2)
相关问答
更多-
TCP/IP模型是一个________。[2023-10-02]
a -
下列中不属于面向对象的编程语言的是?[2022-05-30]
a -
优化R中图例的最佳方法(Best way for optimising legend in R)[2023-04-13]
使用ggplot2,这将在你的图例的每一行中放置~7组: x$y <- y ggplot(x, aes(x=xnum, y=y, group=xdesc, color=xdesc)) + geom_point() + guides(col= guide_legend(position="bottom", nrow=round(length(unique(x$xdesc))/7))) + theme(legend.position="bottom") Using ggplot2, this would pu ... -
R的随机森林情节图例(Legend for Random Forest Plot in R)[2023-05-23]
小区S3方法图使用matplot绘制随机森林模型。 您应该手动添加图例。 这应该是一个好的开始: library(randomForest) model = randomForest(Species ~., data=iris, ntree=100, proximity=T) layout(matrix(c(1,2),nrow=1), width=c(4,1)) par(mar=c(5,4,4,0)) #No margin on the right side plot(model, log= ... -
在绘图函数的包中似乎没有帮助页面,因此您需要确定fit -object的类并查看代码: class(fit) #[1] "ic_npList" #attr(,"package") #[1] "icenReg" plot.ic_npList #Error: object 'plot.ic_npList' not found 因此它不会被导出,我们需要深入挖掘(并不令人惊讶,因为导出的函数确实需要帮助页面。) getAnywhere(plot.ic_npList) #----------- A sing ...
-
当然。 从一个空的情节开始,然后像使用图例一样,如果有情节。 plot(NULL ,xaxt='n',yaxt='n',bty='n',ylab='',xlab='', xlim=0:1, ylim=0:1) legend("topleft", legend =c('Sugar maple', 'White ash', 'Black walnut', 'Red oak', 'Eastern hemlock'), pch=16, pt.cex=3, cex=1.5, bty='n', col ...
-
这样的事可能有用 windows(width = 6, height = 6) curve(dnorm(x), -4, 4, lwd = 3) arrows(2, 0, 2, .3, code = 2, lwd = 3, col = 'red4') #Get the co-oridnates of extremes extremes = par("usr") #Determine how 'long' the lines should be in legend xx = (extremes[2] - e ...
-
R:在R中添加一个图例(R: adding a plot legend in R)[2022-02-24]
正如@rawr所说, palette()决定了使用的颜色顺序。 如果使用整数指定颜色,它也会查看palette() 。 从而 with(iris,plot(Sepal.Length, Sepal.Width, col = Species)) legend("topright",legend=levels(iris$Species),col=1:3, pch=1) 很好地工作。 Base R没有自动图例功能: ggplot2包有。 library(ggplot2) ggplot(iris,aes(Sepal ... -
从Bokeh 0.12.4 ,没有什么可以自动让你将图例分成多行(或列)。 但是你可以通过增加两个传说来解决它。 请注意,我调整了位置并添加了min_border_bottom值,因为底部图例似乎被切断了。 from bokeh.io import output_file, show from bokeh.models import Legend from bokeh.plotting import figure p = figure(min_border_bottom=130) r1 = p.line ...
-
我找到了使用CategoricalColorMapper然后不创建显式图例对象的方法。 可能有一种方法可以使用相同的布局显式创建图例对象,稍后我将介绍。 import numpy as np from bokeh.io import show from bokeh.models import Legend from bokeh.models import ColumnDataSource, HoverTool,CategoricalColorMapper from bokeh.plotting import ...