Discussion:
[OpenJDK 2D-Dev] sun.java2D.Pisces renderer Performance and Memory enhancements
(too old to reply)
Laurent Bourgès
2013-04-09 13:02:17 UTC
Permalink
Dear Java2D members,

Could someone review the following webrev concerning Java2D Pisces to
enhance its performance and reduce its memory footprint (RendererContext
stored in thread local or concurrent queue):
http://jmmc.fr/~bourgesl/share/java2d-pisces/webrev-1/

FYI I fixed file headers in this patch and signed my OCA 3 weeks ago.

Remaining work:
- cleanup (comments ...)
- statistics to perform auto-tuning
- cache / memory cleanup (SoftReference ?): use hints or System properties
to adapt it to use cases
- another problem: fix clipping performance in Dasher / Stroker for
segments out of bounds

Could somebody support me ? ie help me working on these tasks or just to
discuss on Pisces algorithm / implementation ?

Best regards,
Laurent Bourgès
Andrea Aime
2013-04-14 15:37:07 UTC
Permalink
On Tue, Apr 9, 2013 at 3:02 PM, Laurent Bourgès
Post by Laurent Bourgès
Dear Java2D members,
Could someone review the following webrev concerning Java2D Pisces to
enhance its performance and reduce its memory footprint (RendererContext
http://jmmc.fr/~bourgesl/share/java2d-pisces/webrev-1/
FYI I fixed file headers in this patch and signed my OCA 3 weeks ago.
- cleanup (comments ...)
- statistics to perform auto-tuning
- cache / memory cleanup (SoftReference ?): use hints or System properties
to adapt it to use cases
- another problem: fix clipping performance in Dasher / Stroker for
segments out of bounds
Could somebody support me ? ie help me working on these tasks or just to
discuss on Pisces algorithm / implementation ?
Hi,
I would like to express my support for this patch.
Given that micro-benchmarks have already been run, I took the patch for a
spin in a large, real world benchmark instead,
the OSGeo WMS Shootout 2010 benchmark, for which you can see the results
here:
http://www.slideshare.net/gatewaygeomatics.com/wms-performance-shootout-2010

The presentation is long, but suffice it to say all Java based
implementations took quite the beating due to the
poor scalability of Ductus with antialiased rendering of vector data (for
an executive summary just look
at slide 27 and slide 66, where GeoServer, Oracle MapViewer and
Constellation SDI were the
Java based ones)

I took the same tests and run them again on my machine (different hardware
than the tests, don't try to compare
the absolute values), using Oracle JDK 1.7.0_17, OpenJDK 8 (a checkout a
couple of weeks old) and the
same, but with Laurent's patches applied.
Here are the results, throughput (in maps generated per second) with the
load generator (JMeter) going
up from one client to 64 concurrent clients:

*Threads* *JDK 1.7.0_17* *OpenJDK 8, vanilla* *OpenJDK 8 + pisces
renderer improvements* *Pisces renderer performance gain, %* 1 13,97 12,43
13,03 4,75% 2 22,08 20,60 20,77 0,81% 4 34,36 33,15 33,36 0,62% 8 39,39
40,51 41,71 2,96% 16 40,61 44,57 46,98 5,39% 32 41,41 44,73 48,16 7,66%
64 37,09 42,19 45,28 7,32%

Well, first of all, congratulations to the JDK developers, don't know what
changed in JDK 8, but
GeoServer seems to like it quite a bit :-).
That said, Laurent's patch also gives a visible boost, especially when
several concurrent clients are
asking for the maps.

Mind, as I said, this is no micro-benchmark, it is a real application
loading doing a lot of I/O
(from the operating system file cache), other processing before the data
reaches the rendering
pipeline, and then has to PNG encode the output BufferedImage (PNG encoding
being rather
expensive), so getting this speedup from just a change in the rendering
pipeline is significant.

Long story short... personally I'd be very happy if this patch was going to
be integrated in Java 8 :-)

Cheers
Andrea
--
==
GeoServer training in Milan, 6th & 7th June 2013! Visit
http://geoserver.geo-solutions.it for more information.
==

Ing. Andrea Aime
@geowolf
Technical Lead

GeoSolutions S.A.S.
Via Poggio alle Viti 1187
55054 Massarosa (LU)
Italy
phone: +39 0584 962313
fax: +39 0584 1660272
mob: +39 339 8844549

http://www.geo-solutions.it
http://twitter.com/geosolutions_it

-------------------------------------------------------
Laurent Bourgès
2013-04-15 10:01:54 UTC
Permalink
Jim, Andrea,

I updated MapBench to provide test statistics (avg, median, stddev, rms,
med + stddev, min, max) and CSV output (tab separator):
http://jmmc.fr/~bourgesl/share/java2d-pisces/MapBench/


Here are the results (OpenJDK8 Ref vs Patched):
http://jmmc.fr/~bourgesl/share/java2d-pisces/ref_det.log
http://jmmc.fr/~bourgesl/share/java2d-pisces/patch_det.log

test threads ops Tavg Tmed stdDev rms Med+Stddev min max boulder_17 1 20
180,22% 181,08% 1186,01% 181,17% 185,92% 176,35% 170,36% boulder_17 2 20
183,15% 183,80% 162,68% 183,78% 183,17% 174,01% 169,89% boulder_17 4 20
216,62% 218,03% 349,31% 218,87% 226,68% 172,15% 167,54% shp_alllayers_47 1
20 243,90% 244,86% 537,92% 244,87% 246,39% 240,64% 231,00% shp_alllayers_47
2 20 286,42% 287,07% 294,87% 287,07% 287,23% 277,19% 272,23%
shp_alllayers_47 4 20 303,08% 302,15% 168,19% 301,90% 295,90% 462,70%
282,41%

PATCH:
test threads ops Tavg Tmed stdDev rms Med+Stddev min max boulder_17 1 20
110,196 109,244 0,529 109,246 109,773 108,197 129,327 boulder_17 2 40
127,916 127,363 3,899 127,423 131,262 125,262 151,561 boulder_17 4 80
213,085 212,268 14,988 212,796 227,256 155,512 334,407 shp_alllayers_47 1
20 1139,452 1134,858 5,971 1134,873 1140,829 1125,859 1235,746
shp_alllayers_47 2 40 1306,889 1304,598 28,157 1304,902 1332,755 1280,49
1420,351 shp_alllayers_47 4 80 2296,487 2303,81 112,816 2306,57 2416,626
1390,31 2631,455

REF:
test threads ops Tavg Tmed stdDev rms Med+Stddev min max boulder_17 1 20
198,591 197,816 6,274 197,916 204,091 190,805 220,319 boulder_17 2 40
234,272 234,09 6,343 234,176 240,433 217,967 257,485 boulder_17 4 80
461,579 462,8 52,354 465,751 515,153 267,712 560,254 shp_alllayers_47 1 20
2779,133 2778,823 32,119 2779,009 2810,943 2709,285 2854,557
shp_alllayers_47 2 40 3743,255 3745,111 83,027 3746,031 3828,138 3549,364
3866,612 shp_alllayers_47 4 80 6960,23 6960,948 189,75 6963,533 7150,698
6432,945 7431,541
Linux 64 server vm
JVM: -Xms128m -Xmx128m (low mem)

Laurent
Post by Andrea Aime
Post by Laurent Bourgès
Dear Java2D members,
Could someone review the following webrev concerning Java2D Pisces to
enhance its performance and reduce its memory footprint (RendererContext
http://jmmc.fr/~bourgesl/share/java2d-pisces/webrev-1/
FYI I fixed file headers in this patch and signed my OCA 3 weeks ago.
- cleanup (comments ...)
- statistics to perform auto-tuning
- cache / memory cleanup (SoftReference ?): use hints or System
properties to adapt it to use cases
- another problem: fix clipping performance in Dasher / Stroker for
segments out of bounds
Could somebody support me ? ie help me working on these tasks or just to
discuss on Pisces algorithm / implementation ?
Hi,
I would like to express my support for this patch.
Given that micro-benchmarks have already been run, I took the patch for a
spin in a large, real world benchmark instead,
the OSGeo WMS Shootout 2010 benchmark, for which you can see the results
http://www.slideshare.net/gatewaygeomatics.com/wms-performance-shootout-2010
The presentation is long, but suffice it to say all Java based
implementations took quite the beating due to the
poor scalability of Ductus with antialiased rendering of vector data (for
an executive summary just look
at slide 27 and slide 66, where GeoServer, Oracle MapViewer and
Constellation SDI were the
Java based ones)
I took the same tests and run them again on my machine (different hardware
than the tests, don't try to compare
the absolute values), using Oracle JDK 1.7.0_17, OpenJDK 8 (a checkout a
couple of weeks old) and the
same, but with Laurent's patches applied.
Here are the results, throughput (in maps generated per second) with the
load generator (JMeter) going
*Threads* *JDK 1.7.0_17* *OpenJDK 8, vanilla* *OpenJDK 8 + pisces
renderer improvements* *Pisces renderer performance gain, %* 1 13,97
12,43 13,03 4,75% 2 22,08 20,60 20,77 0,81% 4 34,36 33,15 33,36 0,62% 8
39,39 40,51 41,71 2,96% 16 40,61 44,57 46,98 5,39% 32 41,41 44,73 48,16
7,66% 64 37,09 42,19 45,28 7,32%
Well, first of all, congratulations to the JDK developers, don't know what
changed in JDK 8, but
GeoServer seems to like it quite a bit :-).
That said, Laurent's patch also gives a visible boost, especially when
several concurrent clients are
asking for the maps.
Mind, as I said, this is no micro-benchmark, it is a real application
loading doing a lot of I/O
(from the operating system file cache), other processing before the data
reaches the rendering
pipeline, and then has to PNG encode the output BufferedImage (PNG
encoding being rather
expensive), so getting this speedup from just a change in the rendering
pipeline is significant.
Long story short... personally I'd be very happy if this patch was going
to be integrated in Java 8 :-)
Cheers
Andrea
--
==
GeoServer training in Milan, 6th & 7th June 2013! Visit
http://geoserver.geo-solutions.it for more information.
==
Ing. Andrea Aime
@geowolf
Technical Lead
GeoSolutions S.A.S.
Via Poggio alle Viti 1187
55054 Massarosa (LU)
Italy
phone: +39 0584 962313
fax: +39 0584 1660272
mob: +39 339 8844549
http://www.geo-solutions.it
http://twitter.com/geosolutions_it
-------------------------------------------------------
Jim Graham
2013-04-16 21:16:47 UTC
Permalink
If I'm reading this correctly, your patch is faster even for a single
thread? That's great news.

One of the problems we've had with replacing Ductus is that it has been
faster in a single thread situation than the open source versions we've
created. One of its drawbacks is that it had been designed to take
advantage of some AA-accelerating hardware that never came to be. With
the accelerator it would have been insanely fast, but hardware went in a
different direction. The problem was that this early design goal caused
the entire library to be built around an abstraction layer that allowed
for a single "tile producer" internally (because there would be only one
- insanely fast - hardware chip available) and the software version of
the abstraction layer thus had a lot of native "static" data structures
(there's only one of me, right?) that prevented MT access. It was
probably solvable, but I'd be happier if someone could come up with a
faster rasterizer, imagining that there must have been some sort of
advancements in the nearly 2 decades since the original was written.

If I'm misinterpreting and single thread is still slower than Ductus (or
if it is still slower on some other platforms), then <frowny face>.

Also, this is with the Java version, right? We got a decent 2x speedup
in FX by porting the version of Open Pisces that you started with to C
code (all except on Linux where we couldn't find the right gcc options
to make it faster than Hotspot). So, we have yet to investigate a
native version in the JDK which might provide even more gains...

...jim
Post by Laurent Bourgès
Jim, Andrea,
I updated MapBench to provide test statistics (avg, median, stddev, rms,
http://jmmc.fr/~bourgesl/share/java2d-pisces/MapBench/
<http://jmmc.fr/%7Ebourgesl/share/java2d-pisces/MapBench/>
http://jmmc.fr/~bourgesl/share/java2d-pisces/ref_det.log
http://jmmc.fr/~bourgesl/share/java2d-pisces/patch_det.log
test threads ops Tavg Tmed stdDev rms Med+Stddev min max
boulder_17 1 20 180,22% 181,08% 1186,01% 181,17% 185,92%
176,35% 170,36%
boulder_17 2 20 183,15% 183,80% 162,68% 183,78% 183,17%
174,01% 169,89%
boulder_17 4 20 216,62% 218,03% 349,31% 218,87% 226,68%
172,15% 167,54%
shp_alllayers_47 1 20 243,90% 244,86% 537,92% 244,87% 246,39%
240,64% 231,00%
shp_alllayers_47 2 20 286,42% 287,07% 294,87% 287,07% 287,23%
277,19% 272,23%
shp_alllayers_47 4 20 303,08% 302,15% 168,19% 301,90% 295,90%
462,70% 282,41%
test threads ops Tavg Tmed stdDev rms Med+Stddev min max
boulder_17 1 20 110,196 109,244 0,529 109,246 109,773 108,197
129,327
boulder_17 2 40 127,916 127,363 3,899 127,423 131,262 125,262
151,561
boulder_17 4 80 213,085 212,268 14,988 212,796 227,256 155,512
334,407
shp_alllayers_47 1 20 1139,452 1134,858 5,971 1134,873 1140,829
1125,859 1235,746
shp_alllayers_47 2 40 1306,889 1304,598 28,157 1304,902
1332,755 1280,49 1420,351
shp_alllayers_47 4 80 2296,487 2303,81 112,816 2306,57 2416,626
1390,31 2631,455
test threads ops Tavg Tmed stdDev rms Med+Stddev min max
boulder_17 1 20 198,591 197,816 6,274 197,916 204,091 190,805
220,319
boulder_17 2 40 234,272 234,09 6,343 234,176 240,433 217,967
257,485
boulder_17 4 80 461,579 462,8 52,354 465,751 515,153 267,712
560,254
shp_alllayers_47 1 20 2779,133 2778,823 32,119 2779,009
2810,943 2709,285 2854,557
shp_alllayers_47 2 40 3743,255 3745,111 83,027 3746,031
3828,138 3549,364 3866,612
shp_alllayers_47 4 80 6960,23 6960,948 189,75 6963,533 7150,698
6432,945 7431,541
Linux 64 server vm
JVM: -Xms128m -Xmx128m (low mem)
Laurent
On Tue, Apr 9, 2013 at 3:02 PM, Laurent Bourgès
Dear Java2D members,
Could someone review the following webrev concerning Java2D
Pisces to enhance its performance and reduce its memory
footprint (RendererContext stored in thread local or concurrent
http://jmmc.fr/~bourgesl/share/java2d-pisces/webrev-1/
<http://jmmc.fr/%7Ebourgesl/share/java2d-pisces/webrev-1/>
FYI I fixed file headers in this patch and signed my OCA 3 weeks ago.
- cleanup (comments ...)
- statistics to perform auto-tuning
- cache / memory cleanup (SoftReference ?): use hints or System
properties to adapt it to use cases
- another problem: fix clipping performance in Dasher / Stroker
for segments out of bounds
Could somebody support me ? ie help me working on these tasks or
just to discuss on Pisces algorithm / implementation ?
Hi,
I would like to express my support for this patch.
Given that micro-benchmarks have already been run, I took the patch
for a spin in a large, real world benchmark instead,
the OSGeo WMS Shootout 2010 benchmark, for which you can see the
http://www.slideshare.net/gatewaygeomatics.com/wms-performance-shootout-2010
The presentation is long, but suffice it to say all Java based
implementations took quite the beating due to the
poor scalability of Ductus with antialiased rendering of vector data
(for an executive summary just look
at slide 27 and slide 66, where GeoServer, Oracle MapViewer and
Constellation SDI were the
Java based ones)
I took the same tests and run them again on my machine (different
hardware than the tests, don't try to compare
the absolute values), using Oracle JDK 1.7.0_17, OpenJDK 8 (a
checkout a couple of weeks old) and the
same, but with Laurent's patches applied.
Here are the results, throughput (in maps generated per second) with
the load generator (JMeter) going
*Threads* *JDK 1.7.0_17* *OpenJDK 8, vanilla* *OpenJDK 8 + pisces
renderer improvements* *Pisces renderer performance gain, %*
1 13,97 12,43 13,03 4,75%
2 22,08 20,60 20,77 0,81%
4 34,36 33,15 33,36 0,62%
8 39,39 40,51 41,71 2,96%
16 40,61 44,57 46,98 5,39%
32 41,41 44,73 48,16 7,66%
64 37,09 42,19 45,28 7,32%
Well, first of all, congratulations to the JDK developers, don't
know what changed in JDK 8, but
GeoServer seems to like it quite a bit :-).
That said, Laurent's patch also gives a visible boost, especially
when several concurrent clients are
asking for the maps.
Mind, as I said, this is no micro-benchmark, it is a real
application loading doing a lot of I/O
(from the operating system file cache), other processing before the
data reaches the rendering
pipeline, and then has to PNG encode the output BufferedImage (PNG
encoding being rather
expensive), so getting this speedup from just a change in the
rendering pipeline is significant.
Long story short... personally I'd be very happy if this patch was
going to be integrated in Java 8 :-)
Cheers
Andrea
--
==
GeoServer training in Milan, 6th & 7th June 2013! Visit
http://geoserver.geo-solutions.it
<http://geoserver.geo-solutions.it/> for more information.
==
Ing. Andrea Aime
@geowolf
Technical Lead
GeoSolutions S.A.S.
Via Poggio alle Viti 1187
55054 Massarosa (LU)
Italy
phone: +39 0584 962313
fax: +39 0584 1660272
mob: +39 339 8844549
http://www.geo-solutions.it
http://twitter.com/geosolutions_it
-------------------------------------------------------
Laurent Bourgès
2013-04-17 20:14:13 UTC
Permalink
Jim,

thanks for having some interest for my efforts !
As I got almost no feedback, I felt quite disappointed and was thinking
that improving pisces was not important ...

Here are ductus results and comparison (open office format):
http://jmmc.fr/~bourgesl/share/java2d-pisces/ductus_det.log
http://jmmc.fr/~bourgesl/share/java2d-pisces/compareRef_Patch.ods

test threads ops Tavg Tmed stdDev rms *Med+Stddev* min max boulder_17 1
20 73,92% 69,34% 27,98% 69,34% *69,14%* 69,81% 146,89% boulder_17 2 20
110,86% 110,48% 613,80% 112,01% *125,43%* 94,71% 136,35% boulder_17 4 20
135,28% 135,86% 226,61% 136,46% *141,85%* 125,14% 111,32% shp_alllayers_47
1 20 82,25% 82,49% 47,50% 82,48% *82,30%* 82,64% 78,08% shp_alllayers_47 2
20 115,87% 115,67% 315,45% 115,85% *119,89%* 109,33% 128,71%
shp_alllayers_47 4 20 218,59% 218,76% 169,38% 218,65% *216,45%* 220,17%
206,17%
Ductus vs Patch
*1* *80,74%* *2* *120,69%* *4* *205,92%*
Reminder: Ref vs Patch
*1* *237,71%* *2* *271,68%* *4* *286,15%*

Note: I only have 2 cores + HT on my laptop and do not test with more
threads (64 like andrea).
Post by Jim Graham
If I'm reading this correctly, your patch is faster even for a single
thread? That's great news.
Not yet, but ductus is now only 20% faster than my patch and 20% and 2x
slower with 2 and 4 threads :
I still hope to beat it applying few more optimizations:
- Renderer.ScanLine iterator / Renderer.endRendering can be improved
- avoid few more array zero fill (reset) if possible
- adding statistics to better set initial array sizes ...
- use SunGraphics2D to hold an hard reference (temporarly) on
RendererContext (to avoid many ThreadLocal.get())
- cache eviction (WeakReference or SoftReference) ?

Why not use divide and conquer (thread pool) to boost single thread
rendering if the machine has more cpu cores ?
It would be helpful if the AATileGenerator has access to SunGraphics2D to
get rendering hints or share variables (cache ...)

For the moment, I did not modify the algorithms itself but I will do it to
enhance clipping (dasher segments / stroker) ...
Post by Jim Graham
One of the problems we've had with replacing Ductus is that it has been
faster in a single thread situation than the open source versions we've
created. One of its drawbacks is that it had been designed to take
advantage of some AA-accelerating hardware that never came to be. With the
accelerator it would have been insanely fast, but hardware went in a
different direction. The problem was that this early design goal caused
the entire library to be built around an abstraction layer that allowed for
a single "tile producer" internally (because there would be only one -
insanely fast - hardware chip available) and the software version of the
abstraction layer thus had a lot of native "static" data structures
(there's only one of me, right?) that prevented MT access. It was probably
solvable, but I'd be happier if someone could come up with a faster
rasterizer, imagining that there must have been some sort of advancements
in the nearly 2 decades since the original was written.
If I'm misinterpreting and single thread is still slower than Ductus (or
if it is still slower on some other platforms), then <frowny face>.
Not yet: slower than ductus by 20% but faster than current pisces by 2
times !
Post by Jim Graham
Also, this is with the Java version, right?
Yes, my patch is pure java given as webrev:
http://jmmc.fr/~bourgesl/share/java2d-pisces/webrev-1/
Post by Jim Graham
We got a decent 2x speedup in FX by porting the version of Open Pisces
that you started with to C code (all except on Linux where we couldn't find
the right gcc options to make it faster than Hotspot). So, we have yet to
investigate a native version in the JDK which might provide even more
gains...
Personally I prefer working on java code as hotspot can perform so much
optimizations for free and no pointers to deal with and more important:
concurrent primitives (thread local, collections) !

Laurent
Post by Jim Graham
Post by Laurent Bourgès
Jim, Andrea,
I updated MapBench to provide test statistics (avg, median, stddev, rms,
http://jmmc.fr/~bourgesl/**share/java2d-pisces/MapBench/<http://jmmc.fr/%7Ebourgesl/share/java2d-pisces/MapBench/>
<http://jmmc.fr/%7Ebourgesl/**share/java2d-pisces/MapBench/<http://jmmc.fr/%7Ebourgesl/share/java2d-pisces/MapBench/>
http://jmmc.fr/~bourgesl/**share/java2d-pisces/ref_det.**log<http://jmmc.fr/%7Ebourgesl/share/java2d-pisces/ref_det.log>
http://jmmc.fr/~bourgesl/**share/java2d-pisces/patch_det.**log<http://jmmc.fr/%7Ebourgesl/share/java2d-pisces/patch_det.log>
test threads ops Tavg Tmed stdDev rms
Med+Stddev min max
boulder_17 1 20 180,22% 181,08% 1186,01%
181,17% 185,92%
176,35% 170,36%
boulder_17 2 20 183,15% 183,80% 162,68%
183,78% 183,17%
174,01% 169,89%
boulder_17 4 20 216,62% 218,03% 349,31%
218,87% 226,68%
172,15% 167,54%
shp_alllayers_47 1 20 243,90% 244,86%
537,92% 244,87% 246,39%
240,64% 231,00%
shp_alllayers_47 2 20 286,42% 287,07%
294,87% 287,07% 287,23%
277,19% 272,23%
shp_alllayers_47 4 20 303,08% 302,15%
168,19% 301,90% 295,90%
462,70% 282,41%
test threads ops Tavg Tmed stdDev rms
Med+Stddev min max
boulder_17 1 20 110,196 109,244 0,529
109,246 109,773 108,197
129,327
boulder_17 2 40 127,916 127,363 3,899
127,423 131,262 125,262
151,561
boulder_17 4 80 213,085 212,268 14,988
212,796 227,256 155,512
334,407
shp_alllayers_47 1 20 1139,452 1134,858
5,971 1134,873 1140,829
1125,859 1235,746
shp_alllayers_47 2 40 1306,889 1304,598
28,157 1304,902
1332,755 1280,49 1420,351
shp_alllayers_47 4 80 2296,487 2303,81
112,816 2306,57 2416,626
1390,31 2631,455
test threads ops Tavg Tmed stdDev rms
Med+Stddev min max
boulder_17 1 20 198,591 197,816 6,274
197,916 204,091 190,805
220,319
boulder_17 2 40 234,272 234,09 6,343 234,176
240,433 217,967
257,485
boulder_17 4 80 461,579 462,8 52,354 465,751
515,153 267,712
560,254
shp_alllayers_47 1 20 2779,133 2778,823
32,119 2779,009
2810,943 2709,285 2854,557
shp_alllayers_47 2 40 3743,255 3745,111
83,027 3746,031
3828,138 3549,364 3866,612
shp_alllayers_47 4 80 6960,23 6960,948
189,75 6963,533 7150,698
6432,945 7431,541
Linux 64 server vm
JVM: -Xms128m -Xmx128m (low mem)
Laurent
On Tue, Apr 9, 2013 at 3:02 PM, Laurent Bourgčs
Dear Java2D members,
Could someone review the following webrev concerning Java2D
Pisces to enhance its performance and reduce its memory
footprint (RendererContext stored in thread local or concurrent
http://jmmc.fr/~bourgesl/**share/java2d-pisces/webrev-1/<http://jmmc.fr/%7Ebourgesl/share/java2d-pisces/webrev-1/>
<http://jmmc.fr/%7Ebourgesl/**share/java2d-pisces/webrev-1/<http://jmmc.fr/%7Ebourgesl/share/java2d-pisces/webrev-1/>
FYI I fixed file headers in this patch and signed my OCA 3 weeks ago.
- cleanup (comments ...)
- statistics to perform auto-tuning
- cache / memory cleanup (SoftReference ?): use hints or System
properties to adapt it to use cases
- another problem: fix clipping performance in Dasher / Stroker
for segments out of bounds
Could somebody support me ? ie help me working on these tasks or
just to discuss on Pisces algorithm / implementation ?
Hi,
I would like to express my support for this patch.
Given that micro-benchmarks have already been run, I took the patch
for a spin in a large, real world benchmark instead,
the OSGeo WMS Shootout 2010 benchmark, for which you can see the
http://www.slideshare.net/**gatewaygeomatics.com/wms-**
performance-shootout-2010<http://www.slideshare.net/gatewaygeomatics.com/wms-performance-shootout-2010>
The presentation is long, but suffice it to say all Java based
implementations took quite the beating due to the
poor scalability of Ductus with antialiased rendering of vector data
(for an executive summary just look
at slide 27 and slide 66, where GeoServer, Oracle MapViewer and
Constellation SDI were the
Java based ones)
I took the same tests and run them again on my machine (different
hardware than the tests, don't try to compare
the absolute values), using Oracle JDK 1.7.0_17, OpenJDK 8 (a
checkout a couple of weeks old) and the
same, but with Laurent's patches applied.
Here are the results, throughput (in maps generated per second) with
the load generator (JMeter) going
*Threads* *JDK 1.7.0_17* *OpenJDK 8, vanilla* *OpenJDK 8 +
pisces
renderer improvements* *Pisces renderer performance gain, %*
1 13,97 12,43 13,03 4,75%
2 22,08 20,60 20,77 0,81%
4 34,36 33,15 33,36 0,62%
8 39,39 40,51 41,71 2,96%
16 40,61 44,57 46,98 5,39%
32 41,41 44,73 48,16 7,66%
64 37,09 42,19 45,28 7,32%
Well, first of all, congratulations to the JDK developers, don't
know what changed in JDK 8, but
GeoServer seems to like it quite a bit :-).
That said, Laurent's patch also gives a visible boost, especially
when several concurrent clients are
asking for the maps.
Mind, as I said, this is no micro-benchmark, it is a real
application loading doing a lot of I/O
(from the operating system file cache), other processing before the
data reaches the rendering
pipeline, and then has to PNG encode the output BufferedImage (PNG
encoding being rather
expensive), so getting this speedup from just a change in the
rendering pipeline is significant.
Long story short... personally I'd be very happy if this patch was
going to be integrated in Java 8 :-)
Cheers
Andrea
--
==
GeoServer training in Milan, 6th & 7th June 2013! Visit
http://geoserver.geo-**solutions.it<http://geoserver.geo-solutions.it>
<http://geoserver.geo-**solutions.it/<http://geoserver.geo-solutions.it/>>
for more information.
==
Ing. Andrea Aime
@geowolf
Technical Lead
GeoSolutions S.A.S.
Via Poggio alle Viti 1187
55054 Massarosa (LU)
Italy
phone: +39 0584 962313
fax: +39 0584 1660272
mob: +39 339 8844549
http://www.geo-solutions.it
http://twitter.com/**geosolutions_it<http://twitter.com/geosolutions_it>
------------------------------**-------------------------
Richard Bair
2013-04-17 20:26:11 UTC
Permalink
Post by Jim Graham
Also, this is with the Java version, right?
http://jmmc.fr/~bourgesl/share/java2d-pisces/webrev-1/
We got a decent 2x speedup in FX by porting the version of Open Pisces that you started with to C code (all except on Linux where we couldn't find the right gcc options to make it faster than Hotspot). So, we have yet to investigate a native version in the JDK which might provide even more gains…
Oleg did more analysis on this and it appears the reason hotspot on Linux was faster than the C version was because on Linux it is -server compiler (c2) whereas on Windows / Mac it is client compiler (c1). Possibly using -server on windows / mac would also have hotspot beating the C version, although that hasn't been tested.

Richard
Phil Race
2013-04-17 21:06:17 UTC
Permalink
Post by Richard Bair
whereas on Windows / Mac it is client compiler (c1).
For Mac we only have a 64 bit VM which SFAIK should be c2 as well,
yet in that case native was presumably still faster. So its also a matter
of factoring in how good the code is that is generated by the C compiler.

-phil.
Post by Richard Bair
Post by Jim Graham
Also, this is with the Java version, right?
http://jmmc.fr/~bourgesl/share/java2d-pisces/webrev-1/
<http://jmmc.fr/%7Ebourgesl/share/java2d-pisces/webrev-1/>
We got a decent 2x speedup in FX by porting the version of Open
Pisces that you started with to C code (all except on Linux where
we couldn't find the right gcc options to make it faster than
Hotspot). So, we have yet to investigate a native version in the
JDK which might provide even more gains…
Oleg did more analysis on this and it appears the reason hotspot on
Linux was faster than the C version was because on Linux it is -server
compiler (c2) whereas on Windows / Mac it is client compiler (c1).
Possibly using -server on windows / mac would also have hotspot
beating the C version, although that hasn't been tested.
Richard
Clemens Eisserer
2013-04-17 21:15:53 UTC
Permalink
Hi Laurent,

thanks for having some interest for my efforts !
Post by Laurent Bourgès
As I got almost no feedback, I felt quite disappointed and was thinking
that improving pisces was not important ...
Glad to see work is ongoing to improve pisces's performance :)

I had a look at the patch just to be curious (I don't have reviewer
status), but to be honest I had troubles finding the relevant parts.
Having not followed the discussion that closely, it was almost impossible
for me to extract the real modifications from boilerplate/refactoring as
your patch weights almost 3000 modified lines.
I am just an intrested observer without any official state, yet personally
I would prefer smaller pieces with clear description/title.
However, I wouldn't want to cause you additional work and it's just a
single opinion.

Thanks for working on pisces!

Regards, Clemens
Laurent Bourgès
2013-04-18 04:52:54 UTC
Permalink
Hi clemens
Post by Clemens Eisserer
Hi Laurent,
Post by Laurent Bourgès
thanks for having some interest for my efforts !
As I got almost no feedback, I felt quite disappointed and was thinking
that improving pisces was not important ...
Post by Clemens Eisserer
Glad to see work is ongoing to improve pisces's performance :)
Thanks a lot: I am working on pisces during my spare time and
congratulations is very important for my motivation.
Post by Clemens Eisserer
I had a look at the patch just to be curious (I don't have reviewer
status), but to be honest I had troubles finding the relevant parts.

I agree I modify the code to help me understanding it : @override, grouped
fields, constant first, debug logs ...

I can revert most of these boilerplates ... during cleanup.

I sent the patch as webrev to let other people evaluate its performance
using their own platform, work load, benchmarks ...
Post by Clemens Eisserer
Having not followed the discussion that closely, it was almost impossible
for me to extract the real modifications from boilerplate/refactoring as
your patch weights almost 3000 modified lines.

I looked at the webrev and I advocate I can discard many line changes.
As I use netbeans, it modified the code so easily... nevermind.
Post by Clemens Eisserer
I am just an intrested observer without any official state, yet
personally I would prefer smaller pieces with clear description/title.
Post by Clemens Eisserer
However, I wouldn't want to cause you additional work and it's just a
single opinion.
Ok. I mainly modified memory handling: use a renderer cache to reuse arrays
and pisces class instances to avoid too much allocations and resizing ...
stored in thread local or concurrent queue.
Post by Clemens Eisserer
Thanks for working on pisces!
Thanks for your feedback too.

Laurent
Post by Clemens Eisserer
Regards, Clemens
Jim Graham
2013-04-22 21:05:21 UTC
Permalink
One thing about modifying the iterators. I originally commented that I
thought an object there was a bit much but it helped Denis, who was
doing all the work after all, keep the code organized in his mind. I'm
not sure what kind of measurements he made, but he didn't feel that they
were hurting performance at the time. If you are finding that the
allocation of the iterators is costing us then I'd just as much like to
see that paradigm go away, though your fix to allocate them once and
then re-init them does save the memory costs.

Other renderers I've worked on an written have done essentially the same
work inline rather than farming it out to another object.

I also think it's a good suggestion to keep fixes on code like this more
focused on isolating the fix itself rather than reorganizing the code to
a personal style to fix a few issues. So, as far as the code
organization changes I agree with Clemens there. When it comes down to
"a quick fix that doesn't touch many lines of code" like your init()
solution vs. "rewriting that part of the code to no longer need the
operations that were causing problems" then it gets grayer. I would
have probably eliminated the iterators altogether because I would view
it as reducing the complexity of the code, but your solution would
probably touch fewer lines of code (if you factor out the
reorganization). I at least wanted to throw that idea out as a
suggestion...

...jim
Post by Laurent Bourgès
Hi clemens
Post by Clemens Eisserer
Hi Laurent,
Post by Laurent Bourgès
thanks for having some interest for my efforts !
As I got almost no feedback, I felt quite disappointed and was thinking that improving pisces was not important ...
Glad to see work is ongoing to improve pisces's performance :)
Thanks a lot: I am working on pisces during my spare time and
congratulations is very important for my motivation.
Post by Clemens Eisserer
I had a look at the patch just to be curious (I don't have reviewer status), but to be honest I had troubles finding the relevant parts.
grouped fields, constant first, debug logs ...
I can revert most of these boilerplates ... during cleanup.
I sent the patch as webrev to let other people evaluate its performance
using their own platform, work load, benchmarks ...
Post by Clemens Eisserer
Having not followed the discussion that closely, it was almost impossible for me to extract the real modifications from boilerplate/refactoring as your patch weights almost 3000 modified lines.
I looked at the webrev and I advocate I can discard many line changes.
As I use netbeans, it modified the code so easily... nevermind.
Post by Clemens Eisserer
I am just an intrested observer without any official state, yet personally I would prefer smaller pieces with clear description/title.
However, I wouldn't want to cause you additional work and it's just a single opinion.
Ok. I mainly modified memory handling: use a renderer cache to reuse
arrays and pisces class instances to avoid too much allocations and
resizing ... stored in thread local or concurrent queue.
Post by Clemens Eisserer
Thanks for working on pisces!
Thanks for your feedback too.
Laurent
Post by Clemens Eisserer
Regards, Clemens
Laurent Bourgès
2013-04-23 21:58:28 UTC
Permalink
Jim,

I am preparing an updated patch that will contain less syntax changes and
boiler plates. Sorry again.

I have optimized few array cleaning and it perform as fast as ductus on the
biggest map test : in average equals !!

i will send results and comparisons tomorrow.

Hint: alpha array cleaning in renderer endRendering can be optimized by
filling only dirty part as i already do in the given patch. This small
change can be applied easily...
Post by Jim Graham
One thing about modifying the iterators. I originally commented that I
thought an object there was a bit much but it helped Denis, who was doing
all the work after all, keep the code organized in his mind. I'm not sure
what kind of measurements he made, but he didn't feel that they were
hurting performance at the time. If you are finding that the allocation of
the iterators is costing us then I'd just as much like to see that paradigm
go away, though your fix to allocate them once and then re-init them does
save the memory costs.
Post by Jim Graham
Other renderers I've worked on an written have done essentially the same
work inline rather than farming it out to another object.
Post by Jim Graham
I also think it's a good suggestion to keep fixes on code like this more
focused on isolating the fix itself rather than reorganizing the code to a
personal style to fix a few issues. So, as far as the code organization
changes I agree with Clemens there. When it comes down to "a quick fix
that doesn't touch many lines of code" like your init() solution vs.
"rewriting that part of the code to no longer need the operations that were
causing problems" then it gets grayer. I would have probably eliminated
the iterators altogether because I would view it as reducing the complexity
of the code, but your solution would probably touch fewer lines of code (if
you factor out the reorganization). I at least wanted to throw that idea
out as a suggestion..

Agreed. I will try to provide smaller changes but it is difficult to
isolate efficient changes.

Last idea: I think reusing arrays is efficient because they reside in the
cpu cache... and less gc.

I would like to try perform the rendering / cache / tile generation by
chunks corresponding to 32 scan lines to have / reuse a smaller rowAARLE
array...

Finally I looked a bit at clipping shapes and I think it should be done
first in the Stroker segment processing.

Regards,
Laurent
Post by Jim Graham
...jim
Post by Laurent Bourgès
Hi clemens
Post by Clemens Eisserer
Hi Laurent,
Post by Laurent Bourgès
thanks for having some interest for my efforts !
As I got almost no feedback, I felt quite disappointed and was
thinking that improving pisces was not important ...
Post by Jim Graham
Post by Laurent Bourgès
Post by Clemens Eisserer
Glad to see work is ongoing to improve pisces's performance :)
Thanks a lot: I am working on pisces during my spare time and
congratulations is very important for my motivation.
Post by Clemens Eisserer
I had a look at the patch just to be curious (I don't have reviewer
status), but to be honest I had troubles finding the relevant parts.
Post by Jim Graham
Post by Laurent Bourgès
grouped fields, constant first, debug logs ...
I can revert most of these boilerplates ... during cleanup.
I sent the patch as webrev to let other people evaluate its performance
using their own platform, work load, benchmarks ...
Post by Clemens Eisserer
Having not followed the discussion that closely, it was almost
impossible for me to extract the real modifications from
boilerplate/refactoring as your patch weights almost 3000 modified lines.
Post by Jim Graham
Post by Laurent Bourgès
I looked at the webrev and I advocate I can discard many line changes.
As I use netbeans, it modified the code so easily... nevermind.
Post by Clemens Eisserer
I am just an intrested observer without any official state, yet
personally I would prefer smaller pieces with clear description/title.
Post by Jim Graham
Post by Laurent Bourgès
Post by Clemens Eisserer
However, I wouldn't want to cause you additional work and it's just a single opinion.
Ok. I mainly modified memory handling: use a renderer cache to reuse
arrays and pisces class instances to avoid too much allocations and
resizing ... stored in thread local or concurrent queue.
Post by Clemens Eisserer
Thanks for working on pisces!
Thanks for your feedback too.
Laurent
Post by Clemens Eisserer
Regards, Clemens
Jim Graham
2013-04-24 01:24:45 UTC
Permalink
Hi Laurent,

Originally the version that was used in embedded used RLE because it
stored the results in the shape itself. On desktop I never found that
to be a necessary optimization especially because it actually wastes
memory for no gain during animations, but that was why they used RLE as
a storage format. Would it speed up the code to use a different storage
format?

Also, in the version we use in JavaFX we removed the tiling altogether
and return one alpha array for the entire rasterization. We might
consider doing that for this code as well if it allows us to get rid of
Ductus - it was a Ductus design constraint that forced the tiling (it
was again based on the expected size of the hardware AA engine)...

...jim
Post by Laurent Bourgès
Jim,
I am preparing an updated patch that will contain less syntax changes
and boiler plates. Sorry again.
I have optimized few array cleaning and it perform as fast as ductus on
the biggest map test : in average equals !!
i will send results and comparisons tomorrow.
Hint: alpha array cleaning in renderer endRendering can be optimized by
filling only dirty part as i already do in the given patch. This small
change can be applied easily...
One thing about modifying the iterators. I originally commented that I thought an object there was a bit much but it helped Denis, who was doing all the work after all, keep the code organized in his mind. I'm not sure what kind of measurements he made, but he didn't feel that they were hurting performance at the time. If you are finding that the allocation of the iterators is costing us then I'd just as much like to see that paradigm go away, though your fix to allocate them once and then re-init them does save the memory costs.
Other renderers I've worked on an written have done essentially the same work inline rather than farming it out to another object.
I also think it's a good suggestion to keep fixes on code like this more focused on isolating the fix itself rather than reorganizing the code to a personal style to fix a few issues. So, as far as the code organization changes I agree with Clemens there. When it comes down to "a quick fix that doesn't touch many lines of code" like your init() solution vs. "rewriting that part of the code to no longer need the operations that were causing problems" then it gets grayer. I would have probably eliminated the iterators altogether because I would view it as reducing the complexity of the code, but your solution would probably touch fewer lines of code (if you factor out the reorganization). I at least wanted to throw that idea out as a suggestion..
Agreed. I will try to provide smaller changes but it is difficult to
isolate efficient changes.
Last idea: I think reusing arrays is efficient because they reside in
the cpu cache... and less gc.
I would like to try perform the rendering / cache / tile generation by
chunks corresponding to 32 scan lines to have / reuse a smaller rowAARLE
array...
Finally I looked a bit at clipping shapes and I think it should be done
first in the Stroker segment processing.
Regards,
Laurent
...jim
Post by Laurent Bourgès
Hi clemens
Post by Clemens Eisserer
Hi Laurent,
Post by Laurent Bourgès
thanks for having some interest for my efforts !
As I got almost no feedback, I felt quite disappointed and was thinking that improving pisces was not important ...
Glad to see work is ongoing to improve pisces's performance :)
Thanks a lot: I am working on pisces during my spare time and
congratulations is very important for my motivation.
Post by Clemens Eisserer
I had a look at the patch just to be curious (I don't have reviewer status), but to be honest I had troubles finding the relevant parts.
grouped fields, constant first, debug logs ...
I can revert most of these boilerplates ... during cleanup.
I sent the patch as webrev to let other people evaluate its performance
using their own platform, work load, benchmarks ...
Post by Clemens Eisserer
Having not followed the discussion that closely, it was almost impossible for me to extract the real modifications from boilerplate/refactoring as your patch weights almost 3000 modified lines.
I looked at the webrev and I advocate I can discard many line changes.
As I use netbeans, it modified the code so easily... nevermind.
Post by Clemens Eisserer
I am just an intrested observer without any official state, yet personally I would prefer smaller pieces with clear description/title.
However, I wouldn't want to cause you additional work and it's just a single opinion.
Ok. I mainly modified memory handling: use a renderer cache to reuse
arrays and pisces class instances to avoid too much allocations and
resizing ... stored in thread local or concurrent queue.
Post by Clemens Eisserer
Thanks for working on pisces!
Thanks for your feedback too.
Laurent
Post by Clemens Eisserer
Regards, Clemens
Laurent Bourgès
2013-04-24 08:59:58 UTC
Permalink
Jim,

First, here are both updated webrev and benchmark results:
- results: http://jmmc.fr/~bourgesl/share/java2d-pisces/patch_opt_night.log
- webrev: http://jmmc.fr/~bourgesl/share/java2d-pisces/webrev-2/

Note: the webrev is partially "cleaner" - work still in progress !

Changes made:
- optimized cleanup of alpha / edges arrays
- TileState HARD reference stored in SunGraphics2D to avoid repeated
ThreadLocal or ConcurrentQueue accesses
- TileState propagated in RenderingEngine to PiscesRenderingEngine:
warning: interface compatibility issues
- minor tuning.

Now the ArrayCache (IntArrayCache, Dirty... and FloatArrayCache) are
totally useless during MapBench tests as the RendererContext stores large
arrays (16K int or float arrays) + rowAARLE (2Mb).
However, I keep the array caching for very high workload ... to be
discussed later.

Comparison (open office format):
http://jmmc.fr/~bourgesl/share/java2d-pisces/compareRef_Patch.ods

Patch2 vs ductus:
1 *102,11%* 2 *144,49%* 4 *263,13%*
In average, patch2 is equal or better than ductus: 44% for 2 threads and
2.6 times for 4 threads !

In the following table, you can see gain variations depending on the test
(work load): my patch performs better than ductus for complex test case
only.

test threads Tavg Tmed *Med+Stddev* boulder_17 1 82,54% 77,68% *76,99%*
boulder_17 2 119,57% 120,24% *128,56%* boulder_17 4 149,95% 150,39% *
161,98%* shp_alllayers_47 1 107,26% 107,18% *107,02%* shp_alllayers_47 2
144,24% 144,18% *147,00%* shp_alllayers_47 4 288,05% 289,10% *286,04%*
Post by Jim Graham
Originally the version that was used in embedded used RLE because it
stored the results in the shape itself. On desktop I never found that to
be a necessary optimization especially because it actually wastes memory
for no gain during animations, but that was why they used RLE as a storage
format. Would it speed up the code to use a different storage format?
Maybe it could be a very good idea: compressing alpha array to RLE and then
decompressing it to fill byte[] tile array seems a bad idea. However,
keeping RLE encoding may help having smaller arrays to store a complete
tile line as I want: width = 4096 (or more) x height = 32.

As memory is cheap nowadays, I could try having a large 1D array to store
alpha values for complete tile line: 512K only !
Post by Jim Graham
Also, in the version we use in JavaFX we removed the tiling altogether and
return one alpha array for the entire rasterization. We might consider
doing that for this code as well if it allows us to get rid of Ductus - it
was a Ductus design constraint that forced the tiling (it was again based
on the expected size of the hardware AA engine)...
I think tiling is still interesting as such small arrays stay in the cpu
cache ! however, I could try tuning the tile width to be larger (256x32)
instead of (32x32 tiles) ...


Finally,
Who could help me working on pisces ? Could we form a tiger team ?
or at least could denis and you have some time to help me ?

Laurent
Jim Graham
2013-04-24 17:38:36 UTC
Permalink
Hi Laurent,
Post by Jim Graham
Originally the version that was used in embedded used RLE because it
stored the results in the shape itself. On desktop I never found
that to be a necessary optimization especially because it actually
wastes memory for no gain during animations, but that was why they
used RLE as a storage format. Would it speed up the code to use a
different storage format?
Maybe it could be a very good idea: compressing alpha array to RLE and
then decompressing it to fill byte[] tile array seems a bad idea.
However, keeping RLE encoding may help having smaller arrays to store a
complete tile line as I want: width = 4096 (or more) x height = 32.
Or storing the crossings more directly. It's a little obtuse to go from
crossings to a 1D linear array of alpha deltas to RLE alphas to alphas.
For one thing, the crossings were sort of compact and then they got
flattened into the deltas and then the RLE part has to do a linear
search to compute the run lengths. It would seem that it would be
smaller and faster to do a more direct "crossing to RLE" step or just
storing the crossings themselves (which are very analogous to RLE
anyway), and then generating the tile directly from the packed stored
crossings.
Post by Jim Graham
Also, in the version we use in JavaFX we removed the tiling
altogether and return one alpha array for the entire rasterization.
We might consider doing that for this code as well if it allows us
to get rid of Ductus - it was a Ductus design constraint that forced
the tiling (it was again based on the expected size of the hardware
AA engine)...
I think tiling is still interesting as such small arrays stay in the cpu
cache ! however, I could try tuning the tile width to be larger (256x32)
instead of (32x32 tiles) ...
An issue with that is that there are some hard-coded pieces of the HW
rasterizers that expect a 32x32 tile and would need to be rewritten to
adjust to different tile sizes. Once they are fixed then it might make
sense to generate tiles by multiples of scanlines and dynamically
determine the horizontal size on the fly depending on the shape being
rasterized. If you rasterize a shape that is 260 pixels wide then your
256x32 tile would involve way more up-and-down in the pipeline than a
512x16. It might be better to have a 4K tile and then have the
dimensions be W x (4K/W) and have the rendering loops allow an arbitrary
tile size...
Post by Jim Graham
Finally,
Who could help me working on pisces ? Could we form a tiger team ?
or at least could denis and you have some time to help me ?
It would be great to have a tiger team, but I'm not full-time on Java2D
any more and Denis was only working on it during an internship so I'm
not sure what he is up to now.

We have a "graphics-rasterizer" mailing list/project that was created to
come up with an OS alternative to Ductus for OpenJDK. It went silent
when we got the first version of Pisces in and so that project has
technically been finished. If we can get a group of more than, well,
you working and me tossing out suggestions then we can discuss if we
want to revive that group or just continue on here...

...jim
Laurent Bourgès
2013-04-29 16:34:27 UTC
Permalink
Jim,

Sorry for the delay: I was busy on reworking the tile generation to cache
only a single line of tiles: it allows me to have ROWAARLE[32][] and
touchedTile a 1d array.

I simply do the rendering when nextTile() detects a tile line end !

Now it is working and the thread local cache is smaller: 32 x 4096 ~ 512k
for rowAARLE.

I think less memory is wasted and such arrays best fit into cpu cache.

Next step could be to use directly alpha data instead of RLE encoding /
decoding ...

More details asap

Laurent
Post by Clemens Eisserer
Hi Laurent,
Post by Jim Graham
Originally the version that was used in embedded used RLE because it
stored the results in the shape itself. On desktop I never found
that to be a necessary optimization especially because it actually
wastes memory for no gain during animations, but that was why they
used RLE as a storage format. Would it speed up the code to use a
different storage format?
Maybe it could be a very good idea: compressing alpha array to RLE and
then decompressing it to fill byte[] tile array seems a bad idea.
However, keeping RLE encoding may help having smaller arrays to store a
complete tile line as I want: width = 4096 (or more) x height = 32.
Or storing the crossings more directly. It's a little obtuse to go from
crossings to a 1D linear array of alpha deltas to RLE alphas to alphas.
For one thing, the crossings were sort of compact and then they got
flattened into the deltas and then the RLE part has to do a linear search
to compute the run lengths. It would seem that it would be smaller and
faster to do a more direct "crossing to RLE" step or just storing the
crossings themselves (which are very analogous to RLE anyway), and then
generating the tile directly from the packed stored crossings.
Post by Clemens Eisserer
Post by Jim Graham
Also, in the version we use in JavaFX we removed the tiling
altogether and return one alpha array for the entire rasterization.
We might consider doing that for this code as well if it allows us
to get rid of Ductus - it was a Ductus design constraint that forced
the tiling (it was again based on the expected size of the hardware
AA engine)...
I think tiling is still interesting as such small arrays stay in the cpu
cache ! however, I could try tuning the tile width to be larger (256x32)
instead of (32x32 tiles) ...
An issue with that is that there are some hard-coded pieces of the HW
rasterizers that expect a 32x32 tile and would need to be rewritten to
adjust to different tile sizes. Once they are fixed then it might make
sense to generate tiles by multiples of scanlines and dynamically determine
the horizontal size on the fly depending on the shape being rasterized. If
you rasterize a shape that is 260 pixels wide then your 256x32 tile would
involve way more up-and-down in the pipeline than a 512x16. It might be
better to have a 4K tile and then have the dimensions be W x (4K/W) and
have the rendering loops allow an arbitrary tile size...
Post by Clemens Eisserer
Post by Jim Graham
Finally,
Who could help me working on pisces ? Could we form a tiger team ?
or at least could denis and you have some time to help me ?
It would be great to have a tiger team, but I'm not full-time on Java2D
any more and Denis was only working on it during an internship so I'm not
sure what he is up to now.
Post by Clemens Eisserer
We have a "graphics-rasterizer" mailing list/project that was created to
come up with an OS alternative to Ductus for OpenJDK. It went silent when
we got the first version of Pisces in and so that project has technically
been finished. If we can get a group of more than, well, you working and
me tossing out suggestions then we can discuss if we want to revive that
group or just continue on here...
Post by Clemens Eisserer
...jim
Jim Graham
2013-04-24 18:22:40 UTC
Permalink
Post by Laurent Bourgès
Jim,
- results: http://jmmc.fr/~bourgesl/share/java2d-pisces/patch_opt_night.log
- webrev: http://jmmc.fr/~bourgesl/share/java2d-pisces/webrev-2/
Some comments on the webrev:

- This caching of pipeline data in SG2D is a new precedent and I'm wary
of it for a couple of reasons:
- It gets tricky to satisfy all pipelines using that kind of technique
- It breaks encapsulation, but at least it is isolated to internal code
- SG2D will need to deal with the new field in clone().
- Rendering a hierarchy of Swing objects uses clone() a lot so
caching storage in SG2D may not help much in that case and thread local
may be more of a win
- Thread local storage doesn't really have those issues
- It's only used if that pipeline is used
- No encapsulation issues
- Don't need to modify clone() and it will work across clones

- Curve iterator uses auto-boxing, is that causing any churn?
- RenderingEngine may want to provide "forwarding methods" for Ductus as
a temporary solution until we fix Ductus to be aware of the new signatures
- new constants in Dasher were moved out of the class they are used from...?
- Why get rid of the final modifiers in Dasher? Was it not as effective
as copying to local variables? Note that the manual copying to local
variables is prone to maintenance issues if someone inserts a method
call in a block where the fields are stale.
- PRE.pathTo() could still be static, no? Also, it would be nice to
make this change to the existing RE helper method directly (and possibly
provide a backwards compatibility method with the old argument list that
simply forwards with a "new float[6]").
- Perhaps it is time to get rid of the try/catch pairs in PTG.getAlpha()
entirely?
- When you have a field cached in a local variable and you compute a new
value for it, then "field = var = ..." is probably better than "var =
field = ..." since the latter implies that the answer gets stored in the
field and then read back which is an extra memory operation. I noticed
this in a couple of places in Renderer, but I know I saw the local
variable caching in other files as well.

- A lot of undoing of some very carefully planned indentation and code
alignment issues. Auto-formatting is evil... ;)
- A personal rant - I'm a big fan of the { on a line by itself if it
follows an indented line-continued method declaration or control
statement. It helps the eye quickly find the start of the body because
it stands out. Your autoformatting got rid of a bunch of those and I
made a frowny face... :( (It may not be true to the JDK code style
guidelines, but we've been using that style throughout the 2D code for
years...)

- We switched to a new "within" technique in the JavaFX version of
Pisces based upon this paper:

http://www.cygnus-software.com/papers/comparingfloats/comparingfloats.htm

The C version of the implementation in that paper is:

// Usable AlmostEqual function
// See
http://www.cygnus-software.com/papers/comparingfloats/comparingfloats.htm
jboolean Helpers_withinULP(const jfloat A, const jfloat B, const int
maxUlps) {
// Make sure maxUlps is non-negative and small enough that the
// default NAN won't compare as equal to anything.
// assert(maxUlps > 0 && maxUlps < 4 * 1024 * 1024);
jint aInt, bInt;

// Make aInt lexicographically ordered as a twos-complement int
// This cast can induce "false positive" warnings from various
compilers
// or bug checking tools, but is correct as sizeof(jint) ==
sizeof(jfloat)
aInt = *((jint *) &A);
if (aInt < 0) {
aInt = 0x80000000 - aInt;
}

// Make bInt lexicographically ordered as a twos-complement int
// This cast can induce "false positive" warnings from various
compilers
// or bug checking tools, but is correct as sizeof(jint) ==
sizeof(jfloat)
bInt = *((jint *) &B);
if (bInt < 0) {
bInt = 0x80000000 - bInt;
}

// aInt,bInt are in the range [-0x7fffffff, +0x7fffffff]
// assuming maxUlps is much smaller than 0x7fffffff
// (<negative number> + maxUlps) will never overflow
// (<positive number> - maxUlps) will never overflow
if (aInt < bInt) {
return (aInt < 0) ? aInt + maxUlps >= bInt : bInt - maxUlps <=
aInt;
} else {
return (bInt < 0) ? bInt + maxUlps >= aInt : aInt - maxUlps <=
bInt;
}
}

It avoids multiple calls to ULP and the complexity of figuring out
whether to use the ULP of the larger or smaller number. The Java
version would have to use Float.floatToIntBits() where we used funny
pointer casting...

...jim
Laurent Bourgès
2013-04-30 16:29:11 UTC
Permalink
Jim,

here is the latest webrev:
http://jmmc.fr/~bourgesl/share/java2d-pisces/webrev-3/

I tried to revert some syntax mistakes (indentation, moved constants +
cleanup) : code cleanup is still in progress;
so please don't have a frawny face while looking at the code!

I fixed also Ductus and Jules rendering engines (TileState added in
interface)

In this patch, I modified the rowAARLE[][] array to only contain a single
tile line [32][4096] (512K). Each line is quite big already because I want
to store there the alpha data instead of RLE values:
RLE values are very interesting for horizontal lines but for dashed lines
it becomes bigger than alpha data. For example a dashed line [2-1] will
produce many segments that will be encoding like [01][x2]...

I have few questions regarding renderer edge handling and float ceil calls:
private void addLine(float x1, float y1, float x2, float y2) {
...
// LBO: why ceil ie upper integer ?
final int firstCrossing = Math.max(ceil(y1), boundsMinY); // upper
integer
final int lastCrossing = Math.min(ceil(y2), boundsMaxY); // upper
integer

=> firstCrossing / lastCrossing should use lower and upper integer values
(rounding issues) ?

boolean endRendering() {
// TODO: perform shape clipping to avoid dealing with segments out
of bounding box

// Ensure shape edges are within bbox:
if (edgeMinX > edgeMaxX || edgeMaxX < 0f) {
return false; // undefined X bounds or negative Xmax
}
if (edgeMinY > edgeMaxY || edgeMaxY < 0f) {
return false; // undefined Y bounds or negative Ymax
}

// why use upper integer for edge min values ?
=> Here is the question: why use ceil instead of floor ?

final int eMinX = ceil(edgeMinX); // upper positive int
if (eMinX > boundsMaxX) {
return false; // Xmin > bbox maxX
}

final int eMinY = ceil(edgeMinY); // upper positive int
if (eMinY > boundsMaxY) {
return false; // Ymin > bbox maxY
}

int spminX = Math.max(eMinX, boundsMinX);
int spmaxX = Math.min(ceil(edgeMaxX), boundsMaxX); // upper
positive int
int spminY = Math.max(eMinY, boundsMinY);
int spmaxY = Math.min(ceil(edgeMaxY), boundsMaxY); // upper
positive int

int pminX = spminX >> SUBPIXEL_LG_POSITIONS_X;
int pmaxX = (spmaxX + SUBPIXEL_MASK_X) >> SUBPIXEL_LG_POSITIONS_X;
int pminY = spminY >> SUBPIXEL_LG_POSITIONS_Y;
int pmaxY = (spmaxY + SUBPIXEL_MASK_Y) >> SUBPIXEL_LG_POSITIONS_Y;

// store BBox to answer ptg.getBBox():
this.cache.init(pminX, pminY, pmaxX, pmaxY);

In PiscesCache:
void init(int minx, int miny, int maxx, int maxy) {
...
// LBO: why add +1 ??
bboxX1 = maxx /* + 1 */;
bboxY1 = maxy /* + 1 */;

=> piscesCache previously added +1 to maximum (upper loop condition y < y1)
but the pmaxX already uses ceil (+1) and (spmaxY + SUBPIXEL_MASK_Y) ensures
the last pixel is reached.

I fixed these limits to have correct rendering due to dirty rowAARLE reuse.

I think that the thread local's RendererContext is now small enough (< 1
Mb) to be used by web containers but if the number of threads is very high,
concurrent queue could help minimizing wasted memory.
Jim,
- results: http://jmmc.fr/~bourgesl/**share/java2d-pisces/patch_opt_**
night.log<http://jmmc.fr/%7Ebourgesl/share/java2d-pisces/patch_opt_night.log>
- webrev: http://jmmc.fr/~bourgesl/**share/java2d-pisces/webrev-2/<http://jmmc.fr/%7Ebourgesl/share/java2d-pisces/webrev-2/>
- This caching of pipeline data in SG2D is a new precedent and I'm wary of
- It gets tricky to satisfy all pipelines using that kind of technique
- It breaks encapsulation, but at least it is isolated to internal code
- SG2D will need to deal with the new field in clone().
- Rendering a hierarchy of Swing objects uses clone() a lot so caching
storage in SG2D may not help much in that case and thread local may be more
of a win
- Thread local storage doesn't really have those issues
- It's only used if that pipeline is used
- No encapsulation issues
- Don't need to modify clone() and it will work across clones
Ok nevermind, I thought first it was possible but I agree it is not a
proper solution.
- Curve iterator uses auto-boxing, is that causing any churn?
No, integers are in the Integer cache.
- RenderingEngine may want to provide "forwarding methods" for Ductus as a
temporary solution until we fix Ductus to be aware of the new signatures
Done in the new patch.
- new constants in Dasher were moved out of the class they are used from...?
To be done (TBD)
- Why get rid of the final modifiers in Dasher? Was it not as effective
as copying to local variables? Note that the manual copying to local
variables is prone to maintenance issues if someone inserts a method call
in a block where the fields are stale.
I have to fix it.
- PRE.pathTo() could still be static, no? Also, it would be nice to make
this change to the existing RE helper method directly (and possibly provide
a backwards compatibility method with the old argument list that simply
forwards with a "new float[6]").
Agreed: TBD
- Perhaps it is time to get rid of the try/catch pairs in PTG.getAlpha()
entirely?
Done.
- When you have a field cached in a local variable and you compute a new
value for it, then "field = var = ..." is probably better than "var = field
= ..." since the latter implies that the answer gets stored in the field
and then read back which is an extra memory operation. I noticed this in a
couple of places in Renderer, but I know I saw the local variable caching
in other files as well.
Fixed.
- A lot of undoing of some very carefully planned indentation and code
alignment issues. Auto-formatting is evil... ;)
Sorry, I may tune netbeans formatting settings.
- A personal rant - I'm a big fan of the { on a line by itself if it
follows an indented line-continued method declaration or control statement.
It helps the eye quickly find the start of the body because it stands out.
Your autoformatting got rid of a bunch of those and I made a frowny
face... :( (It may not be true to the JDK code style guidelines, but we've
been using that style throughout the 2D code for years...)
Sorry again; however I like autoformating to let me work more on the code
not on syntax / indentation.
- We switched to a new "within" technique in the JavaFX version of Pisces
http://www.cygnus-software.**com/papers/comparingfloats/**
comparingfloats.htm<http://www.cygnus-software.com/papers/comparingfloats/comparingfloats.htm>
Good idea, but I think there is many place where float <-> int conversion
/ tests should be improved ...

Do you have any faster implementation for Math.ceil/floor ? I now use
casting 1 + (int) / (int) but I know it is only correct for positive
numbers (> 0).

I have found few bugs:
- rendering a infinite line [vertical] draws nothing: I have to do a
diagnostic ...
- clipping issues:
for very large dashed rectangle, millions of segments are emited from
dasher (x1) to stroker (x4 segments) to renderer (x4 segments).

However, I would like to add a minimal clipping algorithm but the rendering
pipeline seems to be handling affine transforms between stages:
/*
* Pipeline seems to be:
* shape.getPathIterator
* -> inverseDeltaTransformConsumer
* -> Dasher
* -> Stroker
* -> deltaTransformConsumer OR transformConsumer
* -> pc2d = Renderer (bounding box)
*/

It is complicated to perform clipping in the Renderer and maybe to late as
a complete stroker's segment must be considered as a whole (4 segments
without caps ...).

Could you give me advices how to hack / add clipping algorithm in the
rendering pipeline (dasher / stroker / renderer) ?

Should I try to derive a bounding box for the dasher / stroker depending on
the Affine transforms ?

Laurent
Jim Graham
2013-05-01 00:43:18 UTC
Permalink
Hi Laurent,

There's a lot here to go over... ;)
Post by Laurent Bourgès
Jim,
http://jmmc.fr/~bourgesl/share/java2d-pisces/webrev-3/
I'll answer questions first and then have a look...
I have a personal philosophy for all rendering engines that I've
produced, but I'm not sure the philosophy was adopted for this code. I
like to use the same rounding for every operation and have all ranges be
specified as semi-open regions - x1 <= x < x2 - for example.

The "insideness" rule is most naturally expressed by semi-open intervals
since pixel centers that fall on a vertical boundary are inside if the
interior region is to their right (i.e. the x1 <= x half of that
equation above) and they are outside if the space to their right is
non-interior (i.e. the x < x2 half of that equation). A similar rule is
also used vertically.

The rounding operation which computes this would be:

xc = computed crossing value for the current scan line;
ceil(xc - 0.5);

This maps (for some integer k):

k+0.4999 to k
k+0.5000 to k
k+0.5001 to k+1

Which is exactly what you want for the beginning of a region. If the
crossing is exactly at the pixel center then that is the furthest right
that you can cross the scan line and still include pixel k. If you just
miss the center of a pixel to the right, then the first pixel to be
included is k+1. That function computes it correctly.

Similarly, if you apply the same formula to the right edge of a region
then you compute "the first pixel to not be included in the region.
Consider that if you miss a pixel center on that right edge by falling
slightly to the left of it, then that pixel is "outside" and it is the
first such pixel to be "outside". If you hit its pixel center exactly,
then the insideness rule also says that it is "outside". Only when you
just miss it by falling slightly to the right of its center would that
pixel still be included. The above formula computes that value if you
take its answer to be the "first pixel not included in the region".

This lets you perform the same rounding on every value and treat every
pair of "inside trigger" to "outside trigger" as a half-open interval.
Note that otherwise you'd have to compute every floating point crossing,
then run the EvenOdd vs. ZeroWinding rule on them, and finally only do
the rounding after you knew which were left edges and which were right
(which you don't know until you've computed every crossing on the scan
line). Since this rounding rule is symmetric given half-open intervals
then you can round everything before you've analyzed the results of the
winding rules.

I then extend that philosophy to every set of coordinates I deal with,
be it horizontal ranges, vertical ranges, shape rasterized results, or
clip results, and the math usually works out in the simplest and most
elegant way possible.

For an AA renderer, it isn't clear that the sub-pixel computations
should be as sub-sub-pixely accurate, but I'd like to think that the
philosophy should be kept down to the lowest layers if we can,
especially as this code can be tuned down to a single sub-pixel per real
pixel and then it definitely should be as accurate as it can with the
insideness rule (even though I doubt it will ever be set up that way).

Another consideration is the vertices on an enclosing polygon. You
compute one segment to where it ends at xe1,ye1 and start computing the
next segment from where it starts at xs2,ys2, but for adjacent segments
in an outline, they start and end on a shared common point so xe1==xs2
and ye1==ys2. If you compute crossings for both segments at that shared
point then you have a chance of recording the crossings twice for that
tiny chunk of the outline. You could special case the last point in a
segment, but that gets into lots of tricky tests. On the other hand, if
all intervals everywhere are half-open then anything you compute for
xe1,ye1 would represent "the first line/pixel that segment 1 does not
cause to be included" and computing the same values for xs2,ys2 would
naturally generate "the first line/pixel that segment 2 causes to be
included" and only one of them would induce inclusion on that scan line
or pixel. This gets even more valuable when you consider all cases of
segment 1 going down to xe1,ye1 and segment 2 going back up from that
point, or up/down or up/up or down/down - and also left/left or
left/right or right/right or right/left. All possible cases of how one
segment arrives at xe1,ye1 and the next segment leaves from that same
point just naturally fall out from using symmetric rounding and
half-open intervals.

I wanted to get that out to see if we were on the same page or if there
was another way of doing the math that you are familiar with that might
also make things easier.

I'm pretty sure that most of the code in the current version of Pisces
uses that same philosophy, but I seem to recall that there were a couple
of places that Denis was more comfortable with fully closed intervals.
I don't remember the details, but be on the watch for it. Also, I don't
recall if it used a half-sub-pixel bias anywhere or simply punted on it
as not being visible enough to worry about strict "center of sub-pixel"
comparisons.
Post by Laurent Bourgès
private void addLine(float x1, float y1, float x2, float y2) {
...
// LBO: why ceil ie upper integer ?
final int firstCrossing = Math.max(ceil(y1), boundsMinY); //
upper integer
final int lastCrossing = Math.min(ceil(y2), boundsMaxY); //
upper integer
Perhaps lastCrossing is misnamed. I think it represents the end of the
interval which is half-open. Does that match the rest of the operations
done on it?

If every coordinate has already been biased by the -0.5 then ceil is
just the tail end of the rounding equation I gave above.

If not, then the half-open considerations still apply if you use the
same rounding for both the top and the bottom of the segment. If
lastCrossing represents the "bottom of the shape" then it should not be
included. If the next segment continues on from it, then its y1 will be
computed in the first equation and will represent the first scanline
that the next segment participates in.
Post by Laurent Bourgès
=> firstCrossing / lastCrossing should use lower and upper integer
values (rounding issues) ?
Do they make sense now given my explanation above? Perhaps they should
be renamed to indicate their half-openness?
Post by Laurent Bourgès
boolean endRendering() {
// TODO: perform shape clipping to avoid dealing with segments
out of bounding box
if (edgeMinX > edgeMaxX || edgeMaxX < 0f) {
return false; // undefined X bounds or negative Xmax
}
if (edgeMinY > edgeMaxY || edgeMaxY < 0f) {
return false; // undefined Y bounds or negative Ymax
}
I'd use min >= max since if min==max then I think nothing gets generated
as a result of all edges having both the in and out crossings on the
same coordinate. Also, why not test against the clip bounds instead?
The code after that will clip the edgeMinMaxXY values against the
boundsMinMax values. If you do this kind of test after that clipping is
done (on spminmaxxy) then you can discover if all of the coordinates are
outside the clip or the region of interest.
Post by Laurent Bourgès
// why use upper integer for edge min values ?
Because an edge falling between sample points includes only the sample
point "above" it. (Or "un-includes" it if it is a right/bottom edge.)
Post by Laurent Bourgès
=> Here is the question: why use ceil instead of floor ?
final int eMinX = ceil(edgeMinX); // upper positive int
if (eMinX > boundsMaxX) {
return false; // Xmin > bbox maxX
}
final int eMinY = ceil(edgeMinY); // upper positive int
if (eMinY > boundsMaxY) {
return false; // Ymin > bbox maxY
}
int spminX = Math.max(eMinX, boundsMinX);
int spmaxX = Math.min(ceil(edgeMaxX), boundsMaxX); // upper
positive int
int spminY = Math.max(eMinY, boundsMinY);
int spmaxY = Math.min(ceil(edgeMaxY), boundsMaxY); // upper
positive int
int pminX = spminX >> SUBPIXEL_LG_POSITIONS_X;
int pmaxX = (spmaxX + SUBPIXEL_MASK_X) >> SUBPIXEL_LG_POSITIONS_X;
int pminY = spminY >> SUBPIXEL_LG_POSITIONS_Y;
int pmaxY = (spmaxY + SUBPIXEL_MASK_Y) >> SUBPIXEL_LG_POSITIONS_Y;
this.cache.init(pminX, pminY, pmaxX, pmaxY);
void init(int minx, int miny, int maxx, int maxy) {
...
// LBO: why add +1 ??
bboxX1 = maxx /* + 1 */;
bboxY1 = maxy /* + 1 */;
I think the values passed in are "the smallest and largest values we
encountered" and so adding one computes the first coordinate that is
"not inside the sample region" as per the half-open philosophy.

I tend to use and encourage "min/max" naming for closed-interval values
and xy0/xy1 or xy1/xy2 for half-open-interval values.
Post by Laurent Bourgès
=> piscesCache previously added +1 to maximum (upper loop condition y < y1)
but the pmaxX already uses ceil (+1) and (spmaxY + SUBPIXEL_MASK_Y)
ensures the last pixel is reached.
spmaxY + SUBPIXEL_MASK_Y computes the last sub-pixel piece of the last
pixel. It is essentially equivalen to "lastrow.9999999". So, the first
coordinate not included would be "lastrow+1" which is simply adding +1
to the sub-pixel "max" value that was given.
Post by Laurent Bourgès
I fixed these limits to have correct rendering due to dirty rowAARLE reuse.
If there was a bug before I'd prefer to have it fixed within the
philosophy of half-open intervals rather than trying to translate into
closed intervals - it keeps all of the code on the same page.
Post by Laurent Bourgès
Do you have any faster implementation for Math.ceil/floor ? I now use
casting 1 + (int) / (int) but I know it is only correct for positive
numbers (> 0).
ceil(integer) == integer, but your technique would return "integer+1".
I don't remember a decent technique for doing ceil with a cast.
Post by Laurent Bourgès
for very large dashed rectangle, millions of segments are emited from
dasher (x1) to stroker (x4 segments) to renderer (x4 segments).
However, I would like to add a minimal clipping algorithm but the
/*
* shape.getPathIterator
* -> inverseDeltaTransformConsumer
* -> Dasher
* -> Stroker
* -> deltaTransformConsumer OR transformConsumer
* -> pc2d = Renderer (bounding box)
*/
If we could teach the stroker to be able to apply an elliptical pen then
we could do the transforms once up front and simplify that quite a lot.
That optimization remains TBD. We currently at least detect uniform
scales and those can be handled with a single transform up front and
multiplying all dash segment values and the line width by the uniform
scale factor. But, anything that transforms a circle into an ellipse
requires us to use the multi-stage silliness above.
Post by Laurent Bourgès
It is complicated to perform clipping in the Renderer and maybe to late
as a complete stroker's segment must be considered as a whole (4
segments without caps ...).
Could you give me advices how to hack / add clipping algorithm in the
rendering pipeline (dasher / stroker / renderer) ?
Should I try to derive a bounding box for the dasher / stroker depending
on the Affine transforms ?
Yes, that sounds right so far, but it's pretty high-level and I'm not
sure where your level of understanding here falls short...? I guess
what I would do is:

- compute an expanded "clip box" for the stroker based on the original
clip, the transform, the line width, and the miter limit. Note that the
transform applied to the stroke coordinates varies depending on whether
we use the "uniform scaling" optimization or not

- test coordinates for being in the expanded clip box in the
PathConsumer2D methods of Stroker. If the segment is rejected, be sure
to update all of the internal computations (normal values, last moveto
coords, etc.) that might be needed for stroking the following segment

- unfortunately this means that dasher doesn't get the advantages, you
could add similar code to the dasher, but then you have to be sure an
entire dash is omitted before you can omit the data going on, otherwise
the stroker will not get proper incoming coordinates from "the previous
segment that was out of bounds". That could be managed with proper
communication between the dasher and stroker, but then it gets more
complicated.

Those are the caveats that I would keep in mind while I was designing a
solution - hope it helps some...

...jim
Laurent Bourgès
2013-05-02 08:48:25 UTC
Permalink
Jim,

thanks for this long explanations and I agree you approach using semi-open
intervals.

I have the feeling that pisces was not so accurate implementing it.
I have a personal philosophy for all rendering engines that I've produced,
but I'm not sure the philosophy was adopted for this code. I like to use
the same rounding for every operation and have all ranges be specified as
semi-open regions - x1 <= x < x2 - for example.
The "insideness" rule is most naturally expressed by semi-open intervals
since pixel centers that fall on a vertical boundary are inside if the
interior region is to their right (i.e. the x1 <= x half of that equation
above) and they are outside if the space to their right is non-interior
(i.e. the x < x2 half of that equation). A similar rule is also used
vertically.
xc = computed crossing value for the current scan line;
ceil(xc - 0.5);
Agreed; however, pisces code never performs such rounding: for me, ceil(x -
0.5) means (int) Math.round(x).
k+0.4999 to k
k+0.5000 to k
k+0.5001 to k+1
Which is exactly what you want for the beginning of a region. If the
crossing is exactly at the pixel center then that is the furthest right
that you can cross the scan line and still include pixel k. If you just
miss the center of a pixel to the right, then the first pixel to be
included is k+1. That function computes it correctly.
Similarly, if you apply the same formula to the right edge of a region
then you compute "the first pixel to not be included in the region.
Consider that if you miss a pixel center on that right edge by falling
slightly to the left of it, then that pixel is "outside" and it is the
first such pixel to be "outside". If you hit its pixel center exactly,
then the insideness rule also says that it is "outside". Only when you
just miss it by falling slightly to the right of its center would that
pixel still be included. The above formula computes that value if you take
its answer to be the "first pixel not included in the region".
This lets you perform the same rounding on every value and treat every
pair of "inside trigger" to "outside trigger" as a half-open interval. Note
that otherwise you'd have to compute every floating point crossing, then
run the EvenOdd vs. ZeroWinding rule on them, and finally only do the
rounding after you knew which were left edges and which were right (which
you don't know until you've computed every crossing on the scan line).
Since this rounding rule is symmetric given half-open intervals then you
can round everything before you've analyzed the results of the winding
rules.
I then extend that philosophy to every set of coordinates I deal with, be
it horizontal ranges, vertical ranges, shape rasterized results, or clip
results, and the math usually works out in the simplest and most elegant
way possible.
For an AA renderer, it isn't clear that the sub-pixel computations should
be as sub-sub-pixely accurate, but I'd like to think that the philosophy
should be kept down to the lowest layers if we can, especially as this code
can be tuned down to a single sub-pixel per real pixel and then it
definitely should be as accurate as it can with the insideness rule (even
though I doubt it will ever be set up that way).
Another consideration is the vertices on an enclosing polygon. You
compute one segment to where it ends at xe1,ye1 and start computing the
next segment from where it starts at xs2,ys2, but for adjacent segments in
an outline, they start and end on a shared common point so xe1==xs2 and
ye1==ys2. If you compute crossings for both segments at that shared point
then you have a chance of recording the crossings twice for that tiny chunk
of the outline. You could special case the last point in a segment, but
that gets into lots of tricky tests. On the other hand, if all intervals
everywhere are half-open then anything you compute for xe1,ye1 would
represent "the first line/pixel that segment 1 does not cause to be
included" and computing the same values for xs2,ys2 would naturally
generate "the first line/pixel that segment 2 causes to be included" and
only one of them would induce inclusion on that scan line or pixel. This
gets even more valuable when you consider all cases of segment 1 going down
to xe1,ye1 and segment 2 going back up from that point, or up/down or up/up
or down/down - and also left/left or left/right or right/right or
right/left. All possible cases of how one segment arrives at xe1,ye1 and
the next segment leaves from that same point just naturally fall out from
using symmetric rounding and half-open intervals.
I wanted to get that out to see if we were on the same page or if there
was another way of doing the math that you are familiar with that might
also make things easier.
Perfect and very clear. I will then check pisces rounding either on pixel
coordinates (px, py) or AA (virtual) pixel coordinates (spx, spy).
I'm pretty sure that most of the code in the current version of Pisces
uses that same philosophy, but I seem to recall that there were a couple of
places that Denis was more comfortable with fully closed intervals. I don't
remember the details, but be on the watch for it. Also, I don't recall if
it used a half-sub-pixel bias anywhere or simply punted on it as not being
visible enough to worry about strict "center of sub-pixel" comparisons.
I am perplex and I am going to check pisces code against your given
approach.
private void addLine(float x1, float y1, float x2, float y2) {
Post by Laurent Bourgès
...
// LBO: why ceil ie upper integer ?
final int firstCrossing = Math.max(ceil(y1), boundsMinY); //
upper integer
final int lastCrossing = Math.min(ceil(y2), boundsMaxY); //
upper integer
Perhaps lastCrossing is misnamed. I think it represents the end of the
interval which is half-open. Does that match the rest of the operations
done on it?
If every coordinate has already been biased by the -0.5 then ceil is just
the tail end of the rounding equation I gave above.
That's not the case => buggy: x1, y1 and x2, y2 are directly the point
coordinates as float values.

FYI, firstCrossing is used to compute the first X intersection for that Y
coordinate and determine the bucketIdx:
_edges[ptr+CURX] = x1 + (firstCrossing - y1) * slope;
...
final int bucketIdx = firstCrossing - boundsMinY;
...
_edges[ptr+NEXT] = edgeBuckets[bucketIdx];

lastCrossing is used to give the YMax edge coordinate and edgedBucketCounts:
_edges[ptr+YMAX] = lastCrossing;
...
edgeBucketCounts[lastCrossing - boundsMinY] |= 1;

I think rounding errors can lead to pixel / shape rasterization
deformations ... ?
If not, then the half-open considerations still apply if you use the same
rounding for both the top and the bottom of the segment. If lastCrossing
represents the "bottom of the shape" then it should not be included. If
the next segment continues on from it, then its y1 will be computed in the
first equation and will represent the first scanline that the next segment
participates in.
=> firstCrossing / lastCrossing should use lower and upper integer
Post by Laurent Bourgès
values (rounding issues) ?
Do they make sense now given my explanation above? Perhaps they should be
renamed to indicate their half-openness?
I am a bit embarrassed to verify maths performed in ScanLineIterator.next()
which use edges and edgeBucketCounts arrays ... could you have a look ?
Apparently, it uses the following for loop that respects the semi-open
interval philosophy:
for (int i = 0, ecur, j, k; i < count; i++) {
...

However, I am not sure if the edges "linked list" is traversed correctly ...

boolean endRendering() {
Post by Laurent Bourgès
// TODO: perform shape clipping to avoid dealing with segments
out of bounding box
if (edgeMinX > edgeMaxX || edgeMaxX < 0f) {
return false; // undefined X bounds or negative Xmax
}
if (edgeMinY > edgeMaxY || edgeMaxY < 0f) {
return false; // undefined Y bounds or negative Ymax
}
I'd use min >= max since if min==max then I think nothing gets generated
as a result of all edges having both the in and out crossings on the same
coordinate. Also, why not test against the clip bounds instead? The code
after that will clip the edgeMinMaxXY values against the boundsMinMax
values. If you do this kind of test after that clipping is done (on
spminmaxxy) then you can discover if all of the coordinates are outside the
clip or the region of interest.
I tried here to perform few "fast" checks before doing float to int
conversions (costly because millions are performed): I think it can be
still improved: edgeMinX > edgeMaxX only ensures edgeMinX is defined and
both are positive !

TODO: improve boundary checks here.
// why use upper integer for edge min values ?
Because an edge falling between sample points includes only the sample
point "above" it. (Or "un-includes" it if it is a right/bottom edge.)
=> Here is the question: why use ceil instead of floor ?
Post by Laurent Bourgès
final int eMinX = ceil(edgeMinX); // upper positive int
if (eMinX > boundsMaxX) {
return false; // Xmin > bbox maxX
}
final int eMinY = ceil(edgeMinY); // upper positive int
if (eMinY > boundsMaxY) {
return false; // Ymin > bbox maxY
}
int spminX = Math.max(eMinX, boundsMinX);
int spmaxX = Math.min(ceil(edgeMaxX), boundsMaxX); // upper
positive int
int spminY = Math.max(eMinY, boundsMinY);
int spmaxY = Math.min(ceil(edgeMaxY), boundsMaxY); // upper
positive int
int pminX = spminX >> SUBPIXEL_LG_POSITIONS_X;
int pmaxX = (spmaxX + SUBPIXEL_MASK_X) >>
SUBPIXEL_LG_POSITIONS_X;
int pminY = spminY >> SUBPIXEL_LG_POSITIONS_Y;
int pmaxY = (spmaxY + SUBPIXEL_MASK_Y) >>
SUBPIXEL_LG_POSITIONS_Y;
this.cache.init(pminX, pminY, pmaxX, pmaxY);
void init(int minx, int miny, int maxx, int maxy) {
...
// LBO: why add +1 ??
bboxX1 = maxx /* + 1 */;
bboxY1 = maxy /* + 1 */;
I think the values passed in are "the smallest and largest values we
encountered" and so adding one computes the first coordinate that is "not
inside the sample region" as per the half-open philosophy.
I tend to use and encourage "min/max" naming for closed-interval values
and xy0/xy1 or xy1/xy2 for half-open-interval values.
=> piscesCache previously added +1 to maximum (upper loop condition y <
Post by Laurent Bourgès
y1)
but the pmaxX already uses ceil (+1) and (spmaxY + SUBPIXEL_MASK_Y)
ensures the last pixel is reached.
spmaxY + SUBPIXEL_MASK_Y computes the last sub-pixel piece of the last
pixel. It is essentially equivalen to "lastrow.9999999". So, the first
coordinate not included would be "lastrow+1" which is simply adding +1 to
the sub-pixel "max" value that was given.
I understand but I figured out that pmaxX or pmaxY (pixel coordinates) may
lead to upper integer:
- spmaxY >> SUBPIXEL_LG_POSITIONS_Y
=> 6
- (spmaxY + SUBPIXEL_MASK_Y) >> SUBPIXEL_LG_POSITIONS_Y => 7

So adding +1 in piscesCache will give maxY = 8 !!
I fixed these limits to have correct rendering due to dirty rowAARLE
Post by Laurent Bourgès
reuse.
If there was a bug before I'd prefer to have it fixed within the
philosophy of half-open intervals rather than trying to translate into
closed intervals - it keeps all of the code on the same page.
I think it was not a real bug as new rowAARLE[][] was big enough and zero
filled so it led to only larger boundaries and tiles.
Post by Laurent Bourgès
Do you have any faster implementation for Math.ceil/floor ? I now use
casting 1 + (int) / (int) but I know it is only correct for positive
numbers (> 0).
ceil(integer) == integer, but your technique would return "integer+1". I
don't remember a decent technique for doing ceil with a cast.
I disagree:
- ceil(float) => the smallest (closest to negative infinity) floating-point
value that is greater than or equal to the argument and is equal to a
mathematical integer.
- floor(float) => the largest (closest to positive infinity) floating-point
value that less than or equal to the argument and is equal to a
mathematical integer.

i.e. floor(x) < x < ceil(x)

Here are numbers comparing the Math.ceil / floor operations and my casting
implementation (only valid for >0):

floats = [-134758.4, -131.5, -17.2, -1.9, -0.9, -1.0E-8, -1.0E-23,
-0.0, 0.0, 134758.4, 131.5, 17.2, 1.9, 0.9, 1.0E-8, 1.0E-23]
mathCeil = [-134758, -131, -17, -1, 0, 0, 0, 0, 0, 134759, 132, 18, 2,
1, 1, 1]
fastMathCeil = [-134757, -130, -16, 0, 1, 1, 1, 1, 1, 134759, 132, 18, 2,
1, 1, 1]
mathFloor = [-134759, -132, -18, -2, -1, -1, -1, 0, 0, 134758, 131, 17,
1, 0, 0, 0]
fastMathFloor= [-134758, -131, -17, -1, 0, 0, 0, 0, 0, 134758, 131, 17, 1,
0, 0, 0]

Micro benchmark results:
# FastMathCeil: run duration: 10 000 ms
4 threads, Tavg = 44,95 ns/op (ó = 0,39 ns/op), Total ops
= 890974137 [ 44,54 (224799696), 44,76 (223695800), 45,59
(219626233), 44,93 (222852408)]

# MathCeil: run duration: 10 000 ms
4 threads, Tavg = 125,59 ns/op (ó = 1,71 ns/op), Total ops
= 318415673 [ 125,81 (79406107), 128,11 (78061562), 125,22
(79862351), 123,33 (81085653)]
Post by Laurent Bourgès
for very large dashed rectangle, millions of segments are emited from
dasher (x1) to stroker (x4 segments) to renderer (x4 segments).
However, I would like to add a minimal clipping algorithm but the
/*
* shape.getPathIterator
* -> inverseDeltaTransformConsumer
* -> Dasher
* -> Stroker
* -> deltaTransformConsumer OR transformConsumer
* -> pc2d = Renderer (bounding box)
*/
If we could teach the stroker to be able to apply an elliptical pen then
we could do the transforms once up front and simplify that quite a lot.
That optimization remains TBD. We currently at least detect uniform
scales and those can be handled with a single transform up front and
multiplying all dash segment values and the line width by the uniform scale
factor. But, anything that transforms a circle into an ellipse requires us
to use the multi-stage silliness above.
Could you give me more details about "elliptical pen" ie shearing shapes ?

Another idea to be discussed: rounding caps are quite expensive (curve
break into lines) and an optimization could determine if the cap is
invisible (out of the clip bounds) or visually useless = too small ie half
circle spans over less than 1 pixels (true pixel or AA pixels): any idea ?
maybe I should only check if the line width is too small to renderer caps
in such situations...
It is complicated to perform clipping in the Renderer and maybe to late
Post by Laurent Bourgès
as a complete stroker's segment must be considered as a whole (4
segments without caps ...).
Could you give me advices how to hack / add clipping algorithm in the
rendering pipeline (dasher / stroker / renderer) ?
Should I try to derive a bounding box for the dasher / stroker depending
Post by Laurent Bourgès
on the Affine transforms ?
Yes, that sounds right so far, but it's pretty high-level and I'm not sure
where your level of understanding here falls short...? I guess what I
- compute an expanded "clip box" for the stroker based on the original
clip, the transform, the line width, and the miter limit. Note that the
transform applied to the stroke coordinates varies depending on whether we
use the "uniform scaling" optimization or not
- test coordinates for being in the expanded clip box in the
PathConsumer2D methods of Stroker. If the segment is rejected, be sure to
update all of the internal computations (normal values, last moveto coords,
etc.) that might be needed for stroking the following segment
that's the plan but I would like to do it twice: in dasher and in stroker
... maybe it should be coded once in a new PiscesClipUtils class or Helpers
...
- unfortunately this means that dasher doesn't get the advantages, you
could add similar code to the dasher, but then you have to be sure an
entire dash is omitted before you can omit the data going on, otherwise the
stroker will not get proper incoming coordinates from "the previous segment
that was out of bounds". That could be managed with proper communication
between the dasher and stroker, but then it gets more complicated.
Agreed but maybe it should be analyzed more deeply to evaluate if it is not
too hard ...
Those are the caveats that I would keep in mind while I was designing a
solution - hope it helps some...
...jim
PS: I have more free time until sunday (children are away), so I would like
to talk to you using skype if it is possible ?
If yes, could you give me your skype account and work times (GMT time
please) ?
I am living in Grenoble, France (GMT+1)

Thanks a lot,
Laurent
Jim Graham
2013-05-03 22:57:38 UTC
Permalink
Hi Laurent,
Post by Jim Graham
xc = computed crossing value for the current scan line;
ceil(xc - 0.5);
Agreed; however, pisces code never performs such rounding: for me,
ceil(x - 0.5) means (int) Math.round(x).
(int) Math.round(x) is almost the same function, but the 0.5's map the
wrong way. 3.5 gets mapped to 4 which implies that pixel #4 is the
first on, or first off. But, the first pixel to change is pixel #3 in
that case. For all other values it is the same. ceil(x-0.5) reverses
the "open interval of rounding" to bias halfway points down instead of up.
Post by Jim Graham
I wanted to get that out to see if we were on the same page or if
there was another way of doing the math that you are familiar with
that might also make things easier.
Perfect and very clear. I will then check pisces rounding either on
pixel coordinates (px, py) or AA (virtual) pixel coordinates (spx, spy).
With respect to applying the "center of pixel insideness" rules to the
subpixels...

I have a vague memory of having the same discussion with Denis, but it
seems that either he found a clever way of implementing it that I can't
find, or he didn't find it helped on the quality vs. performance scale.
Keep in mind that our "pixel centers include pixels" rules are
described in a context that is considering a binary "is this pixel
inside or not" decision, but AA processing is a different animal and an
attempt to represent approximate coverages. So, applying that rule on
the sub-pixel sample level of an AA rasterizer is on the gray line of
blind dogmatic adherence to a philosophy from a related area. I'm not
convinced that one could consider Pisces "non-compliant" with that rule
since I don't think it was ever promised that the rule would apply to
this case.
Post by Jim Graham
I am perplex and I am going to check pisces code against your given
approach.
If for no other reason that to make sure that there aren't two parts of
the system trying to communicate with different philosophies. You don't
want a caller to hand a closed interval to a method which treats the
values as half-open, for instance. If the rounding is "different, but
consistent", then I think we can leave it for now and treat it as a
future refinement to check if it makes any practical difference and
correct. But, if it shows up a left-hand-not-talking-to-right-hand bug,
then that would be good to fix sooner rather than later.

I think it is OK to focus on your current task of performance and memory
turmoil, but I wanted to give you the proper background to try to
understand what you were reading primarily, and possibly to get you
interested in cleaning up the code as you went as a secondary consideration.
Post by Jim Graham
Perhaps lastCrossing is misnamed. I think it represents the end of
the interval which is half-open. Does that match the rest of the
operations done on it?
If every coordinate has already been biased by the -0.5 then ceil is
just the tail end of the rounding equation I gave above.
That's not the case => buggy: x1, y1 and x2, y2 are directly the point
coordinates as float values.
Then using the ceil() on both is still consistent with half-open
intervals, it just has a different interpretation of where the sampling
cut-off lies within the subpixel sample. When you determine where the
"crossings" lie, then it would be proper to do the same ceil(y +/- some
offset) operation to compute the first crossing that is included and the
first crossing that is excluded. In this case it appears that the
offset if just 0.0 which doesn't really meet my expectations, but is a
minor issue. These crossings then become a half-open interval of
scanline indices in which to compute the value.
Post by Jim Graham
FYI, firstCrossing is used to compute the first X intersection for that
_edges[ptr+CURX] = x1 + (firstCrossing - y1) * slope;
...
final int bucketIdx = firstCrossing - boundsMinY;
...
_edges[ptr+NEXT] = edgeBuckets[bucketIdx];
_edges[ptr+YMAX] = lastCrossing;
...
edgeBucketCounts[lastCrossing - boundsMinY] |= 1;
I think rounding errors can lead to pixel / shape rasterization
deformations ... ?
As long as the test is "y < _edges[ptr+YMAX]" then that is consistent
with a half-open interval sampled at the top of every sub-pixel region,
isn't it? I agree with the half-open part of it, but would have
preferred a "center of sub-pixel" offset for the actual sampling.
Post by Jim Graham
If not, then the half-open considerations still apply if you use the
same rounding for both the top and the bottom of the segment. If
lastCrossing represents the "bottom of the shape" then it should not
be included. If the next segment continues on from it, then its y1
will be computed in the first equation and will represent the first
scanline that the next segment participates in.
=> firstCrossing / lastCrossing should use lower and upper integer
values (rounding issues) ?
Do they make sense now given my explanation above? Perhaps they
should be renamed to indicate their half-openness?
I am a bit embarrassed to verify maths performed in
ScanLineIterator.next() which use edges and edgeBucketCounts arrays ...
could you have a look ?
Apparently, it uses the following for loop that respects the semi-open
for (int i = 0, ecur, j, k; i < count; i++) {
...
I'll come back to that at a later time, but it sounds like you are
starting to get a handle on the design here.
Post by Jim Graham
However, I am not sure if the edges "linked list" is traversed correctly ...
That one took me a few tries to understand myself as well.
Post by Jim Graham
boolean endRendering() {
// TODO: perform shape clipping to avoid dealing with segments
out of bounding box
if (edgeMinX > edgeMaxX || edgeMaxX < 0f) {
return false; // undefined X bounds or negative Xmax
}
if (edgeMinY > edgeMaxY || edgeMaxY < 0f) {
return false; // undefined Y bounds or negative Ymax
}
I'd use min >= max since if min==max then I think nothing gets
generated as a result of all edges having both the in and out
crossings on the same coordinate. Also, why not test against the
clip bounds instead? The code after that will clip the edgeMinMaxXY
values against the boundsMinMax values. If you do this kind of test
after that clipping is done (on spminmaxxy) then you can discover if
all of the coordinates are outside the clip or the region of interest.
I tried here to perform few "fast" checks before doing float to int
conversions (costly because millions are performed): I think it can be
still improved: edgeMinX > edgeMaxX only ensures edgeMinX is defined and
both are positive !
endRendering is called once per shape. I don't think moving tests above
its conversions to int will affect our throughput compared to
calculations done per-vertex.

I'm going to have to review the rest of this email at a future time, my
apologies...
Post by Jim Graham
PS: I have more free time until sunday (children are away), so I would
like to talk to you using skype if it is possible ?
If yes, could you give me your skype account and work times (GMT time
please) ?
I am living in Grenoble, France (GMT+1)
Unfortunately, this time frame isn't good for me. :(

...jim
Laurent Bourgès
2013-05-06 13:43:12 UTC
Permalink
Jim, Andrea & java2d members,

I am happy to announce an updated Pisces patch that is faster again:

- Patched Pisces vs OpenJDK Pisces (ref): ~ 2.5 to 4.5 times faster

score small *1* 20 248,04% 247,90% 464,65% 248,04% *253,49%* 232,64%
207,77% *2* 40 276,49% 276,09% 1317,15% 279,32% *308,52%* 251,96% 288,31%
*4* 80 295,18% 295,49% 629,06% 298,08% *316,24%* 269,51% 181,64%
*
*




*
*

score big *1* 20 356,13% 356,44% 1862,18% 356,47% *360,04%* 345,63% 360,26%
*2* 40 413,56% 414,14% 350,96% 414,06% *411,88%* 412,23% 385,51% *4* 80
458,96% 459,48% 941,17% 459,68% *467,40%* 425,12% 450,10%
- Patched Pisces vs Oracle JDK 8 (ductus): ~ equal (1T) ~ 60% faster (2T) ~
2 to 3 times faster (4T)

score small *1* 20 94,02% 93,58% 61,96% 93,53% *92,77%* 93,69% 128,83% *
2* 40 138,06% 137,95% 763,67% 140,09% *157,44%* 102,14% 183,03% *4* 80
179,10% 179,17% 494,78% 182,03% *198,80%* 119,86% 176,89%
*
*




*
*

score big *1* 20 122,67% 122,69% 112,98% 122,69% *122,67%* 122,70% 122,23%
*2* 40 173,02% 173,17% 335,41% 173,50% *178,99%* 160,51% 175,63% *4* 80
325,52% 326,50% 574,24% 326,59% *330,57%* 226,20% 321,69%
JAVA_OPTS="-server -XX:+PrintCommandLineFlags -XX:-PrintFlagsFinal
-XX:-TieredCompilation "
JAVA_TUNING=" -Xms128m -Xmx128m"

Full results:
http://jmmc.fr/~bourgesl/share/java2d-pisces/compareRef_Patch_2.ods

http://jmmc.fr/~bourgesl/share/java2d-pisces/patch_opt_05_05_20s.log
http://jmmc.fr/~bourgesl/share/java2d-pisces/ductus_tests_10s.log
http://jmmc.fr/~bourgesl/share/java2d-pisces/ref_test_long.log

Here is the updated pisces patch:
http://jmmc.fr/~bourgesl/share/java2d-pisces/webrev-4/

Changes:
- PiscesCache: use rowAAStride[32][x0; x1; alpha sum(x)] to use alpha data
directly instead of encoding / decoding RLE data
- fixed PiscesTileGenerator.getAlpha() to use directly and optimally
rowAAStride
- Renderer: edges array split into edges [CURX, SLOPE] / edgesInt [NEXT,
YMAX, OR] to avoid float / int conversions
- added "monitors" ie custom cpu / stats probes to gather usage statistics
and cpu timings (minimal overhead): enable them using
PiscesConst.doMonitors flag
- minor tweaks

Remaining tasks:
- basic clipping algorithm to handle trivial shape or line rejection when
no affine transform or simple one (scaling) is in use
- enhance curve / round caps and joins handling to take into account the
spatial resolution: for example, round caps representing less than 2 AA
pixels are visually useless and counter productive (cpu): to be discussed
- cleanup / indentation STILL IN PROGRESS

Jim, I found few bugs / mistakes related to bbox + 1 (piscesCache) and
alpha array (+ 1):
I agree pixel coordinates / edges / crossing should be converted to
integers in an uniform manner (consistent and more accurate) : to be
discusses and remain to be fixed.


Finally I updated MapBench & MapDisplay:
http://jmmc.fr/~bourgesl/share/java2d-pisces/MapBench/

New features:
- each test runs at least 5s (configurable as first CLI arg) to ensure
enough test runs to compute accurate average and statistics
- perform scaling / translation tests (affineTransform) (clipping tests in
progress)
- flush monitors after each test
Post by Laurent Bourgès
I am perplex and I am going to check pisces code against your given
Post by Laurent Bourgès
approach.
If for no other reason that to make sure that there aren't two parts of
the system trying to communicate with different philosophies. You don't
want a caller to hand a closed interval to a method which treats the values
as half-open, for instance. If the rounding is "different, but
consistent", then I think we can leave it for now and treat it as a future
refinement to check if it makes any practical difference and correct. But,
if it shows up a left-hand-not-talking-to-**right-hand bug, then that
would be good to fix sooner rather than later.
As said before, minor bugs:
- alpha array (Renderer) handling seems going over its upper limit: I need
to clear it until pix_to + 1 + 1 (pix_to inclusive) !
- edge / crossing coordinate rounding: fix bias to 0.5 => ceil (x - 0.5)
Post by Laurent Bourgès
I think it is OK to focus on your current task of performance and memory
turmoil, but I wanted to give you the proper background to try to
understand what you were reading primarily, and possibly to get you
interested in cleaning up the code as you went as a secondary consideration.
Agreed.

Could you explain me a bit the renderer's scan line algorithm related to
crossing in next() and _endRendering methods ?
Post by Laurent Bourgès
If every coordinate has already been biased by the -0.5 then ceil is
Post by Laurent Bourgès
just the tail end of the rounding equation I gave above.
That's not the case => buggy: x1, y1 and x2, y2 are directly the point
coordinates as float values.
Then using the ceil() on both is still consistent with half-open
intervals, it just has a different interpretation of where the sampling
cut-off lies within the subpixel sample. When you determine where the
"crossings" lie, then it would be proper to do the same ceil(y +/- some
offset) operation to compute the first crossing that is included and the
first crossing that is excluded.
Ok.
Post by Laurent Bourgès
In this case it appears that the offset if just 0.0 which doesn't really
meet my expectations, but is a minor issue. These crossings then become a
half-open interval of scanline indices in which to compute the value.
To be fixed soon.
Post by Laurent Bourgès
I think rounding errors can lead to pixel / shape rasterization
Post by Laurent Bourgès
deformations ... ?
As long as the test is "y < _edges[ptr+YMAX]" then that is consistent with
a half-open interval sampled at the top of every sub-pixel region, isn't
it?
Ok.
Post by Laurent Bourgès
I agree with the half-open part of it, but would have preferred a "center
of sub-pixel" offset for the actual sampling.
Again.
Post by Laurent Bourgès
I am a bit embarrassed to verify maths performed in
Post by Laurent Bourgès
ScanLineIterator.next() which use edges and edgeBucketCounts arrays ...
could you have a look ?
Apparently, it uses the following for loop that respects the semi-open
for (int i = 0, ecur, j, k; i < count; i++) {
...
I'll come back to that at a later time, but it sounds like you are
starting to get a handle on the design here.
Thanks.
Post by Laurent Bourgès
boolean endRendering() {
Post by Laurent Bourgès
// TODO: perform shape clipping to avoid dealing with segments
out of bounding box
if (edgeMinX > edgeMaxX || edgeMaxX < 0f) {
return false; // undefined X bounds or negative Xmax
}
if (edgeMinY > edgeMaxY || edgeMaxY < 0f) {
return false; // undefined Y bounds or negative Ymax
}
I'd use min >= max since if min==max then I think nothing gets
generated as a result of all edges having both the in and out
crossings on the same coordinate. Also, why not test against the
clip bounds instead? The code after that will clip the edgeMinMaxXY
values against the boundsMinMax values. If you do this kind of test
after that clipping is done (on spminmaxxy) then you can discover if
all of the coordinates are outside the clip or the region of interest.
I tried here to perform few "fast" checks before doing float to int
conversions (costly because millions are performed): I think it can be
still improved: edgeMinX > edgeMaxX only ensures edgeMinX is defined and
both are positive !
endRendering is called once per shape. I don't think moving tests above
its conversions to int will affect our throughput compared to calculations
done per-vertex.
Agreed but having a small gain for each shape is still interesting when
thousands or millions of shapes are rendered !
Post by Laurent Bourgès
I'm going to have to review the rest of this email at a future time, my
apologies...
Looking forward reading you soon and getting more feedback on last changes.

Regards,
Laurent
Andrea Aime
2013-05-06 13:48:35 UTC
Permalink
On Mon, May 6, 2013 at 3:43 PM, Laurent Bourgès
Post by Laurent Bourgès
Jim, Andrea & java2d members,
Laurent, I cannot comment from the code point of view (haven't really read
the patches), but from
the performance standpoint: amazing work!
I plan to give your patch another test next weekend on a full blown
GeoServer to see how it behaves in a full application.

Cheers
Andrea
--
==
GeoServer training in Milan, 6th & 7th June 2013! Visit
http://geoserver.geo-solutions.it for more information.
==

Ing. Andrea Aime
@geowolf
Technical Lead

GeoSolutions S.A.S.
Via Poggio alle Viti 1187
55054 Massarosa (LU)
Italy
phone: +39 0584 962313
fax: +39 0584 1660272
mob: +39 339 8844549

http://www.geo-solutions.it
http://twitter.com/geosolutions_it

-------------------------------------------------------
Laurent Bourgès
2013-05-06 13:58:32 UTC
Permalink
Andrea,

Thanks and I am impatient to see your benchmark results.

If you have time, could you perform two benchmark runs to compare
ThreadLocal vs ConcurrentLinkedQueue modes:
In RenderingEngine class:
/** use ThreadLocal or ConcurrentLinkedQueue to get the thread
RendererContext */
/* TODO: use System property to determine which mode use */
public static final boolean useThreadLocal = true;

Just change the flag (true/false) and rebuild your OpenJDK.

Laurent
Post by Andrea Aime
Post by Laurent Bourgès
Jim, Andrea & java2d members,
Laurent, I cannot comment from the code point of view (haven't really read
the patches), but from
the performance standpoint: amazing work!
I plan to give your patch another test next weekend on a full blown
GeoServer to see how it behaves in a full application.
Cheers
Andrea
--
==
GeoServer training in Milan, 6th & 7th June 2013! Visit
http://geoserver.geo-solutions.it for more information.
==
Ing. Andrea Aime
@geowolf
Technical Lead
GeoSolutions S.A.S.
Via Poggio alle Viti 1187
55054 Massarosa (LU)
Italy
phone: +39 0584 962313
fax: +39 0584 1660272
mob: +39 339 8844549
http://www.geo-solutions.it
http://twitter.com/geosolutions_it
-------------------------------------------------------
Jim Graham
2013-05-08 02:26:44 UTC
Permalink
This is amazing work, Laurent! I'll look over the code changes soon.
Note that the "2 edge arrays" issue goes away if we can use native
methods and C structs. It may be faster still in that case...

...jim
Post by Laurent Bourgès
Jim, Andrea & java2d members,
Laurent Bourgès
2013-05-10 06:50:17 UTC
Permalink
Jim,

FYI, I am working on optimizing the 2 hotspot methods annotated by oprofile
(see specific emails) :
- ScanLineIterator.next() ~ 35%
- Renderer.endRendering(...) ~ 20%

I think that the ScanLineIterator class is no more useful and could be
merged into Renderer directly: I try to optimize these 2 code paths
(crossing / crossing -> alpha) but it seems quite difficult as I must
understand hotspot optimizations (assembler code)...

For now I want to keep pisces in Java code as hotspot is efficient enough
and probably the algorithm can be reworked a bit;
few questions:
- should edges be sorted by Ymax ONCE to avoid complete edges traversal to
count crossings for each Y value:

156 if ((bucketcount & 0x1) != 0) {
157 int newCount = 0;
158 for (int i = 0, ecur; i < count; i++) {
159 ecur = ptrs[i];* 160 if
(_edgesInt[ecur + YMAX] > cury) {* 161
ptrs[newCount++] = ecur;
162 }
163 }
164 count = newCount;
165 }

- why multiply x2 and divide /2 the crossings (+ rounding issues) ?

202 for (int i = 0, ecur, j; i < count; i++) {
203 ecur = ptrs[i];
204 curx = _edges[ecur /* + CURX */];
205 _edges[ecur /* + CURX */] = curx + _edges[ecur + SLOPE];
206 * 207 cross = ((int) curx) << 1;*
208 if (_edgesInt[ecur + OR] != 0 /* > 0 */) {
209 cross |= 1;
210 }

* 674 int lowx = crossings[0] >> 1;
675 int highx = crossings[numCrossings - 1] >> 1;*
689 for (int i = 0; i < numCrossings; i++) {
690 int curxo = crossings[i];* 691
int curx = curxo >> 1;*
- last x pixel processing: could you explain me ?
712 int pix_xmax = x1 >>
SUBPIXEL_LG_POSITIONS_X;
713 int tmp = (x0 & SUBPIXEL_MASK_X);
714 alpha[pix_x] += SUBPIXEL_POSITIONS_X - tmp;
715 alpha[pix_x + 1] += tmp;
716 tmp = (x1 & SUBPIXEL_MASK_X);
717 alpha[pix_xmax] -=
SUBPIXEL_POSITIONS_X - tmp;
718 alpha[pix_xmax + 1] -= tmp;

Finally, it seems that hotspot settings (CompileThreshold=1000 and
-XX:aggressiveopts) are able to compile theses hotspots better ...
This is amazing work, Laurent! I'll look over the code changes soon. Note
that the "2 edge arrays" issue goes away if we can use native methods and C
structs. It may be faster still in that case...
Thanks; probably the edgeBucket / edgeBucketCount arrays could be merged
into a single one to improve cache affinity.

Let stay in java ... as hotspot is so efficient (until the contrary is
proven).

FYI, I can write C/C++ code but I never practised JNI code.
Does somebody could help us to port only these 2 hotspot methods ?

PS: I attend a conference next week (germany) so I will be less available
to work on code but I will read my emails.

Laurent
Jim Graham
2013-05-10 23:13:11 UTC
Permalink
Hi Laurent,
Post by Laurent Bourgès
Jim,
I think that the ScanLineIterator class is no more useful and could be
merged into Renderer directly: I try to optimize these 2 code paths
(crossing / crossing -> alpha) but it seems quite difficult as I must
understand hotspot optimizations (assembler code)...
I was originally skeptical about breaking that part of the process into
an iterator class as well.
Post by Laurent Bourgès
For now I want to keep pisces in Java code as hotspot is efficient
enough and probably the algorithm can be reworked a bit;
- should edges be sorted by Ymax ONCE to avoid complete edges traversal
156 if ((bucketcount & 0x1) != 0) {
157 int newCount = 0;
158 for (int i = 0, ecur; i < count; i++) {
159 ecur = ptrs[i];
* 160 if (_edgesInt[ecur + YMAX] > cury) {
* 161 ptrs[newCount++] = ecur;
162 }
163 }
164 count = newCount;
165 }
This does not traverse all edges, just the edges currently "in play" and
it only does it for scanlines that had a recorded ymax on them (count is
multiplied by 2 and then optionally the last bit is set if some edge
ends on that scanline so we know whether or not to do the "remove
expired edges" processing).
Post by Laurent Bourgès
- why multiply x2 and divide /2 the crossings (+ rounding issues) ?
202 for (int i = 0, ecur, j; i < count; i++) {
203 ecur = ptrs[i];
204 curx = _edges[ecur /* + CURX */];
205 _edges[ecur /* + CURX */] = curx + _edges[ecur + SLOPE];
206
* 207 cross = ((int) curx) << 1;
* 208 if (_edgesInt[ecur + OR] != 0 /* > 0 */) {
209 cross |= 1;
The line above sets the bottom bit if the crossing is one orientation
vs. the other so we know whether to add one or subtract one to the
winding count. The crossings can then be sorted and the orientation
flag is carried along with the values as they are sorted. The cost of
this trick is having to shift the actual crossing coordinates by 1 to
vacate the LSBit.
Post by Laurent Bourgès
- last x pixel processing: could you explain me ?
712 int pix_xmax = x1 >> SUBPIXEL_LG_POSITIONS_X;
713 int tmp = (x0 & SUBPIXEL_MASK_X);
714 alpha[pix_x] += SUBPIXEL_POSITIONS_X - tmp;
715 alpha[pix_x + 1] += tmp;
716 tmp = (x1 & SUBPIXEL_MASK_X);
717 alpha[pix_xmax] -= SUBPIXEL_POSITIONS_X - tmp;
718 alpha[pix_xmax + 1] -= tmp;
Are you referring to the 2 += and the 2 -= for each end of the span? If
an edge crosses in a given pixel at 5 subpixel positions after the start
of that pixel, then it contributes a coverage of "SUBPIXEL_POS_X minus
5" in that pixel. But, starting with the following pixel, the total
coverage it adds for those pixels until it reaches the right edge of the
span is "SUBPIXEL_POSITIONS_X". However, we are recording deltas and
the previous pixels only bumped our total coverage by "S_P_X - 5". So,
we now need to bump the accumulated coverage by 5 in the following pixel
so that the total added coverage is "S_P_X".

Basically the pair of += lines adds a total S_P_X to the coverage, but
it splits that sum over two pixels - the one where the left edge first
appeared and its following pixel. Similarly, the two -= statements
subtract a total of S_P_X from the coverage total, and do so spread
across 2 pixels. If the crossing happened right at the left edge of the
pixel then tmp would be 0 and the second += or -= would be wasted, but
that only happens 1 out of S_P_X times and the cost of testing is
probably less than just adding tmp to the second pixel even if it is 0.

Also note that we need to have an array entry for alpha[max_x + 1] so
that the second += and -= don't store off the end of the array. We
don't need to use that value since we will stop our alpha accumulations
at the entry for max_x, but testing to see if the "second pixel delta
value" is needed is more expensive than just accumulating it into an
unused array entry.
Post by Laurent Bourgès
Finally, it seems that hotspot settings (CompileThreshold=1000 and
-XX:aggressiveopts) are able to compile theses hotspots better ...
What about if we use the default settings as would most non-server apps?
Post by Laurent Bourgès
Thanks; probably the edgeBucket / edgeBucketCount arrays could be merged
into a single one to improve cache affinity.
Interesting.
Post by Laurent Bourgès
FYI, I can write C/C++ code but I never practised JNI code.
Does somebody could help us to port only these 2 hotspot methods ?
Port 2 Hotspot methods? I'm not sure what you are referring to here?
Post by Laurent Bourgès
PS: I attend a conference next week (germany) so I will be less
available to work on code but I will read my emails.
Thanks for the heads up - as long as you don't time out and switch off
of this project permanently - that would be a shame... :(

...jim
Laurent Bourgès
2013-05-29 09:01:45 UTC
Permalink
Andrea, Jim and java2d members,

I am going back to my work on java2d pisces improvements (scanline
rendering optimization).

Two comments:
- hotspot takes some time to optimize the very large
Renderer.endRendering() and ScalineIterator.next() methods: probably
because these methods represents too many byte codes (many loops ...)
- I tried merging them but it seems that it becomes slower (too big method
? or hotspot produce less optimized code according to its heuristics ...
warmup issue ?)
- Do you think it could help hotspot if such methods are split in smaller
ones ?
- Many float operations are used to compute crossings that could be
replaced by DDA (integer ops) to improve performance ...
- does anybody have other ideas to improve algorithm efficiency like
winding rule handling (<< 1, >> 1, bit mask), edge eviction from active
edge list, float rounding operations) ?

Moreover, could anybody help me working on pisces (code, test, benchmarks)
?
at least review the last patch ?

Remaining items:
- clipping issues (dasher emits many segments even out of bounds)
- rounding issues: float to int conversions (biais or not)

Finally I propose to focus on optimizing these 2 algorithms (methods) as it
represents 55% of cpu time:
- optimize / modify algorithms: integer instead of float ops, reduce number
of condition statements (if) to improve branch prediction ...
- or / and port these methods into C code (JNI)
Post by Laurent Bourgès
I think that the ScanLineIterator class is no more useful and could be
merged into Renderer directly: I try to optimize these 2 code paths
(crossing / crossing -> alpha) but it seems quite difficult as I must
understand hotspot optimizations (assembler code)...
I was originally skeptical about breaking that part of the process into an
iterator class as well.
I tried to merge these methods into Renderer.endRendering() but benchmark
results seems worse: probably hotspot generates less optimized code ...
For now I want to keep pisces in Java code as hotspot is efficient
Post by Laurent Bourgès
enough and probably the algorithm can be reworked a bit;
- should edges be sorted by Ymax ONCE to avoid complete edges traversal
156 if ((bucketcount & 0x1) != 0) {
157 int newCount = 0;
158 for (int i = 0, ecur; i < count; i++) {
159 ecur = ptrs[i];
* 160 if (_edgesInt[ecur + YMAX] > cury) {
* 161 ptrs[newCount++] = ecur;
162 }
163 }
164 count = newCount;
165 }
This does not traverse all edges, just the edges currently "in play" and
it only does it for scanlines that had a recorded ymax on them (count is
multiplied by 2 and then optionally the last bit is set if some edge ends
on that scanline so we know whether or not to do the "remove expired edges"
processing).
Agreed, this algorithm is the "well know" edge eviction from active edge
list => move it into dedicated method ?
- why multiply x2 and divide /2 the crossings (+ rounding issues) ?
Post by Laurent Bourgès
202 for (int i = 0, ecur, j; i < count; i++) {
203 ecur = ptrs[i];
204 curx = _edges[ecur /* + CURX */];
205 _edges[ecur /* + CURX */] = curx + _edges[ecur + SLOPE];
206
* 207 cross = ((int) curx) << 1;
* 208 if (_edgesInt[ecur + OR] != 0 /* > 0 */) {
209 cross |= 1;
The line above sets the bottom bit if the crossing is one orientation vs.
the other so we know whether to add one or subtract one to the winding
count. The crossings can then be sorted and the orientation flag is
carried along with the values as they are sorted. The cost of this trick
is having to shift the actual crossing coordinates by 1 to vacate the LSBit.
Two comments:
- float ops used to compute the x crossing => use DDA (integer ops ?)
- store the orientation flag into another byte array to avoid shift
operations << 1 and >> 1 (later in the code) but the insertion sort should
then be smart to provide a sorted index mapping ...
- last x pixel processing: could you explain me ?
Post by Laurent Bourgès
712 int pix_xmax = x1 >>
SUBPIXEL_LG_POSITIONS_X;
713 int tmp = (x0 & SUBPIXEL_MASK_X);
714 alpha[pix_x] +=
SUBPIXEL_POSITIONS_X - tmp;
715 alpha[pix_x + 1] += tmp;
716 tmp = (x1 & SUBPIXEL_MASK_X);
717 alpha[pix_xmax] -=
SUBPIXEL_POSITIONS_X - tmp;
718 alpha[pix_xmax + 1] -= tmp;
Are you referring to the 2 += and the 2 -= for each end of the span? If
an edge crosses in a given pixel at 5 subpixel positions after the start of
that pixel, then it contributes a coverage of "SUBPIXEL_POS_X minus 5" in
that pixel. But, starting with the following pixel, the total coverage it
adds for those pixels until it reaches the right edge of the span is
"SUBPIXEL_POSITIONS_X". However, we are recording deltas and the previous
pixels only bumped our total coverage by "S_P_X - 5". So, we now need to
bump the accumulated coverage by 5 in the following pixel so that the total
added coverage is "S_P_X".
Basically the pair of += lines adds a total S_P_X to the coverage, but it
splits that sum over two pixels - the one where the left edge first
appeared and its following pixel. Similarly, the two -= statements
subtract a total of S_P_X from the coverage total, and do so spread across
2 pixels. If the crossing happened right at the left edge of the pixel
then tmp would be 0 and the second += or -= would be wasted, but that only
happens 1 out of S_P_X times and the cost of testing is probably less than
just adding tmp to the second pixel even if it is 0.
Also note that we need to have an array entry for alpha[max_x + 1] so that
the second += and -= don't store off the end of the array. We don't need
to use that value since we will stop our alpha accumulations at the entry
for max_x, but testing to see if the "second pixel delta value" is needed
is more expensive than just accumulating it into an unused array entry.
Thanks for the explanations: I understand now why I should clean the alpha
array more than I thought: xmax + 2 !
Finally, it seems that hotspot settings (CompileThreshold=1000 and
Post by Laurent Bourgès
-XX:aggressiveopts) are able to compile theses hotspots better ...
What about if we use the default settings as would most non-server apps ?
As my linux is 64 bits, I can only run benchmark with the server compiler
(hotspot c2) ... I could also test with / without tiered compilation
(between client and server compiler) ...
Thanks; probably the edgeBucket / edgeBucketCount arrays could be merged
Post by Laurent Bourgès
into a single one to improve cache affinity.
Interesting.
It could be interesting to evaluate the cpu cache affinity related to all
these arrays: maybe 2D arrays could be packed into 1D arrays to improve
cache efficiency; idem for edge "struct":
it is better to keep int / float arrays instead of having an Edge object
(with instance pool) ?
FYI, I can write C/C++ code but I never practised JNI code.
Post by Laurent Bourgès
Does somebody could help us to port only these 2 hotspot methods ?
Port 2 Hotspot methods? I'm not sure what you are referring to here?
I mean implement the 'hotspot' methods i.e. implement the rendering
algorithms in C code using JNI to access Renderer fields.
Such C code could be compiled using optimizations (gcc O2) but I have
doubts that it will be faster than optimized code produced by hotspot c2
compiler ...

Maybe somebody could look at the assembly code generated by hotspot to
ensure these methods are 'well' optimized: I looked at it but I am not able
to evaluate if it is good (efficient) code.

PS: Andrea, did you run some benchmarks using the last patch ?

Laurent
Andrea Aime
2013-05-29 09:05:39 UTC
Permalink
Post by Laurent Bourgès
Finally I propose to focus on optimizing these 2 algorithms (methods) as
- optimize / modify algorithms: integer instead of float ops, reduce
number of condition statements (if) to improve branch prediction ...
I have been fancying a port to java of the agg liteweight rasterize (
http://www.antigrain.com/download/index.html)
and indeed that code, besides producing high quality antialiased outputs,
works with rescaled integers.
Might be work having a look

Cheers
Andrea
--
==
GeoServer training in Milan, 6th & 7th June 2013! Visit
http://geoserver.geo-solutions.it for more information.
==

Ing. Andrea Aime
@geowolf
Technical Lead

GeoSolutions S.A.S.
Via Poggio alle Viti 1187
55054 Massarosa (LU)
Italy
phone: +39 0584 962313
fax: +39 0584 1660272
mob: +39 339 8844549

http://www.geo-solutions.it
http://twitter.com/geosolutions_it

-------------------------------------------------------
Laurent Bourgès
2013-05-29 12:05:24 UTC
Permalink
Seems very interesting ... I just looked at the documentation.

AGG code would be worth to look at but somebody should check if their
license is compatible with openjdk license terms.

Apparently AGG uses integers to convert float values (24.8 ie float are
multiplied by 256 and then converted to integers)

Laurent
On Wed, May 29, 2013 at 11:01 AM, Laurent Bourgès <
Post by Laurent Bourgès
Finally I propose to focus on optimizing these 2 algorithms (methods) as
- optimize / modify algorithms: integer instead of float ops, reduce
number of condition statements (if) to improve branch prediction ...
I have been fancying a port to java of the agg liteweight rasterize (
http://www.antigrain.com/download/index.html)
and indeed that code, besides producing high quality antialiased outputs,
works with rescaled integers.
Might be work having a look
Cheers
Andrea
--
==
GeoServer training in Milan, 6th & 7th June 2013! Visit
http://geoserver.geo-solutions.it for more information.
==
Ing. Andrea Aime
@geowolf
Technical Lead
GeoSolutions S.A.S.
Via Poggio alle Viti 1187
55054 Massarosa (LU)
Italy
phone: +39 0584 962313
fax: +39 0584 1660272
mob: +39 339 8844549
http://www.geo-solutions.it
http://twitter.com/geosolutions_it
-------------------------------------------------------
Andrea Aime
2013-05-29 12:19:15 UTC
Permalink
On Wed, May 29, 2013 at 2:05 PM, Laurent Bourgès
Post by Laurent Bourgès
Seems very interesting ... I just looked at the documentation.
AGG code would be worth to look at but somebody should check if their
license is compatible with openjdk license terms.
I'm not a lawyer, so take this with a grain of salt, but the most up to
date code base does not seem compatible, however
the lightweight rasterizer I pointed you at is under the LGPL, so it
_should_ be compatible.
I have memories that the rasterization techinique is coming from
libfreetype, so that's another one that might be worth
looking at (and I guess OpenJDK already depends on it to handle fonts under
Linux? not sure)

Cheers
Andrea
--
==
GeoServer training in Milan, 6th & 7th June 2013! Visit
http://geoserver.geo-solutions.it for more information.
==

Ing. Andrea Aime
@geowolf
Technical Lead

GeoSolutions S.A.S.
Via Poggio alle Viti 1187
55054 Massarosa (LU)
Italy
phone: +39 0584 962313
fax: +39 0584 1660272
mob: +39 339 8844549

http://www.geo-solutions.it
http://twitter.com/geosolutions_it

-------------------------------------------------------
Phil Race
2013-05-29 17:12:00 UTC
Permalink
Laurent Bourgès
2013-05-30 07:38:53 UTC
Permalink
Phil,

As I am currently working hard on improving pisces rendering engine
(performance and quality if possible), I am looking for interesting
algorithms, optimization tricks ... i.e. I need to study different
approaches (subpixel rendering, rasterizer, scanline renderer, coordinate
conversions).

I do not want openjdk to use AGG or cairo libs, but I wanted to know if AGG
was open source to be able (allowed) to look at their code to get new
ideas, tricks ...

I am still looking for volunteers to help me improving pisces code as only
Jim and Andrea participate to this topic and have some interest in
improving pisces.

Could someone explain me who maintains Java2D rendering engine and what is
the roadmap for new implementations (glg2d, jules using cairo ...) ?

Laurent

2013/5/29 Phil Race <***@oracle.com>
Laurent Bourgès
2013-06-11 12:21:57 UTC
Permalink
Dear Java2D members,

Sorry to ask but does anybody care about improving java2D rendering
(quality, performance) ?

I sent several emails for one month and nobody answered.

Should I consider that there is no interest and give up ?


I am still waiting for feedback and support before going on working on my
patch on pisces renderer.

Two more comments:
- coordinate conversions: float or integer computations (DDA) related to
subpixel coordinates: ceil(), floor() ...
Pisces uses 3x3 subpixels but it provides poor quality: many research
papers are using 4x4 (1/16 error) or 8x8 (1/64 error) subpixel masks to
increase the coverage precision (ratio of the pixel covered by the polygon)
Moreover, Pisces does not take into account the distance / error
between the mathematical edge position and the pixel grid.
Ideally the subpixel mask should be 16x16 => 1/256 coverage error but
it will lead to higher processing time.

There is a compromise to be done between quality and speed: that's
exactly what java RenderingHints are for !

For these reasons, I propose to use integer maths (instead of floating
point operations) on edges in Renderer._endRendering():

for (i = 0; i < numCrossings; i++) {
* ecur = _edgePtrs[i];
f_curx = _edges[ecur /* + CURX */ ];

// LBO: float ops: maybe use long integer ops (24.8)
_edges[ecur /* + CURX */ ] = f_curx + _edges[ecur + _SLOPE];
*
// note: left shift on crossing to store orientation flag

*// int cross = ((int) f_curx) << 1;*
* cross = (ceil(f_curx)) << 1;
*
Comments are welcome and alternatives also ...

- coverage / alpha mask computation: incorrect or too imprecise
Few software renderer recommends using 16x16 subpixels or at least use
1/256 position accuracy and then compute edge coverage using line functions
with enough precision:
http://www.gris.uni-tuebingen.de/publics/paper/Schilling-1991-ANewSimple.pdf
http://www.antigrain.com/doc/introduction/introduction.agdoc.html#toc0002
http://en.literateprograms.org/Polygon_Rasterization_%28C%29

I will look at increasing the Renderer subpixel resolution (3x3, 4x4,
16x16) and see the performance impact but also the output quality.

An efficient optimization consist in processing multiple scan lines when
possible (90%) i.e. process real pixels ie 3 or more scan lines at a time
when edges are fixed (no new edge nor edge removal) and compute triangle
coverage between 2 edge functions => less loops and more accurate computed
coverage

Laurent
Laurent Bourgès
2013-06-17 15:18:06 UTC
Permalink
Jim,

I think I found the source of the 'poor' quality of line rendering:
the alpha coverage is only computed for the 2 sub pixels (x0, x1) at the
current x-coordinate of an edge ie it does not take into account the span
of a line having a very flat slope:

for (i = 0, sum = 0, prev = bboxx0; i < numCrossings; i++) {
curxo = _crossings[i];
curx = curxo >> 1;

// LBO: TODO: explain alpha computation: Jim, please ?
...
if ((sum & mask) != 0) {
x0 = (prev > bboxx0) ? prev : bboxx0; //
Math.max(prev, bboxx0);
x1 = (curx < bboxx1) ? curx : bboxx1; //
Math.min(curx, bboxx1);

if (x0 < x1) {
x0 -= bboxx0; // turn x0, x1 from coords to
indices
x1 -= bboxx0; // in the alpha array.

pix_x = x0 >> _SUBPIXEL_LG_POSITIONS_X;
pix_xmaxm1 = (x1 - 1) >>
_SUBPIXEL_LG_POSITIONS_X;

if (pix_x == pix_xmaxm1) {
// Start and end in same pixel
tmp = (x1 - x0); // number of subpixels
_alpha[pix_x] += tmp;
_alpha[pix_x + 1] -= tmp;
} else {
tmp = (x0 & _SUBPIXEL_MASK_X);
* _alpha[pix_x] += _SUBPIXEL_POSITIONS_X -
tmp;
* * _alpha[pix_x + 1] += tmp;
*
pix_xmax = x1 >> _SUBPIXEL_LG_POSITIONS_X;
tmp = (x1 & _SUBPIXEL_MASK_X);
* _alpha[pix_xmax] -= _SUBPIXEL_POSITIONS_X
- tmp;
_alpha[pix_xmax + 1] -= tmp;
* }
}
}

// to turn {0, 1} into {-1, 1}, multiply by 2 and
subtract 1.
// int crorientation = ((curxo & 0x1) << 1) - 1;
sum += ((curxo & 0x1) << 1) - 1; // crorientation;
prev = curx;
}
}

Here is a line test using GeneralPath(Line2D.float) to use pisces instead
of FillParallelogram renderer:
- pisces (8x8):
Loading Image...
- pisces (256x256):
Loading Image...

The artefacts comes from the fact that the line spans over several
subpixels and the slope and the span width is not used at all !

I think it is possible to compute a better coverage for all alpha pixels
implied in a span (trapezoid):
for each edge at scanline y: it only needs to have curx and previous curx
(to know how many subpixel the span crosses)

Loading Image...

Comments are welcome ...
Post by Laurent Bourgès
- coordinate conversions: float or integer computations (DDA) related to
subpixel coordinates: ceil(), floor() ...
Pisces uses 3x3 subpixels but it provides poor quality: many
research papers are using 4x4 (1/16 error) or 8x8 (1/64 error) subpixel
masks to increase the coverage precision (ratio of the pixel covered by the
polygon)
Moreover, Pisces does not take into account the distance / error
between the mathematical edge position and the pixel grid.
Ideally the subpixel mask should be 16x16 => 1/256 coverage error
but it will lead to higher processing time.
I misunderstood the code: pisces uses 8x8 subpixel grid (1 << 3) so every
coordinate has a 1/8 precision (low) far from 1/256 (like AGG) which is the
ultimate precision => many rasterizer uses 24.8 (24 bits for integer
coordinates, 8 bits for 1/256 precision) => DDA (32 bits integer
computations)

I will try soon to use 24.8 fixed point DDA to compute x-coordinates of
edge segments.

Laurent
Jim Graham
2013-07-02 02:17:47 UTC
Permalink
I'd have to look, but I think we could raise our precision in the X
direction without costing too much in performance since it just affects
the scale of our sub-pixel contributions per pixel. But, raising it in
the Y direction is where the performance starts to hurt us because it
means more differencing calculations per pixel-line.

In your examples were you raising both the X and Y sub-pixel resolutions
to 256, or just the X resolution? If you raised both to 256 then I
would think that you would end up with a better result...? I don't
think the X resolution will help much for "nearly horizontal" lines.

I agree that it would be nice if we had a more precise calculation of
sub-pixel coverages that did not rely on multi-sampling, and I believe
that a decent algorithm to calculate the trapezoidal coverage of a given
pixel would give us much better answers without any sample-count
performance drain.

I am skeptical about using fixed point for desktop situations. Often
the FPU section of the processor sits idle most of the time and the
integer unit is busy with performing calculations for the logic side of
the process. So, converting FP instructions to integer instructions
just increases that bottleneck rather than sharing it. On the other
hand, if you are constantly converting between FP and integer then you
have a different problem. We sort of run into that latter problem
because of the Java homogenous arrays which require us to store many
integer values in our single floating point array, but I think you had a
patch to stop doing that which would help there.

Also, 24.8 is a poor choice for fixed point. A 256 pixel long line
could be off by a full pixel by the end of tracing its points. If you
are doing 8-sub-sample AA then you would be off by a pixel after only 32
pixels.

When I've done fixed point in the past, I've usually used longs and done
32.32 processing. That would require 4 billion pixels before you are
off by a pixel. I think the conversion to "int" is also fairly fast
because it simply involves using the Upper Word of the result without
much need for shifting...

...jim
Post by Laurent Bourgès
Jim,
the alpha coverage is only computed for the 2 sub pixels (x0, x1) at the
current x-coordinate of an edge ie it does not take into account the span
for (i = 0, sum = 0, prev = bboxx0; i < numCrossings; i++) {
curxo = _crossings[i];
curx = curxo >> 1;
// LBO: TODO: explain alpha computation: Jim, please ?
...
if ((sum & mask) != 0) {
x0 = (prev > bboxx0) ? prev : bboxx0; //
Math.max(prev, bboxx0);
x1 = (curx < bboxx1) ? curx : bboxx1; //
Math.min(curx, bboxx1);
if (x0 < x1) {
x0 -= bboxx0; // turn x0, x1 from coords to
indices
x1 -= bboxx0; // in the alpha array.
pix_x = x0 >> _SUBPIXEL_LG_POSITIONS_X;
pix_xmaxm1 = (x1 - 1) >>
_SUBPIXEL_LG_POSITIONS_X;
if (pix_x == pix_xmaxm1) {
// Start and end in same pixel
tmp = (x1 - x0); // number of subpixels
_alpha[pix_x] += tmp;
_alpha[pix_x + 1] -= tmp;
} else {
tmp = (x0 & _SUBPIXEL_MASK_X);
* _alpha[pix_x] += _SUBPIXEL_POSITIONS_X -
tmp;
* * _alpha[pix_x + 1] += tmp;
*
pix_xmax = x1 >> _SUBPIXEL_LG_POSITIONS_X;
tmp = (x1 & _SUBPIXEL_MASK_X);
* _alpha[pix_xmax] -= _SUBPIXEL_POSITIONS_X
- tmp;
_alpha[pix_xmax + 1] -= tmp;
* }
}
}
// to turn {0, 1} into {-1, 1}, multiply by 2 and
subtract 1.
// int crorientation = ((curxo & 0x1) << 1) - 1;
sum += ((curxo & 0x1) << 1) - 1; // crorientation;
prev = curx;
}
}
Here is a line test using GeneralPath(Line2D.float) to use pisces instead
http://jmmc.fr/~bourgesl/share/java2d-pisces/linetest/LineTest_3.png
http://jmmc.fr/~bourgesl/share/java2d-pisces/linetest/LineTest_8.png
The artefacts comes from the fact that the line spans over several
subpixels and the slope and the span width is not used at all !
I think it is possible to compute a better coverage for all alpha pixels
for each edge at scanline y: it only needs to have curx and previous curx
(to know how many subpixel the span crosses)
http://upload.wikimedia.org/wikipedia/commons/3/38/PolygonFillTrapezoidExample.png
Comments are welcome ...
Post by Laurent Bourgès
- coordinate conversions: float or integer computations (DDA) related to
subpixel coordinates: ceil(), floor() ...
Pisces uses 3x3 subpixels but it provides poor quality: many
research papers are using 4x4 (1/16 error) or 8x8 (1/64 error) subpixel
masks to increase the coverage precision (ratio of the pixel covered by the
polygon)
Moreover, Pisces does not take into account the distance / error
between the mathematical edge position and the pixel grid.
Ideally the subpixel mask should be 16x16 => 1/256 coverage error
but it will lead to higher processing time.
I misunderstood the code: pisces uses 8x8 subpixel grid (1 << 3) so every
coordinate has a 1/8 precision (low) far from 1/256 (like AGG) which is the
ultimate precision => many rasterizer uses 24.8 (24 bits for integer
coordinates, 8 bits for 1/256 precision) => DDA (32 bits integer
computations)
I will try soon to use 24.8 fixed point DDA to compute x-coordinates of
edge segments.
Laurent
Jim Graham
2013-07-02 02:06:38 UTC
Permalink
Hi Laurent,
Post by Laurent Bourgès
Dear Java2D members,
Sorry to ask but does anybody care about improving java2D rendering
(quality, performance) ?
I sent several emails for one month and nobody answered.
Should I consider that there is no interest and give up ?
I apologize - I missed many of your emails as they auto-sort into my JDK
folders that I don't check very often these days. I'll have to work out
better sorting filters.

Also, I'd like to point out that there may not be many people who can
devote time to these efforts right now, but that doesn't mean that
nobody is interested in seeing performance gains on this code, so I hope
you don't get discouraged. I'll answer some of your specific questions
in other emails...

...jim
Jim Graham
2013-07-02 02:56:27 UTC
Permalink
Hi Laurent,
Post by Laurent Bourgès
- hotspot takes some time to optimize the very large
Renderer.endRendering() and ScalineIterator.next() methods: probably
because these methods represents too many byte codes (many loops ...)
- I tried merging them but it seems that it becomes slower (too big
method ? or hotspot produce less optimized code according to its
heuristics ... warmup issue ?)
- Do you think it could help hotspot if such methods are split in
smaller ones ?
Perhaps it is another nudge to consider a native port where we could
statically compile these critical methods with a high degree of
optimization?
Post by Laurent Bourgès
- Many float operations are used to compute crossings that could be
replaced by DDA (integer ops) to improve performance ...
Are you referring to fixed point implementations? Or something more
precise? I'm worried about error accumulation with fixed point
differencing across a large number of pixels.
Post by Laurent Bourgès
- clipping issues (dasher emits many segments even out of bounds)
- rounding issues: float to int conversions (biais or not)
Finally I propose to focus on optimizing these 2 algorithms (methods) as
- optimize / modify algorithms: integer instead of float ops, reduce
number of condition statements (if) to improve branch prediction ...
- or / and port these methods into C code (JNI)
Porting to C might help with a lot of the integer conversions that
happen when we try to stuff "all data into floats".
Post by Laurent Bourgès
I think that the ScanLineIterator class is no more useful and could be
merged into Renderer directly: I try to optimize these 2 code paths
(crossing / crossing -> alpha) but it seems quite difficult as I must
understand hotspot optimizations (assembler code)...
I was originally skeptical about breaking that part of the process
into an iterator class as well.
I tried to merge these methods into Renderer.endRendering() but
benchmark results seems worse: probably hotspot generates less optimized
code ...
D'oh!
Post by Laurent Bourgès
For now I want to keep pisces in Java code as hotspot is efficient
enough and probably the algorithm can be reworked a bit;
- should edges be sorted by Ymax ONCE to avoid complete edges traversal
156 if ((bucketcount & 0x1) != 0) {
157 int newCount = 0;
158 for (int i = 0, ecur; i < count; i++) {
159 ecur = ptrs[i];
* 160 if (_edgesInt[ecur + YMAX] > cury) {
* 161 ptrs[newCount++] = ecur;
162 }
163 }
164 count = newCount;
165 }
This does not traverse all edges, just the edges currently "in play"
and it only does it for scanlines that had a recorded ymax on them
(count is multiplied by 2 and then optionally the last bit is set if
some edge ends on that scanline so we know whether or not to do the
"remove expired edges" processing).
Agreed, this algorithm is the "well know" edge eviction from active edge
list => move it into dedicated method ?
That would depend on the "hotspot thinks this method is too big" issue
that we don't really fully understand, but it sounds like a good point
to investigate.
Post by Laurent Bourgès
- why multiply x2 and divide /2 the crossings (+ rounding issues) ?
202 for (int i = 0, ecur, j; i < count; i++) {
203 ecur = ptrs[i];
204 curx = _edges[ecur /* + CURX */];
205 _edges[ecur /* + CURX */] = curx +
_edges[ecur + SLOPE];
206
* 207 cross = ((int) curx) << 1;
* 208 if (_edgesInt[ecur + OR] != 0 /* > 0 */) {
209 cross |= 1;
The line above sets the bottom bit if the crossing is one
orientation vs. the other so we know whether to add one or subtract
one to the winding count. The crossings can then be sorted and the
orientation flag is carried along with the values as they are
sorted. The cost of this trick is having to shift the actual
crossing coordinates by 1 to vacate the LSBit.
- float ops used to compute the x crossing => use DDA (integer ops ?)
Provided you can keep precision high enough to avoid errors. Also,
sometimes offloading some of your calculations to a float unit keeps
added parallelism in the code and moving the instructions back to int
can actually slow it down.
Post by Laurent Bourgès
- store the orientation flag into another byte array to avoid shift
operations << 1 and >> 1 (later in the code) but the insertion sort
should then be smart to provide a sorted index mapping ...
You then have added memory accesses and remember that we sort these
values so you'd have to move the flags along with the crossings. A
C-struct that had an int and a flag would avoid the shifts while making
it reasonable to sort just one array at the cost of storage complexity
and wasted memory for the flag to keep the structs aligned.
Post by Laurent Bourgès
Finally, it seems that hotspot settings (CompileThreshold=1000 and
-XX:aggressiveopts) are able to compile theses hotspots better ...
What about if we use the default settings as would most non-server apps ?
As my linux is 64 bits, I can only run benchmark with the server
compiler (hotspot c2) ... I could also test with / without tiered
compilation (between client and server compiler) ...
We have to understand the performance on a variety of compilers.
Post by Laurent Bourgès
Thanks; probably the edgeBucket / edgeBucketCount arrays could be merged
into a single one to improve cache affinity.
Interesting.
It could be interesting to evaluate the cpu cache affinity related to
all these arrays: maybe 2D arrays could be packed into 1D arrays to
it is better to keep int / float arrays instead of having an Edge object
(with instance pool) ?
I'm a bit lost with that, but I think we could only know by prototyping
it. Also, perhaps we could store floats into int arrays using
toFloat/IntBits() faster than we could cast to integers? Note that
those methods for conversion to intbits are likely handled internally in
compiled code as simple "casts" (where the compiled code doesn't really
do anything to "cast" a value).
Post by Laurent Bourgès
FYI, I can write C/C++ code but I never practised JNI code.
Does somebody could help us to port only these 2 hotspot methods ?
Port 2 Hotspot methods? I'm not sure what you are referring to here?
I mean implement the 'hotspot' methods i.e. implement the rendering
algorithms in C code using JNI to access Renderer fields.
Such C code could be compiled using optimizations (gcc O2) but I have
doubts that it will be faster than optimized code produced by hotspot c2
compiler ...
I tried for JavaFX and it was a grueling process and I ended up with a
bastardized half-baked port. I would recommend doing it right by simply
implementing C code using the Java code as a guide.

...jim

Laurent Bourgès
2013-05-04 11:59:25 UTC
Permalink
Dear Clemens & Jim,

As you know I am currently improving pisces performance (a lot) and I would
like to compare its performance with your JulesRenderingEngine (cairo).

FYI, my last pisces patch performs a bit slower than native ductus renderer
with only 1 thread (only 10-20% slower), but it very faster with multiple
threads:

pisces patch (1 hour ago):
test threads ops Tavg Tmed stdDev rms Med+Stddev
min max TotalOps [ms/op]
dc_boulder_2013-13-30-06-13-17.ser 1 63 88.731 88.484
0.779 88.488 89.264 88.052 104.455 63
dc_boulder_2013-13-30-06-13-17.ser 2 126 103.945 103.760
1.776 103.775 105.536 102.465 128.479 126
dc_boulder_2013-13-30-06-13-17.ser 4 252 173.820 173.655
3.762 173.696 177.417 170.028 218.739 252
dc_shp_alllayers_2013-00-30-07-00-47.ser 1 25 762.982
760.039 3.484 760.047 763.523 755.093 838.546 25
dc_shp_alllayers_2013-00-30-07-00-47.ser 2 50 894.738
892.648 20.819 892.891 913.467 878.795 1011.004 50
dc_shp_alllayers_2013-00-30-07-00-47.ser 4 100 1510.789
1511.463 98.003 1514.637 1609.467 924.624 2030.892 100

Note: this latest patch uses rowAA optimizations = no more RLE encoding /
decoding: rowAA[32][] stores {x0 x1} {alpha sum values for range[x0; x1[ }

ductus:
test threads ops Tavg Tmed stdDev rms Med+Stddev
min max TotalOps [ms/op]
dc_boulder_2013-13-30-06-13-17.ser 1 142 75.860 75.749
1.111 75.757 76.860 74.853 92.387 142
dc_boulder_2013-13-30-06-13-17.ser 2 284 131.802 131.508
21.045 133.181 152.553 103.277 243.345 284
dc_boulder_2013-13-30-06-13-17.ser 4 568 284.610 284.757
35.984 287.021 320.741 80.782 405.189 568
dc_shp_alllayers_2013-00-30-07-00-47.ser 1 20 912.910
912.634 1.827 912.635 914.460 909.023 921.782 20
dc_shp_alllayers_2013-00-30-07-00-47.ser 2 40 1484.377
1482.126 84.938 1484.558 1567.063 1371.135 1683.175 40
dc_shp_alllayers_2013-00-30-07-00-47.ser 4 80 5075.550
5086.851 133.047 5088.591 5219.898 3945.514 5324.156 80


Could you give me instructions how to make Jules renderer work ?

Apparently JulesRenderingEngine requires xrender pipeline, cairo system
library (linux) and also it loads a native library jules.

How can I get the source code / build this jules library ?
or could you send me a linux x64 binary package ?

Cheers,
Laurent
Loading...