It's been a while, and we've been continuing to make progress...
The big news, is that we - wait for it - doubled the number of digits we can calculate, in a given amount of RAM! This is remarkable, and very satisfying. I think we now compute 1 bit of pi for each bit of spare RAM: that's a bit more than 2 digits per byte. Which is about 10x more capacity than the usual spigot, which takes something like 5 bytes per digit.
How can we do that, when we have 2 big numbers, one being the Sum and the other the current Numerator? The first step was to make one of them big-endian, leaving the other little-endian. A quirk of 6502 means that change was actually a small speedup.
You can see the two numbers, in the top part of a Mode 4 screen, in this video that Dave made
Bellard Pi Calculation for 750 Digits (57 seconds) The top number is the Sum, the changing pixels in the first character line of the screen, nearly 320 bytes. It starts out full size but you'll notice the left hand end becomes static, ever more of it as the calculation proceeds. The black area below the left hand end - which is purely there to help us see what's going on - grows accordingly. So as time goes on, the left hand end of Sum becomes free space.
The shortening of the Sum is what makes the calculation proceed progressively more rapidly: there are ever fewer remaining decimal digits to be computed and to be wrung out of the binary Sum, and so we can leave off computing the least significant bytes, one by one, as they cease to be meaningful.
The other number, the Numerator, takes up character line 7 of the screen. In this case the fizzing pixels start out minimal, on the left, and grow towards the right, into black space. That black space isn't unused, but it's all zeros, so if we keep track of where the numbers stop and the zeros start, we don't need to use it.
If you look carefully in the later part of the computation, you'll see the active part of the Numerator starts to shrink too, this time towards the right. Again, it's a case of the least significant bytes ceasing to be significant - we won't be needing them. This is an extra speed up. We've been doing it for a while, but now we can see it happening.
In fact this visualisation was very useful - very nearly essential - to help us get our heads around what's going on and what we can do next.
We choose to patch the division routine in place, at the instant it needs to stop reading numbers and start working with zeros instead. In fact, we were able to patch the branch back too, to avoid executing a NOP needed to make the code the same length. A tiny speedup.
Not only does the interesting part of the Numerator start small, it also grows relatively slowly. So there is scope at this point for using a single area of RAM: the Numerator first, a tiny gap, the Sum. Making that change doubles our digits per RAM byte.
Here's a video where the two bignums are overlaid: look carefully and you can see where they are active and how the gap moves.
Bellard Pi Calculation with Overlapping Bignums 49 seconds
Dave was also able to juggle our memory usage so we can load down at 0400 on a second processor - normally 0800. On a Beeb we normally load at 1900 but can safely load at 1100 with DFS/MMFS. We can get more digits from the code by loading the program lower.
For more digits, smaller code is better, so we were able to replace the separate addition and subtraction versions of the division with a single version which is patched before execution. This even gave us a speedup because we saw a better way to subtract!
Another way to have smaller code is to strip out some speedups, so we made that configurable: instead of up to 4 division routines for different sizes of numerator, we can always use the biggest.
We can also swap back to the slower but slightly smaller program which uses the BBP algorithm, which gives us slightly more digits because of it being smaller.
In fact we can compute so many digits now we had to upgrade some of the maths routines to allow an extra byte.
Dave reported on 21 July:
I think I calculated this as just over two weeks, if run on a back-in-the-day Master Turbo. Or two hours today with PiTubeDirect.
Our aim is always to calculate correct digits of pi - but around this point we noticed that some of the time - less than 10% - the final digit could be out by one. The next instalment will (probably) explain what we did next.
The big news, is that we - wait for it - doubled the number of digits we can calculate, in a given amount of RAM! This is remarkable, and very satisfying. I think we now compute 1 bit of pi for each bit of spare RAM: that's a bit more than 2 digits per byte. Which is about 10x more capacity than the usual spigot, which takes something like 5 bytes per digit.
How can we do that, when we have 2 big numbers, one being the Sum and the other the current Numerator? The first step was to make one of them big-endian, leaving the other little-endian. A quirk of 6502 means that change was actually a small speedup.
You can see the two numbers, in the top part of a Mode 4 screen, in this video that Dave made
Bellard Pi Calculation for 750 Digits (57 seconds) The top number is the Sum, the changing pixels in the first character line of the screen, nearly 320 bytes. It starts out full size but you'll notice the left hand end becomes static, ever more of it as the calculation proceeds. The black area below the left hand end - which is purely there to help us see what's going on - grows accordingly. So as time goes on, the left hand end of Sum becomes free space.
The shortening of the Sum is what makes the calculation proceed progressively more rapidly: there are ever fewer remaining decimal digits to be computed and to be wrung out of the binary Sum, and so we can leave off computing the least significant bytes, one by one, as they cease to be meaningful.
The other number, the Numerator, takes up character line 7 of the screen. In this case the fizzing pixels start out minimal, on the left, and grow towards the right, into black space. That black space isn't unused, but it's all zeros, so if we keep track of where the numbers stop and the zeros start, we don't need to use it.
If you look carefully in the later part of the computation, you'll see the active part of the Numerator starts to shrink too, this time towards the right. Again, it's a case of the least significant bytes ceasing to be significant - we won't be needing them. This is an extra speed up. We've been doing it for a while, but now we can see it happening.
In fact this visualisation was very useful - very nearly essential - to help us get our heads around what's going on and what we can do next.
We choose to patch the division routine in place, at the instant it needs to stop reading numbers and start working with zeros instead. In fact, we were able to patch the branch back too, to avoid executing a NOP needed to make the code the same length. A tiny speedup.
Not only does the interesting part of the Numerator start small, it also grows relatively slowly. So there is scope at this point for using a single area of RAM: the Numerator first, a tiny gap, the Sum. Making that change doubles our digits per RAM byte.
Here's a video where the two bignums are overlaid: look carefully and you can see where they are active and how the gap moves.
Bellard Pi Calculation with Overlapping Bignums 49 seconds
Dave was also able to juggle our memory usage so we can load down at 0400 on a second processor - normally 0800. On a Beeb we normally load at 1900 but can safely load at 1100 with DFS/MMFS. We can get more digits from the code by loading the program lower.
For more digits, smaller code is better, so we were able to replace the separate addition and subtraction versions of the division with a single version which is patched before execution. This even gave us a speedup because we saw a better way to subtract!
Another way to have smaller code is to strip out some speedups, so we made that configurable: instead of up to 4 division routines for different sizes of numerator, we can always use the biggest.
We can also swap back to the slower but slightly smaller program which uses the BBP algorithm, which gives us slightly more digits because of it being smaller.
In fact we can compute so many digits now we had to upgrade some of the maths routines to allow an extra byte.
Dave reported on 21 July:
To be clear, the Pi is just running as a 64k second processor - we've fit the 135000 digits calculation into somewhat less than 64k.The BPP run has just finished: 135,035 digits (on a Pi Zero 2 W), all correct!
52762887687825004850548005973230753265227792552419913159617911522069419685479187
34156699781096702562993993208164507174173490564339865219986639055709352119852439
06798615021448623928438739820187602285471230394945966157258750965032007124766575
93813721248011341535506167547203695791055974610671125417117453695430147191419937
31972279716902116135726252431164722893666441426212438549813623694963571282116036
85441607108231775107801298304253814190892249208595364610821395648113205316073707
77207605599349815034240640775123315121589992462974978454743857855952270892671024
791991996450430401660056217629623401492821816115205046438140512010176327979
6986.08 secs
Summary for BBP Pi Spigot
100 0.07 secs
1000 0.72 secs
3000 3.71 secs
135035 6986.08 secs
BASIC
>
I think I calculated this as just over two weeks, if run on a back-in-the-day Master Turbo. Or two hours today with PiTubeDirect.
Our aim is always to calculate correct digits of pi - but around this point we noticed that some of the time - less than 10% - the final digit could be out by one. The next instalment will (probably) explain what we did next.
Statistics: Posted by BigEd — Sun Aug 04, 2024 2:58 pm