Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/blixt/gsm8k-math-solver

An experiment to solve GSM8K with local models.
https://github.com/blixt/gsm8k-math-solver

Last synced: about 1 month ago
JSON representation

An experiment to solve GSM8K with local models.

Awesome Lists containing this project

README

        

# gsm8k-math-solver

To run, first make sure you have [Yarn](https://yarnpkg.com/) and
[Ollama](https://ollama.com/) installed.

Then, clone the repository with submodules:

```sh
git clone --recurse-submodules https://github.com/blixt/gsm8k-math-solver.git
```

Install dependencies:

```sh
yarn install
```

Finally, open the folder in VS Code, go to "Run and Debug" and then "Run" and
you should see output in the Debug Console.

If you don't use VS Code, just run with `yarn run start`.

You can also pass a question as an argument to the script, for example:

```sh
yarn run start "How many words does this prompt contain?"
```

Which should result in this:

image

## The theory

By making the LLM write executable JavaScript and also show how that code
evaluates step by step, we can solve more complex problems because code
naturally splits them up into logical units, and the LLM has seen a lot of code
to be able to predict its behavior.

Furthermore, while the comments contain the answers by the LLM, also being able
to evaluate the code allows a kind of cross-referencing to verify that the LLM
is consistent. If there is a mismatch, one could conceivably get the LLM to
self-correct.

## The reality

It seems the most recent LLMs (as of writing this, Llama 3, GPT-4 Turbo, Claude
3 Opus) have learned to navigate multi-step reasoning well enough to
automatically answer most of the GSM8K problems as correctly as any framework
would. The code that the LLM is encouraged to write with the framework presented
here actually does sometimes end up being more correct than what the LLM can
reason its way to, but the main reason LLMs fail equally with or without this
framework is interpreting nuances in the problem (such as misinterpreting when a
value should be compounding versus not).

Even less capable LLMs are right some percentage of the time, so it seems
reasonable to assume that some form of sampling would work, but most likely not
a majority sampling, so probably most frameworks will only be able to raise the
model's capability by a relatively small percentage, while model capabilities
from version to version seem to raise it by a lot more.

Finally, the GSM8K data set is by now mostly learned by LLMs as far as I can
tell, so it's not even a good benchmark anymore. For example, just asking Llama
3 70B one of the problems in the test data set of GSM8K, you will not only get a
reasonable looking answer, you will even get the characteristic `#### 123` line
at the end which is exactly how solutions in GSM8K are formatted.

## Further investigation

It would be interesting to see this framework used for other types of problems,
such as using the evaluation comments as a way to get the LLM to verify the code
that it writes, say for a unit test which should be easy to run and verify.

The GSM8K data set contains a lot of algebraic problems, which is not a good
match for JavaScript. Maybe a better idea would have been to use Python with a
library like SymPy available to the LLM so that it doesn't have to use loops.

## Examples

### Question 1

(680) The elevator in Jack's building supports a maximum load of 700 kg. An
adult weighs an average of 80 kg. If Jack rides the elevator with 8 other
adults, by how much will they have exceeded the maximum load of the elevator?

### Solution 1 by Llama 3 70B 🟢

```javascript
const maxLoad = 700 // [1] 700
const adultWeight = 80 // [2] 80
const totalAdults = 9 // [3] 9 (Jack + 8 others)
const totalWeight = totalAdults * adultWeight // [4] 9 * 80 === 720
const excessLoad = totalWeight - maxLoad // [5] 720 - 700 === 20
console.log(`They will have exceeded the maximum load by ${excessLoad} kg.`) // [6] "They will have exceeded the maximum load by 20 kg."
```

Note that both the code and the comments have been generated by the LLM. In
theory this code can also be run to verify that the LLM "evaluated" the code
correctly.

### Question 2

(1103) A jellyfish tank has numerous jellyfish in it. A fifth of the jellyfish
are large, and a third of the large jellyfish change color from green to blue
under UV light. The other jellyfish are small and always stay blue. When a UV
light turned on, 6 jellyfish changed color. How many jellyfish are in the tank?

### Solution 2 by Llama 3 70B 🟡

In this solution, the LLM chose to use a loop to search for the answer, which
made the output very verbose. Most likely this could've been done much more
elegantly. Note that the `(...)` is also generated by the LLM.

Note that it doesn't always solve this question.

```javascript
let totalJellyfish = 0; // [1] 0
while (true) { // [2] (true)
const largeJellyfish = totalJellyfish / 5; // [3] totalJellyfish / 5 === (totalJellyfish is unknown)
const blueLargeJellyfish = largeJellyfish / 3; // [4] largeJellyfish / 3 === (largeJellyfish is unknown)
if (blueLargeJellyfish === 6) { // [5] (blueLargeJellyfish === 6) === false
break; // [6] n/a
}
totalJellyfish++; // [7] totalJellyfish + 1 === (totalJellyfish is unknown)
}
// Loop continues:
// [2] while (true) === true
// [3] largeJellyfish = totalJellyfish / 5 === 15 / 5 === 3
// [4] blueLargeJellyfish = 3 / 3 === 1
// [5] if (1 === 6) === false
// [6] n/a
// [7] totalJellyfish = 15 + 1 === 16
// (...)
// Loop continues:
// [2] while (true) === true
// [3] largeJellyfish = totalJellyfish / 5 === 30 / 5 === 6
// [4] blueLargeJellyfish = 6 / 3 === 2
// [5] if (2 === 6) === false
// [6] n/a
// [7] totalJellyfish = 30 + 1 === 31
// (...)
// Loop continues:
// [2] while (true) === true
// [3] largeJellyfish = totalJellyfish / 5 === 45 / 5 === 9
// [4] blueLargeJellyfish = 9 / 3 === 3
// [5] if (3 === 6) === false
// [6] n/a
// [7] totalJellyfish = 45 + 1 === 46
// (...)
// Loop continues:
// [2] while (true) === true
// [3] largeJellyfish = totalJellyfish / 5 === 60 / 5 === 12
// [4] blueLargeJellyfish = 12 / 3 === 4
// [5] if (4 === 6) === false
// [6] n/a
// [7] totalJellyfish = 60 + 1 === 61
// (...)
// Loop continues:
// [2] while (true) === true
// [3] largeJellyfish = totalJellyfish / 5 === 75 / 5 === 15
// [4] blueLargeJellyfish = 15 / 3 === 5
// [5] if (5 === 6) === false
// [6] n/a
// [7] totalJellyfish = 75 + 1 === 76
// (...)
// Loop continues:
// [2] while (true) === true
// [3] largeJellyfish = totalJellyfish / 5 === 90 / 5 === 18
// [4] blueLargeJellyfish = 18 / 3 === 6
// [5] if (6 === 6) === true
// [6] break
// Loop ended.
console.log(`There are ${totalJellyfish} jellyfish.`); // [8] "There are 90 jellyfish."
```

### Question 3

(1176) A teacher uses a 5-inch piece of chalk to write math equations on a
chalkboard for his students. The teacher likes to conserve chalk, so he tries to
only use 20% of the chalk each day. Since the teacher cannot write with a very
small piece of chalk, he recycles the chalk when it is smaller than 2 inches. On
Monday the teacher used a new piece of chalk. His students need extra help that
day, so he ended up writing more than usual. He used up 45% of the chalk by the
end of the day. If the teacher goes back to using only 20% of the chalk each
day, how many days does he have before he has to recycle this piece?

### Solution 3 by Llama 3 70B 🔴

While the code below does output the correct answer (2 days), the LLM failed to
"evaluate" the code correctly. Furthermore, the LLM often fails to solve this
problem as it tends to miss that the chalk length should decrease by 20% of its
current length, not the original length.

```javascript
const originalChalkLength = 5 // [1] 5
let chalkLength = originalChalkLength // [2] 5
chalkLength *= (1 - 0.45) // [3] 5 * (1 - 0.45) === 2.75
let daysBeforeRecycling = 0 // [4] 0

while (chalkLength >= 2) { // [5] while (2.75 >= 2) === true
chalkLength *= (1 - 0.20) // [6] 2.75 * (1 - 0.20) === 2.2
daysBeforeRecycling++ // [7] 0 + 1 === 1
// Loop continues:
// [5] while (2.2 >= 2) === true
// [6] 2.2 * (1 - 0.20) === 1.76
// [7] 1 + 1 === 2
// (...);
// Loop continues:
// [5] while (1.42 >= 2) === false
// [8] break;
// Loop ended.
}
console.log(`The teacher has ${daysBeforeRecycling} days before he has to recycle this piece.`); // [9] "The teacher has 3 days before he has to recycle this piece."
```