Two years ago, we built a benchmark to evaluate multistep reasoning, tool use, and logical capabilities in language models. It includes a quality measure to assess performance and is built on a plugin system we developed for SymbolicAI.
- Benchmark & Plugin System: https://github.com/ExtensityAI/benchmark
- Example Eval: https://github.com/ExtensityAI/benchmark/blob/main/src/evals...
We've also implemented some interesting concepts in our framework: - C#-style Extension Methods in Python: Using GlobalSymbolPrimitive to extend functionalities.
- https://github.com/ExtensityAI/benchmark/blob/main/src/func.py#L146
- Symbolic <> Sub-symbolic Conversion: And using this for quality metrics, like a reward signal from the path integral of multistep generations.
- https://github.com/ExtensityAI/benchmark/blob/main/src/func....For fun, we integrated LLM-based tools into a customizable shell. Check out the Rick & Morty-styled rickshell:
- RickShell: https://github.com/ExtensityAI/rickshell
We were also among the first to generate a full research paper from a single prompt and continue to push the boundaries of AI-generated research:
- End-to-End Paper Generation (Examples): https://drive.google.com/drive/folders/1vUg2Y7TgZRRiaPzC83pQ...
- Recent AI Research Generation:
- Three-Body Problem: https://github.com/ExtensityAI/three-body_problem
- Primality Test: https://github.com/ExtensityAI/primality_test
- Twitter/X Post: https://x.com/DinuMariusC/status/1915521724092743997
Finally, for those interested in building similar services, we've had an open-source, MCP-like API endpoint service available for over a year:- SymbolicAI API: https://github.com/ExtensityAI/symbolicai/blob/main/symai/en...