←back to thread

106 points decodebytes | 1 comments | | HN request time: 0.406s | source
Show context
dcreater ◴[] No.45390565[source]
are their good synthetic data sets generated from DeepFabric publicly available?
replies(1): >>45390593 #
decodebytes ◴[] No.45390593[source]
sure, just starting to get some up on HF. A good example might be GSM8K as this shows the structured output where every result is strictly formatted - I am using this right now to train models and managaing to get a small qwen model up in the 60% range, which wildly is higher then llama2 and xAI Grok 1

GSM8K: https://huggingface.co/datasets/lukehinds/deepfabric-GSM8K-c...

also some others

infra failures reasoning / CoT: https://huggingface.co/datasets/lukehinds/deepfabric-devops-...

Medical (multi-turn): https://huggingface.co/datasets/lukehinds/deepfabric-7k-medi...

Programming challenges: https://huggingface.co/datasets/lukehinds/programming-challe...

If there is anything in particular you need, drop me a message or feel free to open an issue and I can create something for you.

replies(1): >>45392383 #
dcreater ◴[] No.45392383[source]
Thanks, what LLMs were used to create these?
replies(1): >>45404324 #
1. decodebytes ◴[] No.45404324[source]
I think it was gpt4-mini, but local models do surprisingly well too.