←back to thread

52 points zomh | 1 comments | | HN request time: 0.198s | source

As a fan of dense New York Times-style crosswords, I challenged myself to create topic-specific puzzles. It turns out that generating crosswords and efficiently placing words is a non-trivial computational problem.

I started the project, "Joystick Jargon" combining traditional crossword elements with gaming-related vocabulary. Here's the technical process behind it:

1. Data Source: Used a 3.8 Million Rows Reddit dataset from Hugging Face (https://huggingface.co/datasets/webis/tldr-17).

2. Data Filtering: Narrowed down to gaming-related subreddits (r/gaming, r/dota2, r/leagueoflegends).

3. Keyword Extraction: Employed ML techniques, specifically BERT-embeddings and cosine similarity, to extract keywords from the subreddits.

4. Data Preprocessing: Cleaned up data unsuitable for crossword puzzles.

5. Grid Generation: Implemented a heuristic crossword algorithm to create grids and place words efficiently.

6. Clue Generation: Utilized a Large Language Model to generate context-aware clues for the placed words.

The resulting system creates crossword puzzles that blend traditional elements with gaming terminology, achieving about a 50-50 mix.

This project is admittedly overengineered for its purpose, but it was an interesting exploration into natural language processing, optimization algorithms, and the intersection of traditional word games with modern gaming culture.

A note on content: Since the data source is Reddit, some mature language may appear in the puzzles. Manual filtering was minimal to preserve authenticity.

You can try the puzzles here: <https://capsloq.de/crosswords/joystick-jargon>

I'm curious about the HN community's thoughts on this approach to puzzle generation? What other domains might benefit from similar computational techniques for content creation?

Show context
vunderba ◴[] No.41881403[source]
Nice work. I've also experimented with procedurally generated crossword puzzles, though I really wanted to constrain them to symmetric layouts like what you would find in the New York times which made it more difficult.

There's an outstanding issue and that is (from what I can tell) at least 75% of the answers correspond to relatively generic nouns or verbs.

Part of the deep satisfaction in solving a crossword puzzle is the specificity of the answer. It's far more gratifying to answer a question with something like "Hawking" then to answer with "scientist", or answering with "mandelbrot" versus "shape".

It might be worth going back and looking up a compendium of games released in the last couple decades, cross referencing them with their manuals, GameFaqs, etc. and peppering this information into the crossword.

replies(2): >>41881500 #>>41884536 #
zomh ◴[] No.41884536[source]
Wanted to thank you again. I am currently working on an improved version providing more context. Your sentence "Part of the deep satisfaction in solving ..." made it into the prompt's rule-set. At this very moment I am only using the dataset of r/dota2 to make the testing easier and I look at the very first result with the new prompt:

Generated words and clues:

heroes: Characters with unique abilities in Dota 2, tasked with defeating the enemy's Ancient.

ragers: Players who overly react to in-game frustrations, often ruining the fun for everyone.

rage: A common emotion experienced by players sometimes leading to poor decision-making.

tachyons: Hypothetical particles that travel faster than light, having no place in an Ancient's mechanics.

healing: Essential support function often provided by certain heroes like Treant Protector.

burn: Refers to a mechanism used to deplete an opponent's mana, crucial in trilane strategies.

matters: In Dota 2, every decision, including hero picks, can significantly change the outcome.

fault: What a player will often blame when losing, rather than acknowledging their own mistakes.

support: Role in Dota 2 focused on helping the team, often with abilities to aid and sustain.

team: Group of players working together to win, where synergy and composition are key to victory.

Note that the Words themselves were not picked by OpenAI but rather a per-selection from the BERT Embeddings ML Algorithm but this time with more than just a word as context.

This is definitely going in the right direction. It's only sample size of 1 but i had to share it with you!

replies(2): >>41888393 #>>41889159 #
xrisk ◴[] No.41888393[source]
Pretty reasonable. I don’t know where it pulled tachyons and it’s clue from though, that’s funny.
replies(1): >>41892054 #
1. zomh ◴[] No.41892054[source]
They talked about it in r/dota2 don't ask me why but they did :D