←back to thread

52 points zomh | 2 comments | | HN request time: 0.43s | source

As a fan of dense New York Times-style crosswords, I challenged myself to create topic-specific puzzles. It turns out that generating crosswords and efficiently placing words is a non-trivial computational problem.

I started the project, "Joystick Jargon" combining traditional crossword elements with gaming-related vocabulary. Here's the technical process behind it:

1. Data Source: Used a 3.8 Million Rows Reddit dataset from Hugging Face (https://huggingface.co/datasets/webis/tldr-17).

2. Data Filtering: Narrowed down to gaming-related subreddits (r/gaming, r/dota2, r/leagueoflegends).

3. Keyword Extraction: Employed ML techniques, specifically BERT-embeddings and cosine similarity, to extract keywords from the subreddits.

4. Data Preprocessing: Cleaned up data unsuitable for crossword puzzles.

5. Grid Generation: Implemented a heuristic crossword algorithm to create grids and place words efficiently.

6. Clue Generation: Utilized a Large Language Model to generate context-aware clues for the placed words.

The resulting system creates crossword puzzles that blend traditional elements with gaming terminology, achieving about a 50-50 mix.

This project is admittedly overengineered for its purpose, but it was an interesting exploration into natural language processing, optimization algorithms, and the intersection of traditional word games with modern gaming culture.

A note on content: Since the data source is Reddit, some mature language may appear in the puzzles. Manual filtering was minimal to preserve authenticity.

You can try the puzzles here: <https://capsloq.de/crosswords/joystick-jargon>

I'm curious about the HN community's thoughts on this approach to puzzle generation? What other domains might benefit from similar computational techniques for content creation?

1. Suppafly ◴[] No.41882163[source]
>5. Grid Generation: Implemented a heuristic crossword algorithm to create grids and place words efficiently.

I always think about doing something similar for a similar project. Are you able to do it completely automatically or do you have to help finesse the words to fit?

replies(1): >>41882325 #
2. zomh ◴[] No.41882325[source]
The algo does allow for fully automatic crossword generation, let me try to summarize the general flow:

1. Begins with an empty grid and starts placing words horizontally from the top-left corner

2. For each word placement, it verifies that valid words can be formed vertically at each intersection point

3. It maintains a list of possible letters for each cell to ensure all constraints are satisfied

4. The generator consults a dictionary to find valid words that fit the curent grid state, allowing for diverse solutions

5. If no valid word can be placed, it may decide to insert a black square, carefully checking that doesn't violate any crossword rules

6. When it reaches an dead end, the system backtracks and tries different options

7. It employs smart heuristics to guide word selection, such as favoring longer words in certain positions

8. Throughout the process it automatically adjusts parameters like word length andblack square placement to find a valid solution

There is no manual intervetion, however the quality depends heavily on the input dictionary and tunable parameters.