←back to thread

52 points zomh | 1 comments | | HN request time: 0.403s | source

As a fan of dense New York Times-style crosswords, I challenged myself to create topic-specific puzzles. It turns out that generating crosswords and efficiently placing words is a non-trivial computational problem.

I started the project, "Joystick Jargon" combining traditional crossword elements with gaming-related vocabulary. Here's the technical process behind it:

1. Data Source: Used a 3.8 Million Rows Reddit dataset from Hugging Face (https://huggingface.co/datasets/webis/tldr-17).

2. Data Filtering: Narrowed down to gaming-related subreddits (r/gaming, r/dota2, r/leagueoflegends).

3. Keyword Extraction: Employed ML techniques, specifically BERT-embeddings and cosine similarity, to extract keywords from the subreddits.

4. Data Preprocessing: Cleaned up data unsuitable for crossword puzzles.

5. Grid Generation: Implemented a heuristic crossword algorithm to create grids and place words efficiently.

6. Clue Generation: Utilized a Large Language Model to generate context-aware clues for the placed words.

The resulting system creates crossword puzzles that blend traditional elements with gaming terminology, achieving about a 50-50 mix.

This project is admittedly overengineered for its purpose, but it was an interesting exploration into natural language processing, optimization algorithms, and the intersection of traditional word games with modern gaming culture.

A note on content: Since the data source is Reddit, some mature language may appear in the puzzles. Manual filtering was minimal to preserve authenticity.

You can try the puzzles here: <https://capsloq.de/crosswords/joystick-jargon>

I'm curious about the HN community's thoughts on this approach to puzzle generation? What other domains might benefit from similar computational techniques for content creation?

Show context
mvdtnz ◴[] No.41881862[source]
Unportal? What the fuck is unportal? I have played games for 30 years and I've never heard the term "unportal". Google gives no useful results. That clue made me angrier than any crossword clue I've ever seen.
replies(3): >>41882069 #>>41882158 #>>41882774 #
zomh ◴[] No.41882774[source]
One more thing: Besides the obvious that clues and answers should make sense can you provide some insights in what makes a good crossword to you? I'd love to hear that from someone playing them for 30 years!
replies(2): >>41884777 #>>41886375 #
carstenhag ◴[] No.41886375[source]
For me, a mixed crossword between all games doesn't make sense. You could give me 20 crosswords with Dota stuff in it and I would be able to solve 0, not even with hints it would be doable.

So maybe rather have some for League, some for CS, etc? Maybe you can do a mixed indie one with very popular games, or mixed Shooter. But then the questions have to be less difficult :D

replies(1): >>41887224 #
1. zomh ◴[] No.41887224[source]
TY for the feedback. I see what you are saying about the crossword getting too niche. For now this is mostly a technical decision since its easier to train on a more specific dataset. However i agree with you that the dataset should be expanded once the quality is there