Google's “Director of Engineering” Hiring Test

1. jblow ◴[13 Oct 16 16:07 UTC] No.12701702[source]▶

#9 is especially stupid because it's so context-dependent. SSE4 gives you a popcount instruction, for example, which would be easily the fastest way to do this, if available.

replies(2): >>12702641 #>>12702679 #

2. mortehu ◴[13 Oct 16 17:50 UTC] No.12702641[source]▶

>>12701702 (TP) #

Which is why you ask follow-up questions instead of just giving the optimal solution for UltraSPARC and rejecting what would be the optimal solution for other CPUs.

3. greyman ◴[13 Oct 16 17:56 UTC] No.12702679[source]▶

>>12701702 (TP) #

Yes, but without that instruction, the algorithm mentioned by recruiter is really the quickest way. I coded chess algorithm in the past and this was exactly the method top chess open-source engines used. But imho it is hard to figure that out without prior experience with this problem.

replies(3): >>12703390 #>>12703834 #>>12708561 #

4. monocasa ◴[13 Oct 16 19:31 UTC] No.12703390[source]▶

>>12702679 #

Yep. Went and tried the lookup method against a 5 step parallel shift and add method (which is the fastest bitwise way I know of without, and the lookup is ~5% faster than the bitwise way.

https://gist.github.com/monocasa/1d44a03cbd0170bfffc6a4a5c37...

replies(1): >>12703823 #

5. gcp ◴[13 Oct 16 20:35 UTC] No.12703823{3}[source]▶

>>12703390 #

Your code has 6 shifts, 6 adds/subs and 6 ANDs.

You can do it with 4 shifts, 3 adds, 1 MUL and 4 ANDs.

Your code is simply suboptimal.

replies(1): >>12704051 #

6. gcp ◴[13 Oct 16 20:37 UTC] No.12703834[source]▶

>>12702679 #

I coded chess algorithm in the past and this was exactly the method top chess open-source engines used.

Your statement is rather vague in time, but for example Stockfish did certainly use the hardware intrinsic at some point. Some of the top closed source engines were using SWAR approaches mixed with loops (when the expected population is 0).

The answer is very dependent on the exact HW architecture and the cache pressure in the surrounding algorithms.

7. monocasa ◴[13 Oct 16 21:09 UTC] No.12704051{4}[source]▶

>>12703823 #

For a 64bit quantity? I'm curious to see your algorithm in actual code.

8. Const-me ◴[14 Oct 16 15:08 UTC] No.12708561[source]▶

>>12702679 #

No, the algorithm mentioned by recruiter is among the slowest ways.

I recently tested different approaches. I’ve been working on some code that downsamples large set of 1 bit voxels to get shades of gray on the edges. For that, I had to counts gigabytes of those bits as fast as possible.

Advanced manually-vectorized SIMD code worked several times faster, esp. on the hardware that supports SSSE3 or XOP instructions.

And even when the hardware doesn’t have SSE4, doesn’t have SSSE3, doesn’t have XOP — SSE2-only backup plan is still faster than lookup tables. Here’s the code: http://stackoverflow.com/a/17355341/126995