Show HN: Single-Header Profiler for C++17

General rundown of the logic can be found in this comment on reddit: https://www.reddit.com/r/cpp/comments/1jy6ver/comment/mmze20...

About linear IDs: A call graph in general case is a tree of nodes, each node has a single parent and an arbitrary amount of children. Each node accumulates time spend in the "lower" branches. A neat property of the callgraph relative to a generic tree, is that every node can be associated with a callsite. For example, if a some function f() calls itself 3 recursively, there will be multiple nodes corresponding to it, but in terms of callsite there is still only one. So lets take some simple call graph as an example:

  Callgraph:         f() -> f() -> f() -> g()
                                       -> h()
  Node id:           0      1      2      3,4

Let's say f() has callsite id '0', g() has callsite id '1', h() has callsite id '2'. The callgraph will then consist of N=5 nodes with M=3 different callsites:

  Node id:         { 0   1   2   3   4 }
  Callsite id:     { 0   0   0   1   2 }

We can then encode all "prev."" nodes as a single N vector, and all "next" nodes as a MxN matrix, which has some kind of sentinel value (like -1) in places with no connection. For this example this will result in following:

  Node id:         { 0   1   2   3   4 }
  Prev. id:        { x   0   1   2   2 }
  Next id matrix:  [ 1   2   3   x   x ]
                   [ x   x   4   x   x ]

Every thread has a thread-local callgraph object that keeps track of all this graph traversal, it holds 'current_node_id'. Traversing backwards on the graph is a single array lookup:

  current_node_id = prev_node_ids[current_node_id];

Traversing forwards to an existing callgraph node is a lookup & branch:

  next_node_id = next_node_ids[callsite_id, current_node_id]
  if (next_node_id = x) create_node(next_node_id); // will be usually predicted
  else                  current_node_id = next_node_id;

New nodes can be created pretty cheaply too, but too verbose for a comment. The key to tracking the callsites and assigning them IDs is thread_local local variables generated by the macro:

https://github.com/DmitriBogdanov/UTL/blob/master/include/UT...

When callsite marker initializes (which only happens once), it gets a new ID. Timer then gets this 'callsite_id' an passes it to the forwards-traversal. The way we get function names is by simply remembering __FILE__, __func__, __LINE__ pointers in another array of the call graph, they get saved during the callsite marker initialization too. As far as performance goes everything we do is cheap & simple operations, at this point the main overhead is just from taking the timestamps.