←back to thread

Dolt is Git for data

(www.dolthub.com)
358 points timsehn | 1 comments | | HN request time: 0.214s | source
Show context
aantix ◴[] No.22734941[source]
Slightly related - how does ML track new data input and ensure that the data hasn't introduced a regression?

I would assume there's an automated test suite, but also some way of diffing large amounts of input data and visualizing those input additions relative to model classifications?

What are the common tools for this?

replies(1): >>22735230 #
1. visarga ◴[] No.22735230[source]
You generally can't analyse the accuracy of an ML system by each individual piece of data in the training set. Each batch of examples slightly changes the model making their updates interact and combine during the training process, so it becomes extremely difficult to assign the contribution of individual examples. Of course you could retrain the model leaving one example out, but that would be exceedingly slow and the result would be inconclusive from a single run because the stochastic noise of the training process is larger than the effect of removing or adding one example.

Related areas are confidence calibration, active learning and hard example detection during training. Another approach is to synthesise a new, much smaller dataset that would train a neural net to the same accuracy of the original larger dataset.