Most active commenters

    ←back to thread

    282 points antidnan | 12 comments | | HN request time: 1.043s | source | bottom
    1. folli ◴[] No.41917385[source]
    From the paper's method section, a bit more about which type of ML algo was used:

    An RF machine-learning model was developed to predict lithium concentrations in Smackover Formation brines throughout southern Arkansas. The model was developed by (i) assigning explanatory variables to brine samples collected at wells, (ii) tuning the RF model to make predictions at wells and assess model performance, (iii) mapping spatially continuous predictions of lithium concentrations across the Reynolds oolite unit of the Smackover Formation in southern Arkansas, and (iv) inspecting the model for explanatory variable importance and influence. Initial model tuning used the tidymodels framework (52) in R (53) to test XGBoost, K-nearest neighbors, and RF algorithms; RF models consistently had higher accuracy and lower bias, so they were used to train the final model and predict lithium.

    Explanatory variables used to tune the RF model included geologic, geochemical, and temperature information for Jurassic and Cretaceous units. The geologic framework of the model domain is expected to influence brine chemistry both spatially and with depth. Explanatory variables used to train the RF model must be mapped across the model domain to create spatially continuous predictions of lithium. Thus, spatially continuous subsurface geologic information is key, although these digital resources are often difficult to acquire.

    Interesting to me that RF performed better the XGBoost, would have expected at least a similar outcome if tuned correctly.

    replies(5): >>41918303 #>>41918430 #>>41918858 #>>41919565 #>>41921426 #
    2. jandrese ◴[] No.41918303[source]
    Did they actually verify the predictions? In my reading of the article I didn't see any core samples being made to verify the model is correct.
    replies(1): >>41919610 #
    3. tomrod ◴[] No.41918430[source]
    RF is a heavy hitter when it comes to tabular data. XGBoost is good as well, but more often than not needs and autotuner to really unlock it (e.g pycaret).
    4. lordgrenville ◴[] No.41918858[source]
    So it turns out that there's no theoretical reason that gradient boosting will always outperform RF (which would violate the "no free lunch" theorem). But it does usually seem to be the case in practice, even with small and noisy data.

    I would hazard a guess that with better tuning, XGBoost would still have won. (The paper notes that the authors chose a suboptimal set of hyperparameters out of fear of overfitting - maybe the same logic justifies choosing a suboptimal model type...)

    replies(2): >>41919082 #>>41923532 #
    5. levocardia ◴[] No.41919082[source]
    That's been my experience. RF tends to do quite well out of the box, and is very fast to fit. It's less of a pain to cross-validate too, with fewer tuning parameters. XGBoost has a huge number of knobs to tune, and its performance varies from god-awful with bad hyperparameters to somewhat better than RF with good ones. Giant PITA with nested cross-validation, etc. though.

    I haven't read in detail what their validation strategy is but this seems like the kind of problem where it's not so easy as you'd think -- you need to be very careful about how you stratify your train, dev, and test sets. A random 80/10/10 split would be way too optimistic: your model would just learn to interpolate between geographically proximate locations. You'd probably need to cross-validate across different geographic areas.

    This also seems like an application that would benefit from "active learning". given that drilling and testing is expensive, you'd want to choose where to collect new data based on where it would best update your model's accuracty. A similar-ish ML story comes from Flint, MI [1] though the ending is not so happy

    [1] https://www.theatlantic.com/technology/archive/2019/01/how-m...

    replies(2): >>41922140 #>>41922930 #
    6. jofer ◴[] No.41919565[source]
    Put another way, this is pretty similar to the interpolation approaches that would normally be used for datasets like this in the world of mineral exploration. Kriging/co-kriging (i.e. gaussian processes) is the more commonly used approach in this particular field due to both the long history and the available hyperparameters for things like spatial aniostropy.

    However, kriging is really quite difficult to use with non-continuous inputs. RF is a lot more forgiving there. You don't need to develop a covariance model for discrete values (or a covariance model for how the different inputs relate, either).

    7. jofer ◴[] No.41919610[source]
    There wouldn't be any core for this. It would be a holdout of the brine samples used in training. The thing that would be being produced is brine, so lithium concentrations in brine samples are the validation dataset as well. In other words, this is spatial interpolation.
    8. aaronblohowiak ◴[] No.41921426[source]
    for other folks wonder what the acronym means; RF in this context is Random Forest
    replies(1): >>41923255 #
    9. dwattttt ◴[] No.41922140{3}[source]
    > your model would just learn to interpolate between geographically proximate locations

    At a particular scale, this is entirely correct; if what I'm looking for is 'large', a measurement 1m away from a known hit would also be likely to be a hit.

    That particular issue sounds like it should be addressed with more negative samples.

    10. youoy ◴[] No.41922930{3}[source]
    The drilling and active learning part reminded me of this very nice article on Bayesian Optimization from Distill publication [0].

    They explain it for selecting the hyper parameters for ML models:

    > In this article, we talk about Bayesian Optimization, a suite of techniques often used to tune hyperparameters. More generally, Bayesian Optimization can be used to optimize any black-box function.

    But the example at the beginning of the article is mining gold:

    > Let us start with the example of gold mining. Our goal is to mine for gold in an unknown land 1 . For now, we assume that the gold is distributed about a line. We want to find the location along this line with the maximum gold while only drilling a few times (as drilling is expensive).

    [0] https://distill.pub/2020/bayesian-optimization/

    11. f_devd ◴[] No.41923255[source]
    For a moment I was excited that they had done surveys entirely on RF backscattering and ML.
    12. eru ◴[] No.41923532[source]
    The 'no free lunch' theorem is almost useless, because no real world data set is made of white noise.