USGS uses machine learning to show large lithium potential in Arkansas

1. folli ◴[22 Oct 24 18:47 UTC] No.41917385[source]▶

>>41916322 (OP) #

From the paper's method section, a bit more about which type of ML algo was used:

An RF machine-learning model was developed to predict lithium concentrations in Smackover Formation brines throughout southern Arkansas. The model was developed by (i) assigning explanatory variables to brine samples collected at wells, (ii) tuning the RF model to make predictions at wells and assess model performance, (iii) mapping spatially continuous predictions of lithium concentrations across the Reynolds oolite unit of the Smackover Formation in southern Arkansas, and (iv) inspecting the model for explanatory variable importance and influence. Initial model tuning used the tidymodels framework (52) in R (53) to test XGBoost, K-nearest neighbors, and RF algorithms; RF models consistently had higher accuracy and lower bias, so they were used to train the final model and predict lithium.

Explanatory variables used to tune the RF model included geologic, geochemical, and temperature information for Jurassic and Cretaceous units. The geologic framework of the model domain is expected to influence brine chemistry both spatially and with depth. Explanatory variables used to train the RF model must be mapped across the model domain to create spatially continuous predictions of lithium. Thus, spatially continuous subsurface geologic information is key, although these digital resources are often difficult to acquire.

Interesting to me that RF performed better the XGBoost, would have expected at least a similar outcome if tuned correctly.

replies(6): >>41918303 #>>41918430 #>>41918858 #>>41919565 #>>41921426 #>>41924440 #

2. jandrese ◴[22 Oct 24 20:22 UTC] No.41918303[source]▶

>>41917385 (TP) #

Did they actually verify the predictions? In my reading of the article I didn't see any core samples being made to verify the model is correct.

replies(1): >>41919610 #

3. tomrod ◴[22 Oct 24 20:34 UTC] No.41918430[source]▶

>>41917385 (TP) #

RF is a heavy hitter when it comes to tabular data. XGBoost is good as well, but more often than not needs and autotuner to really unlock it (e.g pycaret).

replies(1): >>41924108 #

4. lordgrenville ◴[22 Oct 24 21:22 UTC] No.41918858[source]▶

>>41917385 (TP) #

So it turns out that there's no theoretical reason that gradient boosting will always outperform RF (which would violate the "no free lunch" theorem). But it does usually seem to be the case in practice, even with small and noisy data.

I would hazard a guess that with better tuning, XGBoost would still have won. (The paper notes that the authors chose a suboptimal set of hyperparameters out of fear of overfitting - maybe the same logic justifies choosing a suboptimal model type...)

replies(2): >>41919082 #>>41923532 #

5. levocardia ◴[22 Oct 24 21:49 UTC] No.41919082[source]▶

>>41918858 #

That's been my experience. RF tends to do quite well out of the box, and is very fast to fit. It's less of a pain to cross-validate too, with fewer tuning parameters. XGBoost has a huge number of knobs to tune, and its performance varies from god-awful with bad hyperparameters to somewhat better than RF with good ones. Giant PITA with nested cross-validation, etc. though.

I haven't read in detail what their validation strategy is but this seems like the kind of problem where it's not so easy as you'd think -- you need to be very careful about how you stratify your train, dev, and test sets. A random 80/10/10 split would be way too optimistic: your model would just learn to interpolate between geographically proximate locations. You'd probably need to cross-validate across different geographic areas.

This also seems like an application that would benefit from "active learning". given that drilling and testing is expensive, you'd want to choose where to collect new data based on where it would best update your model's accuracty. A similar-ish ML story comes from Flint, MI [1] though the ending is not so happy

[1] https://www.theatlantic.com/technology/archive/2019/01/how-m...

replies(2): >>41922140 #>>41922930 #

6. jofer ◴[22 Oct 24 22:56 UTC] No.41919565[source]▶

>>41917385 (TP) #

Put another way, this is pretty similar to the interpolation approaches that would normally be used for datasets like this in the world of mineral exploration. Kriging/co-kriging (i.e. gaussian processes) is the more commonly used approach in this particular field due to both the long history and the available hyperparameters for things like spatial aniostropy.

However, kriging is really quite difficult to use with non-continuous inputs. RF is a lot more forgiving there. You don't need to develop a covariance model for discrete values (or a covariance model for how the different inputs relate, either).

7. jofer ◴[22 Oct 24 22:58 UTC] No.41919610[source]▶

>>41918303 #

There wouldn't be any core for this. It would be a holdout of the brine samples used in training. The thing that would be being produced is brine, so lithium concentrations in brine samples are the validation dataset as well. In other words, this is spatial interpolation.

8. aaronblohowiak ◴[23 Oct 24 03:14 UTC] No.41921426[source]▶

>>41917385 (TP) #

for other folks wonder what the acronym means; RF in this context is Random Forest

replies(1): >>41923255 #

9. dwattttt ◴[23 Oct 24 05:40 UTC] No.41922140{3}[source]▶

>>41919082 #

> your model would just learn to interpolate between geographically proximate locations

At a particular scale, this is entirely correct; if what I'm looking for is 'large', a measurement 1m away from a known hit would also be likely to be a hit.

That particular issue sounds like it should be addressed with more negative samples.

10. youoy ◴[23 Oct 24 08:14 UTC] No.41922930{3}[source]▶

>>41919082 #

The drilling and active learning part reminded me of this very nice article on Bayesian Optimization from Distill publication [0].

They explain it for selecting the hyper parameters for ML models:

> In this article, we talk about Bayesian Optimization, a suite of techniques often used to tune hyperparameters. More generally, Bayesian Optimization can be used to optimize any black-box function.

But the example at the beginning of the article is mining gold:

> Let us start with the example of gold mining. Our goal is to mine for gold in an unknown land 1 . For now, we assume that the gold is distributed about a line. We want to find the location along this line with the maximum gold while only drilling a few times (as drilling is expensive).

[0] https://distill.pub/2020/bayesian-optimization/

11. f_devd ◴[23 Oct 24 09:10 UTC] No.41923255[source]▶

>>41921426 #

For a moment I was excited that they had done surveys entirely on RF backscattering and ML.

12. eru ◴[23 Oct 24 10:07 UTC] No.41923532[source]▶

>>41918858 #

The 'no free lunch' theorem is almost useless, because no real world data set is made of white noise.

13. jncfhnb ◴[23 Oct 24 12:04 UTC] No.41924108[source]▶

>>41918430 #

XGBoost models are random forest models. They’re also just consistently better for very little effort.

replies(2): >>41924257 #>>41925931 #

14. prog_1 ◴[23 Oct 24 12:21 UTC] No.41924257{3}[source]▶

>>41924108 #

you surely mean that both are ensemble models. RFs and GBMs differ in how they fit the data

replies(1): >>41926686 #

15. Loic ◴[23 Oct 24 12:39 UTC] No.41924440[source]▶

>>41917385 (TP) #

RF is random forest[0].

We had this discussion a couple of days ago: "Why do Random Forests Work? Understanding Tree Ensembles as Self-Regularizing Adaptive Smoothers".

https://arxiv.org/abs/2402.01502

https://news.ycombinator.com/item?id=41873968

[0]: https://en.wikipedia.org/wiki/Random_forest

16. tomrod ◴[23 Oct 24 15:19 UTC] No.41925931{3}[source]▶

>>41924108 #

Not only RF, they incorporate GBM too as I understand it.

Often they are the best "just run it and forget it" but compared to tuning they don't always achieve top -- sometimes surprisingly so.

XGBoost and similar are solid first stops in model building.

17. jncfhnb ◴[23 Oct 24 16:34 UTC] No.41926686{4}[source]▶

>>41924257 #

A GBM like XGBoost is an ensemble of trees. It may be that when you load RandomForest modules they fit based on entropy or whatever the typical DecisionTree does but imo the term “random forest” should really convey nothing more than “ensemble of trees”.

I’m saying XGBoost would be a subclass of RF