Reverse engineering the obfuscated TikTok VM

1. godelski ◴[21 Apr 25 05:11 UTC] No.43748662[source]▶

This seems like quite a lot of work to hide the code. What would the legitimate reasons for this be? Because it looks like it would make the program less optimized and more complexity just leads to more errors.

I understand the desire to make it harder for bots, but 1) it doesn't seem to be effective and bots seem to be going a very different route 2) there's got to be better ways that are more effective. It's not like you're going to stop clones through this because clones can replicate by just seeing how things work and reverse engineer blackbox style.

replies(8): >>43748681 #>>43748712 #>>43748741 #>>43748839 #>>43749167 #>>43749282 #>>43750130 #>>43752385 #

2. davidsojevic ◴[21 Apr 25 05:16 UTC] No.43748681[source]▶

>>43748662 (TP) #

Making it harder for bots usually means that it drives up the cost for the bots to operate; so if they need to run in a headless browser to get around the anti-bot measures it might mean that it takes, for example, 1.5 seconds to execute a request as compared to the 0.1 seconds it would without them in place.

On top of that 1.5 seconds is also that there is a much larger CPU and memory cost from having to run that browser compared to a simple direct HTTP request which is near negligible.

So while you'll never truly defeat a sufficiently motivated actor, you may be able to drive their costs up high enough that it makes it difficult to enter the space or difficult to turn a profit if they're so inclined.

replies(1): >>43756558 #

3. noduerme ◴[21 Apr 25 05:25 UTC] No.43748712[source]▶

>>43748662 (TP) #

A generous take would be that they have their own internal GUI tools that make it easier for non-programmers to set up visual elements in this. That was historically the reason to invent VMs like Flash. A less generous take would account for the enormous potential for hiding nefarious code inside such a thing, and account for the nature of the government which deployed it, and conclude that it was a national security / defense project disguised as a candy-coated trojan horse.

replies(1): >>43748994 #

4. throwaway48476 ◴[21 Apr 25 05:33 UTC] No.43748741[source]▶

>>43748662 (TP) #

Makes it easier to hide code that does browser fingerprinting.

5. supriyo-biswas ◴[21 Apr 25 06:37 UTC] No.43748994[source]▶

>>43748712 #

VM-based architectures are really common in the obfuscation space, which is why you have executable packers[1], JS packers[2] and bot management products[3][4] leveraging similar techniques.

As for why the obfuscation is needed: bot management products suffer from a fundamental weakness in that ultimately, all of them simply collect static data from the environment, therefore it would make much more sense to make the steps involved as difficult to reverse engineer as possible. Once that is done, all you need to do is slightly change the schematics of your script every few weeks and publish a new bundle, and you've got yourself a pretty unsubvertible* protection scheme.

Regarding the "trojan horse", I think someone is yet to show proof that it's a Javascript exploit.

(*Unsubvertible is obviously relative, but raising the cost the attack, from say, $0.01/1000 requests to $10/1000 requests would massively cut down on abuse.)

[1] https://vmpsoft.com/

[2] https://jscrambler.com/

[3] https://github.com/neuroradiology/InsideReCaptcha

[4] https://www.zenrows.com/blog/bypass-cloudflare#_qEu5MvVdnILJ...

replies(1): >>43750071 #

6. ◴[21 Apr 25 07:14 UTC] No.43749167[source]▶

>>43748662 (TP) #

7. rfoo ◴[21 Apr 25 07:36 UTC] No.43749282[source]▶

>>43748662 (TP) #

Google has been doing this since forever for recaptcha. And, to be fair, it seems to be fairly effectively for bot detection.

https://github.com/neuroradiology/InsideReCaptcha

> bots seem to be going a very different route

If the "very different route" means running a headless browser, then it's a success for this tech. Because the bot must run a blackbox JS now, and this gives people a whole new street of ways to run bot detection, using the bot's CPU.

replies(1): >>43756584 #

8. Scaevolus ◴[21 Apr 25 10:00 UTC] No.43750130[source]▶

>>43748662 (TP) #

Obfuscation is one part of defense in depth. Tiktok also has a variety of captchas to block scrapers, independent of this.

None of it's perfect, and they can be worked around, but by providing a barrier you've restricted some of the bad actors (spambots, scrapers) from acting at all.

It's easier to deal with 100 spambots than 1000!

replies(1): >>43752843 #

9. supriyo-biswas ◴[21 Apr 25 10:00 UTC] No.43750136{4}[source]▶

>>43750071 #

> Packers and obfuscators are not a "VM". It may look like a VM and act like a VM, in that it has opcodes that you can write to in another higher-level "language", but that is not a VM. It's in the same sandbox as the thing you're trying to obfuscate.

Indignant, caustic comments that simply discard all presented evidence such as this is the primary reason why people with the relevant experience have reduced their contributions on HN. I suspect I'll join their ranks too.

replies(3): >>43750173 #>>43750254 #>>43752054 #

10. saagarjha ◴[21 Apr 25 10:04 UTC] No.43750164{4}[source]▶

>>43750071 #

Amazing. Every word of what you said was wrong.

replies(1): >>43750267 #

11. esseph ◴[21 Apr 25 10:06 UTC] No.43750173{5}[source]▶

>>43750136 #

If you explain what you disagree with, maybe people would learn something.

If you joint "their ranks", you've simply gone down the bit of the "social media + expertise" bell curve, where the more expertise you get on a particular topic, the less you want to engage about it in public.

This is not unique to any one field or realm of knowledge.

replies(3): >>43750216 #>>43750331 #>>43750929 #

12. supriyo-biswas ◴[21 Apr 25 10:16 UTC] No.43750216{6}[source]▶

>>43750173 #

This is all correct, just that I'm lamenting the decline of technical discussions over uninformed positions and polarizing diatribes at a place I loved to come to be informed. /soapbox

replies(1): >>43776118 #

13. noduerme ◴[21 Apr 25 10:24 UTC] No.43750254{5}[source]▶

>>43750136 #

I'm not trying to be caustic or indignant. Not at all! I'm very surprised someone would take it that way. I have a different definition of what a "VM" is. I've been writing code for 30+ years and this is my understanding of what that means. Please, by all means, if I am completely wrong then take the time to explain what your view is.

14. noduerme ◴[21 Apr 25 10:27 UTC] No.43750267{5}[source]▶

>>43750164 #

Well, then correct me.

replies(1): >>43758816 #

15. noduerme ◴[21 Apr 25 10:33 UTC] No.43750303{4}[source]▶

>>43750071 #

I seem to be misunderstood. In a sibling part of this thread that was responding to someone who was later flagged, I wrote this:

But that's basically an emulator of a VM, isn't it? It's like rewriting the Flash AVM2 into JS... it's still running in JS whereas the original VM was C++. It could JIT compile stuff but only because it literally was reserving memory that could overflow, and (semi-technical take here) from that advantage, of being closer to the metal, flowed all of the flaws in AVM2 that precipitated most of Adobe's woes with Flash. A VM implant in a web page that uses a plugin like Java or Flash, to get around running browser-sandboxed code, which can take over physical memory, is far different from just emulating a VM in Javascript. I wouldn't call writing a ton of opcodes in JS, which resolved to JS functions, a "virtual machine", because it isn't reserving anything or doing anything that Javascript can't do. Someone correct me here if I'm wrong... this is just heavy-duty obfuscation.

Also, one major purpose of a VM is to improve performance over what's available in the browser. If you use that as a measurement, this clearly doesn't fit that goal.

replies(1): >>43751788 #

16. noduerme ◴[21 Apr 25 10:37 UTC] No.43750331{6}[source]▶

>>43750173 #

It's also a learning experience on both sides if people who have the knowledge share it with others. Teaching is a great way to learn; I exercise a great deal in my mind when I'm not working by trying to help other people solve problems.

replies(1): >>43750634 #

17. ◴[21 Apr 25 11:19 UTC] No.43750634{7}[source]▶

>>43750331 #

18. oefrha ◴[21 Apr 25 12:00 UTC] No.43750929{6}[source]▶

>>43750173 #

> I can't think of a scenario where you'd need to deconstruct their front-end to mimic the calls. Just observe the calls and mimic them.

Just about everything in that wall of text is wrong, but it’s rather pointless to engage someone who clearly has never reverse engineered a single thing with a modicum of defense, yet has tons of opinions on the subject. It’s like debating the pros and cons of programming languages with someone whose claim of expertise is having used computers. I totally get gp’s frustration. Denial of service with too many absurd points is real.

> If you explain what you disagree with, maybe people would learn something.

TFA explains in detail why reverse engineering is necessary, and specifically what is achieved by reverse engineering which part. ggp clearly has zero interest in learning when much of what they wrote has been refuted in the damn post being discussed. Other people can RTFA to learn as well, it’s a good technical post, and we don’t get enough of those these days.

replies(2): >>43750994 #>>43751315 #

19. supriyo-biswas ◴[21 Apr 25 12:07 UTC] No.43750994{7}[source]▶

>>43750929 #

> Denial of service with too many absurd points is real.

Otherwise also known as sealioning: https://en.wikipedia.org/wiki/Sealioning

replies(1): >>43751351 #

20. noduerme ◴[21 Apr 25 12:39 UTC] No.43751315{7}[source]▶

>>43750929 #

k, you're saying you can't understand the obfuscated bullet they're sending to the backend unless you understand how it's turned into a particular nut on the front, and understand both sides of the request. I see how that makes sense, if you can't even figure out how to decipher the requests as you read them going out. Is that what you're trying to say? If so, say it, rather this this "wall of text" about why technical posts are scarce and bla bla bla. I'm more interested in the reason for this kind of obfuscation (and no, I don't think it's just to protect against bots, as there are many ways to do that).

replies(2): >>43751829 #>>43751832 #

21. noduerme ◴[21 Apr 25 12:42 UTC] No.43751351{8}[source]▶

>>43750994 #

[flagged]

22. lxgr ◴[21 Apr 25 13:26 UTC] No.43751788{5}[source]▶

>>43750303 #

> But that's basically an emulator of a VM, isn't it? It's like rewriting the Flash AVM2 into JS... it's still running in JS whereas the original VM was C++.

I think you're using a different definition of the term VM than most other people here. An "emulated VM" is a VM too.

> one major purpose of a VM is to improve performance over what's available in the browser.

That's definitely a very nonstandard interpretation. Many VMs are, intentionally, much less capable (in a permissions sense; in a computational sense, they're almost always exactly as capable) than the host environment they run in.

23. oefrha ◴[21 Apr 25 13:31 UTC] No.43751829{8}[source]▶

>>43751315 #

You see a parameter called signature in the request with a random looking value, you try to "mimic" it (how?) and you always get 403 back. How do you proceed? TFA tells you know, and tells you why reverse engineering is necessary (TFA in fact goes one step further than what is necessary, but you have to do at least half of the work there — I have done exactly that myself in the past).

All you've posted so far is "I don't think <other people's points>" while being wildly wrong. It's on you to explain handwaves like "just observe the calls and mimic them", but I don't think you'll do that, plus anyone with experience here can tell it's nonsense anyway, so here's where I disengage.

Edit: I'll add another point of view as someone who has implemented my own obfuscation scheme in a product where throwing up a third party CAPTCHA isn't an option (the above is from having studied and worked around other people's obfuscation schemes, including TikTok's). Obfuscation is an arms race so there's no 100% winning, but my implementation, while vastly simpler and probably won't stop LukasOgunfeitimi, reduced the observable abuse of our product down to effectively zero. Turns out most hackers are pretty dumb. So, this shit works, "I don't think" be damned.

24. gruez ◴[21 Apr 25 13:31 UTC] No.43751832{8}[source]▶

>>43751315 #

>I'm more interested in the reason for this kind of obfuscation (and no, I don't think it's just to protect against bots, as there are many ways to do that).

see: https://news.ycombinator.com/item?id=43748681 and https://news.ycombinator.com/item?id=43749282

all of those can't really work if the javascript payload is easily comprehensible, because you can just write your own implementation in python or whatever.

25. ◴[21 Apr 25 14:25 UTC] No.43752385[source]▶

>>43748662 (TP) #

26. wzdd ◴[21 Apr 25 14:42 UTC] No.43752561{4}[source]▶

>>43750071 #

> I can't think of a scenario where you'd need to deconstruct their front-end to mimic the calls.

The article mentioned that important API calls are signed. So you would need at least partially to deconstruct their front-end to invoke the calls.

Any other nefarious purposes aside, this seems explicitly anti-bot because you can change the obfuscation whenever you like, forcing another RE effort.

27. like_any_other ◴[21 Apr 25 15:11 UTC] No.43752843[source]▶

>>43750130 #

Unless the scrapers are DDoSing the site, I refuse to consider the downloading of publicly posted data as malicious. It shows how captured the conversation has become by corporate interests, that viewing or storing data posted free of charge, publicly, by their users, in a way not approved by that corporation, is seen as malicious, and the only morally allowed way to view it is to use their spyware-laden client.

replies(2): >>43754292 #>>43755727 #

28. Scaevolus ◴[21 Apr 25 17:24 UTC] No.43754292{3}[source]▶

>>43752843 #

What if the user has disabled downloads of a video? Should the creator (and copyright owner) of a piece of media not be allowed even token attempts to prevent copying?

replies(2): >>43755467 #>>43769393 #

29. ndriscoll ◴[21 Apr 25 19:20 UTC] No.43755467{4}[source]▶

>>43754292 #

No because that interferes with fair use. If someone publicly posts a video, everyone has the right to copy it without any permission or awareness from the original author for things like commentary/criticism (it would be silly to require the copyright owner's permission to criticise a work!).

30. areyourllySorry ◴[21 Apr 25 19:49 UTC] No.43755727{3}[source]▶

>>43752843 #

this is also a measure against bots that write, not just those that read

31. godelski ◴[21 Apr 25 21:14 UTC] No.43756558[source]▶

>>43748681 #

I understand the argument. You can't have perfect defense and speedbumps are quite effective. I'm not trying to disagree with that.

But it does not seem like the solution is effective at mitigating bots. Presumably bots are going a different route considering how prolific they are, which warrants another solution. If they are going through this route then it certainly isn't effective either and also warrants another solution.

It seems like this obscurification requires a fair amount of work, especially since you need to frequently update the code to rescramble it. Added complexity also increases risks for bugs and vulnerabilities, which ultimately undermine the whole endeavor.

I'm trying to understand why this level of effort is worth the cost. (Other than nefarious reasons. Those ones are rather obvious)

32. godelski ◴[21 Apr 25 21:16 UTC] No.43756584[source]▶

>>43749282 #

Okay... but those bots exist... and in high numbers... By "very different route" I mean "measure to effectively stop the bots" (or dramatically reduce). It seems like if they're using a headless browser then they're still being quite effective in accomplishing their goals.

replies(1): >>43764108 #

33. saagarjha ◴[22 Apr 25 03:21 UTC] No.43758816{6}[source]▶

>>43750267 #

Your definition of VM diverges from the standard one. This is a VM:

> it has opcodes that you can write to in another higher-level "language"

VMs aren't just VirtualBox.

TikTok obfuscates their frontend because they fingerprint the device and send it to their server, likely as an anti-fraud thing. Generally these kinds of things will make it so that you can't treat the frontend as a black box and replay requests or do anything simple like that. For example, if they add an incrementing counter to each request and then encrypt it, and then they see you send the same request again, then they will flag you for doing something fishy. You'd only know what they were doing if you reverse engineered the frontend. This isn't security per se but it makes people have to reverse engineer the code to mimic genuine behavior, which is their goal.

As for TikTok being a cyberweapon: you're going to have to back up that claim. Obfuscating an app and then installing it on people's phones is not as clever an idea as you think it is.

34. mike_hearn ◴[22 Apr 25 16:50 UTC] No.43764108{3}[source]▶

>>43756584 #

Google's obfuscating VM based anti-bot system (BotGuard) was very effective. Source: I wrote it. We used it to completely wipe out numerous botnets that were abusing Google's products e.g. posting spam, clickfraud, phishing campaigns. BotGuard is still deployed on basically every Google product and they later did similar systems for Android and iOS, so I guess it continues to work well.

AFAIK Google was the first to use VM based obfuscation in JavaScript. Nobody was using this technique at the time for anti-spam so I was inspired primarily by the work Nate Lawson did on BluRay.

What most people didn't realize back then is that if you can force your adversary to run a full blown web browser there are numerous tricks to detect that the browser is being automated. When BotGuard was new most of those tricks were specific to Internet Explorer, none were already known (I had to discover them myself) and I never found any evidence that any of them were rediscovered outside of Google. The original bag of tricks is obsolete now of course, nobody is using Internet Explorer anymore. I don't know what it does these days.

The VM isn't merely about protecting the tricks, though. That's useful but not the main reason for it. The main reason is to make it easier to generate random encrypted programs for the VM, and thus harder to write a static analysis. If you can't write a static analysis for the program supplied by your adversary you're forced to actually execute it and therefore can't write a "safe" bot. If the program changes in ways that are designed to detect your bot, done well there's no good way to detect this and bring the botnet to a safe halt because you don't know what the program is actually doing at the semantic level. Therefore the generated programs can detect your bot and then report back to the server what it found, triggering delayed IP/account/phone number bans. It's very expensive for abusers to go through these bans but because they have to blindly execute the generated programs they can't easily reduce the risk. Once the profit margin shrinks below the margin from abusing a different website, they leave and you win.

35. hoseja ◴[23 Apr 25 07:03 UTC] No.43769393{4}[source]▶

>>43754292 #

Here's a great way to prevent people copying your precious video: don't post in on the internet.

36. esseph ◴[23 Apr 25 20:05 UTC] No.43776118{7}[source]▶

>>43750216 #

You're the one doing the informing on this topic, congrats, you've forum-peaked.