You install a "free" duplicate file finder. You point it at your Pictures folder. You hit scan.
What you don't see is the network connection it just opened. The small batches of file metadata being uploaded somewhere for "analysis." The fact that the company behind the tool now knows you've got 14,000 photos, that some of them are tagged with GPS coordinates from your home address, and that one folder is called "tax_returns_2024."
This isn't a hypothetical. It's how a lot of free file utilities actually work. And in 2026, with AI training data being scraped from anywhere it can be reached, it matters more than it used to.
This post is about what "local" actually means, why it's worth checking, and how to test whether a tool is telling you the truth.
There's a phrase that gets repeated until it sounds like a cliché: if you're not paying for the product, you are the product. It's been true of search engines, social networks, and email. It's now true of a lot of file utilities too.
The business models break down roughly like this:
Cloud-backed scanners. The tool uploads your file hashes (or sometimes the files themselves) to a server for "faster processing" or "cloud-powered AI matching." That metadata gets stored. It might be sold. It almost certainly trains something.
Telemetry-funded freemium. The free tier collects usage data — what folders you scan, what file types, how big they are, how often you use it — and that data gets aggregated, anonymised (allegedly), and sold to data brokers or used to fuel a paid analytics product.
Bundled adware. The classic. The scanner is free because the installer also drops a browser toolbar, a "PC optimiser," or three Chrome extensions you didn't ask for.
The bait-and-switch. Free for the first scan. Then pay £40 to actually delete the duplicates it found. Some of these tools are decent, but the model trains users to expect surprise paywalls in software that's meant to clean up their machine.
None of these are illegal. Most are disclosed somewhere in a 47-page terms-of-service document nobody reads. But "free" almost always means something is being extracted from you, and with file scanners, what's being extracted is information about what's on your hard drive.
You might think file metadata is harmless. It's just filenames and sizes, right?
Not really. A file scan reveals:
medical/, divorce_papers/, kids_school/ tell a story before any file is opened.Aggregated across millions of users, this is enormously valuable data. It feeds AI training sets. It feeds advertising profiles. It feeds whatever the next thing is that nobody has invented yet but will want training data for.
This bit is new, and it's why we think the privacy conversation around file utilities is overdue for a refresh.
Every major AI company is hungry for training data. Public web scraping has more or less hit its limit — the obvious sources have been hoovered up, and the legal pushback from publishers is starting to bite. The next frontier is private data: emails, documents, file structures, the actual stuff on real people's machines.
Companies that already have a foothold on your computer — antivirus vendors, "system optimiser" suites, cloud sync tools, file utilities — are sitting on exactly the kind of data the AI industry would happily pay for. And the licensing terms most of these tools use are vague enough to cover "improving our services," which is the same wording cloud providers used for years before everyone realised it meant "feeding it into models."
You don't have a meaningful way to opt out of this once the data has left your machine. Once it's uploaded, it's uploaded. The only reliable defence is to use tools that don't upload in the first place.
You'll see "zero telemetry" used as a marketing line a lot now. Sometimes it's true. Sometimes it's not. Here's what it should mean if it's used honestly:
Some of these things have legitimate uses. Crash reports help developers fix bugs. Update checks keep you secure. The point isn't that all telemetry is evil — it's that for a tool that's reading every file on your hard drive, the bar for "phoning home" should be very high, and you should be able to verify what's actually being sent.
The good news: you don't have to take anyone's word for it. There are ways to check whether a tool is doing what it claims.
Pull the network cable. The simplest test. Disconnect from the internet (or turn on aeroplane mode) and run a full scan. If the tool works normally, it's local. If it errors out, throws warnings, or refuses to start, something needs the internet to function — and you should ask why.
Use Windows Resource Monitor. Press Win + R, type resmon, hit enter. Go to the Network tab. Run your scan. Watch the "Network Activity" section. A truly local tool will show no network usage during scanning. If you see your file scanner sending data to an external IP, that's the answer.
Use Wireshark for the proper test. This is the heavyweight option. Wireshark is free, captures every packet leaving your machine, and lets you filter by application. If you want to be certain, install it, start a capture, run your file scanner, and look at what came out. It's geeky but it's definitive.
Check the firewall. Windows Firewall lets you block specific applications from accessing the internet. Block your file scanner. Try to use it. If everything still works, it didn't need the network. If the scanner breaks, you've learned something.
Read the privacy policy. This is the boring option but it's worth ten minutes. Search the document for the words "share," "third party," "advertising," "training," and "improve our services." Those are the phrases that tend to hide the data flows.
Full disclosure: we built K8 precisely because we'd done these tests on other tools and didn't like what we saw.
A few specifics:
If any of that ever changes, we'll tell you. It would also break the patent we've filed, which is partly built on the local-only architecture.
The duplicate finder market is a small slice of a much bigger conversation. The same questions apply to:
The principle is the same: anything that runs with broad access to your filesystem is a privileged tool. It should be held to a privileged standard. Local-by-default, transparent about what leaves the machine, and easy to verify.
Your hard drive is the most personal computing surface you own. The contents reveal more about your life than your browser history or your search queries. Tools that scan it deserve the same scrutiny you'd apply to anything else with that level of access.
When you pick a duplicate finder, a disk cleaner, or any utility that touches your files at scale:
K8 is one option. There are others. The point is to make the choice deliberately, knowing what you're trading.
K8 runs 100% locally. No cloud, no accounts, no telemetry. Verify it yourself with Wireshark.
See how K8 works →K8 is available at lilbuba.ai. One-time purchase. No subscription. No cloud. Your files stay on your machine, where they belong.