Art Theft 2.0 - Overhauling the System
As far as I'm aware, no other site has an art theft prevention system. Waterfall's is the first, and I'm quite proud of it. However, it's an imperfect system, and I have some improvements I want to make to it.
Art Theft 1.0 - An Overview
Art Theft 1.0 (or, to be more accurate, the current version is 1.3 or so) is an extraordinarily simple piece of code. Unfortunately, extraordinarily simple means extraordinarily easy to defeat.
When a piece is marked as art, three things happen in order. First, the image is hashed to MD5. Then, a list of MD5s for all art uploaded that don't belong to the user uploading is retrieved, and the hash is checked against each one. Quick and painless.
MD5 is best described as a "signature" that can be applied to something to see if it's the same thing as something else. The problem is (while it's old and collision attacks have been demonstrated now), changing a single byte in the file results in a different MD5 signature.
This means that lazy thieves get caught, but anyone else doesn't. I've been thinking about this a lot over the last 6 months, and I think that - after the app is done - it's time to address these shortcomings.
Before I continue - while the system is easy to defeat, you'll still be banned if you crop out a watermark and upload it as your own thing.
Art Theft 2.0 - Rise of the Machines
The current system is instantaneous. You press upload, it tells you in a couple of seconds whether you're naughty or not. After a brief bit of thinking, I realised this cannot be the case with any sufficiently advanced system that can be called "decent". So first things first - when Art Theft 2.0 rolls out, there'll be two states to art posts. Unverfied, and verified. The main difference is that unverfied posts will just have a yellow icon instead of green for the art symbol in the corner of the header. If a post passes verification - no problem, nothing happens. If a post fails verification, it works the same way as it does now, it'll be silently converted into a reblog of the artist's original post. The major difference is that since there's a chance that post will be reblogged while it's awaiting verification, any reblogs of that post will need to be converted too. Luckily, the way the site stores post chains means this is not a problem at all.
While a post is in unverified state, the site will be running the process in the background. Let's go over what it'll be. It's SIGNIFICANTLY more complex that the current method, and requires some special hardware, so before the cutoff, I'm going to link our Patreon. Ordinarily it's hidden down in the site footer because we feel weird about taking money without giving anything in return (other than... the site I guess?) but the faster we get the hardware to run this (and the more of it), the faster we can improve things.
The first stage of the process will be to convert it into a grid. On each of these grids, an MD5 (or perhaps a SHA256) is generated. It'll then search for images where there's matching MD5s. In theory, there should never be any unless the image is a straight duplicate. This is the low hanging fruit part - it can stop the process if it finds a match here.
But what if we modify a square?
We drew a line in. That square now gets a different MD5. But, since collisions CAN happen - for example, if someone does a plain white background, or transparency, and whatever grid size we use ends up capturing fully transparent/white squares - that's enough to throw it off. So instead, it'll go off how many squares are the same. This is a good time to introduce confidence scoring.
Being Confident
If all but one square returns an MD5 match to an existing image, we can say pretty confidently it's a repost. Let's say 98% confident. The system will reclassify that post as stolen, and change it to a reblog.
But what about more complicated scenarios? The grid system isn't the only method we'll be using, and they're less clear cut. We need to assign a cutoff or two - how confident should the system be to act autonomously? If it's less confident, it should ask a mod to review it manually. At the same time, at some point, the system should be confident enough that it's not a repost that it doesn't bother us. We'll settle on 90% confidence for autonomous action and 20% confidence for not bothering us until we've refined the system. Now, let's go over the other methods we'll be using.
Size Matters
Resizing something is a common way to get around filters of this kind, so we need to keep records of different sizes of the art too. We'll also need to check for images being flipped - the same way that people uploading TV episodes on YouTube do to get around the copyright filters. Ading borders is something we need to check as well.
Why so Blue?
Another common way of getting around filters is changing the colours of something - either the colour itself, or the saturation of it. Checking this will be a pain in the ass, but is essential to a comprehensive theft prevention system.
I See You
Finally, the part that'll take the longest - visual comparisons. If all the above fail, there's a chance the thief is skilled. So, we pass it to an AI to look at. This, once we've gotten it working right, is all the above on steroids. It'll be able to look at it and say whether it's seen it before, as well as a confidence score. If it gets to this stage, unless it's 100% certain, a mod will likely be required for intervention - after all, we've seen what visual recognition is like with Tumblr's porn filter.
This step has some nice bonuses, however. It'll be able to see if it's been blurred, is a cropped version blown up to full size, or whether it's seen something similar before but watermarks have been removed or text altered. It might even be able to tell whether it's a trace of someone else's work or what's been used as a reference - however, we're consciously choosing not to intervene on that stuff, and it's a waste of resources to try.
Here's what it makes of our test piece (rendered in paint - right now, we get a text readout of pixel areas that I've had to translate into something easy to read). As you can see, it's far from perfect, which is why manual mod intervention will be mandatory for this stage.
Closing Thoughts
The above is about half the system we're implementing, excluding the experiments that are more curiosity rather than something serious to include. We're not listing them all here because the post will drag on a bit, and because we want there to be some element of secrecy so you can't find holes to defeat it.
It's a pretty complex system and our aim is that any given art piece should take no longer than 20 minutes to verify. In an ideal world, it'd be 5 minutes - but we're unlikely to have the budget for that any time soon, and as more art is uploaded, the longer it'll inherently take.
Suffice it to say - art theft, while a unique system, is flawed. We want to fix it, and we want to share with you what we're doing to improve on it.
Thanks for reading!
thell you are literally so cool- the reblog thing is so amazing, because even if the thief does get reblogs it'll just be transferred to the original artist, right? that's so good for artists :D
the idea of changing a stolen work into a reblog of the original is such a smart idea, I love it!
From our lead dev, talking about the overhauled, upgraded, and immeasurably improved art theft system currently in the works behind the scenes!