Saving and Verifying the Internet

The big problem with the web archive sites, is that they must comply with DMCA takedowns on a regular basis, they are susceptible to censorship.

The big problem with saving the page yourself, is that it can't be verified, nobody trust's you.

This made me think, what if a site was created, that does nothing but archive and share the hashes? The site downloads the web page and all it's content like a normal archiver would, but instead of sharing the site, it simply shares and store's the hash, but delete's the content.

A user would create the archive on thier own with a format that matches the archive verification site. The user would then request the site to verify and and archive the hash. The user would not upload their version of the archive, the site would generate it on it's own. The user would then check that their hash match's the verification site's hash (which it should if everything went smoothly).

The user would then be able to freely share this webpage, and avoid censorship, and this shared web page could be verified as accurate by anyone that wants, just like archive.is and the rest are assumed to be accurate with the contents of their archive, only the archive site only verifies that hash.

I wanted to bounce this idea.

The big keypoint here, is weather or not a hash is considered copyrighted content. If I hash a proprietary file, or website, or anything else proprietary, is the hash still protected by the original file's copyright? Everything I've been able to find on this says no.

The other issue is matching the format between the user's browser and the verification site, so that the hashes line up in the first place. maff & mht look like they are dead. This would require a good amount of work, but there are already extentions to save the web page into a single file,
addons.mozilla.org/en-US/firefox/addon/save-page-we/
chrome.google.com/webstore/detail/save-page-we/dhhpefjklgkmgeafimnjhojgjamoafof?hl=en-US

There aren't as many as I thought there would be, but they do exist, save page WE is GPLv2, so the method they use to create the single page archive could be copied verbatim on the verification site, or a new format could be created (I'm thinking something like just tar the output of the browser's "save complete page", the folder and the file, not sure if that would line up cross browser), for use on the verification site, and the user's browser via webext.

Is this idea worth exploring? Do you think anyone would bother to use it, or just continue to deal with archive.is and the like's censorship?

Attached: save_page_we.png (254x148, 4.54K)

Other urls found in this thread:

lennartb.home.xs4all.nl/fingerprint.html
reactjs.org/)
web.archive.org/web/20180328221121/https://8ch.net/tech/res/883963.html#q890152
github.com/webrecorder/webrecorder
twitter.com/NSFWRedditGif

this is why I hate buzzwords

Something like this might be a legitimate use of a blockchain.

Extremely useful, get to work! I’ll pay you when you’re done if you post your wallet address or a patreon or something. I want this done now. This project is vital to free speech. If you do a good job I’ll seriously send you good money like 1-2Gs here.

i'm not into the blockchain meme, that seems like a lot of bloat to simply match hashes between a client and server

that's good to know, I'm sure I would take donations because I don't hate money, but the whole thing would be GPL, even the server side. that's the only way to avoid the censorship, even if (and i have to do more research on this) the hashes are not considered copyrighted content, that doesn't mean it couldn't get sued offline. GPL client and server side would make that irrelevant.

Thinking about this further, a browser side archiver would cause nothing but problems due to website's changing their output based on user-agent and other things. The client side would likely be a separate client.

-archiver core
-create the archive and the hashes, CLI front-end

-client side
-GUI for archiver core

-server side
-website that use's archiver core, delete's the archive and only use's the hash afterwards, store's in the database, search functions, etc.

does the EFF answer questions like this regarding the copyright status of a hash?

What about ads? Every person that tries archiving website X will be shown a different set of ads thus creating non matching hashes. Obviously you could block ads as default but then you wouldn't be making a perfect copy of the website. The other archivers make it possible to check who was advertising on a given website at a random time period for example

Following up on what said
You would need to apply some heuristics on what to cull in your hash data.
You have things like ads / tracking ids like was suggested before, but you have much more benign things as well.
- Dates
- Customization for specific user agents (not common in modern times)
- Paywalls
- Membershipwalls
- username of a logged in user

One solution would be to take the DOM tree and for all of the nodes below height X with children and compute the hash for them.

that's a good point, i didn't think about that. 3rd party domains could be blocked from the archive, most ads come off a different domain, but that might also block out other stuff.

The only way I can think to deal with that is to hash all of the individual files in the archive, and then provide all of the hashes. Page location layout might be approaching copyrighted territory, but I don't see a way around it. The client could then have an option to remove any files that don't match the server's hash (or save another copy), which would leave only a verified page, whatever was deleted would just fail to load.

Another possibility is to hash only the source. Or extract all of the text content and hash only that. Hashing only the source still might cause problems with dynamic pages and ad content, you could remove all the javascript, but then half the internet isn't going to load at all.

Another consideration I saw checking out archive.is, is GeoIP targeting. archive.is sends the user's IP address to the site in a header (didn't know that), to match the user's location instead of the servers. I don't know any way to deal with that except to replicate the functionality on the server, delete the user's IP address afterward and allow for multiple correct hash's on the same website for the same exact time.

I think all of these options might need to be implemented to deal with these problems.
-Text Only
-Block 3rd Party
-No Javascript
-Individual File Hashes

That's becoming a lot of combinations.


if there's a paywall or membership wall or any kind of required login the disclaimer could be it's flat out not going to work and not expected to. there's no way a server is going to be able to replicate something behind a login wall. the page has to be publicly accessible in the first place.

Hashing every child in the DOM tree is interesting, but at that point you might as well provide the entire source yourself, your now providing the entire DOM layout, albeit hashed. I think that get's extremely copyright risky. It's a good idea though.

I was just being comprehensive when listing stuff like a paywall.
That is not what I said. Please reread it. I just noticed I accidentally said height instead of depth.
To reiterate my idea. You take all the nodes with the depth below a threshold and then create a tree that mirrors the DOM tree but only includes the nodes you have hashes for. In this new tree you store the hashes you calculated. In order to cut down on mostly pointless hashes, I suggest to cull nodes which do not have any children.
One problem with this approach is if a website's DOM tree is bloated. In this case you might not be able to get as good of results for this. A solution to this problem is instead of doing this for the DOM tree, you have an algorithm to determine the proper root of the website and then measure depth from that node.
Storing hashes like this allows you to compare with the server to see what sections you have match the server and which don't. Hopefully if the dynamic stuff ends up in a different subtree than the static content, you'll be able to verify specific content as being correct. Often times, when you want to verify an archived page, you could care less what the footer of the website says so if your footer doesn't match; whatever.

I don't think picking and choosing which part's the dom tree to hash is going to work. The entire DOM tree is going to have to be hashed item by item, children and parents. The way I see this DOM tree walking going. website's are going to have different depth.

1. Client hashes every element on the DOM tree.
2. Client sends the server it's DOM tree hash list.
3. Server hashes every element on the DOM tree.
4. Server removes conflicts (this will get fun, parents can't be removed because children don't match)
5. Server sends the client back the new filtered DOM tree hash list.
6. Client filter's it's DOM tree to match the server.
7. Client and Server now have identical DOM tree's, download page requirements, images, etc.
8. The server will still have to provide hash's for every file in the archive, because the page requirements may be different, ad images, etc, this type of back and forth could happen again to verify/delete page requirements that don't match.
9. Eventually with this back and forth, the client and server will have a matching hash/hashes, depending on if it's a single file now being hashed with page requirements removed that mismatch, or multiple file hashes (but the source html/DOM tree should always match).

Walking the DOM could turn into a serious mess, and I'm not sure how to go about it without back and forth between the client and server with hashes, But it could be done.

A simpler alternative:
1. Server is given a url by the client (no client in this case it's the browser).
2. Server create's the archive and the browser download's it.
3. Server records the hash and delete's the archive.
4. Archive and hash now obviously match.
5. Archive can still be verified by the server later, but there is no download that can be DMCA'd.

This would operate more like a caching proxy. You don't hear about proxy servers being hit with DMCA's.

I'm not sure which is better here. DOM walking seems more takedown resistant, and less centralized, but it really isn't. If your sending the whole DOM tree hashes back and forth to agree on a common tree, you might as well send the whole archive, and as far as centralization, the DOM walking option is still totally centralized.

The only question here is if the one time archive download is more DMCA'able than the hash back and forth between client and server, that and weather or not DOM walking is going to produce a useful archive for heavily dynamic sites.

If you wanted to be lazy you would not use a tree, but just a list of all the text nodes and store a hash of each one. You could also add support for images easily.

Can you even DMCA a hash. Technically you wouldn't even need to distribute these hashes to others. You could just have the server send back the bad hashes.

actually the simpler alternative, while it be the same DMCA risk as DOM walking, has an additional risk. it's going to have the same problem proxy servers have, only more so, someone tries to archive a site with illegal shit on it, it's not just a proxy it's packaging those files and offering a download. i could see that being a major problem.

i'm not sure, this is the only real source i have on an answer to that

lennartb.home.xs4all.nl/fingerprint.html
which is not exactly reputable.

sending back the bad hashes would work also. I think ideally what would happen is the client-server interaction would end up agreeing on everything, and removing anything that disagrees, if the images are dynamic, they would be removed. this way the server isn't really giving out anything, even the url structure of the content, out to DMCA, its only the hash.

the text only mode could be even lazier, just strip all the html and hash the text. client and server just match methods of stripping html to match hashes.

the problem with that is these javascript heavy site's now insert text and build the entire dom on the fly (ie this cancer reactjs.org/)

It's not illegal to have image hashes of CP. Many large websites use them to block image uploads of common CP automatically.

that supports the dom walking, but what if the site packages the entire archive and offered it for download, even if it was a one time download. I think that would be a major problem, which kills the easymode alternative.

Please tell me you're not a native English speaker.

trusting the site is an issue, but there's nothing that can be done about that. people trust archive.is to not fuck with the content. the user already has the content in this case, all the site does is provide a 3rd party to verify that it came from the site at the time. the only thing that can be done as far as trust goes is make all the code GPL, but that's not a lot server side.

why do people think reddit spacing is the norm?
it's fucking unreadable

This is never going to work with javascript.

Every single test page I've ran, just with wget, the html hash has been fucked due to javascript bloat. Unique id's, timestamps, CSRFTokens, half of the internet now creates the DOM dynamically or atleast heavily tweaks it with javascript and the javascript is generated dynamically. It's not just the javascript either, some of these pages are entering unique id's into the html in custom tags, which i'm sure is used later for tracking and javascript.

maybe the only way is text only, atleast for the html. on a side note reddit pulls 140 files for 1 thread.

I believe user is pointing out OP's grammar

Attached: 1465503543634.png (521x243, 136.87K)

Sounds like a personal problem to me.

But having those hashes is proof that the site possessed CP. They're not getting only the hashes from the government like major sites do. Although it doesn't prove intent to get CP and the site would delete the content immediately after hashing (two things that would make courts lenient), it would probably be enough for a CIAnigger to shut down the site.

...

Implementing the hash check client-side would keep it from ever touching the site's servers, although a court could expand the client-side code as being part of the domain of the site owner.

That's true. Although OP said they wanted some centralized way to verify the hashes. I guess it's either have a central server prone to takedown or have decentralized verification prone to poisoning. It's poz all the way down.

Another way to think about this is just like you have reproducible builds, where the resulting executable should be identical, you have reproducible page archives, one copy just has a hash in a central server for 3rd party verification. Multiple clients who archive the same page at the same time should get the same archive.

Even with a bunch of back and forth between clients(or server) to strip out differences, there's no guarantee it's going to catch all of them even with the dom walking, two clients might agree but a 3rd 4th and 5th might not. Clients could also poison eachother:
Client 1: I don't have this or it's different
Client 2(or server): Okay let's delete this then, does it match now?
Client 1: It matches now.
(deleted content was the important thing all along now intentionally deleted).
This could happen easily even with a text-only mode.

The only way to do this is going to be the simpler alternative:

The server generate's the "official" version, offers a 1 time download, and then maintains the hash for verification later. I don't think this is especially take down resistant. Offering copyrighted content for download once is the same as offering it for download a million times. If a site wanted to get butt-hurt, all it would take is 1 site with 1 kike lawyer and it's game over unless a bunch of money was dumped into legal fees trying to fight it.

Google and archive seem to be running with the excuse that they are a "direct agent of a human user" which allows them to ignore robots.txt and I guess copyright?

The question is then, is the "simpler alternative" offering 1 time download's and hash verification, worth building?

Attached: archive_excuse.png (596x274 11.96 KB, 26.25K)

a client side hash check while keeping any form of verification would require the client be checked that it wasn't altered, which sounds like DRM. The client could act as a simple proxy for the server when it grabs the page, but that doesn't sound appealing at all.

Just drop it already, it is simply not working with non-static (shit) sites. Any workaround you can come up with will be too complex.

Maybe I'll just start saving pages to .TXT files with "lynx -dump" (possibly also with -nolist, if links aren't needed).

unless i'm wrong trying to use SSL to verify this archive data, if it was saved encrypted, is also a dead end. you could record the client's pre-master or session keys, but it's pointless. the client ultimately generates the encryption keys, then uses the public cert to send it to the server. there's no way to verify that the encrypted data actually came from the server, just that the data is using the key.

Attached: deadend.png (993x1294, 168.63K)

content based addressing is an old idea and already sovles this. IETF are making some official goy friendly version of this over the last decades and soon it will replace the web

no, blockchain isn't even applicable

This is an interesting side discussion, but it doesn't strictly invalidate the utility of OP's clever idea, even if you realize you'd only be able to hash the same archive once.

For instance:
This would allow something like archive.is to exist, too, by keeping the hash Db separate from a Db of full archives, which (depending on takedowns) could consist of anything from a normal website like archive.is to a cloud of torrents or a blockchain.

Attached: le reddit spacing meme.png (652x2245, 420.45K)

I think I am going to write some proof of concept code for the simpler alternative method, none of the code is revolutionary here, probably literally use wget to pull the site archive.

I came up with a few extra options today:
-Instead of providing the archive, provide diff files
----Client hashes all binary files, sends file listing and hashes to server (this is going to require a custom client).
----Client sends full text/html files to server.
----Server checks binary hashes for mismatches, missing files, extra files, etc
----Server run's diff on text/html files, generates diff and patch files
----Server sends back binary hash mismatch data, along with text/html diff and patch files
----Client creates a single tar archive with client's original archive + mismatch file + diff/patch files, hashes it
----Server does the same, hashes should match, it has the same data as the client now.
----Server store's this hash for verification.
Advantages:
-The client can either patch the text/html files to generate a fucked copy or not, the archive should be verified either way.
-The server doesn't provide illegal content to the client in case the client is an asshole, which is going to happen.
-The server isn't providing the website file's directly, seems like should be less DMCA-able.
Disadvantages:
-Even if the hashes aren't, the diff files on the text/html source are probably considered a derivative work, it only provides them during the exchange, there's nothing to DMCA but the hashes, but during the exchange they might be able to claim there's copyright infringement going on.

-Instead of providing hashes, sign the archive with GPG
Advantages:
-Hashes can't be DMCA'd, they don't exist, the verification isn't with the site itself, it's with a 3rd party keyserver.
-If the site goes down, the archives can still be verified.
Disadvantages:
-Server must provide the full archive, signed. If the diff method above is used, it would have to return the full archive back to the client instead of just the diffs/mismatch file.

How could copyright violation be said to occur during the window in which a server-side archiver would provide data? That would also be during the time when whatever's being archived is still publicly accessible (otherwise the server obviously couldn't see it). Has, for instance, a proxy server been prosecuted for some reason?

Also, I'm not seeing how a diff would be useful for the primary purpose of verifying that a shady random archive file is an authentic copy of something that a certain URL served at some point in time.

tangential, but mozilla is going to delet all non-quantum addons soon; maybe we can fix that somehow.

I'm not sure, I'm assuming the kike lawyers could come up with practically anything, and think of all the sites that have a ToS 30 pages long. The archive sites and google have lawyers that can smash bs immediately, whoever hosts this system likely will not. This isn't even taking fair use into account, or the lawyer speak google and the archive sites use: "direct agent of a human user"

The goal of this system should be that even if forced to comply with a DMCA, it can still be verified. Complying could mean blocking future access to whatever, hopefully not removing the hash, but gpg would solve that problem. That and minimizing as much as possible, what might be considered copyrighted to begin with, diff's and hash mismatches are less than a full archive of the site coming from the server.

I'm really not confident though that the extra back and forth between client and server to generate these diffs/binary hash mismatches would even be worth it. An archive a server sends out is clearly copyrighted content, but the diff/binary hash could definitely be considered a derivative work, which is the exact same problem if anyone gets pissy. The big benefit here I think is illegal content. It seems like it would be easier to defend the system if a user uploaded the illegal content themselves as opposed to the site reaching out and grabbing it on behalf of the user, but at the same time, it would be doing it anyway to generate the diffs, it just wouldn't be giving it's copy back to the user, it would be sending the user's copy back, maybe this distinction is pointless, I don't know.

The diff by itself wouldn't be. The diff + the user's original archive, combined into a single file and then hashed or signed with gpg would. The purpose of the diff + binary mismatch information would be to avoid the site having to provide the archive at all, the users themselves would submit it. The server would still generate it's own copy, but the user would only get back diff's and mismatches, and with gpg signing, a copy back of the archive they sent to the server.

The html/txt diff's and/or patch files combined with the original client's html/txt, would produce the servers version. Binary files, images/videos, would simply not be provided, it would just be indicated that these files could not be verified due to a hash mismatch.

ya i think this is beyond using another add-on. it's either going to work with no client side at all, in the case of the server providing the full signed/hashed archive, or require a client that's going to have to be dedicated, it'll require more permissions than webext allows. I don't think webext can modify files. You might be able to do it with native messaging, but at that point you still have to install a binary on the client side's system, with the additional requirement of the webext being certified kosher by mozilla/google. It seems stupid at that point, the client side shouldn't require a browser webext front-end to operate in that case, it should just operate independently.

Again, this is should be legally identical to a proxy, and I'm not aware of proxies ever having been held liable for the actions of their users.
I don't think so. This isn't even like a magnet URI, where it can be used to find a file copy, its only possible use is to verify a file's provenance.
Eh, it might be useful for some or other reason, like server bandwidth savings.
Ah, got it.

i worded that wrong. the diff would not be a hash. the diff would be an actual diff of the text/html, or a patch file, this is what i think would be a derivative work. the binaries would just be hashed, i think there's strong support that that is not considered a derivative work, because no traces of the original remain, but the diff/patch file would have snippets of the original site.
hopefully that's a strong defense right there. i think this is where the archive site's and google's use of the lingo "direct agent of a human user" comes from.

Nothing should be archived

why not?

fuck you

web.archive.org/web/20180328221121/https://8ch.net/tech/res/883963.html#q890152

I get it, you ideafag.

Problems that arise
content might not be the same as the true original
how do you verify it?
are you imagining a crawler with torrent?
who verifies the content is it native or a certain server (like torrent trackers)

this whole thing is reliant on central trusted server.
i don't think it would be happening with add-ons, this would be either a dedicated client (method 2) or straight website (method 1)
-user1 downloads gpg signed archive (method1) from trusted server or user.
-user1 generates archive themselves, uploads to trusted server/user, receives gpg signed archive back which includes their original archive + diffs from what the server received (method 2)
-user1 shares their gpg signed archive with user2
-user2 verifies authenticity with gpg.
it's up to user1 to share the archive however they want, the trusted server/user only provides it once, the whole point of this is to be resilient to takedowns.

have you checked github.com/webrecorder/webrecorder

that's interesting. can pretty much run that and sign it with gpg and there's method1. method 1 isn't that complicated to begin with though, this just does it differently. it's more bloat than I had planned but it's good to look at, haven't seen that before.

my plan for this is to make it completely independent of the website.
url -> archiver -> archive file (method 1)
url + client archive -> archiver -> archive file (method 2)
how it gets the input and how it deals with the output should be outside of the scope and a separate project