We get the scattered thread here and on other boards about creating an alternative Internet since the current Internet is just a glorified surveillance apparatus controlled by cultural gatekeepers like Facebook.
I've proposed before that the way to begin building another proto-Internet is to scrape content from the current Internet on a mass coordinated scale, organize it and build our own informal networks for content sharing. My line of thought has gone way beyond this, into how such a system could monetize itself and finance its own physical infrastructure (mesh networks) and attract its own content creators and become a proper Internet. But my idea starts with saying "fuck copyright, scrape everything that isn't nailed down, organize it and share it on decentralized, informal networks".
I'll gladly talk about the rest of what I have in mind if anyone's interested, but for now I'd like to network with anyone who's into data scraping. What software do you use, what do you scrape, share pastebins of what you've collected, etc.
The scraping of anything is relevant. Not just the obvious stuff like youtube videos, image galleries, etc but even stuff we'd likely have no interest in like food recipe blogs & stuff. Content is content. This thread is about how we can cannibalize the Internet in order to build our own.
I just use the MHTML saving feature of my browser (qupzilla) to save all 8ch threads I read, all wikipedia that I read and miscellaneous other things. 2gb so far. I've also downloaded most of the youtube videos I have watched, as well as entire channels that I find interesting. 500+GB so far.
Nah I just want it scraped so I can import it into a database and categorize it & shit, not download all the css & everything.
Looks like the same kind of thing as phantomjs. I'll try it to find out how well they compare. I only just started coding in phantomjs/casperjs a few days ago.
Say hi to your wife and my kid.
Brody Ramirez
Maybe you should focus on making your own Internet 2.0 before you try to scrape the entire existing one.
Jackson Adams
Yeah okay buddy I'll fund it with kickstarter. How much do you want to donate?
Nobody needs to scrape the entire existing Internet. There's exabytes of trash that nobody gives a shit about. I'm talking about scraping interesting, fresh content, not the video blogs & live journals of angsty teenagers.
I'm talking about building a community based around scraping, curating and sharing content. Sharing it on darknet sites, sharing it with p2p software, and growing the community to the point where content creators would see it as a viable venue for hosting their own content.
Luis Thomas
Shit, I'll get started on that logo. How about a silhouette of two trannies molesting a baby? One with hal's thumb up the baby's asshole and the other with the baby's genitals in hal's mouth?
And it will be free as in muh gnu/freedom. How about we call it "The baby fucker"?
Jason Green
I'd like to interject for a moment. What you are referring to as "The baby fucker" is actually "The GNU/baby fucker".
Connor Rodriguez
nothing wrong with that
Ian Jones
The thing would be to create more sites without tracking. However the issue with this is that you need money to keep it running
The other thing the world needs is honest ISP Ones that give less information to everyone about everyones bussiness
No
Easton Murphy
I'm not a commie. I'm a pirate. I'm openly saying we should just outright steal every bit of content we can and then use it to build our own darknet sites. You think you're going to build a second Internet by not breaking any rules? Fuck that.
Ryan Lopez
Python solutions are generally trash, maybe selenium is decent. PhantomJS is better, and NightmareJS looks like it improves Phantom by tenfold, although I haven't used it yet.
Xavier Clark
How many of these things are there? Phantomjs, Casperjs, Spookyjs, Zombiejs and now Nightmarejs. How are they related to each other? All I know is phantomjs is a headless browser and casperjs is to phantomjs what jquery is to javascript.
Jeremiah Jackson
Far too many. Such is the way of the node hipster. But Nightmare looks like it only uses about 1/10 of the syntax of Phantomjs so it's worth checking out if you have some simple scraping in mind.
Wyatt Price
Why not preserve the presentation format? Does your method save a shitton of space or? Also, how do you query your "database", and find out, say, "how many seasons in Psych"? How would that answer come back to you? in a pseudo-terminal or sth?
Elijah Jenkins
how do get chaturbate videos
William Baker
refer to
Xavier Roberts
chaturbate to streamlink, then media player and record. camwhores has a lot of older recordings for free if you also want that.
Jack Richardson
So I'm just starting to cut my teeth on all these scraper solutions. From what I've gathered it sounds like the combination of Electron + Nightmarejs is one of the best solutions.
Then there's either Scrapy or Mechanize + Beautiful Soup. These solutions can't directly read axaj generated content but I'm guessing that this comes with a trade-off of higher speed. So if I'm scraping static content then these would be the way to go while Electron + Nightmarejs would be best for scraping sites that are generated via javascript. Does that sound about right?
The point of scraping is to extract the raw data. Then you can give it whatever presentation you want, or mash it up with data you've scraped from other websites.
Definitely.
SELECT MAX(season_number) FROM tv_shows WHERE name="Psych";
Whatever way I wanted. In my case it would display on my website after an SQL query.
For example, WARC (WebARChive) seems to be a fairly standard format for saving things. Here's some info about the format and the programs which deal with it: archiveteam.org/index.php?title=WARC
WARC contains the data plus a few headers that record the original URL, the hash of the data, etc. The page informs me that wget supports WARC as well.
Cameron Lee
I want something that can save the HTML after Javascript evaluation and all external resources as a static page. I haven't figured out how to do that with things like NightmareJS and similar, but I might just be dense.
Chase Wright
>>>/oven/
Lucas Jenkins
stealing someone else's content and hosting it as your own?
that's like going into a store, stealing everything, and selling it in your own store and feeling completely entitled to doing so
Chase Russell
wew haven't seen one of these niggers on Holla Forums in a while this shit is too retarded to be bait
Jaxon Hill
Does WARC include the SSL transactions when a site supports HTTPS? Basically, wondering if we could have a decentralized archive using IPFS and could prove non-tamperance when a site supports SSL. Obviously, there'd still be a timestamp issue. Unless WARC somehow includes a signed version of this also?
Kevin Butler
...
Leo Moore
kys if you're not trolling. there's a whole race of idiots who think this who need to be purged
Joseph Gutierrez
Oh, so you don't mind if you post your bank details and complete set of medical records here. After all there is no such thing as private information.
Nathan Wood
Something that isn't physical can't be property, you nonce
Dominic Wilson
...
Carson Robinson
Private information exists, but it sure as fuck isn't property. And as soon as it's voluntarily distributed under copyright, it sure as fuck isn't private anymore.
Kayden Gonzalez
Someone elses house can be your property tho
Julian Stewart
how is that even an argument? you must be autistic
exactly this, i couldn't even be bothered to explain it myself
William Turner
pls respond
John Nguyen
This question might not be appropriate here but would it behoove us to delete individual facebook posts before deleting a facebook account? Or is it better not to spend days on end deleting individual posts and to just delete the cancer outright and done with it?
Levi Campbell
It doesn't really matter either way. Facebook already has that information whether you take the time to delete individual posts. Now why would someone take the time to delete every individual post in their history? One possible reason is to send a message: I no longer want to be associated with Facebook and I will take my effort to tell Facebook to remove the information one piece at a time. Whether an employee of Facebook sees this kind of message is something we will never know.
Adrian Hernandez
intellectual property rights != private property rights
Carson Taylor
They keep metrics on everything else, why not that too? If only there were a way to mass-delete because this process is forever for an old-fag like me.
Blake Diaz
...
Owen Allen
What is scraping and how can I help?
Xavier Ortiz
Hi, sorry. I just used wget with the --warc-file option on Wikipedia (with SSL) and it doesn't look like the data (handshakes, certificates, etc.) are stored. All I know is that all that data is available with this command, but not in WARC format: curl -trace [file] [url].
Do you mean that we would have proof that there was no MITM attack or something? Hmm. You could take the output of the above curl command and sign it with a GPG key perhaps. Apart from that, I don't know.
Isaiah Howard
OP here. Just giving an update on what I've found. I've been searching for other scraping solutions besides the ones we've already talked about since I found casperjs way too slow and nightmarejs has barely any tutorials so I'm dead in the water with trying to use it.
I found Navalia github.com/joelgriffith/navalia which automates chrome and lets you open multiple tabs, unlike nightmare/casper/zombie/spooky, etc but there's still a sparse supply of tutorials/examples to help n00bs get started.
However Navalia is currently merging with yet another scraper called chromeless github.com/graphcool/chromeless so further development of Navalia has stopped. The merge is expected to take the next few weeks.
I haven't had a chance to dig deep into chromeless yet but at first blush it looks like it's fairly well documented and supported & it has its own slack community. I'll go sperg on their documentation later and see what I think of them.
Logan Barnes
...
Daniel Murphy
Just mount siterips instead, much better content:work ratio
Isaac Collins
Most sites are already scraped. Focus on getting the siterips usable, for example doing work on IPFS.
Gabriel Rivera
Examples?
Nathan Perez
CheKek'd dubs of a professional shitpost. truth dubs of confirmed truth I thought Perl was the original language lauded for such scraping? (an example from 2003 perl.com/pub/2003/01/22/mechanize.html ) There were so many since, jQuery being one of the most notable.
Noah Walker
How does this fuck anything up, nigger monkey? It's just scraping some sites, you retard.