Scraping the Internet Thread

We get the scattered thread here and on other boards about creating an alternative Internet since the current Internet is just a glorified surveillance apparatus controlled by cultural gatekeepers like Facebook.

I've proposed before that the way to begin building another proto-Internet is to scrape content from the current Internet on a mass coordinated scale, organize it and build our own informal networks for content sharing. My line of thought has gone way beyond this, into how such a system could monetize itself and finance its own physical infrastructure (mesh networks) and attract its own content creators and become a proper Internet. But my idea starts with saying "fuck copyright, scrape everything that isn't nailed down, organize it and share it on decentralized, informal networks".

I'll gladly talk about the rest of what I have in mind if anyone's interested, but for now I'd like to network with anyone who's into data scraping. What software do you use, what do you scrape, share pastebins of what you've collected, etc.

The scraping of anything is relevant. Not just the obvious stuff like youtube videos, image galleries, etc but even stuff we'd likely have no interest in like food recipe blogs & stuff. Content is content. This thread is about how we can cannibalize the Internet in order to build our own.

SCRAPING SOFTWARE I USE (SO FAR)

Youtube
youtube-dl.org/

Tumblr
jzab.de/content/tumblthree

General
phantomjs.org/
casperjs.org/

Other urls found in this thread:

scrapy.org
youtube.com/watch?v=qNYNMC68kq4
archiveteam.org
archive.org
archiveteam.org/index.php?title=WARC
github.com/joelgriffith/navalia
github.com/graphcool/chromeless
perl.com/pub/2003/01/22/mechanize.html
twitter.com/NSFWRedditImage

COMMIE!!!!!!!!!!!!!!!!!1111111111111

Just use httrack

I just use the MHTML saving feature of my browser (qupzilla) to save all 8ch threads I read, all wikipedia that I read and miscellaneous other things. 2gb so far.
I've also downloaded most of the youtube videos I have watched, as well as entire channels that I find interesting. 500+GB so far.

KIKE!!!!!!!!!!!!!!!!!!!!1111111111111

scrapy.org

i only archive stuff thats interesting to me

...

thanks i have aids now

Nah I just want it scraped so I can import it into a database and categorize it & shit, not download all the css & everything.


Looks like the same kind of thing as phantomjs. I'll try it to find out how well they compare. I only just started coding in phantomjs/casperjs a few days ago.


Say hi to your wife and my kid.

Maybe you should focus on making your own Internet 2.0 before you try to scrape the entire existing one.

Yeah okay buddy I'll fund it with kickstarter. How much do you want to donate?

Nobody needs to scrape the entire existing Internet. There's exabytes of trash that nobody gives a shit about. I'm talking about scraping interesting, fresh content, not the video blogs & live journals of angsty teenagers.

I'm talking about building a community based around scraping, curating and sharing content. Sharing it on darknet sites, sharing it with p2p software, and growing the community to the point where content creators would see it as a viable venue for hosting their own content.

Shit, I'll get started on that logo. How about a silhouette of two trannies molesting a baby? One with hal's thumb up the baby's asshole and the other with the baby's genitals in hal's mouth?

And it will be free as in muh gnu/freedom. How about we call it "The baby fucker"?

I'd like to interject for a moment. What you are referring to as "The baby fucker" is actually "The GNU/baby fucker".

nothing wrong with that

The thing would be to create more sites without tracking. However the issue with this is that you need money to keep it running

The other thing the world needs is honest ISP
Ones that give less information to everyone about everyones bussiness


No

I'm not a commie. I'm a pirate. I'm openly saying we should just outright steal every bit of content we can and then use it to build our own darknet sites. You think you're going to build a second Internet by not breaking any rules? Fuck that.

Python solutions are generally trash, maybe selenium is decent. PhantomJS is better, and NightmareJS looks like it improves Phantom by tenfold, although I haven't used it yet.

How many of these things are there? Phantomjs, Casperjs, Spookyjs, Zombiejs and now Nightmarejs. How are they related to each other? All I know is phantomjs is a headless browser and casperjs is to phantomjs what jquery is to javascript.

Far too many. Such is the way of the node hipster. But Nightmare looks like it only uses about 1/10 of the syntax of Phantomjs so it's worth checking out if you have some simple scraping in mind.

Why not preserve the presentation format? Does your method save a shitton of space or? Also, how do you query your "database", and find out, say, "how many seasons in Psych"? How would that answer come back to you? in a pseudo-terminal or sth?

how do get chaturbate videos

refer to

chaturbate to streamlink, then media player and record.
camwhores has a lot of older recordings for free if you also want that.

So I'm just starting to cut my teeth on all these scraper solutions. From what I've gathered it sounds like the combination of Electron + Nightmarejs is one of the best solutions.

Then there's either Scrapy or Mechanize + Beautiful Soup. These solutions can't directly read axaj generated content but I'm guessing that this comes with a trade-off of higher speed. So if I'm scraping static content then these would be the way to go while Electron + Nightmarejs would be best for scraping sites that are generated via javascript. Does that sound about right?


The point of scraping is to extract the raw data. Then you can give it whatever presentation you want, or mash it up with data you've scraped from other websites.

Definitely.

SELECT MAX(season_number) FROM tv_shows WHERE name="Psych";

Whatever way I wanted. In my case it would display on my website after an SQL query.


Haw Haw Haw Haw Haw! Pee Pee Poo Poo!
Here have a free youtube video youtube.com/watch?v=qNYNMC68kq4


Looks like somebody's too good to steal pornography off the Internet.

There are some places on the web who've been archiving various sites for years. You'll be able to get good ideas from these pros.

Archive Team - archiveteam.org
Internet Archive - archive.org

For example, WARC (WebARChive) seems to be a fairly standard format for saving things. Here's some info about the format and the programs which deal with it: archiveteam.org/index.php?title=WARC

WARC contains the data plus a few headers that record the original URL, the hash of the data, etc. The page informs me that wget supports WARC as well.

I want something that can save the HTML after Javascript evaluation and all external resources as a static page. I haven't figured out how to do that with things like NightmareJS and similar, but I might just be dense.

>>>/oven/

stealing someone else's content and hosting it as your own?

that's like going into a store, stealing everything, and selling it in your own store and feeling completely entitled to doing so

wew haven't seen one of these niggers on Holla Forums in a while
this shit is too retarded to be bait

Does WARC include the SSL transactions when a site supports HTTPS? Basically, wondering if we could have a decentralized archive using IPFS and could prove non-tamperance when a site supports SSL.
Obviously, there'd still be a timestamp issue. Unless WARC somehow includes a signed version of this also?

...

kys if you're not trolling. there's a whole race of idiots who think this who need to be purged

Oh, so you don't mind if you post your bank details and complete set of medical records here. After all there is no such thing as private information.

Something that isn't physical can't be property, you nonce

...

Private information exists, but it sure as fuck isn't property.
And as soon as it's voluntarily distributed under copyright, it sure as fuck isn't private anymore.

Someone elses house can be your property tho

how is that even an argument? you must be autistic

exactly this, i couldn't even be bothered to explain it myself

pls respond

This question might not be appropriate here but would it behoove us to delete individual facebook posts before deleting a facebook account? Or is it better not to spend days on end deleting individual posts and to just delete the cancer outright and done with it?

It doesn't really matter either way. Facebook already has that information whether you take the time to delete individual posts. Now why would someone take the time to delete every individual post in their history? One possible reason is to send a message: I no longer want to be associated with Facebook and I will take my effort to tell Facebook to remove the information one piece at a time. Whether an employee of Facebook sees this kind of message is something we will never know.

intellectual property rights != private property rights

They keep metrics on everything else, why not that too? If only there were a way to mass-delete because this process is forever for an old-fag like me.

...

What is scraping and how can I help?

Hi, sorry. I just used wget with the --warc-file option on Wikipedia (with SSL) and it doesn't look like the data (handshakes, certificates, etc.) are stored. All I know is that all that data is available with this command, but not in WARC format: curl -trace [file] [url].

Do you mean that we would have proof that there was no MITM attack or something? Hmm. You could take the output of the above curl command and sign it with a GPG key perhaps. Apart from that, I don't know.

OP here. Just giving an update on what I've found. I've been searching for other scraping solutions besides the ones we've already talked about since I found casperjs way too slow and nightmarejs has barely any tutorials so I'm dead in the water with trying to use it.

I found Navalia github.com/joelgriffith/navalia which automates chrome and lets you open multiple tabs, unlike nightmare/casper/zombie/spooky, etc but there's still a sparse supply of tutorials/examples to help n00bs get started.

However Navalia is currently merging with yet another scraper called chromeless github.com/graphcool/chromeless so further development of Navalia has stopped. The merge is expected to take the next few weeks.

I haven't had a chance to dig deep into chromeless yet but at first blush it looks like it's fairly well documented and supported & it has its own slack community. I'll go sperg on their documentation later and see what I think of them.

...

Just mount siterips instead, much better content:work ratio

Most sites are already scraped. Focus on getting the siterips usable, for example doing work on IPFS.

Examples?

CheKek'd dubs of a professional shitpost.
truth
dubs of confirmed truth
I thought Perl was the original language lauded for such scraping?
(an example from 2003 perl.com/pub/2003/01/22/mechanize.html )
There were so many since, jQuery being one of the most notable.

How does this fuck anything up, nigger monkey? It's just scraping some sites, you retard.

God says...
how_hard_could_it_be don't_count_on_it fool Venus I_love_you talk_to_my_lawyer white_trash you're_out_of_your_mind ordinarily stunning I_hate_when_that_happens You_get_what_you_pray_for straighten_up awful hurts_my_head do_it kick_back exorbitant you're_not_all_there_are_you kludge vengeful et_tu I'm_on_a_roll that's_no_fun music Oh_Hell_No let's_see mundo_stoked you_never_know what_the_heck Zap ba_ha