Scraping the Internet Thread

Question

Scraping the Internet Thread

Brayden Cox

We get the scattered thread here and on other boards about creating an alternative Internet since the current Internet is just a glorified surveillance apparatus controlled by cultural gatekeepers like Facebook.

I've proposed before that the way to begin building another proto-Internet is to scrape content from the current Internet on a mass coordinated scale, organize it and build our own informal networks for content sharing. My line of thought has gone way beyond this, into how such a system could monetize itself and finance its own physical infrastructure (mesh networks) and attract its own content creators and become a proper Internet. But my idea starts with saying "fuck copyright, scrape everything that isn't nailed down, organize it and share it on decentralized, informal networks".

I'll gladly talk about the rest of what I have in mind if anyone's interested, but for now I'd like to network with anyone who's into data scraping. What software do you use, what do you scrape, share pastebins of what you've collected, etc.

The scraping of anything is relevant. Not just the obvious stuff like youtube videos, image galleries, etc but even stuff we'd likely have no interest in like food recipe blogs & stuff. Content is content. This thread is about how we can cannibalize the Internet in order to build our own.

SCRAPING SOFTWARE I USE (SO FAR)

Youtube
youtube-dl.org/

Tumblr
jzab.de/content/tumblthree

General
phantomjs.org/
casperjs.org/

July 18, 2017 - 08:00

Other urls found in this thread:

scrapy.org
youtube.com/watch?v=qNYNMC68kq4
archiveteam.org
archive.org
archiveteam.org/index.php?title=WARC
github.com/joelgriffith/navalia
github.com/graphcool/chromeless
perl.com/pub/2003/01/22/mechanize.html
twitter.com/NSFWRedditImage

Angel Scott

COMMIE!!!!!!!!!!!!!!!!!1111111111111

July 18, 2017 - 08:29

Alexander Butler

Just use httrack

July 18, 2017 - 08:33

Brayden Thomas

I just use the MHTML saving feature of my browser (qupzilla) to save all 8ch threads I read, all wikipedia that I read and miscellaneous other things. 2gb so far.
I've also downloaded most of the youtube videos I have watched, as well as entire channels that I find interesting. 500+GB so far.

July 18, 2017 - 08:34

Thomas Perez

KIKE!!!!!!!!!!!!!!!!!!!!1111111111111

July 18, 2017 - 08:43

Adrian Smith

scrapy.org

July 18, 2017 - 09:23

Hudson Smith

i only archive stuff thats interesting to me

July 18, 2017 - 09:28

Landon Gray

...

July 18, 2017 - 10:52

Josiah Ward

thanks i have aids now

July 18, 2017 - 10:58

Jackson Allen

Nah I just want it scraped so I can import it into a database and categorize it & shit, not download all the css & everything.

Looks like the same kind of thing as phantomjs. I'll try it to find out how well they compare. I only just started coding in phantomjs/casperjs a few days ago.

Say hi to your wife and my kid.

July 18, 2017 - 12:19

Brody Ramirez

Maybe you should focus on making your own Internet 2.0 before you try to scrape the entire existing one.

July 18, 2017 - 12:55

Jackson Adams

Yeah okay buddy I'll fund it with kickstarter. How much do you want to donate?

Nobody needs to scrape the entire existing Internet. There's exabytes of trash that nobody gives a shit about. I'm talking about scraping interesting, fresh content, not the video blogs & live journals of angsty teenagers.

I'm talking about building a community based around scraping, curating and sharing content. Sharing it on darknet sites, sharing it with p2p software, and growing the community to the point where content creators would see it as a viable venue for hosting their own content.

July 18, 2017 - 13:35

Luis Thomas

Shit, I'll get started on that logo. How about a silhouette of two trannies molesting a baby? One with hal's thumb up the baby's asshole and the other with the baby's genitals in hal's mouth?

And it will be free as in muh gnu/freedom. How about we call it "The baby fucker"?

July 18, 2017 - 13:44

Jason Green

I'd like to interject for a moment. What you are referring to as "The baby fucker" is actually "The GNU/baby fucker".

July 18, 2017 - 13:48

Connor Rodriguez

nothing wrong with that

July 18, 2017 - 14:40

Ian Jones

The thing would be to create more sites without tracking. However the issue with this is that you need money to keep it running

The other thing the world needs is honest ISP
Ones that give less information to everyone about everyones bussiness

No

July 18, 2017 - 15:07

Easton Murphy

I'm not a commie. I'm a pirate. I'm openly saying we should just outright steal every bit of content we can and then use it to build our own darknet sites. You think you're going to build a second Internet by not breaking any rules? Fuck that.

July 18, 2017 - 15:38

Ryan Lopez

Python solutions are generally trash, maybe selenium is decent. PhantomJS is better, and NightmareJS looks like it improves Phantom by tenfold, although I haven't used it yet.

July 18, 2017 - 16:28

Xavier Clark

How many of these things are there? Phantomjs, Casperjs, Spookyjs, Zombiejs and now Nightmarejs. How are they related to each other? All I know is phantomjs is a headless browser and casperjs is to phantomjs what jquery is to javascript.

July 18, 2017 - 18:38

Jeremiah Jackson

Far too many. Such is the way of the node hipster. But Nightmare looks like it only uses about 1/10 of the syntax of Phantomjs so it's worth checking out if you have some simple scraping in mind.

July 18, 2017 - 18:44

Wyatt Price

Why not preserve the presentation format? Does your method save a shitton of space or? Also, how do you query your "database", and find out, say, "how many seasons in Psych"? How would that answer come back to you? in a pseudo-terminal or sth?

July 18, 2017 - 19:11

Elijah Jenkins

how do get chaturbate videos

July 18, 2017 - 20:32

William Baker

refer to

July 18, 2017 - 20:38

Xavier Roberts

chaturbate to streamlink, then media player and record.
camwhores has a lot of older recordings for free if you also want that.

July 18, 2017 - 21:33

Jack Richardson

So I'm just starting to cut my teeth on all these scraper solutions. From what I've gathered it sounds like the combination of Electron + Nightmarejs is one of the best solutions.

Then there's either Scrapy or Mechanize + Beautiful Soup. These solutions can't directly read axaj generated content but I'm guessing that this comes with a trade-off of higher speed. So if I'm scraping static content then these would be the way to go while Electron + Nightmarejs would be best for scraping sites that are generated via javascript. Does that sound about right?

The point of scraping is to extract the raw data. Then you can give it whatever presentation you want, or mash it up with data you've scraped from other websites.

Definitely.

SELECT MAX(season_number) FROM tv_shows WHERE name="Psych";

Whatever way I wanted. In my case it would display on my website after an SQL query.

Haw Haw Haw Haw Haw! Pee Pee Poo Poo!
Here have a free youtube video youtube.com/watch?v=qNYNMC68kq4

Looks like somebody's too good to steal pornography off the Internet.

July 19, 2017 - 05:56

Jace Reed

There are some places on the web who've been archiving various sites for years. You'll be able to get good ideas from these pros.

Archive Team - archiveteam.org
Internet Archive - archive.org

For example, WARC (WebARChive) seems to be a fairly standard format for saving things. Here's some info about the format and the programs which deal with it: archiveteam.org/index.php?title=WARC

WARC contains the data plus a few headers that record the original URL, the hash of the data, etc. The page informs me that wget supports WARC as well.

July 22, 2017 - 09:59

Cameron Lee

I want something that can save the HTML after Javascript evaluation and all external resources as a static page. I haven't figured out how to do that with things like NightmareJS and similar, but I might just be dense.

July 22, 2017 - 19:45

Chase Wright

>>>/oven/

July 22, 2017 - 19:53

Lucas Jenkins

stealing someone else's content and hosting it as your own?

that's like going into a store, stealing everything, and selling it in your own store and feeling completely entitled to doing so

July 22, 2017 - 20:10

Chase Russell

wew haven't seen one of these niggers on Holla Forums in a while
this shit is too retarded to be bait

July 22, 2017 - 21:22

Jaxon Hill

Does WARC include the SSL transactions when a site supports HTTPS? Basically, wondering if we could have a decentralized archive using IPFS and could prove non-tamperance when a site supports SSL.
Obviously, there'd still be a timestamp issue. Unless WARC somehow includes a signed version of this also?

July 22, 2017 - 22:03

Kevin Butler

...

July 23, 2017 - 12:03

Leo Moore

kys if you're not trolling. there's a whole race of idiots who think this who need to be purged

July 23, 2017 - 12:18

Joseph Gutierrez

Oh, so you don't mind if you post your bank details and complete set of medical records here. After all there is no such thing as private information.

July 23, 2017 - 14:23

Nathan Wood

Something that isn't physical can't be property, you nonce

July 23, 2017 - 15:20

Dominic Wilson

...

July 23, 2017 - 15:25

Carson Robinson

Private information exists, but it sure as fuck isn't property.
And as soon as it's voluntarily distributed under copyright, it sure as fuck isn't private anymore.

July 23, 2017 - 16:20

Kayden Gonzalez

Someone elses house can be your property tho

July 23, 2017 - 16:38

Julian Stewart

how is that even an argument? you must be autistic

exactly this, i couldn't even be bothered to explain it myself

July 23, 2017 - 16:54

William Turner

pls respond

July 23, 2017 - 18:36

John Nguyen

This question might not be appropriate here but would it behoove us to delete individual facebook posts before deleting a facebook account? Or is it better not to spend days on end deleting individual posts and to just delete the cancer outright and done with it?

July 26, 2017 - 20:30

Levi Campbell

It doesn't really matter either way. Facebook already has that information whether you take the time to delete individual posts. Now why would someone take the time to delete every individual post in their history? One possible reason is to send a message: I no longer want to be associated with Facebook and I will take my effort to tell Facebook to remove the information one piece at a time. Whether an employee of Facebook sees this kind of message is something we will never know.

July 26, 2017 - 20:42

Adrian Hernandez

intellectual property rights != private property rights

July 26, 2017 - 20:44

Carson Taylor

They keep metrics on everything else, why not that too? If only there were a way to mass-delete because this process is forever for an old-fag like me.

July 27, 2017 - 05:33

Blake Diaz

...

July 27, 2017 - 09:41

Owen Allen

What is scraping and how can I help?

July 27, 2017 - 10:32

Xavier Ortiz

Hi, sorry. I just used wget with the --warc-file option on Wikipedia (with SSL) and it doesn't look like the data (handshakes, certificates, etc.) are stored. All I know is that all that data is available with this command, but not in WARC format: curl -trace [file] [url].

Do you mean that we would have proof that there was no MITM attack or something? Hmm. You could take the output of the above curl command and sign it with a GPG key perhaps. Apart from that, I don't know.

July 27, 2017 - 14:03

Isaiah Howard

OP here. Just giving an update on what I've found. I've been searching for other scraping solutions besides the ones we've already talked about since I found casperjs way too slow and nightmarejs has barely any tutorials so I'm dead in the water with trying to use it.

I found Navalia github.com/joelgriffith/navalia which automates chrome and lets you open multiple tabs, unlike nightmare/casper/zombie/spooky, etc but there's still a sparse supply of tutorials/examples to help n00bs get started.

However Navalia is currently merging with yet another scraper called chromeless github.com/graphcool/chromeless so further development of Navalia has stopped. The merge is expected to take the next few weeks.

I haven't had a chance to dig deep into chromeless yet but at first blush it looks like it's fairly well documented and supported & it has its own slack community. I'll go sperg on their documentation later and see what I think of them.

August 1, 2017 - 05:20

Logan Barnes

...

August 1, 2017 - 06:52

Daniel Murphy

Just mount siterips instead, much better content:work ratio

August 1, 2017 - 07:03

Isaac Collins

Most sites are already scraped. Focus on getting the siterips usable, for example doing work on IPFS.

August 1, 2017 - 07:17

Gabriel Rivera

Examples?

August 1, 2017 - 14:18

Nathan Perez

CheKek'd dubs of a professional shitpost.
truth
dubs of confirmed truth
I thought Perl was the original language lauded for such scraping?
(an example from 2003 perl.com/pub/2003/01/22/mechanize.html )
There were so many since, jQuery being one of the most notable.

August 1, 2017 - 15:12

Noah Walker

How does this fuck anything up, nigger monkey? It's just scraping some sites, you retard.

God says...
how_hard_could_it_be don't_count_on_it fool Venus I_love_you talk_to_my_lawyer white_trash you're_out_of_your_mind ordinarily stunning I_hate_when_that_happens You_get_what_you_pray_for straighten_up awful hurts_my_head do_it kick_back exorbitant you're_not_all_there_are_you kludge vengeful et_tu I'm_on_a_roll that's_no_fun music Oh_Hell_No let's_see mundo_stoked you_never_know what_the_heck Zap ba_ha

August 1, 2017 - 15:41

1 2 ... 6 Next

Scraping the Internet Thread

Last threads