// ==UserScript==
// @name Eza's Tumblr Scrape
// @namespace https://inkbunny.net/ezalias
// @description Creates a new page showing just the images from any Tumblr
// @license Public domain / No rights reserved
// @include http://*/ezastumblrscrape*
// @include http://*.tumblr.com/
// @include http://*.tumblr.com/page/*
// @include http://*.tumblr.com/tagged/*
// @version 2.1
// @downloadURL none
// ==/UserScript==
// ------------------------------------ User Variables ------------------------------------ //
var number_of_pages_at_once = 10; // Default: 5. Don't go above 10 unless you've got oodles of RAM.
// ------------------------------------ User Variables ------------------------------------ //
// Because the cross-domain resource policy is just plain stupid (there is no reason I shouldn't be able to HTTP GET pages and files I can trivially load, or even execute without looking) this script creates an imaginary page at the relevant domain. Thankfully this does save a step: the user is not required to type in the domain they want to rip, because we can just check the URL in the address bar.
// Make it work, make it fast, make it pretty - in that order.
// TODO:
// http://officialbrostrider.tumblr.com/tagged/homestuck/ezastumblrscrape does some seriously wacky shit - even /ezastumblrscrape doesn't wholly work, and it shows some other URL for siteurl sometimes.
// check if http://eleanorappreciates.tumblr.com/post/57980902871/here-is-the-second-sketch-i-got-at-the-chisaii-3#dnr does the same thing, it has snow
// handling dosopod and other redirect-themes might require taking over /archive and directly imitating a theme - e.g. requesting unstyled posts like infinite-scrolling pages and /archive must do
// http://dosopod.tumblr.com/ doesn't redirect anymore, but nor do the images scrape. same problem with http://kavwoshin.tumblr.com/.
// For scrapewholesite, I could test many distant pages asynchronously, wait until they all come back, then search more finely between the last good and first bad page. (pointless, but interesting.)
// scrape for image links, but don't post links that are also images? this would require removing duplicate elements in url_array[n][1] - naively, O(N^2), but for small N. Duplicates hardly matter and happen anyway.
// going one page at a time for /scrapewholesite is dog-slow, especially when there are more than a thousand pages. any balance between synchronicity and speed throttling is desirable.
// maybe grab several pages at once? no, damn, that doesn't work without explicit parallelism. I don't know if JS has that. really, I just need to get some timer function working.
// does setInterval work? the auto-repeat one, I mean.
// http://ymirsgirlfriend.tumblr.com/ - http://kavwoshin.tumblr.com/ does some ugly nonsense where images go off the left side of the page. wtf.
// Infinite-scrolling tumblrs don't necessarily link to the next page. I need another metric - like if pages only contain the same images as last time. (Empty pages sometimes display foreground images.)
// I'll have to add filtering as some kind of text input... and could potentially do multi-tag filtering, if I can reliably identify posts and/or reliably match tag definitions to images and image sets.
// This is a good feature for doing /scrapewholesite to get text links and then paging through them with fancy dynamic presentation nonsense. Also: duplicate elision.
// I'd love to do some multi-scrape stuff, e.g. scraping both /tagged/homestuck and /tagged/art, but that requires some communication between divs to avoid constant repetition.
// I should start handling "after the cut" situations somehow, e.g. http://banavalope.tumblr.com/post/72117644857/roachpatrol-punispompouspornpalace-happy-new
// Just grab any link to a specific /post. Occasional duplication is fine, we don't care.
// Wait, shit. Every theme should link to every page. And my banavalope example doesn't even link to the same domain, so we couldn't get it with raw AJAX. Meh. It's just a rare problem we'll have to ignore.
// http://askleijon.tumblr.com/ezastumblrscrape is a good example - lots of posts link to outside images (mostly imgur)
// I could detect "read more" links if I can identify the text-content portion of posts. links to /post/ pages are universal theme elements, but become special when they're something the user links to intentionally.
// for example: narcisso's dream on http://cute-blue.tumblr.com/ only shows the cover because the rest is behind a break.
// post-level detection would also be great because it'd let me filter out reblogs. fuck all these people with 1000-page tumblrs, shitty animated gifs in their theme, infinite scrolling, and NO FUCKING TAGS. looking
// Look into Tumblr Saviour to see how they handle and filter out text posts. at you, http://neuroticnick.tumblr.com/post/16618331343/oh-gamzee#dnr - you prick.
// Should non-image links from images be gathered at the top of each 'page' on the image browser? E.g. http://askNSFWcobaltsnow.tumblr.com links to Derpibooru a lot. Should those be listed before the images?
// I worry it'd pick up a lot of crap, like facebook and the main page.
// Using the Back button screws up the favicon. Weird.
// Ah fuck. onError might be linking to the wrong-size images again. That's an oooold bug making a comeback.
// It might just be blimpcat-art, actually. That site had serious problems before switching to /archive?.
// Consider going back to page-matching /thumbnail links for the "scrape" button. Single-tab weirdos may want to go back and forth from the page links on the embedded pages.
// http://playbunny.tumblr.com/archive?/tagged/homestuck/ezastumblrscrape/thumbnails photosets start with a non-image link.
// e.g. http://assets.tumblr.com/assets/styles/mobile_handset/mh_photoset.css?_v=50006b83e288948d62d0251c6d4a77fb#photoset#http://playbunny.tumblr.com/post/96067079633/photoset_iframe/playbunny/tumblr_nb21beiawY1qemks9/500/false
// ------------------------------------ Script start, general setup ------------------------------------ //
// We need this global variable because GreaseMonkey still can't handle a button activating a function with parameters. It's used in scrape_whole_tumblr.
var lastpage = 0;
// First, determine if we're loading many pages and listing/embedding them, or if we're just adding a convenient button to that functionality.
if( window.location.href.indexOf( 'ezastumblrscrape' ) > -1 ) { // If we're scraping pages:
var subdomain = window.location.href.substring( window.location.href.indexOf( "/" ) + 2, window.location.href.indexOf( "." ) ); // everything between http:// and .tumblr.com
var title = document.title;
document.head.innerHTML = ""; // Delete CSS and content. We'll start with a blank page.
document.title = subdomain + " - " + title;
document.body.outerHTML = "
"; // This is our page. Top stuff, content, bottom stuff.
document.body.style.backgroundColor="#DDDDDD"; // Light grey BG to make image boundaries more obvious
var mydiv = document.getElementById( "maindiv" ); // I apologize for "mydiv." This script used to be a lot simpler.
mydiv.innerHTML = "Not all images are guaranteed to appear. "; // Thanks to Javascript's wacky accomodating nature, mydiv is global despite appearing in an if-else block.
if( window.location.href.indexOf( "/ezastumblrscrape/scrapewholesite" ) < 0 ) {
scrape_tumblr_pages(); // Ten pages of embedded images at a time
} else {
scrape_whole_tumblr(); // Images from every page, presented as text links
}
} else { // If it's just a normal Tumblr page, add a link to the appropriate /ezastumblrscrape URL
// Add link(s) to the standard "+Follow / Dashboard" nonsense. Before +Follow, I think - to avoid messing with users' muscle memory.
// Use regexes to make the last few @includes more concise. /, /page/x, and /tagged/x. (also treat /tagged/x/page/y.)
// The +Follow button is inside tumblr_controls, which is a script in an iframe, not part of the main page. It's a.btn.icon.follow. Can I mess with the DOM enough to add something beside it? The iframe's id is always "tumblr_controls", but its class seems variable. The source for it is http://assets.tumblr.com/assets/html/iframe/o.html plus some metadata after a question mark. Inside the iframe is html.dashboard-context.en_US (i.e. ), which contains , which contains
. Inside that, finally, is Follow .
// So I need to locate
"; // link to image-viewing version, preserving current tags
// Stopgap fix for finding the last page on infinite-scrolling pages with no "next" link:
var url = window.location.href;
if( url.substring( url.length-1, url.length ) == "/" ) { url = url.substring( 0, url.length - 1 ); } // If the URL has a trailing slash, chomp it.
var pages = parseInt( url.substring( url.lastIndexOf( "/" ) + 1 ) ); // everything past the last slash, which should hopefully be a number
if( ! isNaN( pages ) ) { lastpage = pages; } // if the URL ends something like /scrapewholesite/100, then we scrape 100 pages instead of just the two that the link-to-next-page test will find
// I should probably implement a box and button that redirect to whatever page the user chooses. Maybe it should only appear if the last apparent page is 2.
// Find out how many pages we need to scrape.
if( lastpage == 0 ) {
// What's the least number of fetches to estimate an upper bound? We don't need a specific "last page," but we don't want to grab a thousand extra pages that are empty.
// I expect the best approach is to binary-search down from a generous high estimate. E.g., double toward 1024, then creep back down toward 512.
// This would be pointless if I could figure out how some Tumblr themes know their own page count. E.g., some say "Page 1 of 24." Themes might get backend support.
mydiv.innerHTML += "Finding out how many pages are in " + site.substring( site.indexOf( '/' ) + 2 ) + ":
"; // Telling users what's going on. "site" has http(s):// removed for readability.
for( var n = 2; n > 0 && n < 10000; n *= 2 ) { // 10,000 is an arbitrary upper bound to prevent infinite loops, but some crazy-old Tumblrs might have more pages. This used to stop at 5000.
var siteurl = site + "/page/" + n;
var xmlhttp = new XMLHttpRequest();
xmlhttp.onreadystatechange=function() {
if( xmlhttp.readyState == 4 ) {
// Test for the presence of a link to the next page. Pages at or past the end will only link backwards. (Unfortunately, infinite-scrolling Tumblr themes won't link in either direction.)
if( xmlhttp.responseText.indexOf( "/page/" + (n+1) ) < 0 ) {
// instead of checking for link to next page (which doesn't work on infinite-scrolling-only themes), test if the page has the same content as the previous page?
// Images aren't sufficient for this because some pages will be 100% text posts. That bullshit is why I made this script to begin with.
mydiv.innerHTML += siteurl + " is too high. ";
lastpage = n;
n = -100; // break for(n) loop
} else {
mydiv.innerHTML += siteurl + " exists. ";
highest_known_page = n;
}
}
}
xmlhttp.open("GET", siteurl, false); // false=synchronous, for linear execution. There's no point checking if a page is the last one if we've already sent requests for the next dozen.
xmlhttp.send();
}
// Binary-search closer to the actual last page
while( lastpage > highest_known_page + 10 ) { // Arbitrary cutoff. We're just trying to minimize the range. A couple extra pages is reasonable; a hundred is excessive.
// 1000-page example Tumblr: http://neuroticnick.tumblr.com/
mydiv.innerHTML +="Narrowing down last page: ";
var middlepage = parseInt( (lastpage + highest_known_page) / 2 ); // integer midpoint between highest-known and too-high pages
var siteurl = site + "/page/" + middlepage;
var xmlhttp = new XMLHttpRequest();
xmlhttp.onreadystatechange=function() {
if( xmlhttp.readyState == 4 ) {
if( xmlhttp.responseText.indexOf( "/page/" + (middlepage+1) ) < 0 ) { // Test for the presence of a link to the next page.
mydiv.innerHTML += siteurl + " is high. ";
lastpage = middlepage;
} else {
mydiv.innerHTML += siteurl + " exists. ";
highest_known_page = middlepage;
}
}
}
xmlhttp.open("GET", siteurl, false); // false=synchronous, for linear execution. There's no point checking if a page is the last one if we've already sent requests for the next dozen.
xmlhttp.send();
}
}
// If we suspect infinite scrolling, or if someone silly has entered a negative number in the URL, tell them how to choose their own lastpage value:
if( lastpage < 3 ) {
mydiv.innerHTML += " Infinite-scrolling Tumblr themes will sometimes stop at 2 pages. " // Inform user
mydiv.innerHTML += "Click here to try 100 instead. "; // link to N-page version
}
mydiv.innerHTML += " Last page detected is " + lastpage + " or lower. ";
// Buttons within GreaseMonkey are a huge pain in the ass. I stole this from stackoverflow.com/questions/6480082/ - thanks, Brock Adams.
var button = document.createElement ('div');
button.innerHTML = '';
button.setAttribute ( 'id', 'scrape_button' ); // I'm really not sure why this id and the above HTML id aren't the same property.
document.body.appendChild ( button ); // Add button (at the end is fine)
document.getElementById ("myButton").addEventListener ( "click", scrape_all_pages, false ); // Activate button - when clicked, it triggers scrape_all_pages()
}
function scrape_all_pages() { // Example code implies that this function /can/ take a parameter via the event listener, but I'm not sure how.
// First, remove the button. There's no reason it should be clickable twice.
var button = document.getElementById( "scrape_button" );
button.parentNode.removeChild( button ); // The DOM can only remove elements from a higher level. "Elements can't commit suicide, but infanticide is permitted."
// We need to find "site" again, because we can't pass it. Putting a button on the page and making it activate a GreaseMonkey function borders on magic. Adding parameters is straight-up dark sorcery.
var site = get_site( window.location.href );
mydiv.innerHTML += "Scraping page: "; // This makes it easier to track progress, since Firefox / Pale Moon only scrolls with the scroll wheel on pages which are still loading.
// Fetch all pages with content on them
for( var x = 1; x <= lastpage; x++ ) {
var siteurl = site + "/page/" + x;
mydiv.innerHTML += "Page " + x + " fetched ";
document.getElementById( 'pagecounter' ).innerHTML = " " + x;
if( x != lastpage ) {
asynchronous_fetch( siteurl, false ); // Sorry for the function spaghetti. Scrape_all_pages exists so a thousand pages aren't loaded in the background, and asynchronous_fetch prevents race conditions.
} else {
asynchronous_fetch( siteurl, true ); // Stop = true when we're on the last page. No idea if it accomplishes anything at this point. (Probably not, thanks to /archive?.
document.getElementById( 'pagecounter' ).innerHTML += " Done. Use DownThemAll (or a similar plugin) to grab all these links.";
}
}
}
function asynchronous_fetch( siteurl, stop ) { // separated into another function to prevent race condition (i.e. variables changing while asynronous request is happening)
var xmlhttp = new XMLHttpRequest(); // AJAX object
xmlhttp.onreadystatechange = function() { // When the request returns, this anonymous function will trigger (repeatedly, for various stages of the reply)
if( xmlhttp.readyState == 4 ) { // Don't do anything until we're done downloading the page.
var thisdiv = document.getElementById( siteurl ); // identify the div we printed for this page
thisdiv.innerHTML += "" + siteurl + " "; // link to page, in case you want to see something in-situ (e.g. for proper sourcing)
var url_array = soft_scrape_page( xmlhttp.responseText ); // turn HTML dump into list of URLs
// Print URLs so DownThemAll (or similar) can grab them
for( var n = 0; n < url_array.length; n++ ) {
var image_url = url_array[n][1]; // url_array is an array of 2-element arrays. each inner array goes .
thisdiv.innerHTML += "" + image_url + " "; // These URLs don't need to be links, but why not? Anyway, lusers don't know what "URL" means.
// Some images are automatically resized. We'll add the maximum-sized link in case it exists - unfortunately, there's no easy way to check if it exists. We'll just post both.
var fixed_url = "";
if( image_url.lastIndexOf( "_500." ) > -1 ) { fixed_url = image_url.replace( "_500.", "_1280." ); }
if( image_url.lastIndexOf( "_400." ) > -1 ) { fixed_url = image_url.replace( "_400.", "_1280." ); }
if( fixed_url.indexOf( "#photoset" ) > 0 ) { fixed_url = ""; } // Photoset image links are never resized. Tumblr did at least this one thing right.
if( fixed_url !== "" ) { thisdiv.innerHTML += "" + fixed_url + " "; }
if( stop ) { window.stop(); } // clumsy way to finish up for sites with uncooperative script bullshit that makes everything vanish after loading completes. (not sure this does anything anymore.)
}
}
}
xmlhttp.open("GET", siteurl, false); // This should probably be "true" for asynchronous at some point, but naively, it spams hundreds of GETs per second. This spider script shouldn't act like a DDOS.
xmlhttp.send();
}
// ------------------------------------ Multi-page scraper with embedded images ------------------------------------ //
// I should probably change page numbers such that ezastumblrscrap/100 starts at /page/100 and goes to /page/(100+numberofpages). Just ignore /page/0.
function scrape_tumblr_pages() { // Create a page where many images are displayed as densely as seems sensible
// Figure out which site we're scraping
var site = get_site( window.location.href ); // remove /archive? nonsense, remove /ezastumblrscrape nonsense, preserve /tagged/whatever, /chrono, etc.
var thumbnails = window.location.href.indexOf( "/ezastumblrscrape/thumbnails" ); // look for "/thumbnails" flag to determine whether images get resized or not
if( thumbnails > 0 ) { thumbnails = true; } else { thumbnails = false; } // Simplify to true/false. Lord only knows what JS's truth table looks like for integers.
// Figure out which pages we're showing, then add navigation links
var scrapetext = "/ezastumblrscrape/";
if( thumbnails ) { scrapetext += "thumbnails/"; } // Maintain whether or not we're in thumbnails mode
var archive_site = insert_archive( site ); // so we don't call this a dozen times in a row
var url = window.location.href;
if( url.substring( url.length-1, url.length ) == "/" ) { url = url.substring( 0, url.length - 1 ); } // If the URL has a trailing slash, chomp it.
var pages = parseInt( url.substring( url.lastIndexOf( "/" ) + 1 ) ); // everything past the last slash, which should hopefully be a number
if( isNaN( pages ) || pages == 1 ) { // If parseInt doesn't work (probably because the URL has no number after it) then just do the first set.
pages = 1;
mydiv.innerHTML += " Next >>>
" ; // No "Previous" link on page 1. Tumblr politely treats negative pages as page 1, but it's pointless.
document.getElementById("bottom_controls_div").innerHTML += " Next >>>
" ;
} else { // It's a testament to modern browsers that these brackets-as-arrows don't break the bracketed tags.
mydiv.innerHTML += " <<< Previous - Next >>>