// ==UserScript== // @name Eza's Tumblr Scrape // @namespace https://inkbunny.net/ezalias // @description Creates a new page showing just the images from any Tumblr // @license Public domain / No rights reserved // @include http://*/ezastumblrscrape* // @include http://*.tumblr.com/ // @include http://*.tumblr.com/page/* // @include http://*.tumblr.com/tagged/* // @version 2.1 // @downloadURL none // ==/UserScript== // ------------------------------------ User Variables ------------------------------------ // var number_of_pages_at_once = 10; // Default: 5. Don't go above 10 unless you've got oodles of RAM. // ------------------------------------ User Variables ------------------------------------ // // Because the cross-domain resource policy is just plain stupid (there is no reason I shouldn't be able to HTTP GET pages and files I can trivially load, or even execute without looking) this script creates an imaginary page at the relevant domain. Thankfully this does save a step: the user is not required to type in the domain they want to rip, because we can just check the URL in the address bar. // Make it work, make it fast, make it pretty - in that order. // TODO: // http://officialbrostrider.tumblr.com/tagged/homestuck/ezastumblrscrape does some seriously wacky shit - even /ezastumblrscrape doesn't wholly work, and it shows some other URL for siteurl sometimes. // check if http://eleanorappreciates.tumblr.com/post/57980902871/here-is-the-second-sketch-i-got-at-the-chisaii-3#dnr does the same thing, it has snow // handling dosopod and other redirect-themes might require taking over /archive and directly imitating a theme - e.g. requesting unstyled posts like infinite-scrolling pages and /archive must do // http://dosopod.tumblr.com/ doesn't redirect anymore, but nor do the images scrape. same problem with http://kavwoshin.tumblr.com/. // For scrapewholesite, I could test many distant pages asynchronously, wait until they all come back, then search more finely between the last good and first bad page. (pointless, but interesting.) // scrape for image links, but don't post links that are also images? this would require removing duplicate elements in url_array[n][1] - naively, O(N^2), but for small N. Duplicates hardly matter and happen anyway. // going one page at a time for /scrapewholesite is dog-slow, especially when there are more than a thousand pages. any balance between synchronicity and speed throttling is desirable. // maybe grab several pages at once? no, damn, that doesn't work without explicit parallelism. I don't know if JS has that. really, I just need to get some timer function working. // does setInterval work? the auto-repeat one, I mean. // http://ymirsgirlfriend.tumblr.com/ - http://kavwoshin.tumblr.com/ does some ugly nonsense where images go off the left side of the page. wtf. // Infinite-scrolling tumblrs don't necessarily link to the next page. I need another metric - like if pages only contain the same images as last time. (Empty pages sometimes display foreground images.) // I'll have to add filtering as some kind of text input... and could potentially do multi-tag filtering, if I can reliably identify posts and/or reliably match tag definitions to images and image sets. // This is a good feature for doing /scrapewholesite to get text links and then paging through them with fancy dynamic presentation nonsense. Also: duplicate elision. // I'd love to do some multi-scrape stuff, e.g. scraping both /tagged/homestuck and /tagged/art, but that requires some communication between divs to avoid constant repetition. // I should start handling "after the cut" situations somehow, e.g. http://banavalope.tumblr.com/post/72117644857/roachpatrol-punispompouspornpalace-happy-new // Just grab any link to a specific /post. Occasional duplication is fine, we don't care. // Wait, shit. Every theme should link to every page. And my banavalope example doesn't even link to the same domain, so we couldn't get it with raw AJAX. Meh. It's just a rare problem we'll have to ignore. // http://askleijon.tumblr.com/ezastumblrscrape is a good example - lots of posts link to outside images (mostly imgur) // I could detect "read more" links if I can identify the text-content portion of posts. links to /post/ pages are universal theme elements, but become special when they're something the user links to intentionally. // for example: narcisso's dream on http://cute-blue.tumblr.com/ only shows the cover because the rest is behind a break. // post-level detection would also be great because it'd let me filter out reblogs. fuck all these people with 1000-page tumblrs, shitty animated gifs in their theme, infinite scrolling, and NO FUCKING TAGS. looking // Look into Tumblr Saviour to see how they handle and filter out text posts. at you, http://neuroticnick.tumblr.com/post/16618331343/oh-gamzee#dnr - you prick. // Should non-image links from images be gathered at the top of each 'page' on the image browser? E.g. http://askNSFWcobaltsnow.tumblr.com links to Derpibooru a lot. Should those be listed before the images? // I worry it'd pick up a lot of crap, like facebook and the main page. // Using the Back button screws up the favicon. Weird. // Ah fuck. onError might be linking to the wrong-size images again. That's an oooold bug making a comeback. // It might just be blimpcat-art, actually. That site had serious problems before switching to /archive?. // Consider going back to page-matching /thumbnail links for the "scrape" button. Single-tab weirdos may want to go back and forth from the page links on the embedded pages. // http://playbunny.tumblr.com/archive?/tagged/homestuck/ezastumblrscrape/thumbnails photosets start with a non-image link. // e.g. http://assets.tumblr.com/assets/styles/mobile_handset/mh_photoset.css?_v=50006b83e288948d62d0251c6d4a77fb#photoset#http://playbunny.tumblr.com/post/96067079633/photoset_iframe/playbunny/tumblr_nb21beiawY1qemks9/500/false // ------------------------------------ Script start, general setup ------------------------------------ // // We need this global variable because GreaseMonkey still can't handle a button activating a function with parameters. It's used in scrape_whole_tumblr. var lastpage = 0; // First, determine if we're loading many pages and listing/embedding them, or if we're just adding a convenient button to that functionality. if( window.location.href.indexOf( 'ezastumblrscrape' ) > -1 ) { // If we're scraping pages: var subdomain = window.location.href.substring( window.location.href.indexOf( "/" ) + 2, window.location.href.indexOf( "." ) ); // everything between http:// and .tumblr.com var title = document.title; document.head.innerHTML = ""; // Delete CSS and content. We'll start with a blank page. document.title = subdomain + " - " + title; document.body.outerHTML = "
"; // This is our page. Top stuff, content, bottom stuff. document.body.style.backgroundColor="#DDDDDD"; // Light grey BG to make image boundaries more obvious var mydiv = document.getElementById( "maindiv" ); // I apologize for "mydiv." This script used to be a lot simpler. mydiv.innerHTML = "Not all images are guaranteed to appear.
"; // Thanks to Javascript's wacky accomodating nature, mydiv is global despite appearing in an if-else block. if( window.location.href.indexOf( "/ezastumblrscrape/scrapewholesite" ) < 0 ) { scrape_tumblr_pages(); // Ten pages of embedded images at a time } else { scrape_whole_tumblr(); // Images from every page, presented as text links } } else { // If it's just a normal Tumblr page, add a link to the appropriate /ezastumblrscrape URL // Add link(s) to the standard "+Follow / Dashboard" nonsense. Before +Follow, I think - to avoid messing with users' muscle memory. // Use regexes to make the last few @includes more concise. /, /page/x, and /tagged/x. (also treat /tagged/x/page/y.) // The +Follow button is inside tumblr_controls, which is a script in an iframe, not part of the main page. It's a.btn.icon.follow. Can I mess with the DOM enough to add something beside it? The iframe's id is always "tumblr_controls", but its class seems variable. The source for it is http://assets.tumblr.com/assets/html/iframe/o.html plus some metadata after a question mark. Inside the iframe is html.dashboard-context.en_US (i.e. ), which contains , which contains
. Inside that, finally, is Follow . // So I need to locate
and insert some right at the start. Presumably the link classes introduce the icons through CSS or something. // This is currently beyond my ability to dick with JS through a script in a plugin. Let's kludge it for immediate usability. // kludge by Ivan - http://userscripts-mirror.org/scripts/review/65725.html url = insert_archive( window.location.href ) + "/ezastumblrscrape/scrapewholesite"; // var controls_iframe = document.getElementById( "tumblr_controls" ); // var controls_div = controls_iframe.getElementById( "iframe_controls" ); // controls_div.innerHTML = "Foobar " + controls_div.innerHTML; // debug // Don't clean this up. It's not permanent. var eLink = document.createElement("a"); eLink.setAttribute("id","edit_link"); eLink.setAttribute("style","position:absolute;top:26px;right:2px;padding:2px 0 0;width:50px;height:18px;display:block;overflow:hidden;-moz-border-radius:3px;background:#777;color:#fff;font-size:8pt;text-decoration:none;font-weight:bold;text-align:center;line-height:12pt;"); eLink.setAttribute("href", url); eLink.appendChild(document.createTextNode("Scrape")); var elBody = document.getElementsByTagName("body")[0]; elBody.appendChild(eLink); } // ------------------------------------ Whole-site scraper for use with DownThemAll ------------------------------------ // // Monolithic scrape-whole-site function, recreating the original intent (before I added pages and made it a glorified multipage image browser) // I still can't determine the existence of _1280 images without downloading them entirely, so there will be some different-size duplicates. Better too much than not enough. // So for archiving, I need some kind of sister Perl script that goes 'foreach filename containing _500, if (regex _1280) exists, delete this _500 file.' function scrape_whole_tumblr() { var highest_known_page = 0; var site = get_site( window.location.href ); mydiv.innerHTML += "

Browse images


"; // link to image-viewing version, preserving current tags // Stopgap fix for finding the last page on infinite-scrolling pages with no "next" link: var url = window.location.href; if( url.substring( url.length-1, url.length ) == "/" ) { url = url.substring( 0, url.length - 1 ); } // If the URL has a trailing slash, chomp it. var pages = parseInt( url.substring( url.lastIndexOf( "/" ) + 1 ) ); // everything past the last slash, which should hopefully be a number if( ! isNaN( pages ) ) { lastpage = pages; } // if the URL ends something like /scrapewholesite/100, then we scrape 100 pages instead of just the two that the link-to-next-page test will find // I should probably implement a box and button that redirect to whatever page the user chooses. Maybe it should only appear if the last apparent page is 2. // Find out how many pages we need to scrape. if( lastpage == 0 ) { // What's the least number of fetches to estimate an upper bound? We don't need a specific "last page," but we don't want to grab a thousand extra pages that are empty. // I expect the best approach is to binary-search down from a generous high estimate. E.g., double toward 1024, then creep back down toward 512. // This would be pointless if I could figure out how some Tumblr themes know their own page count. E.g., some say "Page 1 of 24." Themes might get backend support. mydiv.innerHTML += "Finding out how many pages are in " + site.substring( site.indexOf( '/' ) + 2 ) + ":

"; // Telling users what's going on. "site" has http(s):// removed for readability. for( var n = 2; n > 0 && n < 10000; n *= 2 ) { // 10,000 is an arbitrary upper bound to prevent infinite loops, but some crazy-old Tumblrs might have more pages. This used to stop at 5000. var siteurl = site + "/page/" + n; var xmlhttp = new XMLHttpRequest(); xmlhttp.onreadystatechange=function() { if( xmlhttp.readyState == 4 ) { // Test for the presence of a link to the next page. Pages at or past the end will only link backwards. (Unfortunately, infinite-scrolling Tumblr themes won't link in either direction.) if( xmlhttp.responseText.indexOf( "/page/" + (n+1) ) < 0 ) { // instead of checking for link to next page (which doesn't work on infinite-scrolling-only themes), test if the page has the same content as the previous page? // Images aren't sufficient for this because some pages will be 100% text posts. That bullshit is why I made this script to begin with. mydiv.innerHTML += siteurl + " is too high.
"; lastpage = n; n = -100; // break for(n) loop } else { mydiv.innerHTML += siteurl + " exists.
"; highest_known_page = n; } } } xmlhttp.open("GET", siteurl, false); // false=synchronous, for linear execution. There's no point checking if a page is the last one if we've already sent requests for the next dozen. xmlhttp.send(); } // Binary-search closer to the actual last page while( lastpage > highest_known_page + 10 ) { // Arbitrary cutoff. We're just trying to minimize the range. A couple extra pages is reasonable; a hundred is excessive. // 1000-page example Tumblr: http://neuroticnick.tumblr.com/ mydiv.innerHTML +="Narrowing down last page: "; var middlepage = parseInt( (lastpage + highest_known_page) / 2 ); // integer midpoint between highest-known and too-high pages var siteurl = site + "/page/" + middlepage; var xmlhttp = new XMLHttpRequest(); xmlhttp.onreadystatechange=function() { if( xmlhttp.readyState == 4 ) { if( xmlhttp.responseText.indexOf( "/page/" + (middlepage+1) ) < 0 ) { // Test for the presence of a link to the next page. mydiv.innerHTML += siteurl + " is high.
"; lastpage = middlepage; } else { mydiv.innerHTML += siteurl + " exists.
"; highest_known_page = middlepage; } } } xmlhttp.open("GET", siteurl, false); // false=synchronous, for linear execution. There's no point checking if a page is the last one if we've already sent requests for the next dozen. xmlhttp.send(); } } // If we suspect infinite scrolling, or if someone silly has entered a negative number in the URL, tell them how to choose their own lastpage value: if( lastpage < 3 ) { mydiv.innerHTML += "
Infinite-scrolling Tumblr themes will sometimes stop at 2 pages. " // Inform user mydiv.innerHTML += "Click here to try 100 instead.
"; // link to N-page version } mydiv.innerHTML += "
Last page detected is " + lastpage + " or lower.
"; // Buttons within GreaseMonkey are a huge pain in the ass. I stole this from stackoverflow.com/questions/6480082/ - thanks, Brock Adams. var button = document.createElement ('div'); button.innerHTML = ''; button.setAttribute ( 'id', 'scrape_button' ); // I'm really not sure why this id and the above HTML id aren't the same property. document.body.appendChild ( button ); // Add button (at the end is fine) document.getElementById ("myButton").addEventListener ( "click", scrape_all_pages, false ); // Activate button - when clicked, it triggers scrape_all_pages() } function scrape_all_pages() { // Example code implies that this function /can/ take a parameter via the event listener, but I'm not sure how. // First, remove the button. There's no reason it should be clickable twice. var button = document.getElementById( "scrape_button" ); button.parentNode.removeChild( button ); // The DOM can only remove elements from a higher level. "Elements can't commit suicide, but infanticide is permitted." // We need to find "site" again, because we can't pass it. Putting a button on the page and making it activate a GreaseMonkey function borders on magic. Adding parameters is straight-up dark sorcery. var site = get_site( window.location.href ); mydiv.innerHTML += "Scraping page:

"; // This makes it easier to track progress, since Firefox / Pale Moon only scrolls with the scroll wheel on pages which are still loading. // Fetch all pages with content on them for( var x = 1; x <= lastpage; x++ ) { var siteurl = site + "/page/" + x; mydiv.innerHTML += "Page " + x + " fetched
"; document.getElementById( 'pagecounter' ).innerHTML = " " + x; if( x != lastpage ) { asynchronous_fetch( siteurl, false ); // Sorry for the function spaghetti. Scrape_all_pages exists so a thousand pages aren't loaded in the background, and asynchronous_fetch prevents race conditions. } else { asynchronous_fetch( siteurl, true ); // Stop = true when we're on the last page. No idea if it accomplishes anything at this point. (Probably not, thanks to /archive?. document.getElementById( 'pagecounter' ).innerHTML += "
Done. Use DownThemAll (or a similar plugin) to grab all these links."; } } } function asynchronous_fetch( siteurl, stop ) { // separated into another function to prevent race condition (i.e. variables changing while asynronous request is happening) var xmlhttp = new XMLHttpRequest(); // AJAX object xmlhttp.onreadystatechange = function() { // When the request returns, this anonymous function will trigger (repeatedly, for various stages of the reply) if( xmlhttp.readyState == 4 ) { // Don't do anything until we're done downloading the page. var thisdiv = document.getElementById( siteurl ); // identify the div we printed for this page thisdiv.innerHTML += "" + siteurl + "
"; // link to page, in case you want to see something in-situ (e.g. for proper sourcing) var url_array = soft_scrape_page( xmlhttp.responseText ); // turn HTML dump into list of URLs // Print URLs so DownThemAll (or similar) can grab them for( var n = 0; n < url_array.length; n++ ) { var image_url = url_array[n][1]; // url_array is an array of 2-element arrays. each inner array goes . thisdiv.innerHTML += "" + image_url + "
"; // These URLs don't need to be links, but why not? Anyway, lusers don't know what "URL" means. // Some images are automatically resized. We'll add the maximum-sized link in case it exists - unfortunately, there's no easy way to check if it exists. We'll just post both. var fixed_url = ""; if( image_url.lastIndexOf( "_500." ) > -1 ) { fixed_url = image_url.replace( "_500.", "_1280." ); } if( image_url.lastIndexOf( "_400." ) > -1 ) { fixed_url = image_url.replace( "_400.", "_1280." ); } if( fixed_url.indexOf( "#photoset" ) > 0 ) { fixed_url = ""; } // Photoset image links are never resized. Tumblr did at least this one thing right. if( fixed_url !== "" ) { thisdiv.innerHTML += "" + fixed_url + "
"; } if( stop ) { window.stop(); } // clumsy way to finish up for sites with uncooperative script bullshit that makes everything vanish after loading completes. (not sure this does anything anymore.) } } } xmlhttp.open("GET", siteurl, false); // This should probably be "true" for asynchronous at some point, but naively, it spams hundreds of GETs per second. This spider script shouldn't act like a DDOS. xmlhttp.send(); } // ------------------------------------ Multi-page scraper with embedded images ------------------------------------ // // I should probably change page numbers such that ezastumblrscrap/100 starts at /page/100 and goes to /page/(100+numberofpages). Just ignore /page/0. function scrape_tumblr_pages() { // Create a page where many images are displayed as densely as seems sensible // Figure out which site we're scraping var site = get_site( window.location.href ); // remove /archive? nonsense, remove /ezastumblrscrape nonsense, preserve /tagged/whatever, /chrono, etc. var thumbnails = window.location.href.indexOf( "/ezastumblrscrape/thumbnails" ); // look for "/thumbnails" flag to determine whether images get resized or not if( thumbnails > 0 ) { thumbnails = true; } else { thumbnails = false; } // Simplify to true/false. Lord only knows what JS's truth table looks like for integers. // Figure out which pages we're showing, then add navigation links var scrapetext = "/ezastumblrscrape/"; if( thumbnails ) { scrapetext += "thumbnails/"; } // Maintain whether or not we're in thumbnails mode var archive_site = insert_archive( site ); // so we don't call this a dozen times in a row var url = window.location.href; if( url.substring( url.length-1, url.length ) == "/" ) { url = url.substring( 0, url.length - 1 ); } // If the URL has a trailing slash, chomp it. var pages = parseInt( url.substring( url.lastIndexOf( "/" ) + 1 ) ); // everything past the last slash, which should hopefully be a number if( isNaN( pages ) || pages == 1 ) { // If parseInt doesn't work (probably because the URL has no number after it) then just do the first set. pages = 1; mydiv.innerHTML += "
Next >>>

" ; // No "Previous" link on page 1. Tumblr politely treats negative pages as page 1, but it's pointless. document.getElementById("bottom_controls_div").innerHTML += "

Next >>>

" ; } else { // It's a testament to modern browsers that these brackets-as-arrows don't break the bracketed tags. mydiv.innerHTML += "
<<< Previous - Next >>>

" ; document.getElementById("bottom_controls_div").innerHTML += "

<<< Previous - "; document.getElementById("bottom_controls_div").innerHTML += "Next >>>

" ; } // Link to the thumbnail page or full-size-image page as appropriate if( thumbnails ) { mydiv.innerHTML += "Switch to full-sized images
"; } else { mydiv.innerHTML += "Switch to thumbnails
"; } // Grab several pages and extract/embed images. var firstpage = (((pages-1) * number_of_pages_at_once) + 1); // so e.g. 1=1, 2=11, 3=21 for n=10 var lastpage = firstpage + number_of_pages_at_once; // due to using < instead of <=, we don't actually load it, so it's not technically the "last page," but meh for( x = firstpage; x < lastpage; x++ ) { var siteurl = site + "/page/" + x; mydiv.innerHTML += "
Page " + x + " fetched
"; // TODO: Sanitize the URL here and in fetch_page. It's just a unique ID. fetch_page( siteurl, mydiv, thumbnails ); // I'd rather do this right here, but unless the whole AJAX mess is inside its own function, matching a responseText to its siteurl is fucking intractable. } } function fetch_page( siteurl, mydiv, thumbnails ) { // Grab a page, scrape its image URLs, and embed them for easy browsing var xmlhttp = new XMLHttpRequest(); // AJAX object xmlhttp.onreadystatechange = function() { // When the request returns, this anonymous function will trigger (repeatedly, for various stages of the reply) if( xmlhttp.readyState == 4 ) { // Don't do anything until we're done downloading the page. var thisdiv = document.getElementById( siteurl ); // identify the div we printed for this page // TODO: Sanitize, as above. Code execution through this niche script is unlikely, but why keep it possible? thisdiv.innerHTML += "" + siteurl + "
"; // link to page, in case you want to see something in-situ (e.g. for proper sourcing) var url_array = soft_scrape_page( xmlhttp.responseText ); // turn HTML dump into list of URLs // Embed high-res images to be seen, clicked, and saved for( var n = 0; n < url_array.length; n++ ) { var image_url = url_array[n][1]; // For images which might have been automatically resized, assume the highest resolution exists, and change the URL accordingly. var fixed_url = ""; if( image_url.lastIndexOf( "_500." ) > -1 ) { fixed_url = image_url.replace( "_500.", "_1280." ); } if( image_url.lastIndexOf( "_400." ) > -1 ) { fixed_url = image_url.replace( "_400.", "_1280." ); } if( image_url.lastIndexOf( "_250." ) > -1 ) { fixed_url = image_url.replace( "_250.", "_1280." ); } if( image_url.lastIndexOf( "_100." ) > -1 ) { fixed_url = image_url.replace( "_100.", "_1280." ); } if( fixed_url.indexOf( "#photoset" ) > 0 ) { fixed_url = ""; } // Photosets always link to the highest resolution available. if( fixed_url !== "" ) { image_url = fixed_url; } // This clunky function looks for a lower-res image if the high-res version doesn't exist. var on_error = 'if(this.src.indexOf("_1280")>0){this.src=this.src.replace("_1280","_500");}'; // Swap 1280 for 500 on_error += 'else if(this.src.indexOf("_500")>0){this.src=this.src.replace("_500","_400");}'; // Or swap 500 for 400 on_error += 'else if(this.src.indexOf("_400")>0){this.src=this.src.replace("_400","_250");}'; // Or swap 400 for 250 on_error += 'else{this.src=this.src.replace("_250","_100");this.onerror=null;}'; // Or swap 250 for 100, then give up on_error += 'document.getElementById("' + image_url + '").href=this.src;'; // Link the image to itself, regardless of size // Embed images (linked to themselves) and link to photosets if( image_url.indexOf( "#" ) < 0 ) { // if it's just an image, then embed that image, linked to itself if( thumbnails ) { thisdiv.innerHTML += " "; } else { thisdiv.innerHTML += " "; } } else { // but if it's an image from a photoset, also print the photoset link. (is on_error necessary here? these images are already high-res. I guess it's an unintrusive fallback.) var photoset_url = image_url.substring( image_url.lastIndexOf( "#" ) + 1 ); // separate everything past the last hash - it's like http://tumblr.com/image#photoset#http://tumblr.com/photoset_iframe if( photoset_url.substring( 0, 4) == "http" ) { thisdiv.innerHTML += " Set:"; } // if the #photoset tag is followed by an #http URL, link the URL if ( thumbnails ) { thisdiv.innerHTML += "(Wait for image) "; } else { thisdiv.innerHTML += "(Image) "; } } } } } xmlhttp.open("GET", siteurl, true); // True = asynchronous. Finally got the damn thing to work! It's a right bitch to do in an inline function. JS scopes are screwy as hell. xmlhttp.send(); } // ------------------------------------ Universal page-scraping function (and other helped functions) ------------------------------------ // // This scrapes all embedded images, iframe photosets, and linked image files into an array. Including all content is a work in progress. function soft_scrape_page( html_copy ) { var url_array = new Array(); // look for tags, isolate src URLs var string_counter = 0; // this is what we'll use instead of copying and scraping everything. indexOf( "thing", string_counter ). while( html_copy.indexOf( ' -1 ) { // For each tag in the page's HTML // String_counter must ALWAYS be higher at the end of this loop than the beginning, because otherwise, while() fucks us. In fact, let's enforce that: // Firefox is aggravatingly susceptible to freezing for infinite loops. In a sandbox! I hope it's because GM is a plugin, because otherwise, yeesh. var string_counter_enforcement = string_counter; // if string_counter isn't higher than this at the end of the while() loop, you done goofed // Seek to next tag, extract source string_counter = html_copy.indexOf( '', string_counter ); if( next_angle_bracket > next_image_src ) { // If this tag contains a src, grab it. (I doubt any tags are malformed, but let's be cautious. string_counter = next_image_src; var quote_type = html_copy.substring( string_counter - 1, string_counter ); // either a singlequote or a doublequote var image_url = html_copy.substring( string_counter, html_copy.indexOf( quote_type, string_counter ) ); } // Exclude a bunch of useless nonsense with a blacklist if( image_url.indexOf( "//assets.tumblr.com" ) > 0 ) { image_url = ""; } // let's ignore avatar icons and Tumblr stuff. if( image_url.indexOf( "//static.tumblr.com" ) > 0 ) { image_url = ""; } if( image_url.indexOf( "//www.tumblr.com" ) > 0 ) { image_url = ""; } if( image_url.indexOf( "/avatar_" ) > 0 ) { image_url = ""; } // Include potentially interesting nonsense with a whitelist // General offsite whitelist would include crap like Facebook buttons, Twitter icons, etc. if( image_url.indexOf( ".tumblr.com" ) < 0 ) { // note that this test is different from the others - we blank image_url if the search term is not found, instead of blanking if it is found var original_image_url = image_url; image_url = ""; if( original_image_url.indexOf( "deviantart.net" ) > 0 ) { image_url = original_image_url; } // this is a sloppy whitelist of non-tumblr domains if( original_image_url.indexOf( "imgur.com" ) > 0 ) { image_url = original_image_url; } if( original_image_url.indexOf( "imageshack.com" ) > 0 ) { image_url = original_image_url; } if( original_image_url.indexOf( "imageshack.us" ) > 0 ) { image_url = original_image_url; } if( original_image_url.indexOf( "tinypic.com" ) > 0 ) { image_url = original_image_url; } // this originally read "tinypic.com1", but I assume I was drunk. if( original_image_url.indexOf( "gifninja.com" ) > 0 ) { image_url = original_image_url; } if( original_image_url.indexOf( "photobucket.com" ) > 0 ) { image_url = original_image_url; } if( original_image_url.indexOf( "dropbox.com" ) > 0 ) { image_url = original_image_url; } // if( original_image_url.indexOf( "" ) > 0 ) { image_url = original_image_url; } // if( original_image_url.indexOf( "" ) > 0 ) { image_url = original_image_url; } } if( image_url !== "" ) { url_array.push( [string_counter, image_url] ); // Push the page location alongside the URL, for a 2D array where the first element (url_array[n][0]) is its display order - for later sorting } if( string_counter_enforcement > string_counter ) { string_counter = string_counter_enforcement + 1; } // Make sure our while() eventually ends. Possibly throw an error here, for debugging. } // Look for links to offsite images, isolate URLs string_counter = 0; // reset to scrape for links this time while( html_copy.indexOf( ' -1 ) { var string_counter_enforcement = string_counter; // if string_counter isn't higher than this at the end of the while() loop, you done goofed // I probably don't even need to look for ' 0 ) { image_url = original_image_url; } if( original_image_url.indexOf( ".jpg" ) > 0 ) { image_url = original_image_url; } if( original_image_url.indexOf( ".jpeg" ) > 0 ) { image_url = original_image_url; } if( original_image_url.indexOf( ".png" ) > 0 ) { image_url = original_image_url; } // if( original_image_url.indexOf( "" ) > 0 ) { image_url = original_image_url; } // if( original_image_url.indexOf( "" ) > 0 ) { image_url = original_image_url; } } if( image_url !== "" ) { url_array.push( [parseFloat("0." + string_counter), image_url] ); // We lie about their order on the page (zero point string_counter) to avoid doubling-up when embedded images link to themselves } if( string_counter_enforcement > string_counter ) { string_counter = string_counter_enforcement + 1; } // making sure our while() eventually ends, even if we fuck up } // look for photoset iframes, then fetch them and soft-scrape them string_counter = 0; // reset to scrape for photosets this time while( html_copy.indexOf( 'id="photoset', string_counter ) > -1 ) { string_counter = html_copy.indexOf( 'id="photoset', string_counter ) + 10; // advance to where the next photoset is defined string_counter = html_copy.indexOf( 'src="', string_counter ) + 5; // advance to the source URL (we can assume doublequotes b/c photosets are never themed) var photoset_url = html_copy.substring( string_counter, html_copy.indexOf( '"', string_counter ) ); // grab the doublequote-delimited source URL if( photoset_url.indexOf( "photoset_iframe" ) > 0 ) { // do not attempt to extract photoset links from false-positive id="photoset" hits - it causes this function to fail var photosetxml = new XMLHttpRequest(); photosetxml.onreadystatechange = function() { // this will trigger whenever a photoset request comes back if( photosetxml.readyState == 4 ) { // when we're finally done loading the request var photoset_html = photosetxml.responseText; // I'm not sure you can write to responseText, but this is smarter practice regardless. var photoset_string_counter = 0; // best not to overload string_counter for a different scope. var first_image = true; while( photoset_html.indexOf( 'href="', photoset_string_counter ) > -1 ) { // do photosets need to be singlequote/doublequote-agnostic? I think they're 100% Tumblr-standardized. photoset_string_counter = photoset_html.indexOf( 'href="', photoset_string_counter ) + 6; // advance to next link href var image_url = photoset_html.substring( photoset_string_counter, photoset_html.indexOf( '"', photoset_string_counter ) ); // grab contents of link href // push [string.photoset as a float for sorting, image URL # photoset URL for linking to photoset if( first_image ) { url_array.push( [parseFloat(string_counter + "." + photoset_string_counter), image_url + "#photoset#" + photoset_url] ); first_image = false; // We want the photoset URL attached to just the first image found, so it's only linked once. Other images are only generically marked #photoset. } else { url_array.push( [parseFloat(string_counter + "." + photoset_string_counter), image_url + "#photoset"] ); } } } } photosetxml.open("GET", photoset_url, false); photosetxml.send(); } } url_array.sort( function(a,b) { return a[0] - b[0]; } ); // given two array elements, each of which is a [string_counter, image_url] array, return whichever has the lower string_counter (i.e. a if a-b > 0) return url_array; } // Now that the URL format is so complicated, it's prudent to have a single canonical URL-unfucker that tells us what pages we should be looking at. function get_site( site ) { site = site.substring( 0, window.location.href.indexOf( "/ezastumblrscrape" ) ); // remove everything after the string that triggers this script // Replace "/archive#" with "/" if( site.indexOf( "/archive?" ) > 0 ) { // Sanitize /archive URLs site = site.substring( 0, site.indexOf( "/archive?" ) ) + site.substring( site.indexOf( "/archive?" ) + 9 ); //site.replace( "/archive", "" ); // doesn't work for some goddamn reason } return site; } // This does the opposite of get_site, by inserting /archive# back into sanitized URLs. // I'd like to point out that this is all Tumblr's fault, and I never would've started this creeping mess if they didn't completely suck as a gallery site. function insert_archive( url ) { if( url.substring( url.length - 1 ) == '/' ) { url = url.substring( 0, url.length - 1 ); } // remove trailing slash if present if( url.indexOf( "/", 9 ) > 0 ) { // if there's anything past .tumblr.com, e.g. /tagged/stuff var tld = url.substring( 0, url.indexOf( "/", 9 ) ); // So e.g. http[s]://example.tumblr.com/tagged/stuff loses the /tagged/stuff var stuff = url.substring( url.indexOf( "/", 9 ) ); url = tld + "/archive?" + stuff; } else { // otherwise it's just the TLD url = url + "/archive?"; } return url; }