// ==UserScript== // @name Eza's Tumblr Scrape // @namespace https://inkbunny.net/ezalias // @description Creates a new page showing just the images from any Tumblr // @license MIT // @license Public domain / No rights reserved // @include http://*?ezastumblrscrape* // @include https://*?ezastumblrscrape* // @include http://*/ezastumblrscrape* // @include http://*.tumblr.com/ // @include https://*.tumblr.com/ // @include http://*.tumblr.com/page/* // @include https://*.tumblr.com/page/* // @include http://*.tumblr.com/tagged/* // @include https://*.tumblr.com/tagged/* // @include http://*.tumblr.com/search/* // @include https://*.tumblr.com/search/* // @include http://*.tumblr.com/post/* // @include https://*.tumblr.com/post/* // @include https://*.media.tumblr.com/* // @include https://media.tumblr.com/* // @include http://*/archive // @include https://*/archive // @include http://*.co.vu/* // @exclude */photoset_iframe/* // @exclude *imageshack.us* // @exclude *imageshack.com* // @exclude *//scmplayer.* // @exclude *//wikplayer.* // @exclude *//www.wikplayer.* // @exclude *//www.tumblr.com/search* // @grant GM_registerMenuCommand // @version 5.17 // @downloadURL https://update.greasyfork.icu/scripts/4801/Eza%27s%20Tumblr%20Scrape.user.js // @updateURL https://update.greasyfork.icu/scripts/4801/Eza%27s%20Tumblr%20Scrape.meta.js // ==/UserScript== // Create an imaginary page on the relevant Tumblr domain, mostly to avoid the ridiculous same-origin policy for public HTML pages. Populate page with all images from that Tumblr. Add links to this page on normal pages within the blog. // This script also works on off-site Tumblrs, by the way - just add /archive?ezastumblrscrape?scrapewholesite after the ".com" or whatever. Sorry it's not more concise. // Make it work, make it fast, make it pretty - in that order. // TODO: // I'll have to add filtering as some kind of text input... and could potentially do multi-tag filtering, if I can reliably identify posts and/or reliably match tag definitions to images and image sets. // This is a good feature for doing /scrapewholesite to get text links and then paging through them with fancy dynamic presentation nonsense. Also: duplicate elision. // I'd love to do some multi-scrape stuff, e.g. scraping both /tagged/homestuck and /tagged/art, but that requires communication between divs to avoid constant repetition. // post-level detection would also be great because it'd let me filter out reblogs. fuck all these people with 1000-page tumblrs, shitty animated gifs in their theme, infinite scrolling, and NO FUCKING TAGS. looking at you, http://neuroticnick.tumblr.com/post/16618331343/oh-gamzee#dnr - you prick. // Look into Tumblr Saviour to see how they handle and filter out text posts. // Add a convenient interface for changing options? "Change browsing options" to unhide a div that lists every ?key=value pair, with text-entry boxes or radio buttons as appropriate, and a button that pushes a new URL into the address bar and re-hides the div. Would need to be separate from thumbnail toggle so long as anything false is suppressed in get_url or whatever. // Dropdown menus? Thumbnails yes/no, Pages At Once 1-20. These change the options_map settings immediately, so next/prev links will use them. Link to Apply Changes uses same ?startpage as current. // Could I generalize that the way I've generalized Image Glutton? E.g., grab all links from a Pixiv gallery page, show all images and all manga pages. // Possibly @include any ?scrapeeverythingdammit to grab all links and embed all pictures found on them. single-jump recursive web mirroring. (fucking same-domain policy!) // now that I've got key-value mapping, add a link for 'view original posts only (experimental).' er, 'hide reblogs?' difficult to accurately convey. // make it an element of the post-scraping function. then it would also work on scrape-whole-tumblr. // better yet: call it separately, then use the post-scraping function on each post-level chunk of HTML. i.e. call scrape_without_reblogs from scrape_whole_tumblr, split off each post into strings, and call soft_scrape_page( single_post_string ) to get all the same images. // or would it be better to get all images from any post? doing this by-post means we aren't getting theme nonsense (mostly). // maybe just exclude images where a link to another tumblr happens before the next image... no, text posts could screw that up. // general post detection is about recognizing patterns. can we automate it heuristically? bear in mind it'd be done at least once per scrape-page, and possibly once per tumblr-page. // user b84485 seems to be using the scrape-whole-site option to open image links in tabs, and so is annoyed by the 500/1280 duplicates. maybe a 'remove duplicates' button after the whole site's done? // It's a legitimately good idea. Lord knows I prefer opening images in tabs under most circumstances. // Basically I want a "Browse Links" page instead of just "grab everything that isn't nailed down." // http://mekacrap.tumblr.com/post/82151443664/oh-my-looks-like-theres-some-pussy-under#dnr - lots of 'read more' stuff, for when that's implemented. // eza's tumblr scrape: "read more" might be tumblr standard. // e.g.

Read More

// http://c-enpai.tumblr.com/ - interesting content visible in /archive, but every page is 'themed' to be a blank front page. wtf. // chokes on multi-thousand-page tumblrs like actual-vriska, at least when listing all pages. it's just link-heavy text. maybe skip having a div for every page and just append to one div. or skip divs and append to the raw document innerHTML. it could be a memory thing, if ajax elements are never destroyed. // multi-thousand-page tumblrs make "find image links from all pages" choke. massive memory use, massive CPU load. ridiculous. it's just text. (alright, it's links and ajax requests, but it's doggedly linear.) // maybe skip individual divs and append the raw pile-of-links hypertext into one div. or skip divs entirely and append it straight to the document innerHTML. // could it be a memory leak thing? are ajax elements getting properly released and destroyed when their scope ends? kind of ridiculous either way, considering we're holding just a few kilobytes of text per page. // try re-using the same ajax object. /* Assorted notes from another text file . eza's tumblr fixiv? de-style everything by simply erasing the "; document.body.innerHTML += css_block; // Has to go in all at once or the browser "helpfully" closes the style tag upon evaluation var mydiv = document.getElementById( "maindiv" ); // I apologize for the generic names. This script used to be a lot simpler. // Identify options in URL (in the form of ?key=value pairs) var key_value_array = window.location.href.split( '?' ); // Knowing how to do it the hard way is less impressive than knowing how not to do it the hard way. key_value_array.shift(); // The first element will be the site URL. Durrrr. for( dollarsign of key_value_array ) { // forEach( key_value_array ), including clumsy homage to $_ var this_pair = dollarsign.split( '=' ); // Split key=value into [key,value] (or sometimes just [key]) if( this_pair.length < 2 ) { this_pair.push( true ); } // If there's no value for this key, make its value boolean True if( this_pair[1] == "false " ) { this_pair[1] = false; } // If the value is the string "false" then make it False - note fun with 1-ordinal "length" and 0-ordinal array[element]. else if( !isNaN( parseInt( this_pair[1] ) ) ) { this_pair[1] = parseInt( this_pair[1] ); } // If the value string looks like a number, make it a number options_map[ this_pair[0] ] = this_pair[1]; // options_map.key = value } if( options_map.find[ options_map.find.length - 1 ] == "/" ) { options_map.find = options_map.find.substring( 0, options_map.find.length - 1 ); } // Prevents .com//page/2 // if( options_map.find.indexOf( '/chrono' ) > 0 ) { options_map.chrono = true; } else { options_map.chrono = false; } // False case, to avoid unexpected persistence? Hm. // Convert old URL options to new key-value pairs if( options_map[ "scrapewholesite" ] ) { options_map.scrapemode = "scrapewholesite"; options_map.scrapewholesite = false; } if( options_map[ "everypost" ] ) { options_map.scrapemode = "everypost"; options_map.everypost = false; } if( options_map[ "thumbnails" ] == true ) { options_map.thumbnails = "fixed-width"; } // Replace the original valueless key with the default value if( options_map[ "notraw" ] ) { options_map.maxres = "1280"; options_map.notraw = false; } if( options_map[ "usesmall" ] ) { options_map.maxres = "400"; options_map.usesmall = false; } document.body.className = options_map.thumbnails; // E.g. fixed-width, fixed-height, as matches the CSS. Persistent thumbnail options. Failsafe = original size. // Oh yeah, we have to do this -after- options_map.find is defined: site_and_tags = window.location.protocol + "//" + window.location.hostname + options_map.find; // e.g. http: + // + example.tumblr.com + /tagged/sherlock // Grab an example page so that duplicate-removal hides whatever junk is on every single page // This remains buggy due to asynchronicity. It's a race condition where the results are, at worst, mildly annoying. // Previous notes mention jeffmacanolinsfw.tumblr.com for some reason. // Can just .then this function? // Can I create cookies? That'd work fine. On load, grab cookies for this site and for this script, use /page/1 crap. if( options_map.startpage != 1 && options_map.scrapemode != "scrapewholesite" ) // Not on first page, so on-every-page stuff appears somewhere { exclude_content_example( site_and_tags + '/page/1' ); } // Since this doesn't happen on page 1, let's use page 1. Low pages are faster somehow. . // Add tags to title, for archival and identification purposes document.title += options_map.find.split('/').join(' '); // E.g. /tagged/example/chrono -> "tagged example chrono" // In Chrome, /archive pages monkey-patch and overwrite Promise.all and Promise.resolve. // Clunky solution to clunky problem: grab the default property from a fresh iframe. // Big thanks to inu-no-policeman for the iframe-based solution. Prototypes were not helpful. var iframe = document.createElement( 'iframe' ); document.body.appendChild( iframe ); window['Promise'] = iframe.contentWindow['Promise']; document.body.removeChild( iframe ); mydiv.innerHTML = "Not all images are guaranteed to appear.
"; // Thanks to JS's wacky accomodating nature, mydiv is global despite appearing in an if-else block. // Go to image browser or link scraper according to URL options. switch( options_map.scrapemode ) { case "scrapewholesite": scrape_whole_tumblr(); break; case "xml" : scrape_sitemap(); break; case "everypost": setTimeout( new_embedded_display, 500 ); break; // Slight delay increases odds of exclude_content_example actually fucking working case "www": scrape_www_tagged(); break; // www.tumblr.com/tagged, and eventually your dashboard, maybe. default: scrape_tumblr_pages(); // Sensible delays do not work on the original image browser. Shrug. } } else { // If it's just a normal Tumblr page, add a link to the appropriate /ezastumblrscrape URL // Add link(s) to the standard "+Follow / Dashboard" nonsense. Before +Follow, I think - to avoid messing with users' muscle memory. // This is currently beyond my ability to dick with JS through a script in a plugin. Let's kludge it for immediate usability. // kludge by Ivan - http://userscripts-mirror.org/scripts/review/65725.html // Preserve /tagged/tag/chrono, etc. Also preserve http: vs https: via "location.protocol". var find = window.location.pathname; if( find.indexOf( "/page/chrono" ) <= 0 ) { // Basically checking for posts /tagged/page, thanks to Detective-Pony. Don't even ask. if( find.lastIndexOf( "/page/" ) >= 0 ) { find = find.substring( 0, find.lastIndexOf( "/page/" ) ); } // Don't include e.g. /page/2. We'll add that ourselves. if( find.lastIndexOf( "/post/" ) >= 0 ) { find = find.substring( 0, find.lastIndexOf( "/post" ) ); } if( find.lastIndexOf( "/archive" ) >= 0 ) { find = find.substring( 0, find.lastIndexOf( "/archive" ) ); } // On individual posts (and the /archive page), the link should scrape the whole site. } var url = window.location.protocol + "//" + window.location.hostname + standard_landing_page + "?ezastumblrscrape?scrapewholesite?find=" + find; if( window.location.host == "www.tumblr.com" ) { url += "?scrapemode=www?thumbnails=fixed-width"; url = url.replace( "?scrapewholesite", "" ); } // "Don't clean this up. It's not permanent." // Fuck it, it works and it's fragile. Just boost its z-index so it stops getting covered. var scrape_button = document.createElement("a"); scrape_button.setAttribute( "style", "position: absolute; top: 26px; right: 1px; padding: 2px 0 0; width: 50px; height: 18px; display: block; overflow: hidden; -moz-border-radius: 3px; background: #777; color: #fff; font-size: 8pt; text-decoration: none; font-weight: bold; text-align: center; line-height: 12pt; z-index: 1000; " ); scrape_button.setAttribute("href", url); scrape_button.innerHTML = "Scrape"; var body_ref = document.getElementsByTagName("body")[0]; body_ref.appendChild(scrape_button); // Pages where the button gets split (i.e. clicking top half only redirects tiny corner iframe) are probably loading this script separately in the iframe. // Which means you'd need to redirect the window instead of just linking. Bluh. // Greasemonkey supports user commands through its add-on menu! Thus: no more manually typing /archive?ezastumblrscrape?scrapewholesite on uncooperative blogs. GM_registerMenuCommand( "Scrape whole Tumblr blog", go_to_scrapewholesite ); // If a page is missing (post deleted, blog deleted, name changed) a reblog can often be found based on the URL // Ugh, these should only appear on /post/ pages. // Naming these commands is hard. Both look for reblogs, but one is for if the post you're on is already a reblog. // Maybe "because this Tumblr changed names / moved / is missing" versus "because this reblog got deleted?" Newbs might not know either way. // "As though" this blog is missing / this post is missing? // "Search for this Tumblr under a different name" versus "search for other blogs that've reblogged this?" GM_registerMenuCommand( "Google for reblogs of this original Tumblr post", google_for_reblogs ); GM_registerMenuCommand( "Google for other instances of this Tumblr reblog", google_for_reblogs_other ); // two hard problems // if( window.location.href.indexOf( '?browse' ) > -1 ) { browse_this_page(); } // Experimental single-page scrape mode - DEFINITELY not guaranteed to stay } function go_to_scrapewholesite() { // let redirect = window.location.protocol + "//" + window.location.hostname + "/archive?ezastumblrscrape?scrapewholesite?find=" + window.location.pathname; let redirect = window.location.protocol + "//" + window.location.hostname + standard_landing_page + "?ezastumblrscrape?scrapewholesite?find=" + window.location.pathname; window.location.href = redirect; } function google_for_reblogs() { let blog_name = window.location.href.split('/')[2].split('.')[0]; // e.g. http//example.tumblr.com -> example let content = window.location.href.split('/').pop(); // e.g. http//example.tumblr.com/post/12345/hey-i-drew-this -> hey-i-drew-this let redirect = "https://google.com/search?q=tumblr " + blog_name + " " + content; window.location.href = redirect; } function google_for_reblogs_other() { let content = window.location.href.split('/').pop().split('-'); // e.g. http//example.tumblr.com/post/12345/hey-i-drew-this -> hey,i,drew,this let blog_name = content.shift(); // e.g. examplename-hey-i-drew-this -> examplename content = content.join('-'); let redirect = "https://google.com/search?q=tumblr " + blog_name + " " + content; window.location.href = redirect; } // ------------------------------------ Whole-site scraper for use with DownThemAll ------------------------------------ // // Monolithic scrape-whole-site function, recreating the original intent (before I added pages and made it a glorified multipage image browser) function scrape_whole_tumblr() { // console.log( page_dupe_hash ); var highest_known_page = 0; options_map.startpage = 1; // Reset to default, because other values do goofy things to the image-browsing links below // Link to image-viewing version, preserving current tags mydiv.innerHTML += "

Browse images (10 pages at once)

"; mydiv.innerHTML += "

(5 pages at once)

"; mydiv.innerHTML += "

(1 page at once)



"; mydiv.innerHTML += "Experimental fetch-every-post image browser (10 pages at once) "; mydiv.innerHTML += "(5 pages at once) "; mydiv.innerHTML += "(1 page at once)

"; mydiv.innerHTML += "Post-by-post images and text <-- New option for saving stories

"; // Find out how many pages we need to scrape. if( isNaN( options_map.lastpage ) ) { // Find upper bound in a small number of fetches. Ideally we'd skip this - some themes list e.g. "Page 1 of 24." I think that requires back-end cooperation. mydiv.innerHTML += "Finding out how many pages are in " + site_and_tags.substring( site_and_tags.indexOf( '/' ) + 2 ) + ":

"; // Returns page number if there's no Next link, or negative page number if there is a Next link. // Only for use on /mobile pages; relies on Tumblr's shitty standard theme function test_next_page( body ) { var link_index = body.indexOf( 'rel="canonical"' ); // var page_index = body.indexOf( '/page/', link_index ); var terminator_index = body.indexOf( '"', page_index ); var this_page = parseInt( body.substring( page_index+6, terminator_index ) ); if( body.indexOf( '>next<' ) > 0 ) { return -this_page; } else { return this_page } } // Generates an array of length "steps" between given boundaries - or near enough, for sanity's sake function array_between_bounds( lower_bound, upper_bound, steps ) { if( lower_bound > upper_bound ) { // Swap if out-of-order. var temp = lower_bound; lower_bound = upper_bound, upper_bound = temp; } var bound_range = upper_bound - lower_bound; if( steps > bound_range ) { steps = bound_range; } // Steps <= bound_range, but steps > 1 to avoid division by zero: var pages_per_test = parseInt( bound_range / steps ); // Steps-1 here, so first element is lower_bound & last is upper_bound. Off-by-one errors, whee... var range = Array( steps ) .fill( lower_bound ) .map( (value,index) => value += index * pages_per_test ); range.push( upper_bound ); return range; } // DEBUG // site_and_tags = 'https://www.tumblr.com/safe-mode?url=http://shittyhorsey.tumblr.com'; // Given a (presumably sorted) list of page numbers, find the last that exists and the first that doesn't exist. function find_reasonable_bound( test_array ) { return Promise.all( test_array.map( pagenum => fetch( site_and_tags + '/page/' + pagenum + '/mobile', { credentials: 'include' } ) ) ) .then( responses => Promise.all( responses.map( response => response.text() ) ) ) .then( pages => pages.map( page => test_next_page( page ) ) ) .then( numbers => { var lower_index = -1; numbers.forEach( (value,index) => { if( value < 0 ) { lower_index++; } } ); // Count the negative numbers (i.e., count the pages with known content) if( lower_index < 0 ) { lower_index = 0; } var bounds = [ Math.abs(numbers[lower_index]), numbers[lower_index+1] ] mydiv.innerHTML += "Last page is between " + bounds[0] + " and " + bounds[1] + ".
"; return bounds; } ) } // Repeatedly narrow down how many pages we're talking about; find a reasonable "last" page find_reasonable_bound( [2, 10, 100, 1000, 10000, 100000] ) // Are we talking a couple pages, or a shitload of pages? .then( pair => find_reasonable_bound( array_between_bounds( pair[0], pair[1], 10 ) ) ) // Narrow it down. Fewer rounds of more fetches works best. .then( pair => find_reasonable_bound( array_between_bounds( pair[0], pair[1], 10 ) ) ) // Time is round count, fetches add up, selectivity is fetches x fetches. // Quit fine-tuning numbers and just conditional in some more testing for wide ranges. .then( pair => { if( pair[1] - pair[0] > 50 ) { return find_reasonable_bound( array_between_bounds( pair[0], pair[1], 10 ) ) } else { return pair; } } ) .then( pair => { if( pair[1] - pair[0] > 50 ) { return find_reasonable_bound( array_between_bounds( pair[0], pair[1], 10 ) ) } else { return pair; } } ) .then( pair => { options_map.lastpage = pair[1]; document.getElementById( 'browse10' ).href += "?lastpage=" + options_map.lastpage; // Add last-page indicator to Browse Images link document.getElementById( 'browse5' ).href += "?lastpage=" + options_map.lastpage; // ... and the 5-pages-at-once link. document.getElementById( 'browse1' ).href += "?lastpage=" + options_map.lastpage; // ... and the 1-page-at-onces link. document.getElementById( 'exp10' ).href += "?lastpage=" + options_map.lastpage; // ... and this fetch-every-post link. document.getElementById( 'exp5' ).href += "?lastpage=" + options_map.lastpage; // ... and this fetch-every-post link. document.getElementById( 'exp1' ).href += "?lastpage=" + options_map.lastpage; // ... and this fetch-every-post link. document.getElementById( 'xml_all' ).href += "?lastpage=" + options_map.lastpage; // ... and this XML post-by-post link, why not. start_scraping_button(); } ); } else { // If we're given the highest page by the URL, just use that start_scraping_button(); } // Add "Scrape" button to the page. This will grab images and links from many pages and list them page-by-page. function start_scraping_button() { if( options_map.grabrange ) { // If we're only grabbing a 1000-page block from a huge-ass tumblr: mydiv.innerHTML += "
This will grab 1000 pages starting at " + options_map.grabrange + ".

"; } else { // If we really are describing the last page: mydiv.innerHTML += "
Last page is " + options_map.lastpage + " or lower. "; mydiv.innerHTML += "Find page count again?

"; } if( options_map.lastpage > 1500 && !options_map.grabrange ) { // If we need to link to 1000-page blocks, and aren't currently inside one: for( let x = 1; x < options_map.lastpage; x += 1000 ) { // For every 1000 pages... let decade_url = window.location.href + "?grabrange=" + x + "?lastpage=" + options_map.lastpage; mydiv.innerHTML += "Pages " + x + "-" + (x+999) + "
"; // ... link a range of 1000 pages. } } // Add button to scrape every page, one after another. // Buttons within GreaseMonkey are a huge pain in the ass. I stole this from stackoverflow.com/questions/6480082/ - thanks, Brock Adams. var button = document.createElement ('div'); button.innerHTML = ''; button.setAttribute ( 'id', 'scrape_button' ); // I'm really not sure why this id and the above HTML id aren't the same property. document.body.appendChild ( button ); // Add button (at the end is fine) document.getElementById ("myButton").addEventListener ( "click", scrape_all_pages, false ); // Activate button - when clicked, it triggers scrape_all_pages() if( options_map.autostart ) { document.getElementById ("myButton").click(); } // Getting tired of clicking on every reload - debug-ish if( options_map.lastpage <= 26 ) { document.getElementById ("myButton").click(); } // Automatic fetch (original behavior!) for a single round } } function scrape_all_pages() { // Example code implies that this function /can/ take a parameter via the event listener, but I'm not sure how. var button = document.getElementById( "scrape_button" ); // First, remove the button. There's no reason it should be clickable twice. button.parentNode.removeChild( button ); // The DOM can only remove elements from a higher level. "Elements can't commit suicide, but infanticide is permitted." mydiv.innerHTML += "Scraping page:


"; // This makes it easier to view progress, // Create divs for all pages' content, allowing asynchronous AJAX fetches var x = 1; var div_end_page = options_map.lastpage; if( !isNaN( options_map.grabrange ) ) { // If grabbing 1000 pages from the middle of 10,000, don't create 0..10,000 divs x = options_map.grabrange; div_end_page = x + 1000; // Should be +999, but whatever, no harm in tiny overshoot } for( ; x <= div_end_page; x++ ) { var siteurl = site_and_tags + "/page/" + x; if( options_map.usemobile ) { siteurl += "/mobile"; } // If ?usemobile is flagged, scrape the mobile version. if( x == 1 && options_map.usemobile ) { siteurl = site_and_tags + "/mobile"; } // Hacky fix for redirect from example.tumblr.com/page/1/anything -> example.tumblr.com var new_div = document.createElement( 'div' ); new_div.id = '' + x; document.body.appendChild( new_div ); } // Fetch all pages with content on them var page_counter_div = document.getElementById( 'pagecounter' ); // Probably minor, but over thousands of laggy page updates, I'll take any optimization. pagecounter.innerHTML = "" + 1; var begin_page = 1; var end_page = options_map.lastpage; if( !isNaN( options_map.grabrange ) ) { // If a range is defined, grab only 1000 pages starting there begin_page = options_map.grabrange; end_page = options_map.grabrange + 999; // NOT plus 1000. Stop making that mistake. First page + 999 = 1000 total. if( end_page > options_map.lastpage ) { end_page = options_map.lastpage; } // Kludge document.title += " " + (parseInt( begin_page / 1000 ) + 1); // Change page title to indicate which block of pages we're saving } // Generate array of URL/pagenum pair-arrays url_index_array = new Array; for( var x = begin_page; x <= end_page; x++ ) { var siteurl = site_and_tags + "/page/" + x; if( options_map.usemobile ) { siteurl += "/mobile"; } // If ?usemobile is flagged, scrape the mobile version. No theme shenanigans... but also no photosets. Sigh. if( x == 1 && options_map.usemobile ) { siteurl = site_and_tags + "/mobile"; } // Hacky fix for redirect from example.tumblr.com/page/1/anything -> example.tumblr.com url_index_array.push( [siteurl, x] ); } // Fetch, scrape, and display all URLs. Uses promises to work in parallel and promise.all to limit speed and memory (mostly for reliability's sake). // Consider privileging first page with single-element fetch, to increase apparent responsiveness. Doherty threshold for frustration is 400ms. var simultaneous_fetches = 25; var chain = Promise.resolve(0); // Empty promise so we can use "then" var order_array = [1]; // We want to show the first page immediately, and this is a callback rat's-nest, so let's make an array of how many pages to take each round for( var x = 1; x < url_index_array.length; x += simultaneous_fetches ) { // E.g. [1, simultaneous_fetchs, s_f, s_f, s_f, whatever's left] if( url_index_array.length - x > simultaneous_fetches ) { order_array.push( simultaneous_fetches ); } else { order_array.push( url_index_array.length - x ); } } order_array.forEach( (how_many) => { chain = chain.then( s => { var subarray = url_index_array.splice( 0, how_many ); // Shift a reasonable number of elements into separate array, for partial array.map return Promise.all( subarray.map( page => Promise.all( [ fetch( page[0], { credentials: 'include' } ).then( s => s.text() ), page[1], page[0] ] ) // Return [ body of page, page number, page URL ] ) ) } ) .then( responses => responses.map( s => { // Scrape URLs for links and images, display on page var pagenum = s[1]; var page_url = s[2]; var url_array = soft_scrape_page_promise( s[0] ) // Surprise, this is a promise now .then( urls => { // Sort #link URLs to appear first, because we don't do that in soft-scrape anymore urls.sort( (a,b) => -a.indexOf( "#link" ) ); // Strings containing "#link" go before others - return +1 if not found in 'a.' Should be stable. // Print URLs so DownThemAll (or similar) can grab them var bulk_string = "
" + page_url + "
"; // A digest, so we can update innerHTML just once per div // DEBUG-ish - on theory that 1000-page-tall scraping/rendering fucks my VRAM if( options_map.smalltext ) { bulk_string = "

" + bulk_string; } // If ?smalltext flag is set, render text unusably small, for esoteric reasons urls.forEach( (value,index,array) => { if( options_map.plaintext ) { bulk_string += value + '
'; } else { bulk_string += '' + value + '
'; } } ) document.getElementById( '' + pagenum ).innerHTML = bulk_string; if( parseInt( pagecounter.innerHTML ) < pagenum ) { pagecounter.innerHTML = "" + pagenum; } // Increment pagecounter (where sensible) } ); } ) ) } ) chain = chain.then( s => { document.getElementById( 'afterpagecounter' ).innerHTML = "Done. Use DownThemAll (or a similar plugin) to grab all these links."; // Divulge contents of page_dupe_hash to check for common tags // Ugh, I'm going to have to turn this from an associative array into an array-of-arrays if I want to sort it. let tag_overview = "
" + "Tag overview: " + "
"; let score_tag_list = new Array; // This will hold an array of arrays so we can sort this associative array by its values. Wheee. for( let url in page_dupe_hash ) { if( url.indexOf( '/tagged/' ) > 0 // If it's a tag URL... && page_dupe_hash[ url ] > 1 // and non-unique... && url.indexOf( '/page/' ) < 0 // and not a page link... && url.indexOf( '?og' ) < 0 // and not an opengraph link... && url.indexOf( '?ezas' ) < 0 // and not this script, wtf... ) { // So if it's a TAG, in other words... score_tag_list.push( [ page_dupe_hash[ url ], url ] ); // ... store [ number of times seen, tag URL ] for sorting. } } score_tag_list.sort( (a,b) => a[0] > b[0] ); // Ascending order, now - most common tags at the very bottom, for easier access score_tag_list.map( pair => { pair[1] = pair[1].replace( '/chrono', '' ); // Remove /chrono from sites that append it automatically, since it breaks the 'N pages' autostart links. (/chrono/ might fail.) var this_tag = pair[1].split('/').pop(); // e.g. example.tumblr.com/tagged/my-art -> my-art //if( this_tag == '' ) { let this_tag = pair[1].split('/').pop(); } // Trailing slash screws this up, so get the second-to-last thing instead var scrape_link = options_url( {find: '/tagged/'+this_tag, lastpage: false, grabrange: false, autostart: true} ); // Direct link to ?scrapewholesite for e.g. /tagged/my-art tag_overview += "
" + pair[0] + " posts:\t" + "" + pair[1] + ""; } ) document.body.innerHTML += tag_overview; } ) } // ------------------------------------ Multi-page scraper with embedded images ------------------------------------ // function scrape_tumblr_pages() { // Grab an empty page so that duplicate-removal hides whatever junk is on every single page // This is DEBUG-ish. It might be slow, barring caching. It might not work due to asynchrony. It could block actual content thanks to 'my best posts' sidebars. if( isNaN( parseInt( options_map.startpage ) ) || options_map.startpage <= 1 ) { options_map.startpage = 1; } // Sanity check mydiv.innerHTML += "
" + html_previous_next_navigation() + "
"; document.getElementById("bottom_controls_div").innerHTML += "
" + html_page_count_navigation() + "
" + html_previous_next_navigation(); mydiv.innerHTML += "
" + html_ezastumblrscrape_options() + "
"; mydiv.innerHTML += "
" +image_size_options(); mydiv.innerHTML += "

" + image_resolution_options(); // Fill an array with the page URLs to be scraped (and create per-page divs while we're at it) var pages = new Array( parseInt( options_map.pagesatonce ) ) .fill( parseInt( options_map.startpage ) ) .map( (value,index) => value+index ); pages.forEach( pagenum => { mydiv.innerHTML += "


Page " + pagenum + "
"; } ) pages.map( pagenum => { var siteurl = site_and_tags + "/page/" + pagenum; // example.tumblr.com/page/startpage, startpage+1, startpage+2, etc. if( options_map.usemobile ) { siteurl += "/mobile"; } // If ?usemobile is flagged, scrape mobile version. No theme shenanigans... but also no photosets. Sigh. if( pagenum == 1 && options_map.usemobile ) { siteurl = site_and_tags + "/mobile"; } // Hacky fix for redirect from example.tumblr.com/page/1/anything -> example.tumblr.com fetch( siteurl, { credentials: 'include' } ).then( response => response.text() ).then( text => { document.getElementById( pagenum ).innerHTML += "fetched
" // Immediately indicate the fetch happened. + "" + siteurl + "
"; // Link to page. Useful for viewing things in-situ... and debugging. // For some asinine reason, 'return url_array' causes 'Permission denied to access property "then".' So fake it with ugly nesting. soft_scrape_page_promise( text ) .then( url_array => { var div_digest = ""; // Instead of updating each div's HTML for every image, we'll lump it into one string and update the page once per div. var video_array = new Array; var outlink_array = new Array; var inlink_array = new Array; url_array.forEach( (value,index,array) => { // Shift videos and links to separate arrays, blank out those URLs in url_array if( value.indexOf( '#video' ) > 0 ) { video_array.push( value ); array[index] = '' } if( value.indexOf( '#offsite' ) > 0 ) { outlink_array.push( value ); array[index] = '' } if( value.indexOf( '#local' ) > 0 ) { inlink_array.push( value ); array[index] = '' } } ); url_array = url_array.filter( url => url === "" ? false : true ); // Remove empty elements from url_array // Display video links, if there are any video_array.forEach( value => {div_digest += "Video: " + value + "
"; } ) // Display page links if the ?showlinks flag is enabled if( options_map.showlinks ) { div_digest += "Outgoing links: "; outlink_array.forEach( (value,index) => { div_digest += "O" + (index+1) + " " } ); div_digest += "
" + "Same-Tumblr links: "; inlink_array.forEach( (value,index) => { div_digest += "T" + (index+1) + " " } ); div_digest += "
"; } // Embed high-res images to be seen, clicked, and saved url_array.forEach( image_url => { // Embed images (linked to themselves) and link to photosets if( image_url.indexOf( "#photoset#" ) > 0 ) { // Before the first image in a photoset, print the photoset link. var photoset_url = image_url.split( "#" ).pop(); // URL is like tumblr.com/image#photoset#http://tumblr.com/photoset_iframe - separate past last hash... t. div_digest += " Set:"; } div_digest += "" + "(Waiting for image) "; // div_digest += "" + "(Waiting for image) "; } ) div_digest += "
(End of " + siteurl + ")"; // Another link to the page, because I'm tired of scrolling back up. document.getElementById( pagenum ).innerHTML += div_digest; } ) // End of 'then( url_array => { } )' } ) // End of 'then( text => { } )' } ) // End of 'pages.map( pagenum => { } )' } // ------------------------------------ Whole-site scraper based on post-by-post method ------------------------------------ // // The use of sitemap.xml could be transformative to this script, or even split off into another script entirely. It finally allows a standard theme! // Just do it. Entries look like: // http://leopha.tumblr.com/post/57593339717/hello-a-friend-found-my-blog-somehow-so-url2013-08-07T06:42:58Z // So split on and terminate at . Or see if JS has native XML parsing, for honest object-orientation instead of banging at text files. // Grab normally for now, I guess, just to get it implemented. Make ?usemobile work. Worry about /embed stuff later. // Check if dashboard-only blogs have xml files exposed. // While we're at it, check that the very first for sitemap2 doesn't point to a /page/n URL. Not useful, just interesting. // Simplicate for now. Have a link to grab an individual XML file, get everything inside it, then add images (and tags?) from each page. // For an automatic 'get the whole damn site' mode, maybe just call each XML file sequentially. Not a for() loop - have each page maybe call the next page when finished. // Oh right, these XML files are like 500 posts each. Not much different from grabbing 50 pages at a time - and we currently grab 25 at once - but significant. // If I have to pass the /post URL list to a serial function in order to rate-limit this, I might as well grab all XML files and fill the list -once.- // Count down? So e.g. scrape(x) calls scrape(x-1). It's up to the root function to call the high value initially. // Single-sitemap pages (e.g. scraping and displaying sitemap2) will need prev/next controls (e.g. to sitemap1 & sitemap3). // These will be chronological. That's fine, just mention it somewhere. // I'm almost loathe to publicize this. It's not faster or more reliable than the monolithic scraper. It's not a better image-browsing method, in my opinion. // It's only really useful for people who think a blank theme is the same as erasing their blog... and the less those jerks know, the better. Don't delete art. // I guess it'll be better when I list tags with each post, but at present they're strongly filtered out. // Wait, wtf? I ripped sometipsygnostalgic as a test cast (98 sitemaps, ~50K posts) and the last page scraped is from years ago. // None of the sitemaps past sitemap51 exist. So the first 51 are present - the rest 404. (To her theme, not as a generic tumblr error with animated background.) // It's not a URL-generation error; the same thing happens copy-pasting the XML URLs straight from sitemap.xml. // Just... fuck. Throw a warning for now, see if there's anything to be done later. // If /mobile versions of /post URLs unexpectedly had proper [video] references... do they contain previous/next post info? Doesn't look like it. Nuts. // In addition to tags, this should maybe list sources and 'via' links. It's partially archival, after all. // Sort of have to conditionally unfilter /tagged links from both duplicate-remover and whichever function removes tumblr guff. 'if !xml remove tagged.' bluh. // Better idea: send the list to a 'return only tags' filter first, non-destructively. Then filter them out. // I guess this is what I'd modify to get multi-tag searches - like homestuck+hs+hamsteak. Scrape by /tagged and /page, populate post_urls, de-dupe and sort. // Of course then I'd want it to 'just work' with the existing post-by-post image browser, which... probably won't happen. I'm loathe to recreate it exactly or inexactly. // This needs some method to find /post numbers from /page and /tagged pages. // Maybe trigger a later post-by-post function from a modified scrapewholesite approach? Code quality is a non-issue right now. // Basically have to fill some array post_urls with /post URLs (strings) and then call scrape_post_urls(). Oh, that array is global. This was already a janky first pass. // Instead of XML files, use /page or /page + /mobile... pages. // Like 'if /tagged/something then grab all links and filter for indexof /post.' // This is the landing function for this mode - it grabs sitemap.xml, parses options, sets any necessary variables, and invokes scrape_sitemap_x for sitemap1 and so on. function scrape_sitemap() { document.title += ' sitemap'; // Flow control. // Two hard problems. Structure now, names later. if( options_map.sitemap ) { mydiv.innerHTML += "
Completion:
0%

"; // This makes it easier to view progress, let final_sitemap = 0; // Default: no recursion. 'Else considered harmful.' if( options_map.xmlcount ) { final_sitemap = options_map.xmlcount; } // If we're doing the whole site, indicate the highest sitemap and recurse until you get there. scrape_sitemap_x( options_map.sitemap, final_sitemap ); } else { // Could create a div for this enumeration of sitemapX files, and always have it at the top. It'd get silly for tumblrs with like a hundred of them. // Maybe at the bottom? fetch( window.location.protocol + "//" + window.location.hostname + '/sitemap.xml', { credentials: 'include' } ) // Grab text-like file .then( r => r.text() ) .then( t => { // Process the text of sitemap.xml let sitemap_list = t.split( '' ); sitemap_list.shift(); // Get rid of data before first location sitemap_list = sitemap_list.map( e => e.split('')[0] ); // Terminate each entry sitemap_list = sitemap_list.filter( e => { return e.indexOf( '/sitemap' ) > 0 && e.indexOf( '.xml' ) > 0; } ); // Remove everything but 'sitemapX.xml' links mydiv.innerHTML += '
' + window.location.hostname + ' has ' + sitemap_list.length + ' sitemap XML file'; if( sitemap_list.length > 1 ) { mydiv.innerHTML += 's'; } // Pluralization! Not sexy, just functional. // mydiv.innerHTML += ' (Up to ' + sitemap_list.length * 500 + ' posts.)
' mydiv.innerHTML += '. (' + Math.ceil(sitemap_list.length / 2) + ',000 posts or fewer.)

' // Kludge math, but I pefer this presentation. if( sitemap_list.length > 50 ) { mydiv.innerHTML += 'Sitemaps past 50 may not work.

'; } // Pluralization! Not sexy, just functional. // List everything: // List sitemap1: or // List sitemap2: or etc. mydiv.innerHTML += 'List everything: ' + 'Links only - Links and text (stories)

'; if( options_map.find && options_map.lastpage ) { mydiv.innerHTML += 'List just this tag (' + options_map.find + '): ' + 'Links only - Links and text (stories)

'; } // mydiv.innerHTML += "

Browse images (10 pages at once)

"; for( n = 1; n <= sitemap_list.length; n++ ) { let text_link = options_url( { sitemap:n } ); let images_link = options_url( { sitemap:n, thumbnails:'xml' } ); let story_link = options_url( { sitemap:n, story:true, usemobile:true } ); mydiv.innerHTML += 'List sitemap' + n + ': ' + 'Links only Links & thumbnails Links & text
' } } ) } } // Text-based scrape mode for sitemap1.xml, sitemap2.xml, etc. function scrape_sitemap_x( sitemap_x, final_sitemap ) { document.title = window.location.hostname + ' - sitemap ' + sitemap_x; // We lose the original tumblr's title, but whatever. if( sitemap_x == final_sitemap ) { document.title = window.location.hostname + ' - sitemap complete'; } if( options_map.story ) { document.title += ' with stories'; } // Last-minute kludge: if options_map.tagscrape, use /page instead of any xml, get /post URLs matching this domain. // Only whole-hog for now - sitemap_x might allow like ten or a hundred pages at once, later. var base_site = window.location.protocol + "//" + window.location.hostname; // e.g. http://example.tumblr.com var sitemap_url = base_site + '/sitemap' + sitemap_x + '.xml'; if( options_map.tagscrape ) { sitemap_url = base_site + options_map.find + '/page/' + sitemap_x; document.title += ' ' + options_map.find.replace( /\//g, ' '); // E.g. 'tagged my-stuff'. } mydiv.innerHTML += "Finding posts from " + sitemap_url + ".
"; fetch( sitemap_url, { credentials: 'include' } ) // Grab text-like file .then( r => r.text() ) // Get txt from HTTP response .then( t => { // Process text to extract links let location_list = t.split( '' ); // Break at each location declaration location_list.shift(); location_list.shift(); // Ditch first two elements. First is preamble, second is base URL (e.g. example.tumblr.com). location_list = location_list.map( e => e.split( '' )[0] ); // Terminate each entry at location close-tag if( options_map.tagscrape ) { // Fill location_list with URLs on this page, matching this domain, containing e.g. /post/12345. // Trailing slash not guaranteed. Quote terminator unknown. Fuck it, just assume it's all doublequotes. location_list = t.split( 'href="' ); location_list.shift(); location_list = location_list.map( e => e.split( '"' )[0] ); // Terminate each entry at location close-tag location_list = location_list.filter( u => u.indexOf( window.location.hostname + '/post' ) > -1 ); location_list = location_list.filter( u => u.indexOf( '/embed' ) < 0 ); location_list = location_list.filter( u => u.indexOf( '#notes' ) < 0 ); // I fucking hate this website. // https://www.facebook.com/sharer/sharer.php?u=https://shamserg.tumblr.com/post/55985088311/knight-of-vengence-by-shamserg // I fucking hate the web. location_list = location_list.map( e => window.location.protocol + '//' + window.location.hostname + e.split( window.location.hostname )[1] ); // location_list = location_list.filter( u => u.indexOf( '?ezastumblrscrape' ) <= 0 ); // Exclude links to this script. (Nope, apparently that's introduced elsewhere. WTF.) // Christ, https versus http bullshit AGAIN. Just get it done. // location_list = location_list.map( e => window.location.protocol + '//' + e.split( '//' )[1] ); // http -> https, or https -> http, as needed. // I don't think I ever remove duplicates. On pages with lots of silly "sharing" bullshit, this might grab pages multiple times. Not sure I care. Extreme go horse. } // location_list should now contain a bunch of /post/12345 URLs. if( options_map.usemobile ) { location_list = location_list.map( (url) => { return url + '/mobile' } ); } console.log( location_list ); // return location_list; // Promises are still weird. (JS's null-by-default is also weird.) post_urls = post_urls.concat( location_list ); // Append these URLs to the global array of /post URLs. (It's not like Array.push() because fuck you.) // console.log( post_urls ); // It seems like bad practice to recurse from here, but for my weird needs it seems to make sense. if( sitemap_x < final_sitemap ) { scrape_sitemap_x( sitemap_x + 1, final_sitemap ); // If there's more to be done, recurse. } else { scrape_post_urls(); // If not, fetch & display all these /post URLs. } } ) } // Fetch all the URLs in the global array of post_urls, then display contents as relevant function scrape_post_urls() { // Had to re-check how I did rate-limited fetches for the monolithic main scraper. It involved safe levels of 'wtf.' // Basically I put n URLs in an array, or an array of fetch promises I guess, then use promise.all to wait until they're all fetched. Repeat as necessary. // Can I use web workers? E.g. instantiate 25 parallel scripts that each shift from post_urls and spit back HTML. Ech. External JS is encouraged and DOM access is limited. // Screw it, batch parallelism will suffice. // Can I define a variable as a function? I mean, can I go var foobar = function_name, so I can later invoke foobar() and get function_name()? // If so: I can pick a callback before all of this, fill a global array with e.g. [url, contents] stuff, and jump to a text- or image-based display when that's finished. // Otherwise I think I'm gonna be copying myself a lot in order to display text xor images. // Create divs for each post? This is going to be a lot of divs. // 1000 is fine for the traditional monolithic scrape mode, but that covers ten or twenty times as many posts as a single sitemapX.xml. // And it probably doesn't belong right here. // Fuck it, nobody's using this unless they read the source code. Gotta build something twice to build it right. // Create divs for each post, so they can be fetched and inserted asynchronously post_urls.forEach( (url) => { let new_div = document.createElement( 'div' ); new_div.id = '' + url; document.body.appendChild( new_div ); } ) // Something in here is throwing "Promise rejection value is a non-unwrappable cross-compartment wrapper." I don't even know what the fuck. // Apparently Mozilla doesn't know what the fuck either, because all discussion of this seems to be 'well it ought to throw a better error.' // Okay somehow this happens even with everything past 'console.log( order_array );' commented out, so it's not actually a wrong thing involving the Promise nest. Huh. // It's the for loop. // Is it because we try using location_list.length? Even though it's a global array and not a Promise? And console.log() has no problem... oh, console.log quietly chokes. // I'm trying to grab a sitemap2 that doesn't exist because I was being clever in opposite directions. // Nope, still shits the bed, same incomprehensible error. Console.log shows location_list has... oh goddammit. location_list was the local variable; post_urls is the global. // Also array.concat doesn't work for some reason. post_urls still has zero elements. // Oh for fuck's sake it has no side effects! You have to do thing=thing.concat()! God damn the inconsistent behavior versus array.push! // We now get as far as chain=chain.then and the first console.log('foobar'), but it happens exactly once. // Splicing appears to happen correctly. // We enter responses.map, and the [body of post, post url] array is passed correctly and contains right-looking data. // Ah. links_from_text is not a function. It's links_from_page. Firefox, that's a dead simple error - say something. // Thumbnails? True thumbnails, and only for tumblr images with _100 available. Still inadvisable for the whole damn tumblr at once. // So limit that to single-sitemap modes (like linking out to 1000-pages-at-once monolithic scrapes) and call it good. 500 thumbnails is what /archive does anyway. // Terrible fix for unique tags: can I edit CSS from within the page? Can I count inside that? // I'd be trying to do something like with a corresponding value over one. // Utter hack: I need to get the text of each /post to go above or below it. The key is options_map( 'story' ) == true. // Still pass it as a "link" string. Prepend with \n or something, a non-URL-like control character, which we're okay to print accidentally. Don't link that "link." // /mobile might work? Photosets show up as '[video]', but some text is in the HTML. Dunno if it's everything. // Oh god they're suddenly fucking with mobile. // In /mobile, get everything from '' to '