// ==UserScript==
// @name Eza's Tumblr Scrape
// @namespace https://inkbunny.net/ezalias
// @description Creates a new page showing just the images from any Tumblr
// @license MIT
// @license Public domain / No rights reserved
// @include http://*?ezastumblrscrape*
// @include https://*?ezastumblrscrape*
// @include http://*/ezastumblrscrape*
// @include http://*.tumblr.com/
// @include https://*.tumblr.com/
// @include http://*.tumblr.com/page/*
// @include https://*.tumblr.com/page/*
// @include http://*.tumblr.com/tagged/*
// @include https://*.tumblr.com/tagged/*
// @include http://*.tumblr.com/search/*
// @include https://*.tumblr.com/search/*
// @include http://*.tumblr.com/post/*
// @include https://*.tumblr.com/post/*
// @include https://*.media.tumblr.com/*
// @include https://media.tumblr.com/*
// @include http://*/archive
// @include https://*/archive
// @include http://*.co.vu/*
// @exclude */photoset_iframe/*
// @exclude *imageshack.us*
// @exclude *imageshack.com*
// @exclude *//scmplayer.*
// @exclude *//wikplayer.*
// @exclude *//www.wikplayer.*
// @exclude *//www.tumblr.com/search*
// @grant GM_registerMenuCommand
// @version 5.17
// @downloadURL https://update.greasyfork.icu/scripts/4801/Eza%27s%20Tumblr%20Scrape.user.js
// @updateURL https://update.greasyfork.icu/scripts/4801/Eza%27s%20Tumblr%20Scrape.meta.js
// ==/UserScript==
// Create an imaginary page on the relevant Tumblr domain, mostly to avoid the ridiculous same-origin policy for public HTML pages. Populate page with all images from that Tumblr. Add links to this page on normal pages within the blog.
// This script also works on off-site Tumblrs, by the way - just add /archive?ezastumblrscrape?scrapewholesite after the ".com" or whatever. Sorry it's not more concise.
// Make it work, make it fast, make it pretty - in that order.
// TODO:
// I'll have to add filtering as some kind of text input... and could potentially do multi-tag filtering, if I can reliably identify posts and/or reliably match tag definitions to images and image sets.
// This is a good feature for doing /scrapewholesite to get text links and then paging through them with fancy dynamic presentation nonsense. Also: duplicate elision.
// I'd love to do some multi-scrape stuff, e.g. scraping both /tagged/homestuck and /tagged/art, but that requires communication between divs to avoid constant repetition.
// post-level detection would also be great because it'd let me filter out reblogs. fuck all these people with 1000-page tumblrs, shitty animated gifs in their theme, infinite scrolling, and NO FUCKING TAGS. looking at you, http://neuroticnick.tumblr.com/post/16618331343/oh-gamzee#dnr - you prick.
// Look into Tumblr Saviour to see how they handle and filter out text posts.
// Add a convenient interface for changing options? "Change browsing options" to unhide a div that lists every ?key=value pair, with text-entry boxes or radio buttons as appropriate, and a button that pushes a new URL into the address bar and re-hides the div. Would need to be separate from thumbnail toggle so long as anything false is suppressed in get_url or whatever.
// Dropdown menus? Thumbnails yes/no, Pages At Once 1-20. These change the options_map settings immediately, so next/prev links will use them. Link to Apply Changes uses same ?startpage as current.
// Could I generalize that the way I've generalized Image Glutton? E.g., grab all links from a Pixiv gallery page, show all images and all manga pages.
// Possibly @include any ?scrapeeverythingdammit to grab all links and embed all pictures found on them. single-jump recursive web mirroring. (fucking same-domain policy!)
// now that I've got key-value mapping, add a link for 'view original posts only (experimental).' er, 'hide reblogs?' difficult to accurately convey.
// make it an element of the post-scraping function. then it would also work on scrape-whole-tumblr.
// better yet: call it separately, then use the post-scraping function on each post-level chunk of HTML. i.e. call scrape_without_reblogs from scrape_whole_tumblr, split off each post into strings, and call soft_scrape_page( single_post_string ) to get all the same images.
// or would it be better to get all images from any post? doing this by-post means we aren't getting theme nonsense (mostly).
// maybe just exclude images where a link to another tumblr happens before the next image... no, text posts could screw that up.
// general post detection is about recognizing patterns. can we automate it heuristically? bear in mind it'd be done at least once per scrape-page, and possibly once per tumblr-page.
// user b84485 seems to be using the scrape-whole-site option to open image links in tabs, and so is annoyed by the 500/1280 duplicates. maybe a 'remove duplicates' button after the whole site's done?
// It's a legitimately good idea. Lord knows I prefer opening images in tabs under most circumstances.
// Basically I want a "Browse Links" page instead of just "grab everything that isn't nailed down."
// http://mekacrap.tumblr.com/post/82151443664/oh-my-looks-like-theres-some-pussy-under#dnr - lots of 'read more' stuff, for when that's implemented.
// eza's tumblr scrape: "read more" might be tumblr standard.
// e.g.
// http://c-enpai.tumblr.com/ - interesting content visible in /archive, but every page is 'themed' to be a blank front page. wtf.
// chokes on multi-thousand-page tumblrs like actual-vriska, at least when listing all pages. it's just link-heavy text. maybe skip having a div for every page and just append to one div. or skip divs and append to the raw document innerHTML. it could be a memory thing, if ajax elements are never destroyed.
// multi-thousand-page tumblrs make "find image links from all pages" choke. massive memory use, massive CPU load. ridiculous. it's just text. (alright, it's links and ajax requests, but it's doggedly linear.)
// maybe skip individual divs and append the raw pile-of-links hypertext into one div. or skip divs entirely and append it straight to the document innerHTML.
// could it be a memory leak thing? are ajax elements getting properly released and destroyed when their scope ends? kind of ridiculous either way, considering we're holding just a few kilobytes of text per page.
// try re-using the same ajax object.
/* Assorted notes from another text file
. eza's tumblr fixiv? de-style everything by simply erasing the ";
document.body.innerHTML += css_block; // Has to go in all at once or the browser "helpfully" closes the style tag upon evaluation
var mydiv = document.getElementById( "maindiv" ); // I apologize for the generic names. This script used to be a lot simpler.
// Identify options in URL (in the form of ?key=value pairs)
var key_value_array = window.location.href.split( '?' ); // Knowing how to do it the hard way is less impressive than knowing how not to do it the hard way.
key_value_array.shift(); // The first element will be the site URL. Durrrr.
for( dollarsign of key_value_array ) { // forEach( key_value_array ), including clumsy homage to $_
var this_pair = dollarsign.split( '=' ); // Split key=value into [key,value] (or sometimes just [key])
if( this_pair.length < 2 ) { this_pair.push( true ); } // If there's no value for this key, make its value boolean True
if( this_pair[1] == "false " ) { this_pair[1] = false; } // If the value is the string "false" then make it False - note fun with 1-ordinal "length" and 0-ordinal array[element].
else if( !isNaN( parseInt( this_pair[1] ) ) ) { this_pair[1] = parseInt( this_pair[1] ); } // If the value string looks like a number, make it a number
options_map[ this_pair[0] ] = this_pair[1]; // options_map.key = value
}
if( options_map.find[ options_map.find.length - 1 ] == "/" ) { options_map.find = options_map.find.substring( 0, options_map.find.length - 1 ); } // Prevents .com//page/2
// if( options_map.find.indexOf( '/chrono' ) > 0 ) { options_map.chrono = true; } else { options_map.chrono = false; } // False case, to avoid unexpected persistence? Hm.
// Convert old URL options to new key-value pairs
if( options_map[ "scrapewholesite" ] ) { options_map.scrapemode = "scrapewholesite"; options_map.scrapewholesite = false; }
if( options_map[ "everypost" ] ) { options_map.scrapemode = "everypost"; options_map.everypost = false; }
if( options_map[ "thumbnails" ] == true ) { options_map.thumbnails = "fixed-width"; } // Replace the original valueless key with the default value
if( options_map[ "notraw" ] ) { options_map.maxres = "1280"; options_map.notraw = false; }
if( options_map[ "usesmall" ] ) { options_map.maxres = "400"; options_map.usesmall = false; }
document.body.className = options_map.thumbnails; // E.g. fixed-width, fixed-height, as matches the CSS. Persistent thumbnail options. Failsafe = original size.
// Oh yeah, we have to do this -after- options_map.find is defined:
site_and_tags = window.location.protocol + "//" + window.location.hostname + options_map.find; // e.g. http: + // + example.tumblr.com + /tagged/sherlock
// Grab an example page so that duplicate-removal hides whatever junk is on every single page
// This remains buggy due to asynchronicity. It's a race condition where the results are, at worst, mildly annoying.
// Previous notes mention jeffmacanolinsfw.tumblr.com for some reason.
// Can just .then this function?
// Can I create cookies? That'd work fine. On load, grab cookies for this site and for this script, use /page/1 crap.
if( options_map.startpage != 1 && options_map.scrapemode != "scrapewholesite" ) // Not on first page, so on-every-page stuff appears somewhere
{ exclude_content_example( site_and_tags + '/page/1' ); } // Since this doesn't happen on page 1, let's use page 1. Low pages are faster somehow. .
// Add tags to title, for archival and identification purposes
document.title += options_map.find.split('/').join(' '); // E.g. /tagged/example/chrono -> "tagged example chrono"
// In Chrome, /archive pages monkey-patch and overwrite Promise.all and Promise.resolve.
// Clunky solution to clunky problem: grab the default property from a fresh iframe.
// Big thanks to inu-no-policeman for the iframe-based solution. Prototypes were not helpful.
var iframe = document.createElement( 'iframe' );
document.body.appendChild( iframe );
window['Promise'] = iframe.contentWindow['Promise'];
document.body.removeChild( iframe );
mydiv.innerHTML = "Not all images are guaranteed to appear. "; // Thanks to JS's wacky accomodating nature, mydiv is global despite appearing in an if-else block.
// Go to image browser or link scraper according to URL options.
switch( options_map.scrapemode ) {
case "scrapewholesite": scrape_whole_tumblr(); break;
case "xml" : scrape_sitemap(); break;
case "everypost": setTimeout( new_embedded_display, 500 ); break; // Slight delay increases odds of exclude_content_example actually fucking working
case "www": scrape_www_tagged(); break; // www.tumblr.com/tagged, and eventually your dashboard, maybe.
default: scrape_tumblr_pages(); // Sensible delays do not work on the original image browser. Shrug.
}
} else { // If it's just a normal Tumblr page, add a link to the appropriate /ezastumblrscrape URL
// Add link(s) to the standard "+Follow / Dashboard" nonsense. Before +Follow, I think - to avoid messing with users' muscle memory.
// This is currently beyond my ability to dick with JS through a script in a plugin. Let's kludge it for immediate usability.
// kludge by Ivan - http://userscripts-mirror.org/scripts/review/65725.html
// Preserve /tagged/tag/chrono, etc. Also preserve http: vs https: via "location.protocol".
var find = window.location.pathname;
if( find.indexOf( "/page/chrono" ) <= 0 ) { // Basically checking for posts /tagged/page, thanks to Detective-Pony. Don't even ask.
if( find.lastIndexOf( "/page/" ) >= 0 ) { find = find.substring( 0, find.lastIndexOf( "/page/" ) ); } // Don't include e.g. /page/2. We'll add that ourselves.
if( find.lastIndexOf( "/post/" ) >= 0 ) { find = find.substring( 0, find.lastIndexOf( "/post" ) ); }
if( find.lastIndexOf( "/archive" ) >= 0 ) { find = find.substring( 0, find.lastIndexOf( "/archive" ) ); }
// On individual posts (and the /archive page), the link should scrape the whole site.
}
var url = window.location.protocol + "//" + window.location.hostname + standard_landing_page + "?ezastumblrscrape?scrapewholesite?find=" + find;
if( window.location.host == "www.tumblr.com" ) { url += "?scrapemode=www?thumbnails=fixed-width"; url = url.replace( "?scrapewholesite", "" ); }
// "Don't clean this up. It's not permanent."
// Fuck it, it works and it's fragile. Just boost its z-index so it stops getting covered.
var scrape_button = document.createElement("a");
scrape_button.setAttribute( "style", "position: absolute; top: 26px; right: 1px; padding: 2px 0 0; width: 50px; height: 18px; display: block; overflow: hidden; -moz-border-radius: 3px; background: #777; color: #fff; font-size: 8pt; text-decoration: none; font-weight: bold; text-align: center; line-height: 12pt; z-index: 1000; " );
scrape_button.setAttribute("href", url);
scrape_button.innerHTML = "Scrape";
var body_ref = document.getElementsByTagName("body")[0];
body_ref.appendChild(scrape_button);
// Pages where the button gets split (i.e. clicking top half only redirects tiny corner iframe) are probably loading this script separately in the iframe.
// Which means you'd need to redirect the window instead of just linking. Bluh.
// Greasemonkey supports user commands through its add-on menu! Thus: no more manually typing /archive?ezastumblrscrape?scrapewholesite on uncooperative blogs.
GM_registerMenuCommand( "Scrape whole Tumblr blog", go_to_scrapewholesite );
// If a page is missing (post deleted, blog deleted, name changed) a reblog can often be found based on the URL
// Ugh, these should only appear on /post/ pages.
// Naming these commands is hard. Both look for reblogs, but one is for if the post you're on is already a reblog.
// Maybe "because this Tumblr changed names / moved / is missing" versus "because this reblog got deleted?" Newbs might not know either way.
// "As though" this blog is missing / this post is missing?
// "Search for this Tumblr under a different name" versus "search for other blogs that've reblogged this?"
GM_registerMenuCommand( "Google for reblogs of this original Tumblr post", google_for_reblogs );
GM_registerMenuCommand( "Google for other instances of this Tumblr reblog", google_for_reblogs_other ); // two hard problems
// if( window.location.href.indexOf( '?browse' ) > -1 ) { browse_this_page(); } // Experimental single-page scrape mode - DEFINITELY not guaranteed to stay
}
function go_to_scrapewholesite() {
// let redirect = window.location.protocol + "//" + window.location.hostname + "/archive?ezastumblrscrape?scrapewholesite?find=" + window.location.pathname;
let redirect = window.location.protocol + "//" + window.location.hostname + standard_landing_page
+ "?ezastumblrscrape?scrapewholesite?find=" + window.location.pathname;
window.location.href = redirect;
}
function google_for_reblogs() {
let blog_name = window.location.href.split('/')[2].split('.')[0]; // e.g. http//example.tumblr.com -> example
let content = window.location.href.split('/').pop(); // e.g. http//example.tumblr.com/post/12345/hey-i-drew-this -> hey-i-drew-this
let redirect = "https://google.com/search?q=tumblr " + blog_name + " " + content;
window.location.href = redirect;
}
function google_for_reblogs_other() {
let content = window.location.href.split('/').pop().split('-'); // e.g. http//example.tumblr.com/post/12345/hey-i-drew-this -> hey,i,drew,this
let blog_name = content.shift(); // e.g. examplename-hey-i-drew-this -> examplename
content = content.join('-');
let redirect = "https://google.com/search?q=tumblr " + blog_name + " " + content;
window.location.href = redirect;
}
// ------------------------------------ Whole-site scraper for use with DownThemAll ------------------------------------ //
// Monolithic scrape-whole-site function, recreating the original intent (before I added pages and made it a glorified multipage image browser)
function scrape_whole_tumblr() {
// console.log( page_dupe_hash );
var highest_known_page = 0;
options_map.startpage = 1; // Reset to default, because other values do goofy things to the image-browsing links below
// Link to image-viewing version, preserving current tags
mydiv.innerHTML += "
";
// Find out how many pages we need to scrape.
if( isNaN( options_map.lastpage ) ) {
// Find upper bound in a small number of fetches. Ideally we'd skip this - some themes list e.g. "Page 1 of 24." I think that requires back-end cooperation.
mydiv.innerHTML += "Finding out how many pages are in " + site_and_tags.substring( site_and_tags.indexOf( '/' ) + 2 ) + ":
";
// Returns page number if there's no Next link, or negative page number if there is a Next link.
// Only for use on /mobile pages; relies on Tumblr's shitty standard theme
function test_next_page( body ) {
var link_index = body.indexOf( 'rel="canonical"' ); //
var page_index = body.indexOf( '/page/', link_index );
var terminator_index = body.indexOf( '"', page_index );
var this_page = parseInt( body.substring( page_index+6, terminator_index ) );
if( body.indexOf( '>next<' ) > 0 ) { return -this_page; } else { return this_page }
}
// Generates an array of length "steps" between given boundaries - or near enough, for sanity's sake
function array_between_bounds( lower_bound, upper_bound, steps ) {
if( lower_bound > upper_bound ) { // Swap if out-of-order.
var temp = lower_bound; lower_bound = upper_bound, upper_bound = temp;
}
var bound_range = upper_bound - lower_bound;
if( steps > bound_range ) { steps = bound_range; } // Steps <= bound_range, but steps > 1 to avoid division by zero:
var pages_per_test = parseInt( bound_range / steps ); // Steps-1 here, so first element is lower_bound & last is upper_bound. Off-by-one errors, whee...
var range = Array( steps )
.fill( lower_bound )
.map( (value,index) => value += index * pages_per_test );
range.push( upper_bound );
return range;
}
// DEBUG
// site_and_tags = 'https://www.tumblr.com/safe-mode?url=http://shittyhorsey.tumblr.com';
// Given a (presumably sorted) list of page numbers, find the last that exists and the first that doesn't exist.
function find_reasonable_bound( test_array ) {
return Promise.all( test_array.map( pagenum => fetch( site_and_tags + '/page/' + pagenum + '/mobile', { credentials: 'include' } ) ) )
.then( responses => Promise.all( responses.map( response => response.text() ) ) )
.then( pages => pages.map( page => test_next_page( page ) ) )
.then( numbers => {
var lower_index = -1;
numbers.forEach( (value,index) => { if( value < 0 ) { lower_index++; } } ); // Count the negative numbers (i.e., count the pages with known content)
if( lower_index < 0 ) { lower_index = 0; }
var bounds = [ Math.abs(numbers[lower_index]), numbers[lower_index+1] ]
mydiv.innerHTML += "Last page is between " + bounds[0] + " and " + bounds[1] + ". ";
return bounds;
} )
}
// Repeatedly narrow down how many pages we're talking about; find a reasonable "last" page
find_reasonable_bound( [2, 10, 100, 1000, 10000, 100000] ) // Are we talking a couple pages, or a shitload of pages?
.then( pair => find_reasonable_bound( array_between_bounds( pair[0], pair[1], 10 ) ) ) // Narrow it down. Fewer rounds of more fetches works best.
.then( pair => find_reasonable_bound( array_between_bounds( pair[0], pair[1], 10 ) ) ) // Time is round count, fetches add up, selectivity is fetches x fetches.
// Quit fine-tuning numbers and just conditional in some more testing for wide ranges.
.then( pair => { if( pair[1] - pair[0] > 50 ) { return find_reasonable_bound( array_between_bounds( pair[0], pair[1], 10 ) ) } else { return pair; } } )
.then( pair => { if( pair[1] - pair[0] > 50 ) { return find_reasonable_bound( array_between_bounds( pair[0], pair[1], 10 ) ) } else { return pair; } } )
.then( pair => {
options_map.lastpage = pair[1];
document.getElementById( 'browse10' ).href += "?lastpage=" + options_map.lastpage; // Add last-page indicator to Browse Images link
document.getElementById( 'browse5' ).href += "?lastpage=" + options_map.lastpage; // ... and the 5-pages-at-once link.
document.getElementById( 'browse1' ).href += "?lastpage=" + options_map.lastpage; // ... and the 1-page-at-onces link.
document.getElementById( 'exp10' ).href += "?lastpage=" + options_map.lastpage; // ... and this fetch-every-post link.
document.getElementById( 'exp5' ).href += "?lastpage=" + options_map.lastpage; // ... and this fetch-every-post link.
document.getElementById( 'exp1' ).href += "?lastpage=" + options_map.lastpage; // ... and this fetch-every-post link.
document.getElementById( 'xml_all' ).href += "?lastpage=" + options_map.lastpage; // ... and this XML post-by-post link, why not.
start_scraping_button();
} );
}
else { // If we're given the highest page by the URL, just use that
start_scraping_button();
}
// Add "Scrape" button to the page. This will grab images and links from many pages and list them page-by-page.
function start_scraping_button() {
if( options_map.grabrange ) { // If we're only grabbing a 1000-page block from a huge-ass tumblr:
mydiv.innerHTML += " This will grab 1000 pages starting at " + options_map.grabrange + ".
";
} else { // If we really are describing the last page:
mydiv.innerHTML += " Last page is " + options_map.lastpage + " or lower. ";
mydiv.innerHTML += "Find page count again?
";
}
if( options_map.lastpage > 1500 && !options_map.grabrange ) { // If we need to link to 1000-page blocks, and aren't currently inside one:
for( let x = 1; x < options_map.lastpage; x += 1000 ) { // For every 1000 pages...
let decade_url = window.location.href + "?grabrange=" + x + "?lastpage=" + options_map.lastpage;
mydiv.innerHTML += "Pages " + x + "-" + (x+999) + " "; // ... link a range of 1000 pages.
}
}
// Add button to scrape every page, one after another.
// Buttons within GreaseMonkey are a huge pain in the ass. I stole this from stackoverflow.com/questions/6480082/ - thanks, Brock Adams.
var button = document.createElement ('div');
button.innerHTML = '';
button.setAttribute ( 'id', 'scrape_button' ); // I'm really not sure why this id and the above HTML id aren't the same property.
document.body.appendChild ( button ); // Add button (at the end is fine)
document.getElementById ("myButton").addEventListener ( "click", scrape_all_pages, false ); // Activate button - when clicked, it triggers scrape_all_pages()
if( options_map.autostart ) { document.getElementById ("myButton").click(); } // Getting tired of clicking on every reload - debug-ish
if( options_map.lastpage <= 26 ) { document.getElementById ("myButton").click(); } // Automatic fetch (original behavior!) for a single round
}
}
function scrape_all_pages() { // Example code implies that this function /can/ take a parameter via the event listener, but I'm not sure how.
var button = document.getElementById( "scrape_button" ); // First, remove the button. There's no reason it should be clickable twice.
button.parentNode.removeChild( button ); // The DOM can only remove elements from a higher level. "Elements can't commit suicide, but infanticide is permitted."
mydiv.innerHTML += "Scraping page:
"; // This makes it easier to view progress,
// Create divs for all pages' content, allowing asynchronous AJAX fetches
var x = 1;
var div_end_page = options_map.lastpage;
if( !isNaN( options_map.grabrange ) ) { // If grabbing 1000 pages from the middle of 10,000, don't create 0..10,000 divs
x = options_map.grabrange;
div_end_page = x + 1000; // Should be +999, but whatever, no harm in tiny overshoot
}
for( ; x <= div_end_page; x++ ) {
var siteurl = site_and_tags + "/page/" + x;
if( options_map.usemobile ) { siteurl += "/mobile"; } // If ?usemobile is flagged, scrape the mobile version.
if( x == 1 && options_map.usemobile ) { siteurl = site_and_tags + "/mobile"; } // Hacky fix for redirect from example.tumblr.com/page/1/anything -> example.tumblr.com
var new_div = document.createElement( 'div' );
new_div.id = '' + x;
document.body.appendChild( new_div );
}
// Fetch all pages with content on them
var page_counter_div = document.getElementById( 'pagecounter' ); // Probably minor, but over thousands of laggy page updates, I'll take any optimization.
pagecounter.innerHTML = "" + 1;
var begin_page = 1;
var end_page = options_map.lastpage;
if( !isNaN( options_map.grabrange ) ) { // If a range is defined, grab only 1000 pages starting there
begin_page = options_map.grabrange;
end_page = options_map.grabrange + 999; // NOT plus 1000. Stop making that mistake. First page + 999 = 1000 total.
if( end_page > options_map.lastpage ) { end_page = options_map.lastpage; } // Kludge
document.title += " " + (parseInt( begin_page / 1000 ) + 1); // Change page title to indicate which block of pages we're saving
}
// Generate array of URL/pagenum pair-arrays
url_index_array = new Array;
for( var x = begin_page; x <= end_page; x++ ) {
var siteurl = site_and_tags + "/page/" + x;
if( options_map.usemobile ) { siteurl += "/mobile"; } // If ?usemobile is flagged, scrape the mobile version. No theme shenanigans... but also no photosets. Sigh.
if( x == 1 && options_map.usemobile ) { siteurl = site_and_tags + "/mobile"; } // Hacky fix for redirect from example.tumblr.com/page/1/anything -> example.tumblr.com
url_index_array.push( [siteurl, x] );
}
// Fetch, scrape, and display all URLs. Uses promises to work in parallel and promise.all to limit speed and memory (mostly for reliability's sake).
// Consider privileging first page with single-element fetch, to increase apparent responsiveness. Doherty threshold for frustration is 400ms.
var simultaneous_fetches = 25;
var chain = Promise.resolve(0); // Empty promise so we can use "then"
var order_array = [1]; // We want to show the first page immediately, and this is a callback rat's-nest, so let's make an array of how many pages to take each round
for( var x = 1; x < url_index_array.length; x += simultaneous_fetches ) { // E.g. [1, simultaneous_fetchs, s_f, s_f, s_f, whatever's left]
if( url_index_array.length - x > simultaneous_fetches ) { order_array.push( simultaneous_fetches ); } else { order_array.push( url_index_array.length - x ); }
}
order_array.forEach( (how_many) => {
chain = chain.then( s => {
var subarray = url_index_array.splice( 0, how_many ); // Shift a reasonable number of elements into separate array, for partial array.map
return Promise.all( subarray.map( page =>
Promise.all( [ fetch( page[0], { credentials: 'include' } ).then( s => s.text() ), page[1], page[0] ] ) // Return [ body of page, page number, page URL ]
) )
} )
.then( responses => responses.map( s => { // Scrape URLs for links and images, display on page
var pagenum = s[1];
var page_url = s[2];
var url_array = soft_scrape_page_promise( s[0] ) // Surprise, this is a promise now
.then( urls => {
// Sort #link URLs to appear first, because we don't do that in soft-scrape anymore
urls.sort( (a,b) => -a.indexOf( "#link" ) ); // Strings containing "#link" go before others - return +1 if not found in 'a.' Should be stable.
// Print URLs so DownThemAll (or similar) can grab them
var bulk_string = " " + page_url + " "; // A digest, so we can update innerHTML just once per div
// DEBUG-ish - on theory that 1000-page-tall scraping/rendering fucks my VRAM
if( options_map.smalltext ) { bulk_string = "
" + bulk_string; } // If ?smalltext flag is set, render text unusably small, for esoteric reasons
urls.forEach( (value,index,array) => {
if( options_map.plaintext ) {
bulk_string += value + ' ';
} else {
bulk_string += '' + value + ' ';
}
} )
document.getElementById( '' + pagenum ).innerHTML = bulk_string;
if( parseInt( pagecounter.innerHTML ) < pagenum ) { pagecounter.innerHTML = "" + pagenum; } // Increment pagecounter (where sensible)
} );
} )
)
} )
chain = chain.then( s => {
document.getElementById( 'afterpagecounter' ).innerHTML = "Done. Use DownThemAll (or a similar plugin) to grab all these links.";
// Divulge contents of page_dupe_hash to check for common tags
// Ugh, I'm going to have to turn this from an associative array into an array-of-arrays if I want to sort it.
let tag_overview = " " + "Tag overview: " + " ";
let score_tag_list = new Array; // This will hold an array of arrays so we can sort this associative array by its values. Wheee.
for( let url in page_dupe_hash ) {
if( url.indexOf( '/tagged/' ) > 0 // If it's a tag URL...
&& page_dupe_hash[ url ] > 1 // and non-unique...
&& url.indexOf( '/page/' ) < 0 // and not a page link...
&& url.indexOf( '?og' ) < 0 // and not an opengraph link...
&& url.indexOf( '?ezas' ) < 0 // and not this script, wtf...
) { // So if it's a TAG, in other words...
score_tag_list.push( [ page_dupe_hash[ url ], url ] ); // ... store [ number of times seen, tag URL ] for sorting.
}
}
score_tag_list.sort( (a,b) => a[0] > b[0] ); // Ascending order, now - most common tags at the very bottom, for easier access
score_tag_list.map( pair => {
pair[1] = pair[1].replace( '/chrono', '' ); // Remove /chrono from sites that append it automatically, since it breaks the 'N pages' autostart links. (/chrono/ might fail.)
var this_tag = pair[1].split('/').pop(); // e.g. example.tumblr.com/tagged/my-art -> my-art
//if( this_tag == '' ) { let this_tag = pair[1].split('/').pop(); } // Trailing slash screws this up, so get the second-to-last thing instead
var scrape_link = options_url( {find: '/tagged/'+this_tag, lastpage: false, grabrange: false, autostart: true} ); // Direct link to ?scrapewholesite for e.g. /tagged/my-art
tag_overview += " " + pair[0] + " posts:\t" + "" + pair[1] + "";
} )
document.body.innerHTML += tag_overview;
} )
}
// ------------------------------------ Multi-page scraper with embedded images ------------------------------------ //
function scrape_tumblr_pages() {
// Grab an empty page so that duplicate-removal hides whatever junk is on every single page
// This is DEBUG-ish. It might be slow, barring caching. It might not work due to asynchrony. It could block actual content thanks to 'my best posts' sidebars.
if( isNaN( parseInt( options_map.startpage ) ) || options_map.startpage <= 1 ) { options_map.startpage = 1; } // Sanity check
mydiv.innerHTML += " " + html_previous_next_navigation() + " ";
document.getElementById("bottom_controls_div").innerHTML += " " + html_page_count_navigation() + " " + html_previous_next_navigation();
mydiv.innerHTML += " " + html_ezastumblrscrape_options() + " ";
mydiv.innerHTML += " " +image_size_options();
mydiv.innerHTML += "
" + image_resolution_options();
// Fill an array with the page URLs to be scraped (and create per-page divs while we're at it)
var pages = new Array( parseInt( options_map.pagesatonce ) )
.fill( parseInt( options_map.startpage ) )
.map( (value,index) => value+index );
pages.forEach( pagenum => {
mydiv.innerHTML += "
Page " + pagenum + "
";
} )
pages.map( pagenum => {
var siteurl = site_and_tags + "/page/" + pagenum; // example.tumblr.com/page/startpage, startpage+1, startpage+2, etc.
if( options_map.usemobile ) { siteurl += "/mobile"; } // If ?usemobile is flagged, scrape mobile version. No theme shenanigans... but also no photosets. Sigh.
if( pagenum == 1 && options_map.usemobile ) { siteurl = site_and_tags + "/mobile"; } // Hacky fix for redirect from example.tumblr.com/page/1/anything -> example.tumblr.com
fetch( siteurl, { credentials: 'include' } ).then( response => response.text() ).then( text => {
document.getElementById( pagenum ).innerHTML += "fetched " // Immediately indicate the fetch happened.
+ "" + siteurl + " "; // Link to page. Useful for viewing things in-situ... and debugging.
// For some asinine reason, 'return url_array' causes 'Permission denied to access property "then".' So fake it with ugly nesting.
soft_scrape_page_promise( text )
.then( url_array => {
var div_digest = ""; // Instead of updating each div's HTML for every image, we'll lump it into one string and update the page once per div.
var video_array = new Array;
var outlink_array = new Array;
var inlink_array = new Array;
url_array.forEach( (value,index,array) => { // Shift videos and links to separate arrays, blank out those URLs in url_array
if( value.indexOf( '#video' ) > 0 ) { video_array.push( value ); array[index] = '' }
if( value.indexOf( '#offsite' ) > 0 ) { outlink_array.push( value ); array[index] = '' }
if( value.indexOf( '#local' ) > 0 ) { inlink_array.push( value ); array[index] = '' }
} );
url_array = url_array.filter( url => url === "" ? false : true ); // Remove empty elements from url_array
// Display video links, if there are any
video_array.forEach( value => {div_digest += "Video: " + value + " "; } )
// Display page links if the ?showlinks flag is enabled
if( options_map.showlinks ) {
div_digest += "Outgoing links: ";
outlink_array.forEach( (value,index) => { div_digest += "O" + (index+1) + " " } );
div_digest += " " + "Same-Tumblr links: ";
inlink_array.forEach( (value,index) => { div_digest += "T" + (index+1) + " " } );
div_digest += " ";
}
// Embed high-res images to be seen, clicked, and saved
url_array.forEach( image_url => {
// Embed images (linked to themselves) and link to photosets
if( image_url.indexOf( "#photoset#" ) > 0 ) { // Before the first image in a photoset, print the photoset link.
var photoset_url = image_url.split( "#" ).pop();
// URL is like tumblr.com/image#photoset#http://tumblr.com/photoset_iframe - separate past last hash... t.
div_digest += " Set:";
}
div_digest += "" + " ";
// div_digest += "" + " ";
} )
div_digest += " (End of " + siteurl + ")"; // Another link to the page, because I'm tired of scrolling back up.
document.getElementById( pagenum ).innerHTML += div_digest;
} ) // End of 'then( url_array => { } )'
} ) // End of 'then( text => { } )'
} ) // End of 'pages.map( pagenum => { } )'
}
// ------------------------------------ Whole-site scraper based on post-by-post method ------------------------------------ //
// The use of sitemap.xml could be transformative to this script, or even split off into another script entirely. It finally allows a standard theme!
// Just do it. Entries look like:
// http://leopha.tumblr.com/post/57593339717/hello-a-friend-found-my-blog-somehow-so-url2013-08-07T06:42:58Z
// So split on and terminate at . Or see if JS has native XML parsing, for honest object-orientation instead of banging at text files.
// Grab normally for now, I guess, just to get it implemented. Make ?usemobile work. Worry about /embed stuff later.
// Check if dashboard-only blogs have xml files exposed.
// While we're at it, check that the very first for sitemap2 doesn't point to a /page/n URL. Not useful, just interesting.
// Simplicate for now. Have a link to grab an individual XML file, get everything inside it, then add images (and tags?) from each page.
// For an automatic 'get the whole damn site' mode, maybe just call each XML file sequentially. Not a for() loop - have each page maybe call the next page when finished.
// Oh right, these XML files are like 500 posts each. Not much different from grabbing 50 pages at a time - and we currently grab 25 at once - but significant.
// If I have to pass the /post URL list to a serial function in order to rate-limit this, I might as well grab all XML files and fill the list -once.-
// Count down? So e.g. scrape(x) calls scrape(x-1). It's up to the root function to call the high value initially.
// Single-sitemap pages (e.g. scraping and displaying sitemap2) will need prev/next controls (e.g. to sitemap1 & sitemap3).
// These will be chronological. That's fine, just mention it somewhere.
// I'm almost loathe to publicize this. It's not faster or more reliable than the monolithic scraper. It's not a better image-browsing method, in my opinion.
// It's only really useful for people who think a blank theme is the same as erasing their blog... and the less those jerks know, the better. Don't delete art.
// I guess it'll be better when I list tags with each post, but at present they're strongly filtered out.
// Wait, wtf? I ripped sometipsygnostalgic as a test cast (98 sitemaps, ~50K posts) and the last page scraped is from years ago.
// None of the sitemaps past sitemap51 exist. So the first 51 are present - the rest 404. (To her theme, not as a generic tumblr error with animated background.)
// It's not a URL-generation error; the same thing happens copy-pasting the XML URLs straight from sitemap.xml.
// Just... fuck. Throw a warning for now, see if there's anything to be done later.
// If /mobile versions of /post URLs unexpectedly had proper [video] references... do they contain previous/next post info? Doesn't look like it. Nuts.
// In addition to tags, this should maybe list sources and 'via' links. It's partially archival, after all.
// Sort of have to conditionally unfilter /tagged links from both duplicate-remover and whichever function removes tumblr guff. 'if !xml remove tagged.' bluh.
// Better idea: send the list to a 'return only tags' filter first, non-destructively. Then filter them out.
// I guess this is what I'd modify to get multi-tag searches - like homestuck+hs+hamsteak. Scrape by /tagged and /page, populate post_urls, de-dupe and sort.
// Of course then I'd want it to 'just work' with the existing post-by-post image browser, which... probably won't happen. I'm loathe to recreate it exactly or inexactly.
// This needs some method to find /post numbers from /page and /tagged pages.
// Maybe trigger a later post-by-post function from a modified scrapewholesite approach? Code quality is a non-issue right now.
// Basically have to fill some array post_urls with /post URLs (strings) and then call scrape_post_urls(). Oh, that array is global. This was already a janky first pass.
// Instead of XML files, use /page or /page + /mobile... pages.
// Like 'if /tagged/something then grab all links and filter for indexof /post.'
// This is the landing function for this mode - it grabs sitemap.xml, parses options, sets any necessary variables, and invokes scrape_sitemap_x for sitemap1 and so on.
function scrape_sitemap() {
document.title += ' sitemap';
// Flow control.
// Two hard problems. Structure now, names later.
if( options_map.sitemap ) {
mydiv.innerHTML += " Completion:
0%
"; // This makes it easier to view progress,
let final_sitemap = 0; // Default: no recursion. 'Else considered harmful.'
if( options_map.xmlcount ) { final_sitemap = options_map.xmlcount; } // If we're doing the whole site, indicate the highest sitemap and recurse until you get there.
scrape_sitemap_x( options_map.sitemap, final_sitemap );
} else {
// Could create a div for this enumeration of sitemapX files, and always have it at the top. It'd get silly for tumblrs with like a hundred of them.
// Maybe at the bottom?
fetch( window.location.protocol + "//" + window.location.hostname + '/sitemap.xml', { credentials: 'include' } ) // Grab text-like file
.then( r => r.text() )
.then( t => { // Process the text of sitemap.xml
let sitemap_list = t.split( '' );
sitemap_list.shift(); // Get rid of data before first location
sitemap_list = sitemap_list.map( e => e.split('')[0] ); // Terminate each entry
sitemap_list = sitemap_list.filter( e => { return e.indexOf( '/sitemap' ) > 0 && e.indexOf( '.xml' ) > 0; } ); // Remove everything but 'sitemapX.xml' links
mydiv.innerHTML += ' ' + window.location.hostname + ' has ' + sitemap_list.length + ' sitemap XML file';
if( sitemap_list.length > 1 ) { mydiv.innerHTML += 's'; } // Pluralization! Not sexy, just functional.
// mydiv.innerHTML += ' (Up to ' + sitemap_list.length * 500 + ' posts.) '
mydiv.innerHTML += '. (' + Math.ceil(sitemap_list.length / 2) + ',000 posts or fewer.)
' // Kludge math, but I pefer this presentation.
if( sitemap_list.length > 50 ) { mydiv.innerHTML += 'Sitemaps past 50 may not work.
'; } // Pluralization! Not sexy, just functional.
// List everything:
// List sitemap1: or
// List sitemap2: or etc.
mydiv.innerHTML += 'List everything: ' + 'Links only - Links and text (stories)
';
if( options_map.find && options_map.lastpage ) {
mydiv.innerHTML += 'List just this tag (' + options_map.find + '): ' + 'Links only - Links and text (stories)
";
for( n = 1; n <= sitemap_list.length; n++ ) {
let text_link = options_url( { sitemap:n } );
let images_link = options_url( { sitemap:n, thumbnails:'xml' } );
let story_link = options_url( { sitemap:n, story:true, usemobile:true } );
mydiv.innerHTML += 'List sitemap' + n + ': ' + 'Links onlyLinks & thumbnailsLinks & text '
}
} )
}
}
// Text-based scrape mode for sitemap1.xml, sitemap2.xml, etc.
function scrape_sitemap_x( sitemap_x, final_sitemap ) {
document.title = window.location.hostname + ' - sitemap ' + sitemap_x; // We lose the original tumblr's title, but whatever.
if( sitemap_x == final_sitemap ) { document.title = window.location.hostname + ' - sitemap complete'; }
if( options_map.story ) { document.title += ' with stories'; }
// Last-minute kludge: if options_map.tagscrape, use /page instead of any xml, get /post URLs matching this domain.
// Only whole-hog for now - sitemap_x might allow like ten or a hundred pages at once, later.
var base_site = window.location.protocol + "//" + window.location.hostname; // e.g. http://example.tumblr.com
var sitemap_url = base_site + '/sitemap' + sitemap_x + '.xml';
if( options_map.tagscrape ) {
sitemap_url = base_site + options_map.find + '/page/' + sitemap_x;
document.title += ' ' + options_map.find.replace( /\//g, ' '); // E.g. 'tagged my-stuff'.
}
mydiv.innerHTML += "Finding posts from " + sitemap_url + ". ";
fetch( sitemap_url, { credentials: 'include' } ) // Grab text-like file
.then( r => r.text() ) // Get txt from HTTP response
.then( t => { // Process text to extract links
let location_list = t.split( '' ); // Break at each location declaration
location_list.shift(); location_list.shift(); // Ditch first two elements. First is preamble, second is base URL (e.g. example.tumblr.com).
location_list = location_list.map( e => e.split( '' )[0] ); // Terminate each entry at location close-tag
if( options_map.tagscrape ) {
// Fill location_list with URLs on this page, matching this domain, containing e.g. /post/12345.
// Trailing slash not guaranteed. Quote terminator unknown. Fuck it, just assume it's all doublequotes.
location_list = t.split( 'href="' );
location_list.shift();
location_list = location_list.map( e => e.split( '"' )[0] ); // Terminate each entry at location close-tag
location_list = location_list.filter( u => u.indexOf( window.location.hostname + '/post' ) > -1 );
location_list = location_list.filter( u => u.indexOf( '/embed' ) < 0 );
location_list = location_list.filter( u => u.indexOf( '#notes' ) < 0 ); // I fucking hate this website.
// https://www.facebook.com/sharer/sharer.php?u=https://shamserg.tumblr.com/post/55985088311/knight-of-vengence-by-shamserg
// I fucking hate the web.
location_list = location_list.map( e => window.location.protocol + '//' + window.location.hostname + e.split( window.location.hostname )[1] );
// location_list = location_list.filter( u => u.indexOf( '?ezastumblrscrape' ) <= 0 ); // Exclude links to this script. (Nope, apparently that's introduced elsewhere. WTF.)
// Christ, https versus http bullshit AGAIN. Just get it done.
// location_list = location_list.map( e => window.location.protocol + '//' + e.split( '//' )[1] ); // http -> https, or https -> http, as needed.
// I don't think I ever remove duplicates. On pages with lots of silly "sharing" bullshit, this might grab pages multiple times. Not sure I care. Extreme go horse.
}
// location_list should now contain a bunch of /post/12345 URLs.
if( options_map.usemobile ) { location_list = location_list.map( (url) => { return url + '/mobile' } ); }
console.log( location_list );
// return location_list; // Promises are still weird. (JS's null-by-default is also weird.)
post_urls = post_urls.concat( location_list ); // Append these URLs to the global array of /post URLs. (It's not like Array.push() because fuck you.)
// console.log( post_urls );
// It seems like bad practice to recurse from here, but for my weird needs it seems to make sense.
if( sitemap_x < final_sitemap ) {
scrape_sitemap_x( sitemap_x + 1, final_sitemap ); // If there's more to be done, recurse.
} else {
scrape_post_urls(); // If not, fetch & display all these /post URLs.
}
} )
}
// Fetch all the URLs in the global array of post_urls, then display contents as relevant
function scrape_post_urls() {
// Had to re-check how I did rate-limited fetches for the monolithic main scraper. It involved safe levels of 'wtf.'
// Basically I put n URLs in an array, or an array of fetch promises I guess, then use promise.all to wait until they're all fetched. Repeat as necessary.
// Can I use web workers? E.g. instantiate 25 parallel scripts that each shift from post_urls and spit back HTML. Ech. External JS is encouraged and DOM access is limited.
// Screw it, batch parallelism will suffice.
// Can I define a variable as a function? I mean, can I go var foobar = function_name, so I can later invoke foobar() and get function_name()?
// If so: I can pick a callback before all of this, fill a global array with e.g. [url, contents] stuff, and jump to a text- or image-based display when that's finished.
// Otherwise I think I'm gonna be copying myself a lot in order to display text xor images.
// Create divs for each post? This is going to be a lot of divs.
// 1000 is fine for the traditional monolithic scrape mode, but that covers ten or twenty times as many posts as a single sitemapX.xml.
// And it probably doesn't belong right here.
// Fuck it, nobody's using this unless they read the source code. Gotta build something twice to build it right.
// Create divs for each post, so they can be fetched and inserted asynchronously
post_urls.forEach( (url) => {
let new_div = document.createElement( 'div' );
new_div.id = '' + url;
document.body.appendChild( new_div );
} )
// Something in here is throwing "Promise rejection value is a non-unwrappable cross-compartment wrapper." I don't even know what the fuck.
// Apparently Mozilla doesn't know what the fuck either, because all discussion of this seems to be 'well it ought to throw a better error.'
// Okay somehow this happens even with everything past 'console.log( order_array );' commented out, so it's not actually a wrong thing involving the Promise nest. Huh.
// It's the for loop.
// Is it because we try using location_list.length? Even though it's a global array and not a Promise? And console.log() has no problem... oh, console.log quietly chokes.
// I'm trying to grab a sitemap2 that doesn't exist because I was being clever in opposite directions.
// Nope, still shits the bed, same incomprehensible error. Console.log shows location_list has... oh goddammit. location_list was the local variable; post_urls is the global.
// Also array.concat doesn't work for some reason. post_urls still has zero elements.
// Oh for fuck's sake it has no side effects! You have to do thing=thing.concat()! God damn the inconsistent behavior versus array.push!
// We now get as far as chain=chain.then and the first console.log('foobar'), but it happens exactly once.
// Splicing appears to happen correctly.
// We enter responses.map, and the [body of post, post url] array is passed correctly and contains right-looking data.
// Ah. links_from_text is not a function. It's links_from_page. Firefox, that's a dead simple error - say something.
// Thumbnails? True thumbnails, and only for tumblr images with _100 available. Still inadvisable for the whole damn tumblr at once.
// So limit that to single-sitemap modes (like linking out to 1000-pages-at-once monolithic scrapes) and call it good. 500 thumbnails is what /archive does anyway.
// Terrible fix for unique tags: can I edit CSS from within the page? Can I count inside that?
// I'd be trying to do something like with a corresponding value over one.
// Utter hack: I need to get the text of each /post to go above or below it. The key is options_map( 'story' ) == true.
// Still pass it as a "link" string. Prepend with \n or something, a non-URL-like control character, which we're okay to print accidentally. Don't link that "link."
// /mobile might work? Photosets show up as '[video]', but some text is in the HTML. Dunno if it's everything.
// Oh god they're suddenly fucking with mobile.
// In /mobile, get everything from '' to '
'.
var simultaneous_fetches = 25;
var order_array = new Array; // No reason to show immediate results, so fill with n, n, n until the sum matches the number of URLs.
for( x = post_urls.length; x > simultaneous_fetches; x -= simultaneous_fetches ) {
order_array.push( simultaneous_fetches );
}
if( x != 0 ) { order_array.push( x ); } // Any remainder from counting-down for() loop becomes the last element
// console.log( order_array );
// order_array = [5,5]; // Debug - limits fetches to just a few posts
var chain = Promise.resolve(0); // Empty promise so we can use "then"
order_array.forEach( (how_many, which_entry) => {
chain = chain.then( s => {
// console.log( 'foobar' );
var subarray = post_urls.splice( 0, how_many ); // Shift some number of elements into a separate array, for partial array.map
// console.log( 'foobar2', subarray );
return Promise.all( subarray.map( post_url =>
Promise.all( [ fetch( post_url, { credentials: 'include' } ).then( s => s.text() ), post_url ] ) // Return [body of post, post URL]
) )
} )
. then( responses => responses.map( s => {
// console.log( 'foobar3', s );
var post_url = s[1];
// var url_array = soft_scrape_page_promise( s[0] ) // Surprise, this is a promise now
// Wait, do I need it to be a promise? Because otherwise I might prefer to use links_from_text().
var sublinks = links_from_page( s[0] );
let tag_links = sublinks.filter( return_tags ); // Copy tags into some new array. No changes to other filters required! (Test on eixn, ~2000 pages.)
tag_links = tag_links.filter( u => u.indexOf( '?ezastumblrscrape' ) < 0 );
// Do we want to filter down to images for this scrape function? It's not necessarily displaying them.
// Arg, you almost have to, to filter out the links to tumblrs that reblogged each post.
sublinks = sublinks.filter( s => { return s.indexOf( '.jpg' ) > 0 || s.indexOf( '.jpeg' ) > 0 || s.indexOf( '.png' ) > 0 || s.indexOf( '.gif' ) > 0; } );
sublinks = sublinks.filter( tumblr_blacklist_filter ); // Remove avatars and crap
sublinks = sublinks.map( image_standardizer ); // Clean up semi-dupes (e.g. same image in different sizes -> same URL)
sublinks = sublinks.filter( novelty_filter ); // Global duplicate remover
// sublinks = sublinks.concat( tag_links ); // Add tags back in. Not ideal, but should be functional.
if( options_map.story ) {
// Janky last-ditch text ripping, because Tumblr is killing NSFW sites:
// Assume we're using /mobile. Grab everything between the end of the blog name and the standard navigation div.
// let story_start = s[0].indexOf( '' ) + 5;
let story_start = s[0].indexOf( '
' ) + 3; // Better to skip ahead a bit instead of showing the
date
let story_end = s[0].indexOf( '
' );
let story = s[0].substring( story_start, story_end );
// Problem: images are still showing up. We sort of don't want that. Clunk.
story = story.replace( /" + post_url + " "; // A digest, so we can update innerHTML just once per div
sublinks.forEach( (link) => {
let contents = link;
if( options_map.thumbnails == 'xml' && link.indexOf( '_1280' ) > -1 ) { // If we're showing thumbnails and this image can be resized, do, then show it
let img = link.replace( '_1280', '_100' );
contents = '' + link; // deserves a class, for consistent scale.
}
this_link = '' + contents + ' ';
// How is this not what's causing the CSS bleed?
// if( link.substring(0) == '\n' ) { // If this is text
if( link.indexOf( '\n' ) > -1 ) { // If this is text
this_link = link;
}
bulk_string += this_link;
} )
var tag_string = "";
tag_links.forEach( (link) => {
// let tag = link.split( '/tagged/' )[1]
tag_string += '#' + link.split( '/tagged/' )[1] + ' ';
} )
bulk_string += tag_string + " ";
// Tags should be added here. If we have them.
// console.log( bulk_string );
document.getElementById( '' + post_url ).innerHTML = bulk_string; // Yeeeah, I should probably create these div IDs before this happens.
// And here's where I'd increase the page counter, if we had one.
// Let's use order_array to judge how done we are - and make it a percent, not a page count. Use order_array.length and pretend the last element's the same size.
// document.getElementById( 'pagecounter' ).innerHTML = '%';
// No wait, we can't do it here. This is per-post, not per-page. Or... we could do it real half-assed.
} )
)
.then( s => { // I don't think we take any actual data here. This just fires once per 'responses' group, so we can indicate page count etc.
let completion = Math.ceil( 100 * (which_entry+1) / order_array.length ); // Zero-ordinal index to percentage. Blugh.
document.getElementById( 'pagecounter' ).innerHTML = '' + completion + '%';
} )
} )
// "Promises allow a flat execution pattern!" Fuck you, you liars. Look at that rat's nest of alternating braces.
// If you've done all the work in spaghetti functions somewhere else, maybe it's fine, but if you want code to happen where it fucking starts, anonymous functions SUCK.
}
// ------------------------------------ Post-by-post scraper with embedded images ------------------------------------ //
// Scrape each page for /post/ links, scrape each /post/ for content, display in-order with less callback hell
// New layout & new scrape method - not required to be compatible with previous functions
function new_embedded_display() {
if( isNaN( parseInt( options_map.startpage ) ) || options_map.startpage <= 1 ) { options_map.startpage = 1; }
mydiv.innerHTML += "
";
// Links out from this mode - scrapewholesite, original mode, maybe other crap
//mydiv.innerHTML += "This mode is under development and subject to change."; // No longer true. It's basically feature-complete.
// mydiv.innerHTML += " - Return to original image browser" + " " + " ";
mydiv.innerHTML += " " + html_ezastumblrscrape_options() + "
";
// Messy inline function for toggling page breaks - they're optional because we have post permalinks now
mydiv.innerHTML += "Toggle page breaks
";
mydiv.innerHTML += ""; // Empty span for things to be placed after.
posts_placed.push( 0 ); // Because fuck special cases.
// Scrape some pages
for( let x = options_map.startpage; x < options_map.startpage + options_map.pagesatonce; x++ ) {
fetch( site_and_tags + "/page/" + x, { credentials: 'include' } ).then( r => r.text() ).then( text => {
scrape_by_posts( text, x );
} )
}
}
// Take the HTML from a /page, fetch the /post links, display images
// Probably ought to be despaghettified and combined with the above function, but I was fighting callback hell -hard- after the last major version
// Alternately, split it even further and do some .then( do_this ).then( do_that ) kinda stuff above.
function scrape_by_posts( html_copy, page_number ) {
// console.log( page_dupe_hash ); // DEBUG
let posts = links_from_page( html_copy ); // Get links on page
posts = posts.filter( link => { return link.indexOf( '/post/' ) > 0 && link.indexOf( '/photoset' ) < 0; } ); // Keep /post links but not photoset iframes
posts = posts.map( link => { return link.replace( '#notes', '' ); } ); // post/1234 is the same as /post/1234#notes
posts = posts.filter( link => link.indexOf( window.location.host ) > 0 ); // Same-origin filter. Not necessary, but it unclutters the console. Fuckin' CORS.
if( page_number != 1 ) { posts = posts.filter( novelty_filter ); } // Attempt to remove posts linked on every page, e.g. commission info. Suffers a race condition.
posts = remove_duplicates( posts ); // De-dupe
// 'posts' now contains an array of /post URLs
// Display link and linebreak before first post on this page
let first_id = posts.map( u => parseInt( u.split( '/' )[4] ) ).sort( ).pop(); // Grab ID from its place in each URL, sort accordingly, take the top one
let page_link = " Page " + page_number + "";
if( posts.length == 0 ) { first_id = 1; page_link += " - No images found."; } // Handle empty pages with dummy content. Out of order, but whatever.
page_link += "
";
display_post( page_link, first_id + 0.5 ); // +/- on the ID will change with /chrono, once that matters
posts.map( link => {
fetch( link, { credentials: 'include' } ).then( r => r.text() ).then( text => {
let sublinks = links_from_page( text );
sublinks = sublinks.filter( s => { return s.indexOf( '.jpg' ) > 0 || s.indexOf( '.jpeg' ) > 0 || s.indexOf( '.png' ) > 0 || s.indexOf( '.gif' ) > 0; } );
sublinks = sublinks.filter( tumblr_blacklist_filter ); // Remove avatars and crap
sublinks = sublinks.map( image_standardizer ); // Clean up semi-dupes (e.g. same image in different sizes -> same URL)
sublinks = sublinks.filter( novelty_filter ); // Global duplicate remover
// Oh. Photosets sort of just... work? That might not be reliable; DownThemAll acts like it can't see the iframes on some themes.
// Yep, they're there. Gonna be hard to notice if/when they fail. Oh well, "not all images are guaranteed to appear."
// Videos will still be weird. (But it does grab their preview thumbnails.)
// Wait, can I filter reblogs here? E.g. with a ?noreblogs flag, and then checking if any given post has via/source links. Hmm. Might be easier in /mobile pages.
// Seem to get a lot of duplicate images? e.g. both
// https://media.tumblr.com/tumblr_m2gktkD7u31qdcy3io1_640.jpg and
// https://media.tumblr.com/tumblr_m2gktkD7u31qdcy3io1_1280.jpg
// Oh! Do I just not handle _640?
// Get ID from post URL, e.g. http//example.tumblr.com/post/12345/title => 12345
let post_id = parseInt( link.split( '/' )[4] ); // 12345 as a NUMBER, not a string, doofus
if( sublinks.length > 0 ) { // If this post has images we're displaying -
let this_post = new String;
sublinks.map( url => {
this_post += '';
this_post += '';
this_post += '';
this_post += 'Permalink ';
} )
display_post( this_post, post_id );
}
} )
} )
}
// Place content on page in descending order according to post ID number
// Consider rejiggering the old scrape method to use this. Move to 'universal' section if so. Alter or spin off to link posts instead?
// Turns out I never implemented ?chrono or ?reverse, so nevermind that for now.
// Remember to set options_map.chrono if ?find contains /chrono or whatever.
function display_post( content, post_id ) {
let this_node = document.createElement( "span" );
this_node.innerHTML = content;
this_node.id = post_id
// Find lower-numbered node than post_id
let target_id = posts_placed.filter( n => n <= post_id ).sort( ).pop(); // Take the highest number less than (or equal to) post_id
if( options_map.find.indexOf( '/chrono' ) > 0 ) {
target_id = posts_placed.filter( n => n <= post_id ).sort( ).shift(); // Take the... fuck... lowest? What am I doing again?
// Fuuuck, this is really inconsistent. Nevermind the looney-toons syntax I used here, =>n<=.
// Screw it, use the old scraper for now.
}
let target_node = document.getElementById( target_id );
// http://stackoverflow.com/questions/4793604/how-to-do-insert-after-in-javascript-without-using-a-library
target_node.parentNode.insertBefore( this_node, target_node ); // Insert our span after the lower-ID node
posts_placed.push( post_id ); // Remember that we added this ID
// No return value
}
// Return ascending or descending order depending on "chrono" setting
// function post_order_sort( a, b )
// ------------------------------------ Specific handling for www.tumblr.com (tag search, possibly dashboard) ------------------------------------ //
// URLs like https://www.tumblr.com/tagged/wooloo?before=1560014505 don't follow easy sequential pagination, so we have to (a) be linear or (b) guess. First whack is (a).
// Tumblr dashboard is obvious standardized, so we can make assumptions about /post links in relation to images.
// We do not fetch any individual /posts. We can't. They're on different subdomains, and Tumblr CORS remains tight-assed. But we can link them like in the post scrape mode.
// Ooh, I could maybe get "behind the jump" content via /embed URLs. Posts here should contain the blog UUID and a post number.
// Copied notes from above:
// https://www.tumblr.com/tagged/homestuck sort of works. Trouble is, pages go https://www.tumblr.com/tagged/homestuck?before=1558723097 - ugh.
// Next page is https://www.tumblr.com/tagged/homestuck?before=1558720051 - yeah this is a Unix timestamp. It's epoch time.
// https://www.tumblr.com/dashboard/2/185117399018 - but this one -is- based on /post numbers. Tumblr.com: given two choices, take all three.
// Being able to scrape and archive site-wide tags or your own dashboard would be useful. Dammit.
// Just roll another mode into this script. It's already a hot mess. The new code just won't run on individual blogs.
// Okay, so dashboard and site-wide tag modes.
// Dashboard post numbers don't have to be real post numbers. Tag-search timestamps obviously don't have to relate to real posts.
// We de-dupe, so overkill is fine... ish. Tumblr's touchy about "rate limit exceeded" these days.
// Tag scrape would be suuuper useful if we can grab blog posts from www.tumblr.com. Like "hiveswapcomicscontest."
// Still no luck on scraping dashboard-only blogs. Bluh.
// This is so aggravating. I can see the content, obviously. "View page source" just returns the dashboard source. But document.body.outerHTML contains the blog proper.
// Consider /embed again:
// https://embed.tumblr.com/embed/post/4RwtewsxXp-k1ReCcdAgXg/185288559546?width=542&language=en_US&did=a5c973d33a43ace664986204d72d7739de31b614
// This works but provides no previous/next link. (We need the ID, but we can get it from www.tumblr.com, then redirect.)
// Using DaveJaders for testing. https://davejaders.tumblr.com/archive does not redirect, so we can use that for same-origin fetches. Does /page stuff work?
// fetch( '/' ).then( r => r.text() ).then( t => document.body.outerHTML = t ) - CORS failure. "The Same Origin Policy disallows reading the remote resource at https://www.tumblr.com/login_required/davejaders . (Reason: CORS header ‘Access-Control-Allow-Origin’ missing)." Fuck me again, apparently.
// Circle back to dashboard-only blogs by showing the actual content. Avoid clearing body.innerHTML, let Tumblr do its thing, interact with the sidebar deal.
// I could still estimate page count, using a binary search. Assuming constant post rates.
// Or, get fancy, and estimate total posts from time between posts on sparse samples. Each fetch is ten posts in-order.
// It does say "No posts found." See https://www.tumblr.com/tagged/starlight-brigade?before=1503710342
// Between the ?before we fetch and the Next link, we have the span of time it took for ten posts to be made. I would completely exclude the "last" page, if found.
// I guess... integrate in blocks? E.g., given a data point, assume that rate continues to the next data point. Loose approximations are fine.
// E.g. push timestamp/rate pairs into an array, sort by timestamp, find delta between each timestamp, and multiply each rate by each period. One period is bidirectional.
// Lerping is not meaningfully harder. It's the above, plus each period times half the difference between the rates at either end of that period.
// Specific date entry? input type = "date".
// Specific date entry is useful and simple enough to finish before posting a new version. Content estimation can wait.
// Oh, idiot: make the "on or before" display also the navigation mechanic.
// We now go right here for any ?ezastumblrscrape URL on www.tumblr.com. Sloppy but functional.
function scrape_www_tagged( ) {
// ?find=/tagged/whatever is already populated by existing scrape links, but any ?key=value stuff gets lost.
// (If no ?before=timestamp, fetch first page. ?before=0 works. Some max_int would be preferable for sorting. No exceptions needed, then.)
is_first_page = false; // Implicit global, eat me. Clunk.
if( isNaN( options_map.before ) ) {
options_map.before = parseInt( Date.now() / 1000 ); // Current Unix timestamp in seconds, not milliseconds. The year is not 51413.
is_first_page = true;
}
if( options_map.before > parseInt( Date.now() / 1000 ) ) { is_first_page = true; } // Also handle when going days or years "into the future."
// Standard-ish intial steps: clear page (handled), add controls, maybe change window title.
// We can't use truly standard prev/next navigation. Even officially, you only get "next" links. (Page count options should still work.)
let www_tagged_next = "Next >>>";
let pages_www_tagged = "";
pages_www_tagged += " 10, ";
pages_www_tagged += " 5, ";
pages_www_tagged += " 1 - >>>";
// Periods of time, in seconds, because Unix epoch timestamps.
let one_day = 24 * 60 * 60;
let one_week = one_day * 7;
let one_year = one_day * 365.24; // There's no reason not to account for leap years.
let approximate_place = "Posts " + options_map.find.split('/').join(' ') + ""; // options_map.find as relative link. Convenient.
let date_string = new Date( options_map.before * 1000 ).toISOString().slice( 0, 10 );
approximate_place += " on or before ";
// How do I associate this entry with the button so that pressing Enter triggers the button? Eh, minor detail.
let coverage = ""; // Started writing this string, then realized I can't know its value until all pages are fetched.
let time_www_tagged = ""; // Browse forward / backward by day / week / year.
time_www_tagged += "<<< One day";
time_www_tagged += " >>> - ";
time_www_tagged += "<<< One week";
time_www_tagged += " >>> - ";
time_www_tagged += "<<< One year";
time_www_tagged += " >>>";
// Change the displayed date and it'll go there. Play stupid games, win stupid prizes.
let jump_button_code = 'window.location += "?before=" + parseInt( new Date( document.getElementById( "on_or_before" ).value ).getTime() / 1000 );';
let date_jump = "";
mydiv.innerHTML += "
";
// Fetch first page specified by ?before=timestamp.
let tagged_url = "" + options_map.find + "?before=" + options_map.before; // Relative URLs are guaranteed to be same-domain, even if they're garbage.
fetch( tagged_url, { credentials: 'include' } ).then( r => r.text() ).then( text => {
display_www_tagged( text, options_map.before, options_map.pagesatonce ); // ... pagesatonce gets set in the pre-amble, right? It should have a default.
// Optionally we could check here if options_map.before == 0 and instead send max_safe_integer.
} )
}
// Either we need a global variable for how many more pages per... page... or else I should pass a how_many_more value to this recursive function.
function display_www_tagged( content, timestamp, pages_left ) {
// First, grab the Next link - i.e. its ?before=timestamp value.
// let next_timestamp_index = content.lastIndexOf( '?before=' );
// let next_timestamp = content.substring( next_timestamp_index + 8, content.indexOf( next_timestamp_index, '"' ) ); // Untested
let next_timestamp = content.split( '?before=' ).pop().split( '"' ).shift(); // The last "?before=12345'" string on the page. Clunky but tolerable.
next_timestamp = "" + parseInt( next_timestamp ); // Guarantee this is a string of a number. (NaN "works.") Pages past the end may return nonsense.
if( pages_left > 1 ) { // If we're displaying more pages then fetch that and recurse.
let tagged_url = "" + options_map.find + "?before=" + next_timestamp; // Relative URLs are guaranteed to be same-domain, even if they're garbage.
// console.log( tagged_url );
fetch( tagged_url, { credentials: 'include' } ).then( r => r.text() ).then( text => {
display_www_tagged( text, next_timestamp, pages_left - 1 );
} )
} else { // Otherwise put that timestamp in our constructed Next link(s).
// I guess... get HTMLcollection of elements for "next" links, and change each one.
// Downside: links will only change once the last page is fetched. We could tack on a ?before for every fetch, but it would get silly. Right?
let next_links = Array.from( document.getElementsByClassName( 'www_next' ) ); // I'm not dealing with a live object unless I have to.
for( link of next_links ) { link.href += "?before=" + next_timestamp; }
// Oh right, and update the header to guesstimate what span of time we're looking at.
let coverage = document.getElementById( 'coverage' );
coverage.innerHTML += "Displaying ";
if( is_first_page ) { coverage.innerHTML += "the most recent "; } // Condition is no longer sensible, but it's a good placeholder.
// We can safely treat ?before as the initial timestamp. Perfect accuracy is not important.
let time_covered = Math.abs( options_map.before - next_timestamp ); // Absolute so I don't care if I have it backwards.
if( time_covered > 48 * 60 * 60 ) { coverage.innerHTML += parseInt( time_covered / (24*60*60) ) + " days of posts"; } // Over two days? Display days.
else if( time_covered > 2 * 60 * 60 ) { coverage.innerHTML += parseInt( time_covered / (60*60) ) + " hours of posts"; } // Over two hours? Display hours.
else if( time_covered > 2 * 60 ) { coverage.innerHTML += parseInt( time_covered / 60 ) + " minutes of posts"; } // Over two minutes? Display minutes.
else { coverage.innerHTML += " several seconds of posts"; } // Otherwise just say it's a damn short time.
if( time_covered == 0 ) { coverage.innerHTML = "Displaying the first available posts"; } // Last page, earlier posts. No "next" page.
}
// Insert div for this timestamp's page.
let new_div = document.createElement( 'span' ); // Span, because divs cause line breaks. Whoops.
new_div.id = "" + timestamp;
let target_node = document.getElementById( 'bottom_controls_div' );
target_node.parentNode.insertBefore( new_div, target_node ); // Insert each page before the footer.
let div_html = "";
// Separate page HTML by posts.
// At least the "li" elements aren't nested, so I can terminate the last one on "". Or... all of them.
let posts = content.split( '
' )[0] ); // Terminate last element at
. Again, not great code, but clunk clunk clunk get it done.
// For each post:
for( post of posts ) {
// Extract images from each post.
let links = links_from_page( post );
links = links.map( image_standardizer ); // This goes before grabbing the permalink because /post URLs do get standardized. No &media guff.
let permalink = links.filter( s => s.indexOf( '.tumblr.com/post' ) > 0 )[0]; // This has to go before de-duping, or posts linking to posts can leave permalinks blank.
links = links.filter( novelty_filter );
links = links.filter( tumblr_blacklist_filter );
// document.body.innerHTML += links.join( " " ) + "
"; // Debug
// Separate the images.
let images = links.filter( s => s.indexOf( 'media.tumblr.com' ) > 0 ); // Note: this will exclude external images, e.g. embedded Twitter stuff.
// If this post has images:
if( images.length > 0 ) { // Build HTML xor insert div for each post, to display images.
// Get /post URL, including blog name etc.
//let permalink = links.filter( s => s.indexOf( '.tumblr.com/post' ) > 0 )[0];
let post_html = "";
for( image of images ) {
post_html += '';
post_html += '';
post_html += '';
post_html += 'Permalink ';
}
div_html += post_html;
}
}
// Insert accumulated HTML into this div.
new_div.innerHTML = div_html;
}
// ------------------------------------ HTML-returning functions for duplication prevention ------------------------------------ //
// Return HTML for standard Previous / Next controls (<<< Previous - Next >>>)
function html_previous_next_navigation() {
let prev_next_controls = "";
if( options_map.startpage > 1 ) {
prev_next_controls += "<<< Previous - ";
}
prev_next_controls += "Next >>>";
return prev_next_controls;
}
// Return HTML for pages-at-once versions of previous/next page navigation controls (<<< 10, 5, 1 - 1, 5, 10 >>>)
function html_page_count_navigation() {
let prev_next_controls = "";
if( options_map.startpage > 1 ) { // <<< 10, 5, 1 -
prev_next_controls += "<<< ";
prev_next_controls += " 1, ";
prev_next_controls += " 5, ";
prev_next_controls += " 10 - ";
}
prev_next_controls += " 10, ";
prev_next_controls += " 5, ";
prev_next_controls += " 1 - ";
prev_next_controls += ">>>";
return prev_next_controls;
}
// Return HTML for image-size options (changes via CSS or via URL parameters)
// This used to work. It still works, in the new www mode I just wrote. What the fuck do I have to do for some goddamn onclick behavior?
// "Content Security Policy: The page’s settings blocked the loading of a resource at self (“script-src https://lalilalup.tumblr.com https://assets.tumblr.com/pop/ 'nonce-OTA4NjViZmE2MzZkYTFjMjM1OGZkZGM1MzkwYWU4NTA='”)." What the fuck.
// Jesus Christ, it might be yet again because of Tumblr's tightass settings:
// https://stackoverflow.com/questions/37298608/content-security-policy-the-pages-settings-blocked-the-loading-of-a-resource
// The function this still works in is on a "not found" page. The places it will not work are /archive pages.
// Yeah, on /mobile instead of /archive it works fine. Fuck you, Tumblr.
// Jesus, that means even onError can't work.
function image_size_options() {
var html_string = "Immediate: \t"; // Change class to instantly resize images, temporarily
html_string += "Original image sizes - ";
html_string += "Snap columns - ";
html_string += "Snap rows - ";
html_string += "Fit width - ";
html_string += "Fit height - ";
html_string += "Fit both
";
html_string += "Persistent: \t"; // Reload page with different image mode that will stick for previous/next pages
html_string += "Original image sizes - "; // This is the CSS default, so any other value works
html_string += "Snap columns - ";
html_string += "Snap rows - ";
html_string += "Fit width - ";
html_string += "Fit height - ";
html_string += "Fit both";
return html_string;
}
// Return HTML for links to ?maxres versions of the same page, e.g. "_raw" versus "_1280"
function image_resolution_options() {
var html_string = "Maximum resolution: \t";
html_string += "Raw - "; // I'm not 100% sure "_raw" works anymore, but the error function handles it, so whatever.
html_string += "1280 - ";
html_string += "500 - ";
html_string += "400 - ";
html_string += "250 - ";
html_string += "100";
return html_string;
}
// Return links to other parts of Eza's Tumblr Scrape functionality, possibly excluding whatever you're currently doing
// Switch to full-size images - Toggle image size - Show one page at once - Scrape whole Tumblr - (Experimental fetch-every-post image browser)
// This mode is under development and subject to change. - Return to original image browser
function html_ezastumblrscrape_options() {
let html_string = "";
// "You are browsing" text? Tell people where they are and what they're looking at.
html_string += "Scrape whole Tumblr - ";
html_string += "Browse images - "; // Default mode; so any value works
html_string += "(Experimental fetch-every-post image browser) ";
return html_string;
}
function error_function( url ) {
// This clunky function looks for a lower-res image if the high-res version doesn't exist.
// Surprisingly, this does still matter. E.g. http://66.media.tumblr.com/ba99a55896a14a2e083cec076f159956/tumblr_inline_nyuc77wUR01ryfvr9_500.gif
// This might mismatch _100 images and _250 links because of that self-erasing clause... but it's super rare, so meh.
let on_error = 'if(this.src.indexOf("_raw")>0){this.src=this.src.replace("_raw","_1280").replace("//media","//66.media");}'; // Swap _raw for 1280, add CDN number
on_error += 'else if(this.src.indexOf("_1280")>0){this.src=this.src.replace("_1280","_500");}'; // Swap 1280 for 500
on_error += 'else if(this.src.indexOf("_500")>0){this.src=this.src.replace("_500","_400");}'; // Or swap 500 for 400
on_error += 'else if(this.src.indexOf("_400")>0){this.src=this.src.replace("_400","_250");}'; // Or swap 400 for 250
on_error += 'else{this.src=this.src.replace("_250","_100");this.onerror=null;}'; // Or swap 250 for 100, then give up
on_error += 'document.getElementById("' + encodeURI( url ) + '").href=this.src;'; // Link the image to itself, regardless of size
// 2020: This has to get more complicated, or at least more verbose, to support shitty new image URLs.
// https://66.media.tumblr.com/eb7d40a8683e623f173f81fc253056dc/e4277131a4c174c7-a5/s1280x1920/5b3c54b53a09f91b08f9af7fb88e30a253f3c5db.jpg
//'s400x600'
//'s500x750'
//'s640x960'
//'s1280x1900'
// 's2048x3072'
// This has to go in Image Glutton, because Tumblr only ever gets worse
// Start over.
// on_error = 'this.src = this.src.replace( "_250", "_100" ); '; // Unconditional, reverse ladder pattern. Don't Google that. I just made it up.
on_error = "let oldres = [ '_100', '_250', '_400', '_500', '_1280' ]; "; // Inverted quotes, Eh.
on_error += "let newres = [ 's400x600', 's500x750', 's640x960', 's1280x1920' ];"; // Should probably include slashes. Rare collisions.
// on_error += "for( let i = old.length-1; i > 1; i-- ) { this.src.replace( old[i], old[i-1] ); } "; // 1280 -> 500, 500 -> 400... shit.
on_error += "for( let i = 1; i < oldres.length; i++ ) { this.src = this.src.replace( oldres[i+1], oldres[i] ); } "; // 250 -> 100, 400 -> 250, etc. One step per error.
on_error += "for( let i = 1; i < newres.length; i++ ) { this.src = this.src.replace( newres[i+1], newres[i] ); } ";
on_error += 'document.getElementById("' + encodeURI( url ) + '").href=this.src;'; // Original quotes - be cautious when editing
// God dammit. The links work, but the embedded images don't.
// https://66.media.tumblr.com/eb7d40a8683e623f173f81fc253056dc/e4277131a4c174c7-a5/s1280x1920/ba0f0dd5d5e76726639782f57c4b732f156a5219.jpg
// https://66.media.tumblr.com/eb7d40a8683e623f173f81fc253056dc/e4277131a4c174c7-a5/s1280x1920/5b3c54b53a09f91b08f9af7fb88e30a253f3c5db.jpg
/*
x = 'https://66.media.tumblr.com/eb7d40a8683e623f173f81fc253056dc/e4277131a4c174c7-a5/s1280x1920/44b457df0d9826135b062563cd802afdfe496888.jpg'
newres = [ 's400x600', 's500x750', 's640x960', 's1280x1920' ];
for( let i = 1; i < newres.length; i++ ) { this.src.replace( newres[i+1], newres[i] ); }
*/
return on_error;
}
// ------------------------------------ Universal page-scraping function (and other helper functions) ------------------------------------ //
// Add URLs from a 'blank' page to page_dupe_hash (without just calling soft_scrape_page_promise and ignoring its results)
function exclude_content_example( url ) {
fetch( url, { credentials: 'include' } ).then( r => r.text() ).then( text => {
let links = links_from_page( text );
links = links.filter( novelty_filter ); // Novelty filter twice, because image_standardizer munges some /post URLs
links = links.map( image_standardizer );
links = links.filter( novelty_filter );
} )
// No return value
}
// Spaghetti to reduce redundancy: given a page's text, return a list of URLs.
function links_from_page( html_copy ) {
// Cut off the page at the "More you might like" / "Related posts" footer, on themes that have one
html_copy = html_copy.split( '="related-posts' ).shift();
let http_array = html_copy.split( /['="']http/ ); // Regex split on anything that looks like a source or href declaration
http_array.shift(); // Ditch first element, which is just etc.
http_array = http_array.map( s => { // Theoretically parallel .map instead of maybe-linear .forEach or low-level for() loop
if( s.indexOf( "&" ) > -1 ) { s = htmlDecode( s ); } // Yes a fucking " should match a goddamn regex for terminating on quotes!
s = s.split( /['<>"']/ )[0]; // Terminate each element (split on any terminator, take first subelement)
s = s.replace( /\\/g, '' ); // Remove escaping backslashes (e.g. http\/\/ -> http//)
if( s.indexOf( "%3A%2F%2F" ) > -1 ) { s = decodeURIComponent( s ); } // What is with all the http%3A%2F%2F URLs?
// s = s.split( '"' )[0]; // Yes these count as doublequotes you stupid broken scripting language.
return "http" + s; // Oh yeah, add http back in (regex eats it)
} ) // http_array now contains an array of strings that should be URLs
let post_array = html_copy.split( /['="']\/post/ ); // Regex split on anything that defines looks similar to a src="/post" link
post_array.shift(); // Ditch first element, which is just etc.
post_array = post_array.map( s => { // Theoretically parallel .map instead of maybe-linear .forEach or low-level for() loop
s = s.split( /['<>"']/ )[0]; // Terminate each element (split on any terminator, take first subelement)
return window.location.protocol + "//" + window.location.hostname + "/post" + s; // Oh yeah, add /post back in (regex eats it)
} ) // post_array now contains an array of strings that should be photoset URLs
http_array = http_array.concat( post_array ); // Photosets are out of order again. Blar.
return http_array;
}
// Filter: Return false for typical Tumblr nonsense (JS, avatars, RSS, etc.)
function tumblr_blacklist_filter( url ) {
if( url.indexOf( "/reblog/" ) > 0 ||
url.indexOf( "/tagged/" ) > 0 || // Might get removed so the script can track and report tag use. Stupid art tags like 'my-draws' or 'art-poop' are a pain to find.
url.indexOf( ".tumblr.com/avatar_" ) > 0 ||
url.indexOf( ".tumblr.com/image/" ) > 0 ||
url.indexOf( ".tumblr.com/rss" ) > 0 ||
url.indexOf( "srvcs.tumblr.com" ) > 0 ||
url.indexOf( "assets.tumblr.com" ) > 0 ||
url.indexOf( "schema.org" ) > 0 ||
url.indexOf( ".js" ) > 0 ||
url.indexOf( ".css" ) > 0 ||
url.indexOf( "twitter.com/intent" ) > 0 || // Weirdly common now
url.indexOf( "tmblr.co/" ) > 0 ||
//https://66.media.tumblr.com/dea691ec31c9f719dd9b057d0c2be8c3/a533a91e352b4ac7-29/s16x16u_c1/f1781cbe6a6eb7417aa6c97e4286d46372e5ba37.jpg
url.indexOf( "s16x16u" ) > 0 || // Avatars
url.indexOf( "s64x64u" ) > 0 || // Avatars
//https://66.media.tumblr.com/49fac1010ada367bcaee721ea49d0de5/3ddc8ebad67ae55a-ca/s64x64u_c1/592c805c7ce2c44cd389421697e0360b83d89f37.jpg
url.indexOf( "u_c1/" ) > 0 || // Avatars
// https://66.media.tumblr.com/a4fe5011f9e07f95588c789128b60dca/603e412a60dc5227-4a/s64x64u_c1_f1/7b50fe887c756759973d442a89ef86ab1359f6e5.gif
url.indexOf( "ezastumblrscrape" ) > 0 ) // Somehow this script is running on pages being fetched, inserting a link. Okay. Sure.
{ return false } else { return true }
}
// Return standard canonical URL for various resizes of Tumblr images - size of _1280, single CDN
// 10/14 - ?usesmall seems to miss the CDN sometimes?
// e.g. http://mooseman-draws.tumblr.com/archive?startpage=1?pagesatonce=5?thumbnails?ezastumblrscrape?scrapemode=everypost?lastpage=37?usesmall
// https://66.media.tumblr.com/d970fff86185d6a51904e0047de6e764/tumblr_ookdvk7foy1tf83r7o1_400.png sometimes redirects to 78.media and _raw. What?
// Oh, probably not my script. I fucked with the Tumblr redirect script I use, but didn't handle the lack of CDN in _raw sizes.
// 2020: tumblr scrape:
// https://66.media.tumblr.com/b1515c45637955e1e52ec213944db662/08f2b2e47bbce703-70/s400x600/9077891dda98f127c040bd77581014ce4019fe75.gifv
// obviously can go to .gif, but that .gif saves as tumblr_b1515c45637955e1e52ec213944db662_9077891d_400.gif.
// Also, bumping up to /s500x750 works, so I can probably still slam everything to maximum size via 'canonical' urls.
function image_standardizer( url ) {
// Some lower-size images are automatically resized. We'll change the URL to the maximum size just in case, and Tumblr will provide the highest resolution.
// Replace all resizes with _1280 versions. Nearly all _1280 URLs resolve to highest-resolution versions now, so we don't need to e.g. handle GIFs separately.
// Oh hey, Tumblr now has _raw for a no-bullshit as-large-as-possible setting.
// _raw only works without the CDN - so //media.tumblr yes, but //66.media.tumblr no. This complicates things.
// Does //media and _raw always work? No, of course not. So we still need on_error.
// url = url.replace( "_540.", "_1280." ).replace( "_500.", "_1280." ).replace( "_400.", "_1280." ).replace( "_250.", "_1280." ).replace( "_100.", "_1280." );
let maxres = "1280"; // It is increasingly unlikely that _raw still works. Reconsider CDN handling if that's the case.
if( options_map.maxres ) { maxres = options_map.maxres } // If it's set, use it. Should be _100, _250, whatever. ?usesmall should set it to _400. ?notraw, _1280.
maxres = "_" + maxres + "."; // Keep the URL options clean: "400" instead of "_400." etc.
url = url.replace( "_raw", maxres ).replace( "_1280.", maxres ).replace( "_640.", maxres ).replace( "_540.", maxres )
.replace( "_500.", maxres ).replace( "_400.", maxres ).replace( "_250.", maxres ).replace( "_100.", maxres );
// henrythehangman.tumblr.com has doubled images from /image posts in ?scrapemode=everypost. Lots of _1280.jpg?.jpg nonsense.
// Is that typical for tumblrs with this theme? It's one of those annoying magnifying-glass-on-hover deals. If it's just that one weird fetish site, remove this later.
url = url.split('?')[0]; // Ditch anything past the first question mark, if one exists
url = url.split('&')[0]; // Ditch anything past the first ampersand, if one exists - e.g. speikobrarote.tumblr.com
if( url.indexOf( 'tumblr.com' ) > 0 ) { url = url.split( ' ' )[0]; } // Ditch anything past a trailing space, if one exists - e.g. cinnasmut.tumblr.com
// Standardize media subdomain / CDN subsubdomain, to prevent duplicates and fix _1280 vs _raw complications.
if( url.indexOf( '.media.tumblr.com/' ) > 0 ) {
let url_parts = url.split( '/' )
url_parts[2] = '66.media.tumblr.com'; // This came first. Then //media.tumblr.com worked, even for _raw. Then _raw went away. Now it needs a CDN# again. Bluh.
// url_parts[2] = 'media.tumblr.com'; // 2014: write a thing. 2016: comment out old thing, write new thing. 2018: uncomment old thing, comment new thing. This script.
url = url_parts.join( '/' ).replace( 'http:', 'https:' );
/* // Doesn't work - URLs resolve to HTML when clicked, but don't embed as images. Fuck this website.
// I need some other method for recognizing duplicates - the initial part of the URL is the same.
// Can HTML5 / CSS suppress certain elements if another element is present?
// E.g. #abcd1234#640 is display:none if #abcd1234#1280 exists.
// https://support.awesome-table.com/hc/en-us/articles/115001399529-Use-CSS-to-change-the-style-of-each-row-depending-on-the-content
// The :empty pseudoselector works like :hover. And you can condition it on attributes, like .picture[src=""]:empty.
// The :empty + otherSelector{} syntax is weird, but it's CSS, so of course it's weird.
// Here we'd... generate a style element? Probably just insert a