// ==UserScript== // @name Eza's Tumblr Scrape // @namespace https://inkbunny.net/ezalias // @description Creates a new page showing just the images from any Tumblr // @license MIT // @license Public domain / No rights reserved // @include http://*?ezastumblrscrape* // @include http://*/ezastumblrscrape* // @include http://*.tumblr.com/ // @include http://*.tumblr.com/page/* // @include http://*.tumblr.com/tagged/* // @include http://*.tumblr.com/archive // @exclude *imageshack.us* // @exclude *imageshack.com* // @version 3.6 // @downloadURL none // ==/UserScript== // Create an imaginary page on the relevant Tumblr domain, mostly to avoid the ridiculous same-origin policy for public HTML pages. Populate page with all images from that Tumblr. Add links to this page on normal pages within the blog. // This script also works on off-site Tumblrs, by the way - just add /archive?ezastumblrscrape?scrapewholesite after the ".com" or whatever. Sorry it's not more concise. // Make it work, make it fast, make it pretty - in that order. // TODO: // going one page at a time for /scrapewholesite is dog-slow, especially when there are more than a thousand pages. any balance between synchronicity and speed throttling is desirable. // maybe grab several pages at once? no, damn, that doesn't work without explicit parallelism. I don't know if JS has that. really, I just need to get some timer function working. // does setInterval work? the auto-repeat one, I mean. // Infinite-scrolling tumblrs don't necessarily link to the next page. I need another metric - like if pages only contain the same images as last time. (Empty pages sometimes display foreground images.) // I'll have to add filtering as some kind of text input... and could potentially do multi-tag filtering, if I can reliably identify posts and/or reliably match tag definitions to images and image sets. // This is a good feature for doing /scrapewholesite to get text links and then paging through them with fancy dynamic presentation nonsense. Also: duplicate elision. // I'd love to do some multi-scrape stuff, e.g. scraping both /tagged/homestuck and /tagged/art, but that requires some communication between divs to avoid constant repetition. // I should start handling "after the cut" situations somehow, e.g. http://banavalope.tumblr.com/post/72117644857/roachpatrol-punispompouspornpalace-happy-new // Just grab any link to a specific /post. Occasional duplication is fine, we don't care. // Wait, shit. Every theme should link to every page. And my banavalope example doesn't even link to the same domain, so we couldn't get it with raw AJAX. Meh. It's just a rare problem we'll have to ignore. // http://askleijon.tumblr.com/ezastumblrscrape is a good example - lots of posts link to outside images (mostly imgur) // I could detect "read more" links if I can identify the text-content portion of posts. links to /post/ pages are universal theme elements, but become special when they're something the user links to intentionally. // for example: narcisso's dream on http://cute-blue.tumblr.com/ only shows the cover because the rest is behind a break. // post-level detection would also be great because it'd let me filter out reblogs. fuck all these people with 1000-page tumblrs, shitty animated gifs in their theme, infinite scrolling, and NO FUCKING TAGS. looking // Look into Tumblr Saviour to see how they handle and filter out text posts. at you, http://neuroticnick.tumblr.com/post/16618331343/oh-gamzee#dnr - you prick. // Should non-image links from images be gathered at the top of each 'page' on the image browser? E.g. http://askNSFWcobaltsnow.tumblr.com links to Derpibooru a lot. Should those be listed before the images? // I worry it'd pick up a lot of crap, like facebook and the main page. More blacklists / whitelists. Save it for when individual posts are detected. // ScrapeWholeSite: 10 pages at once by doing 10 separate xmlhttpwhatever objects, waiting for each to flip some bit in a 10-bool array? Clumsy parallelism. Possibly recursion, if the check for are-we-all-done-yet is in the status==4 callback. // I should probably implement a box and button for choosing lastpage, just for noob usability's sake. Maybe it'd only appear if pages==2. // Add a convenient interface for changing options? "Change browsing options" to unhide a div that lists every ?key=value pair, with text-entry boxes or radio buttons as appropriate, and a button that pushes a new URL into the address bar and re-hides the div. Would need to be separate from thumbnail toggle so long as anything false is suppressed in get_url or whatever. // Dropdown menus? Thumbnails yes/no, Pages At Once 1-20. These change the options_map settings immediately, so next/prev links will use them. Link to Apply Changes uses same ?startpage as current. // Could I generalize that the way I've generalized Image Glutton? E.g., grab all links from a Pixiv gallery page, show all images and all manga pages. // Possibly @include any ?scrapeeverythingdammit to grab all links and embed all pictures found on them. single-jump recursive web mirroring. (fucking same-domain policy!) // now that I've got key-value mapping, add a link for 'view original posts only (experimental).' er, 'hide reblogs?' difficult to accurately convey. // make it an element of the post-scraping function. then it would also work on scrape-whole-tumblr. // better yet: call it separately, then use the post-scraping function on each post-level chunk of HTML. i.e. call scrape_without_reblogs from scrape_whole_tumblr, split off each post into strings, and call soft_scrape_page( single_post_string ) to get all the same images. // or would it be better to get all images from any post? doing this by-post means we aren't getting theme nonsense (mostly). // maybe just exclude images where a link to another tumblr happens before the next image... no, text posts could screw that up. // general post detection is about recognizing patterns. can we automate it heuristically? bear in mind it'd be done at least once per scrape-page, and possibly once per tumblr-page. // Add picturepush.com to whitelist - or just add anything with an image file extension? Once we're filtering duplicates, Facebook buttons won't matter. // user b84485 seems to be using the scrape-whole-site option to open image links in tabs, and so is annoyed by the 500/1280 duplicates. maybe a 'remove duplicates' button after the whole site's done? // It's a legitimately good idea. Lord knows I prefer opening images in tabs under most circumstances. // Basically I want a "Browse Links" page instead of just "grab everything that isn't nailed down." // http://mekacrap.tumblr.com/post/82151443664/oh-my-looks-like-theres-some-pussy-under#dnr - lots of 'read more' stuff, for when that's implemented. // eza's tumblr scrape: "read more" might be tumblr standard. // e.g.

Read More

// http://c-enpai.tumblr.com/ - interesting content visible in /archive, but every page is 'themed' to be a blank front page. wtf. // "Scrape" link should appear in /archive, for consistency. Damn thing's unclickable on some themes. // why am I looking for specific domains to sort to the front? imgur, deviantart, etc. - just do it for any image that's not on *.tumblr.com, fool. // chokes on multi-thousand-page tumblrs like actual-vriska, at least when listing all pages. it's just link-heavy text. maybe skip having a div for every page and just append to one div. or skip divs and append to the raw document innerHTML. it could be a memory thing, if ajax elements are never destroyed. // multi-thousand-page tumblrs make "find image links from all pages" choke. massive memory use, massive CPU load. ridiculous. it's just text. (alright, it's links and ajax requests, but it's doggedly linear.) // maybe skip individual divs and append the raw pile-of-links hypertext into one div. or skip divs entirely and append it straight to the document innerHTML. // could it be a memory leak thing? are ajax elements getting properly released and destroyed when their scope ends? kind of ridiculous either way, considering we're holding just a few kilobytes of text per page. // try re-using the same ajax object. // Add HTTPS support, ya dingus. // Expand options_url to take an arbitrary list of key,value,key,value pairs. // Escape function in JS is encodeURI. We need 'safe' URLs as tag IDs. // Optimizing find-last-past function: start on p3, look for p2 to see if it's a theme without links. Then multiply by 11s: 3, 33, 363, 3993, ~44,000. // Consider empirical analysis of this ridiculous problem. Check several random-ish tumblrs to guage typical size, then test vs. different growth methods. /* Assorted notes from another text file . eza's tumblr scrape - testing open-loop vs. closed-loop updating for large tumblrs. caffeccino has 200-ish pages. from a cached state, and with stuff downloading, getting all 221 the old way takes 8m20s and has noticeable slowdown past 40-ish. new method takes 16m and is honestly not very fast from the outset. the use of a global variable might cause ugly locking. with js, who knows. . eza's tumblr fixiv? de-style everything by simply erasing the