Hacking around memory limitations in shared hosting.

May 30 2019

Longtime readers are aware that I've been a customer of Dreamhost for quite a few years now, and by and large they've done all right by me.  They haven't complained (much) about all the stuff I have running there, and I try to keep my hosted databases in good condition.  However, the server they have my stuff on is starting to act wonky.  Periodic outages mostly, but when my Wallabag installation started throwing all sorts of errors and generally not working right, that got under my skin in a fairly big hurry.  I reinstalled.  I upgraded to the latest stable release.  I installed the latest commit from the source code repository401 and 500 errors as far as the eye could see whenever I tried to do anything regardless of what I did.

In a misguided attempt to figure out what was going on, I bit the bullet and installed PHP on one of my servers, along with all of the usual dependencies and tried to replicate my setup at Dreamhost.  What that was a bit tricky and took some debugging I eventually got it to work.  It was getting my data out of the sorta-kinda-broken setup that proved troublesome.

Wallabag has a built-in data export feature which will allow you to take copies of your stored articles with you in a variety of formats.  Every time I tried to use it, however, I kept getting an HTTP 500 error - something was broken either on the server or in my install.  Poking around in the server logs showed that Wallabag kept using up all of the memory available to it.  In hindsight, perhaps I pushed things a little too far.  This isn't uncommon on shared hosting, by the way - if you're going to turn people loose on your server you have to have some restrictions in place to keep everyone from accidentally killing the box with all of their stuff.  It just so happened that the php-fpm daemons they'd allocated me were just a little short of memory to get my data dumped.  I wasn't able to get anything using the console commands, either.  Now what?

Statement of the problem:

  • Get my data out of the messed up Wallabag install.
  • Figure out how to parse said data.
  • Pump said data into my new Wallabag install.

When it comes to extracting data from recalcitrant applications, this isn't my first rodeo.  Off to the API documentation.  Step one: Get an API token to authenticate me to Wallabag:

[drwho @ leandra:(3) ~]$ curl -s "https://wallabag.virtadpt.net/oauth
    /v2/token?grant_type=password&client_id=AAAAA&client_secret=BBBBB
    &username=CCCCC&password=DDDDD"

{"access_token":"EEEEE","expires_in":3600,"token_type":"bearer","scope":null,
    "refresh_token":"FFFFF"}

(Secrets redacted, of course.)

Step two: Get (or, GET) Wallabag entries out of the database into a format I can use (namely, JSON for reimportation) using the /api/entries.json API call.  Due to the sheer volume of articles I have stashed away, I had to do it in batches of 500 or so at a time and store each batch into a separate file:

{14:22:13 @ Thu May 30}
[drwho @ leandra:(3) ~]$ curl -X GET "https://wallabag.virtadpt.net
    /api/entries.json?access_token=EEEEE&perPage=500&page=1"
    > wallabag-1.json

{14:22:13 @ Thu May 30}
[drwho @ leandra:(3) ~]$ curl -X GET "https://wallabag.virtadpt.net
    /api/entries.json?access_token=EEEEE&perPage=500&page=2"
    > wallabag-2.json

{14:22:13 @ Thu May 30}
[drwho @ leandra:(3) ~]$ curl -X GET "https://wallabag.virtadpt.net
    /api/entries.json?access_token=EEEEE&perPage=500&page=3"
    > wallabag-3.json

...

You get the picture.  Repeat until done.

Now the fun part: Re-importing those batches of articles into the new Wallabag install.  As it turned out, I couldn't just import them with Wallabag's built-in file reader, they're not in the right format.  So, after taking a look at one of the files in a text editor I decided to hack together a quick bit of Python that would read in a batch of JSON, take each article, pick out the information needed to populate a new entry, and fire it over to the server using the /api/entries.json API call using POST instead of GET this time.  It's short, it's sweet, it got the job done, and it could probably be turned into a utility with a little bit of spit and polish.

parameters = {}
parameters["grant_type"] = "password"
parameters["client_id"] = "GGGGG"
parameters["client_secret"] = "HHHHH"
parameters["username"] = "IIIII"
parameters["password"] = "JJJJJ"
request = requests.post("https://new.wallabag.virtadpt.net/oauth/v2/token",
    data=parameters)

headers = {}
headers["Authorization"] = "Bearer " + request.json()["access_token"]

file = open("wallabag-X.json")
articles = json.load(file)
file.close()
for article in articles["_embedded"]["items"]:
    parameters = {}
    parameters["url"] = article["url"]
    parameters["title"] = article["title"]
    parameters["content"] = article["content"]
    if article["tags"]:
        parameters["tags"] = article["tags"]
    parameters["published_at"] = article["updated_at"]
    parameters["origin_url"] = article["url"]
    request = requests.post("https://new.wallabag.virtadpt.net/api/entries.json",
        data=parameters, headers=headers)
    print("Uploaded article: " + parameters["url"])
    print()

I had to watch the server logs in realtime from another window because Wallabag's access tokens are only good for an hour, and when they expire you get 401 errors again.  And, of course, no new articles imported.  So, I basically ran that blob of code over and over again until all of my data was in the new Wallabag server.  It's not my finest work but it got the job done.  The new Wallabag server seems to be working pretty well across two different web browsers and two Android devices.  Much more reliably, too - when you're not contending with an unknown number of other users, thing seem to work smoothly.  Who knew.