GunplaHub Update #5 - Automation and SEO

GunplaHub is now deployed at https://gunplahub.com!

Table of Contents

1) Intro, Technical Decisions, What's Next

2) Slugs, Obfuscated Primary Keys

3) Deployment and Serving Images

4) Refactor and Document Often

5) (Current) Automation and SEO

Automation

Scraping Prices and Updating Gunpla Database

Price tracking is a core feature of GunplaHub. I am scraping prices using Scrapy. Using cron, I am able to schedule my scrapers to run during certain times through out the day.

Until a month ago, I was using BeautifulSoup for scraping. It works fine enough but with Scrapy I was able to simplify my code. I do think BeautifulSoup on its own is fine for my purposes, as I don't need to crawl links within a page. I am only supplying a url to scrape using selectors. Scrapy can crawl urls easily with minor configuration.

My Scrapy code looks like this:

def start_requests(self):
  urls = requests.get(-my-meta-endpoint-) // This endpoint returns a list of urls to parse
  for url in urls:
    yield scrapy.Request(...)

def parse(self, response):
  price = response.xpath(...)
  if price:
    requests.post(-add-to-db-) // POST to endpoint to post price from page
  else:
    alert_me

Scrapy also abstracts out configs into a settings file. My scraper written with BeautifulSoup has configs and explicitly handles some edge cases all in one long class.

You can see that I am grabbing a list of urls through a meta endpoint. All I have to do to start scraping new pages is to add the to my database. No additional configuration needed!

Generating Sitemap

NuxtJS has a sitemap generator module that GunplaHub uses. All I had to do was to supply an endpoint that exposes all gunpla available in GunplaHub's db, and configure the sitemap module to create it.

The config in nuxt.config.js looks like:

sitemap: {
  hostname: 'https://gunplahub.com',
  gzip: true,
  routes (callback) {
    axios.get('my-meta-endpoint')
      .then((res) => {
        let routes = res.data.map(gunpla => '/gunpla/' + gunpla.url.split('/toys/')[1] + '/' + gunpla.slug)
        callback(null, routes)          
      })
      .catch(callback)
  }
},

Thus, hitting GunplaHub's sitemap will be created and served on the fly. I can also pre-render sitemap.xml and serve it as a static file from AWS S3 or Google Cloud Storage but I haven't thought about how to automate this yet.

With this setup, I never have to manually generate sitemaps.

Resizing Images

When I first started GunplaHub, I didn't really think too much about the images I've scraped. All I did was scrape, store, and serve these images.

A website/service seeking speed gains should compress images and try to send the smallest amount of kb for each payload. In GunplaHub's case, images from search results aren't displayed at full size- but full sized variants are delivered to the screen.

Say a user sees 20 items on the search results on average, with each full sized image taking up 100kb. That's about 2MB for each search result page! On search pages, "thumbnails" are maxed out at 256 pixels, a size where the most eagle eyed user won't even notice the finer details.

This is where resizing comes in handy. For pages that don't need to display high definition image variants, we can use a small variant (aka a thumbnail) and reduce the kbs the page has to load. GunplaHub uses 256px image variants on the search pages, which only take up to 15kb each. With this we've reduced 4MB worth of photos to only 300kb. If the user wants to see the full sized variant, we can redirect them to it. The point is we don't have to serve huge images when not needed. See the example below, where unoptimized and optimized images locked at 256px height.

Before:

MG F91 Ver 2.0 Full Image

After:

MG F91 Ver 2.0 Thumbnail

The difference in quality is noticable, but it comes with huge savings in bandwidth.

I've also scraped full images that are around 4MB, but even for the full size variants I don't even want the user to use all that bandwidth to load a single image. For these I've made a compromise and resized larger images to a manageable ~100kb each.

The script I used to compress images looks like:

// compress quality and resize extremely large images to 728px

mogrify -path ./optimized/image -sampling-factor 4:2:0 -strip -quality 85 -interlace JPEG -colorspace RGB -resize 728x728\> ./media/image/*

This script was inspired by Google PageSpeed Insights and this StackOverflow post. Running this script turns this 1.4MB 954x1200 image into a 131kb 579x728 image. Image heights are locked at 728px in the comparison below.

Before:

MG F91 Ver 2.0 Full Image

After:

MG F91 Ver 2.0 Optimized Image

Again, noticable photo quality difference. You can play around with the -quality flag from the script above to find the best quality-image size tradeoff. I thought it was a reasonable compromise and I think GunplaHub can live with this.

SEO

With NuxtJS, I can use the head property where I could set meta properties for each page, open graph, etc. This is what GunplaHub's head property looks like for each gunpla's page.

head () {
  return {
    title: this.gunpla.name + ' - ' + this.gunpla.grade,
    titleTemplate: '%s - GunplaHub',
    meta: [
      { hid: 'description', name: 'description', content: this.gunpla.name + ' prices from gunplahub.com' }, 
      { property: 'og:title', content: this.gunpla.name },                                                                         
      { property: 'og:url', content: 'https://gunplahub.com' + this.$route.fullPath },
      { property: 'og:type', content: 'article' },
      { property: 'og:description', content: this.gunpla.name + ' prices from gunplahub.com' },                                    
      { property: 'og:image', content: this.gunpla.image }
    ]
  }
}

On Slack, GunplaHub's link previous look like: GunplaHub Slack Preview

And on Twitter: GunplaHub Twitter Preview

Open graph makes sharing individual gunpla pages really pretty.

What's Next

GunplaHub's immediate backlog includes UI bug fixes, adding tests for frontend code, manually fixing some data, etc. A few of the bigger features I'm looking at are user login, 'liking' gunplas, and price alerts.

Thanks for reading up to this far! It's been an up and down journey and I learned a lot. Over the past few months I've been casually using GunplaHub and I think it's a useful service for looking up historical prices.

Scraping is automated, but adding new gunpla isn't. Adding new gunpla is a 1-2 hour effort each week, so it's fairly manageable for a one-man operation. I'm constantly trying to add Amazon links to gunpla and will soon look into other hobby sites like HobbyLink.

tags: gunplahub