Ethan Chiu My personal blog

Extracting Subreddit Names From Urls

In my research on the platform Discord, I had gathered a list of URLs mentioned the platform during the beginning of Discord’s rise in popularity. I wanted to investigate how Discord grew on Reddit.

After gathering a list of those URLs using a Google Scraper I had modified, I then needed to extract the URLs from that data to then get a list of subreddits where the Discord platform was being mentioned on.

The data I collected looked like this:

  "effective_query": "",
  "id": "1721",
  "no_results": "False",
  "num_results": "10",
  "num_results_for_query": "About 2,150,000 results (0.33 seconds)\u00a0",
  "page_number": "9",
  "query": "",
  "requested_at": "2017-11-21 06:58:48.987283",
  "requested_by": "localhost",
  "results": [
      "domain": "",
      "id": "2665",
      "link": "",
      "link_type": "results",
      "rank": "1",
      "serp_id": "1721",
      "snippet": " ... The good thing about discord is if you're like me and your Mic don't work there's a .... Sub still kinda active, but the discord is much more.",
      "time_stamp": "Apr 13, 2017 - 100+ posts - \u200e100+ authors",
      "title": "Join HeistTeam's Offical Discord Server! Invite: - Reddit",
      "visible_link": ""
      "domain": "",
      "id": "2666",
      "link": "",
      "link_type": "results",
      "rank": "2",
      "serp_id": "1721",
      "snippet": "Ive changed the link in the sidebar over to the official discord or you can join it by following this link here. Here are the rules for\u00a0...",
      "time_stamp": "Jul 28, 2017 - 5 posts - \u200e4 authors",
      "title": "The Official Neebs Gaming Discord! : NeebsGaming - Reddit",
      "visible_link": ""

I first needed to extract solely the links from the “results” part of the data. So, I used a simple nested for loop to extract just the links. Then, I appended those links’ values to an empty list which will be used later on:

#Load Json
data = json.load(open('discordgg/November2015December2016.json'))

#Get Only Links from JSON
for a in data:
	for b in a['results']:

Then, I needed to somehow just extract the part of the url that corresponds to the subreddit name. For example, for the url “”, I wanted to extract just “NeebsGaming”. Luckily, all of the links I collected from Reddit followed the same pattern where the subreddit name appeared between “/r/” and the next “/”, so I just used regex to splice and then just selected the correct index of that slice for the list of links:

#Process data using regex to get subreddits

for y in links:

Code in its totality:

import urllib, json 
from pprint import pprint

#Load Json
data = json.load(open('discordgg/November2015December2016.json'))

#Get Only Links from JSON
for a in data:
	for b in a['results']:

#Process data using regex to get subreddits

for y in links:

Right now, I’m using the Reddit API and getting short descriptions of those subreddits and then using a simple bags of words algorithm to categorize them. Stay tooned!