Ethan Chiu My personal blog

Extracting Subreddit Names From Urls

In my research on the platform Discord, I had gathered a list of URLs mentioned the platform during the beginning of Discord’s rise in popularity. I wanted to investigate how Discord grew on Reddit.

After gathering a list of those URLs using a Google Scraper I had modified, I then needed to extract the URLs from that data to then get a list of subreddits where the Discord platform was being mentioned on.

The data I collected looked like this:

[{
  "effective_query": "",
  "id": "1721",
  "no_results": "False",
  "num_results": "10",
  "num_results_for_query": "About 2,150,000 results (0.33 seconds)\u00a0",
  "page_number": "9",
  "query": "discord.gg site:reddit.com",
  "requested_at": "2017-11-21 06:58:48.987283",
  "requested_by": "localhost",
  "results": [
    {
      "domain": "www.reddit.com",
      "id": "2665",
      "link": "https://www.reddit.com/r/HeistTeams/comments/6543q7/join_heistteams_offical_discord_server_invite/",
      "link_type": "results",
      "rank": "1",
      "serp_id": "1721",
      "snippet": "http://discord.gg/gtao. ... The good thing about discord is if you're like me and your Mic don't work there's a .... Sub still kinda active, but the discord is much more.",
      "time_stamp": "Apr 13, 2017 - 100+ posts - \u200e100+ authors",
      "title": "Join HeistTeam's Offical Discord Server! Invite: discord.gg/gtao - Reddit",
      "visible_link": "https://www.reddit.com/r/HeistTeams/.../join_heistteams_offical_discord_server_invite..."
    },
    {
      "domain": "www.reddit.com",
      "id": "2666",
      "link": "https://www.reddit.com/r/NeebsGaming/comments/6q3wlk/the_official_neebs_gaming_discord/",
      "link_type": "results",
      "rank": "2",
      "serp_id": "1721",
      "snippet": "Ive changed the link in the sidebar over to the official discord or you can join it by following this link here. http://discord.gg/neebsgaming. Here are the rules for\u00a0...",
      "time_stamp": "Jul 28, 2017 - 5 posts - \u200e4 authors",
      "title": "The Official Neebs Gaming Discord! : NeebsGaming - Reddit",
      "visible_link": "https://www.reddit.com/r/NeebsGaming/.../the_official_neebs_gaming_discord/"
    },

I first needed to extract solely the links from the “results” part of the data. So, I used a simple nested for loop to extract just the links. Then, I appended those links’ values to an empty list which will be used later on:

#Load Json
data = json.load(open('discordgg/November2015December2016.json'))

#Get Only Links from JSON
links=[]
for a in data:
	for b in a['results']:
		links.append(b['link'])
		#pprint(b['link'])

Then, I needed to somehow just extract the part of the url that corresponds to the subreddit name. For example, for the url “https://www.reddit.com/r/NeebsGaming/comments/6q3wlk/the_official_neebs_gaming_discord/”, I wanted to extract just “NeebsGaming”. Luckily, all of the links I collected from Reddit followed the same pattern where the subreddit name appeared between “/r/” and the next “/”, so I just used regex to splice and then just selected the correct index of that slice for the list of links:

#Process data using regex to get subreddits
subReddits=[]

for y in links:
	subReddits.append(y.split('/')[4])
	pprint(y.split('/')[4])

Code in its totality:

import urllib, json 
from pprint import pprint

#Load Json
data = json.load(open('discordgg/November2015December2016.json'))

#Get Only Links from JSON
links=[]
for a in data:
	for b in a['results']:
		links.append(b['link'])
		#pprint(b['link'])

#Process data using regex to get subreddits
subReddits=[]

for y in links:
	subReddits.append(y.split('/')[4])
	pprint(y.split('/')[4])

Right now, I’m using the Reddit API and getting short descriptions of those subreddits and then using a simple bags of words algorithm to categorize them. Stay tooned!