How to Speak Parse-ltongue: A Lightning-Fast Tutorial

AWK-Ward

Alright, let's get the basics out of the way. cat is used to get the contents of a file (or many, as it actually stands for conCATenate).

But if it's something like a CSV file, we might want to extract columns. And that's where awk comes in. You pass it arguments, tell it what you want, and what file to work on (unless it's piped in).

Let's get some free data to play with.

Grab the customers-100.csv file from here.

Let's say we want full names out of this. We need to see what fields are in there. You could cat it, but you don't need all the lines!

Instead let's use head. head takes a -n argument for how many lines you want (it has a cousin called tail too).

head -n 3 customers-100.csv

So we want the third and fourth columns. Let's do it.

head -n 3 customers-100.csv | awk -F, '{print $3 $4}' | less

Breaking it down, I want to run it only on the first few lines, so I pipe in from head. Then we tell awk with -F, that commas are F-ield separators. And we tell it to print the 3rd and 4th columns. To keep our terminals tidy, we view the output in less.

It works... but there's a problem. The first line is column names, and those appear too.

We kinda want everything but the first, we want the... tail of it.

head -n 3 customers-100.csv | awk -F, '{print $3 $4}' | tail -n +2 | less

So we use tail, and tell it (with the plus) to start from the second line.

And now with the whole file...

awk -F, '{print $3 $4}' customers-100.csv | tail -n +2 | less

And that's the basics. There are other tools like cut, sed, etc.

And you can feel free to experiment, but for more complex stuff you can also dip into Python.

Let's Make Soup

For Python, we are going to do something crazy: turn the Hacker News front page into a useful dataset.

Hacker News

First, let's setup. In your (assumed Linux) environment, run python -m venv hndata

You'll have a nice folder. cd into it, and then run source bin/activate. That's your playground.

We are going to use a lovely and useful library called BeautifulSoup.

Have a look at the docs.

Following the docs, we install it with pip install beautifulsoup4.

Wikipedia has a great template at https://en.wikipedia.org/wiki/Beautiful_Soup_%28HTML_parser%29.

Note that you will also need to install requests with pip install requests for the template to work.

Now look at the source for the Hacker News site, and understand the structure. For our purposes, we will want the title, and number of comments and points for each.

There are a number of ways to skin this cat, but I chose to go by table rows (<tr>). Each entry has a tr with class 'athing submission', and that has the title in it.

And then the next tr has the points and number of comments.

Let's see if we can grab all the titles for a trial run.

import requests
from bs4 import BeautifulSoup

url = "https://news.ycombinator.com/"
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")

trs = soup.find_all("tr", attrs={"class":"athing submission"})

for tr in trs:
    heading = tr.find("span", attrs={"class":"titleline"})
    print(heading.a.string)
        

Try and follow the structure of the document and see how I am using a bit of finding with navigating down the trees to get what I want. The docs for BeautifulSoup will help too.

And now let's do the same, but grab the comments and points... Oh, and I've added comments to my code to make it easier to understand. Throw it into a good text editor to have it syntax highlighted.

import requests
from bs4 import BeautifulSoup

url = "https://news.ycombinator.com/"
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")

trs = soup.find_all("tr", attrs={"class":"athing submission"})

for tr in trs:
    heading = tr.find("span", attrs={"class":"titleline"})
       
    # So we go on to the next span with next_sibling
    the_rest = tr.next_sibling
    
    # the score is a span with class score, but we only care about the first
    # part (the score is "NUMBER points" so we split the string and take
    # the first piece
    score = the_rest.find("span", attrs={"class":"score"}).string.split()[0]
    
    # the comments are inside a span with class subline
    subline = the_rest.find("span", attrs={"class":"subline"})
    
    # And they are the 4th (counting starts from zero in python) anchor tag
    # we also don't want the word 'comments' so we split and take the first
    # piece
    comments = subline.find_all("a")[3].string.split()[0]
    
    # look up f-strings: they make it easier to print pretty!
    print(f"{heading.a.string} -> {score} points and {comments} comments")
        

But all this is useless without a data structure. Let's use a list (look up python lists if you aren't familiar):

import requests
from bs4 import BeautifulSoup

url = "https://news.ycombinator.com/"
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")

trs = soup.find_all("tr", attrs={"class":"athing submission"})
data = [] # We make an empty list here, otherwise it will die when it exits the for loop

for tr in trs:
    heading = tr.find("span", attrs={"class":"titleline"})
       
    # So we go on to the next span with next_sibling
    the_rest = tr.next_sibling
    
    # the score is a span with class score, but we only care about the first
    # part (the score is "NUMBER points" so we split the string and take
    # the first piece
    score = the_rest.find("span", attrs={"class":"score"}).string.split()[0]
    
    # the comments are inside a span with class subline
    subline = the_rest.find("span", attrs={"class":"subline"})
    
    # And they are the 4th (counting starts from zero in python) anchor tag
    # we also don't want the word 'comments' so we split and take the first
    # piece
    comments = subline.find_all("a")[3].string.split()[0]
    
    # Uh oh! I found out if there are no comments, it just has the word 
    # discuss, and that will break when we convert it to a number.
    # Let's make it zero when that happens.
    if comments.startswith("discuss"): comments = "0"
    
    # We use int() to turn our string of numbers into what Python
    # understands to be an integer
    data.append([heading.a.string, int(score), int(comments)])

print(data)
    

Now when that is printed out, it is U-G-L-Y compared to before. But you can do so much with data that's correctly stored! My last gift to you, padawan, is this program showing off so many different things.

import requests
from bs4 import BeautifulSoup

url = "https://news.ycombinator.com/"
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")

trs = soup.find_all("tr", attrs={"class":"athing submission"})
data = [] # We make an empty list here, otherwise it will die when it exits the for loop

for tr in trs:
    heading = tr.find("span", attrs={"class":"titleline"})
       
    # So we go on to the next span with next_sibling
    the_rest = tr.next_sibling
    
    # the score is a span with class score, but we only care about the first
    # part (the score is "NUMBER points" so we split the string and take
    # the first piece
    score = the_rest.find("span", attrs={"class":"score"}).string.split()[0]
    
    # the comments are inside a span with class subline
    subline = the_rest.find("span", attrs={"class":"subline"})
    
    # And they are the 4th (counting starts from zero in python) anchor tag
    # we also don't want the word 'comments' so we split and take the first
    # piece
    comments = subline.find_all("a")[3].string.split()[0]
    
    # Uh oh! I found out if there are no comments, it just has the word 
    # discuss, and that will break when we convert it to a number.
    # Let's make it zero when that happens.
    if comments.startswith("discuss"): comments = "0"
    
    # We use int() to turn our string of numbers into what Python
    # understands to be an integer
    data.append([heading.a.string, int(score), int(comments)])

# How many titles start with a certain letter?
letters = dict() # empty python dictionary
for entry in data:
    title = entry[0] # the title is the first thing in each entry
    # we'll work with upper case. We are assuming that the titles are all starting with letters,
    # and that could bite us in the arse... But here it will work and just add an entry for whatever number or symbol instead.
    first_letter = title[0].upper()
    # We'll try and update the entry, but if this is the first time
    # for that entry, we'll catch the error and just set the value
    # (you can't update an entry with a += if it doesn't exist, see?)
    try:
        letters[first_letter] += 1
    except KeyError:
        letters[first_letter] = 1
 
# How about average points? That's easy!
total = 0
for entry in data:
    total += entry[1]

avg = total / len(data)
print(f"Average points for a post is {avg}")

# Most points? Easy too.
highest = 0
title = ""
for entry in data:
    points = entry[1]
    if points > highest:
        highest = points
        title = entry[0] 
print(f"\"{title}\" had the most points ({points}).")
# Mind, we could have ties...
# Oh, and the backslashes let me inert quotes without it thinking that
# marks the beginning or end of the string

# now, for something more impressive, let's find the most common starting letter and note ties
greatest = 0
greatest_l = ""
tied = False
for key,value in letters.items():
    if value > greatest:
        greatest = value
        greatest_l = key
        tied = False
    elif value == greatest:
        tied = True
    else:
        continue

print(f"{greatest_l} is the most common starting letter with {greatest} entries.")
if tied:
    print("But it was tied with others.")
    

Bye!

I hope that was semi-useful. Take what is, discard the rest. Go off into tangents, explore. Nothing I've done here is even the fastest way to do things, or most efficient. Part of it is I am trying to make it easier to read, but part of it is that when you need data quickly, you have to write a dirty script and get it done. You can go back and polish it later if you think you or someone else will need it in future. Or at least lie to yourself that you will... hehe.