Hey I'm Alex

UTD Directory Scraping

by Alex Guerra, 09-02-2013

I was working on a script to scrape the UTD online student directory for fun the other day (dataset without emails available here) and it actually turned into a pretty neat algorithms problem. Maybe you'll find it interesting.

The Problem

The online directory takes a search query and either returns the information of some number of students, the information of no students with a message telling you as much, or the information of exactly 50 students with a message stating that your search is too broad. The search will return any names that contain the query as a substring.

As an added constraint, I want the program to be heavily throttled (1req/s) as to not raise any suspicions, which means that reducing the total number of requests is priority number one.

Requesting the pages and extracting the data is easy and not particularly interesting. The only problem then is how to write a program that exhaustively searches the directory using only the limited interface provided and in as few requests as possible.

Queries as Trees

So step one is to convert the problem domain into code. Easy enough, the only action I can perform on the directory is something encapsulated in one method with one argument; search with a query. I also know it'll return that either there were no results, some results, or that the search was too broad, along with a set of extracted student data.

def search(query):
    # Make the request, extract the data...
    if there were no results:
        return NO_RESULTS, set()
    elif there were some results:
        return SOME_RESULTS, students
    elif the search was too broad:
        return TOO_BROAD, students

Plus, since the interface I have is so simple, I know the brunt of the work is going to be in generating queries and choosing which order to execute them.

The obvious brute force a-z method is where I start trying to work it out:

'a'

Good start. What happens next? Well, I run search('a') and it returns one of three values.

  • NO_RESULTS: A name containing 'a' doesn't exist.
  • SOME_RESULTS: The results returned have every name containing 'a'. This also lets us know that more specific queries (say 'aa', 'ab', 'ac') are unnecessary as they will not return anything new.
  • TOO_BROAD: The results returned are a subset of the names containing 'a'. We'll need to do more specific queries to uncover them all.

Using this we can imagine our query scheme as a big 26-ary tree; NO_RESULTS and SOME_RESULTS are both leaf nodes and TOO_BROAD is a non-leaf node. With this in mind, creating a recursive method to generate the tree is easy.

def check(query):
    result, new_students = search(query)
    if result in (NO_RESULTS, SOME_RESULTS):
        return new_students
    else:
        new_students = set()
        for letter in string.lowercase:
            new_students |= check(query+letter)
        return new_students

students = set()
for letter in string.lowercase:
    students |= check(letter)

We pick a letter and then recursively continue down the tree until we hit a leaf. Problem solved.

Keepin it DRY

The program currently duplicates a lot of work. Consider that search is trying to check the string "cba". Also imagine that the string "ba" has already been checked and returned no results. Since "cba" is more specific than "ba", it's impossible for "cba" to return anything new

Because the search method throttles itself one second per run, almost anything that's a local operation will be faster than actually sending out a request, which means I need to find a way to not run queries who's substrings have already returned a leaf node.

I reason through this by hypothetically running the query "abcde". The obvious solution is to keep a set of discovered leaf nodes and then check every substring of the "abcde" against it.

However, we know if we're here that the parent of "abcde", "abcd", returned TOO_BROAD. This by extension means that any substring of "abcd" will also return TOO_BROAD, as they are all less specific than "abcd". In order to check for leaves in all substrings of "abcde" then, the only subqueries we actually need to search our leaf set for are the ones created by the most recent addition to the query, namely "bcde", "cde", "de" and "e".

for r in range(len(query)-1):
    subquery = query[r+1:]
    if subquery in leafs:
        # Search is unnecessary

Order

In the previous section we were working under the same recursive tree method for generating queries that we discovered at the beginning. However, because the program exhaustively searches each branch before moving forward, it's guaranteed that in our previous example 'bcde', 'cde', 'de' and 'e' will not be in the leaf set as they come after 'abcde' in our generation scheme.

The solution is to increment the depth of all trees at the same time, that is, only check direct descendants of a query on every iteration. That's kinda tough to put into words, so here's an example.

students = set()
leafs = set()
queries = string.lowercase
while queries:
    non_leaf_descendants = []
    for query in queries:
        for letter in string.lowercase:
            result, new_students = search(query + letter)
            if result == TOO_BROAD:
                non_leaf_descendants.append(query + letter)
            else:
                students |= new_students
    queries = non_leaf_descendants

Conclusion

And that's where this ends. I'm sure there are more efficient options out there if you're willing to utilize outside information (name dictionaries so that you're not beginning on queries that obviously will return TOO_BROAD, frequencies of different letter combinations, etc), but I'm hoping I've hit upon a pretty decent language-naive algorithm, and it was good enough for me.

The actual implementation that I wrote had to consider a few more things, most notably spaces (people with extremely common first and last names only show up in combination) and there's still technically an edge case where if 50 people have the exact same name none of them will be added, but I'm thinking that's an outlier :) Hope you enjoyed reading.