Python URL mangling

Posted Thursday, June 30 2011 by jonathan

TL;DR: using query_dict = urlparse.parse_qs(query_string, True) and query_string = urllib.urlencode(query_dict, True) will round-trip query strings the way you probably want them to.


A quick quandry.

>>> import urlparse
>>> import urllib

>>> source_url = "http://netjunky.com/as/a?matter=of&fact="
>>> parsed_url = urlparse.urlparse(source_url)
>>> query = urlparse.parse_qs(parsed_url.query)
>>> result_url = urlparse.ParseResult(parsed_url.scheme,
            parsed_url.netloc,
            parsed_url.path,
            parsed_url.params,
            urllib.urlencode(query),
            parsed_url.fragment).geturl()

>>> source_url == result_url
False

Huh?

>>> source_url
'http://netjunky.com/as/a?matter=of&fact='
>>> result_url
'http://netjunky.com/as/a?matter=%5B%27of%27%5D'

Well, that’s odd.

>>> query
{ 'of': [ 'fact' ] }

>>> urllib.urlencode(query)
'matter=%5B%27of%27%5D'

Since any URL query parameter may have a sequence of values, python implements parse_qs() to return a list for all values (even if you’re not expecting that). urlencode() appears to convert the value of each input parameter using str() or repr(), which causes all sorts of problems.

Now, you could go about converting all single-element lists into strings:

>>> dict( (k, v if len(v)>1 else v[0] )
            for k, v in query.iteritems() )
{'matter': 'of'}
>>> urllib.urlencode(dict( (k, v if len(v)>1 else v[0] )
            for k, v in query.iteritems() ))
'matter=of'

That’s annoying and seems like a hack workaround. Also, if you do happen to have a parameter in your query string twice you’re still stuck.

It turns out that urlencode() takes an optional second parameter, doseq. If True, urlencode() generates a separate key=value pair for each value in the respective list.

>>> urllib.urlencode(query, True)
'matter=of'

Sweet.

>>> result_url = urlparse.ParseResult(parsed_url.scheme,
            parsed_url.netloc,
            parsed_url.path,
            parsed_url.params,
            urllib.urlencode(query, True),
            parsed_url.fragment).geturl()
>>> source_url == result_url
False

Hmmm. What happened?

>>> source_url
'http://netjunky.com/as/a?matter=of&fact='

>>> result_url
'http://netjunky.com/as/a?matter=of'

Ah, we lost that query string parameter with no value.

It turns out that parse_qs() ALSO takes an optional second argument, keep_blank_values, that preserves blank values.

>>> source_url = "http://netjunky.com/as/a?matter=of&fact="
>>> parsed_url = urlparse.urlparse(source_url)
>>> query = urlparse.parse_qs(parsed_url.query, True)

>>> result_url = urlparse.ParseResult(parsed_url.scheme,
            parsed_url.netloc,
            parsed_url.path,
            parsed_url.params,
            urllib.urlencode(query, True),
            parsed_url.fragment).geturl()
>>> source_url == result_url
True

But Wait!

urlencode() is non-deterministic, so source_url == result_url is unreliable.

Why is urlencode() non-deterministic? Because it’s input, a dictionary, does not guarantee the order of the key iterator.

Implementing a deterministic URL query string round-trip algorithm is left as an exercise for the reader. (Hint: urlparse.parse_qsl())

Your Thoughts?