Translating Categorizr from PHP to Python

Orde Saunders' avatarPublished: by Orde Saunders

Whilst working on building a layered UI I was looking for a Python server side user agent detection solution. Categorizr provided the functionality I was looking for but is written in PHP and, as I could not find a Python port, I wrote my own.

PHP implemetation

The reference PHP implementation of Categorizr is tightly coupled to the superglobal variables of $_SERVER and $_SESSION and adds a number of functions into the global name space but this is not a model that easily translates to Python. Whilst there is a cross-check page for the reference implementation the are no systematic unit tests.

JSON test data

In order to translate it to another language I wanted a set of tests that I could run against the reference implementation and against my own so that I could verify mine behaved in the same way. As these tests would need to be run in both PHP and Python I needed a way of defining them in a way that was language agnostic so I settled on using JSON to define the inputs and expected outputs.

As Categorizr uses a cascade of regular expressions that are frequently logically combined so, rather than testing against real user agents, I wrote a set of test strings that were each designed to test a specific element of the user agent detection. Combined with each test string is the category we expect it to result in and a flag of "i" to indicate whether we should run this test in a case insensitive manner:

  ["Xbox", "tv", "i"],
  ["iPad", "tablet", "i"],
  ["Windows NT  ", "desktop", "s"]


To run the tests against the reference implementation I selected PHPUnit as this is what I have used in the past to test projects written in PHP. However, taking the test data from JSON and turning this into PHPUnit test cases turned out to be more challenging than I anticipated.

PHPUnit expects its tests to be structured as a sub-class of the main PHPUnit_Framework_TestCase class and each test is expected to be a method named test*. Whilst it would be possible to write one test method that looped over the test data and check each user agent this would result in PHPUnit reporting one test that either passed or failed and would provide little useful debugging information.

The ideal solution would be to have a number of individual test methods but with PHP we can't easily monkey patch classes to add in the test* methods by iterating over the JSON test data so I decided to go with code generation. Although less than ideal as an approach it meant that I could generate an arbitrary number of tests from a single template.

The initial version of the tests produced an error report that only listed the expected device class and the resulting device class, it didn't give any information about the user agent string being tested. The reporting format for failed tests in PHPUnit gives the name of the test* method that is being tested as the main form of information to diagnose which test is failing but, as I am using generated code I hadn't given each test a meaningful name - merely an incremental counter to provide unique names. In order to get round this I changed the tests to check the equality of an array that consists of the user agent being tested which means the failure report includes the user agent being tested:

1) CategorizrTest::test_276
Failed asserting that two arrays are equal.
--- Expected
+++ Actual
@@ @@
 Array (
     0 => 'Windows NT'
-    1 => 'desktop'
+    1 => 'mobile'

With the test data and test runner set up I was able to run the tests and check the test data was correct. As I was attempting to benchmark the reference implementation at this stage any test failures meant that the test data needed to be adjusted rather than the code.

Python implementation

With a set of test data that would enable me to ensure that I was able to match the reference implementation I started on the Python implementation.

As mentioned above, the reference implementation used a number of PHP idioms (such as superglobals and adding functions to the global namespace) which cannot be directly translated to Python so a new, Pythonic, approach was required.

  • The code was moved to a module. This could be done as a single file but incorportating it as a file in a directory means it can be easily included as a git submodule.
  • The detection chain was moved from a single fall-through to a number of functions. This made the code easier for me to manage and enables the testing of each individual function to isolate issues if required.
  • The options were moved from being variables in the main function to keyword arguments. This makes it possible to set them at runtime and to have several instances at once that have different options, again useful for testing.
  • The results were returned as a class with boolean properties indicating the device class. The class was also retained as a string which means it can be tested as a simple string comparison.
  • The Python regular expression engine has the ability to pre-compile regular expressions and this is this recommended approach. As there were a significant number of regular expressions I ensured they were compiled when the file was included and stored them in a dictionary for ease of reference.
  • As a number of the regular expressions consisted of a case insensitive string match that can be achieved with a combination of Python's string.lower and string.find methods I created a helper function for this. This reduces the number of pre-compiled regular expressions required and makes the code easier for me to read.

I also added an additional option to class robots as mobile. With a fully mobile first approach there may be SEO advantages to serving mobile content to search engine crawlers.

For unit testing the Python implementation the obvious choice was PyUnit. This is similar to PHPUnit in that it looks for methods named test_* but with Python it is possible to monkey patch these test methods into a class by iterating over the JSON input data. PyUnit will use the test method's docstring when reporting test failures so it was possible to use this to pass the user agent being tested.

In addition to the user agent tests I wrote a number of other tests to ensure the helper functions, overrides and results class behaved as expected.

With the test suite in place and the detection code ported I was then able to run the tests and modify the detection code to ensure it matched the reference implementation.

Categorizr cross check page.

The initial test data was using user agents specifically designed to test the detection code but, as mentioned above, Categorizr has a cross-check page that provides a list of real user agents and the corresponding categorisation from the reference implementation. This is provided in HTML format so I wrote a basic HTML page which holds a jQuery script to pull in the cross-check HTML page and parse it into JSON that can be copied into a test data file which can be run through the tests to further ensure the two implementations match.

When I ran this data through the reference implementation it produced two fails where it was categorising Android user agents as mobile rather than tablet. As this test data is intended to match the reference implementation I modified the test data to match the reference implementation, rather than modifying the implementation to match the test data.

Running this test data against the Python implementation passed first time which was particularly gratifying and indicated that the translation had been successful.

GitHub Resources