An Introduction to Object Recognition Testing

Many applications act as a black box; the user types things and the application responds. "Driving" the software with a tool, means either driving the software as a user, or a serious engineering project to separate layers of the application to make them testable.

It's not surprise that many automation projects pick the user interface layer. Not only is it immediately accessible, but that also corresponds directly to the user experience, where, say, testing just the database layer might not. At this point, almost every popular tool offers a recording tool, and those tools are a lot more approachable for less-technical testers.

A tester that can not write code at all (and might not want to) can get started with UI automation quickly; just click the record button and have the tool capture the browser actions. Running those tests for the first time is a quick lesson in how hard it can be to make automating a user interface useful. User interfaces will change by just a couple pixels, a text field or button might not be ready when the test starts accessing it, and refactoring into new UI technologies are all ways to create failures than need re-recording, or at least "maintenance" fixes.

Getting consistently useful results is tough, but focusing on an object recognition strategy can get you closer.

Object Location

In order to do the hard job of driving a browser, the test tool has to find objects, like a button, a text field or a drop-down selector. There are several ways to identify these elements. You might use pixel coordinates on a screen, XPath (a kind of locator code), the element label or ID. Each method comes with its own benefits and drawbacks, and might be fit for certain situations and not for others.

Pixel coordinates is probably the most primitive way to find an element. This is what happens with most of the basic record and play back tools. When a tester opens a browser, clicks on a record button, and starts interacting with a user interface, this is usually how elements are found on the play back. The resulting script looks like this:

click(176, 543)
type(300, 425, "justin")
click(300, 450)

Each set of numbers there represents an X and Y coordinate on the monitor that the recording was made from. This strategy can work just fine on legacy programs that are mostly in maintenance mode. The user interfaces and screen size in these products tend to be unchanging. But, as soon as a user interface refactor happens that moves elements on the screen around, or even if the browser gets resized, everything stops working. That click at 176, 543 is now clicking on empty space instead of setting focus on a text field. One change has a cascading effect on everything that comes after.

XPath is slightly less prone to failure. With XPath the tool views the browser as a director of objects, similar to a file system on a computer. Pointing a file browser at C:\Users\justin\Documents\TEST_FILE.txt is a very precise way to find a file and it will work every single time ... until the file moves. XPath looks a little different, more like "//div/table/" but the principle is the same, descending elements match the structure of the page. Both of these suffer due to change, because they change the structure of the thing you are trying to locate by structure. And so it is with XPath and automation. Software that is important enough to have an automation project is probably getting regular user interface changes like new pages, but also refactoring and redesigning of existing pages. Each time this happens, every reference to that element will need to be updated to point to the new path.

Object ID is the most stable way to find user interface elements. These IDs are a lot like an address. If the submit button lives at 23 Enigma Ln in a developing neighborhood, houses, schools, and parks can all be built around it, but the button can still be found. If button owner gets a bonus at work and wants to add on to it by buying the lot next door and adding a fancy drop list, that button still has the same address. The challenge here is a social one, not technical. It is easy to say that every element in the user interface of web based software should have an ID from the start, but automation projects often don't start at the same time. These user interface projects are usually a mix of elements with different types of IDs and some elements with no ID at all. Making this technique useful is an effort in convincing the development manager and the developers themselves on the Boy Scouts rule of leaving every campsite cleaner than it was found. Note this has another tradeoff: If the mailman notices the house looks different, he doesn't care, he delivers the mail (the tool clicks the button) without noticing that the text, font size, or location has changed.

It takes more than knowing how to identify UI objects to make a lasting UI automation project, you'll also want an architecture strategy.

Inventory and Architecture

User interface automation projects are easy to start. A tester can open up Firefox, click the record button on Selenium IDE, and record a procedural scenario in the browser. After exporting that recording to a programming language such as Ruby or Python, and adding some assertions to verify data and UI elements, an automated check is born. That new automated check is a problem child, though. Every check that is recorded is tech debt of the future that is realized when the UI changes in the test environment.

The conventional wisdom around automating the user interface is that it is doomed to failure. Even if someone were able to consistently locate elements in the browser and record a few scenarios, they still end up with a pile of scripts that need to be updated for every little button change or text field rename. Times have changed. Most UI automation tooling now offers object libraries at a minimum. There is also the object libraries more advanced, more sophisticated cousin, the PageObject.

Object Library:

The object library is an inventory system for elements in a user interface. In a very simple test script, even in a recording, each element the tester wants to interact with has to be located and given a name. If the script needs to click a save button, that button has to be handled by searching through the DOM and assigning that object to a variable. If that object is only going to be used one time, ever, this isn't a big deal. Problems come in to focus when there are multiple tests that touch that Save button. And, this is almost always the case.

Creating an object library is a way to remove that duplication. Rather than searching for a button and defining it every time a new test is made, each object on a webpage can be defined in a separate file. In plain terms, that changes a line of code like driver.findElement("saveToCart")).click(); to; The obvious benefit there is that there is less typing and less potential for mistake. The big deal, the real reason to do this, is to isolate the code changes that have to be made when a element changes to one place. In software engineering terms, this is called the DRY principle, or Do Not Repeat Yourself.


The PageObject pattern takes this idea several steps further. If the object library is a cave man just discovering the wheel, PageObjects are someone flying down the freeway on a motorcycle. Think about one of the simpler parts of most web apps, the user profile page. These pages usually have a grouping of text fields – first name, last name, address, password. There might also be some checkbox sets for things like gender and maybe a place to upload an avatar, too.

In the PageObject for that user profile webpage, each element on that page will be identified and assigned to a variable. On top of that, any functions you might want to perform on those elements will be specified. So for a user profile we would see setFirstName(), setLastName(), deleteFirstName(), setAvatar(), and so on. There are two big outcomes here. First, the tests become incredibly simple. Lines of code that previously looked like firstName.sendKeys("Justin") transforms to setFirstName("Justin"). That style is easier to read and write. This also pushes the DRY principle. Normal changes that happen during the evolution of a product like buttons moving to a different part of a page can be addressed in the PageObject instead of the several tests that work with those UI elements.

Object recognition is the cause of, and solution to many problems that exist in automating checks in a web browser. Knowing the trade-offs involved in how objects are found, and how to build an architecture around that, can help prevent a big headache.