Using automatic tools in accessibility and usability assurance processes

Giorgio Brajnik
Dipartimento di Matematica e Informatica
Università di Udine
Via delle Scienze, 206
33100 Udine, Italy
giorgio@dimi.uniud.it

Nov 14, 2002

Abstract:

The paper claims that processes for monitoring, assessing and ensuring appropriate levels of accessibility and usability have to be adopted by web development and maintenance teams. Secondarily it argues that automatic tools for accessibility and usability are a necessary component of these processes.

The paper presents first an analysis of web development and maintenance activities that highlights the reasons why accessibility and usability are so poorly achieved. It then suggests which processes, borrowed from the domain of software quality assurance, should be established to improve production and maintenance of websites. The paper finally shows how automatic tools could fit in those processes and actually improve them, while being cost-effective.

Introduction

There are many books and other published material that present a wealth of information on what to do, and what to avoid, when designing, developing or maintaining a web site [10]. For example, they discuss typography on the web, dealing with issues like alignment, capitalization, typefaces, etc. Although extremely educational and useful, this written knowledge is not sufficient. In order to improve the quality of a site a web developer has to study this material, has to extract useful guidelines from it and has to decide which principles to apply, how to apply them, and when.

A better start is by looking at compiled lists of web usability guidelines [14,13,12]. But also in this case the web developer has to study this material to figure out which guideline is relevant for the specific situation at hand, and to apply it.

A crucial decision is which principles to apply, as different situations and contexts call for different choices. In order to determine which principles are relevant to a specific situation a web developer has to (i) detect failures of the site, (ii) diagnose them and identify their causes, (iii) prioritize them in terms of importance, (iv) determine how to repair them, and (v) estimate benefits and costs of these changes.

The rapid pace and tight deadlines at which development and maintenance of websites proceed severely hinders the ability to perform processes (i) to (v) in an effective, reliable and cost-effective manner.

On the other hand, processes (i) to (v) are necessary if the site is bound to match a certain level of quality since accessibility and usability are playing an increasingly important role in the game of achieving and maintaining a successful website.

The claim of the paper is twofold. First, in order to improve accessibility and usability levels of websites developers and maintainers have to establish assurance processes that are similar to those currently adopted for ensuring software quality. Second, due to the nature of accessibility and usability failures, the adoption of automatic tools supporting these assurance processes appears to be inevitable.

Low accessibility and usability levels

The current demand for accessible and usable websites (as witnessed also by the large number of books published on the subject in the last two years) stems from the fact that the vast majority of existing websites suffer from chronically low accessibility and/or usability levels. The trend is improving, thanks to legislation [18], public awareness and market pressure [16].


Reasons

Consider the reasons that are behind changes to websites (like adding a couple of new pages) leading to reduced accessibility and/or usability (A&U).

Usability and accessibility assurance processes

Website development processes

In general a site development and maintenance team includes at least the following roles (potentially carried out by the same persons): web architects, web programmers, web designers, web content producers, web accessibility verifiers, web usability verifiers, web masters, project managers.

These roles are involved in a number of processes, covering the following activities: elicitation of clients' and users' goals, definition of sites' purposes, analysis of users jobs and workflows, definition of task models, choice of technologies, definition of the information architecture and of the look and feel of the site, definition of website style guides, of page layout, of the site launching activities, and of the site maintenance processes; implementation of the site, graphical design, page debugging, multimedia production, site launch, bug fixing [8,10,7,9].

Developing websites vs. packaged software

Developing a website is, in many ways, like developing an interactive software system. Development has to result from analysis, planning, design, implementation and verification steps.

However, due to the specific nature of websites, the following major differences can be acknowledged, each leading to more specific and demanding A&U assurance processes.

For these reasons it is important that every design and implementation decision that may affect the user interface (i.e. anything that manifests in DHTML coding of the pages) is validated as soon as possible. Validation means to make sure that the adopted solution is indeed a solution of the right problem.

Difficult decisions

When carrying out these processes a number of difficult managerial and technical decisions have to be taken (see the glossary in the appendix).

  1. Determining the goal of the assurance process.
  2. Determining which guidelines to adopt.
  3. Determining how to resolve conflicts between competing guidelines.

    Determining the goal of the assurance process requires that somebody decides which part of the site has to be assessed, and against which level of A&U. One has to decide if this level should be based on existing guidelines (official ones [18,19] or unofficial but well known ones [14,12,15,13]) or if in-house guidelines should be developed for use within the organization. The advantage of adopting existing guidelines stems from the fact that guidelines embody the results of investigations carried out by others which may be relevant also to the specific case. However it may happen that guidelines need to be adapted to fit to specific situations.
    For example, putting navbars to the right of the page improves accessibility (since even though they are repeated in many pages, the visitor using a screen reader does not have to hear them over and over before getting the page content). However, sighted visitors might be required to horizontally scroll the page to see the navbar, which is known usability hindrance.

  4. Choosing the assessment methods.
  5. Determining how to detect failures.

    Choosing the methods to carry out the assessment process is affects the effectiveness of the process and its efficiency. One has to decide when to run user testing sessions, or expert reviews for accessibility or usability, or when to use automatic tools. And in any case general principles stated in the guidelines have to be made operational so that it is possible to determine if a website satisfies the guideline or not.

    The Goal, Questions and Metrics (GQM) approach [1] can be helpful to frame these decisions. It is based on the following steps, described here in the context of website analysis:

    1. establish the goals of the analysis. Possible goals include: to detect and remove usability obstacles that hinder an online sale procedure; to learn whether the site conveys trust to the intended user population; to compare two designs of a site to determine which one is more usable; to determine performance bottlenecks in the website and its back-end implementation.
      Goals can be normally defined by understanding the situation of concern, which is the reason for performing the analysis. Which actions will be taken as a result of the quality assessment? What do we need to know in order to take these actions?
    2. develop questions of interest whose answers would be sufficient to decide on the appropriate course of actions. For example, a question related to the online sale obstacles goal mentioned above could be ``how many users are able to buy product X in less than 3 minutes and with no mistakes?''. Another question related to the same goal might be ``are we using too many font faces on those forms?''. Questions of interest constitute a bridge between goals of the analysis and measurable properties that are used for data collection. Questions also lead to sharper definition of the goals. They can be used to filter out inappropriate goals: goals for which questions cannot be formulated and goals for which there are no measurable properties can be filtered out. Questions can also be evaluated to determine if they completely cover their goal and if they refer to measurable properties. In addition certain questions will be more important than others in fulfilling a goal.
      Published guidelines are useful since they suggest relevant questions (for example a guideline requiring to use white space to separate links in navbars suggests questions like ``are links too close to each other?'', ``how will the space between links change if font size is increased?'', ``how many visitors will be affected by violations of this guideline?'' ``what are the possible consequences on visitors' tasks if this violation occurs?''.
      Once relevant questions have been chosen, in-house guidelines can be defined.
    3. establish measurement methods (i.e. metrics) to be used to collect and organize data to answer the questions of interest. A measurement method (see [6] for additional details) should include at least:
      1. a description of how the data have to be elicited. For example via user testing with a given task description; or automatically inspecting HTML sources of online pages to learn how links are organized.
      2. a description of how data are identified. For example, what does ``successfully completed task'' mean in a user testing session?; how is ``time to complete the task'' going to be measured?; what constitutes a link in a webpage (i.e. which HTML tags should be considered: A, AREA, IMG, BLOCKQUOTE, ...)?; what constitutes a local link to a website (e.g. same server, same sub-domain, or same path)?
      3. a description of how data are going to be categorized. For example, what are the different kinds of mistakes that users might get involved in, how can the "time to complete" be broken down into different phases, if there are different categories of local links: links to images, videos, HTML files, etc. Measurement methods sharpen and refine the questions of interest.
        Even though some A&U guideline is sufficiently precise and specific to suggest the appropriate measurement method, it is often the case that guidelines need to be specialized in order to be operational.

  6. Deciding which guidelines to apply at which stage of the development process.
  7. Determining how to prevent errors.

    Certain methods and guidelines should be applied to early deliverables of the development process. For example, some accessibility failures are due to bad design decisions: the use of frames, putting navbars to the left of the page, laying out the page content, or implementing menus with rollovers. These decisions are taken when the page layout is designed, at a stage that occurs earlier than production of content. But, unless a careful assessment of the page template has been carried out when the template is produced, the consequences of these choices show up only when content is produced. Leading, for example, to pages where skip-links have to be added, navbars have to be moved to the right, a new implementation of the table layout has to be done in order to yield the correct reading order, and navigation options implemented through rollovers have to be made redundant with other mechanisms. The problem is that fixing a seemingly simple defect requires a reassessment of page design, and it is likely that a change in many of the pages is needed. The later this is done, the more costly it will be.

    On the other hand, violation of other guidelines can be spotted at later times (say, after content is produced) without incurring in significantly larger costs. For example, figuring out that images are not labeled (with the ALT attribute) is a problem which requires a localized fix that does not impact on other parts of the site. However, doing it afterwords still requires a systematic, tedious activity of locating all image occurrences, labeling them and then making sure that labels are all consistent.

  8. Determining how to estimate the cost, and the Return on Investment of the entire assurance process.

    A quality cost model used for software quality [2] could be applied to web development as well.

    According to the model the cost of quality ($C_{q}$) is the sum of the cost of conformance ( $C_{conformance}$) and the cost of non-conformance ( $C_{non-conformance}$). $C_{conformance}$ includes any expenditure for preventing defects (e.g. training, tools, process tuning) ($C_{prevent}$) and the cost of planning and running verifications (personnel, setting the test environment, running the verifications) ($C_{appraise}$). $C_{prevent}$ is basically a fixed cost incurred initially, while $C_{appraise}$ is a cost that depends also on the number of defects that are found (and fixed, which requires repeated verifications). $C_{non-conformance}$ includes the economic effects of bugs discovered after release. At the very least, $C_{non-conformance}$ includes the infrastructure, time and effort needed for replicating, isolating, fixing, verifying each of the bugs reported by web visitors ($C_{direct}$). In addition it includes also less tangible costs like delays, slowdowns, customer alienation, uncompleted transactions, reduced traffic, reduced sales, etc. ($C_{indirect}$).


    \begin{displaymath}C_q = C_{conformance} + C_{non-conformance}\end{displaymath}

    that is


    \begin{displaymath}
C_{q} = C_{prevent} + C_{appraise}(N) + C_{direct}(M) +
C_{indirect}(M)
\end{displaymath}

    where $N$ and $M$ are the number of defects found and fixed prior to (and respectively after the) release.

    If we keep the total number of defects in the site ($N+M$) constant, and explore a scenario (n. 1) where no specific assurance processes are put in place, then we may assume that all the defects will be discovered sooner or later by the visitors. Hence:


    \begin{displaymath}CQ_{1} = 0 + 0 + CD_1 + CI_1\end{displaymath}

    Alternatively, in a scenario (n. 2) where assurance processes are implemented, and assuming that all the defects are fixed as early as possible in the development process the result is:


    \begin{displaymath}CQ_{2} = CP_2 + CA_2 + 0 + 0\end{displaymath}

    It will often be the case that $CP_2 + CA_2 < CD_1$ (and hence $CQ_2<CQ_1$), since finding and fixing the bugs reported by visitors in scenario 1 the same assurance processes have to be performed (eliciting failures, diagnosing them, fixing them, verifying them again, ...) using the same infrastructure needed in scenario 2.

    While these scenarios are very abstract and extreme, as reported in [2] a wise investment and management of assurance processes lead to a reduced cost of quality. And the return on investment of $C_{conformance}$ is consistently positive.

    This cost model could be applied also to website development and maintenance. However, for websites the proportion of defects that are reported by visitors is smaller than for software. Most visitors will not complain and simply will move away from the site after facing one or more A&U failures. Therefore $C_{direct}$ will be small, because of the small number of reported bugs. However $C_{indirect}$ is likely to soar.
    Only effective assurance processes can avoid this increase in indirect costs.

Defect flow model

Consider now how defects flow in and out of the website. Following a model used for software development [2], defects are inserted into and removed from the website as an effect of the overall development and maintenance process (see figure 1). More specifically, the following subprocesses can be identified:

defect insertion:
this occurs during website development and maintenance, at any stage. Defects may be due to errors in implementation decisions (like forgetting to label correctly controls of a form) or in design decisions (like relying on colors to convey information) or during analysis (like not providing the appropriate navigation path to support a given information task). In the majority of the cases this happens implicitly, without notice. Sometimes errors simply occur because somebody has thought of doing something, but then it's been forgotten. Other times they occur because of misjudgment (f.e. deciding that a certain accessibility requirement is not so important) or because of ignorance.

defect removal:
this happens after defects have been discovered, have been analyzed and their removal has been planned. Web developers produce one or more solutions, rank them, plan their execution, and finally execute them. Usually, after executing a solution, a subsequent verification step takes place making sure that the defect has been removed, and that applied changes do not perturb the rest of the site.
Figure 1 shows that defect removal is a process informed by defect management, since the latter provides priorities and schedules for defect removals.

defect elicitation:
this occurs when A&U failures are detected and traced back to their causes: the defects. This may happen when website visitors complain about the site or when explicit verification activities are carried out.
During this process several elicitation techniques may be used: user testing, heuristic evaluations, accessibility inspections, accessibility ``sniff tests'', analysis of webserver logs, code inspections.
Figure 1 shows that defect elicitation is guided by A&U monitoring and by A&U assurance plan definition: the former triggers elicitation, the latter determines which part of the site has to be tested, why, how and when. Defect management also affects defect elicitation, since it determines the order in which defects are to be removed.

defect management:
is the process of determining which defects should be removed, to schedule their removal and to track their evolution in the site. Defect triage is a step performed during this process which aims at performing a superficial but quick evaluation of reported defects so that most critical ones are treated first [11]. Another possible approach is ``Failure Mode and Effects Analysis'' [2] where criticality of a defect can be computed in a more formal approach. In any case, criticality of a defect depends on the following factors:
violation on required standards:
for example a failure such that a U.S. federal website cannot satisfy Section 508 accessibility requirements is a good candidate for a high priority.
individual impact:
(or severity) the impact of the failure with respect to an individual user who has to face it. This estimation may be formulated in terms of user inability to successfully complete the task that s/he intended to do.
affected audience:
(or likelihood) the fraction of the intended user population of the website that is affected by the failure. The likelihood of a failure in turns depends on (i) how frequently the pages causing it are shown to visitors and (ii) which proportion of those visitors will actually experience the failure.
A defect located on the home page will be more critical than the same defect located on a secondary page, which will be visited by a smaller proportion of visitors. Webserver logs can also be used to determine the frequency of display of given pages. On the other hand, determining the proportion of visitors that will experience the failure depends on assumptions, like knowing how many visitors use the page in the contexts leading to the failure. For example, only visually impaired visitors using JAWS on Windows 98 clicking on a certain link might experience the failure.
symbolic impact:
the failure might be perceived by the audience as a symbolic quality problem; it could be an embarrassing bug that requires a prompt treatment.
removal costs:
in order to remove a bug a number of activities have to performed: (i) to diagnose the failures and determining their causes (i.e. the defect), (ii) to determine one or more possible solutions (i.e. alternative treatments), (iii) to estimate their requirements in terms of persons, skills, and infrastructure, (iv) to implement one solution, (v) to verify that the defect has disappeared, and finally (vi) to estimate time and money needed for all these activities.
instability of the site:
fixing a defect requires changing some part of the website, and these changes may propagate to other pages across the site. For example, fixing a usability defect by replacing a label of a link appears to be an easy and localized change of a page. But a deeper analysis may indicate that the same label was used in a global navigation bar, in the heading of an entire subsite and also in the URLs of the pages. What appeared to be an easy change has become a very destabilizing activity on a large fraction of the site.

A&U assurance plan definition:
the plan should describe:

A&U monitoring:
it is the process of managing methods, tools and roles for alerting when a failure has occurred and therefore triggering the defect elicitation process.

Figure 1: The defect flow model
A diagram showing the processes involved in quality assurance

Impact of automatic tools on assurance processes

Automatic tools

There are several flavors of automatic webtesting tools [4,3]:

These tools cover a large set of tests for A&U. For example, usability tests can be grouped under the following factors (for more details see [3]):

Obviously, automatic tools cannot assess all sorts of properties. In particular, anything that requires interpretation (e.g. usage of natural and concise language) or that requires assessment of relevance (e.g. ALT text of an image is equivalent to the image itself) is out of reach. Nevertheless these tools can highlight a number of issues that have to be later inspected by humans and can avoid highlighting them when there is reasonably certainty that the issue is not a problem. For example, a non-empty ALT can contain placeholder text (like the filename of the image); it is possible to write heuristic programs that can detect such a pattern and flag such an issue only when appropriate - and be able not to flag it if the ALT contains other text.

A case study: LIFT

A quick analysis of the potential of a tool like LIFT (produced by UsableNet Inc., see www.usablenet.com) can highlight its impact on the A&U assurance processes. LIFT is in fact a family of tools including:

Impact on Defect insertion

Automatic tools can affect the Defect insertion process.

LIFT, for example, comes with a rich description of what an accessibility problem is (i.e. the failure and the possible defect), why it is important to fix it (i.e. the consequences of the failure), where you can learn more about it, and how you could fix it (i.e. how to remove the defect). This description can be used to recall a specific guideline to an experienced developer or to train novices, an important requirement as many U.S. federal agencies hire external contractors to retrofit accessibility on their websites, as required by Section 508; these contractors are not always selected on the basis of their experience in accessibility.

In addition, being LIFT embedded in the authoring environment, this information is readily available while developing the site. And actually through the familiar user interface of the development environment.

Thirdly, LIFT has a monitoring function that continuously evaluates the page as it is developed, highlighting the potential defects. This can therefore alert the developer as soon as the defect is inserted in the site so that it can be removed as early as possible.

Finally, LIFT offers some page previewers that help to debug a page, right when it is being developed. Available previewers display the page with colors turned off (to discover if there is information that is color coded) or with linearized content (to check reading order).

Impact on Defect elicitation

Automatic tools can improve the effectiveness of the defect elicitation process. They are systematic, do not get tired or bored, and are fast.

LIFT, for example, can scan a large number of local or live pages (1000+) and apply a large (120+) set of tests on them.

Obviously, only certain kinds of properties can be tested in an automatic way. When heuristic tests are adopted (which often is the case since accessibility and usability are properties that seldom can be reduced to easy clear-cut decisions), the tools may produce false positives (i.e. they report potential failures in cases where there is no defect).
LIFT reduces the number of false positives by guessing the role that images and tables play in a page. On the basis of image size and type, and its location in the page, LIFT is able to guess with good accuracy if the image is a spacer, a decoration, a banner, a button, or something else. And therefore it can offer a more specific diagnosis of the problem and suggestions for a solution.

Impact on Defect removal

Tools can improve also the removal process, at least for certain kinds of defects.

LIFT offers, whenever possible, examples of good solutions (like fragments of javascript code for correctly handling events causing new browser windows to be opened) that can be copied into the page being developed.

In addition, LIFT offers Fix Wizards that allow a user to fix a problem without having to manually edit DHTML code. Which again is convenient for users who are not experienced or not used to work with HTML or Javascript. For example, fix wizards for data tables (i.e. TABLE tags used to display tabular information) that allow to markup correctly table cells so that they refer to table headers.

Finally, the ALT Editor of LIFT supports a global analysis and fix of the ALT attributes of all the images found in the site. This is useful to ensure consistency among these image labels.

Impact on Monitoring

Tools can support the monitoring process.

For example, LIFT can schedule evaluations on a live site over time; they are run in an unsupervised mode and when ready an email is sent to the person that requested them.

Impact on Plan definition

Tools can support the definition of the test plan.

For example, LIFT allows the user to enable/disable individual tests, groups of tests and to define and use named guidelines profiles. In this way the user has a fine control on which tests are applied.

Secondly, the behavior of many tests is affected by parameters whose value can be changed by the user. For example, to determine if a site has a ``text only'' version, LIFT looks for links containing words like ``text only'', ``text version'', etc. By changing the value of this parameter the user can customize the behavior of the test.

Impact on ROI

Tools affect the cost of A&U assurance in two ways. They contribute to increase fixed costs for the infrastructure and training ($C_{prevent}$). However, if appropriately deployed they are likely to reduce the running costs of appraisal ($C_{appraise}$) and, at the same time, increase the quality level that can be achieved.

Conclusion

While there are differences between development and maintenance of websites and of software, it is likely that the same assurance processes adopted for software can be deployed for websites. And, as a consequence, also automated tools for supporting A&U assurance can play a significant and positive role.

They can address the issues raised in section 2.1:

Bibliography

1
V. Basili and D. Weiss.
A methodology for collecting valid software engineering data.
IEEE Trans. on Software Engineering, 10(6):728-738, 1984.

2
R. Black.
Managing the testing process.
Wiley Publishing Inc., 2002.

3
G. Brajnik.
Automatic web usability evaluation: what needs to be done?
In Proc. Human Factors and the Web, 6th Conference, Austin TX, June 2000.
http://www.dimi.uniud.it/$\sim$giorgio/papers/hfweb00.html.

4
T. Brink and E. Hofer.
Automatically evaluating web usability.
CHI 2002 Workshop, April 2002.

5
E. Chi, P. Pirolli, and J. Pitkow.
The scent of a site: a system for analyzing and predicting information scent, usage and usability of a web site.
In ACM, editor, Proceedings of CHI 2000, 2000.

6
N. Fenton and S. Lawrence Pfleeger.
Software metrics.
International Thompson Publishing Company, 2nd edition, 1997.

7
K. Goto and E. Cotler.
Web redesign: workflow that works.
New Riders Publishing, 2002.

8
J. Hackos and J. Redish.
User and task analysis for interface design.
Wiley Computer Publishing, 1998.

9
M. Holzschlag and B. Lawson, editors.
Usability: the site speaks for itself.
Glasshouse Ltd., 2002.

10
P. Lynch and S. Horton.
Web Style Guide.
Yale University, 2nd edition, 2002.

11
J. McCarthy.
Dynamics of software development.
Microsoft Press, 1995.

12
National Cancer Institute.
Usability.gov.
http://www.usability.gov, 2002.

13
J. Nielsen.
Designing Web Usability: the practice of simplicity.
New Riders Publishing, 1999.

14
J. Nielsen.
Alertbox.
http://www.useit.com/alertbox/, 2002.

15
Nielsen Norman Group.
Beyond alt text: Making the web easy to use for users with disabilities.
http://www.nngroup.com/reports/accessibility/, Oct 2001.

16
A. Ramasastry.
Should web-only businesses be required to be disabled-accessible?
http://www.cnn.com/2002/LAW/11/07/findlaw.analysis.ramasastry.disabled/index.html, Nov 7 2002.

17
R. Sinha, M. Hearst, M. Ivory, and M. Draisin.
Content or graphics? an empirical analysis of criteria for award-winning websites.
In Proc. Human Factors and the Web, 7th Conference,, Madison, WI, June 2001.

18
U.S. Dept. of Justice.
Section 508 of the rehabilitation act.
http://www.access-board.gov/sec508/guide/1194.22.htm, 2001.

19
World Wide Web Consortium - Web Accessibility Initiative.
Web content accessibility guidelines 1.0.
http://www.w3.org/TR/WCAG10, May 1999.

The following terminology has been used in the paper:

bug:
a generic term for referring to a misbehavior of the system and its causes.
failure:
the manifestation of a misbehavior of the system. For example, when a web visitor using a screen reader gets the content of the page read in an incorrect order. Notice that in case of a usability and accessibility failure of a website, misbehavior has to encompass the behavior of: the site, the webserver, the browser, brower's plug-ins, any assistive technology, and the operating system used by the visitor.
failure mode:
the category of failures that share the same kind of misbehavior. A failure mode is the set of symptoms; these symptoms may show up during a specific failure, and are caused by one or more defects (i.e. the disease).
defect:
or fault, the reason why a failure may show up. Typically, for web usability and accessibility, a defect is rooted in some fragment of code implementing the site (HTML, Javascript, CSS). In the previous example, the defect associated with the incorrect reading order might be a bad use of the TABLE tag to layout the page.
error:
is the misbehavior of the developer causing a defect that is inserted into the site.

About this document ...

Using automatic tools in accessibility and usability assurance processes

This document was generated using the LaTeX2HTML translator Version 2002 (1.62)

Copyright © 1993, 1994, 1995, 1996, Nikos Drakos, Computer Based Learning Unit, University of Leeds.
Copyright © 1997, 1998, 1999, Ross Moore, Mathematics Department, Macquarie University, Sydney.

The command line arguments were:
latex2html -split 0 -nonavigation -html_version 4.0 assurance-proc-note

The translation was initiated by Giorgio Brajnik on 2003-03-26


Giorgio Brajnik 2003-03-26