(C) 2001 Giorgio Brajnik.
paper presented at: 7th Conference on Human Factors and the Web,
Madison, Wisconsin. June 2001.


 

Towards valid quality models for websites

Giorgio Brajnik
Dipartimento di Matematica e Informatica
University of Udine, Udine, Italy
giorgio@dimi.uniud.it

Abstract

In order to assess the success of a website, a quality model highlighting relevant properties of the website and specifying how to measure them is needed.

In the paper I discuss the role of quality models in the development and maintenance processes of websites and show how they can be based on existing guidelines and web-site testing tools.

I also discuss the validation problem of these automatic tools and propose ways to solve it.

1. Introduction

Consider a webdesigner that has to decide which of two alternative designs of a website, A or B, is more successful, given a heterogeneous user population using various devices to access the site. How do we compare A with B? How is successful going to be defined? Does it mean functionally adequate, usable, accessible, efficient? With respect to which user population? And how is it going to be measured?

These are some of the questions that require a quality model for the website in order to be answered. A quality model highlights the properties that are more relevant to a need (like improving trust) and relates them to a system of attributes (like presence of descriptions of refund policies, testimonials for a technology, spelling errors, readability) that have to be measured following a defined procedure (like through user testing and filling in appropriate questionnaires or computing the fog readability index [Gunning, 1968]).

Quality models can enable development and maintenance processes that consistently achieve high quality standards based on standardized data acquisition and measurement methods. Using appropriate quality models a web development team can understand, control and improve its products and processes.

A quality model is essentially a set of criteria that are used to determine if a website reaches certain levels of quality. Quality models require also ways to assess if such criteria hold for a website. For example user testing, heuristic evaluation or automatic webtesting systems.

The goal of the paper is to show the role that quality models can play in web development and maintenance processes. The claim is that quality models used in software engineering can be applied to website engineering and that design guidelines and usability evaluation techniques and tools are powerful ingredients of quality models. Automatic tools for webtesting are particularly interesting for their low cost. However, appropriate methods should be used to validate them since they are often based on heuristic rules.

The paper presents a case study centered on a specific method, called page tracking, for considering the utility of rules used by one of such tools.

2. Website quality

The quality of a website is a property difficult to define and capture in an operational way, yet everybody feels it when it is missing. In fact, for a website there can be as many views of its quality as there are usages.

Quality may depend on task-related factors affecting end users like:

It may also depend on performance-related factors that affect the efficiency of end users and the economics of the website within the company running it. These factors include:

It may depend on development-related factors that affect developers and maintainers of a website. These include:

The life cycle of a website is determined by the processes of analysis, design, implementation, validation and maintenance involving a variety of persons, resources, methods and tools. When these processes are not based on well defined frameworks, it is likely that they will be neither effective nor efficient, leading to products whose success is difficult to achieve and is not repeatable.

This might well be the case of the majority of websites currently online.

2.1. Quality models

The ISO 9126 definition of quality for software products is

the totality of features and characteristics of a software product that bear on its ability to satisfy stated or implied needs

Quality is specified further as a composite property involving a set of interdependent factors:

Quality is a property of a product (i.e. it applies to some entity, like the website, or some prototype, or its information architecture) defined in terms of a system of attributes, like readability or coupling. Finally a number of measurement methods (metrics) have to be defined in order to assess the attributes that a certain product possesses.

These aspects taken together are called quality model [Fenton and Lawrence Pfleeger, 1997].

A quality model may involve a lot of interdependent attributes and has, of course, to take into account the particular usage of the product for which quality is being modeled.

Attributes of a software may include a very large list of properties, possibly at different levels of detail, including usability, integrity, efficiency, reliability, maintainability, testability, reusability, portability, complexity, readability. Figure 1 shows a portion of a possible quality model of a website centered on usability based on factors mentioned in [Brajnik, 2000].

Some attribute is internal (i.e. can be measured by examining the product, separate from its behavior); others are external (i.e. can be measured only with respect to how the product relates to its environment). For example, size is an internal attribute, while user error rate is external.

In general attributes related with usability are external ones. External attributes are more difficult to acquire and represent because they refer to usage of the system by its users, an environment very difficult to model.

2.2. Importance of quality models

A quality model can be used to understand, control, improve a product or a process. For example:

Even though empirical standards already exist that support a designer in creating high quality work, I believe these standards should be enriched with measurement techniques and they should refer to a systematic set of properties and attributes (i.e. they should become quality models) in order to be even more useful.

DeMarco's [1982] statement in the context of (early 80's) software engineering applies equally well to nowadays website engineering:

you cannot control what you cannot measure

2.3. Adopting a quality model

Determining what to measure is a difficult decision: often we focus on attributes that are convenient or easy to measure rather than those that are needed. For generic software products, general quality models already exist [Fenton and Lawrence Pfleeger, 1997] that can be adopted as-is and specify what has to be measured and how. For websites this is not the case and quality models need to be defined ad-hoc.

The Goal-Question-Metric (GQM) paradigm [Basili and Weiss, 1984] is a useful framework to guide the definition of a quality model. It is based on three steps:

  1. list the major goals of the development or maintenance process
  2. derive from each goal the questions that need to be answered in order to determine if the goals have been met
  3. decide what must be measured in order to answer the questions and how

For example, if the goal is to determine if a given e-commerce system (the set of forms and related information for buying a product) is adequate, a possible question is if it is usable enough by non expert users, and a possible metric is the success rate of a sample of user population, i.e. how many of them complete successfully the procedure in a given time.

The goal and questions determine the quality factors that are more important and those that should be discarded.

Metrics have to be chosen in order to determine, in a reliable and accurate way, the value for an attribute from data that, often, have to be collected on purpose. Possible techniques for collecting, analyzing and measuring quality attributes are:

Notice that not all techniques apply equally well to each attribute. While the attribute number of different media object embedded in a page can be obtained via scanning, time to complete a task may be obtainable from specific logs, and user error rate may be obtainable from interviews, think-aloud protocols or observations.

Furthermore, different data collection and analysis techniques may be complementary (as shown by [Nielsen and Mack, 1994]). Thus definition of a quality model should exploit complementary techniques to yield more complete and robust results.

2.4. Quality models for websites

Devising a quality model for websites is hindered by at least three factors.

First, the enterprise nature common to many web applications involving many different subsystems, languages, databases, makes the definition of attributes and of their relationships and associated metrics a challenging task.

Second, websites tend to have a greater information density than other interactive applications. Information seeking is a very difficult task to model and support because it encompasses complex cognitive, social and cultural processes [Allen, 1996] spanning through interpretation of textual, visual, audio messages, selection of relevant information and learning. As a consequence, it is very difficult to formulate questions and metrics to accurately capture properties like usefulness, relevance, navigability, or satisfaction.

Third, the designer has no control on the devices and applications that the user is going to use when accessing the website. This requires additional effort in determining questions and metrics that cover all the possibilities.

However the large body of informal design and analysis guidelines for websites (like the accessibility guidelines defined by W3C/WAI [WAI, 1999], Nielsen's usability guidelines [Nielsen, 1999; 2001], the guidelines mentioned in [Scapin et al, 2000]) can be used as a basis for a quality model.

In addition, the use of markup languages to describe content, presentation and interaction controls (like links, buttons, frames, menus) makes the user interface of a website an object that can be automatically analyzed.

In fact, some guidelines suggest specific testing techniques able to detect whether or not the guideline is satisfied. Automatic tools are also available to carry out some of these tests. See, for example, the W3C specification of accessibility evaluation and repair tools [W3C, 2000] and a review of such tools [Brajnik, 2000].

At least for usability-related questions, quality models should blend these techniques with other ones, like user testing or heuristic evaluation, in order to take advantage of their complementarity and balance their costs and reliability. In fact, heuristic evaluation has a relatively high operational cost due to the usability experts that are needed to analyze and produce a report: not all companies can afford to employ or hire them to run routine evaluations for any release of a website. User testing on the other hand is very focussed in terms of the features that are tested and the environment in which they are tested.

Furthermore, the variability that is entailed by these manual methods applied to subjective properties like usability and quality (as shown in [Molich, 1998]) requires the definition of standardized procedures to yield repeatable and comparable results.

The time and effort needed to carry out heuristic evaluation or user testing is in conflict with two fundamental pragmatic aspects of current websites. Web technologies evolve extremely fast, enabling sophisticated tools to be deployed and complex interactions to take place. Secondly, the life cycle of a website is also extremely fast: maintenance of a website is performed at a rate that is higher than that of other software products because of market pressure and lack of distribution barriers. Such a conflict is one more reason why it is necessary to consider automatic tools for supporting quality assessments.

3. Testing tools

Automatic webtesting tools identify features of web pages that, under certain circumstances, may become defects causing failures. For example, an IMG element with no associated descriptive string (ALT attribute) is a feature of a page. It becomes a defect causing an accessibility failure when a user is accessing that page through a speaking browser and is unable to exploit the content of the image; it causes a usability failure when a user is waiting for a slow page to be downloaded and s/he doesn't have any clue whether that image would be useful or not, hence worth waiting or not. In these cases the feature is a defect. In other cases (e.g. when a user does not notice the absence of the ALT string) the feature is not a defect since it causes no failures. Notice that the reason why a feature is not necessarily a defect is determined by usability being an external property.

Automatic usability testing tools are able to detect only features related to internal attributes; there's no way for them, in a totally automatic way, to determine external attributes. This is the case for properties referring to the content, which require some sort of interpretation assigning meaning to symbols in order to be assessed (e.g. to determine if the textual label is equivalent to the image it describes). Notice that even if these properties require human intervention to be determined, automated tools might still be useful if they were able to flag features that are probably a defect. For example, even if determining meaningfulness of a label is out of the scope of automatic tools, identifying labels that are likely to be meaningless, f.e. placeholder text ("describe the image"), can still be useful.

Automatic testing tools may be based on a set of rules. A rule is a piece of code identifying a feature that is believed to be the cause of a failure (i.e. the rule asserts that the feature is a defect). For external properties these rules acquire a heuristic nature: they identify a feature that may, depending on the circumstances, be the defect causing a failure.

For example, an ALT label like "blue bullet" may go unnoticed when using graphical browsers, while it may distract and irritate users of speaking browsers, and is totally useless to non-English speaking users. A rule might check for the occurrence of the name of a color followed by the string "bullet" in the ALT attribute of an IMG element.

Rules of a testing tool may be thought of as components that count occurrences of specific features in websites. These numbers can then be related to quality attributes through appropriate metrics. A simple example is accessibility conformance as defined by WAI: accessibility is a true/false property defined on the basis of whether all guidelines are satisfied; automatic evaluation tools simply enumerate the features that violate some guidelines (or those requiring further manual inspection).

For another property, like navigability, such tools may check consistency (e.g. of colors, of font faces and styles, of labels, of navigational buttons), contextual navigation (e.g. links in NOFRAMES, links to home page, frame titles), navigation efficiency (e.g. image and table rendering, cachable graphics, page size). And for each of these factors, report a ranked list of defects according to their estimated frequency and impact on users.

3.1. Validating a testing tool

Testing tools have to be validated in order to be useful. Their validation may be concerned with usability, task-adequacy, efficiency, etc. (in fact via an appropriate quality model). A crucial property that needs to be assessed is the validity of their rules. Following the above mentioned example, the specific rule coding may be wrong, too specific (failing to detect "a bullet", for example), or too general (identifying a defect where there is none, like in "picture of a blue bullet proof vest").

A valid rule is:

In other words a correct rule never identifies false-positives, and a complete rule never yields false-negatives. Continuing with the previous example, a rule failing to detect "a bullet"" is incomplete, while a rule asserting that "picture of a blue bullet proof vest" is an inappropriate ALT for a picture is incorrect.

Obviously, given the heuristic and generic nature of most guidelines concerned with usability of websites, only extremely simple and straightforward rules may be shown to be valid (like checking that the markup language used conforms to the HTML 4.0 standard).

In fact, rule completeness is extremely difficult to assess since there is no practical way to operationally qualify the condition "whenever a failure of the particular class dealt with by the rule can occur" mentioned above and required for the completeness property of a valid rule.

Fortunately, assessing the correctness of a rule is more viable. I envision at least three methods for achieving such a goal.

The first one is running comparative experiments. Given a website, an alternative evaluation technique is used (like user testing) and the set of problems found respectively by that technique and the automatic tool are compared. If a problem found by a rule is confirmed by findings through the other technique, then the rule is correct. In addition, a statistical characterization of rule correctness can be given by considering how many times a problem found by a rule is confirmed by the other technique.

The second method is rule inspection and testing. In this case each rule is run and its results are analyzed by a team of webdesigners and usability experts.

A third method, called page tracking, entails the repeated use of an automatic tool. If a site is analyzed two or more times, each evaluation generates a set of problems. If, on subsequent evaluations, a problem disappears (due to a change in the underlying webpage/website) this is due to a maintenance action. If we assume that such an action was prompted to the webmaster by the rule showing the problem, then this method provides an indirect way to determine utility of the rule, a property that is closely related to rule correctness. In fact, a rule that is useful (i.e. its advice is followed by some webmaster) is likely to be also correct.

Unfortunately the latter method relies on a strong assumption, that the change action is triggered by the rule. In fact, a change action might have been scheduled regardless of being prompted by the rule; or perhaps the designer implemented a change action whose scope also affected indirectly the feature tested by the rule. In such a case the rule played no role in triggering the change but the method fails to capture this.

The big advantage of page tracking is that it is amenable for totally automatic procedures and therefore it is much more cost-effective than the other two methods.

The case study described in the next section is aimed at applying such a method and present the insights it yields. In subsequent research, the comparative experiments method will also be used for the same tool, and comparisons between the two validation methods will be drawn.

4. A case study: LIFT

An online testing tool, LIFT developed and deployed by Usablenet Inc. [http://www.usablenet.com], was selected for the case study and a portion of its database of problems has been analyzed. The considered database contained 243,784 problems found by 12 rules distributed over 8,926 pages (each containing at least one problem) of websites analyzed by the system on 1,282 different evaluations on a couple of weeks during March 2001.

Table 1 shows a brief description of the analyzed rules.

Table 1: Description of analyzed rules
Rulename Description
KnownFontColor FONT color should be a correct RGB string
KnownBodyColor BODY colors should be correct RGB strings
NonStandardLinks colors for visited and new links should be the conventional ones
SelfReferentialPage page should not contain a link pointing to itself (apart from named anchors)
NoKeywords META/keywords should be defined
ExplicitMailto labels of mailto: links should be the email address
IMGWithSize GIF images should specify size attributes
SpacerWithALT spacer image (1x1, 1xN, Nx1) should have an ALT
IMGWithEmptyALT ALT of image should be the empty string
IMGWithALT image should specify an ALT
NOFRAMESWithContent NOFRAMES should not be empty
NOFRAMESexists NOFRAMES should be present

To implement the page tracking method the history of each page has to be analyzed to find out how many times a page has been evaluated and how many problems a rule has found on each evaluation. Any time such a number decreases we can assume that the rule actually worked well, since the person in charge for the maintenance of the page (let's call it the webmaster) undertook some fixing action to remove the problem. As was previously mentioned, we assume that the fixing action was triggered by seeing the results produced by the rule, which might not always be the case.

With the specific system producing the data used in this study (i.e. LIFTOnline) a page is in general evaluated more than once because for each request, LIFTOnline downloads a portion of a website (up to 250 pages) and applies its rules to all these pages. Often a webmaster runs a first evaluation to find out the existing problems, fixes some of them on some of the pages, and then reruns LIFTOnline on all the pages to verify if those problems were actually fixed. In this way pages that were not changed by the webmaster would be evaluated twice and the number of problems found on them would not change. There can be therefore many repeated dummy evaluations of a page. As we will shortly see, this will introduce some noise in the data produced by the page tracking method.

In what follows a problem is a specific defect found by a rule in a page (for example, and IMG element with no ALT). A page may have multiple problems identified by the same rule. An evaluation is a run of the LIFTOnline system on a website. A fixed problem is a problem that disappeared between two consecutive evaluations of the same page. A faulty page is a page where some rule found some problem. A fixed page is a faulty page where some problem has been fixed between two consecutive evaluations.

4.1. Data analysis

Table 2 shows the breakdown of problems and faulty pages according to rules. The problem that occurs most often in terms of pages is NoKeywords: it occurs in 6,500 pages covering almost 73% of the faulty pages found all the 12 rules under analysis. The rule that found the largest number of problems is IMGWithALT, that generated 131,519 problems, almost 54% of all the problems considered.


Note: some rule, like NoKeywords, found a number of problems that is larger than the number of pages even though the features that they observe can appear at most once per page. This is because in table 2, the column # pages shows only unique pages, whereas column # problems reports also problems found by repeated analyses of a page. Furthermore a page may contain problems found by two or more rules, and that's why the total of column # pages is much larger (actually three times as much) than the total number of distinct pages.

Table 2: Distribution of problems and faulty pages by rules
Rulename # problems pbm ratio # pages page ratio
NoKeywords 9077 3.7% 6500 72.8%
IMGWithALT 131519 53.9% 5784 64.8%
NonStandardLinks 9847 4.0% 3821 42.8%
SpacerWithALT 60001 24.6% 2982 33.4%
SelfReferentialPage 4773 2.0% 2327 26.1%
ExplicitMailto 4182 1.7% 1837 20.6%
IMGWithSize 11870 4.9% 1322 14.8%
IMGWithEmptyALT 3535 1.5% 727 8.1%
KnownFontColor 7631 3.1% 670 7.5%
NOFRAMESWithContent 734 0.3% 424 4.8%
KnownBodyColor 471 0.2% 319 3.6%
NOFRAMESexists 144 0.1% 129 1.4%
 Totals 243784 100.0% 26842 300.7%
 Total number of distinct pages     8926  

Table 3 presents the number of problems and of fixed problems rule by rule and for all the 12 rules. Rules have been followed by webmasters in many cases (6,018 times for all the 12 rules). IMGWithALT has been followed 2,069 times; SpacerWithALT found 60,001 problems (24.6% of all the problems) and was followed 1,539 times.

The rules that found fewer problems are NOFRAMESexists and KnownBodyColor that found 144 and 471 problems respectively. Of which only 1 and 19 have been fixed.

As column ratio reports, the proportion of problems found by each rule that have been fixed is relatively low (smaller than 10%). There may be many reasons explaining this, including:

The last condition affects quantitatively the proportion, as the # problems column reports values that are larger than they should be and the ratio values are underestimated. However, such a distortion is similar for all the rules. If we consider only the relative change in percentage between different rules then we can draw some useful conclusion about the relative effectiveness between rules.

The last line of table 3 shows the global number of problems found by all 12 rules and the global number of fixed problems.

Table 3: Distribution of fixed problems rule by rule
Rulename # fixed problems # problems ratio
KnownFontColor 747 7631 9.8%
SelfReferentialPage 295 4773 6.2%
NoKeywords 523 9077 5.8%
ExplicitMailto 224 4182 5.4%
KnownBodyColor 19 471 4.0%
NonStandardLinks 334 9847 3.4%
SpacerWithALT 1539 60001 2.6%
IMGWithSize 223 11870 1.9%
IMGWithALT 2069 131519 1.6%
NOFRAMESWithContent 8 734 1.1%
IMGWithEmptyALT 36 3535 1.0%
NOFRAMESexists 1 144 0.7%
 Average 501.5 20315.3 3.6%
 Globally 6018 243784 2.5%

Table 4 shows similar results in terms of faulty pages and fixed pages. In this case too numbers in column # pages are inflated because of repeated evaluations, leading to an underestimation of the ratio. But again the distortion is uniform among the rules and it does not prevent us from getting useful results.

Table 4: Distribution of fixed pages rule by rule
Rulename # fixed pages # pages ratio
KnownFontColor 58 670 8.7%
SelfReferentialPage 186 2327 8.0%
NoKeywords 523 6500 8.0%
ExplicitMailto 116 1837 6.3%
KnownBodyColor 19 319 6.0%
IMGWithALT 279 5784 4.8%
NonStandardLinks 173 3821 4.5%
IMGWithSize 59 1322 4.5%
SpacerWithALT 118 2982 4.0%
IMGWithEmptyALT 18 727 2.5%
NOFRAMESWithContent 8 424 1.9%
NOFRAMESexists 1 129 0.8%
 Average 129.8 2236.8 5.0%
 Global 1558 8926 17.5%

By considering the ratio between found problems and fixed problems (column ratio from table 3) the most effective rule is KnownFontColor where 9.8% of the problems found have been fixed, i.e. 747 fixed problems.

Similarly, from table 4 the most effective rule is again KnownFontColor where 8.7% of the faulty pages have been fixed, i.e. 58 pages. It is also worth noting that globally 17.5% of the faulty pages have been fixed (1558 pages over 8926).

4.2. Effectiveness of the page tracking method

Tables 3 and 4 show that the considered rules have identified a relatively large number of problems that have been fixed soon after being discovered. They also show that there is great variation of the effectiveness of these rules, measured in terms of proportion of problems that have been fixed or, alternatively, in terms of proportion of pages that have been fixed.

In fact the four most effective rules are KnownFontColor, SelfReferentialPage, NoKeywords and ExplicitMailto with ratios of fixed problems that range from 5.4% to 9.8% (the average is 3.6%) and ratios of fixed pages ranging from 6.3% to 8.7% (the average is 5%).

The least effective rules are NOFRAMESexists, NOFRAMESWithContent and IMGWithEmptyALT with rations of fixed problems ranging from 0.7% to 1.1% and ratios of fixed pages ranging from 0.8% to 2.5%.

These numbers so much smaller than the averages are symptoms that something is not working properly for these rules. Two of them, those dealing with NOFRAMES, are based on standard guidelines [WAI, 1999] and should be generally accepted by webmasters. In addition, fixing the page in order to comply with those rules does not require a major change in the page structure, since adding a NOFRAMES element and populating it with a summary of the content and links of the framed pages should be pretty simple.

However, a closer look to the rule implementation highlighted a bug in the HTML parser that was adopted by the testing system. The bug prevented in many cases a correct parsing of documents containing NOFRAMES elements. This explains why such rules were followed so rarely by webmasters.

For the third rule, IMGWithEmptyALT, a closer inspection and a simple user testing uncovered that the wordings used to describe the problem was confusing. That might have been the reason why such a simple rule was not followed by many webmasters.

Finally the IMGWithALT rule, a straightforward implementation of an accessibility guideline, was followed only on 1.6% of the problems found. This can be explained by considering the magnitude of the number of problems found by that rule (131,519) and the relatively short time span available to webmasters to fix those problems. Many webmasters might have postponed fixing those problems to a later stage in the maintenance process.

Two rules that are not straightforward implementations of published guidelines, and for which there is no large consensus in the web development community, have been followed a relatively large number of times. Namely SelfReferentialPage and ExplicitMailto, that have been followed 295 and 224 times respectively, or on 6.2% and 5.4% of the cases.

4.3. Quality models based on correct rules

These numbers suggest that the method yields valuable information that can be used to assess utility of a rule. It has a number of advantages:

In this way also rules that are heuristics, and that could generate many false-positives, can be validated and then successfully used.

Once a set of rules have been assessed in terms of utility, a quality model can be defined by following these steps:

  1. selecting one or more quality factors to be emphasized (for example, by weighting factors mentioned in Figure 1, up to the most detailed ones on the right hand side of the tree);
  2. define the metrics to be adopted to assess those factors; this step requires:
    1. to identify the rules of the chosen webtesting system that are closest to the selected factors
    2. to consider their utility
    3. to customize them (for example, assigning them a weight)

Such a model can then be run automatically of the website after any maintenance actions to monitor changes of quality levels.

Obviously, this can be done only for quality factors for which there is some automatic rule. For other quality factors (like presence of equivalent ALT strings) human judgment is needed and additional techniques need to be employed.

5. Conclusion

In the paper I claim that the lack of appropriate quality models for websites is one reason explaining the low quality that they feature nowadays.

Quality models include attributes describing properties relevant to quality and appropriate measurement methods to assign them values. DeMarco's [1982] statement you cannot control what you cannot measure applies very well to current website development and maintenance practice.

Guidelines and automatic testing tools for websites can play an important role in determining a set of attributes and measurement methods that are both viable and reliable. However guidelines and tools need to be validated, and it appears that only methods based on empirical evaluations can be deployed to achieve this.

I describe some of these methods and present some data about the kind of information that one of these methods, called page tracking, is capable of yielding. In particular I show that the page tracking method yields information that can be used to determine if a tool is working properly and if not, the method helps also in understanding which part of the tool is not working properly.

Acknowledgements

Many thanks to Marco Ranon for his help in exploring the data about rule results.

The results described in this paper are based on data provided by Usablenet Inc. (http://www.usablenet.com), a company for which the author is a scientific advisor.

References

[Allen, 1996] B. Allen, Information tasks : Toward a User-Centered Approach to Information Systems, Academic Press, 1996

[Basili and Weiss, 1984] Basili V.R. and Weiss D. "A methodology for collecting valid software engineering data", IEEE Trans. on Software Engineering, SE-10(6), pp. 728-738, 1984.

[Brajnik, 2000] Brajnik, G. "Automatic web usability evaluation: what needs to be done?", in Proc. Human Factors and the WEB, 6th Conference, Austin, June 2000, http://www.tri.sbc.com/hfweb/brajnik/hfweb-brajnik.html

[DeMarco, 1982] De Marco T. Controlling software projects, Yourdon Press, New York, 1982.

[Fenton and Lawrence Pfleeger, 1997] Fenton N.E. and Lawrence Pfleeger S., Software metrics, 2nd ed., International Thompson Publishing Company, 1997

[Gunning, 1968] Gunning R. The techniques of clear writing, McGraw-Hill, New York, 1968.

[Molich et al, 1998] Molich R. et al. "Comparative evaluation of usability tests", Procs. of the Usability Professionals Association 1998 Conference (UPA98), Washington D.C., USA, June 1998.

[Nielsen and Mack, 1994] J. Nielsen and R. Mack (eds), Usability Inspection Methods, Wiley, 1994.

[Nielsen, 1999] Nielsen J., Designing Web Usability: the practice of semplicity, New Riders Publishing, 1999.

[Nielsen, 2001] Nielsen J., http://www.useit.com/alertbox/, January 2001

[Scapin et al, 2000] Scapin D., Leulier C., Vanderdonckt J., Mariage C., Bastien C., Farenc C., Palanque P., Bastide R. Towards automated testing of web usability guidelines, Proc. Human Factors and the WEB, 6th Conference, Austin, June 2000, http://www.tri.sbc.com/hfweb/scapin/Scapin.html

[WAI, 1999] Web Accessibility Initiative, Web Content Accessibility Guidelines 1.0, http://www.w3.org/TR/1999/WAI-WEBCONTENT-19990505, 1999

[WAI, 2000] Web Accessibility Initiative, Accessibility Evaluation and Repair Tools, http://www.w3.org/TR/AERT, 2000