Information Networks

Networks of information consist of items of data linked together in some way. Perhaps the best known example is the World Wide Web.

The World Wide Web

The Web is a network in which the vertices are web pages and the edges are the hyperlinks that allow us to navigate from page to page. More precisely, there exists an edge from page P to page Q if page P contains at least one hyperlink pointing to page Q. Usually, the actual number of hyperlinks between two pages is not important and hence the network modelling the Web is an unweighted directed graph. Moreover, page self-links are ignored.

The Web was invented in 1989 by physicists Tim Berners-Lee at the European Organization for Nuclear Research (CERN). Berners-Lee proposed the HyperText Markup Language (HTML) to keep track of experimental data at CERN. An interesting excerpt of the original far-sighted proposal in which Berners-Lee attempts to persuade CERN management to adopt the new global hypertext system follows:

We should work toward a universal linked information system, in which generality and portability are more important than fancy graphics techniques and complex extra facilities. The aim would be to allow a place to be found for any information or reference which one felt was important, and a way of finding it afterwards. The result should be sufficiently attractive to use that the information contained would grow past a critical threshold.

The Web structure was in fact envisioned much earlier. In 1945 Vannevar Bush wrote a today celebrated article in The Atlantic Monthly entitled "As We May Think" describing a futuristic device he called Memex. Bush writes:

Wholly new forms of encyclopedias will appear, ready made with a mesh of associative trails running through them, ready to be dropped into the Memex and there amplified.

The structure of the Web can be measured using a crawler, a computer program that automatically surfs the Web looking for pages using a breadth-first search technique. Web crawlers are typically exploited by Web search engines, such as Google, since knowing the structure of the Web is fundamental to provide to the users significant rankings of Web pages. More specifically, the modules involved in the Web search process are briefly described in the following:

  1. Crawler. The hyperlinked nature of the Web is exploited by the crawling software, which spawns virtual robots, called spiders, that repeatedly jump from page to page following the hyperlinks. The pages collected by the spiders are temporarily stored as full pages in a central repository.
  2. Indexer. This module takes the uncompressed version of pages from the page repository, extracts from them only the interesting information, and store it in various data structures called indexes. A Web search engine mainly works with two indexes: a content index, containing information about the content of the pages, and a structure index, storing precious information about the hyperlink topology of the Web. The content index uses an inverted file structure. This file is similar to a book index and contains, next to each significant term, a list of page locations where the term appears. Furthermore, the index contains the number of occurrences of the term in each page and whether or not the term appears in important page sections such as the title.
  3. Query processor. The modules described above exists and operate independent of user queries. The query processor is, instead, a query-dependent module that accepts the user queries and responds in real-time. The query module parses the query and selects from the content index the pages that are relevant to the posed query.
  4. Ranking module. This final piece of a Web search engine has the critical task of ranking the query relevant pages putting first the pages that the user desires most. Typically, each relevant page is assessed using a content score, which measures how similar is a page with respect to the user query, for instance by counting the number of occurrences of the query terms within the page and by checking whether the query terms occur in important page sections like the title. Furthermore, for each page an importance score is computed, which gauges the status of the page within the entire Web network. The importance score computed by Google search engine is well known as the PageRank score of the page; the idea is to give more importance to pages that are linked to from other important pages. The content and importance scores are then combined into an overall score and the pages are presented to the user in order of this global value.

Academic citation networks

Most academic papers refer to other previous papers, usually in a bibliography at the end of the paper. Academic citation networks are networks in which the nodes are bibliometric units, such as papers and journals, and the directed edges are the bibliographic citations between them.

Two typical examples are article citation networks, in which the vertices are papers and the edges are the bibliographic citations between papers, and journal citation networks, in which the nodes are academic journals and the edges are the bibliographic citations between papers published in the journals. In article citation networks, the actual number of citations between two papers is ignored and the network is represented as an unweighted directed graph. Furthermore, since paper citations point backward in time, from newer to older papers, article citation networks are acyclic graphs. On the other hand, journals citation networks are weighted directed graphs, where the edges are weighted with the actual number of citations between the involved journals. Since journals contain papers published in different periods, journal citation networks typically contain loops.

Citations are often a way of showing intellectual debt because the ideas and the results of the earlier paper have been useful for the investigations provided in the older paper. Hence, the number of citations received by one paper from other papers gives an indication of the impact of a paper within the academic community. The impact of the papers written by a scholar can be used as one criterion for the assessment of the quality of the research made by the scholar.

Recently, bibliometric indicators, like the Eigenfactor, have been proposed which exploit the topology of the journal citation network in order to compute importance scores for journals. The Eigenfactor indicator works similarly to the PageRank for Web pages, giving more importance to journals that are cited by other important journals. Hence, academic journals within a given field can be ranked by their importance, and these rankings can be exploited by, for instance, librarians in order to purchase the most important journals within a limited budged.

The two main commercial bibliometric data sources are: Web of Science, owned by Thomson-Reuters and Scopus, of Elsevier. Both Web of Science and Scopus are subscription-based proprietary databases that are hand-maintained by professional staff. On the other hand, Google Scholar uses automated citation indexing and provides free access.

Citation networks described so far are not the only possible network representation of citation patterns. An alternative representation is the cocitation network. Two papers are said to be cocited if they are both cited by the same third paper. Cocitation is often taken as an indicator that papers deal with related topics. A cocitation network is a network in which the vertices represent papers and the edges represent cocitation of pairs of papers. Notice that cocitation is a symmetric relationship, hence the edges in cocitation networks are undirected. It is also possible to provide a strength for the cocitation between two papers as the number of other papers that cite both and hence create a weighted cocitation network.

Another related concept is bibliographic coupling. Two papers are said to be bibliographically coupled if they cite the same other paper. Bibliographic coupling, like cocitation, can be taken as an indicator that the papers deal with related material. One can define a bibliographic coupling network, either weighted or not, in which the vertices are papers the undirected edges indicate bibliographic coupling.

Citation networks, in particular article citation networks, are in many ways similar to the World Wide Web. The vertices of the network hold information in the form of text and pictures, just as Web pages do, and the links from one paper to another play a role similar to the hyperlinks on Web pages, alerting the reader when information relevant to the topic of one paper can be found on another. Papers with many citation are often more influential than those with few, just as is the case with Web pages that are linked to by many other pages; in particular, the Eigenfactor bibliometric indicator has been designed to work in the journal citation network as the PageRank does on the Web network. Finally, it is typical for a scholar to surf the citation network by following chains of citations from paper to paper just as computer users surf the Web from page to page.

However, quantitative studies on citation networks are much older that the Web and go back to the 1960s with the studies of de Solla Price. Notably, Sergey Brin and Larry Page have been inspired by works on citation networks in the development of the PageRank algorithm for Google. Studies on academic publications and citations fall within the field of bibliometrics, a branch of information and library science.

Recommender networks

A type of information network important for technology and commerce is the recommender network. Recommender networks represent people's preferences for things, such as for products sold by a retailer. Online merchants, for instance, keep records of which customers bought which products and sometimes ask them whether they liked the products. As another example, many large supermarkets chains record the purchases made by each of their regular customers, usually identified by a barcode card that is scanned when purchases are made.

The appropriate representation of a recommender network is as a bipartite graph in which the two types of vertices represent products and customers, with edges connecting customers to the products they buy or like. One can add also weights to the edges to indicate how often the person bought the product of how the person likes it.

The primary commercial interest in recommender networks arises from their use in recommender systems. These are computer algorithms that try to guess items that people will like by comparing a person's known preferences with those of other people that bought similar products. If person A bought or likes many of the same products as persons B, C, and D, and if persons B, C, and D all bought or like some further item that A has never bought or expressed an opinion about, then maybe A would like that item too. The website Amazon.com, for instance, uses recommender systems to recommend book titles, and many supermarkets print out discount coupons to customers for products that the customer has never bought but might be interested to try.