Skip to content

Information Architecture

mojavelinux edited this page Mar 20, 2012 · 24 revisions

This document describes the information architecture of the website. In particular, it covers:

  • terminology

  • data structure, retrieval and storage

  • information linkages

  • infrastructure and source code conventions

  • site structure

It also documents areas where source information can be improved to be more friendly to automation.

We’ll start with the terminology that’s central to the information architecture, components and modules.

Component vs Module

When the Arquillian repository was divided up, we began referring to the individual Arquillian repositories as modules. However, as we built out the website, we realized that a repository was not granular enough to document all aspects (or dimensions) of the software. In particular, a container adapter repository typically contains multiple container adapter modules. There is certainly a relationship between the repository and the artifacts, but they aren’t the same.

People often use component to describe a distinct part of a module. These terms can also be used the other way around. In fact, they are generally interchangeable. Fortunately, there’s precedence within the Arquillian project that helps us decide how to apply them.

JIRA uses the term component to categorizes issues within a project. The Arquillian team has adopted the convention to map components in JIRA to source repositories. Furthermore, the tag (e.g., release) in the source repository is mapped as a component version in JIRA. So, we’ll reinforce this definition by adopting it for the website data structure.

component == repository

That leaves us with the term module to refer to an individual artifact within a repository. Fortunately, there is precedence for this term as well. Maven refers to sub-projects as modules, which get listed in a module element in the parent pom. Once again, we’ll reinforce this definition by adopting it for the website data structure.

module == artifact

Sorting out these two terms gives us the ability to map the primary data for the website. We arrive at the following facts:

  • A component has a 1-to-many relationship to a module (e.g., a container repository has many adapters)

  • Each module is represented by a single web page (e.g., a remote container adapter)

  • Releases are performed on a component; all the modules of a component have the same version

  • Components are categorized by component type: platform, container-adapter, extension, test-runner or tool-plugin

With those facts in place, we can turn to where information about the components is retrieved.

Information sources

Information is pulled from three main sources:

  • Cloned git repositories

  • REST APIs

  • Screen scrapping (when the bastards won’t give us a proper API)

The repositories are explained in the next section. The REST API and screen scrapping requests are performed by the Ruby RestClient. The result are transparently cached in the following directory:

'_tmp/restcache'

Some of the calls have an expires timeout, though most of the data we are pulling is relatively static.

Repositories

The primary source of information for a component (and thus its modules) is a source repository. At the moment, we are going to limit that association to git source repositories because git is very efficient at reporting information about the entire history of a project…​and we just happened to implement it that way.

We will slightly refute the assumption that a component and a repository are equivalent. For components within Arquillian, that is likely to remain true. We don’t control all repositories, however, and Arquillian support may (and often does) appear as a module within the source repository of another project. Therefore, we support looking for a component within a relative path of a repository. Therefore, the mapping is more accurately described as:

component == repository or repository subtree

There’s another very important consequence of this narrowing. Not all contributors on a project contribute to the Arquillian support. We want to make sure that when we count commits for Arquillian, we only look within the relevant portion of the repository. Otherwise, on a project like AS7, which hosts its own Arquillian adapters, the number of contributors to Arquillian would be highly inflated.

The website build begins off by grabbing a list of repositories from the ohloh.net REST API. This list is subsidized by manual entries (until we can figure out how to discover them automatically). The following information is captured about a repository:

  • path

  • relative path (subtree)

  • desc

  • owner

  • host

  • http url

  • clone url

The build then clones all the repositories locally to allow them to be analyzed for tags, commits and additional project metadata in the build file and source code.

Note
We found using git to operate on the repositories locally is much, much more efficient than using the REST API that github offers. We still rely on that REST API for information that is not available in the repository, such as user accounts and teams.

The source code analyzers then take over and mine information out of the git repositories. The following information is gathered:

  • component: name, desc, type, group id, JIRA key, basepath, etc.

    • releases: version, sha hash, release date, released by, contributors, published artifacts, compile dep versions, etc.

  • modules: name, description, artifacts, etc.

Note
At the moment we are getting the published artifacts for a release by switching to the tag and running a partial Maven build to get the build plan. Since releases are permanent, and this operation is expensive, we should think about making this available as a data service (or perhaps update a data file in the website repository when a new release occurs).

The component type is determined from the name of the repository as follows:

  • arquillian-core ⇒ platform

  • arquillian-container-* ⇒ container-adapter

  • arquillian-extension-* ⇒ extension

  • arquillian-testrunner-* ⇒ test-runner

  • arquillian-*-plugin ⇒ tool-plugin

Every component has a reference to its modules and every module to its component. This association eases extensions and page templates to easily access the data they require. Where we find the data is not readily accessible enough, we’ll mixin helper methods into the objects.

Once all the information is captured into the data structures, they are dumped to yaml and written to the following cache location:

'_tmp/datacache/components.yml'

Pages are then generated for each module and fed into the page building engine.

The next time the website build occurs, the data is read from the cache file rather than doing all the analysis again. This optimization speeds up development, and also makes it easy to inspect the data that was collected.

Tip
If you need to regenerate, delete the _tmp/datacache directory.

The other key data set in the information architecture is the user identities.

Identities

In addition to the components and modules, the other important data that the website showcases is the contributors. The identities serve two purposes:

  • Properly credit each contributor with the contribution made (code contribution, guide or blog author, component lead, etc)

  • Link to the contributor’s online identity

Giving credit is a core tenant of the Arquillian project and we want to do whatever we can to give ownership to the individuals and organizations that have contributed to Arquillian. Rather than decide how many contributions are meaningful, we let the activity speak for itself. That’s why we also track any contribution we can, including website updates, translations and blog entries. It’s all "source code".

There are three pivot points for the identities:

  • commit email address (associated w/ every git commit)

  • github login

  • jboss username

Unfortunately, the hard work of linking these identities is up to us :) We’ve setup a chain of identity collectors and crawlers. Here’s how we are currently linking the identities together:

  1. when analyzing the repositories, collect every unique commit email addresses and a reference commit sha

  2. use the commit sha to make a REST API call to get the associated github login (assumes the repo is on github)

  3. use the github login to make a REST API call to get all public identity info from github

  4. use the gravatar id provided by github to get the person’s avatar and grab any reachable identity

  5. use name or github login to make a REST API to the JBoss Community Confluence instance to get the jboss username (fuzzy match)

    • the jboss username is later used to link the identities of component leads

We then wire everything together as best we can and provide helper methods to retrieve identities. The goal is that any reference on the website to a user links consistently to their profile (on the website or otherwise).

Tip
The username should be used whenever an identity is needed, such as for a blog entry.

When it’s all said and done, we end up with a data structure of all the identities. This data structure is converted to yaml and written to the following cache location:

'_tmp/datacache/identities.yml'

The next time the website build occurs, it reads from this file rather than doing all the analysis again. This optimization speeds up development, and also makes it easy to inspect the data that was collected.

Tip
If you need to regenerate, delete the _tmp/datacache directory.

Issues and component leads

Resolved issues for a version are retrieved using the JIRA REST API and some screen scraping (because the REST API is stingy). The resolved issues are indexed by the component release version from JIRA so that they can be looked up per release in the release announcement or module page. The component release version uses the following convention:

last segment of repository path + _ + version

For example:

Arquillian Drone 1.0.0.CR3 ⇒ drone_1.0.0.CR3

In addition to issues, JIRA serves as the canonical source for module leads. Since assignment of issues is a daily activity, this data is necessarily correct. There’s only one challenge: what is the id used to associate a JIRA component with a source repository (and hence a component in the data structure)? It turns out, there isn’t one. Thus, we have to adopt the following convention.

The source repository path is stored at the end of the description field of the JIRA component, offset by two colons followed by a space. Here’s an example:

Issues pertaining to the internal Arquillian implementation :: arquillian-core
Warning
If this hint is removed, the component lead will not be listed on the website.

This linkage is also how we resolve the URL to the component in JIRA to display on the module pages for that component.

Release announcements

The information collected by the repository and identity extensions can be used to generate a bulk of the release announcements. This generation is done in two steps:

  1. Pages are generated for each release, integrating an optional release notes file for the release from the website source, if found.

  2. Those pages are then picked up by the posts extension and integrated into the blog along with the rest of the entries.

We only do releases for our own components (for now, any repository whose owner is arquillian).