We already have the pfx_dw datawarehouse database; this is where job data persists after it's removed from the qube database. The table that will be of the most interest to you is the pfx_dw.job_fact table.
mysql> DESCRIBE job_fact;
+------------------+----------------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+------------------+----------------------+------+-----+---------+-------+
| time_sk | int(11) | NO | | NULL | |
| job_id | int(11) | YES | UNI | NULL | |
| job_name | varchar(255) | YES | | NULL | |
| job_priority | int(11) | YES | | NULL | |
| job_timesubmit | int(11) | YES | | NULL | |
| job_timestart | int(11) | YES | | NULL | |
| job_timecomplete | int(11) | YES | | NULL | |
| jobstatus_sk | int(11) | YES | | NULL | |
| user_sk | smallint(5) unsigned | YES | | NULL | |
| cluster_sk | smallint(5) unsigned | YES | | NULL | |
| account_sk | smallint(5) unsigned | YES | | NULL | |
| kind_sk | smallint(5) unsigned | YES | | NULL | |
| prototype_sk | smallint(5) unsigned | YES | | NULL | |
| cpu_count | int(10) unsigned | NO | | NULL | |
| cpu_seconds | int(10) unsigned | NO | | NULL | |
| work_count | int(10) unsigned | NO | | NULL | |
| worktime_min | int(10) unsigned | NO | | NULL | |
| worktime_max | int(10) unsigned | NO | | NULL | |
| worktime_avg | int(10) unsigned | NO | | NULL | |
| mem_max | bigint(20) unsigned | YES | | NULL | |
| mem_avg | bigint(20) unsigned | YES | | NULL | |
| mem_stddev | bigint(20) unsigned | YES | | NULL | |
+------------------+----------------------+------+-----+---------+-------+
Each of the columns whose name ends in "_sk" contains a "surrogate key" which can be used to join to the appropriate "dimension" table; dimensions are any column which might appear in a WHERE clause. The dimension tables have a naming convention where the table name is the dimension name, the column to do the JOIN on will be the same as in the fact table, and the "actual" name of the entity is the in the 'name' column:
mysql> show tables like '%dim';
+-------------------------+
| Tables_in_pfx_dw (%dim) |
+-------------------------+
| account_dim |
| cluster_dim |
| hoststate_dim |
| jobstatus_dim |
| kind_dim |
| prototype_dim |
| resource_dim |
| time_dim |
| user_dim |
| worker_dim |
+-------------------------+
mysql> DESCRIBE user_dim;
+---------+--------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+---------+--------------+------+-----+---------+----------------+
| user_sk | int(11) | NO | PRI | NULL | auto_increment |
| name | varchar(255) | NO | UNI | NULL | |
+---------+--------------+------+-----+---------+----------------+
2 rows in set (0.00 sec)
mysql> DESCRIBE cluster_dim;
+------------+--------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+------------+--------------+------+-----+---------+----------------+
| cluster_sk | int(11) | NO | PRI | NULL | auto_increment |
| name | varchar(255) | NO | UNI | NULL | |
+------------+--------------+------+-----+---------+----------------+
2 rows in set (0.00 sec)
The jobstatus_dim table is slightly different, in that it provides a mapping between the job status integer values and human-readable ones:
mysql> DESCRIBE jobstatus_dim;
+----------------+-------------+------+-----+------------+----------------+
| Field | Type | Null | Key | Default | Extra |
+----------------+-------------+------+-----+------------+----------------+
| jobstatus_sk | int(11) | NO | PRI | NULL | auto_increment |
| status_int | smallint(6) | NO | | NULL | |
| status_char | varchar(15) | NO | | NULL | |
| effective_date | date | YES | | NULL | |
| expiry_date | date | YES | | 9999-12-31 | |
+----------------+-------------+------+-----+------------+----------------+
5 rows in set (0.02 sec)
So INNER JOINs can easily be composed once you get used to the naming convention:
mysql> SELECT
COUNT(*)
FROM
pfx_dw.job_fact AS fact
INNER JOIN
user_dim AS user
ON fact.user_sk=user.user_sk
WHERE
user.name='jburk';
+----------+
| COUNT(*) |
+----------+
| 503 |
+----------+
The time_sk values in all the fact tables are UNIXTIME epoch values, but you don't need to do date and time functions on each row in the fact table; if you do an INNER JOIN to the time_dim table on the time_sk columns, all the values are pre-calculated in that table:
mysql> DESCRIBE time_dim;
+------------+----------------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+------------+----------------------+------+-----+---------+-------+
| time_sk | int(11) | NO | PRI | NULL | |
| date_time | datetime | NO | | NULL | |
| hour | tinyint(4) | NO | | NULL | |
| date | date | NO | | NULL | |
| dow | tinyint(4) | NO | | NULL | |
| month_name | char(9) | NO | | NULL | |
| month | tinyint(3) unsigned | NO | | NULL | |
| year | smallint(5) unsigned | NO | | NULL | |
+------------+----------------------+------+-----+---------+-------+
mysql> SELECT * FROM time_dim LIMIT 1;
+------------+---------------------+------+------------+-----+------------+-------+------+
| time_sk | date_time | hour | date | dow | month_name | month | year |
+------------+---------------------+------+------------+-----+------------+-------+------+
| 1327359000 | 2012-01-23 14:50:00 | 14 | 2012-01-23 | 2 | January | 1 | 2012 |
+------------+---------------------+------+------------+-----+------------+-------+------+
Example: find all failed jobs for 'jburk' for the month of April/2012:
mysql> SELECT
COUNT(*)
FROM
pfx_dw.job_fact AS fact
INNER JOIN
user_dim AS user
ON fact.user_sk=user.user_sk
INNER JOIN
time_dim AS time
ON fact.time_sk=time.time_sk
INNER JOIN
jobstatus_dim AS status
ON fact.jobstatus_sk=status.jobstatus_sk
WHERE
user.name='jburk'
AND
time.date > '2012-03-31'
AND
status.status_char='failed'
;
+----------+
| COUNT(*) |
+----------+
| 4 |
+----------+
The nnqube databases are used to group the jobs into "buckets" of 128 jobs per database. When a job gets removed from Qube, the tables for that job get dropped, and when the last set of tables in a particular db get dropped, the db itself gets dropped. So Qube will clean up after itself.
We shard out the tables this way to cut down on table contention, since MyISAM does not support row locking, and InnoDB (which does support row-locking) has performance issues when handling 100's of connections. There is a plan in our roadmap to switch over to Postgres within the next year which will allow us to coalese all these tables into 6 tables in the main qube db with a 'jobid' column.