Author Topic: Keep data longterm / general architecture questions  (Read 3894 times)

instinct-vfx

  • Full Member
  • ***
  • Posts: 16
Keep data longterm / general architecture questions
« on: April 20, 2012, 10:01:31 AM »
Hi there,

we are currently in the process of setting up qube to match our needs in the best possible way, so expect more questions soon :D

After a bit of fiddling around i was wondering what the best way to persist data would be. Ideally i'd like to remove deletion privileges from users (can be done via default permissions afaik) and keep the jobs for statistical and reference reasons. Now i'd like to get them out of the way of the artists (and for performance also the supervisor i guess) after certain amounts of time. So ideally it would simply be moved to a legacy database and removed from the main db. Is there anything built-in to help with such a setup?
If not, i just took a quick look at the db, and it seems that qube is creating quite a lot of databases. What is that based on? A new DB after x jobs ? Will these databases (and the corresponding job tables!) simply keep counting upwards forever? Would the strategy have to be something like "use qube API to find jobs older x, iterate over jobs over *qube databases (brr), copy all data to legacy db, delete records". That sounds a tad scary to be honest hehe.

Any input on that ?

Kind regards,
Thorsten

jburk

  • Administrator
  • *****
  • Posts: 493
Re: Keep data longterm / general architecture questions
« Reply #1 on: April 20, 2012, 05:19:19 PM »
We already have the pfx_dw datawarehouse database; this is where job data persists after it's removed from the qube database.  The table that will be of the most interest to you is the pfx_dw.job_fact table.

mysql> DESCRIBE job_fact;
+------------------+----------------------+------+-----+---------+-------+
| Field            | Type                 | Null | Key | Default | Extra |
+------------------+----------------------+------+-----+---------+-------+
| time_sk          | int(11)              | NO   |     | NULL    |       |
| job_id           | int(11)              | YES  | UNI | NULL    |       |
| job_name         | varchar(255)         | YES  |     | NULL    |       |
| job_priority     | int(11)              | YES  |     | NULL    |       |
| job_timesubmit   | int(11)              | YES  |     | NULL    |       |
| job_timestart    | int(11)              | YES  |     | NULL    |       |
| job_timecomplete | int(11)              | YES  |     | NULL    |       |
| jobstatus_sk     | int(11)              | YES  |     | NULL    |       |
| user_sk          | smallint(5) unsigned | YES  |     | NULL    |       |
| cluster_sk       | smallint(5) unsigned | YES  |     | NULL    |       |
| account_sk       | smallint(5) unsigned | YES  |     | NULL    |       |
| kind_sk          | smallint(5) unsigned | YES  |     | NULL    |       |
| prototype_sk     | smallint(5) unsigned | YES  |     | NULL    |       |
| cpu_count        | int(10) unsigned     | NO   |     | NULL    |       |
| cpu_seconds      | int(10) unsigned     | NO   |     | NULL    |       |
| work_count       | int(10) unsigned     | NO   |     | NULL    |       |
| worktime_min     | int(10) unsigned     | NO   |     | NULL    |       |
| worktime_max     | int(10) unsigned     | NO   |     | NULL    |       |
| worktime_avg     | int(10) unsigned     | NO   |     | NULL    |       |
| mem_max          | bigint(20) unsigned  | YES  |     | NULL    |       |
| mem_avg          | bigint(20) unsigned  | YES  |     | NULL    |       |
| mem_stddev       | bigint(20) unsigned  | YES  |     | NULL    |       |
+------------------+----------------------+------+-----+---------+-------+



Each of the columns whose name ends in "_sk" contains a "surrogate key" which can be used to join to the appropriate "dimension" table; dimensions are any column which might appear in a WHERE clause.  The dimension tables have a naming convention where the table name is the dimension name, the column to do the JOIN on will be the same as in the fact table, and the "actual" name of the entity is the in the 'name' column:

mysql> show tables like '%dim';
+-------------------------+
| Tables_in_pfx_dw (%dim) |
+-------------------------+
| account_dim             |
| cluster_dim             |
| hoststate_dim           |
| jobstatus_dim           |
| kind_dim                |
| prototype_dim           |
| resource_dim            |
| time_dim                |
| user_dim                |
| worker_dim              |
+-------------------------+

mysql> DESCRIBE user_dim;
+---------+--------------+------+-----+---------+----------------+
| Field   | Type         | Null | Key | Default | Extra          |
+---------+--------------+------+-----+---------+----------------+
| user_sk | int(11)      | NO   | PRI | NULL    | auto_increment |
| name    | varchar(255) | NO   | UNI | NULL    |                |
+---------+--------------+------+-----+---------+----------------+
2 rows in set (0.00 sec)

mysql> DESCRIBE cluster_dim;
+------------+--------------+------+-----+---------+----------------+
| Field      | Type         | Null | Key | Default | Extra          |
+------------+--------------+------+-----+---------+----------------+
| cluster_sk | int(11)      | NO   | PRI | NULL    | auto_increment |
| name       | varchar(255) | NO   | UNI | NULL    |                |
+------------+--------------+------+-----+---------+----------------+
2 rows in set (0.00 sec)



The jobstatus_dim table is slightly different, in that it provides a mapping between the job status integer values and human-readable ones:

mysql> DESCRIBE jobstatus_dim;
+----------------+-------------+------+-----+------------+----------------+
| Field          | Type        | Null | Key | Default    | Extra          |
+----------------+-------------+------+-----+------------+----------------+
| jobstatus_sk   | int(11)     | NO   | PRI | NULL       | auto_increment |
| status_int     | smallint(6) | NO   |     | NULL       |                |
| status_char    | varchar(15) | NO   |     | NULL       |                |
| effective_date | date        | YES  |     | NULL       |                |
| expiry_date    | date        | YES  |     | 9999-12-31 |                |
+----------------+-------------+------+-----+------------+----------------+
5 rows in set (0.02 sec)



So INNER JOINs can easily be composed once you get used to the naming convention:

mysql> SELECT
    COUNT(*)
FROM
    pfx_dw.job_fact AS fact
INNER JOIN
    user_dim AS user
    ON fact.user_sk=user.user_sk
WHERE
    user.name='jburk';
+----------+
| COUNT(*) |
+----------+
|      503 |
+----------+



The time_sk values in all the fact tables are UNIXTIME epoch values, but you don't need to do date and time functions on each row in the fact table; if you do an INNER JOIN to the time_dim table on the time_sk columns, all the values are pre-calculated in that table:

mysql> DESCRIBE time_dim;
+------------+----------------------+------+-----+---------+-------+
| Field      | Type                 | Null | Key | Default | Extra |
+------------+----------------------+------+-----+---------+-------+
| time_sk    | int(11)              | NO   | PRI | NULL    |       |
| date_time  | datetime             | NO   |     | NULL    |       |
| hour       | tinyint(4)           | NO   |     | NULL    |       |
| date       | date                 | NO   |     | NULL    |       |
| dow        | tinyint(4)           | NO   |     | NULL    |       |
| month_name | char(9)              | NO   |     | NULL    |       |
| month      | tinyint(3) unsigned  | NO   |     | NULL    |       |
| year       | smallint(5) unsigned | NO   |     | NULL    |       |
+------------+----------------------+------+-----+---------+-------+

mysql> SELECT * FROM time_dim LIMIT 1;
+------------+---------------------+------+------------+-----+------------+-------+------+
| time_sk    | date_time           | hour | date       | dow | month_name | month | year |
+------------+---------------------+------+------------+-----+------------+-------+------+
| 1327359000 | 2012-01-23 14:50:00 |   14 | 2012-01-23 |   2 | January    |     1 | 2012 |
+------------+---------------------+------+------------+-----+------------+-------+------+



Example: find all failed jobs for 'jburk' for the month of April/2012:

mysql> SELECT
    COUNT(*)
FROM
    pfx_dw.job_fact AS fact
INNER JOIN
    user_dim AS user
    ON fact.user_sk=user.user_sk
INNER JOIN
    time_dim AS time
    ON fact.time_sk=time.time_sk
INNER JOIN
    jobstatus_dim AS status
    ON fact.jobstatus_sk=status.jobstatus_sk
WHERE
    user.name='jburk'    
    AND
    time.date > '2012-03-31'
    AND
    status.status_char='failed'
;
+----------+
| COUNT(*) |
+----------+
|      4   |
+----------+


The nnqube databases are used to group the jobs into "buckets" of 128 jobs per database.  When a job gets removed from Qube, the tables for that job get dropped, and when the last set of tables in a particular db get dropped, the db itself gets dropped.  So Qube will clean up after itself.

We shard out the tables this way to cut down on table contention, since MyISAM does not support row locking, and InnoDB (which does support row-locking) has performance issues when handling 100's of connections.  There is a plan in our roadmap to switch over to Postgres within the next year which will allow us to coalese all these tables into 6 tables in the main qube db with a 'jobid' column.
« Last Edit: April 20, 2012, 05:46:03 PM by jburk »

instinct-vfx

  • Full Member
  • ***
  • Posts: 16
Re: Keep data longterm / general architecture questions
« Reply #2 on: April 23, 2012, 07:26:32 AM »
Great! Thanks a lot! That is a ton of info to get started. When is data actually going to the warehouse database? When a job is removed or earlier?

Kind regards and thanks again,
Thorsten

jburk

  • Administrator
  • *****
  • Posts: 493
Re: Keep data longterm / general architecture questions
« Reply #3 on: April 30, 2012, 10:10:32 PM »
The data collectors run every 5 minutes, so it's before a job is removed; it's actually at most 5 minutes after the job is created, and it updates every 5 minutes for as long as the job is active.

The data collectors essentially filter for "any job that has had anything about it change in the last 5 minutes", which includes new jobs. 

But we can't (or rather, we don't) track when a job is removed (technically, that's an update), since the "absence" of a thing is odd to track.  We'd have to keep a list of jobs from 5 minutes ago, and then ask "what's missing from the current job list that was present 5 minutes ago?".  Frankly, I don't think it's worth the effort just to know when a job was removed.  Besides curiosity, what would you ever realistically do with this data point?

instinct-vfx

  • Full Member
  • ***
  • Posts: 16
Re: Keep data longterm / general architecture questions
« Reply #4 on: May 01, 2012, 12:52:07 PM »
Heya,

sorry that was easy to misunderstand. I was not looking for the removal point. I was just wondering when and how data is collected. Mainly to know if the data warehouse is "up to date" or reflecting the state WITHOUT the jobs that are still active/not removed.

Thanks again for the information that seems perfect :)

Thorsten