Keep data longterm / general architecture questions

PipelineFX Forum
Qube!
Installation and Configuration (Moderator: shinya)
Keep data longterm / general architecture questions

« previous next »

Print

Pages: [1]

Author Topic: Keep data longterm / general architecture questions (Read 15317 times)

instinct-vfx

Full Member
Posts: 16

Keep data longterm / general architecture questions

« on: April 20, 2012, 10:01:31 AM »

Hi there,

we are currently in the process of setting up qube to match our needs in the best possible way, so expect more questions soon

After a bit of fiddling around i was wondering what the best way to persist data would be. Ideally i'd like to remove deletion privileges from users (can be done via default permissions afaik) and keep the jobs for statistical and reference reasons. Now i'd like to get them out of the way of the artists (and for performance also the supervisor i guess) after certain amounts of time. So ideally it would simply be moved to a legacy database and removed from the main db. Is there anything built-in to help with such a setup?
If not, i just took a quick look at the db, and it seems that qube is creating quite a lot of databases. What is that based on? A new DB after x jobs ? Will these databases (and the corresponding job tables!) simply keep counting upwards forever? Would the strategy have to be something like "use qube API to find jobs older x, iterate over jobs over *qube databases (brr), copy all data to legacy db, delete records". That sounds a tad scary to be honest hehe.

Any input on that ?

Kind regards,
Thorsten

Logged

jburk

Administrator
Posts: 493

Re: Keep data longterm / general architecture questions

« Reply #1 on: April 20, 2012, 05:19:19 PM »

We already have the pfx_dw datawarehouse database; this is where job data persists after it's removed from the qube database. The table that will be of the most interest to you is the pfx_dw.job_fact table.

mysql> DESCRIBE job_fact; +------------------+----------------------+------+-----+---------+-------+ | Field | Type | Null | Key | Default | Extra | +------------------+----------------------+------+-----+---------+-------+ | time_sk | int(11) | NO | | NULL | | | job_id | int(11) | YES | UNI | NULL | | | job_name | varchar(255) | YES | | NULL | | | job_priority | int(11) | YES | | NULL | | | job_timesubmit | int(11) | YES | | NULL | | | job_timestart | int(11) | YES | | NULL | | | job_timecomplete | int(11) | YES | | NULL | | | jobstatus_sk | int(11) | YES | | NULL | | | user_sk | smallint(5) unsigned | YES | | NULL | | | cluster_sk | smallint(5) unsigned | YES | | NULL | | | account_sk | smallint(5) unsigned | YES | | NULL | | | kind_sk | smallint(5) unsigned | YES | | NULL | | | prototype_sk | smallint(5) unsigned | YES | | NULL | | | cpu_count | int(10) unsigned | NO | | NULL | | | cpu_seconds | int(10) unsigned | NO | | NULL | | | work_count | int(10) unsigned | NO | | NULL | | | worktime_min | int(10) unsigned | NO | | NULL | | | worktime_max | int(10) unsigned | NO | | NULL | | | worktime_avg | int(10) unsigned | NO | | NULL | | | mem_max | bigint(20) unsigned | YES | | NULL | | | mem_avg | bigint(20) unsigned | YES | | NULL | | | mem_stddev | bigint(20) unsigned | YES | | NULL | | +------------------+----------------------+------+-----+---------+-------+

Each of the columns whose name ends in "_sk" contains a "surrogate key" which can be used to join to the appropriate "dimension" table; dimensions are any column which might appear in a WHERE clause. The dimension tables have a naming convention where the table name is the dimension name, the column to do the JOIN on will be the same as in the fact table, and the "actual" name of the entity is the in the 'name' column:

mysql> show tables like '%dim'; +-------------------------+ | Tables_in_pfx_dw (%dim) | +-------------------------+ | account_dim | | cluster_dim | | hoststate_dim | | jobstatus_dim | | kind_dim | | prototype_dim | | resource_dim | | time_dim | | user_dim | | worker_dim | +-------------------------+ mysql> DESCRIBE user_dim; +---------+--------------+------+-----+---------+----------------+ | Field | Type | Null | Key | Default | Extra | +---------+--------------+------+-----+---------+----------------+ | user_sk | int(11) | NO | PRI | NULL | auto_increment | | name | varchar(255) | NO | UNI | NULL | | +---------+--------------+------+-----+---------+----------------+ 2 rows in set (0.00 sec) mysql> DESCRIBE cluster_dim; +------------+--------------+------+-----+---------+----------------+ | Field | Type | Null | Key | Default | Extra | +------------+--------------+------+-----+---------+----------------+ | cluster_sk | int(11) | NO | PRI | NULL | auto_increment | | name | varchar(255) | NO | UNI | NULL | | +------------+--------------+------+-----+---------+----------------+ 2 rows in set (0.00 sec)

The jobstatus_dim table is slightly different, in that it provides a mapping between the job status integer values and human-readable ones:

mysql> DESCRIBE jobstatus_dim; +----------------+-------------+------+-----+------------+----------------+ | Field | Type | Null | Key | Default | Extra | +----------------+-------------+------+-----+------------+----------------+ | jobstatus_sk | int(11) | NO | PRI | NULL | auto_increment | | status_int | smallint(6) | NO | | NULL | | | status_char | varchar(15) | NO | | NULL | | | effective_date | date | YES | | NULL | | | expiry_date | date | YES | | 9999-12-31 | | +----------------+-------------+------+-----+------------+----------------+ 5 rows in set (0.02 sec)

So INNER JOINs can easily be composed once you get used to the naming convention:

mysql> SELECT COUNT(*) FROM pfx_dw.job_fact AS fact INNER JOIN user_dim AS user ON fact.user_sk=user.user_sk WHERE user.name='jburk'; +----------+ | COUNT(*) | +----------+ | 503 | +----------+

The time_sk values in all the fact tables are UNIXTIME epoch values, but you don't need to do date and time functions on each row in the fact table; if you do an INNER JOIN to the time_dim table on the time_sk columns, all the values are pre-calculated in that table:

mysql> DESCRIBE time_dim; +------------+----------------------+------+-----+---------+-------+ | Field | Type | Null | Key | Default | Extra | +------------+----------------------+------+-----+---------+-------+ | time_sk | int(11) | NO | PRI | NULL | | | date_time | datetime | NO | | NULL | | | hour | tinyint(4) | NO | | NULL | | | date | date | NO | | NULL | | | dow | tinyint(4) | NO | | NULL | | | month_name | char(9) | NO | | NULL | | | month | tinyint(3) unsigned | NO | | NULL | | | year | smallint(5) unsigned | NO | | NULL | | +------------+----------------------+------+-----+---------+-------+ mysql> SELECT * FROM time_dim LIMIT 1; +------------+---------------------+------+------------+-----+------------+-------+------+ | time_sk | date_time | hour | date | dow | month_name | month | year | +------------+---------------------+------+------------+-----+------------+-------+------+ | 1327359000 | 2012-01-23 14:50:00 | 14 | 2012-01-23 | 2 | January | 1 | 2012 | +------------+---------------------+------+------------+-----+------------+-------+------+

Example: find all failed jobs for 'jburk' for the month of April/2012:

mysql> SELECT COUNT(*) FROM pfx_dw.job_fact AS fact INNER JOIN user_dim AS user ON fact.user_sk=user.user_sk INNER JOIN time_dim AS time ON fact.time_sk=time.time_sk INNER JOIN jobstatus_dim AS status ON fact.jobstatus_sk=status.jobstatus_sk WHERE user.name='jburk' AND time.date > '2012-03-31' AND status.status_char='failed' ; +----------+ | COUNT(*) | +----------+ | 4 | +----------+

The nnqube databases are used to group the jobs into "buckets" of 128 jobs per database. When a job gets removed from Qube, the tables for that job get dropped, and when the last set of tables in a particular db get dropped, the db itself gets dropped. So Qube will clean up after itself.

We shard out the tables this way to cut down on table contention, since MyISAM does not support row locking, and InnoDB (which does support row-locking) has performance issues when handling 100's of connections. There is a plan in our roadmap to switch over to Postgres within the next year which will allow us to coalese all these tables into 6 tables in the main qube db with a 'jobid' column.

« Last Edit: April 20, 2012, 05:46:03 PM by jburk »

Logged

instinct-vfx

Full Member
Posts: 16

Re: Keep data longterm / general architecture questions

« Reply #2 on: April 23, 2012, 07:26:32 AM »

Great! Thanks a lot! That is a ton of info to get started. When is data actually going to the warehouse database? When a job is removed or earlier?

Kind regards and thanks again,
Thorsten

Logged

jburk

Administrator
Posts: 493

Re: Keep data longterm / general architecture questions

« Reply #3 on: April 30, 2012, 10:10:32 PM »

The data collectors run every 5 minutes, so it's before a job is removed; it's actually at most 5 minutes after the job is created, and it updates every 5 minutes for as long as the job is active.

The data collectors essentially filter for "any job that has had anything about it change in the last 5 minutes", which includes new jobs.

But we can't (or rather, we don't) track when a job is removed (technically, that's an update), since the "absence" of a thing is odd to track. We'd have to keep a list of jobs from 5 minutes ago, and then ask "what's missing from the current job list that was present 5 minutes ago?". Frankly, I don't think it's worth the effort just to know when a job was removed. Besides curiosity, what would you ever realistically do with this data point?

Logged

instinct-vfx

Full Member
Posts: 16

Re: Keep data longterm / general architecture questions

« Reply #4 on: May 01, 2012, 12:52:07 PM »

Heya,

sorry that was easy to misunderstand. I was not looking for the removal point. I was just wondering when and how data is collected. Mainly to know if the data warehouse is "up to date" or reflecting the state WITHOUT the jobs that are still active/not removed.

Thanks again for the information that seems perfect

Thorsten

Logged

Print

Pages: [1]

« previous next »

PipelineFX Forum
Qube!
Installation and Configuration (Moderator: shinya)
Keep data longterm / general architecture questions