Home » SQL & PL/SQL » SQL & PL/SQL » Finding initial, final, in_between, and initial_and_final events
Finding initial, final, in_between, and initial_and_final events [message #687913] |
Thu, 20 July 2023 10:24 |
manubatham20
Messages: 566 Registered: September 2010 Location: Seattle, WA, USA
|
Senior Member |
|
|
WITH RawData AS (
SELECT 1 AS Sequence_ID, 'program6' AS Program_Name
UNION ALL
SELECT 2, 'program2'
UNION ALL
SELECT 3, 'program7'
UNION ALL
SELECT 4, 'program3'
UNION ALL
SELECT 5, 'program1'
-- Rows with no Program_Name
UNION ALL
SELECT 6, NULL
UNION ALL
SELECT 7, NULL
UNION ALL
SELECT 8, NULL
-- Continue with program names
UNION ALL
SELECT 9, 'program4'
UNION ALL
SELECT 10, 'program1'
UNION ALL
SELECT 11, 'program2'
-- Rows with no Program_Name
UNION ALL
SELECT 12, NULL
UNION ALL
SELECT 13, NULL
-- Continue with program names
UNION ALL
SELECT 14, 'program6'
UNION ALL
SELECT 15, 'program1'
-- Rows with no Program_Name
UNION ALL
SELECT 16, NULL
UNION ALL
-- Continue with program names
SELECT 17, 'program5'
UNION ALL
SELECT 18, 'program2'
-- Rows with no Program_Name
UNION ALL
SELECT 19, NULL
-- Continue with program names
UNION ALL
SELECT 20, 'program3'
UNION ALL
SELECT 21, 'program4'
UNION ALL
SELECT 22, 'program2'
UNION ALL
SELECT 23, 'program5'
-- Rows with no Program_Name
UNION ALL
SELECT 24, NULL
UNION ALL
SELECT 25, NULL
-- Continue with program names
UNION ALL
SELECT 26, 'program1'
UNION ALL
SELECT 27, 'program7'
-- Rows with no Program_Name
UNION ALL
SELECT 28, NULL
-- Continue with program names
UNION ALL
SELECT 29, 'program6'
UNION ALL
SELECT 30, 'program4'
UNION ALL
SELECT 31, 'program2'
-- Rows with no Program_Name
UNION ALL
SELECT 32, NULL
-- Continue with program names
UNION ALL
SELECT 33, 'program5'
UNION ALL
SELECT 34, 'program3'
-- Rows with no Program_Name
UNION ALL
SELECT 35, NULL
UNION ALL
SELECT 36, NULL
-- Continue with program names
UNION ALL
SELECT 37, 'program2'
UNION ALL
SELECT 38, 'program1'
-- Rows with no Program_Name
UNION ALL
SELECT 39, NULL
-- Continue with program names
UNION ALL
SELECT 40, 'program6'
UNION ALL
SELECT 41, 'program4'
-- Rows with no Program_Name
UNION ALL
SELECT 42, NULL
UNION ALL
-- Continue with program names
SELECT 43, 'program5'
UNION ALL
SELECT 44, 'program2'
-- Rows with no Program_Name
UNION ALL
SELECT 45, NULL
-- Continue with program names
UNION ALL
SELECT 46, 'program1'
UNION ALL
SELECT 47, 'program7'
-- Rows with no Program_Name
UNION ALL
SELECT 48, NULL
-- Continue with program names
UNION ALL
SELECT 49, 'program3'
UNION ALL
SELECT 50, 'program5'
),
pivotdata AS (
SELECT
sequence_id,
program_name,
ROW_NUMBER() OVER (PARTITION BY CASE WHEN program_name IS NOT NULL THEN 0 ELSE 1 END ORDER BY sequence_id ASC) AS rno_asc,
ROW_NUMBER() OVER (PARTITION BY CASE WHEN program_name IS NOT NULL THEN 0 ELSE 1 END ORDER BY sequence_id DESC) AS rno_desc,
CASE
WHEN
FIRST_VALUE(program_name) OVER (PARTITION BY CASE WHEN program_name IS NOT NULL THEN 0 ELSE 1 END ORDER BY sequence_id ASC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) =
LAST_VALUE(program_name) OVER (PARTITION BY CASE WHEN program_name IS NOT NULL THEN 0 ELSE 1 END ORDER BY sequence_id ASC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
THEN 1
END initial_and_final_flag
FROM
rawdata
)
-- SELECT * FROM pivotdata
SELECT
DECODE(MAX(CASE WHEN program_name = 'program1' THEN appearance END), 2, 'initial', 1, 'in_between', 3, 'final', 4, 'initial_and_final') AS program1,
DECODE(MAX(CASE WHEN program_name = 'program2' THEN appearance END), 2, 'initial', 1, 'in_between', 3, 'final', 4, 'initial_and_final') AS program2,
DECODE(MAX(CASE WHEN program_name = 'program3' THEN appearance END), 2, 'initial', 1, 'in_between', 3, 'final', 4, 'initial_and_final') AS program3,
DECODE(MAX(CASE WHEN program_name = 'program4' THEN appearance END), 2, 'initial', 1, 'in_between', 3, 'final', 4, 'initial_and_final') AS program4,
DECODE(MAX(CASE WHEN program_name = 'program5' THEN appearance END), 2, 'initial', 1, 'in_between', 3, 'final', 4, 'initial_and_final') AS program5,
DECODE(MAX(CASE WHEN program_name = 'program6' THEN appearance END), 2, 'initial', 1, 'in_between', 3, 'final', 4, 'initial_and_final') AS program6,
DECODE(MAX(CASE WHEN program_name = 'program7' THEN appearance END), 2, 'initial', 1, 'in_between', 3, 'final', 4, 'initial_and_final') AS program7,
DECODE(MAX(CASE WHEN program_name = 'program8' THEN appearance END), 2, 'initial', 1, 'in_between', 3, 'final', 4, 'initial_and_final') AS program8
FROM
(
SELECT
program_name,
CASE
WHEN program_name IS NOT NULL and rno_asc = 1 AND initial_and_final_flag = 1
THEN 4
WHEN program_name IS NOT NULL and rno_desc = 1
THEN 3
WHEN program_name IS NOT NULL and rno_asc = 1
THEN 2
WHEN program_name IS NOT NULL
THEN 1
END AS appearance
FROM
pivotdata
) AS pivoteddata
Given the RawData dataset (which has sequence_id to determine the order in which different programs appeared in a timeline, and it few sequence can have null program_name where there are no program associated)
I would like to determine, for (program1, program2, .... programn) where they have occurred in the timeline.
Possible values could be:
initial: This indicates that program appeared in the very beginning.
final: This indicates that the program appeared in the very last.
in_between: This indicates that appeared in between and was not initial or final.
initial_and_final: Represents that the same program was there initially, and when it concluded (i.e. final).
single: Only one single program was present in the timeline, and no other programs.
With current state of data, output should look like:
+----------+----------+----------+----------+--------+--------+----------+--------+
|program1 |program2 |program3 |program4 |program5|program6|program7 |program8|
+----------+----------+----------+----------+--------+--------+----------+--------+
|in_between|in_between|in_between|in_between|final |initial |in_between|null |
+----------+----------+----------+----------+--------+--------+----------+--------+
But if we change program_name='program6' for sequence 50, the output should look like:
+----------+----------+----------+----------+----------+--------------------+----------+--------+
|program1 |program2 |program3 |program4 |program5 |program6 |program7 |program8|
+----------+----------+----------+----------+----------+--------------------+----------+--------+
|in_between|in_between|in_between|in_between|in_between|initial_and_final |in_between|null |
+----------+----------+----------+----------+----------+--------------------+----------+--------+
And if we change all the sequence to either null, or just have one program (let's say program6), then the output should be:
+----------+----------+----------+----------+----------+---------+----------+--------+
|program1 |program2 |program3 |program4 |program5 |program6 |program7 |program8|
+----------+----------+----------+----------+----------+---------+----------+--------+
|null |null |null |null |null |single |null |null |
+----------+----------+----------+----------+----------+---------+----------+--------+
Note:
- first or last sequence can also have NULL program_name.
Thanks.
[Updated on: Thu, 20 July 2023 10:59] Report message to a moderator
|
|
|
|
Re: Finding initial, final, in_between, and initial_and_final events [message #687915 is a reply to message #687914] |
Thu, 20 July 2023 12:06 |
manubatham20
Messages: 566 Registered: September 2010 Location: Seattle, WA, USA
|
Senior Member |
|
|
mathguy wrote on Thu, 20 July 2023 11:23Three questions.
First: this is a perfect application of match_recognize, but that was only introduced in Oracle 12.1. What is your Oracle version? That should be included with all questions. Sorry if you provided it already and I missed it.
Second: it would be much simpler (no dynamic query) if the output was in just two columns, with one program name per row, not one per column.
Third: A program name may be "single" in one group, "final" in another, etc. Your output only shows one category for each program name. Shouldn't there be one for each group of non-null names?
1. Oracle 10g.
2. Nope, having one program per row doesn't make sense, as this is just a very simple example of a complex overall use case. If I have 8 pre-defined programs, I prefer to keep it in columns (sparse data) over one row per program, and pivoting it again and again.
3. If program name is classified as single, it will take precedence over any other category.
|
|
|
|
Re: Finding initial, final, in_between, and initial_and_final events [message #687917 is a reply to message #687916] |
Thu, 20 July 2023 12:37 |
|
mathguy
Messages: 108 Registered: January 2023
|
Senior Member |
|
|
OK, I think I get it - I misunderstood part of the task.
So, you have a set of predetermined program names (including 'program8' which doesn't appear anywhere in the inputs). So the columns in the output are pre-determined, there is nothing "dynamic" about the requirement. This is good.
Also, you are looking at the entire timeline, from sequence = 1 to sequence = 50 - I thought you were splitting it into sub-sequences when a name was NULL, but re-reading the question (and your follow-up), that isn't what you are doing. So there should be only one classification for each program name.
I am working on possible solutions, I will post if I can come up with something simple. But in the meanwhile, just to make sure: you are indeed using Oracle, right? I ask only because in your RawData query, you are selecting values, but not showing where you are selecting them FROM (missing FROM clause). This works in some other RDBMS products, but not in Oracle; in Oracle you select "FROM DUAL" when you need such things.
Also, where do the predetermined program names come from? In my modeling I am adding a subquery in the WITH clause, but if in your use case they come from some other source, you will need to adapt.
|
|
|
|
Re: Finding initial, final, in_between, and initial_and_final events [message #687919 is a reply to message #687918] |
Thu, 20 July 2023 13:52 |
|
mathguy
Messages: 108 Registered: January 2023
|
Senior Member |
|
|
I think this will do what you need. I assumed that program1, for example, is classified as 'initial' only if it appears at sequence_id = 1 (and not if it is the first NON-NULL program name in the sequence, if the program name at sequence_id = 1 is NULL).
with
raw_data as ( ...... )
, prep as (
select case grouping(program_name) when 0 then program_name else 'global' end as program_name,
grouping_id(program_name) as lvl,
min(case grouping(program_name) when 1 then min(program_name) end) over () as min_pn,
min(case grouping(program_name) when 1 then max(program_name) end) over () as max_pn,
min(case grouping(program_name) when 1 then min(program_name)
keep (dense_rank first order by sequence_id) end) over () as first_pn,
min(case grouping(program_name) when 1 then min(program_name)
keep (dense_rank last order by sequence_id) end) over () as last_pn
from rawdata
group by rollup(program_name)
)
select *
from ( select program_name, min_pn, max_pn, first_pn, last_pn
from prep
where lvl = 0
)
pivot ( min( case when min_pn = max_pn then case when program_name = min_pn then 'single' end
when first_pn = last_pn and program_name = first_pn then 'initial_and_final'
when program_name = first_pn then 'initial'
when program_name = last_pn then 'final'
else 'in_between'
end )
for program_name in ('program1' as program1, 'program2' as program2, 'program3' as program3, 'program4' as program4,
'program5' as program5, 'program6' as program6, 'program7' as program7, 'program8' as program8)
)
;
|
|
|
Re: Finding initial, final, in_between, and initial_and_final events [message #687920 is a reply to message #687919] |
Thu, 20 July 2023 16:27 |
manubatham20
Messages: 566 Registered: September 2010 Location: Seattle, WA, USA
|
Senior Member |
|
|
Thanks mathguy. The definition of initial and final is first and last NOT NULL program_name in order of sequence_id. Also, I have moved on from Oracle, but try to write code in such a fashion which is generic, and can be ported easily if required. But since the beginning of my career this forum has been super helpful, and I turn to this forum looking for suggestions.
Could you please explain, how your code is efficient, and intuitive over what I've already?
[Updated on: Thu, 20 July 2023 17:48] Report message to a moderator
|
|
|
Re: Finding initial, final, in_between, and initial_and_final events [message #687921 is a reply to message #687920] |
Thu, 20 July 2023 21:46 |
|
mathguy
Messages: 108 Registered: January 2023
|
Senior Member |
|
|
If "initial" and "final" must capture the first and last non-null program names, then you need a small modification of the code: In the subquery named PREP in the WITH clause, add the WHERE clause between FROM and GROUP BY:
...
from rawdata
where program_name is not null
group by rollup(program_name)
...
This assumes that the program names in the PROGRAM_NAME column are always a subset of the "predetermined" names. If the "predetermined" names are from 'program1' to 'program8' but the PROGRAM_NAME column in the data may contain many other values, then a different WHERE clause is more efficient:
where program_name in ('program1', 'program2', ...... , 'program8')
The most time and resource consuming part of the query is the GROUP BY operation, so it is best to cut down the number of rows to which it applies as early as possible.
How is this solution more intuitive and more efficient... I don't know about "intuitive" - one's intuition depends on one's experience. In my query all the meaningful work is done in the subquery named PREP - an aggregation over the input data, using the ROLLUP option, to achieve two goals: first, to get all the distinct program names that are present in the data, and second, to get the global min, max, first and last program names. We only need these aggregates at the global level (over all the data, not for each group separately); this is what the ROLLUP thing does. Otherwise the query could be written without GROUP BY, with a SELECT DISTINCT; but that wouldn't get the global min, max, first and last.
The SELECT list of the PREP subquery includes several analytic functions - over the entire output of the aggregation - just to present the global min, max, first and last names on each row. Note that the aggregation is performed first, resulting in a small number of groups in the output ('program1' to 'program7' and NULL, plus the rollup row). The analytic functions apply to a very small number of rows, compared to the raw data.
Then the computations in PIVOT are trivial, and over small data. All the time in the query is in the aggregation in PREP (and I mean specifically in the aggregation, not in the analytic functions applied after the fact).
In your query, you get to essentially the same intermediate values to use in comparisons, but in a more indirect way. First you apply analytic functions, to the raw data itself (potentially a large number of rows). Then you apply an aggregation (pivoting) to the result - again over the same potentially large number of rows. Whether that matters, and by how much, is an empirical question. I tried both queries with 500 rows in rawdata, both finished too fast to see any difference. I tried again with 1.5 million rows in rawdata; now my query finishes about 10 times faster (under 0.3 seconds vs. 3 seconds for your query, on my system). Whether this matters depends on how much data you have in the first place; if it's of the order of 500 rows, then it doesn't matter.
[Updated on: Fri, 21 July 2023 00:27] Report message to a moderator
|
|
|
Goto Forum:
Current Time: Sat Jun 29 08:36:15 CDT 2024
|