Use Case Accelerators > How to Implement Count Distinct

The count distinct function in a database returns the number of distinct values for a specified column. This functionality can be implemented in DMExpress in two ways:

- Single value count distinct: optimal when you want to count distinct values of one field.
- Multiple value count distinct: optimal when you want to count distinct values of more than one field.

DMExpress can be used to obtain the number of distinct values of:

- one field: using the single value count distinct solution
- more than one field: using the multiple value count distinct solution

Single Value Count Distinct

This solution, included in the Attachments, is recommended when you want to count distinct values of a single field.

In the Source Data shown in the data flow diagram below, we want to count distinct cities in a state. To do this, we create a job with two tasks, where the output of the first task is the input to the second task:

- Task 1 – an Aggregate task to deduplicate
- Task 2 – an Aggregate task to count the number of unique occurrences

This results in the following data flow:

Combining the Tasks into a Job

In the Job Editor, create a job and add the two Aggregate tasks. Enable direct data flow between the tasks by selecting Edit > Use Direct Data Flows Everywhere.

Multiple Value Count Distinct

This solution, included in the Attachments, is recommended when you want to count distinct values of more than one field.

In the Source Data shown in the data flow diagram below, we want to count distinct counties and cities in a state. To do this, we create a job with four tasks, where the output of each task is the input to the subsequent task:

- Task 1 – a Copy task to do a vertical pivot
- Task 2 – an Aggregate task to deduplicate
- Task 3 – an Aggregate task to count the number of unique occurrences
- Task 4 – a Copy task to do a horizontal pivot

This results in the following data flow:

Task 1 – Vertical Pivot

In the Task Editor, create a Copy task:

- Define the following values:
delimiter=","

terminator="\r\n"

countyIndicator="County"

cityIndicator="City"

- Add a Fixed position target reformat and add the following values in the order shown

This will effectively create two target records for each source record:

- one with the state, county, and "county" indicator
- one with the state, city, and "city" indicator

In the subsequent task, these fields will be interpreted in the Source layout as State, County_or_City, and countyCityIndicator.

Task 2 – Deduplicate

In the Task Editor, create an Aggregate task:

- Define the output of Task 1 as the source.
- Add State, County_or_City, and countyCityIndicator as the Group by fields in the Aggregate dialog.
- Add State, County_or_City, and countyCityIndicator to the target reformat.

Task 4 – Horizontal Pivot

In the Task Editor, create a Copy task:

- Define the output of Task 3 as the source.
- Define the following values:
countDistinctCounty = IfThenElse(

recordlayout1.countyCityIndicator = "County",

recordlayout1.countDistinct,

0)

countDistinctCity = IfThenElse(

recordlayout1.countyCityIndicator = "City",

recordlayout1.countDistinct,

0)

- Add State as a Group by field and generate a Total summary of the countDistinctCounty and countDistinctCity values in the Aggregate dialog.
- Add State, total(countDistinctCounty), and total(countDistinctCity) to the target reformat.

Coming into this task, for each state, there will be two records – one with the number of distinct counties, and one with the number of distinct cities for that state, as shown in "Output of Task 3" in the data flow diagram above.

To follow the evaluation, start with the reformat. As each input record is processed, it evaluates:

- countDistinctCounty, which will return the record's countDistinct value if it's a county record (or 0 otherwise)
- countDistinctCity, which will return the record's countDistinct value if it's a city record (or 0 otherwise)

Pre-aggregation, the data would logically appear as follows:

The aggregation then groups by state and totals the counts for county and city, producing the final output.

Combining the Tasks into a Job

In the Job Editor, create a job and add the four tasks. Enable direct data flow between the tasks by selecting Edit > Use Direct Data Flows Everywhere.

50_singleValueCountDistinct.zip, compatible with DMExpress version 7.6.0 or higher

50_multipleValueCountDistinct.zip, compatible with DMExpress version 7.6.0 or higher

See How to Perform Horizontal Pivoting (Denormalization).

Copyright © 2016 Syncsort All rights reserved.