URGENT: remove duplicate records and sorting in DEDUP??

Discussion on developing CloverETL engine, transformation components etc.

achan
Posts: 71
Joined: Thu Mar 20, 2008 7:19 pm

URGENT: remove duplicate records and sorting in DEDUP??

Postby achan » Sat Aug 30, 2008 12:35 am

Hi,

How do i remove duplicate records without specifying all the fields? my record has a metadata of 2000 fields...
here is a subset of my input data, sorted by REFERENCE (primary key):

"REFERENCE","NAME","NO"
"000000010271 ","WFB ","1"
"000000010271 ","WFB ","1"
"000000010272 ","ABC ","1"
"000000010272 ","ABC ","2"

i want an output result like this:

"REFERENCE","NAME","NO"
"000000010271 ","WFB ","1" (removed the duplicate)
"000000010272 ","ABC ","1"
"000000010272 ","ABC ","2"

i know i can use DEDUP and set the dedupKey="REFERENCE;NAME;NO" to achieve my output, but if my input data has 2000 fields, i do not want to set dedupKey to 2000 fields, right? moreover, can dedupKey be set to such a long string? so, is there a way to tell CloverETL to remove duplicate records if i have 2000 fields to match?

i would think DEDUP would just need a flag, say remove_only_if_all_fields_matches, set to true and can reference the FMT for the list of fields... if values of each respective fields match, then it's a duplicate and remove it... that way, DEDUP would not need the dedupKey to be set to a large number of field names... right?

just to make sure, DEDUP does not sort the records, right?

any help would be greatly appreciated Smile

thanks,
al

avackova
Posts: 841
Joined: Fri Jul 20, 2007 9:28 am

Re: URGENT: remove duplicate records and sorting in DEDUP??

Postby avackova » Thu Dec 17, 2009 11:32 am

Hello,
I've created the new issue with your request (http://bug.cloveretl.com/view.php?id=3401).
And answer about sorting: Dedup doesn't sort records, but it expects, that records are sorted according the key fields. If not, it only deduplicates records for each group of records that have the same key in sequence input.
Agata Vackova
Javlin a.s.
agata.vackova@javlin.eu

twaller
Posts: 49
Joined: Mon Feb 23, 2009 4:21 pm

Re: URGENT: remove duplicate records and sorting in DEDUP??

Postby twaller » Thu Dec 17, 2009 3:23 pm

Hello Achan,

You can click inside the left pane of the Edit key dialog (Fields pane), then click Ctrl+A (after which all Fields will be selected and turned blue) and click the Right arrow key.

This way all the fields will be moved to the pane on the right. You only need to confirm this by clicking OK.

Before this, you should have done the same in the ExtSort component.

I think this is what you wanted.

Best regards,

Tomas Waller
Tomas Waller
Javlin, a.s.
wallert@mail.javlin.cz

TomFS
Posts: 3
Joined: Tue Jun 18, 2013 3:15 pm

Re: URGENT: remove duplicate records and sorting in DEDUP??

Postby TomFS » Wed Jun 26, 2013 5:12 pm

Pulling this from a long time ago, but facing a similar issue. My problem is not the number of metadata fields (100), but rather the number of separate graphs. We have 400 files we will be uploading via CloverETL, but we need to have each in a separate graph for purposes of running the files individually.

I have used a shared metadata resource, database connection and SQL query file to make those elements common among all graphs. But I cannot figure out how to make an automatically updated or shared component for the sort/dedup. I have read that dedup has some functionality when the key is not specified, but that it still expects sorted input data. Is there a way to created a shared resource for the sort components? Or is there a sort that functions without any specified key and just sorts in order?

A 'shared resource' or non-key'd sort outputting to a non-key'd dedup i think would fix my problem, but I can't figure out how to make it work. Thank you for your help!

kubosj
Posts: 372
Joined: Thu Jan 12, 2012 9:10 am

Re: URGENT: remove duplicate records and sorting in DEDUP??

Postby kubosj » Thu Jun 27, 2013 10:55 am

Hi Tom,

I would recommend to use parameters in workspace.prm and ${} parameters substitution. That will allow you to control sort&dedup key from one place. Most of component parameters can be passed in this way. Please see attached sample.

In workspace.prm please define parameters like:

Code: Select all

SORT_KEY=field1(a);field2(a)
DEDUP_KEY=field1(a)
Attachments
test_parametrization.grf
(2.62 KiB) Downloaded 146 times
Jaroslav Kubos
CloverCARE Support
CloverETL | Rapid Data Integration

Visit us online at http://www.cloveretl.com


cron