I need to go through a large csv files with over 100,000 rows, translate some lookup values from a DB before inserting them to a DB table.
I attached a screenshot to demonstrate. Basically I used:
1. A flat file reader to read in the file
2. Followed by a number of 'Lookup Join' to translate some values from a DB
3. Then a 'ExtHashJoin' to link up all the data (since the source is not sorted)
4. Then Insert the record to a DB.
Pretty simple process and works fine for a small number of records. But when I try to import the whole file, it simply ran out of memory (increase memory allocation won't help).
I found the ExtHashJoin is trying to wait until it joins all records before starting to insert them and that seems to be the problem, since it has to store all records in memory. I don't really need that behaviour. Those records are all independent and can be processed in batches e.g. every 1000 rows a time but I cannot figure out a way to make it do this way.
I also tried to set different phases values but it still tries to join all values up before starting inserting the first record.
How can I tell the flat file reader to break the records down and process them in batches?
Please check my attached flow graph.