S3 and Wildcards

Support/help with CloverETL implementation problems

DTaylor
Posts: 29
Joined: Tue Jan 22, 2013 6:41 pm

S3 and Wildcards

Postby DTaylor » Tue Dec 06, 2016 7:56 pm

I have recently had trouble with the UniversalDataReader (and its successor the FlatFileReader) when using the S3 protocol with the * wildcard. S3 appears to take the wildcard as a literal '*' and returns no sources while the ETL component assumes there are no files matching the pattern and finishes successfully. We've worked around this via other protocols, but I thought I would put this out there in case anyone was having this issue with the S3 protocol.

vazquezrosariop
Posts: 111
Joined: Mon Feb 29, 2016 5:33 pm

Re: S3 and Wildcards

Postby vazquezrosariop » Wed Dec 07, 2016 5:35 pm

Hi DTaylor

Could you please provide us with more information on the behavior of your issue. Also, if possible can you post anonymized graph example. We would like to better understand what might be causing the issue.
---
Pedro Vazquez Rosario
CloverCARE Support
CloverETL | Rapid Data Integration

Visit us online at http://www.cloveretl.com
How to speed up communication with CloverCARE support

DTaylor
Posts: 29
Joined: Tue Jan 22, 2013 6:41 pm

Re: S3 and Wildcards

Postby DTaylor » Wed Feb 01, 2017 5:34 pm

Sorry for the delay. I'm unfortunately unable to post an anonymized ETL at this time.

We were accessing a bucket using the S3 protocol with the UniversalDataReader. The ETL needed to read the contents of all files that matched a particular pattern. Previously, we had been using the * character as a part of the pattern and it had matched the files appropriately. At the time of posting, S3 had started regarding the * character as a literal rather than a wildcard, so none of the filenames matched the pattern that I had set up. So, for example, rather than finding all files that matched the pattern 'ABC123*.txt', the S3 protocol started telling UniversalDataReader that there were no files with the name 'ABC123*.txt'. Since UniversalDataReader recognized the wildcard even though S3 did not, the ETL did not error and we had a process that was quietly failing.

We managed to work around this by accessing the bucket via HTTPS, but that's not ideal when there is a dedicated protocol.

svecp
Posts: 25
Joined: Wed Nov 09, 2016 11:51 pm
Location: 2111 Wilson Blvd., Arlington VA 22201
Contact:

Re: S3 and Wildcards

Postby svecp » Thu Feb 02, 2017 3:00 pm

In case, you have a Corporate server you can use component ListFiles to list all available files from an S3 bucket a feed those into UniversalDataWriter. Since version 4.2.0, we're using official Amazon SDK to access S3. Have you encountered this change after upgrade to later version? More details in: https://bug.javlin.eu/browse/CLO-7170.

overview.png
overview.png (23.78 KiB) Viewed 528 times
--
Pavel Švec | CloverETL | Sales Engineer | 2111 Wilson Blvd | Suite 320 | Arlington, VA 22201

DTaylor
Posts: 29
Joined: Tue Jan 22, 2013 6:41 pm

Re: S3 and Wildcards

Postby DTaylor » Thu Feb 16, 2017 9:10 pm

We encountered this issue on 4.3. We do not have Corporate Server unfortunately, so ListFiles was not an option.

svecp
Posts: 25
Joined: Wed Nov 09, 2016 11:51 pm
Location: 2111 Wilson Blvd., Arlington VA 22201
Contact:

Re: S3 and Wildcards

Postby svecp » Fri Feb 24, 2017 3:28 pm

I just tried that on new 4.5.0-M2 and got info, algorithm changed in 4.4.0. Would it be possible for you to try a later version?

s3://***:***@s3.amazonaws.com/cloveretl.svecp/Monitored/cust*.dat - works
s3://***:***@s3.amazonaws.com/cloveretl.svecp/Monitored/* - works
s3://***:***@s3.amazonaws.com/* - does not work, because of insufficient privileges
--
Pavel Švec | CloverETL | Sales Engineer | 2111 Wilson Blvd | Suite 320 | Arlington, VA 22201