On 15Apr2012 14:30, Amadeus W.M. amadeus84@verizon.net wrote: | > Look at this (completely untested) loop: | > | > # a little setup | > cmd=`basename "$0"` | > : ${TMPDIR:=/tmp} | > tmppfx=$TMPDIR/$cmd.$$ | > | > i=0 | > while read -r url | > do | > i=$((i+1)) | > out=$tmppfx.$i | > if curl -s "$url" >"$out" | > then echo "$out" | > else echo "$cmd: curl fails on: $url" >&2 fi & | > done < myURLs \ | > | while read -r out | > do | > cat "$out" | > rm "$out" | > done \ | > | tee all-data.out \ | > | your-data-parsing-program | | | I understand the script, although I haven't tested it either. My take on | it: | + it solves the problem of curls overwriting (I think) | + the data parsing and tracking is done on the combined curls
Yes.
| - it retrieves the urls serially, not in parallel
No, in parallel. There is an "&" after the "fi" in the if.
It looks like the "fi &" got sucked onto the end of an echo statemnet. It should be on its own line.
| - it writes them to disk
Just long enough to be read and catted, then removed.
| - it re-reads them from disk, hence some disk activity, although | probably insignificant relative to the download time.
Should be, yes.
| The way I'm doing it now is this: I do the retrieval and the parsing and | tracking all within a single program. For each url I create a separate | thread from which I call curl and get its output, then parse. | Like this: | | // inside each thread: [... popen(curl...) ...] | // when threads done, analyze the combined info. | | This works, but I would have liked a more modular solution. I want the url | retrieval to be a separate, standalone entity and the parsing and | tracking another entity (possibly two entities). Hence, what I want is | | - in a shell | - download in parallel | - merge curl outputs
My above loop tries to do that. The curls do run in parallel.
| then pipe into the parser/tracker. Parsing can be done per url, but | tracking MUST be across urls.
That should work; your parser comes at the end of the pipeline.
Cheers,