I'm just getting started with python. Curious what I need to change to properly filter out only strings starting with "200"
script: #!/usr/bin/python
directories = ['20080412', '20080324', 'blahblah', 'latest-dir', 'rawhide-20080410', 'rawhide-20080411', 'rawhide-20080412' , '20080401']
print 'directories == %s' % directories
for directory in directories: print 'processing directory %s ' % directory if not directory.startswith('200'): directories.remove(directory) print 'removed == %s' % directory
print 'directories == %s' % directories
~~~~~~~~~~~~~~~~~~~~~~~~~ output: $ python dir-filter.py directories == ['20080412', '20080324', 'blahblah', 'latest-dir', 'rawhide-20080410', 'rawhide-20080411', 'rawhide-20080412', '20080401'] processing directory 20080412 type == <type 'str'> processing directory 20080324 type == <type 'str'> processing directory blahblah type == <type 'str'> removed == blahblah processing directory rawhide-20080410 type == <type 'str'> removed == rawhide-20080410 processing directory rawhide-20080412 type == <type 'str'> removed == rawhide-20080412 directories == ['20080412', '20080324', 'latest-dir', 'rawhide-20080411', '20080401']
~~~~~~~~~~~~~~~~~~~~
why do 'rawhide-20080411' and 'latest-dir' remain?
Thanks, John
On Sat, Apr 12, 2008 at 01:35:33PM -0700, John Poelstra wrote:
directories = ['20080412', '20080324', 'blahblah', 'latest-dir', 'rawhide-20080410', 'rawhide-20080411', 'rawhide-20080412' , '20080401']
print 'directories == %s' % directories
for directory in directories: print 'processing directory %s ' % directory if not directory.startswith('200'): directories.remove(directory) print 'removed == %s' % directory
print 'directories == %s' % directories
Modifying the contents of a list that you're iterating over with a generator will give you strange results. Basically, you're causing the generator to lose its place. If you put the result in a new list (say, directories_new), this won't happen.
Granted, I'd probably do it this way, just going once over the list with a list iteration:
[d for d in directories if d.startswith('200')]
['20080412', '20080324', '20080401']
Since it builds a new list, you can even assign it back to the directories variable:
directories = [d for d in directories if d.startswith('200')] directories
['20080412', '20080324', '20080401']
I'm not a huge fan of list comprehensions due to the somewhat baroque syntax, but this seems like a perfect single-pass O(n) application of the construct that is still easy to read.
On Sat, Apr 12, 2008 at 01:47:25PM -0700, Kyle VanderBeek wrote:
Modifying the contents of a list that you're iterating over with a generator will give you strange results. Basically, you're causing the generator to lose its place. If you put the result in a new list (say, directories_new), this won't happen.
I mis-spoke on a minor detail: the list object doesn't use a generator, it's just an interable in the usual sense. Sorry to confuse.
My technique still holds. Modifying something you're iterating over will give you unexpected results.
On Sat, 2008-04-12 at 13:35 -0700, John Poelstra wrote:
I'm just getting started with python. Curious what I need to change to properly filter out only strings starting with "200"
script: #!/usr/bin/python
directories = ['20080412', '20080324', 'blahblah', 'latest-dir', 'rawhide-20080410', 'rawhide-20080411', 'rawhide-20080412' , '20080401']
print 'directories == %s' % directories
for directory in directories: print 'processing directory %s ' % directory if not directory.startswith('200'): directories.remove(directory) print 'removed == %s' % directory
print 'directories == %s' % directories
output: $ python dir-filter.py directories == ['20080412', '20080324', 'blahblah', 'latest-dir', 'rawhide-20080410', 'rawhide-20080411', 'rawhide-20080412', '20080401'] processing directory 20080412 type == <type 'str'> processing directory 20080324 type == <type 'str'> processing directory blahblah type == <type 'str'> removed == blahblah processing directory rawhide-20080410 type == <type 'str'> removed == rawhide-20080410 processing directory rawhide-20080412 type == <type 'str'> removed == rawhide-20080412 directories == ['20080412', '20080324', 'latest-dir', 'rawhide-20080411', '20080401'] ~~~~~~~~~~~~~~~~~~~~ why do 'rawhide-20080411' and 'latest-dir' remain?
removing items from a list you're working on means the index change in place
so you'll end up skipping some items b/c the loop moves over them.
-sv
On Sat, 2008-04-12 at 13:35 -0700, John Poelstra wrote:
I'm just getting started with python. Curious what I need to change to properly filter out only strings starting with "200"
script: #!/usr/bin/python
directories = ['20080412', '20080324', 'blahblah', 'latest-dir', 'rawhide-20080410', 'rawhide-20080411', 'rawhide-20080412' , '20080401']
print 'directories == %s' % directories
for directory in directories: print 'processing directory %s ' % directory if not directory.startswith('200'): directories.remove(directory) print 'removed == %s' % directory
print 'directories == %s' % directories
You can't alter things that you are iterating, the easist fix is:
for directory in directories[:]:
...the others being to create a new list of just what you want, or a list of what needs to go and then do the .remove() calls on that (these methods can be worth it, for large lists).
On Sat, Apr 12, 2008 at 04:52:06PM -0400, James Antill wrote:
You can't alter things that you are iterating, the easist fix is:
for directory in directories[:]:
...the others being to create a new list of just what you want, or a list of what needs to go and then do the .remove() calls on that (these methods can be worth it, for large lists).
Actually, I'd contend this technique gets worse as your list size increases. First, you're making a copy pass, doubling your memory footprint, and then a second pass to actually filter the list down to just the elements you want. That will get slower as your list size increases.
Oh, and that reminds me of another way, using the builtin filter():
directories = ['20080412', '20080324', 'blahblah', 'latest-dir',
'rawhide-20080410', 'rawhide-20080411', 'rawhide-20080412' , '20080401']
filter(lambda x: x.startswith('200'), directories)
['20080412', '20080324', '20080401']
Kyle VanderBeek wrote:
On Sat, Apr 12, 2008 at 04:52:06PM -0400, James Antill wrote:
You can't alter things that you are iterating, the easist fix is:
for directory in directories[:]:
...the others being to create a new list of just what you want, or a list of what needs to go and then do the .remove() calls on that (these methods can be worth it, for large lists).
Actually, I'd contend this technique gets worse as your list size increases. First, you're making a copy pass, doubling your memory footprint, and then a second pass to actually filter the list down to just the elements you want. That will get slower as your list size increases.
Oh, and that reminds me of another way, using the builtin filter():
directories = ['20080412', '20080324', 'blahblah', 'latest-dir',
'rawhide-20080410', 'rawhide-20080411', 'rawhide-20080412' , '20080401']
filter(lambda x: x.startswith('200'), directories)
['20080412', '20080324', '20080401']
Just a note: This will be much slower than the list comprehension for large lists.
That's because function calls have tremendous overhead in python and a lambda is just a function without a name. So using filter you make two function calls for every entry in the list: * lambda [...] * str.startswith()
With a list comprehension you only call str.startswith().
As I tell people, if you want readability, write it out fully (James Antill's solution or even:
newDirectories = [] for directory in directories: if directory.startswith('200'): newDirectories.append(directory) directories = newDirectories
If you want speed and code compactness use a list comprehension.
-Toshio
On Sat, 2008-04-12 at 17:10 -0700, Toshio Kuratomi wrote:
Kyle VanderBeek wrote:
On Sat, Apr 12, 2008 at 04:52:06PM -0400, James Antill wrote:
You can't alter things that you are iterating, the easist fix is:
for directory in directories[:]:
...the others being to create a new list of just what you want, or a list of what needs to go and then do the .remove() calls on that (these methods can be worth it, for large lists).
Actually, I'd contend this technique gets worse as your list size increases. First, you're making a copy pass, doubling your memory footprint, and then a second pass to actually filter the list down to just the elements you want. That will get slower as your list size increases.
Oh, and that reminds me of another way, using the builtin filter():
directories = ['20080412', '20080324', 'blahblah', 'latest-dir',
'rawhide-20080410', 'rawhide-20080411', 'rawhide-20080412' , '20080401']
filter(lambda x: x.startswith('200'), directories)
['20080412', '20080324', '20080401']
Just a note: This will be much slower than the list comprehension for large lists.
That's because function calls have tremendous overhead in python and a lambda is just a function without a name. So using filter you make two function calls for every entry in the list:
- lambda [...]
- str.startswith()
With a list comprehension you only call str.startswith().
As I tell people, if you want readability, write it out fully (James Antill's solution or even:
newDirectories = [] for directory in directories: if directory.startswith('200'): newDirectories.append(directory) directories = newDirectories
If you want speed and code compactness use a list comprehension.
And just as another note - you better be doing this several hundred thousand times to justify the complete unreadability of list comprehensions.
-sv
On Sun, 2008-04-13 at 17:05 -0400, seth vidal wrote:
newDirectories = [] for directory in directories: if directory.startswith('200'): newDirectories.append(directory) directories = newDirectories
If you want speed and code compactness use a list comprehension.
And just as another note - you better be doing this several hundred thousand times to justify the complete unreadability of list comprehensions.
directories = [directory for directory in directories if directory.startswith('200') ]
The change in line ordering is unsettling at first. But I think the intent is very clear with a list comprehension. Certainly the phrasing is quite similar.
python-devel@lists.fedoraproject.org