Notes
CH 1: Introduction
map
andreduce
can be combined to execute "transform and consolidate" workflows.map
is one-to-one transformation,reduce
is one-to-any transformation (Assembling/Consolidating)- Shopify has 10PB of MP3s would take 20,000 years to play.
- Hadoop distributes both data storage and processing.
- Hadoop provides a layer of abstraction on top of DFS that allows us to run highly // MR jobs
- Spark does more of its work in memory instead of by writing to files.
- AWS Elastic MR (EMR)
- AWS approach
CH 2: Map and // Computing
- map is powerful lazy function, instructions are saved for evaluating the function and runs only when we ask for the value.
- Python
map
converts sequence of inputs --> instructions for computing - get #cpus
import os
print(os.cpu_count())
2.1 Pickling
- Python's version of object serialization or marshalling
- it allows moving code objects(functions+data) across machines.
pickle-work-pickle
approachmultiprocessing
module does not pickle1. Lambda functions, 2. Nested functions, 3. Nested classes
- Use third party library for them (like
pathos
)
2.2 Order and //ization
CH 3: Function pipelines
- helper functions and function chains
- use
compose
[Write in reverse order] andpipe
fromtoolz
(decode hacker messages)
CH 4: Lazy Workflows
- Lazy and Eager evaluation
-
Some lazy functions:
-->filter
-->zip
-->iglob
: Lazy way to query filesystems -
Important other functions:
filterfalse
,keyfilter
,valfilter
anditemfilter
4.1 Iterators
- Iterators can move through in sequence Generators generate sequences
- we can loop over items of an iterator, or we can map a function across one.
-
The process is defined by method
__iter__()
and returns object with a__next__()
method. -
All our lazy friends are one-way streets; once we call next. item returned to us is removed from sequence.
- Iterators are not for by-hand inspection, they are meant for processing big data. They use less memory and offer better performance.
4.2 Generators
- generates sequence and don't hold them in memory.
- define function with
yeild
statement. - Use generator expressions if possible.
from itertools import count, islice
# Generator Expression
evens = (i*2 for i in count()) # Generator
# Create chunks of the evens generator
islice(evens, 5, 10) # Generator
4.3 Lazily processing large dataset
- Finding author of Poem problem.