Notes
CH 1: Introduction
mapandreducecan be combined to execute "transform and consolidate" workflows.mapis one-to-one transformation,reduceis one-to-any transformation (Assembling/Consolidating)- Shopify has 10PB of MP3s would take 20,000 years to play.
- Hadoop distributes both data storage and processing.
- Hadoop provides a layer of abstraction on top of DFS that allows us to run highly // MR jobs
- Spark does more of its work in memory instead of by writing to files.
- AWS Elastic MR (EMR)
- AWS approach
CH 2: Map and // Computing
- map is powerful lazy function, instructions are saved for evaluating the function and runs only when we ask for the value.
- Python
mapconverts sequence of inputs --> instructions for computing - get #cpus
import os
print(os.cpu_count())
2.1 Pickling
- Python's version of object serialization or marshalling
- it allows moving code objects(functions+data) across machines.
pickle-work-pickleapproachmultiprocessingmodule does not pickle1. Lambda functions, 2. Nested functions, 3. Nested classes- Use third party library for them (like
pathos)
2.2 Order and //ization
CH 3: Function pipelines
- helper functions and function chains
- use
compose[Write in reverse order] andpipefromtoolz(decode hacker messages)
CH 4: Lazy Workflows
- Lazy and Eager evaluation
-
Some lazy functions:
-->filter
-->zip
-->iglob: Lazy way to query filesystems -
Important other functions:
filterfalse,keyfilter,valfilteranditemfilter
4.1 Iterators
- Iterators can move through in sequence Generators generate sequences
- we can loop over items of an iterator, or we can map a function across one.
-
The process is defined by method
__iter__()and returns object with a__next__()method. -
All our lazy friends are one-way streets; once we call next. item returned to us is removed from sequence.
- Iterators are not for by-hand inspection, they are meant for processing big data. They use less memory and offer better performance.
4.2 Generators
- generates sequence and don't hold them in memory.
- define function with
yeildstatement. - Use generator expressions if possible.
from itertools import count, islice
# Generator Expression
evens = (i*2 for i in count()) # Generator
# Create chunks of the evens generator
islice(evens, 5, 10) # Generator
4.3 Lazily processing large dataset
- Finding author of Poem problem.