NYC Open Data: Streaming Python on Hadoop
Posted by Vivian Zhang
Updated: Sep 19, 2015
Our upcoming 12-Week Data Science bootcamp starts on January, 11th, 2016. Apply now to secure a spot in our winter cohort!
In the meantime, come join the NYC Open Data Meetup Group and learn how easily you can use Hadoop for Machine Learning.
If you are hiring Data Scientists, call us at +1-888-752-7585 (USA) or reach us at info@nycdatascience.com to share your openings and set up interviews with our excellent students.
What is an NYC Open Data Meetup event like? Here’s an example:
This summer, Sam Kamin (Vice President of Engineering, NYC Data Science Academy) gave a teaser version of NYC Data Science Academy’s five-week Hadoop course. (The current version on offer is a six-week evening program, “Big Data with Hadoop and Spark.”) See below for a class syllabus from the five-week Hadoop course.
First, though, here are the slides from the Meetup:
(Slides can also be accessed here through SlideShare.)
This five-week course is an intensive, hands-on introduction to the Hadoop ecosystem of Big Data technologies. The emphasis in this course is on learning several of the major components of Apache Hadoop – HDFS, MapReduce, Hive, Pig, Streaming – by doing exercises of increasing complexity. Programming will be done in Python. Students are expected to be familiar with using an operating system from the command line; knowledge of Python is helpful; the material in Learn Python the Hard Way is sufficient background knowledge. The course format is mixed lecture/lab. Students will need to bring their own laptops to connect to our server; instructions will be provided ahead of time as to how to install any required software.
SYLLABUS
Week 1 – Introduction: MapReduce Overview of Big Data and the Hadoop ecosystem the concept of MapReduce
HDFS – Hadoop Distributed File System
MapReduce with Python streaming
Week 2 – More on MapReduce
More on Big Data, the Hadoop ecosystem, and MapReduce.
Mixed case studies and exercises using MR with Python streaming
Week 3 – Hive: A database for Big Data
Hive concepts
HiveQL
User-defined functions in the Hive language
User-defined functions in Python (using streaming)
Advanced topic: Hive queries in Python code
Week 4 – Pig: Simplified MapReduce
Basic concepts
Pig Latin
Pig functions and macros
User-defined functions
Week 5 – Project day
The Hadoop ecosystem
Brief intro to Spark
Brief intro to Mahout
Case studies/project ideas
Vivian Zhang
Vivian is the CTO and School Director of NYC Data Science Academy and CTO of SupStat. With her extensive experience working in the data science field, she developed expertise in multiple programming languages, including R, Python, Hadoop, and Spark. In August 2016, Forbes ranked her amongst one of the nine women leading the pack in data analytics. In 2013, she created the NYC Open Data Meetup group, which stands as one of the largest data science communities offering meetups, conferences, and a weekly newsletter. In her spare time, Vivian enjoys meeting people and sharing her motivational stories with our students and other professionals
View all articlesTopics from this blog: Community
Subscribe Here
Posts by Tag
- Meetup (101)
- data science (68)
- Community (60)
- R (48)
- Alumni (46)
- NYC (43)
- Data Science News and Sharing (41)
- nyc data science academy (38)
- python (32)
- alumni story (28)
- data (28)
- Featured (14)
- Machine Learning (14)
- data science bootcamp (14)
- Big Data (13)
- NYC Open Data (12)
- statistics (11)
- visualization (11)
- Hadoop (10)
- hiring partner events (10)
- D3.js (9)
- Data Scientist (9)
- NYCDSA (8)
- Web Scraping (8)
- Career (7)
- Data Scientist Jobs (6)
- Data Visualization (6)
- Hiring (6)
- Open Data (6)
- R Workshop (6)
- APIs (5)
- Alumni Spotlight (5)
- Best Bootcamp (5)
- Best Data Science 2019 (5)
- Best Data Science Bootcamp (5)
- Data Science Academy (5)
- Demo Day (5)
- Job Placement (5)
- NYCDSA Alumni (5)
- Tableau (5)
- alumni interview (5)
- API (4)
- Career Education (4)
- Deep Learning (4)
- Get Hired (4)
- Kaggle (4)
- NYC Data Science (4)
- Networking (4)
- Student Works (4)
- employer networking (4)
- prediction (4)
- Data Analyst (3)
- Job (3)
- Maps (3)
- New Courses (3)
- Python Workshop (3)
- R Shiny (3)
- Shiny (3)
- Top Data Science Bootcamp (3)
- bootcamp (3)
- recommendation (3)
- 2019 (2)
- Alumnus (2)
- Book-Signing (2)
- Bootcamp Alumni (2)
- Bootcamp Prep (2)
- Capstone (2)
- Career Day (2)
- Data Science Reviews (2)
- Data science jobs (2)
- Discount (2)
- Events (2)
- Full Stack Data Scientist (2)
- Hiring Partners (2)
- Industry Experts (2)
- Jobs (2)
- Online Bootcamp (2)
- Spark (2)
- Testimonial (2)
- citibike (2)
- clustering (2)
- jp morgan chase (2)
- pandas (2)
- python machine learning (2)
- remote data science bootcamp (2)
- #trainwithnycdsa (1)
- ACCET (1)
- AWS (1)
- Accreditation (1)
- Alex Baransky (1)
- Alumni Reviews (1)
- Application (1)
- Best Data Science Bootcamp 2020 (1)
- Best Data Science Bootcamp 2021 (1)
- Best Ranked (1)
- Book Launch (1)
- Bundles (1)
- California (1)
- Cancer Research (1)
- Coding (1)
- Complete Guide To Become A Data Scientist (1)
- Course Demo (1)
- Course Report (1)
- Finance (1)
- Financial Data Science (1)
- First Step to Become Data Scientist (1)
- How To Learn Data Science From Scratch (1)
- Instructor Interview (1)
- Jon Krohn (1)
- Lead Data Scienctist (1)
- Lead Data Scientist (1)
- Medical Research (1)
- Meet the team (1)
- Neural networks (1)
- Online (1)
- Part-time (1)
- Portfolio Development (1)
- Prework (1)
- Programming (1)
- PwC (1)
- R Programming (1)
- R language (1)
- Ranking (1)
- Remote (1)
- Selenium (1)
- Skills Needed (1)
- Special (1)
- Special Summer (1)
- Sports (1)
- Student Interview (1)
- Student Showcase (1)
- Switchup (1)
- TensorFlow (1)
- Weekend Course (1)
- What to expect (1)
- artist (1)
- bootcamp experience (1)
- data scientist career (1)
- dplyr (1)
- interview (1)
- linear regression (1)
- nlp (1)
- painter (1)
- python web scraping (1)
- python webscraping (1)
- regression (1)
- team (1)
- twitter (1)