Hadoop overview
I've spend time with the Hortonworks sandbox and reading about Hadoop project. It's funny and you can learn a lot fast but I think there's a lot of confusion when you try to face Hadoop for the first time. Too many projects, concepts and topics. And the bad news is you cannot use easily it in a real world unless you work with big data. But don't worry. If you want to explore and learn about Hadoop you can try the Hortonworks virtual machine, read books, articles and more. Trying this technology on your computer is the first step to know how Hadoop works. Before start, we have to study the infrastructure and we can get an overview on the main projects.
From hardware to software
The HDFS is the hadoop file system more close to the hardware infrastructure. On the following list there are the most important terms and concepts you must know to understand the Hadoop architecture.
- HDFS, Hadoop Distributed File System
- DataNode
- Namenode
- TaskTracker
- JobTracker
- MapReduce, algorithms used to filter data
Software project are many and they seem very complex
Hadoop is composed with a set of projects. Every project can have relations to others. Let's see all main projects names and their role on a Hadoop architecture.
Other related projects
- Mahout, a project about machine learning
- Ambari, Ganglia, Nagios projects not only for netwok monitoring
- Sqoop
- Cascading, an alternative API to Hadoop MapReduce
- Oozie, workflow scheduler
- Flume