Analysis of Hadoop HDFS Data Read and Write Process
This article explains the underlying principles of Hadoop HDFS read and write operations, detailing how the client interacts with NameNode and DataNodes, the role of FsDataInputStream and FsDataOutputStream, block location retrieval, pipeline replication, and file closure steps.
Analysis of Hadoop HDFS Data Read and Write Process
When learning Hadoop HDFS, many developers wonder how a few lines of code can perform read and write operations on HDFS; this article explains the underlying mechanisms with diagrams from the China University MOOC "Big Data Technology Principles and Applications" course.
1. Principles of Reading Data
1. Open the file. Using Java, import the FileSystem class and obtain an instance via FileSystem.get(conf), which loads configuration files such as core-site.xml and hdfs-site.xml, creating a DistributedFileSystem object that connects to the distributed system.
2. Obtain a data‑stream. The input stream type is FsDataInputStream, which encapsulates a DFSInputStream. The actual communication with the NameNode is performed by DFSInputStream, while FsDataInputStream interacts with the client.
3. Get block information. Because FsDataInputStream contains DFSInputStream, the client calls ClientProtocol.getBlockLocations() to ask the NameNode which DataNodes store the required blocks; the NameNode returns the locations of the first part of the file.
4. Read request. With the block locations, the client calls the read function. The NameNode also sorts DataNodes by proximity, so the client automatically connects to the nearest DataNode to read the data.
5. Read data. After reading, FsDataInputStream closes the connection to the DataNode.
6. Retrieve next block information. If more blocks are needed, the client again calls ClientProtocol.getBlockLocations() to obtain the locations of the next set of blocks from the NameNode.
7. Continue reading. The client repeats the read process until all blocks are read.
8. Close the file. Finally, the client calls FsDataInputStream.close() to finish the read operation.
2. Principles of Writing Data
1. Create file request. Similar to reading, obtain a FileSystem instance via FileSystem.get(conf), which creates a DistributedFileSystem object.
2. Use an output stream. The output stream type is FsDataOutputStream, which encapsulates a DFSOutputStream. The actual communication with the NameNode is performed by DFSOutputStream.
3. Create file metadata. DFSOutputStream makes an RPC call to the NameNode to create a new file in the namespace, after checking existence and client permissions.
4. Write data. Data is split into small packets and placed in an internal queue of DFSOutputStream. The client requests DataNodes from the NameNode to store these packets.
5. Pipeline replication. The NameNode returns a list of DataNodes; the client sends each packet to the first DataNode, which forwards it to the next, forming a pipeline that writes replicas to multiple DataNodes.
6. Receive acknowledgment. The last DataNode sends an acknowledgment back through the pipeline to the client, indicating successful write of the block.
7. Close the file. After receiving the acknowledgment, the client closes the file, completing the write operation.
Welcome to like, bookmark, and share this article!
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Big Data Technology & Architecture
Wang Zhiwu, a big data expert, dedicated to sharing big data technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
