Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

downlaod large file from Jetty (ambari webhdfs) is slow #155

Open
ianzhang1988 opened this issue Apr 21, 2020 · 0 comments
Open

downlaod large file from Jetty (ambari webhdfs) is slow #155

ianzhang1988 opened this issue Apr 21, 2020 · 0 comments

Comments

@ianzhang1988
Copy link

ianzhang1988 commented Apr 21, 2020

downlaod large file from Jetty (ambari webhdfs) is slow

I have a file about 5G, download from hdfs at 12M/s, buy my network could reach 500M/s, and smaller file work fine. Then I reproduced this problem with curl, and requests.

Here is curl debug log:

curl -v -X GET http://x.x.x.x/file

> GET /webhdfs/v1/user/sohuvideo/online/srcFile/188/718/188718791/dat1_188718791_2020_4_11_17_4_172647e6e60.mp4?op=OPEN&user.name=sohuvideo&namenoderpcaddress=sotocyon&offset=0 HTTP/1.1
> User-Agent: curl/7.29.0
> Host: x.x.x.com:50075
> Accept: */*
> 
< HTTP/1.1 200 OK
< Cache-Control: no-cache
< Expires: Tue, 21 Apr 2020 03:01:26 GMT
< Date: Tue, 21 Apr 2020 03:01:26 GMT
< Pragma: no-cache
< Expires: Tue, 21 Apr 2020 03:01:26 GMT
< Date: Tue, 21 Apr 2020 03:01:26 GMT
< Pragma: no-cache
< Content-Type: application/octet-stream
< Access-Control-Allow-Methods: GET
< Access-Control-Allow-Origin: *
< Transfer-Encoding: chunked
< Server: Jetty(6.1.26)
< 
{ [data not shown]
100  119M    0  119M    0     0  13.0M      0 --:--:--  0:00:09 --:--:-- 12.1M^C

After some digging, I found if attach header Connection: close to the request, it could end up much faster.

curl -v -H "Connection: close" -X GET http://x.x.x.x/file

> GET /webhdfs/v1/user/sohuvideo/online/srcFile/188/718/188718791/dat1_188718791_2020_4_11_17_4_172647e6e60.mp4?op=OPEN&user.name=sohuvideo&namenoderpcaddress=sotocyon&offset=0 HTTP/1.1
> User-Agent: curl/7.29.0
> Host: x.x.x.com:50075
> Accept: */*
> Connection: close
> 
< HTTP/1.1 200 OK
< Cache-Control: no-cache
< Expires: Tue, 21 Apr 2020 03:00:13 GMT
< Date: Tue, 21 Apr 2020 03:00:13 GMT
< Pragma: no-cache
< Expires: Tue, 21 Apr 2020 03:00:13 GMT
< Date: Tue, 21 Apr 2020 03:00:13 GMT
< Pragma: no-cache
< Content-Type: application/octet-stream
< Access-Control-Allow-Methods: GET
< Access-Control-Allow-Origin: *
< Connection: close
< Server: Jetty(6.1.26)
< 
{ [data not shown]
100 4517M    0 4517M    0     0   138M      0 --:--:--  0:00:32 --:--:--  153M
* Closing connection 0

I think this probably caused by Transfer-Encoding: chunked from server when file is large, server choose this because when server transfer the file the file size has not yet be decided, chunked stream could give a lots of overhead. If given Connection: close then server would not use Transfer-Encoding: chunked to indicate the end of steam, just close the connection instead.

I add Connection: close to request and it seems have solved this problem

    return self._session.request(
      method=method,
      url=url,
      timeout=self._timeout,
      headers={'content-type': 'application/octet-stream', 'Connection':'close'}, # For HttpFS.
      **kwargs
    )

Though I am not sure if this could bring any side affect.

Any suggestions?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant