History log of /linux-master/fs/splice.c
Revision Date Author Comments
# 7c98f7cb 28-Aug-2023 Miklos Szeredi <mszeredi@redhat.com>

remove call_{read,write}_iter() functions

These have no clear purpose. This is effectively a revert of commit
bb7462b6fd64 ("vfs: use helpers for calling f_op->{read,write}_iter()").

The patch was created with the help of a coccinelle script.

Fixes: bb7462b6fd64 ("vfs: use helpers for calling f_op->{read,write}_iter()")
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 705bcfcb 12-Dec-2023 Amir Goldstein <amir73il@gmail.com>

fs: use splice_copy_file_range() inline helper

generic_copy_file_range() is just a wrapper around splice_file_range(),
which caps the maximum copy length.

The only caller of splice_file_range(), namely __ceph_copy_file_range()
is already ready to cope with short copy.

Move the length capping into splice_file_range() and replace the exported
symbol generic_copy_file_range() with a simple inline helper.

Suggested-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/linux-fsdevel/20231204083849.GC32438@lst.de/
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Link: https://lore.kernel.org/r/20231212094440.250945-3-amir73il@gmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>


# 0f292086 12-Dec-2023 Amir Goldstein <amir73il@gmail.com>

splice: return type ssize_t from all helpers

Not sure why some splice helpers return long, maybe historic reasons.
Change them all to return ssize_t to conform to the splice methods and
to the rest of the helpers.

Suggested-by: Christian Brauner <brauner@kernel.org>
Link: https://lore.kernel.org/r/20231208-horchen-helium-d3ec1535ede5@brauner/
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Link: https://lore.kernel.org/r/20231212094440.250945-2-amir73il@gmail.com
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>


# da40448c 30-Nov-2023 Amir Goldstein <amir73il@gmail.com>

fs: move file_start_write() into direct_splice_actor()

The callers of do_splice_direct() hold file_start_write() on the output
file.

This may cause file permission hooks to be called indirectly on an
overlayfs lower layer, which is on the same filesystem of the output
file and could lead to deadlock with fanotify permission events.

To fix this potential deadlock, move file_start_write() from the callers
into the direct_splice_actor(), so file_start_write() will not be held
while splicing from the input file.

Suggested-by: Josef Bacik <josef@toxicpanda.com>
Link: https://lore.kernel.org/r/20231128214258.GA2398475@perftesting/
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Link: https://lore.kernel.org/r/20231130141624.3338942-3-amir73il@gmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>


# 488e8f68 30-Nov-2023 Amir Goldstein <amir73il@gmail.com>

fs: fork splice_file_range() from do_splice_direct()

In preparation of calling do_splice_direct() without file_start_write()
held, create a new helper splice_file_range(), to be called from context
of ->copy_file_range() methods instead of do_splice_direct().

Currently, the only difference is that splice_file_range() does not take
flags argument and that it asserts that file_start_write() is held, but
we factor out a common helper do_splice_direct_actor() that will be used
later.

Use the new helper from __ceph_copy_file_range(), that was incorrectly
passing to do_splice_direct() the copy flags argument as splice flags.
The value of copy flags in ceph is always 0, so it is a smenatic bug fix.

Move the declaration of both helpers to linux/splice.h.

Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Link: https://lore.kernel.org/r/20231130141624.3338942-2-amir73il@gmail.com
Acked-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>


# d53471ba6f 23-Nov-2023 Amir Goldstein <amir73il@gmail.com>

splice: remove permission hook from iter_file_splice_write()

All the callers of ->splice_write(), (e.g. do_splice_direct() and
do_splice()) already check rw_verify_area() for the entire range
and perform all the other checks that are in vfs_write_iter().

Instead of creating another tiny helper for special caller, just
open-code it.

This is needed for fanotify "pre content" events.

Suggested-by: Jan Kara <jack@suse.cz>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Link: https://lore.kernel.org/r/20231122122715.2561213-6-amir73il@gmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>


# b70d8e2b 22-Nov-2023 Amir Goldstein <amir73il@gmail.com>

splice: move permission hook out of splice_file_to_pipe()

vfs_splice_read() has a permission hook inside rw_verify_area() and
it is called from splice_file_to_pipe(), which is called from
do_splice() and do_sendfile().

do_sendfile() already has a rw_verify_area() check for the entire range.
do_splice() has a rw_verify_check() for the splice to file case, not for
the splice from file case.

Add the rw_verify_area() check for splice from file case in do_splice()
and use a variant of vfs_splice_read() without rw_verify_area() check
in splice_file_to_pipe() to avoid the redundant rw_verify_area() checks.

This is needed for fanotify "pre content" events.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Link: https://lore.kernel.org/r/20231122122715.2561213-5-amir73il@gmail.com
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>


# feebea75 22-Nov-2023 Amir Goldstein <amir73il@gmail.com>

splice: move permission hook out of splice_direct_to_actor()

vfs_splice_read() has a permission hook inside rw_verify_area() and
it is called from do_splice_direct() -> splice_direct_to_actor().

The callers of do_splice_direct() (e.g. vfs_copy_file_range()) already
call rw_verify_area() for the entire range, but the other caller of
splice_direct_to_actor() (nfsd) does not.

Add the rw_verify_area() checks in nfsd_splice_read() and use a
variant of vfs_splice_read() without rw_verify_area() check in
splice_direct_to_actor() to avoid the redundant rw_verify_area() checks.

This is needed for fanotify "pre content" events.

Acked-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Link: https://lore.kernel.org/r/20231122122715.2561213-4-amir73il@gmail.com
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>


# 2a33e2dd 22-Nov-2023 Amir Goldstein <amir73il@gmail.com>

splice: remove permission hook from do_splice_direct()

All callers of do_splice_direct() have a call to rw_verify_area() for
the entire range that is being copied, e.g. by vfs_copy_file_range() or
do_sendfile() before calling do_splice_direct().

The rw_verify_area() check inside do_splice_direct() is redundant and
is called after sb_start_write(), so it is not "start-write-safe".
Remove this redundant check.

This is needed for fanotify "pre content" events.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Link: https://lore.kernel.org/r/20231122122715.2561213-3-amir73il@gmail.com
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>


# 0201ebf2 28-Jun-2023 David Howells <dhowells@redhat.com>

mm: merge folio_has_private()/filemap_release_folio() call pairs

Patch series "mm, netfs, fscache: Stop read optimisation when folio
removed from pagecache", v7.

This fixes an optimisation in fscache whereby we don't read from the cache
for a particular file until we know that there's data there that we don't
have in the pagecache. The problem is that I'm no longer using PG_fscache
(aka PG_private_2) to indicate that the page is cached and so I don't get
a notification when a cached page is dropped from the pagecache.

The first patch merges some folio_has_private() and
filemap_release_folio() pairs and introduces a helper,
folio_needs_release(), to indicate if a release is required.

The second patch is the actual fix. Following Willy's suggestions[1], it
adds an AS_RELEASE_ALWAYS flag to an address_space that will make
filemap_release_folio() always call ->release_folio(), even if
PG_private/PG_private_2 aren't set. folio_needs_release() is altered to
add a check for this.


This patch (of 2):

Make filemap_release_folio() check folio_has_private(). Then, in most
cases, where a call to folio_has_private() is immediately followed by a
call to filemap_release_folio(), we can get rid of the test in the pair.

There are a couple of sites in mm/vscan.c that this can't so easily be
done. In shrink_folio_list(), there are actually three cases (something
different is done for incompletely invalidated buffers), but
filemap_release_folio() elides two of them.

In shrink_active_list(), we don't have have the folio lock yet, so the
check allows us to avoid locking the page unnecessarily.

A wrapper function to check if a folio needs release is provided for those
places that still need to do it in the mm/ directory. This will acquire
additional parts to the condition in a future patch.

After this, the only remaining caller of folio_has_private() outside of
mm/ is a check in fuse.

Link: https://lkml.kernel.org/r/20230628104852.3391651-1-dhowells@redhat.com
Link: https://lkml.kernel.org/r/20230628104852.3391651-2-dhowells@redhat.com
Reported-by: Rohith Surabattula <rohiths.msft@gmail.com>
Suggested-by: Matthew Wilcox <willy@infradead.org>
Signed-off-by: David Howells <dhowells@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Steve French <sfrench@samba.org>
Cc: Shyam Prasad N <nspmangalore@gmail.com>
Cc: Rohith Surabattula <rohiths.msft@gmail.com>
Cc: Dave Wysochanski <dwysocha@redhat.com>
Cc: Dominique Martinet <asmadeus@codewreck.org>
Cc: Ilya Dryomov <idryomov@gmail.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: Xiubo Li <xiubli@redhat.com>
Cc: Jingbo Xu <jefflexu@linux.alibaba.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 781ca602 21-Aug-2023 Matthew Wilcox (Oracle) <willy@infradead.org>

splice: Convert page_cache_pipe_buf_confirm() to use a folio

Convert buf->page to a folio once instead of five times. There's only
one uptodate bit per folio, not per page, so we lose nothing here.

Signed-off-by: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Message-Id: <20230821141541.2535953-1-willy@infradead.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>


# 576d498e 03-Jul-2023 Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>

splice: fsnotify_access(in), fsnotify_modify(out) on success in tee

Same logic applies here: this can fill up the pipe, and pollers that rely
on getting IN_MODIFY notifications never wake up.

Fixes: 983652c69199 ("splice: report related fsnotify events")
Link: https://lore.kernel.org/linux-fsdevel/jbyihkyk5dtaohdwjyivambb2gffyjs3dodpofafnkkunxq7bu@jngkdxx65pux/t/#u
Link: https://bugs.debian.org/1039488
Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Acked-by: Jan Kara <jack@suse.cz>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Message-Id: <10d76dd8c85017ae3cd047c9b9a32e26daefdaa2.1688393619.git.nabijaczleweli@nabijaczleweli.xyz>
Signed-off-by: Christian Brauner <brauner@kernel.org>


# 7f0f1ea0 03-Jul-2023 Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>

splice: fsnotify_access(fd)/fsnotify_modify(fd) in vmsplice

Same logic applies here: this can fill up the pipe and pollers that rely
on getting IN_MODIFY notifications never wake up.

Fixes: 983652c69199 ("splice: report related fsnotify events")
Link: https://lore.kernel.org/linux-fsdevel/jbyihkyk5dtaohdwjyivambb2gffyjs3dodpofafnkkunxq7bu@jngkdxx65pux/t/#u
Link: https://bugs.debian.org/1039488
Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Acked-by: Jan Kara <jack@suse.cz>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Message-Id: <8d9ad5acb9c5c1dd2376a2ff5da6ac3183115389.1688393619.git.nabijaczleweli@nabijaczleweli.xyz>
Signed-off-by: Christian Brauner <brauner@kernel.org>


# 12ee4b66 03-Jul-2023 Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>

splice: always fsnotify_access(in), fsnotify_modify(out) on success

The current behaviour caused an asymmetry where some write APIs
(write, sendfile) would notify the written-to/read-from objects,
but splice wouldn't.

This affected userspace which uses inotify, most notably coreutils
tail -f, to monitor pipes.
If the pipe buffer had been filled by a splice-family function:
* tail wouldn't know and thus wouldn't service the pipe, and
* all writes to the pipe would block because it's full,
thus service was denied.
(For the particular case of tail -f this could be worked around
with ---disable-inotify.)

Fixes: 983652c69199 ("splice: report related fsnotify events")
Link: https://lore.kernel.org/linux-fsdevel/jbyihkyk5dtaohdwjyivambb2gffyjs3dodpofafnkkunxq7bu@jngkdxx65pux/t/#u
Link: https://bugs.debian.org/1039488
Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Acked-by: Jan Kara <jack@suse.cz>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Message-Id: <604ec704d933e0e0121d9e107ce914512e045fad.1688393619.git.nabijaczleweli@nabijaczleweli.xyz>
Signed-off-by: Christian Brauner <brauner@kernel.org>


# 0f0fa27b 24-Jul-2023 Jan Stancek <jstancek@redhat.com>

splice, net: Fix splice_to_socket() for O_NONBLOCK socket

LTP sendfile07 [1], which expects sendfile() to return EAGAIN when
transferring data from regular file to a "full" O_NONBLOCK socket,
started failing after commit 2dc334f1a63a ("splice, net: Use
sendmsg(MSG_SPLICE_PAGES) rather than ->sendpage()").
sendfile() no longer immediately returns, but now blocks.

Removed sock_sendpage() handled this case by setting a MSG_DONTWAIT
flag, fix new splice_to_socket() to do the same for O_NONBLOCK sockets.

[1] https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/syscalls/sendfile/sendfile07.c

Fixes: 2dc334f1a63a ("splice, net: Use sendmsg(MSG_SPLICE_PAGES) rather than ->sendpage()")
Acked-by: David Howells <dhowells@redhat.com>
Signed-off-by: Jan Stancek <jstancek@redhat.com>
Tested-by: Xi Ruoyao <xry111@xry111.site>
Link: https://lore.kernel.org/r/023c0e21e595e00b93903a813bc0bfb9a5d7e368.1690219914.git.jstancek@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>


# 2e82f6c3 14-Jun-2023 Christoph Hellwig <hch@lst.de>

splice: simplify a conditional in copy_splice_read

Check for -EFAULT instead of wrapping the check in an ret < 0 block.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Reviewed-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/20230614140341.521331-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 0b24be46 14-Jun-2023 Christoph Hellwig <hch@lst.de>

splice: don't call file_accessed in copy_splice_read

copy_splice_read calls into ->read_iter to read the data, which already
calls file_accessed.

Fixes: 33b3b041543e ("splice: Add a func to do a splice from an O_DIRECT file without ITER_PIPE")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Reviewed-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/20230614140341.521331-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# ca2d49f7 14-Jun-2023 David Howells <dhowells@redhat.com>

splice, net: Fix splice_to_socket() to handle pipe bufs larger than a page

splice_to_socket() assumes that a pipe_buffer won't hold more than a single
page of data - but this assumption can be violated by skb_splice_bits()
when it splices from a socket into a pipe.

The problem is that splice_to_socket() doesn't advance the pipe_buffer
length and offset when transcribing from the pipe buf into a bio_vec, so if
the buf is >PAGE_SIZE, it keeps repeating the same initial chunk and
doesn't advance the tail index. It then subtracts this from "remain" and
overcounts the amount of data to be sent.

The cleanup phase then tries to overclean the pipe, hits an unused pipe buf
and a NULL-pointer dereference occurs.

Fix this by not restricting the bio_vec size to PAGE_SIZE and instead
transcribing the entirety of each pipe_buffer into a single bio_vec and
advancing the tail index if remain hasn't hit zero yet.

Large bio_vecs will then be split up by iterator functions such as
iov_iter_extract_pages().

This resulted in a KASAN report looking like:

general protection fault, probably for non-canonical address 0xdffffc0000000001: 0000 [#1] PREEMPT SMP KASAN
KASAN: null-ptr-deref in range [0x0000000000000008-0x000000000000000f]
...
RIP: 0010:pipe_buf_release include/linux/pipe_fs_i.h:203 [inline]
RIP: 0010:splice_to_socket+0xa91/0xe30 fs/splice.c:933

Fixes: 2dc334f1a63a ("splice, net: Use sendmsg(MSG_SPLICE_PAGES) rather than ->sendpage()")
Reported-by: syzbot+f9e28a23426ac3b24f20@syzkaller.appspotmail.com
Link: https://lore.kernel.org/r/0000000000000900e905fdeb8e39@google.com/
Tested-by: syzbot+f9e28a23426ac3b24f20@syzkaller.appspotmail.com
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
cc: David Ahern <dsahern@kernel.org>
cc: Jens Axboe <axboe@kernel.dk>
cc: Matthew Wilcox <willy@infradead.org>
cc: Christian Brauner <brauner@kernel.org>
cc: Alexander Viro <viro@zeniv.linux.org.uk>
Link: https://lore.kernel.org/r/1428985.1686737388@warthog.procyon.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>


# 219d9205 07-Jun-2023 David Howells <dhowells@redhat.com>

splice, net: Fix SPLICE_F_MORE signalling in splice_direct_to_actor()

splice_direct_to_actor() doesn't manage SPLICE_F_MORE correctly[1] - and,
as a result, it incorrectly signals/fails to signal MSG_MORE when splicing
to a socket. The problem I'm seeing happens when a short splice occurs
because we got a short read due to hitting the EOF on a file: as the length
read (read_len) is less than the remaining size to be spliced (len),
SPLICE_F_MORE (and thus MSG_MORE) is set.

The issue is that, for the moment, we have no way to know *why* the short
read occurred and so can't make a good decision on whether we *should* keep
MSG_MORE set.

MSG_SENDPAGE_NOTLAST was added to work around this, but that is also set
incorrectly under some circumstances - for example if a short read fills a
single pipe_buffer, but the next read would return more (seqfile can do
this).

This was observed with the multi_chunk_sendfile tests in the tls kselftest
program. Some of those tests would hang and time out when the last chunk
of file was less than the sendfile request size:

build/kselftest/net/tls -r tls.12_aes_gcm.multi_chunk_sendfile

This has been observed before[2] and worked around in AF_TLS[3].

Fix this by making splice_direct_to_actor() always signal SPLICE_F_MORE if
we haven't yet hit the requested operation size. SPLICE_F_MORE remains
signalled if the user passed it in to splice() but otherwise gets cleared
when we've read sufficient data to fulfill the request.

If, however, we get a premature EOF from ->splice_read(), have sent at
least one byte and SPLICE_F_MORE was not set by the caller, ->splice_eof()
will be invoked.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Linus Torvalds <torvalds@linux-foundation.org>
cc: Jens Axboe <axboe@kernel.dk>
cc: Christoph Hellwig <hch@lst.de>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: Matthew Wilcox <willy@infradead.org>
cc: Jan Kara <jack@suse.cz>
cc: Jeff Layton <jlayton@kernel.org>
cc: David Hildenbrand <david@redhat.com>
cc: Christian Brauner <brauner@kernel.org>
cc: Chuck Lever <chuck.lever@oracle.com>
cc: Boris Pismenny <borisp@nvidia.com>
cc: John Fastabend <john.fastabend@gmail.com>
cc: linux-mm@kvack.org

Link: https://lore.kernel.org/r/499791.1685485603@warthog.procyon.org.uk/ [1]
Link: https://lore.kernel.org/r/1591392508-14592-1-git-send-email-pooja.trivedi@stackpath.com/ [2]
Link: https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git/commit/?id=d452d48b9f8b1a7f8152d33ef52cfd7fe1735b0a [3]
Signed-off-by: Jakub Kicinski <kuba@kernel.org>


# 2bfc6685 07-Jun-2023 David Howells <dhowells@redhat.com>

splice, net: Add a splice_eof op to file-ops and socket-ops

Add an optional method, ->splice_eof(), to allow splice to indicate the
premature termination of a splice to struct file_operations and struct
proto_ops.

This is called if sendfile() or splice() encounters all of the following
conditions inside splice_direct_to_actor():

(1) the user did not set SPLICE_F_MORE (splice only), and

(2) an EOF condition occurred (->splice_read() returned 0), and

(3) we haven't read enough to fulfill the request (ie. len > 0 still), and

(4) we have already spliced at least one byte.

A further patch will modify the behaviour of SPLICE_F_MORE to always be
passed to the actor if either the user set it or we haven't yet read
sufficient data to fulfill the request.

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/CAHk-=wh=V579PDYvkpnTobCLGczbgxpMgGmmhqiTyE34Cpi5Gg@mail.gmail.com/
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
cc: Jens Axboe <axboe@kernel.dk>
cc: Christoph Hellwig <hch@lst.de>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: Matthew Wilcox <willy@infradead.org>
cc: Jan Kara <jack@suse.cz>
cc: Jeff Layton <jlayton@kernel.org>
cc: David Hildenbrand <david@redhat.com>
cc: Christian Brauner <brauner@kernel.org>
cc: Chuck Lever <chuck.lever@oracle.com>
cc: Boris Pismenny <borisp@nvidia.com>
cc: John Fastabend <john.fastabend@gmail.com>
cc: linux-mm@kvack.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>


# 2dc334f1 07-Jun-2023 David Howells <dhowells@redhat.com>

splice, net: Use sendmsg(MSG_SPLICE_PAGES) rather than ->sendpage()

Replace generic_splice_sendpage() + splice_from_pipe + pipe_to_sendpage()
with a net-specific handler, splice_to_socket(), that calls sendmsg() with
MSG_SPLICE_PAGES set instead of calling ->sendpage().

MSG_MORE is used to indicate if the sendmsg() is expected to be followed
with more data.

This allows multiple pipe-buffer pages to be passed in a single call in a
BVEC iterator, allowing the processing to be pushed down to a loop in the
protocol driver. This helps pave the way for passing multipage folios down
too.

Protocols that haven't been converted to handle MSG_SPLICE_PAGES yet should
just ignore it and do a normal sendmsg() for now - although that may be a
bit slower as it may copy everything.

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
cc: Jens Axboe <axboe@kernel.dk>
cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>


# 9eee8bd8 22-May-2023 David Howells <dhowells@redhat.com>

splice: kdoc for filemap_splice_read() and copy_splice_read()

Provide kerneldoc comments for filemap_splice_read() and
copy_splice_read().

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Christian Brauner <brauner@kernel.org>
cc: Christoph Hellwig <hch@lst.de>
cc: Jens Axboe <axboe@kernel.dk>
cc: Steve French <smfrench@gmail.com>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: linux-mm@kvack.org
cc: linux-block@vger.kernel.org
cc: linux-cifs@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
Link: https://lore.kernel.org/r/20230522135018.2742245-32-dhowells@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# c6585011 22-May-2023 David Howells <dhowells@redhat.com>

splice: Remove generic_file_splice_read()

Remove generic_file_splice_read() as it has been replaced with calls to
filemap_splice_read() and copy_splice_read().

With this, ITER_PIPE is no longer used.

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christian Brauner <brauner@kernel.org>
cc: Jens Axboe <axboe@kernel.dk>
cc: Steve French <smfrench@gmail.com>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: David Hildenbrand <david@redhat.com>
cc: John Hubbard <jhubbard@nvidia.com>
cc: linux-mm@kvack.org
cc: linux-block@vger.kernel.org
cc: linux-cifs@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
Link: https://lore.kernel.org/r/20230522135018.2742245-30-dhowells@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# b85930a0 22-May-2023 David Howells <dhowells@redhat.com>

splice: Make splice from a DAX file use copy_splice_read()

Make a read splice from a DAX file go directly to copy_splice_read() to do
the reading as filemap_splice_read() is unlikely to find any pagecache to
splice.

I think this affects only erofs, Ext2, Ext4, fuse and XFS.

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Reviewed-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: Jens Axboe <axboe@kernel.dk>
cc: linux-erofs@lists.ozlabs.org
cc: linux-ext4@vger.kernel.org
cc: linux-xfs@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
cc: linux-block@vger.kernel.org
cc: linux-mm@kvack.org
Link: https://lore.kernel.org/r/20230522135018.2742245-9-dhowells@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# aa3dbde8 22-May-2023 David Howells <dhowells@redhat.com>

splice: Make splice from an O_DIRECT fd use copy_splice_read()

Make a read splice from a file descriptor that's open O_DIRECT use
copy_splice_read() to do the reading as filemap_splice_read() is unlikely
to find any pagecache to splice.

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christian Brauner <brauner@kernel.org>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: Jens Axboe <axboe@kernel.dk>
cc: linux-fsdevel@vger.kernel.org
cc: linux-block@vger.kernel.org
cc: linux-mm@kvack.org
Link: https://lore.kernel.org/r/20230522135018.2742245-8-dhowells@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 123856f0 22-May-2023 David Howells <dhowells@redhat.com>

splice: Check for zero count in vfs_splice_read()

Make vfs_splice_read() return immediately if the length is 0.

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christian Brauner <brauner@kernel.org>
cc: Jens Axboe <axboe@kernel.dk>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: linux-block@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
Link: https://lore.kernel.org/r/20230522135018.2742245-7-dhowells@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 6a3f30b8 22-May-2023 David Howells <dhowells@redhat.com>

splice: Make do_splice_to() generic and export it

Rename do_splice_to() to vfs_splice_read() and export it so that it can be
used as a helper when calling down to a lower layer filesystem as it
performs all the necessary checks[1].

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christian Brauner <brauner@kernel.org>
cc: Miklos Szeredi <miklos@szeredi.hu>
cc: Jens Axboe <axboe@kernel.dk>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: John Hubbard <jhubbard@nvidia.com>
cc: David Hildenbrand <david@redhat.com>
cc: Matthew Wilcox <willy@infradead.org>
cc: linux-unionfs@vger.kernel.org
cc: linux-block@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
Link: https://lore.kernel.org/r/CAJfpeguGksS3sCigmRi9hJdUec8qtM9f+_9jC1rJhsXT+dV01w@mail.gmail.com/ [1]
Link: https://lore.kernel.org/r/20230522135018.2742245-6-dhowells@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# e69f37bc 22-May-2023 David Howells <dhowells@redhat.com>

splice: Clean up copy_splice_read() a bit

Do a couple of cleanups to copy_splice_read():

(1) Cast to struct page **, not void *.

(2) Simplify the calculation of the number of pages to keep/reclaim in
copy_splice_read().

Suggested-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christian Brauner <brauner@kernel.org>
cc: Jens Axboe <axboe@kernel.dk>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: David Hildenbrand <david@redhat.com>
cc: John Hubbard <jhubbard@nvidia.com>
cc: linux-mm@kvack.org
cc: linux-block@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
Link: https://lore.kernel.org/r/20230522135018.2742245-5-dhowells@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 69df79a4 22-May-2023 David Howells <dhowells@redhat.com>

splice: Rename direct_splice_read() to copy_splice_read()

Rename direct_splice_read() to copy_splice_read() to better reflect as to
what it does.

Suggested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christian Brauner <brauner@kernel.org>
cc: Steve French <sfrench@samba.org>
cc: Jens Axboe <axboe@kernel.dk>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: linux-cifs@vger.kernel.org
cc: linux-mm@kvack.org
cc: linux-block@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
Link: https://lore.kernel.org/r/20230522135018.2742245-4-dhowells@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 0f99fc51 24-Apr-2023 Jens Axboe <axboe@kernel.dk>

splice: clear FMODE_NOWAIT on file if splice/vmsplice is used

In preparation for pipes setting FMODE_NOWAIT on pipes to indicate that
RWF_NOWAIT/IOCB_NOWAIT is fully supported, have splice and vmsplice
clear that file flag. Splice holds the pipe lock around IO and cannot
easily be refactored to avoid that, as splice and pipes are inherently
tied together.

By clearing FMODE_NOWAIT if splice is being used on a pipe, other users
of the pipe will know that the pipe is no longer safe for RWF_NOWAIT
and friends.

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 983652c6 22-Mar-2023 Chung-Chiang Cheng <cccheng@synology.com>

splice: report related fsnotify events

The fsnotify ACCESS and MODIFY event are missing when manipulating a file
with splice(2).

Signed-off-by: Chung-Chiang Cheng <cccheng@synology.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Acked-by: Jan Kara <jack@suse.cz>
Message-Id: <20230322062519.409752-1-cccheng@synology.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>


# c3a4aec0 07-Mar-2023 Jiapeng Chong <jiapeng.chong@linux.alibaba.com>

splice: Remove redundant assignment to ret

The variable ret belongs to redundant assignment and can be deleted.

fs/splice.c:940:2: warning: Value stored to 'ret' is never read.

Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Link: https://bugzilla.openanolis.cn/show_bug.cgi?id=4406
Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>


# 7c8e01eb 15-Feb-2023 David Howells <dhowells@redhat.com>

splice: Export filemap/direct_splice_read()

filemap_splice_read() and direct_splice_read() should be exported.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Steve French <sfrench@samba.org>
cc: Jens Axboe <axboe@kernel.dk>
cc: Christoph Hellwig <hch@lst.de>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: David Hildenbrand <david@redhat.com>
cc: John Hubbard <jhubbard@nvidia.com>
cc: linux-cifs@vger.kernel.org
cc: linux-mm@kvack.org
cc: linux-block@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Steve French <stfrench@microsoft.com>


# 33b3b041 07-Feb-2023 David Howells <dhowells@redhat.com>

splice: Add a func to do a splice from an O_DIRECT file without ITER_PIPE

Implement a function, direct_file_splice(), that deals with this by using
an ITER_BVEC iterator instead of an ITER_PIPE iterator as the former won't
free its buffers when reverted. The function bulk allocates all the
buffers it thinks it is going to use in advance, does the read
synchronously and only then trims the buffer down. The pages we did use
get pushed into the pipe.

This fixes a problem with the upcoming iov_iter_extract_pages() function,
whereby pages extracted from a non-user-backed iterator such as ITER_PIPE
aren't pinned. __iomap_dio_rw(), however, calls iov_iter_revert() to
shorten the iterator to just the bufferage it is going to use - which has
the side-effect of freeing the excess pipe buffers, even though they're
attached to a bio and may get written to by DMA (thanks to Hillf Danton for
spotting this[1]).

This then causes memory corruption that is particularly noticeable when the
syzbot test[2] is run. The test boils down to:

out = creat(argv[1], 0666);
ftruncate(out, 0x800);
lseek(out, 0x200, SEEK_SET);
in = open(argv[1], O_RDONLY | O_DIRECT | O_NOFOLLOW);
sendfile(out, in, NULL, 0x1dd00);

run repeatedly in parallel. What I think is happening is that ftruncate()
occasionally shortens the DIO read that's about to be made by sendfile's
splice core by reducing i_size.

This should be more efficient for DIO read by virtue of doing a bulk page
allocation, but slightly less efficient by ignoring any partial page in the
pipe.

Reported-by: syzbot+a440341a59e3b7142895@syzkaller.appspotmail.com
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
cc: Christoph Hellwig <hch@lst.de>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: David Hildenbrand <david@redhat.com>
cc: John Hubbard <jhubbard@nvidia.com>
cc: linux-mm@kvack.org
cc: linux-block@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
Link: https://lore.kernel.org/r/20230207094731.1390-1-hdanton@sina.com/ [1]
Link: https://lore.kernel.org/r/000000000000b0b3c005f3a09383@google.com/ [2]
Signed-off-by: Steve French <stfrench@microsoft.com>


# 664e4078 03-Feb-2023 Christoph Hellwig <hch@lst.de>

splice: use bvec_set_page to initialize a bvec

Use the bvec_set_page helper to initialize a bvec.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230203150634.3199647-18-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# de4eda9d 15-Sep-2022 Al Viro <viro@zeniv.linux.org.uk>

use less confusing names for iov_iter direction initializers

READ/WRITE proved to be actively confusing - the meanings are
"data destination, as used with read(2)" and "data source, as
used with write(2)", but people keep interpreting those as
"we read data from it" and "we write data to it", i.e. exactly
the wrong way.

Call them ITER_DEST and ITER_SOURCE - at least that is harder
to misinterpret...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 7d690c15 09-Jun-2022 Al Viro <viro@zeniv.linux.org.uk>

iter_to_pipe(): switch to advancing variant of iov_iter_get_pages()

... and untangle the cleanup on failure to add into pipe.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 0d964934 12-Jun-2022 Al Viro <viro@zeniv.linux.org.uk>

splice: stop abusing iov_iter_advance() to flush a pipe

Use pipe_discard_from() explicitly in generic_file_read_iter(); don't bother
with rather non-obvious use of iov_iter_advance() in there.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 97ef77c5 29-Jun-2022 Jason A. Donenfeld <Jason@zx2c4.com>

fs: check FMODE_LSEEK to control internal pipe splicing

The original direct splicing mechanism from Jens required the input to
be a regular file because it was avoiding the special socket case. It
also recognized blkdevs as being close enough to a regular file. But it
forgot about chardevs, which behave the same way and work fine here.

This is an okayish heuristic, but it doesn't totally work. For example,
a few chardevs should be spliceable here. And a few regular files
shouldn't. This patch fixes this by instead checking whether FMODE_LSEEK
is set, which represents decently enough what we need rewinding for when
splicing to internal pipes.

Fixes: b92ce5589374 ("[PATCH] splice: add direct fd <-> fd splicing support")
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 5100da38 12-Feb-2022 Matthew Wilcox (Oracle) <willy@infradead.org>

mm: Convert remove_mapping() to take a folio

Add kernel-doc and return the number of pages removed in order to
get the statistics right in __invalidate_mapping_pages().

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>


# b9ccad2e 11-Feb-2022 Matthew Wilcox (Oracle) <willy@infradead.org>

splice: Use a folio in page_cache_pipe_buf_try_steal()

This saves a lot of calls to compound_head().

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>


# b964bf53 25-Jan-2021 Al Viro <viro@zeniv.linux.org.uk>

teach sendfile(2) to handle send-to-pipe directly

no point going through the intermediate pipe

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# faa97c48 25-Jan-2021 Al Viro <viro@zeniv.linux.org.uk>

take the guts of file-to-pipe splice into a helper function

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 313d64a3 24-Jan-2021 Al Viro <viro@zeniv.linux.org.uk>

do_splice_to(): move the logics for limiting the read length in

Both callers have the identical logics limiting the amount of
data we try to read into pipe - no more than would fit into
that pipe. Move that into do_splice_to() itself.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 0f1d344f 09-Jan-2021 Pavel Begunkov <asml.silence@gmail.com>

splice: don't generate zero-len segement bvecs

iter_file_splice_write() may spawn bvec segments with zero-length. In
preparation for prohibiting them, filter out by hand at splice level.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# ee6e00c8 22-Oct-2020 Jens Axboe <axboe@kernel.dk>

splice: change exported internal do_splice() helper to take kernel offset

With the set_fs change, we can no longer rely on copy_{to,from}_user()
accepting a kernel pointer, and it was bad form to do so anyway. Clean
this up and change the internal helper that io_uring uses to deal with
kernel pointers instead. This puts the offset copy in/out in __do_splice()
instead, which just calls the same helper.

Signed-off-by: Jens Axboe <axboe@kernel.dk>


# d1a819a2 05-Oct-2020 Linus Torvalds <torvalds@linux-foundation.org>

splice: teach splice pipe reading about empty pipe buffers

Tetsuo Handa reports that splice() can return 0 before the real EOF, if
the data in the splice source pipe is an empty pipe buffer. That empty
pipe buffer case doesn't happen in any normal situation, but you can
trigger it by doing a write to a pipe that fails due to a page fault.

Tetsuo has a test-case to show the behavior:

#define _GNU_SOURCE
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>

int main(int argc, char *argv[])
{
const int fd = open("/tmp/testfile", O_WRONLY | O_CREAT, 0600);
int pipe_fd[2] = { -1, -1 };
pipe(pipe_fd);
write(pipe_fd[1], NULL, 4096);
/* This splice() should wait unless interrupted. */
return !splice(pipe_fd[0], NULL, fd, NULL, 65536, 0);
}

which results in

write(5, NULL, 4096) = -1 EFAULT (Bad address)
splice(4, NULL, 3, NULL, 65536, 0) = 0

and this can confuse splice() users into believing they have hit EOF
prematurely.

The issue was introduced when the pipe write code started pre-allocating
the pipe buffers before copying data from user space.

This is modified verion of Tetsuo's original patch.

Fixes: a194dfe6e6f6 ("pipe: Rearrange sequence in pipe_write() to preallocate slot")
Link:https://lore.kernel.org/linux-fsdevel/20201005121339.4063-1-penguin-kernel@I-love.SAKURA.ne.jp/
Reported-by: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Acked-by: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 598b3cec 24-Sep-2020 Christoph Hellwig <hch@lst.de>

fs: remove compat_sys_vmsplice

Now that import_iovec handles compat iovecs, the native vmsplice syscall
can be used for the compat case as well.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 89cd35c5 24-Sep-2020 Christoph Hellwig <hch@lst.de>

iov_iter: transparently handle compat iovecs in import_iovec

Use in compat_syscall to import either native or the compat iovecs, and
remove the now superflous compat_import_iovec.

This removes the need for special compat logic in most callers, and
the remaining ones can still be simplified by using __import_iovec
with a bool compat parameter.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 472e5b05 01-Oct-2020 Linus Torvalds <torvalds@linux-foundation.org>

pipe: remove pipe_wait() and fix wakeup race with splice

The pipe splice code still used the old model of waiting for pipe IO by
using a non-specific "pipe_wait()" that waited for any pipe event to
happen, which depended on all pipe IO being entirely serialized by the
pipe lock. So by checking the state you were waiting for, and then
adding yourself to the wait queue before dropping the lock, you were
guaranteed to see all the wakeups.

Strictly speaking, the actual wakeups were not done under the lock, but
the pipe_wait() model still worked, because since the waiter held the
lock when checking whether it should sleep, it would always see the
current state, and the wakeup was always done after updating the state.

However, commit 0ddad21d3e99 ("pipe: use exclusive waits when reading or
writing") split the single wait-queue into two, and in the process also
made the "wait for event" code wait for _two_ wait queues, and that then
showed a race with the wakers that were not serialized by the pipe lock.

It's only splice that used that "pipe_wait()" model, so the problem
wasn't obvious, but Josef Bacik reports:

"I hit a hang with fstest btrfs/187, which does a btrfs send into
/dev/null. This works by creating a pipe, the write side is given to
the kernel to write into, and the read side is handed to a thread that
splices into a file, in this case /dev/null.

The box that was hung had the write side stuck here [pipe_write] and
the read side stuck here [splice_from_pipe_next -> pipe_wait].

[ more details about pipe_wait() scenario ]

The problem is we're doing the prepare_to_wait, which sets our state
each time, however we can be woken up either with reads or writes. In
the case above we race with the WRITER waking us up, and re-set our
state to INTERRUPTIBLE, and thus never break out of schedule"

Josef had a patch that avoided the issue in pipe_wait() by just making
it set the state only once, but the deeper problem is that pipe_wait()
depends on a level of synchonization by the pipe mutex that it really
shouldn't. And the whole "wait for any pipe state change" model really
isn't very good to begin with.

So rather than trying to work around things in pipe_wait(), remove that
legacy model of "wait for arbitrary pipe event" entirely, and actually
create functions that wait for the pipe actually being readable or
writable, and can do so without depending on the pipe lock serializing
everything.

Fixes: 0ddad21d3e99 ("pipe: use exclusive waits when reading or writing")
Link: https://lore.kernel.org/linux-fsdevel/bfa88b5ad6f069b2b679316b9e495a970130416c.1601567868.git.josef@toxicpanda.com/
Reported-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-and-tested-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 36e2c742 03-Sep-2020 Christoph Hellwig <hch@lst.de>

fs: don't allow splice read/write without explicit ops

default_file_splice_write is the last piece of generic code that uses
set_fs to make the uaccess routines operate on kernel pointers. It
implements a "fallback loop" for splicing from files that do not actually
provide a proper splice_read method. The usual file systems and other
high bandwidth instances all provide a ->splice_read, so this just removes
support for various device drivers and procfs/debugfs files. If splice
support for any of those turns out to be important it can be added back
by switching them to the iter ops and using generic_file_splice_read.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 566d1362 19-May-2020 Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>

pipe: Fix pipe_full() test in opipe_prep().

syzbot is reporting that splice()ing from non-empty read side to
already-full write side causes unkillable task, for opipe_prep() is by
error not inverting pipe_full() test.

CPU: 0 PID: 9460 Comm: syz-executor.5 Not tainted 5.6.0-rc3-next-20200228-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
RIP: 0010:rol32 include/linux/bitops.h:105 [inline]
RIP: 0010:iterate_chain_key kernel/locking/lockdep.c:369 [inline]
RIP: 0010:__lock_acquire+0x6a3/0x5270 kernel/locking/lockdep.c:4178
Call Trace:
lock_acquire+0x197/0x420 kernel/locking/lockdep.c:4720
__mutex_lock_common kernel/locking/mutex.c:956 [inline]
__mutex_lock+0x156/0x13c0 kernel/locking/mutex.c:1103
pipe_lock_nested fs/pipe.c:66 [inline]
pipe_double_lock+0x1a0/0x1e0 fs/pipe.c:104
splice_pipe_to_pipe fs/splice.c:1562 [inline]
do_splice+0x35f/0x1520 fs/splice.c:1141
__do_sys_splice fs/splice.c:1447 [inline]
__se_sys_splice fs/splice.c:1427 [inline]
__x64_sys_splice+0x2b5/0x320 fs/splice.c:1427
do_syscall_64+0xf6/0x790 arch/x86/entry/common.c:295
entry_SYSCALL_64_after_hwframe+0x49/0xbe

Reported-by: syzbot+b48daca8639150bc5e73@syzkaller.appspotmail.com
Link: https://syzkaller.appspot.com/bug?id=9386d051e11e09973d5a4cf79af5e8cedf79386d
Fixes: 8cefc107ca54c8b0 ("pipe: Use head and tail pointers for the ring, not cursor and length")
Cc: stable@vger.kernel.org # 5.5+
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# c928f642 20-May-2020 Christoph Hellwig <hch@lst.de>

fs: rename pipe_buf ->steal to ->try_steal

And replace the arcane return value convention with a simple bool
where true means success and false means failure.

[AV: braino fix folded in]

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# b8d9e7f2 20-May-2020 Christoph Hellwig <hch@lst.de>

fs: make the pipe_buf_operations ->confirm operation optional

Just return 0 for success if it is not present.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 76887c25 20-May-2020 Christoph Hellwig <hch@lst.de>

fs: make the pipe_buf_operations ->steal operation optional

Just return 1 for failure if it is not present.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# f6dd9755 20-May-2020 Christoph Hellwig <hch@lst.de>

pipe: merge anon_pipe_buf*_ops

All the op vectors are exactly the same, they are just used to encode
packet or nomerge behavior. There already is a flag for the packet
behavior, so just add a new one to allow for merging. Inverting it vs
the previous nomerge special casing actually allows for much nicer code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 00c285d0 20-May-2020 Christoph Hellwig <hch@lst.de>

fs: simplify do_splice_from

No need for a local function pointer when we can trivial branch on the
->splice_write presence.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 2bc01060 20-May-2020 Christoph Hellwig <hch@lst.de>

fs: simplify do_splice_to

No need for a local function pointer when we can trivial branch on the
->splice_read presence.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# c73be61c 14-Jan-2020 David Howells <dhowells@redhat.com>

pipe: Add general notification queue support

Make it possible to have a general notification queue built on top of a
standard pipe. Notifications are 'spliced' into the pipe and then read
out. splice(), vmsplice() and sendfile() are forbidden on pipes used for
notifications as post_one_notification() cannot take pipe->mutex. This
means that notifications could be posted in between individual pipe
buffers, making iov_iter_revert() difficult to effect.

The way the notification queue is used is:

(1) An application opens a pipe with a special flag and indicates the
number of messages it wishes to be able to queue at once (this can
only be set once):

pipe2(fds, O_NOTIFICATION_PIPE);
ioctl(fds[0], IOC_WATCH_QUEUE_SET_SIZE, queue_depth);

(2) The application then uses poll() and read() as normal to extract data
from the pipe. read() will return multiple notifications if the
buffer is big enough, but it will not split a notification across
buffers - rather it will return a short read or EMSGSIZE.

Notification messages include a length in the header so that the
caller can split them up.

Each message has a header that describes it:

struct watch_notification {
__u32 type:24;
__u32 subtype:8;
__u32 info;
};

The type indicates the source (eg. mount tree changes, superblock events,
keyring changes, block layer events) and the subtype indicates the event
type (eg. mount, unmount; EIO, EDQUOT; link, unlink). The info field
indicates a number of things, including the entry length, an ID assigned to
a watchpoint contributing to this buffer and type-specific flags.

Supplementary data, such as the key ID that generated an event, can be
attached in additional slots. The maximum message size is 127 bytes.
Messages may not be padded or aligned, so there is no guarantee, for
example, that the notification type will be on a 4-byte bounary.

Signed-off-by: David Howells <dhowells@redhat.com>


# 9dafdfc2 17-May-2020 Pavel Begunkov <asml.silence@gmail.com>

splice: export do_tee()

export do_tee() for use in io_uring

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 90da2e3f 04-May-2020 Pavel Begunkov <asml.silence@gmail.com>

splice: move f_mode checks to do_{splice,tee}()

do_splice() is used by io_uring, as will be do_tee(). Move f_mode
checks from sys_{splice,tee}() to do_{splice,tee}(), so they're
enforced for io_uring as well.

Fixes: 7d67af2c0134 ("io_uring: add splice(2) support")
Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 444ebb57 24-Feb-2020 Pavel Begunkov <asml.silence@gmail.com>

splice: make do_splice public

Make do_splice(), so other kernel parts can reuse it

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 0ddad21d 09-Dec-2019 Linus Torvalds <torvalds@linux-foundation.org>

pipe: use exclusive waits when reading or writing

This makes the pipe code use separate wait-queues and exclusive waiting
for readers and writers, avoiding a nasty thundering herd problem when
there are lots of readers waiting for data on a pipe (or, less commonly,
lots of writers waiting for a pipe to have space).

While this isn't a common occurrence in the traditional "use a pipe as a
data transport" case, where you typically only have a single reader and
a single writer process, there is one common special case: using a pipe
as a source of "locking tokens" rather than for data communication.

In particular, the GNU make jobserver code ends up using a pipe as a way
to limit parallelism, where each job consumes a token by reading a byte
from the jobserver pipe, and releases the token by writing a byte back
to the pipe.

This pattern is fairly traditional on Unix, and works very well, but
will waste a lot of time waking up a lot of processes when only a single
reader needs to be woken up when a writer releases a new token.

A simplified test-case of just this pipe interaction is to create 64
processes, and then pass a single token around between them (this
test-case also intentionally passes another token that gets ignored to
test the "wake up next" logic too, in case anybody wonders about it):

#include <unistd.h>

int main(int argc, char **argv)
{
int fd[2], counters[2];

pipe(fd);
counters[0] = 0;
counters[1] = -1;
write(fd[1], counters, sizeof(counters));

/* 64 processes */
fork(); fork(); fork(); fork(); fork(); fork();

do {
int i;
read(fd[0], &i, sizeof(i));
if (i < 0)
continue;
counters[0] = i+1;
write(fd[1], counters, (1+(i & 1)) *sizeof(int));
} while (counters[0] < 1000000);
return 0;
}

and in a perfect world, passing that token around should only cause one
context switch per transfer, when the writer of a token causes a
directed wakeup of just a single reader.

But with the "writer wakes all readers" model we traditionally had, on
my test box the above case causes more than an order of magnitude more
scheduling: instead of the expected ~1M context switches, "perf stat"
shows

231,852.37 msec task-clock # 15.857 CPUs utilized
11,250,961 context-switches # 0.049 M/sec
616,304 cpu-migrations # 0.003 M/sec
1,648 page-faults # 0.007 K/sec
1,097,903,998,514 cycles # 4.735 GHz
120,781,778,352 instructions # 0.11 insn per cycle
27,997,056,043 branches # 120.754 M/sec
283,581,233 branch-misses # 1.01% of all branches

14.621273891 seconds time elapsed

0.018243000 seconds user
3.611468000 seconds sys

before this commit.

After this commit, I get

5,229.55 msec task-clock # 3.072 CPUs utilized
1,212,233 context-switches # 0.232 M/sec
103,951 cpu-migrations # 0.020 M/sec
1,328 page-faults # 0.254 K/sec
21,307,456,166 cycles # 4.074 GHz
12,947,819,999 instructions # 0.61 insn per cycle
2,881,985,678 branches # 551.096 M/sec
64,267,015 branch-misses # 2.23% of all branches

1.702148350 seconds time elapsed

0.004868000 seconds user
0.110786000 seconds sys

instead. Much better.

[ Note! This kernel improvement seems to be very good at triggering a
race condition in the make jobserver (in GNU make 4.2.1) for me. It's
a long known bug that was fixed back in June 2017 by GNU make commit
b552b0525198 ("[SV 51159] Use a non-blocking read with pselect to
avoid hangs.").

But there wasn't a new release of GNU make until 4.3 on Jan 19 2020,
so a number of distributions may still have the buggy version. Some
have backported the fix to their 4.2.1 release, though, and even
without the fix it's quite timing-dependent whether the bug actually
is hit. ]

Josh Triplett says:
"I've been hammering on your pipe fix patch (switching to exclusive
wait queues) for a month or so, on several different systems, and I've
run into no issues with it. The patch *substantially* improves
parallel build times on large (~100 CPU) systems, both with parallel
make and with other things that use make's pipe-based jobserver.

All current distributions (including stable and long-term stable
distributions) have versions of GNU make that no longer have the
jobserver bug"

Tested-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# a28c8b9d 07-Dec-2019 Linus Torvalds <torvalds@linux-foundation.org>

pipe: remove 'waiting_writers' merging logic

This code is ancient, and goes back to when we only had a single page
for the pipe buffers. The exact history is hidden in the mists of time
(ie "before git", and in fact predates the BK repository too).

At that long-ago point in time, it actually helped to try to merge big
back-and-forth pipe reads and writes, and not limit pipe reads to the
single pipe buffer in length just because that was all we had at a time.

However, since then we've expanded the pipe buffers to multiple pages,
and this logic really doesn't seem to make sense. And a lot of it is
somewhat questionable (ie "hmm, the user asked for a non-blocking read,
but we see that there's a writer pending, so let's wait anyway to get
the extra data that the writer will have").

But more importantly, it makes the "go to sleep" logic much less
obvious, and considering the wakeup issues we've had, I want to make for
less of those kinds of things.

Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# ec057595 06-Dec-2019 Linus Torvalds <torvalds@linux-foundation.org>

pipe: fix incorrect caching of pipe state over pipe_wait()

Similarly to commit 8f868d68d335 ("pipe: Fix missing mask update after
pipe_wait()") this fixes a case where the pipe rewrite ended up caching
the pipe state incorrectly over a pipe lock drop event.

It wasn't quite as obvious, because you needed to splice data from a
pipe to a file, which is a fairly unusual operation, but it's completely
wrong.

Make sure we load the pipe head/tail/size information only after we've
waited for there to be data in the pipe.

While in that file, also make one of the splice helper functions use the
canonical arghument order for pipe_empty(). That's syntactic - pipe
emptiness is just that head and tail are equal, and thus mixing up head
and tail doesn't really matter. It's still wrong, though.

Reported-by: David Sterba <dsterba@suse.cz>
Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 6718b6f8 16-Oct-2019 David Howells <dhowells@redhat.com>

pipe: Allow pipes to have kernel-reserved slots

Split pipe->ring_size into two numbers:

(1) pipe->ring_size - indicates the hard size of the pipe ring.

(2) pipe->max_usage - indicates the maximum number of pipe ring slots that
userspace orchestrated events can fill.

This allows for a pipe that is both writable by the general kernel
notification facility and by userspace, allowing plenty of ring space for
notifications to be added whilst preventing userspace from being able to
pin too much unswappable kernel space.

Signed-off-by: David Howells <dhowells@redhat.com>


# 8cefc107 15-Nov-2019 David Howells <dhowells@redhat.com>

pipe: Use head and tail pointers for the ring, not cursor and length

Convert pipes to use head and tail pointers for the buffer ring rather than
pointer and length as the latter requires two atomic ops to update (or a
combined op) whereas the former only requires one.

(1) The head pointer is the point at which production occurs and points to
the slot in which the next buffer will be placed. This is equivalent
to pipe->curbuf + pipe->nrbufs.

The head pointer belongs to the write-side.

(2) The tail pointer is the point at which consumption occurs. It points
to the next slot to be consumed. This is equivalent to pipe->curbuf.

The tail pointer belongs to the read-side.

(3) head and tail are allowed to run to UINT_MAX and wrap naturally. They
are only masked off when the array is being accessed, e.g.:

pipe->bufs[head & mask]

This means that it is not necessary to have a dead slot in the ring as
head == tail isn't ambiguous.

(4) The ring is empty if "head == tail".

A helper, pipe_empty(), is provided for this.

(5) The occupancy of the ring is "head - tail".

A helper, pipe_occupancy(), is provided for this.

(6) The number of free slots in the ring is "pipe->ring_size - occupancy".

A helper, pipe_space_for_user() is provided to indicate how many slots
userspace may use.

(7) The ring is full if "head - tail >= pipe->ring_size".

A helper, pipe_full(), is provided for this.

Signed-off-by: David Howells <dhowells@redhat.com>


# 3253d9d0 15-Oct-2019 Darrick J. Wong <darrick.wong@oracle.com>

splice: only read in as much information as there is pipe buffer space

Andreas Grünbacher reports that on the two filesystems that support
iomap directio, it's possible for splice() to return -EAGAIN (instead of
a short splice) if the pipe being written to has less space available in
its pipe buffers than the length supplied by the calling process.

Months ago we fixed splice_direct_to_actor to clamp the length of the
read request to the size of the splice pipe. Do the same to do_splice.

Fixes: 17614445576b6 ("splice: don't read more than available pipe space")
Reported-by: syzbot+3c01db6025f26530cf8d@syzkaller.appspotmail.com
Reported-by: Andreas Grünbacher <andreas.gruenbacher@gmail.com>
Reviewed-by: Andreas Grünbacher <andreas.gruenbacher@gmail.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>


# 87e5e6da 14-May-2019 Jens Axboe <axboe@kernel.dk>

uio: make import_iovec()/compat_import_iovec() return bytes on success

Currently these functions return < 0 on error, and 0 for success.
Change that so that we return < 0 on error, but number of bytes
for success.

Some callers already treat the return value that way, others need a
slight tweak.

Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 457c8996 19-May-2019 Thomas Gleixner <tglx@linutronix.de>

treewide: Add SPDX license identifier for missed files

Add SPDX license identifiers to all files which:

- Have no license information of any form

- Have EXPORT_.*_SYMBOL_GPL inside which was used in the
initial scan/conversion to ignore the file

These files fall under the project license, GPL v2 only. The resulting SPDX
license identifier is:

GPL-2.0-only

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# b9872226 04-Apr-2019 Jann Horn <jannh@google.com>

tracing: Fix buffer_ref pipe ops

This fixes multiple issues in buffer_pipe_buf_ops:

- The ->steal() handler must not return zero unless the pipe buffer has
the only reference to the page. But generic_pipe_buf_steal() assumes
that every reference to the pipe is tracked by the page's refcount,
which isn't true for these buffers - buffer_pipe_buf_get(), which
duplicates a buffer, doesn't touch the page's refcount.
Fix it by using generic_pipe_buf_nosteal(), which refuses every
attempted theft. It should be easy to actually support ->steal, but the
only current users of pipe_buf_steal() are the virtio console and FUSE,
and they also only use it as an optimization. So it's probably not worth
the effort.
- The ->get() and ->release() handlers can be invoked concurrently on pipe
buffers backed by the same struct buffer_ref. Make them safe against
concurrency by using refcount_t.
- The pointers stored in ->private were only zeroed out when the last
reference to the buffer_ref was dropped. As far as I know, this
shouldn't be necessary anyway, but if we do it, let's always do it.

Link: http://lkml.kernel.org/r/20190404215925.253531-1-jannh@google.com

Cc: Ingo Molnar <mingo@redhat.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: stable@vger.kernel.org
Fixes: 73a757e63114d ("ring-buffer: Return reader page back into existing ring buffer")
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>


# 15fab63e 05-Apr-2019 Matthew Wilcox <willy@infradead.org>

fs: prevent page refcount overflow in pipe_buf_get

Change pipe_buf_get() to return a bool indicating whether it succeeded
in raising the refcount of the page (if the thing in the pipe is a page).
This removes another mechanism for overflowing the page refcount. All
callers converted to handle a failure.

Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: Matthew Wilcox <willy@infradead.org>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# ee5e0011 07-Feb-2019 Slavomir Kaslev <kaslevs@vmware.com>

fs: Make splice() and tee() take into account O_NONBLOCK flag on pipes

The current implementation of splice() and tee() ignores O_NONBLOCK set
on pipe file descriptors and checks only the SPLICE_F_NONBLOCK flag for
blocking on pipe arguments. This is inconsistent since splice()-ing
from/to non-pipe file descriptors does take O_NONBLOCK into
consideration.

Fix this by promoting O_NONBLOCK, when set on a pipe, to
SPLICE_F_NONBLOCK.

Some context for how the current implementation of splice() leads to
inconsistent behavior. In the ongoing work[1] to add VM tracing
capability to trace-cmd we stream tracing data over named FIFOs or
vsockets from guests back to the host.

When we receive SIGINT from user to stop tracing, we set O_NONBLOCK on
the input file descriptor and set SPLICE_F_NONBLOCK for the next call to
splice(). If splice() was blocked waiting on data from the input FIFO,
after SIGINT splice() restarts with the same arguments (no
SPLICE_F_NONBLOCK) and blocks again instead of returning -EAGAIN when no
data is available.

This differs from the splice() behavior when reading from a vsocket or
when we're doing a traditional read()/write() loop (trace-cmd's
--nosplice argument).

With this patch applied we get the same behavior in all situations after
setting O_NONBLOCK which also matches the behavior of doing a
read()/write() loop instead of splice().

This change does have potential of breaking users who don't expect
EAGAIN from splice() when SPLICE_F_NONBLOCK is not set. OTOH programs
that set O_NONBLOCK and don't anticipate EAGAIN are arguably buggy[2].

[1] https://github.com/skaslev/trace-cmd/tree/vsock
[2] https://github.com/torvalds/linux/blob/d47e3da1759230e394096fd742aad423c291ba48/fs/read_write.c#L1425

Signed-off-by: Slavomir Kaslev <kaslevs@vmware.com>
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 736706be 04-Mar-2019 Linus Torvalds <torvalds@linux-foundation.org>

get rid of legacy 'get_ds()' function

Every in-kernel use of this function defined it to KERNEL_DS (either as
an actual define, or as an inline function). It's an entirely
historical artifact, and long long long ago used to actually read the
segment selector valueof '%ds' on x86.

Which in the kernel is always KERNEL_DS.

Inspired by a patch from Jann Horn that just did this for a very small
subset of users (the ones in fs/), along with Al who suggested a script.
I then just took it to the logical extreme and removed all the remaining
gunk.

Roughly scripted with

git grep -l '(get_ds())' -- :^tools/ | xargs sed -i 's/(get_ds())/(KERNEL_DS)/'
git grep -lw 'get_ds' -- :^tools/ | xargs sed -i '/^#define get_ds()/d'

plus manual fixups to remove a few unusual usage patterns, the couple of
inline function cases and to fix up a comment that had become stale.

The 'get_ds()' function remains in an x86 kvm selftest, since in user
space it actually does something relevant.

Inspired-by: Jann Horn <jannh@google.com>
Inspired-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 01e7187b 23-Jan-2019 Jann Horn <jannh@google.com>

pipe: stop using ->can_merge

Al Viro pointed out that since there is only one pipe buffer type to which
new data can be appended, it isn't necessary to have a ->can_merge field in
struct pipe_buf_operations, we can just check for a magic type.

Suggested-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# a0ce2f0a 23-Jan-2019 Jann Horn <jannh@google.com>

splice: don't merge into linked buffers

Before this patch, it was possible for two pipes to affect each other after
data had been transferred between them with tee():

============
$ cat tee_test.c

int main(void) {
int pipe_a[2];
if (pipe(pipe_a)) err(1, "pipe");
int pipe_b[2];
if (pipe(pipe_b)) err(1, "pipe");
if (write(pipe_a[1], "abcd", 4) != 4) err(1, "write");
if (tee(pipe_a[0], pipe_b[1], 2, 0) != 2) err(1, "tee");
if (write(pipe_b[1], "xx", 2) != 2) err(1, "write");

char buf[5];
if (read(pipe_a[0], buf, 4) != 4) err(1, "read");
buf[4] = 0;
printf("got back: '%s'\n", buf);
}
$ gcc -o tee_test tee_test.c
$ ./tee_test
got back: 'abxx'
$
============

As suggested by Al Viro, fix it by creating a separate type for
non-mergeable pipe buffers, then changing the types of buffers in
splice_pipe_to_pipe() and link_pipe().

Cc: <stable@vger.kernel.org>
Fixes: 7c77f0b3f920 ("splice: implement pipe to pipe splicing")
Fixes: 70524490ee2e ("[PATCH] splice: add support for sys_tee()")
Suggested-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 17614445 30-Nov-2018 Darrick J. Wong <darrick.wong@oracle.com>

splice: don't read more than available pipe space

In commit 4721a601099, we tried to fix a problem wherein directio reads
into a splice pipe will bounce EFAULT/EAGAIN all the way out to
userspace by simulating a zero-byte short read. This happens because
some directio read implementations (xfs) will call
bio_iov_iter_get_pages to grab pipe buffer pages and issue asynchronous
reads, but as soon as we run out of pipe buffers that _get_pages call
returns EFAULT, which the splice code translates to EAGAIN and bounces
out to userspace.

In that commit, the iomap code catches the EFAULT and simulates a
zero-byte read, but that causes assertion errors on regular splice reads
because xfs doesn't allow short directio reads.

The brokenness is compounded by splice_direct_to_actor immediately
bailing on do_splice_to returning <= 0 without ever calling ->actor
(which empties out the pipe), so if userspace calls back we'll EFAULT
again on the full pipe, and nothing ever gets copied.

Therefore, teach splice_direct_to_actor to clamp its requests to the
amount of free space in the pipe and remove the simulated short read
from the iomap directio code.

Fixes: 4721a601099 ("iomap: dio data corruption and spurious errors when pipes fill")
Reported-by: Murphy Zhou <jencce.kernel@gmail.com>
Ranted-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>


# aa563d7b 19-Oct-2018 David Howells <dhowells@redhat.com>

iov_iter: Separate type from direction and use accessor functions

In the iov_iter struct, separate the iterator type from the iterator
direction and use accessor functions to access them in most places.

Convert a bunch of places to use switch-statements to access them rather
then chains of bitwise-AND statements. This makes it easier to add further
iterator types. Also, this can be more efficient as to implement a switch
of small contiguous integers, the compiler can use ~50% fewer compare
instructions than it has to use bitwise-and instructions.

Further, cease passing the iterator type into the iterator setup function.
The iterator function can set that itself. Only the direction is required.

Signed-off-by: David Howells <dhowells@redhat.com>


# 6da2ec56 12-Jun-2018 Kees Cook <keescook@chromium.org>

treewide: kmalloc() -> kmalloc_array()

The kmalloc() function has a 2-factor argument form, kmalloc_array(). This
patch replaces cases of:

kmalloc(a * b, gfp)

with:
kmalloc_array(a * b, gfp)

as well as handling cases of:

kmalloc(a * b * c, gfp)

with:

kmalloc(array3_size(a, b, c), gfp)

as it's slightly less ugly than:

kmalloc_array(array_size(a, b), c, gfp)

This does, however, attempt to ignore constant size factors like:

kmalloc(4 * 1024, gfp)

though any constants defined via macros get caught up in the conversion.

Any factors with a sizeof() of "unsigned char", "char", and "u8" were
dropped, since they're redundant.

The tools/ directory was manually excluded, since it has its own
implementation of kmalloc().

The Coccinelle script used for this was:

// Fix redundant parens around sizeof().
@@
type TYPE;
expression THING, E;
@@

(
kmalloc(
- (sizeof(TYPE)) * E
+ sizeof(TYPE) * E
, ...)
|
kmalloc(
- (sizeof(THING)) * E
+ sizeof(THING) * E
, ...)
)

// Drop single-byte sizes and redundant parens.
@@
expression COUNT;
typedef u8;
typedef __u8;
@@

(
kmalloc(
- sizeof(u8) * (COUNT)
+ COUNT
, ...)
|
kmalloc(
- sizeof(__u8) * (COUNT)
+ COUNT
, ...)
|
kmalloc(
- sizeof(char) * (COUNT)
+ COUNT
, ...)
|
kmalloc(
- sizeof(unsigned char) * (COUNT)
+ COUNT
, ...)
|
kmalloc(
- sizeof(u8) * COUNT
+ COUNT
, ...)
|
kmalloc(
- sizeof(__u8) * COUNT
+ COUNT
, ...)
|
kmalloc(
- sizeof(char) * COUNT
+ COUNT
, ...)
|
kmalloc(
- sizeof(unsigned char) * COUNT
+ COUNT
, ...)
)

// 2-factor product with sizeof(type/expression) and identifier or constant.
@@
type TYPE;
expression THING;
identifier COUNT_ID;
constant COUNT_CONST;
@@

(
- kmalloc
+ kmalloc_array
(
- sizeof(TYPE) * (COUNT_ID)
+ COUNT_ID, sizeof(TYPE)
, ...)
|
- kmalloc
+ kmalloc_array
(
- sizeof(TYPE) * COUNT_ID
+ COUNT_ID, sizeof(TYPE)
, ...)
|
- kmalloc
+ kmalloc_array
(
- sizeof(TYPE) * (COUNT_CONST)
+ COUNT_CONST, sizeof(TYPE)
, ...)
|
- kmalloc
+ kmalloc_array
(
- sizeof(TYPE) * COUNT_CONST
+ COUNT_CONST, sizeof(TYPE)
, ...)
|
- kmalloc
+ kmalloc_array
(
- sizeof(THING) * (COUNT_ID)
+ COUNT_ID, sizeof(THING)
, ...)
|
- kmalloc
+ kmalloc_array
(
- sizeof(THING) * COUNT_ID
+ COUNT_ID, sizeof(THING)
, ...)
|
- kmalloc
+ kmalloc_array
(
- sizeof(THING) * (COUNT_CONST)
+ COUNT_CONST, sizeof(THING)
, ...)
|
- kmalloc
+ kmalloc_array
(
- sizeof(THING) * COUNT_CONST
+ COUNT_CONST, sizeof(THING)
, ...)
)

// 2-factor product, only identifiers.
@@
identifier SIZE, COUNT;
@@

- kmalloc
+ kmalloc_array
(
- SIZE * COUNT
+ COUNT, SIZE
, ...)

// 3-factor product with 1 sizeof(type) or sizeof(expression), with
// redundant parens removed.
@@
expression THING;
identifier STRIDE, COUNT;
type TYPE;
@@

(
kmalloc(
- sizeof(TYPE) * (COUNT) * (STRIDE)
+ array3_size(COUNT, STRIDE, sizeof(TYPE))
, ...)
|
kmalloc(
- sizeof(TYPE) * (COUNT) * STRIDE
+ array3_size(COUNT, STRIDE, sizeof(TYPE))
, ...)
|
kmalloc(
- sizeof(TYPE) * COUNT * (STRIDE)
+ array3_size(COUNT, STRIDE, sizeof(TYPE))
, ...)
|
kmalloc(
- sizeof(TYPE) * COUNT * STRIDE
+ array3_size(COUNT, STRIDE, sizeof(TYPE))
, ...)
|
kmalloc(
- sizeof(THING) * (COUNT) * (STRIDE)
+ array3_size(COUNT, STRIDE, sizeof(THING))
, ...)
|
kmalloc(
- sizeof(THING) * (COUNT) * STRIDE
+ array3_size(COUNT, STRIDE, sizeof(THING))
, ...)
|
kmalloc(
- sizeof(THING) * COUNT * (STRIDE)
+ array3_size(COUNT, STRIDE, sizeof(THING))
, ...)
|
kmalloc(
- sizeof(THING) * COUNT * STRIDE
+ array3_size(COUNT, STRIDE, sizeof(THING))
, ...)
)

// 3-factor product with 2 sizeof(variable), with redundant parens removed.
@@
expression THING1, THING2;
identifier COUNT;
type TYPE1, TYPE2;
@@

(
kmalloc(
- sizeof(TYPE1) * sizeof(TYPE2) * COUNT
+ array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
, ...)
|
kmalloc(
- sizeof(TYPE1) * sizeof(THING2) * (COUNT)
+ array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
, ...)
|
kmalloc(
- sizeof(THING1) * sizeof(THING2) * COUNT
+ array3_size(COUNT, sizeof(THING1), sizeof(THING2))
, ...)
|
kmalloc(
- sizeof(THING1) * sizeof(THING2) * (COUNT)
+ array3_size(COUNT, sizeof(THING1), sizeof(THING2))
, ...)
|
kmalloc(
- sizeof(TYPE1) * sizeof(THING2) * COUNT
+ array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
, ...)
|
kmalloc(
- sizeof(TYPE1) * sizeof(THING2) * (COUNT)
+ array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
, ...)
)

// 3-factor product, only identifiers, with redundant parens removed.
@@
identifier STRIDE, SIZE, COUNT;
@@

(
kmalloc(
- (COUNT) * STRIDE * SIZE
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kmalloc(
- COUNT * (STRIDE) * SIZE
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kmalloc(
- COUNT * STRIDE * (SIZE)
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kmalloc(
- (COUNT) * (STRIDE) * SIZE
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kmalloc(
- COUNT * (STRIDE) * (SIZE)
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kmalloc(
- (COUNT) * STRIDE * (SIZE)
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kmalloc(
- (COUNT) * (STRIDE) * (SIZE)
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kmalloc(
- COUNT * STRIDE * SIZE
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
)

// Any remaining multi-factor products, first at least 3-factor products,
// when they're not all constants...
@@
expression E1, E2, E3;
constant C1, C2, C3;
@@

(
kmalloc(C1 * C2 * C3, ...)
|
kmalloc(
- (E1) * E2 * E3
+ array3_size(E1, E2, E3)
, ...)
|
kmalloc(
- (E1) * (E2) * E3
+ array3_size(E1, E2, E3)
, ...)
|
kmalloc(
- (E1) * (E2) * (E3)
+ array3_size(E1, E2, E3)
, ...)
|
kmalloc(
- E1 * E2 * E3
+ array3_size(E1, E2, E3)
, ...)
)

// And then all remaining 2 factors products when they're not all constants,
// keeping sizeof() as the second factor argument.
@@
expression THING, E1, E2;
type TYPE;
constant C1, C2, C3;
@@

(
kmalloc(sizeof(THING) * C2, ...)
|
kmalloc(sizeof(TYPE) * C2, ...)
|
kmalloc(C1 * C2 * C3, ...)
|
kmalloc(C1 * C2, ...)
|
- kmalloc
+ kmalloc_array
(
- sizeof(TYPE) * (E2)
+ E2, sizeof(TYPE)
, ...)
|
- kmalloc
+ kmalloc_array
(
- sizeof(TYPE) * E2
+ E2, sizeof(TYPE)
, ...)
|
- kmalloc
+ kmalloc_array
(
- sizeof(THING) * (E2)
+ E2, sizeof(THING)
, ...)
|
- kmalloc
+ kmalloc_array
(
- sizeof(THING) * E2
+ E2, sizeof(THING)
, ...)
|
- kmalloc
+ kmalloc_array
(
- (E1) * E2
+ E1, E2
, ...)
|
- kmalloc
+ kmalloc_array
(
- (E1) * (E2)
+ E1, E2
, ...)
|
- kmalloc
+ kmalloc_array
(
- E1 * E2
+ E1, E2
, ...)
)

Signed-off-by: Kees Cook <keescook@chromium.org>


# 87a3002a 26-May-2018 Al Viro <viro@zeniv.linux.org.uk>

vmsplice(): lift importing iovec into vmsplice(2) and compat counterpart

... getting rid of transformations in the latter - just use
compat_import_iovec().

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 30cfe4ef 17-Mar-2018 Dominik Brodowski <linux@dominikbrodowski.net>

fs: add do_vmsplice() helper; remove in-kernel call to syscall

Using the fs-internal do_vmsplice() helper allows us to get rid of the
fs-internal call to the sys_vmsplice() syscall.

This patch is part of a series which removes in-kernel calls to syscalls.
On this basis, the syscall entry path can be streamlined. For details, see
http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net

Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>


# 6aa7de05 23-Oct-2017 Mark Rutland <mark.rutland@arm.com>

locking/atomics: COCCINELLE/treewide: Convert trivial ACCESS_ONCE() patterns to READ_ONCE()/WRITE_ONCE()

Please do not apply this to mainline directly, instead please re-run the
coccinelle script shown below and apply its output.

For several reasons, it is desirable to use {READ,WRITE}_ONCE() in
preference to ACCESS_ONCE(), and new code is expected to use one of the
former. So far, there's been no reason to change most existing uses of
ACCESS_ONCE(), as these aren't harmful, and changing them results in
churn.

However, for some features, the read/write distinction is critical to
correct operation. To distinguish these cases, separate read/write
accessors must be used. This patch migrates (most) remaining
ACCESS_ONCE() instances to {READ,WRITE}_ONCE(), using the following
coccinelle script:

----
// Convert trivial ACCESS_ONCE() uses to equivalent READ_ONCE() and
// WRITE_ONCE()

// $ make coccicheck COCCI=/home/mark/once.cocci SPFLAGS="--include-headers" MODE=patch

virtual patch

@ depends on patch @
expression E1, E2;
@@

- ACCESS_ONCE(E1) = E2
+ WRITE_ONCE(E1, E2)

@ depends on patch @
expression E;
@@

- ACCESS_ONCE(E)
+ READ_ONCE(E)
----

Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: davem@davemloft.net
Cc: linux-arch@vger.kernel.org
Cc: mpe@ellerman.id.au
Cc: shuah@kernel.org
Cc: snitzer@redhat.com
Cc: thor.thayer@linux.intel.com
Cc: tj@kernel.org
Cc: viro@zeniv.linux.org.uk
Cc: will.deacon@arm.com
Link: http://lkml.kernel.org/r/1508792849-3115-19-git-send-email-paulmck@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# ac452aca 01-Sep-2017 Christoph Hellwig <hch@lst.de>

fs: move kernel_write to fs/read_write.c

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# abbb6589 27-May-2017 Christoph Hellwig <hch@lst.de>

fs: implement vfs_iter_write using do_iter_write

De-dupliate some code and allow for passing the flags argument to
vfs_iter_write. Additionally it now properly updates timestamps.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 174cd4b1 02-Feb-2017 Ingo Molnar <mingo@kernel.org>

sched/headers: Prepare to move signal wakeup & sigpending methods from <linux/sched.h> into <linux/sched/signal.h>

Fix up affected files that include this signal functionality via sched.h.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>


# bb7462b6 20-Feb-2017 Miklos Szeredi <mszeredi@redhat.com>

vfs: use helpers for calling f_op->{read,write}_iter()

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>


# 5a81e6a1 16-Feb-2017 Miklos Szeredi <mszeredi@redhat.com>

vfs: fix uninitialized flags in splice_to_pipe()

Flags (PIPE_BUF_FLAG_PACKET, PIPE_BUF_FLAG_GIFT) could remain on the
unused part of the pipe ring buffer. Previously splice_to_pipe() left
the flags value alone, which could result in incorrect behavior.

Uninitialized flags appears to have been there from the introduction of
the splice syscall.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Cc: <stable@vger.kernel.org> # 2.6.17+
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 13c0f52b 10-Dec-2016 Al Viro <viro@zeniv.linux.org.uk>

make nr_pages calculation in default_file_splice_read() a bit less ugly

It's an artifact of lousy calling conventions of iov_iter_get_pages_alloc().
Hopefully, we'll get something saner come next cycle; for now that'll
do.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 3d6ea290 10-Dec-2016 Al Viro <viro@zeniv.linux.org.uk>

splice/tee/vmsplice: validate flags

Long overdue...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 23c832b1 12-Oct-2016 Al Viro <viro@zeniv.linux.org.uk>

remove spd_release_page()

no users left

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 52bce911 21-Dec-2016 Linus Torvalds <torvalds@linux-foundation.org>

splice: reinstate SIGPIPE/EPIPE handling

Commit 8924feff66f3 ("splice: lift pipe_lock out of splice_to_pipe()")
caused a regression when there were no more readers left on a pipe that
was being spliced into: rather than the expected SIGPIPE and -EPIPE
return value, the writer would end up waiting forever for space to free
up (which obviously was not going to happen with no readers around).

Fixes: 8924feff66f3 ("splice: lift pipe_lock out of splice_to_pipe()")
Reported-and-tested-by: Andreas Schwab <schwab@linux-m68k.org>
Debugged-by: Al Viro <viro@zeniv.linux.org.uk>
Cc: stable@kernel.org # v4.9
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 8e54cada 26-Nov-2016 Al Viro <viro@zeniv.linux.org.uk>

fix default_file_splice_read()

Botched calculation of number of pages. As the result,
we were dropping pieces when doing splice to pipe from
e.g. 9p.

Reported-by: Alexei Starovoitov <ast@kernel.org>
Tested-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# e519e777 10-Nov-2016 Al Viro <viro@zeniv.linux.org.uk>

splice: remove detritus from generic_file_splice_read()

i_size check is a leftover from the horrors that used to play with
the page cache in that function. With the switch to ->read_iter(),
it's neither needed nor correct - for gfs2 it ends up being buggy,
since i_size is not guaranteed to be correct until later (inside
->read_iter()).

Spotted-by: Abhi Das <adas@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# be297968 01-Nov-2016 Christoph Hellwig <hch@lst.de>

mm: only include blk_types in swap.h if CONFIG_SWAP is enabled

It's only needed for the CONFIG_SWAP-only use of bio_end_io_t.

Because CONFIG_SWAP implies CONFIG_BLOCK this will allow to drop some
ifdefs in blk_types.h.

Instead we'll need to add a few explicit includes that were implicit
before, though.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>


# c3a69024 10-Oct-2016 Al Viro <viro@zeniv.linux.org.uk>

fix ITER_PIPE interaction with direct_IO

by making sure we call iov_iter_advance() on original
iov_iter even if direct_IO (done on its copy) has returned 0.
It's a no-op for old iov_iter flavours and does the right thing
(== truncation of the stuff we'd allocated, but not filled) in
ITER_PIPE case. Failures (e.g. -EIO) get caught and dealt with
by cleanup in generic_file_read_iter().

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# fba597db 27-Sep-2016 Miklos Szeredi <mszeredi@redhat.com>

pipe: add pipe_buf_confirm() helper

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# a779638c 27-Sep-2016 Miklos Szeredi <mszeredi@redhat.com>

pipe: add pipe_buf_release() helper

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 7bf2d1df 27-Sep-2016 Miklos Szeredi <mszeredi@redhat.com>

pipe: add pipe_buf_get() helper

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 523ac9af 23-Sep-2016 Al Viro <viro@zeniv.linux.org.uk>

switch default_file_splice_read() to use of pipe-backed iov_iter

we only use iov_iter_get_pages_alloc() and iov_iter_advance() -
pages are filled by kernel_readv() via a kvec array (as we used
to do all along), so iov_iter here is used only as a way of
arranging for those pages to be in pipe.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 82c156f8 22-Sep-2016 Al Viro <viro@zeniv.linux.org.uk>

switch generic_file_splice_read() to use of ->read_iter()

... and kill the ->splice_read() instances that can be switched to it

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 241699cd 22-Sep-2016 Al Viro <viro@zeniv.linux.org.uk>

new iov_iter flavour: pipe-backed

iov_iter variant for passing data into pipe. copy_to_iter()
copies data into page(s) it has allocated and stuffs them into
the pipe; copy_page_to_iter() stuffs there a reference to the
page given to it. Both will try to coalesce if possible.
iov_iter_zero() is similar to copy_to_iter(); iov_iter_get_pages()
and friends will do as copy_to_iter() would have and return the
pages where the data would've been copied. iov_iter_advance()
will truncate everything past the spot it has advanced to.

New primitive: iov_iter_pipe(), used for initializing those.
pipe should be locked all along.

Running out of space acts as fault would for iovec-backed ones;
in other words, giving it to ->read_iter() may result in short
read if the pipe overflows, or -EFAULT if it happens with nothing
copied there.

In other words, ->read_iter() on those acts pretty much like
->splice_read(). Moreover, all generic_file_splice_read() users,
as well as many other ->splice_read() instances can be switched
to that scheme - that'll happen in the next commit.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 79fddc4e 17-Sep-2016 Al Viro <viro@zeniv.linux.org.uk>

new helper: add_to_pipe()

single-buffer analogue of splice_to_pipe(); vmsplice_to_pipe() switched
to that, leaving splice_to_pipe() only for ->splice_read() instances
(and that only until they are converted as well).

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 8924feff 17-Sep-2016 Al Viro <viro@zeniv.linux.org.uk>

splice: lift pipe_lock out of splice_to_pipe()

* splice_to_pipe() stops at pipe overflow and does *not* take pipe_lock
* ->splice_read() instances do the same
* vmsplice_to_pipe() and do_splice() (ultimate callers of splice_to_pipe())
arrange for waiting, looping, etc. themselves.

That should make pipe_lock the outermost one.

Unfortunately, existing rules for the amount passed by vmsplice_to_pipe()
and do_splice() are quite ugly _and_ userland code can be easily broken
by changing those. It's not even "no more than the maximal capacity of
this pipe" - it's "once we'd fed pipe->nr_buffers pages into the pipe,
leave instead of waiting".

Considering how poorly these rules are documented, let's try "wait for some
space to appear, unless given SPLICE_F_NONBLOCK, then push into pipe
and if we run into overflow, we are done".

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# db85a9eb 17-Sep-2016 Al Viro <viro@zeniv.linux.org.uk>

splice: switch get_iovec_page_array() to iov_iter

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# e7c3c646 17-Sep-2016 Al Viro <viro@zeniv.linux.org.uk>

splice_to_pipe(): don't open-code wakeup_pipe_readers()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 09cbfeaf 01-Apr-2016 Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros

PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.

This promise never materialized. And unlikely will.

We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.

Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.

Let's stop pretending that pages in page cache are special. They are
not.

The changes are pretty straight-forward:

- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;

- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;

- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};

- page_cache_get() -> get_page();

- page_cache_release() -> put_page();

This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.

The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.

There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.

virtual patch

@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E

@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E

@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT

@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE

@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK

@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)

@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)

@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 03cc0789 02-Apr-2016 Al Viro <viro@zeniv.linux.org.uk>

do_splice_to(): cap the size before passing to ->splice_read()

pipe capacity won't exceed 2G anyway.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# d6785d91 10-Mar-2016 Rabin Vincent <rabin@rab.in>

splice: handle zero nr_pages in splice_to_pipe()

Running the following command:

busybox cat /sys/kernel/debug/tracing/trace_pipe > /dev/null

with any tracing enabled pretty very quickly leads to various NULL
pointer dereferences and VM BUG_ON()s, such as these:

BUG: unable to handle kernel NULL pointer dereference at 0000000000000020
IP: [<ffffffff8119df6c>] generic_pipe_buf_release+0xc/0x40
Call Trace:
[<ffffffff811c48a3>] splice_direct_to_actor+0x143/0x1e0
[<ffffffff811c42e0>] ? generic_pipe_buf_nosteal+0x10/0x10
[<ffffffff811c49cf>] do_splice_direct+0x8f/0xb0
[<ffffffff81196869>] do_sendfile+0x199/0x380
[<ffffffff81197600>] SyS_sendfile64+0x90/0xa0
[<ffffffff8192cbee>] entry_SYSCALL_64_fastpath+0x12/0x6d

page dumped because: VM_BUG_ON_PAGE(atomic_read(&page->_count) == 0)
kernel BUG at include/linux/mm.h:367!
invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
RIP: [<ffffffff8119df9c>] generic_pipe_buf_release+0x3c/0x40
Call Trace:
[<ffffffff811c48a3>] splice_direct_to_actor+0x143/0x1e0
[<ffffffff811c42e0>] ? generic_pipe_buf_nosteal+0x10/0x10
[<ffffffff811c49cf>] do_splice_direct+0x8f/0xb0
[<ffffffff81196869>] do_sendfile+0x199/0x380
[<ffffffff81197600>] SyS_sendfile64+0x90/0xa0
[<ffffffff8192cd1e>] tracesys_phase2+0x84/0x89

(busybox's cat uses sendfile(2), unlike the coreutils version)

This is because tracing_splice_read_pipe() can call splice_to_pipe()
with spd->nr_pages == 0. spd_pages underflows in splice_to_pipe() and
we fill the page pointers and the other fields of the pipe_buffers with
garbage.

All other callers of splice_to_pipe() avoid calling it when nr_pages ==
0, and we could make tracing_splice_read_pipe() do that too, but it
seems reasonable to have splice_to_page() handle this condition
gracefully.

Cc: stable@vger.kernel.org
Signed-off-by: Rabin Vincent <rabin@rab.in>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 793b80ef 03-Mar-2016 Christoph Hellwig <hch@lst.de>

vfs: pass a flags argument to vfs_readv/vfs_writev

This way we can set kiocb flags also from the sync read/write path for
the read_iter/write_iter operations. For now there is no way to pass
flags to plain read/write operations as there is no real need for that,
and all flags passed are explicitly rejected for these files.

Signed-off-by: Milosz Tanski <milosz@adfin.com>
[hch: rebased on top of my kiocb changes]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Stephen Bates <stephen.bates@pmcs.com>
Tested-by: Stephen Bates <stephen.bates@pmcs.com>
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 90330e68 18-Dec-2015 Abhi Das <adas@redhat.com>

fs: __generic_file_splice_read retry lookup on AOP_TRUNCATED_PAGE

During testing, I discovered that __generic_file_splice_read() returns
0 (EOF) when aops->readpage fails with AOP_TRUNCATED_PAGE on the first
page of a single/multi-page splice read operation. This EOF return code
causes the userspace test to (correctly) report a zero-length read error
when it was expecting otherwise.

The current strategy of returning a partial non-zero read when ->readpage
returns AOP_TRUNCATED_PAGE works only when the failed page is not the
first of the lot being processed.

This patch attempts to retry lookup and call ->readpage again on pages
that had previously failed with AOP_TRUNCATED_PAGE. With this patch, my
tests pass and I haven't noticed any unwanted side effects.

This version removes the thrice-retry loop and instead indefinitely
retries lookups on AOP_TRUNCATED_PAGE errors from ->readpage. This
behavior is now similar to do_generic_file_read().

Signed-off-by: Abhi Das <adas@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Bob Peterson <rpeterso@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# c2489e07 23-Nov-2015 Jan Kara <jack@suse.cz>

vfs: Avoid softlockups with sendfile(2)

The following test program from Dmitry can cause softlockups or RCU
stalls as it copies 1GB from tmpfs into eventfd and we don't have any
scheduling point at that path in sendfile(2) implementation:

int r1 = eventfd(0, 0);
int r2 = memfd_create("", 0);
unsigned long n = 1<<30;
fallocate(r2, 0, 0, n);
sendfile(r1, r2, 0, n);

Add cond_resched() into __splice_from_pipe() to fix the problem.

CC: Dmitry Vyukov <dvyukov@google.com>
CC: stable@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# c725bfce 23-Nov-2015 Jan Kara <jack@suse.cz>

vfs: Make sendfile(2) killable even better

Commit 296291cdd162 (mm: make sendfile(2) killable) fixed an issue where
sendfile(2) was doing a lot of tiny writes into a filesystem and thus
was unkillable for a long time. However sendfile(2) can be (mis)used to
issue lots of writes into arbitrary file descriptor such as evenfd or
similar special file descriptors which never hit the standard filesystem
write path and thus are still unkillable. E.g. the following example
from Dmitry burns CPU for ~16s on my test system without possibility to
be killed:

int r1 = eventfd(0, 0);
int r2 = memfd_create("", 0);
unsigned long n = 1<<30;
fallocate(r2, 0, 0, n);
sendfile(r1, r2, 0, n);

There are actually quite a few tests for pending signals in sendfile
code however we data to write is always available none of them seems to
trigger. So fix the problem by adding a test for pending signal into
splice_from_pipe_next() also before the loop waiting for pipe buffers to
be available. This should fix all the lockup issues with sendfile of the
do-ton-of-tiny-writes nature.

CC: stable@vger.kernel.org
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# c62d2555 06-Nov-2015 Michal Hocko <mhocko@suse.com>

mm, fs: introduce mapping_gfp_constraint()

There are many places which use mapping_gfp_mask to restrict a more
generic gfp mask which would be used for allocations which are not
directly related to the page cache but they are performed in the same
context.

Let's introduce a helper function which makes the restriction explicit and
easier to track. This patch doesn't introduce any functional changes.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Michal Hocko <mhocko@suse.com>
Suggested-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 6afdb859 24-Jun-2015 Michal Hocko <mhocko@suse.cz>

mm: do not ignore mapping_gfp_mask in page cache allocation paths

page_cache_read, do_generic_file_read, __generic_file_splice_read and
__ntfs_grab_cache_pages currently ignore mapping_gfp_mask when calling
add_to_page_cache_lru which might cause recursion into fs down in the
direct reclaim path if the mapping really relies on GFP_NOFS semantic.

This doesn't seem to be the case now because page_cache_read (page fault
path) doesn't seem to suffer from the reclaim recursion issues and
do_generic_file_read and __generic_file_splice_read also shouldn't be
called under fs locks which would deadlock in the reclaim path. Anyway it
is better to obey mapping gfp mask and prevent from later breakage.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Anton Altaparmakov <anton@tuxera.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 2b514574 21-May-2015 Hannes Frederic Sowa <hannes@stressinduktion.org>

net: af_unix: implement splice for stream af_unix sockets

unix_stream_recvmsg is refactored to unix_stream_read_generic in this
patch and enhanced to deal with pipe splicing. The refactoring is
inneglible, we mostly have to deal with a non-existing struct msghdr
argument.

Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 0ff28d9f 06-May-2015 Christophe Leroy <christophe.leroy@c-s.fr>

splice: sendfile() at once fails for big files

Using sendfile with below small program to get MD5 sums of some files,
it appear that big files (over 64kbytes with 4k pages system) get a
wrong MD5 sum while small files get the correct sum.
This program uses sendfile() to send a file to an AF_ALG socket
for hashing.

/* md5sum2.c */
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <string.h>
#include <fcntl.h>
#include <sys/socket.h>
#include <sys/stat.h>
#include <sys/types.h>
#include <linux/if_alg.h>

int main(int argc, char **argv)
{
int sk = socket(AF_ALG, SOCK_SEQPACKET, 0);
struct stat st;
struct sockaddr_alg sa = {
.salg_family = AF_ALG,
.salg_type = "hash",
.salg_name = "md5",
};
int n;

bind(sk, (struct sockaddr*)&sa, sizeof(sa));

for (n = 1; n < argc; n++) {
int size;
int offset = 0;
char buf[4096];
int fd;
int sko;
int i;

fd = open(argv[n], O_RDONLY);
sko = accept(sk, NULL, 0);
fstat(fd, &st);
size = st.st_size;
sendfile(sko, fd, &offset, size);
size = read(sko, buf, sizeof(buf));
for (i = 0; i < size; i++)
printf("%2.2x", buf[i]);
printf(" %s\n", argv[n]);
close(fd);
close(sko);
}
exit(0);
}

Test below is done using official linux patch files. First result is
with a software based md5sum. Second result is with the program above.

root@vgoip:~# ls -l patch-3.6.*
-rw-r--r-- 1 root root 64011 Aug 24 12:01 patch-3.6.2.gz
-rw-r--r-- 1 root root 94131 Aug 24 12:01 patch-3.6.3.gz

root@vgoip:~# md5sum patch-3.6.*
b3ffb9848196846f31b2ff133d2d6443 patch-3.6.2.gz
c5e8f687878457db77cb7158c38a7e43 patch-3.6.3.gz

root@vgoip:~# ./md5sum2 patch-3.6.*
b3ffb9848196846f31b2ff133d2d6443 patch-3.6.2.gz
5fd77b24e68bb24dcc72d6e57c64790e patch-3.6.3.gz

After investivation, it appears that sendfile() sends the files by blocks
of 64kbytes (16 times PAGE_SIZE). The problem is that at the end of each
block, the SPLICE_F_MORE flag is missing, therefore the hashing operation
is reset as if it was the end of the file.

This patch adds SPLICE_F_MORE to the flags when more data is pending.

With the patch applied, we get the correct sums:

root@vgoip:~# md5sum patch-3.6.*
b3ffb9848196846f31b2ff133d2d6443 patch-3.6.2.gz
c5e8f687878457db77cb7158c38a7e43 patch-3.6.3.gz

root@vgoip:~# ./md5sum2 patch-3.6.*
b3ffb9848196846f31b2ff133d2d6443 patch-3.6.2.gz
c5e8f687878457db77cb7158c38a7e43 patch-3.6.3.gz

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Jens Axboe <axboe@fb.com>


# be64f884 15-Apr-2015 Boaz Harrosh <boaz@plexistor.com>

dax: unify ext2/4_{dax,}_file_operations

The original dax patchset split the ext2/4_file_operations because of the
two NULL splice_read/splice_write in the dax case.

In the vfs if splice_read/splice_write are NULL we then call
default_splice_read/write.

What we do here is make generic_file_splice_read aware of IS_DAX() so the
original ext2/4_file_operations can be used as is.

For write it appears that iter_file_splice_write is just fine. It uses
the regular f_op->write(file,..) or new_sync_write(file, ...).

Signed-off-by: Boaz Harrosh <boaz@plexistor.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Matthew Wilcox <willy@linux.intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 345995fa 21-Mar-2015 Al Viro <viro@zeniv.linux.org.uk>

vmsplice_to_user(): switch to import_iovec()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# e2e40f2c 22-Feb-2015 Christoph Hellwig <hch@lst.de>

fs: move struct kiocb to fs.h

struct kiocb now is a generic I/O container, so move it to fs.h.
Also do a #include diet for aio.h while we're at it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# dbe4e192 25-Jan-2015 Christoph Hellwig <hch@lst.de>

fs: add vfs_iter_{read,write} helpers

Simple helpers that pass an arbitrary iov_iter to filesystems.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 05afcb77 22-Jan-2015 Al Viro <viro@zeniv.linux.org.uk>

new helper: iov_iter_bvec()

similar to iov_iter_kvec(), for ITER_BVEC ones

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 1c118596 23-Oct-2014 Miklos Szeredi <mszeredi@suse.cz>

vfs: export do_splice_direct() to modules

Export do_splice_direct() to modules. Needed by overlay filesystem.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>


# 5f073850 05-Apr-2014 Al Viro <viro@zeniv.linux.org.uk>

kill generic_file_splice_write()

no callers left

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 96f9bc8f 05-Apr-2014 Al Viro <viro@zeniv.linux.org.uk>

fs/splice.c: remove unneeded exports

ocfs2 was using a bunch of splice.c guts...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 8d020765 05-Apr-2014 Al Viro <viro@zeniv.linux.org.uk>

->splice_write() via ->write_iter()

iter_file_splice_write() - a ->splice_write() instance that gathers the
pipe buffers, builds a bio_vec-based iov_iter covering those and feeds
it to ->write_iter(). A bunch of simple cases coverted to that...

[AV: fixed the braino spotted by Cyrill]

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# b6dd6f47 27-May-2014 Miklos Szeredi <mszeredi@suse.cz>

vfs: fix vmplice_to_user()

Commit 6130f5315ee8 "switch vmsplice_to_user() to copy_page_to_iter()" in
v3.15-rc1 broke vmsplice(2).

This patch fixes two bugs:

- count is not initialized to a proper value, which resulted in no data
being copied

- if rw_copy_check_uvector() returns negative then the iov might be leaked.

Tested OK.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 71d8e532 05-Mar-2014 Al Viro <viro@zeniv.linux.org.uk>

start adding the tag to iov_iter

For now, just use the same thing we pass to ->direct_IO() - it's all
iovec-based at the moment. Pass it explicitly to iov_iter_init() and
account for kvec vs. iovec in there, by the same kludge NFS ->direct_IO()
uses.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 6130f531 03-Feb-2014 Al Viro <viro@zeniv.linux.org.uk>

switch vmsplice_to_user() to copy_page_to_iter()

I've switched the sanity checks on iovec to rw_copy_check_uvector();
we might need to do a local analog, if any behaviour differences are
not actually bugfixes here...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# fbb32750 02-Feb-2014 Al Viro <viro@zeniv.linux.org.uk>

pipe: kill ->map() and ->unmap()

all pipe_buffer_operations have the same instances of those...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 28a625cb 22-Jan-2014 Miklos Szeredi <mszeredi@suse.cz>

fuse: fix pipe_buf_operations

Having this struct in module memory could Oops when if the module is
unloaded while the buffer still persists in a pipe.

Since sock_pipe_buf_ops is essentially the same as fuse_dev_pipe_buf_steal
merge them into nosteal_pipe_buf_ops (this is the same as
default_pipe_buf_ops except stealing the page from the buffer is not
allowed).

Reported-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: stable@vger.kernel.org


# 72c2d531 22-Sep-2013 Al Viro <viro@zeniv.linux.org.uk>

file->f_op is never NULL...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 18c67cb9 19-Jun-2013 Al Viro <viro@zeniv.linux.org.uk>

splice: lift checks from do_splice_from() into callers

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 50cd2c57 23-May-2013 Al Viro <viro@zeniv.linux.org.uk>

lift file_*_write out of do_splice_direct()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 500368f7 23-May-2013 Al Viro <viro@zeniv.linux.org.uk>

lift file_*_write out of do_splice_from()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# acdb37c3 22-Jun-2013 Randy Dunlap <rdunlap@infradead.org>

fs: fix new splice.c kernel-doc warning

Fix new kernel-doc warning in fs/splice.c:

Warning(fs/splice.c:1298): No description found for parameter 'opos'

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 7995bd28 20-Jun-2013 Al Viro <viro@zeniv.linux.org.uk>

splice: don't pass the address of ->f_pos to methods

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 7bee130e 21-Mar-2013 Al Viro <viro@zeniv.linux.org.uk>

get rid of alloc_pipe_info() argument

not used anymore

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 6447a3cf 21-Mar-2013 Al Viro <viro@zeniv.linux.org.uk>

get rid of pipe->inode

it's used only as a flag to distinguish normal pipes/FIFOs from the
internal per-task one used by file-to-file splice. And pipe->files
would work just as well for that purpose...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 2dd8c9ad 20-Mar-2013 Al Viro <viro@zeniv.linux.org.uk>

lift sb_start_write out of ->splice_write()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 17338fcc 20-Mar-2013 Al Viro <viro@zeniv.linux.org.uk>

lift sb_start_write into default_file_splice_write()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 03d95eb2 20-Mar-2013 Al Viro <viro@zeniv.linux.org.uk>

lift sb_start_write() out of ->write()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 06ae43f3 20-Mar-2013 Al Viro <viro@zeniv.linux.org.uk>

Don't bother with redoing rw_verify_area() from default_file_splice_from()

default_file_splice_from() ends up calling vfs_write() (via very convoluted
callchain). It's an overkill, since we already have done rw_verify_area()
in the caller by the time we call vfs_write() we are under set_fs(KERNEL_DS),
so access_ok() is also pointless. Add a new helper (__kernel_write()),
use it instead of kernel_write() in there.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 76b021d0 02-Mar-2013 Al Viro <viro@zeniv.linux.org.uk>

convert vmsplice to COMPAT_SYSCALL_DEFINE

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 7bb307e8 23-Feb-2013 Al Viro <viro@zeniv.linux.org.uk>

export kernel_write(), convert open-coded instances

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 496ad9aa 23-Jan-2013 Al Viro <viro@zeniv.linux.org.uk>

new helper: file_inode(file)

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# ae62ca7b 06-Jan-2013 Eric Dumazet <edumazet@google.com>

tcp: fix MSG_SENDPAGE_NOTLAST logic

commit 35f9c09fe9c72e (tcp: tcp_sendpages() should call tcp_push() once)
added an internal flag : MSG_SENDPAGE_NOTLAST meant to be set on all
frags but the last one for a splice() call.

The condition used to set the flag in pipe_to_sendpage() relied on
splice() user passing the exact number of bytes present in the pipe,
or a smaller one.

But some programs pass an arbitrary high value, and the test fails.

The effect of this bug is a lack of tcp_push() at the end of a
splice(pipe -> socket) call, and possibly very slow or erratic TCP
sessions.

We should both test sd->total_len and fact that another fragment
is in the pipe (pipe->nrbufs > 1)

Many thanks to Willy for providing very clear bug report, bisection
and test programs.

Reported-by: Willy Tarreau <w@1wt.eu>
Bisected-by: Willy Tarreau <w@1wt.eu>
Tested-by: Willy Tarreau <w@1wt.eu>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# d0e1d66b 11-Dec-2012 Namjae Jeon <linkinjeon@gmail.com>

writeback: remove nr_pages_dirtied arg from balance_dirty_pages_ratelimited_nr()

There is no reason to pass the nr_pages_dirtied argument, because
nr_pages_dirtied value from the caller is unused in
balance_dirty_pages_ratelimited_nr().

Signed-off-by: Namjae Jeon <linkinjeon@gmail.com>
Signed-off-by: Vivek Trivedi <vtrivedi018@gmail.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 2903ff01 27-Aug-2012 Al Viro <viro@zeniv.linux.org.uk>

switch simple cases of fget_light to fdget

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 14da9200 12-Jun-2012 Jan Kara <jack@suse.cz>

fs: Protect write paths by sb_start_write - sb_end_write

There are several entry points which dirty pages in a filesystem. mmap
(handled by block_page_mkwrite()), buffered write (handled by
__generic_file_aio_write()), splice write (generic_file_splice_write),
truncate, and fallocate (these can dirty last partial page - handled inside
each filesystem separately). Protect these places with sb_start_write() and
sb_end_write().

->page_mkwrite() calls are particularly complex since they are called with
mmap_sem held and thus we cannot use standard sb_start_write() due to lock
ordering constraints. We solve the problem by using a special freeze protection
sb_start_pagefault() which ranks below mmap_sem.

BugLink: https://bugs.launchpad.net/bugs/897421
Tested-by: Kamal Mostafa <kamal@canonical.com>
Tested-by: Peter M. Petrakis <peter.petrakis@canonical.com>
Tested-by: Dann Frazier <dann.frazier@canonical.com>
Tested-by: Massimo Morana <massimo.morana@canonical.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 047fe360 12-Jun-2012 Eric Dumazet <edumazet@google.com>

splice: fix racy pipe->buffers uses

Dave Jones reported a kernel BUG at mm/slub.c:3474! triggered
by splice_shrink_spd() called from vmsplice_to_pipe()

commit 35f3d14dbbc5 (pipe: add support for shrinking and growing pipes)
added capability to adjust pipe->buffers.

Problem is some paths don't hold pipe mutex and assume pipe->buffers
doesn't change for their duration.

Fix this by adding nr_pages_max field in struct splice_pipe_desc, and
use it in place of pipe->buffers where appropriate.

splice_shrink_spd() loses its struct pipe_inode_info argument.

Reported-by: Dave Jones <davej@redhat.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Tom Herbert <therbert@google.com>
Cc: stable <stable@vger.kernel.org> # 2.6.35
Tested-by: Dave Jones <davej@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# c3b2da31 26-Mar-2012 Josef Bacik <josef@redhat.com>

fs: introduce inode operation ->update_time

Btrfs has to make sure we have space to allocate new blocks in order to modify
the inode, so updating time can fail. We've gotten around this by having our
own file_update_time but this is kind of a pain, and Christoph has indicated he
would like to make xfs do something different with atime updates. So introduce
->update_time, where we will deal with i_version an a/m/c time updates and
indicate which changes need to be made. The normal version just does what it
has always done, updates the time and marks the inode dirty, and then
filesystems can choose to do something different.

I've gone through all of the users of file_update_time and made them check for
errors with the exception of the fault code since it's complicated and I wasn't
quite sure what to do there, also Jan is going to be pushing the file time
updates into page_mkwrite for those who have it so that should satisfy btrfs and
make it not a big deal to check the file_update_time() return code in the
generic fault path. Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>


# bd1a68b5 04-Apr-2012 Eric Dumazet <eric.dumazet@gmail.com>

vmsplice: relax alignement requirements for SPLICE_F_GIFT

It seems there is no fundamental reason to limit vmsplice()
SPLICE_F_GIFT to page aligned chunks.

All helpers are prepared to cope with offsets in page.

This limitation makes vmsplice() API very impractical in the zero-copy
land.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Tom Herbert <therbert@google.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: David Miller <davem@davemloft.net>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Hugh Dickins <hughd@google.com>
Cc: Changli Gao <xiaosuo@gmail.com>
Cc: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 35f9c09f 04-Apr-2012 Eric Dumazet <eric.dumazet@gmail.com>

tcp: tcp_sendpages() should call tcp_push() once

commit 2f533844242 (tcp: allow splice() to build full TSO packets) added
a regression for splice() calls using SPLICE_F_MORE.

We need to call tcp_flush() at the end of the last page processed in
tcp_sendpages(), or else transmits can be deferred and future sends
stall.

Add a new internal flag, MSG_SENDPAGE_NOTLAST, acting like MSG_MORE, but
with different semantic.

For all sendpage() providers, its a transparent change. Only
sock_sendpage() and tcp_sendpages() can differentiate the two different
flags provided by pipe_to_sendpage()

Reported-by: Tom Herbert <therbert@google.com>
Cc: Nandita Dukkipati <nanditad@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Tom Herbert <therbert@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: H.K. Jerry Chu <hkchu@google.com>
Cc: Maciej Żenczykowski <maze@google.com>
Cc: Mahesh Bandewar <maheshb@google.com>
Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail>com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# e8e3c3d6 25-Nov-2011 Cong Wang <amwang@redhat.com>

fs: remove the second argument of k[un]map_atomic()

Acked-by: Benjamin LaHaise <bcrl@kvack.org>
Signed-off-by: Cong Wang <amwang@redhat.com>


# 630d9c47 16-Nov-2011 Paul Gortmaker <paul.gortmaker@windriver.com>

fs: reduce the use of module.h wherever possible

For files only using THIS_MODULE and/or EXPORT_SYMBOL, map
them onto including export.h -- or if the file isn't even
using those, then just delete the include. Fix up any implicit
include dependencies that were being masked by module.h along
the way.

Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>


# ff01bb48 16-Sep-2011 Al Viro <viro@zeniv.linux.org.uk>

fs: move code out of buffer.c

Move invalidate_bdev, block_sync_page into fs/block_dev.c. Export
kill_bdev as well, so brd doesn't have to open code it. Reduce
buffer_head.h requirement accordingly.

Removed a rather large comment from invalidate_bdev, as it looked a bit
obsolete to bother moving. The small comment replacing it says enough.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 708e3508 25-Jul-2011 Hugh Dickins <hughd@google.com>

tmpfs: clone shmem_file_splice_read()

Copy __generic_file_splice_read() and generic_file_splice_read() from
fs/splice.c to shmem_file_splice_read() in mm/shmem.c. Make
page_cache_pipe_buf_ops and spd_release_page() accessible to it.

Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Jens Axboe <jaxboe@fusionio.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 825cdcb1 23-May-2011 Namhyung Kim <namhyung@gmail.com>

splice: add wakeup_pipe_readers()

Add and use wakeup_pipe_readers() to consolidate duplicated codes.

Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>


# a8adbe37 17-Dec-2010 Michał Mirosław <mirq-linux@rere.qmqm.pl>

fs/splice: Pull buf->ops->confirm() from splice_from_pipe actors

This patch pulls calls to buf->ops->confirm() from all actors passed
(also indirectly) to splice_from_pipe_feed().

Is avoiding the call to buf->ops->confirm() while splice()ing to
/dev/null is an intentional optimization? No other user does that
and this will remove this special case.

Against current linux.git 6313e3c21743cc88bb5bd8aa72948ee1e83937b6.

Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>


# c66fb347 28-Nov-2010 Linus Torvalds <torvalds@linux-foundation.org>

Export 'get_pipe_info()' to other users

And in particular, use it in 'pipe_fcntl()'.

The other pipe functions do not need to use the 'careful' version, since
they are only ever called for things that are already known to be pipes.

The normal read/write/ioctl functions are called through the file
operations structures, so if a file isn't a pipe, they'd never get
called. But pipe_fcntl() is special, and called directly from the
generic fcntl code, and needs to use the same careful function that the
splice code is using.

Cc: Jens Axboe <jaxboe@fusionio.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Dave Jones <davej@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 71993e62 28-Nov-2010 Linus Torvalds <torvalds@linux-foundation.org>

Rename 'pipe_info()' to 'get_pipe_info()'

.. and change it to take the 'file' pointer instead of an inode, since
that's what all users want anyway.

The renaming is preparatory to exporting it to other users. The old
'pipe_info()' name was too generic and is already used elsewhere, so
before making the function public we need to use a more specific name.

Cc: Jens Axboe <jaxboe@fusionio.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Dave Jones <davej@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 6965031d 02-Aug-2010 Miklos Szeredi <miklos@szeredi.hu>

splice: fix misuse of SPLICE_F_NONBLOCK

SPLICE_F_NONBLOCK is clearly documented to only affect blocking on the
pipe. In __generic_file_splice_read(), however, it causes an EAGAIN
if the page is currently being read.

This makes it impossible to write an application that only wants
failure if the pipe is full. For example if the same process is
handling both ends of a pipe and isn't otherwise able to determine
whether a splice to the pipe will fill it or not.

We could make the read non-blocking on O_NONBLOCK or some other splice
flag, but for now this is the simplest fix.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
CC: stable@kernel.org
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>


# 1676effc 21-Jun-2010 Andi Kleen <andi@firstfloor.org>

gcc-4.6: fs: fix unused but set warnings

No real bugs I believe, just some dead code, and some
shut up code.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
Cc: Eric Paris <eparis@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>


# 19c9a49b 29-Jun-2010 Changli Gao <xiaosuo@gmail.com>

splice: check f_mode for seekable file

check f_mode for seekable file

As a seekable file is allowed without a llseek function, so the old way isn't
work any more.

Signed-off-by: Changli Gao <xiaosuo@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
----
fs/splice.c | 6 ++----
1 file changed, 2 insertions(+), 4 deletions(-)
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>


# 2cb4b05e 29-Jun-2010 Changli Gao <xiaosuo@gmail.com>

splice: direct_splice_actor() should not use pos in sd

direct_splice_actor() shouldn't use sd->pos, as sd->pos is for file reading,
file->f_pos should be used instead.

Signed-off-by: Changli Gao <xiaosuo@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
----
fs/splice.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>


# 0ae0b5d0 25-May-2010 Nick Piggin <npiggin@suse.de>

fs/splice.c: fix mapping_gfp_mask usage

mapping_gfp_mask() is not supposed to store allocation contex details,
only page location details. So mapping_gfp_mask should be applied to the
pagecache page allocation, wheras normal (kernel mapped) memory should be
used for surrounding allocations such as radix-tree nodes allocated by
add_to_page_cache. Context modifiers should be applied on a per-callsite
basis.

So change splice to follow this convention (which is followed in similar
code patterns in core code).

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>


# 35f3d14d 20-May-2010 Jens Axboe <jens.axboe@oracle.com>

pipe: add support for shrinking and growing pipes

This patch adds F_GETPIPE_SZ and F_SETPIPE_SZ fcntl() actions for
growing and shrinking the size of a pipe and adjusts pipe.c and splice.c
(and relay and network splice) usage to work with these larger (or smaller)
pipes.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>


# 5a0e3ad6 24-Mar-2010 Tejun Heo <tj@kernel.org>

include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h

percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files. percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.

percpu.h -> slab.h dependency is about to be removed. Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability. As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.

http://userweb.kernel.org/~tj/misc/slabh-sweep.py

The script does the followings.

* Scan files for gfp and slab usages and update includes such that
only the necessary includes are there. ie. if only gfp is used,
gfp.h, if slab is used, slab.h.

* When the script inserts a new include, it looks at the include
blocks and try to put the new include such that its order conforms
to its surrounding. It's put in the include block which contains
core kernel includes, in the same order that the rest are ordered -
alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
doesn't seem to be any matching order.

* If the script can't find a place to put a new include (mostly
because the file doesn't have fitting include block), it prints out
an error message indicating which .h file needs to be added to the
file.

The conversion was done in the following steps.

1. The initial automatic conversion of all .c files updated slightly
over 4000 files, deleting around 700 includes and adding ~480 gfp.h
and ~3000 slab.h inclusions. The script emitted errors for ~400
files.

2. Each error was manually checked. Some didn't need the inclusion,
some needed manual addition while adding it to implementation .h or
embedding .c file was more appropriate for others. This step added
inclusions to around 150 files.

3. The script was run again and the output was compared to the edits
from #2 to make sure no file was left behind.

4. Several build tests were done and a couple of problems were fixed.
e.g. lib/decompress_*.c used malloc/free() wrappers around slab
APIs requiring slab.h to be added manually.

5. The script was run on all .h files but without automatically
editing them as sprinkling gfp.h and slab.h inclusions around .h
files could easily lead to inclusion dependency hell. Most gfp.h
inclusion directives were ignored as stuff from gfp.h was usually
wildly available and often used in preprocessor macros. Each
slab.h inclusion directive was examined and added manually as
necessary.

6. percpu.h was updated not to include slab.h.

7. Build test were done on the following configurations and failures
were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
distributed build env didn't work with gcov compiles) and a few
more options had to be turned off depending on archs to make things
build (like ipr on powerpc/64 which failed due to missing writeq).

* x86 and x86_64 UP and SMP allmodconfig and a custom test config.
* powerpc and powerpc64 SMP allmodconfig
* sparc and sparc64 SMP allmodconfig
* ia64 SMP allmodconfig
* s390 SMP allmodconfig
* alpha SMP allmodconfig
* um on x86_64 SMP allmodconfig

8. percpu.h modifications were reverted so that it could be applied as
a separate patch and serve as bisection point.

Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.

Signed-off-by: Tejun Heo <tj@kernel.org>
Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>


# cc56f7de 04-Nov-2009 Changli Gao <xiaosuo@gmail.com>

sendfile(): check f_op.splice_write() rather than f_op.sendpage()

sendfile(2) was reworked with the splice infrastructure, but it still
checks f_op.sendpage() instead of f_op.splice_write() wrongly. Although
if f_op.sendpage() exists, f_op.splice_write() always exists at the same
time currently, the assumption will be broken in future silently. This
patch also brings a side effect: sendfile(2) can work with any output
file. Some security checks related to f_op are added too.

Signed-off-by: Changli Gao <xiaosuo@gmail.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>


# 148f948b 17-Aug-2009 Jan Kara <jack@suse.cz>

vfs: Introduce new helpers for syncing after writing to O_SYNC file or IS_SYNC inode

Introduce new function for generic inode syncing (vfs_fsync_range) and use
it from fsync() path. Introduce also new helper for syncing after a sync
write (generic_write_sync) using the generic function.

Use these new helpers for syncing from generic VFS functions. This makes
O_SYNC writes to block devices acquire i_mutex for syncing. If we really
care about this, we can make block_fsync() drop the i_mutex and reacquire
it before it returns.

CC: Evgeniy Polyakov <zbr@ioremap.net>
CC: ocfs2-devel@oss.oracle.com
CC: Joel Becker <joel.becker@oracle.com>
CC: Felix Blyakher <felixb@sgi.com>
CC: xfs@oss.sgi.com
CC: Anton Altaparmakov <aia21@cantab.net>
CC: linux-ntfs-dev@lists.sourceforge.net
CC: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
CC: linux-ext4@vger.kernel.org
CC: tytso@mit.edu
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>


# 723590ed 15-Aug-2009 Miklos Szeredi <mszeredi@suse.cz>

splice: update mtime and atime on files

Splice should update the modification and access times on regular
files just like read and write. Not updating mtime will confuse
backup tools, etc...

This patch only adds the time updates for regular files. For pipes
and other special files that splice touches the need for updating the
times is less clear. Let's discuss and fix that separately.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>


# b2858d7d 19-May-2009 Miklos Szeredi <mszeredi@suse.cz>

splice: fix kmaps in default_file_splice_write()

Unfortunately multiple kmap() within a single thread are deadlockable,
so writing out multiple buffers with writev() isn't possible.

Change the implementation so that it does a separate write() for each
buffer. This actually simplifies the code a lot since the
splice_from_pipe() helper can be used.

This limitation is caused by HIGHMEM pages, and so only affects a
subset of architectures and configurations. In the future it may be
worth to implement default_file_splice_write() in a more efficient way
on configs that allow it.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>


# 77f6bf57 14-May-2009 Andrew Morton <akpm@linux-foundation.org>

splice: fix error return code

fs/splice.c: In function 'default_file_splice_read':
fs/splice.c:566: warning: 'error' may be used uninitialized in this function

which is sort-of true. The code will in fact return -ENOMEM instead of the
kernel_readv() return value.

Cc: Miklos Szeredi <miklos@szeredi.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>


# 4f231228 13-May-2009 Jens Axboe <jens.axboe@oracle.com>

splice: fix repeated kmap()'s in default_file_splice_read()

We cannot reliably map more than one page at the time, or we risk
deadlocking. Just allocate the pages from low mem instead.

Reported-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>


# 0b0a47f5 07-May-2009 Miklos Szeredi <miklos@szeredi.hu>

splice: implement default splice_write method

If f_op->splice_write() is not implemented, fall back to a plain write.
Use vfs_writev() to write from the pipe buffers.

This will allow splice on all filesystems and file types. This
includes "direct_io" files in fuse which bypass the page cache.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>


# 6818173b 07-May-2009 Miklos Szeredi <miklos@szeredi.hu>

splice: implement default splice_read method

If f_op->splice_read() is not implemented, fall back to a plain read.
Use vfs_readv() to read into previously allocated pages.

This will allow splice and functions using splice, such as the loop
device, to work on all filesystems. This includes "direct_io" files
in fuse which bypass the page cache.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>


# 7c77f0b3 07-May-2009 Miklos Szeredi <miklos@szeredi.hu>

splice: implement pipe to pipe splicing

Allow splice(2) to work when both the input and the output is a pipe.

Based on the impementation of the tee(2) syscall, but instead of
duplicating the buffer references move the buffers from the input pipe
to the output pipe.

Moving the whole buffer only succeeds if the full length of the buffer
is spliced. Otherwise duplicate the buffer, just like tee(2), set the
length of the output buffer and advance the offset on the input
buffer.

Since splice is operating on two pipes, special care needs to be taken
with locking to prevent AN ABBA deadlock. Again this is done
similarly to the tee(2) syscall, first preparing the input and output
pipes so there's data to consume and space for that data, and then
doing the move operation while holding both locks.

If other processes are doing I/O on the same pipes parallel to the
splice, then by the time both inodes are locked there might be no
buffers left to move, or no space to move them to. In this case retry
the whole operation, including the preparation phase. This could lead
to starvation, but I'm not sure if that's serious enough to worry
about.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>


# b80901bb 16-Apr-2009 Randy Dunlap <randy.dunlap@oracle.com>

splice: fix new kernel-doc warnings

splice: fix kernel-doc warnings

Warning(fs/splice.c:617): bad line:
Warning(fs/splice.c:722): No description found for parameter 'sd'
Warning(fs/splice.c:722): Excess function parameter 'pipe' description in 'splice_from_pipe_begin'

Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 61e0d47c 14-Apr-2009 Miklos Szeredi <miklos@szeredi.hu>

splice: add helpers for locking pipe inode

There are lots of sequences like this, especially in splice code:

if (pipe->inode)
mutex_lock(&pipe->inode->i_mutex);
/* do something */
if (pipe->inode)
mutex_unlock(&pipe->inode->i_mutex);

so introduce helpers which do the conditional locking and unlocking.
Also replace the inode_double_lock() call with a pipe_double_lock()
helper to avoid spreading the use of this functionality beyond the
pipe code.

This patch is just a cleanup, and should cause no behavioral changes.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>


# f8cc774c 14-Apr-2009 Miklos Szeredi <miklos@szeredi.hu>

splice: remove generic_file_splice_write_nolock()

Remove the now unused generic_file_splice_write_nolock() function.
It's conceptually broken anyway, because splice may need to wait for
pipe events so holding locks across the whole operation is wrong.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>


# 328eaaba 14-Apr-2009 Miklos Szeredi <miklos@szeredi.hu>

ocfs2: fix i_mutex locking in ocfs2_splice_to_file()

Rearrange locking of i_mutex on destination and call to
ocfs2_rw_lock() so locks are only held while buffers are copied with
the pipe_to_file() actor, and not while waiting for more data on the
pipe.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>


# eb443e5a 14-Apr-2009 Miklos Szeredi <miklos@szeredi.hu>

splice: fix i_mutex locking in generic_splice_write()

Rearrange locking of i_mutex on destination so it's only held while
buffers are copied with the pipe_to_file() actor, and not while
waiting for more data on the pipe.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>


# 2933970b 14-Apr-2009 Miklos Szeredi <miklos@szeredi.hu>

splice: remove i_mutex locking in splice_from_pipe()

splice_from_pipe() is only called from two places:

- generic_splice_sendpage()
- splice_write_null()

Neither of these require i_mutex to be taken on the destination inode.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>


# b3c2d2dd 14-Apr-2009 Miklos Szeredi <miklos@szeredi.hu>

splice: split up __splice_from_pipe()

Split up __splice_from_pipe() into four helper functions:

splice_from_pipe_begin()
splice_from_pipe_next()
splice_from_pipe_feed()
splice_from_pipe_end()

splice_from_pipe_next() will wait (if necessary) for more buffers to
be added to the pipe. splice_from_pipe_feed() will feed the buffers
to the supplied actor and return when there's no more data available
(or if all of the requested data has been copied).

This is necessary so that implementations can do locking around the
non-waiting splice_from_pipe_feed().

This patch should not cause any change in behavior.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>


# 7bfac9ec 06-Apr-2009 Miklos Szeredi <mszeredi@suse.cz>

splice: fix deadlock in splicing to file

There's a possible deadlock in generic_file_splice_write(),
splice_from_pipe() and ocfs2_file_splice_write():

- task A calls generic_file_splice_write()
- this calls inode_double_lock(), which locks i_mutex on both
pipe->inode and target inode
- ordering depends on inode pointers, can happen that pipe->inode is
locked first
- __splice_from_pipe() needs more data, calls pipe_wait()
- this releases lock on pipe->inode, goes to interruptible sleep
- task B calls generic_file_splice_write(), similarly to the first
- this locks pipe->inode, then tries to lock inode, but that is
already held by task A
- task A is interrupted, it tries to lock pipe->inode, but fails, as
it is already held by task B
- ABBA deadlock

Fix this by explicitly ordering locks: the outer lock must be on
target inode and the inner lock (which is later unlocked and relocked)
must be on pipe->inode. This is OK, pipe inodes and target inodes
form two nonoverlapping sets, generic_file_splice_write() and friends
are not called with a target which is a pipe.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Acked-by: Mark Fasheh <mfasheh@suse.com>
Acked-by: Jens Axboe <jens.axboe@oracle.com>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 266cf658 03-Apr-2009 David Howells <dhowells@redhat.com>

FS-Cache: Recruit a page flags for cache management

Recruit a page flag to aid in cache management. The following extra flag is
defined:

(1) PG_fscache (PG_private_2)

The marked page is backed by a local cache and is pinning resources in the
cache driver.

If PG_fscache is set, then things that checked for PG_private will now also
check for that. This includes things like truncation and page invalidation.
The function page_has_private() had been added to make the checks for both
PG_private and PG_private_2 at the same time.

Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Steve Dickson <steved@redhat.com>
Acked-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Tested-by: Daire Byrne <Daire.Byrne@framestore.com>


# 836f92ad 14-Jan-2009 Heiko Carstens <hca@linux.ibm.com>

[CVE-2009-0029] System call wrappers part 31

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>


# 08e552c6 07-Jan-2009 KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

memcg: synchronized LRU

A big patch for changing memcg's LRU semantics.

Now,
- page_cgroup is linked to mem_cgroup's its own LRU (per zone).

- LRU of page_cgroup is not synchronous with global LRU.

- page and page_cgroup is one-to-one and statically allocated.

- To find page_cgroup is on what LRU, you have to check pc->mem_cgroup as
- lru = page_cgroup_zoneinfo(pc, nid_of_pc, zid_of_pc);

- SwapCache is handled.

And, when we handle LRU list of page_cgroup, we do following.

pc = lookup_page_cgroup(page);
lock_page_cgroup(pc); .....................(1)
mz = page_cgroup_zoneinfo(pc);
spin_lock(&mz->lru_lock);
.....add to LRU
spin_unlock(&mz->lru_lock);
unlock_page_cgroup(pc);

But (1) is spin_lock and we have to be afraid of dead-lock with zone->lru_lock.
So, trylock() is used at (1), now. Without (1), we can't trust "mz" is correct.

This is a trial to remove this dirty nesting of locks.
This patch changes mz->lru_lock to be zone->lru_lock.
Then, above sequence will be written as

spin_lock(&zone->lru_lock); # in vmscan.c or swap.c via global LRU
mem_cgroup_add/remove/etc_lru() {
pc = lookup_page_cgroup(page);
mz = page_cgroup_zoneinfo(pc);
if (PageCgroupUsed(pc)) {
....add to LRU
}
spin_lock(&zone->lru_lock); # in vmscan.c or swap.c via global LRU

This is much simpler.
(*) We're safe even if we don't take lock_page_cgroup(pc). Because..
1. When pc->mem_cgroup can be modified.
- at charge.
- at account_move().
2. at charge
the PCG_USED bit is not set before pc->mem_cgroup is fixed.
3. at account_move()
the page is isolated and not on LRU.

Pros.
- easy for maintenance.
- memcg can make use of laziness of pagevec.
- we don't have to duplicated LRU/Active/Unevictable bit in page_cgroup.
- LRU status of memcg will be synchronized with global LRU's one.
- # of locks are reduced.
- account_move() is simplified very much.
Cons.
- may increase cost of LRU rotation.
(no impact if memcg is not configured.)

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 4e02ed4b 29-Oct-2008 Nick Piggin <npiggin@suse.de>

fs: remove prepare_write/commit_write

Nothing uses prepare_write or commit_write. Remove them from the tree
completely.

[akpm@linux-foundation.org: schedule simple_prepare_write() for unexporting]
Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# efc968d4 09-Oct-2008 Linus Torvalds <torvalds@linux-foundation.org>

Don't allow splice() to files opened with O_APPEND

This is debatable, but while we're debating it, let's disallow the
combination of splice and an O_APPEND destination.

It's not entirely clear what the semantics of O_APPEND should be, and
POSIX apparently expects pwrite() to ignore O_APPEND, for example. So
we could make up any semantics we want, including the old ones.

But Miklos convinced me that we should at least give it some thought,
and that accepting writes at arbitrary offsets is wrong at least for
IS_APPEND() files (which always have O_APPEND set, even if the reverse
isn't true: you can obviously have O_APPEND set on a regular file).

So disallow O_APPEND entirely for now. I doubt anybody cares, and this
way we have one less gray area to worry about.

Reported-and-argued-for-by: Miklos Szeredi <miklos@szeredi.hu>
Acked-by: Jens Axboe <ens.axboe@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 529ae9aa 01-Aug-2008 Nick Piggin <npiggin@suse.de>

mm: rename page trylock

Converting page lock to new locking bitops requires a change of page flag
operation naming, so we might as well convert it to something nicer
(!TestSetPageLocked_Lock => trylock_page, SetPageLocked => set_page_locked).

This also facilitates lockdeping of page lock.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 2f1936b8 24-Jun-2008 Miklos Szeredi <mszeredi@suse.cz>

[patch 3/5] vfs: change remove_suid() to file_remove_suid()

All calls to remove_suid() are made with a file pointer, because
(similarly to file_update_time) it is called when the file is written.

Clean up callers by passing in a file instead of a dentry.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>


# bc40d73c 25-Jul-2008 Nick Piggin <npiggin@suse.de>

splice: use get_user_pages_fast

Use get_user_pages_fast in splice. This reverts some mmap_sem batching
there, however the biggest problem with mmap_sem tends to be hold times
blocking out other threads rather than cacheline bouncing. Further: on
architectures that implement get_user_pages_fast without locks, mmap_sem
can be avoided completely anyway.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: Dave Kleikamp <shaggy@austin.ibm.com>
Cc: Andy Whitcroft <apw@shadowen.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Dave Kleikamp <shaggy@austin.ibm.com>
Cc: Badari Pulavarty <pbadari@us.ibm.com>
Cc: Zach Brown <zach.brown@oracle.com>
Cc: Jens Axboe <jens.axboe@oracle.com>
Reviewed-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 32502b84 04-Jul-2008 Miklos Szeredi <mszeredi@suse.cz>

splice: fix generic_file_splice_read() race with page invalidation

If a page was invalidated during splicing from file to a pipe, then
generic_file_splice_read() could return a short or zero count.

This manifested itself in rare I/O errors seen on nfs exported fuse
filesystems. This is because nfsd uses splice_direct_to_actor() to read
files, and fuse uses invalidate_inode_pages2() to invalidate stale data on
open.

Fix by redoing the page find/create if it was found to be truncated
(invalidated).

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>


# ca39d651 20-May-2008 Jens Axboe <jens.axboe@oracle.com>

splice: handle try_to_release_page() failure

splice currently assumes that try_to_release_page() always suceeds,
but it can return failure. If it does, we cannot steal the page.

Acked-by: Mingming Cao <cmm@us.ibm.com
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>


# a82c53a0 09-May-2008 Tom Zanussi <zanussi@comcast.net>

splice: fix sendfile() issue with relay

Splice isn't always incrementing the ppos correctly, which broke
relay splice.

Signed-off-by: Tom Zanussi <zanussi@comcast.net>
Tested-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>


# 75065ff6 08-May-2008 Jens Axboe <jens.axboe@oracle.com>

Revert "relay: fix splice problem"

This reverts commit c3270e577c18b3d0e984c3371493205a4807db9d.


# 7f3d4ee1 07-May-2008 Miklos Szeredi <mszeredi@suse.cz>

vfs: splice remove_suid() cleanup

generic_file_splice_write() duplicates remove_suid() just because it
doesn't hold i_mutex. But it grabs i_mutex inside splice_from_pipe()
anyway, so this is rather pointless.

Move locking to generic_file_splice_write() and call remove_suid() and
__splice_from_pipe() instead.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>


# c3270e57 23-Apr-2008 Tom Zanussi <zanussi@comcast.ne>

relay: fix splice problem

Splice isn't always incrementing the ppos correctly, which broke
relay splice.

Signed-off-by: Tom Zanussi <zanussi@comcast.net>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>


# 8191ecd1 10-Apr-2008 Jens Axboe <jens.axboe@oracle.com>

splice: fix infinite loop in generic_file_splice_read()

There's a quirky loop in generic_file_splice_read() that could go
on indefinitely, if the file splice returns 0 permanently (and not
just as a temporary condition). Get rid of the loop and pass
back -EAGAIN correctly from __generic_file_splice_read(), so we
handle that condition properly as well.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>


# 4cd13504 03-Apr-2008 Hugh Dickins <hugh@veritas.com>

splice: use mapping_gfp_mask

The loop block driver is careful to mask __GFP_IO|__GFP_FS out of its
mapping_gfp_mask, to avoid hangs under memory pressure. But nowadays
it uses splice, usually going through __generic_file_splice_read. That
must use mapping_gfp_mask instead of GFP_KERNEL to avoid those hangs.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Cc: Jens Axboe <jens.axboe@oracle.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 02cf01ae 20-Feb-2008 Jens Axboe <jens.axboe@oracle.com>

splice: only return -EAGAIN if there's hope of more data

sys_tee() currently is a bit eager in returning -EAGAIN, it may do so
even if we don't have a chance of anymore data becoming available. So
improve the logic and only return -EAGAIN if we have an attached writer
to the input pipe.

Reported by Johann Felix Soden <johfel@gmx.de> and
Patrick McManus <mcmanus@ducksong.com>.

Tested-by: Johann Felix Soden <johfel@users.sourceforge.net>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>


# 712a30e6 10-Feb-2008 Bastian Blank <bastian@waldi.eu.org>

splice: fix user pointer access in get_iovec_page_array()

Commit 8811930dc74a503415b35c4a79d14fb0b408a361 ("splice: missing user
pointer access verification") added the proper access_ok() calls to
copy_from_user_mmap_sem() which ensures we can copy the struct iovecs
from userspace to the kernel.

But we also must check whether we can access the actual memory region
pointed to by the struct iovec to fix the access checks properly.

Signed-off-by: Bastian Blank <waldi@debian.org>
Acked-by: Oliver Pinter <oliver.pntr@gmail.com>
Cc: Jens Axboe <jens.axboe@oracle.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 8811930d 08-Feb-2008 Jens Axboe <jens.axboe@oracle.com>

splice: missing user pointer access verification

vmsplice_to_user() must always check the user pointer and length
with access_ok() before copying. Likewise, for the slow path of
copy_from_user_mmap_sem() we need to check that we may read from
the user region.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Cc: Wojciech Purczynski <cliph@research.coseinc.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 80848708 29-Jan-2008 Jens Axboe <jens.axboe@oracle.com>

splice: always updated atime in direct splice

Andre Majorel <aym-xunil@teaser.fr> points out that if we only updated
the atime when we transfer some data, we deviate from the standard
of always updating the atime. So change splice to always call
file_accessed() even if splice_direct_to_actor() didn't transfer
any data.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>


# 9e97198d 29-Jan-2008 Jens Axboe <jens.axboe@oracle.com>

splice: fix problem with atime not being updated

A bug report on nfsd that states that since it was switched to use
splice instead of sendfile, the atime was no longer being updated
on the input file. do_generic_mapping_read() does this when accessing
the file, make splice do it for the direct splice handler.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>


# bbdfc2f7 07-Nov-2007 Jens Axboe <jens.axboe@oracle.com>

[SPLICE]: Don't assume regular pages in splice_to_pipe()

Allow caller to pass in a release function, there might be
other resources that need releasing as well. Needed for
network receive.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# c43e259c 12-Jan-2008 James Morris <jmorris@namei.org>

security: call security_file_permission from rw_verify_area

All instances of rw_verify_area() are followed by a call to
security_file_permission(), so just call the latter from the former.

Acked-by: Eric Paris <eparis@redhat.com>
Signed-off-by: James Morris <jmorris@namei.org>


# b5376771 17-Oct-2007 Serge E. Hallyn <serue@us.ibm.com>

Implement file posix capabilities

Implement file posix capabilities. This allows programs to be given a
subset of root's powers regardless of who runs them, without having to use
setuid and giving the binary all of root's powers.

This version works with Kaigai Kohei's userspace tools, found at
http://www.kaigai.gr.jp/index.php. For more information on how to use this
patch, Chris Friedhoff has posted a nice page at
http://www.friedhoff.org/fscaps.html.

Changelog:
Nov 27:
Incorporate fixes from Andrew Morton
(security-introduce-file-caps-tweaks and
security-introduce-file-caps-warning-fix)
Fix Kconfig dependency.
Fix change signaling behavior when file caps are not compiled in.

Nov 13:
Integrate comments from Alexey: Remove CONFIG_ ifdef from
capability.h, and use %zd for printing a size_t.

Nov 13:
Fix endianness warnings by sparse as suggested by Alexey
Dobriyan.

Nov 09:
Address warnings of unused variables at cap_bprm_set_security
when file capabilities are disabled, and simultaneously clean
up the code a little, by pulling the new code into a helper
function.

Nov 08:
For pointers to required userspace tools and how to use
them, see http://www.friedhoff.org/fscaps.html.

Nov 07:
Fix the calculation of the highest bit checked in
check_cap_sanity().

Nov 07:
Allow file caps to be enabled without CONFIG_SECURITY, since
capabilities are the default.
Hook cap_task_setscheduler when !CONFIG_SECURITY.
Move capable(TASK_KILL) to end of cap_task_kill to reduce
audit messages.

Nov 05:
Add secondary calls in selinux/hooks.c to task_setioprio and
task_setscheduler so that selinux and capabilities with file
cap support can be stacked.

Sep 05:
As Seth Arnold points out, uid checks are out of place
for capability code.

Sep 01:
Define task_setscheduler, task_setioprio, cap_task_kill, and
task_setnice to make sure a user cannot affect a process in which
they called a program with some fscaps.

One remaining question is the note under task_setscheduler: are we
ok with CAP_SYS_NICE being sufficient to confine a process to a
cpuset?

It is a semantic change, as without fsccaps, attach_task doesn't
allow CAP_SYS_NICE to override the uid equivalence check. But since
it uses security_task_setscheduler, which elsewhere is used where
CAP_SYS_NICE can be used to override the uid equivalence check,
fixing it might be tough.

task_setscheduler
note: this also controls cpuset:attach_task. Are we ok with
CAP_SYS_NICE being used to confine to a cpuset?
task_setioprio
task_setnice
sys_setpriority uses this (through set_one_prio) for another
process. Need same checks as setrlimit

Aug 21:
Updated secureexec implementation to reflect the fact that
euid and uid might be the same and nonzero, but the process
might still have elevated caps.

Aug 15:
Handle endianness of xattrs.
Enforce capability version match between kernel and disk.
Enforce that no bits beyond the known max capability are
set, else return -EPERM.
With this extra processing, it may be worth reconsidering
doing all the work at bprm_set_security rather than
d_instantiate.

Aug 10:
Always call getxattr at bprm_set_security, rather than
caching it at d_instantiate.

[morgan@kernel.org: file-caps clean up for linux/capability.h]
[bunk@kernel.org: unexport cap_inode_killpriv]
Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
Cc: Stephen Smalley <sds@tycho.nsa.gov>
Cc: James Morris <jmorris@namei.org>
Cc: Chris Wright <chrisw@sous-sol.org>
Cc: Andrew Morgan <morgan@kernel.org>
Signed-off-by: Andrew Morgan <morgan@kernel.org>
Signed-off-by: Adrian Bunk <bunk@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# afddba49 16-Oct-2007 Nick Piggin <npiggin@suse.de>

fs: introduce write_begin, write_end, and perform_write aops

These are intended to replace prepare_write and commit_write with more
flexible alternatives that are also able to avoid the buffered write
deadlock problems efficiently (which prepare_write is unable to do).

[mark.fasheh@oracle.com: API design contributions, code review and fixes]
[akpm@linux-foundation.org: various fixes]
[dmonakhov@sw.ru: new aop block_write_begin fix]
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
Signed-off-by: Dmitriy Monakhov <dmonakhov@openvz.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# f4e6b498 16-Oct-2007 Fengguang Wu <wfg@mail.ustc.edu.cn>

readahead: combine file_ra_state.prev_index/prev_offset into prev_pos

Combine the file_ra_state members
unsigned long prev_index
unsigned int prev_offset
into
loff_t prev_pos

It is more consistent and better supports huge files.

Thanks to Peter for the nice proposal!

[akpm@linux-foundation.org: fix shift overflow]
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 6866bef4 16-Oct-2007 Jens Axboe <jens.axboe@oracle.com>

splice: fix double kunmap() in vmsplice copy path

The out label should not include the unmap, the only way to jump
there already has unmapped the source.

00002000
f7c21a00 00000000 00000000 c0489036 00018e32 00000002 00000000
00001000
Call Trace:
[<c0487dd9>] pipe_to_user+0xca/0xd3
[<c0488233>] __splice_from_pipe+0x53/0x1bd
[<c0454947>] ------------[ cut here ]------------
filemap_fault+0x221/0x380
[<c0487d0f>] pipe_to_user+0x0/0xd3
[<c0489036>] sys_vmsplice+0x3b7/0x422
[<c045ec3f>] kernel BUG at mm/highmem.c:206!
handle_mm_fault+0x4d5/0x8eb
[<c041ed5b>] kmap_atomic+0x1c/0x20
[<c045d33d>] unmap_vmas+0x3d1/0x584
[<c045f717>] free_pgtables+0x90/0xa0
[<c041d84b>] pgd_dtor+0x0/0x1
[<c044d665>] audit_syscall_exit+0x2aa/0x2c6
[<c0407817>] do_syscall_trace+0x124/0x169
[<c0404df2>] syscall_call+0x7/0xb
=======================
Code: 2d 00 d0 5b 00 25 00 00 e0 ff 29 invalid opcode: 0000 [#1]
c2 89 d0 c1 e8 0c 8b 14 85 a0 6c 7c c0 4a 85 d2 89 14 85 a0 6c 7c c0 74 07
31 c9 4a 75 15 eb 04 <0f> 0b eb fe 31 c9 81 3d 78 38 6d c0 78 38 6d c0 0f
95 c1 b0 01
EIP: [<c045bbc3>] kunmap_high+0x51/0x8e SS:ESP 0068:f5960df0
SMP
Modules linked in: netconsole autofs4 hidp nfs lockd nfs_acl rfcomm l2cap
bluetooth sunrpc ipv6 ib_iser rdma_cm ib_cm iw_cmib_sa ib_mad ib_core
ib_addr iscsi_tcp libiscsi scsi_transport_iscsi dm_mirror dm_multipath
dm_mod video output sbs batteryac parport_pc lp parport sg i2c_piix4
i2c_core floppy cfi_probe gen_probe scb2_flash mtd chipreg tg3 e1000 button
ide_cd serio_raw cdrom aic7xxx scsi_transport_spi sd_mod scsi_mod ext3 jbd
ehci_hcd ohci_hcd uhci_hcd
CPU: 3
EIP: 0060:[<c045bbc3>] Not tainted VLI
EFLAGS: 00010246 (2.6.23 #1)
EIP is at kunmap_high+0x51/0x8e

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>


# 75723957 01-Oct-2007 Linus Torvalds <torvalds@woody.linux-foundation.org>

Fix possible splice() mmap_sem deadlock

Nick Piggin points out that splice isn't being good about the mmap
semaphore: while two readers can nest inside each others, it does leave
a possible deadlock if a writer (ie a new mmap()) comes in during that
nesting.

Original "just move the locking" patch by Nick, replaced by one by me
based on an optimistic pagefault_disable(). And then Jens tested and
updated that patch.

Reported-by: Nick Piggin <npiggin@suse.de>
Tested-by: Jens Axboe <jens.axboe@oracle.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 79685b8d 27-Jul-2007 Randy Dunlap <randy.dunlap@oracle.com>

docbook: add pipes, other fixes

Fix some typos in pipe.c and splice.c.
Add pipes API to kernel-api.tmpl.

Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>


# 6a860c97 20-Jul-2007 Jens Axboe <jens.axboe@oracle.com>

splice: fix bad unlock_page() in error case

If add_to_page_cache_lru() fails, the page will not be locked. But
splice jumps to an error path that does a page release and unlock,
causing a BUG() in unlock_page().

Fix this by adding one more label that just releases the page. This bug
was actually triggered on EL5 by gurudas pai <gurudas.pai@oracle.com>
using fio.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# cf914a7d 19-Jul-2007 Rusty Russell <rusty@rustcorp.com.au>

readahead: split ondemand readahead interface into two functions

Split ondemand readahead interface into two functions. I think this makes it
a little clearer for non-readahead experts (like Rusty).

Internally they both call ondemand_readahead(), but the page argument is
changed to an obvious boolean flag.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# d8983910 19-Jul-2007 Fengguang Wu <wfg@mail.ustc.edu.cn>

readahead: pass real splice size

Pass real splice size to page_cache_readahead_ondemand().

The splice code works in chunks of 16 pages internally. The readahead code
should be told of the overall splice size, instead of the internal chunk size.
Otherwize bad things may happen. Imagine some 17-page random splice reads.
The code before this patch will result in two readahead calls: readahead(16);
readahead(1); That leads to one 16-page I/O and one 32-page I/O: one extra I/O
and 31 readahead miss pages.

Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn>
Cc: Jens Axboe <jens.axboe@oracle.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 431a4820 19-Jul-2007 Fengguang Wu <wfg@mail.ustc.edu.cn>

readahead: move synchronous readahead call out of splice loop

Move synchronous page_cache_readahead_ondemand() call out of splice loop.

This avoids one pointless page allocation/insertion in case of non-zero
ra_pages, or many pointless readahead calls in case of zero ra_pages.

Note that if a user sets ra_pages to less than PIPE_BUFFERS=16 pages, he will
not get expected readahead behavior anyway. The splice code works in batches
of 16 pages, which can be taken as another form of synchronous readahead.

Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn>
Cc: Jens Axboe <jens.axboe@oracle.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# a08a166f 19-Jul-2007 Fengguang Wu <wfg@mail.ustc.edu.cn>

readahead: convert splice invocations

Convert splice reads to use on-demand readahead.

Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn>
Cc: Steven Pratt <slpratt@austin.ibm.com>
Cc: Ram Pai <linuxram@us.ibm.com>
Cc: Jens Axboe <axboe@suse.de>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# bcd4f3ac 16-Jul-2007 Jens Axboe <jens.axboe@oracle.com>

splice: direct splicing updates ppos twice

OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> reported that he's noticed
nfsd read corruption in recent kernels, and did the hard work of
discovering that it's due to splice updating the file position twice.
This means that the next operation would start further ahead than it
should.

nfsd_vfs_read()
splice_direct_to_actor()
while(len) {
do_splice_to() [update sd->pos]
-> generic_file_splice_read() [read from sd->pos]
nfsd_direct_splice_actor()
-> __splice_from_pipe() [update sd->pos]

There's nothing wrong with the core splice code, but the direct
splicing is an addon that calls both input and output paths.
So it has to take care in locally caching offset so it remains correct.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>


# 51a92c0f 13-Jul-2007 Jens Axboe <jens.axboe@oracle.com>

splice: fix offset mangling with direct splicing (sendfile)

If the output actor doesn't transfer the full amount of data, we will
increment ppos too much. Two related bugs in there:

- We need to break out and return actor() retval if it is shorted than
what we spliced into the pipe.

- Adjust ppos only according to actor() return.

Also fix loop problem in generic_file_splice_read(), it should not keep
going when data has already been transferred.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>


# 29ce2058 13-Jul-2007 James Morris <jmorris@namei.org>

security: revalidate rw permissions for sys_splice and sys_vmsplice

Revalidate read/write permissions for splice(2) and vmslice(2), in case
security policy has changed since the files were opened.

Acked-by: Stephen Smalley <sds@tycho.nsa.gov>
Signed-off-by: James Morris <jmorris@namei.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>


# 0845718d 12-Jun-2007 Jens Axboe <jens.axboe@oracle.com>

pipe: add documentation and comments

As per Andrew Mortons request, here's a set of documentation for
the generic pipe_buf_operations hooks, the pipe, and pipe_buffer
structures.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>


# cac36bb0 14-Jun-2007 Jens Axboe <jens.axboe@oracle.com>

pipe: change the ->pin() operation to ->confirm()

The name 'pin' was badly chosen, it doesn't pin a pipe buffer
in the most commonly used sense in the kernel. So change the
name to 'confirm', after debating this issue with Hugh
Dickins a bit.

A good return from ->confirm() means that the buffer is really
there, and that the contents are good.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>


# 932cc6d4 21-Jun-2007 Jens Axboe <jens.axboe@oracle.com>

splice: completely document external interface with kerneldoc

Also add fs/splice.c as a kerneldoc target with a smaller blurb that
should be expanded to better explain the overview of splice.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>


# 497f9625 10-Jun-2007 Jens Axboe <jens.axboe@oracle.com>

pipe: allow passing around of ops private pointer

relay needs this for proper consumption handling, and the network
receive support needs it as well to lookup the sk_buff on pipe
release.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>


# d6b29d7c 04-Jun-2007 Jens Axboe <jens.axboe@oracle.com>

splice: divorce the splice structure/function definitions from the pipe header

We need to move even more stuff into the header so that folks can use
the splice_to_pipe() implementation instead of open-coding a lot of
pipe knowledge (see relay implementation), so move to our own header
file finally.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>


# 6a14b90b 14-Jun-2007 Jens Axboe <jens.axboe@oracle.com>

vmsplice: add vmsplice-to-user support

A bit of a cheat, it actually just copies the data to userspace. But
this makes the interface nice and symmetric and enables people to build
on splice, with room for future improvement in performance.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>


# c66ab6fa 12-Jun-2007 Jens Axboe <jens.axboe@oracle.com>

splice: abstract out actor data

For direct splicing (or private splicing), the output may not be a file.
So abstract out the handling into a specified actor function and put
the data in the splice_desc structure earlier, so we can build on top
of that.

This is the first step in better splice handling for drivers, and also
for implementing vmsplice _to_ user memory.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>


# 02676e5a 15-Jun-2007 Jens Axboe <jens.axboe@oracle.com>

splice: only check do_wakeup in splice_to_pipe() for a real pipe

We only ever set do_wakeup to non-zero if the pipe has an inode
backing, so it's pointless to check outside the pipe->inode
check.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>


# 00de00bd 15-Jun-2007 Jens Axboe <jens.axboe@oracle.com>

splice: fix leak of pages on short splice to pipe

If the destination pipe is full and we already transferred
data, we break out instead of waiting for more pipe room.
The exit logic looks at spd->nr_pages to see if we moved
everything inside the spd container, but we decrement that
variable in the loop to decide when spd has emptied.

Instead we want to compare to the original page count in
the spd, so cache that in a local variable.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>


# 17ee4f49 15-Jun-2007 Jens Axboe <jens.axboe@oracle.com>

splice: adjust balance_dirty_pages_ratelimited() call

As we have potentially dirtied more than 1 page, we should indicate as
such to the dirty page balancing. So call
balance_dirty_pages_ratelimited_nr() and pass in the approximate number
of pages we dirtied.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>


# 620a324b 07-Jun-2007 Jens Axboe <jens.axboe@oracle.com>

splice: __generic_file_splice_read: fix read/truncate race

Original patch and description from Neil Brown <neilb@suse.de>,
merged and adapted to splice branch by me. Neils text follows:

__generic_file_splice_read() currently samples the i_size at the start
and doesn't do so again unless it needs to call ->readpage to load
a page. After ->readpage it has to re-sample i_size as a truncate
may have caused that page to be filled with zeros, and the read()
call should not see these.

However there are other activities that might cause ->readpage to be
called on a page between the time that __generic_file_splice_read()
samples i_size and when it finds that it has an uptodate page. These
include at least read-ahead and possibly another thread performing a
read

So we must sample i_size *after* it has an uptodate page. Thus the
current sampling at the start and after a read can be replaced with a
sampling before page addition into spd.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>


# 475ecade 07-Jun-2007 Hugh Dickins <hugh@veritas.com>

splice: __generic_file_splice_read: fix i_size_read() length checks

__generic_file_splice_read's partial page check, at eof after readpage,
not only got its calculations wrong, but also reused the loff variable:
causing data corruption when splicing from a non-0 offset in the file's
last page (revealed by ext2 -b 1024 testing on a loop of a tmpfs file).

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>


# 20d698db 05-Jun-2007 Jens Axboe <jens.axboe@oracle.com>

splice: move balance_dirty_pages_ratelimited() outside of splice actor

I've seen inode related deadlocks, so move this call outside of the
actor itself, which may hold the inode lock.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>


# 267adc3e 08-Jun-2007 Jens Axboe <jens.axboe@oracle.com>

splice: remove do_splice_direct() symbol export

It's only supposed to be used by do_sendfile(), which is never
modular. So kill the export.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>


# d366d398 01-Jun-2007 Jens Axboe <jens.axboe@oracle.com>

splice: move inode size check into generic_file_splice_read()

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>


# 86aa5ac5 08-May-2007 Jens Axboe <jens.axboe@oracle.com>

[PATCH] splice: always call into page_cache_readahead()

Don't try to guess what the read-ahead logic will do, allow it
to make its own decisions.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>


# 9ae9d68c 08-May-2007 Fengguang Wu <fengguang.wu@gmail.com>

[PATCH] splice(): fix interaction with readahead

Eric Dumazet, thank you for disclosing this bug.

Readahead logic somehow fails to populate the page range with data.
It can be because

1) the readahead routine is not always called in the following lines of

fs/splice.c:
if (!loff || nr_pages > 1)
page_cache_readahead(mapping, &in->f_ra, in, index, nr_pages);

2) even called, page_cache_readahead() wont guarantee the pages are there.
It wont submit readahead I/O for pages already in the radix tree, or when
(ra_pages == 0), or after 256 cache hits.

In your case, it should be because of the retried reads, which lead to
excessive cache hits, and disables readahead at some time.

And that _one_ failure of readahead blocks the whole read process.
The application receives EAGAIN and retries the read, but
__generic_file_splice_read() refuse to make progress:

- in the previous invocation, it has allocated a blank page and inserted it
into the radix tree, but never has the chance to start I/O for it: the test
of SPLICE_F_NONBLOCK goes before that.

- in the retried invocation, the readahead code will neither get out of the
cache hit mode, nor will it submit I/O for an already existing page.

Cc: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>


# d9993c37 29-Mar-2007 Dmitriy Monakhov <dmonakhov@sw.ru>

[PATCH] splice: partial write fix

Currently if partial write has happened while ->commit_write() then page
wasn't marked as accessed and rebalanced.

Signed-off-by: Monakhov Dmitriy <dmonakhov@openvz.org>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>


# 40bee44e 21-Mar-2007 Mark Fasheh <mark.fasheh@oracle.com>

Export __splice_from_pipe()

Ocfs2 wants to implement it's own splice write actor so that it can better
manage cluster / page locks. This lets us re-use the rest of splice write
while only providing our own code where it's actually important.

Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>


# 08c72591 27-Mar-2007 Nick Piggin <npiggin@suse.de>

2/2 splice: dont readpage

Splice does not need to readpage to bring the page uptodate before writing
to it, because prepare_write will take care of that for us.

Splice is also wrong to SetPageUptodate before the page is actually uptodate.
This results in the old uninitialised memory leak. This gets fixed as a
matter of course when removing the readpage logic.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>


# 485ddb4b 27-Mar-2007 Nick Piggin <npiggin@suse.de>

1/2 splice: dont steal

Stealing pages with splice is problematic because we cannot just insert
an uptodate page into the pagecache and hope the filesystem can take care
of it later.

We also cannot just ClearPageUptodate, then hope prepare_write does not
write anything into the page, because I don't think prepare_write gives
that guarantee.

Remove support for SPLICE_F_MOVE for now. If we really want to bring it
back, we might be able to do so with a the new filesystem buffered write
aops APIs I'm working on. If we really don't want to bring it back, then
we should decide that sooner rather than later, and remove the flag and
all the stealing infrastructure before anybody starts using it.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>


# d4c3cca9 13-Dec-2006 Eric Dumazet <dada1@cosmosbay.com>

[PATCH] constify pipe_buf_operations

- pipe/splice should use const pipe_buf_operations and file_operations

- struct pipe_inode_info has an unused field "start" : get rid of it.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Cc: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 0f7fc9e4 08-Dec-2006 Josef "Jeff" Sipek <jsipek@cs.sunysb.edu>

[PATCH] VFS: change struct file to use struct path

This patch changes struct file to use struct path instead of having
independent pointers to struct dentry and struct vfsmount, and converts all
users of f_{dentry,vfsmnt} in fs/ to use f_path.{dentry,mnt}.

Additionally, it adds two #define's to make the transition easier for users of
the f_dentry and f_vfsmnt.

Signed-off-by: Josef "Jeff" Sipek <jsipek@cs.sunysb.edu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# ddac0d39 03-Nov-2006 Jens Axboe <jens.axboe@oracle.com>

[PATCH] splice: fix problem introduced with inode diet

After the inode slimming patch that unionised i_pipe/i_bdev/i_cdev, it's
no longer enough to check for existance of ->i_pipe to verify that this
is a pipe.

Original patch from Eric Dumazet <dada1@cosmosbay.com>
Final solution suggested by Linus.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 2ae88149 28-Oct-2006 Nick Piggin <npiggin@suse.de>

[PATCH] mm: clean up pagecache allocation

- Consolidate page_cache_alloc

- Fix splice: only the pagecache pages and filesystem data need to use
mapping_gfp_mask.

- Fix grab_cache_page_nowait: same as splice, also honour NUMA placement.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 8c34e2d6 17-Oct-2006 Jens Axboe <jens.axboe@oracle.com>

[PATCH] Remove SUID when splicing into an inode

Originally from Mark Fasheh <mark.fasheh@oracle.com>

generic_file_splice_write() does not remove S_ISUID or S_ISGID. This is
inconsistent with the way we generally write to files.

Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>


# 6da61809 17-Oct-2006 Mark Fasheh <mark.fasheh@oracle.com>

[PATCH] Introduce generic_file_splice_write_nolock()

This allows file systems to manage their own i_mutex locking while
still re-using the generic_file_splice_write() logic.

OCFS2 in particular wants this so that it can order cluster locks within
i_mutex.

Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>


# 62752ee1 17-Oct-2006 Mark Fasheh <mark.fasheh@oracle.com>

[PATCH] Take i_mutex in splice_from_pipe()

The splice_actor may be calling ->prepare_write() and ->commit_write(). We
want i_mutex on the inode being written to before calling those so that we
don't race i_size changes.

The double locking behavior is done elsewhere in splice.c, and if we
eventually want _nolock variants of generic_file_splice_write(), fs modules
might have to replicate the nasty locking code. We introduce
inode_double_lock() and inode_double_unlock() to consolidate the locking
rules into one set of functions.

Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>


# e6e80f29 11-Oct-2006 Jens Axboe <jens.axboe@oracle.com>

[PATCH] splice: fix pipe_to_file() ->prepare_write() error path

Don't jump to the unlock+release path, we already did that.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>


# 0fe23479 04-Sep-2006 Jens Axboe <axboe@kernel.dk>

[PATCH] Update axboe@suse.de email address

As people often look for the copyright in files to see who to mail,
update the link to a neutral one.

Signed-off-by: Jens Axboe <axboe@kernel.dk>


# aadd06e5 10-Jul-2006 Jens Axboe <axboe@suse.de>

[PATCH] splice: fix problems with sys_tee()

Several issues noticed/fixed:

- We cannot reliably block in link_pipe() while holding both input and output
mutexes. So do preparatory checks before locking down both mutexes and doing
the link.

- The ipipe->nrbufs vs i check was bad, because we could have dropped the
ipipe lock in-between. This causes us to potentially look at unknown
buffers if we were racing with someone else reading this pipe.

Signed-off-by: Jens Axboe <axboe@suse.de>


# 9e94cd4f 20-Jun-2006 Jens Axboe <axboe@suse.de>

[PATCH] splice: retrieve mapping after locking the page

Otherwise we could be racing with truncate/mapping removal.

Problem found/fixed by Nick Piggin <npiggin@suse.de>, logic rewritten
by me.

Signed-off-by: Jens Axboe <axboe@suse.de>


# a0548871 03-May-2006 Jens Axboe <axboe@suse.de>

[PATCH] splice: redo page lookup if add_to_page_cache() returns -EEXIST

This can happen quite easily, if several processes are trying to splice
the same file at the same time. It's not a failure, it just means someone
raced with us in allocating this file page. So just dump the allocated
page and relookup the original.

Signed-off-by: Jens Axboe <axboe@suse.de>


# 76ad4d11 03-May-2006 Jens Axboe <axboe@suse.de>

[PATCH] splice: rename remaining info variables to pipe

Same thing was done in fs/pipe.c and most of fs/splice.c, but we had
a few missing still.

Signed-off-by: Jens Axboe <axboe@suse.de>


# 1432873a 03-May-2006 Jens Axboe <axboe@suse.de>

[PATCH] splice: LRU fixups

Nick says that the current construct isn't safe. This goes back to the
original, but sets PIPE_BUF_FLAG_LRU on user pages as well as they all
seem to be on the LRU in the first place.

Signed-off-by: Jens Axboe <axboe@suse.de>


# bfc4ee39 03-May-2006 Jens Axboe <axboe@suse.de>

[PATCH] splice: fix unlocking of page on error ->prepare_write()

Looking at generic_file_buffered_write(), we need to unlock_page() if
prepare write fails and it isn't due to racing with truncate().

Also trim the size if ->prepare_write() fails, if we have to.

Signed-off-by: Jens Axboe <axboe@suse.de>


# 330ab716 02-May-2006 Jens Axboe <axboe@suse.de>

[PATCH] vmsplice: restrict stealing a little more

Apply the same rules as the anon pipe pages, only allow stealing
if no one else is using the page.

Signed-off-by: Jens Axboe <axboe@suse.de>


# a893b99b 02-May-2006 Jens Axboe <axboe@suse.de>

[PATCH] splice: fix page LRU accounting

Currently we rely on the PIPE_BUF_FLAG_LRU flag being set correctly
to know whether we need to fiddle with page LRU state after stealing it,
however for some origins we just don't know if the page is on the LRU
list or not.

So remove PIPE_BUF_FLAG_LRU and do this check/add manually in pipe_to_file()
instead.

Signed-off-by: Jens Axboe <axboe@suse.de>


# 7591489a 01-May-2006 Jens Axboe <axboe@suse.de>

[PATCH] vmsplice: fix badly placed end paranthesis

We need to use the minium of {len, PAGE_SIZE-off}, not {len, PAGE_SIZE}-off.
The latter doesn't make any sense, and could cause us to attempt negative
length transfers...

Signed-off-by: Jens Axboe <axboe@suse.de>


# 7afa6fd0 01-May-2006 Jens Axboe <axboe@suse.de>

[PATCH] vmsplice: allow user to pass in gift pages

If SPLICE_F_GIFT is set, the user is basically giving this pages away to
the kernel. That means we can steal them for eg page cache uses instead
of copying it.

The data must be properly page aligned and also a multiple of the page size
in length.

Signed-off-by: Jens Axboe <axboe@suse.de>


# f6762b7a 01-May-2006 Jens Axboe <axboe@suse.de>

[PATCH] pipe: enable atomic copying of pipe data to/from user space

The pipe ->map() method uses kmap() to virtually map the pages, which
is both slow and has known scalability issues on SMP. This patch enables
atomic copying of pipe pages, by pre-faulting data and using kmap_atomic()
instead.

lmbench bw_pipe and lat_pipe measurements agree this is a Good Thing. Here
are results from that on a UP machine with highmem (1.5GiB of RAM), running
first a UP kernel, SMP kernel, and SMP kernel patched.

Vanilla-UP:
Pipe bandwidth: 1622.28 MB/sec
Pipe bandwidth: 1610.59 MB/sec
Pipe bandwidth: 1608.30 MB/sec
Pipe latency: 7.3275 microseconds
Pipe latency: 7.2995 microseconds
Pipe latency: 7.3097 microseconds

Vanilla-SMP:
Pipe bandwidth: 1382.19 MB/sec
Pipe bandwidth: 1317.27 MB/sec
Pipe bandwidth: 1355.61 MB/sec
Pipe latency: 9.6402 microseconds
Pipe latency: 9.6696 microseconds
Pipe latency: 9.6153 microseconds

Patched-SMP:
Pipe bandwidth: 1578.70 MB/sec
Pipe bandwidth: 1579.95 MB/sec
Pipe bandwidth: 1578.63 MB/sec
Pipe latency: 9.1654 microseconds
Pipe latency: 9.2266 microseconds
Pipe latency: 9.1527 microseconds

Signed-off-by: Jens Axboe <axboe@suse.de>


# e27dedd8 01-May-2006 Jens Axboe <axboe@suse.de>

[PATCH] splice: call handle_ra_miss() on failure to lookup page

Notify the readahead logic of the missing page. Suggested by
Oleg Nesterov.

Signed-off-by: Jens Axboe <axboe@suse.de>


# f84d7519 01-May-2006 Jens Axboe <axboe@suse.de>

[PATCH] pipe: introduce ->pin() buffer operation

The ->map() function is really expensive on highmem machines right now,
since it has to use the slower kmap() instead of kmap_atomic(). Splice
rarely needs to access the virtual address of a page, so it's a waste
of time doing it.

Introduce ->pin() to take over the responsibility of making sure the
page data is valid. ->map() is then reduced to just kmap(). That way we
can also share a most of the pipe buffer ops between pipe.c and splice.c

Signed-off-by: Jens Axboe <axboe@suse.de>


# 0568b409 01-May-2006 Jens Axboe <axboe@suse.de>

[PATCH] splice: fix bugs in pipe_to_file()

Found by Oleg Nesterov <oleg@tv-sign.ru>, fixed by me.

- Only allow full pages to go to the page cache.
- Check page != buf->page instead of using PIPE_BUF_FLAG_STOLEN.
- Remember to clear 'stolen' if add_to_page_cache() fails.

And as a cleanup on that:

- Make the bottom fall-through logic a little less convoluted. Also make
the steal path hold an extra reference to the page, so we don't have
to differentiate between stolen and non-stolen at the end.

Signed-off-by: Jens Axboe <axboe@suse.de>


# 46e678c9 30-Apr-2006 Jens Axboe <axboe@suse.de>

[PATCH] splice: fix bugs with stealing regular pipe pages

- Check that page has suitable count for stealing in the regular pipes.
- pipe_to_file() assumes that the page is locked on succesful steal, so
do that in the pipe steal hook
- Missing unlock_page() in add_to_page_cache() failure.

Signed-off-by: Jens Axboe <axboe@suse.de>


# eb20796b 27-Apr-2006 Jens Axboe <axboe@suse.de>

[PATCH] splice: make the read-side do batched page lookups

Use the new find_get_pages_contig() to potentially look up the entire
splice range in one single call. This speeds up generic_file_splice_read()
quite a bit.

Signed-off-by: Jens Axboe <axboe@suse.de>


# eb645a24 27-Apr-2006 Jens Axboe <axboe@suse.de>

[PATCH] splice: switch to using page_cache_readahead()

Avoids doing useless work, when the file is fully cached.

Signed-off-by: Jens Axboe <axboe@suse.de>


# 00522fb4 26-Apr-2006 Jens Axboe <axboe@suse.de>

[PATCH] splice: rearrange moving to/from pipe helpers

We need these for people writing their own ->splice_read/write hooks.

Signed-off-by: Jens Axboe <axboe@suse.de>


# 912d35f8 26-Apr-2006 Jens Axboe <axboe@suse.de>

[PATCH] Add support for the sys_vmsplice syscall

sys_splice() moves data to/from pipes with a file input/output. sys_vmsplice()
moves data to a pipe, with the input being a user address range instead.

This uses an approach suggested by Linus, where we can hold partial ranges
inside the pages[] map. Hopefully this will be useful for network
receive support as well.

Signed-off-by: Jens Axboe <axboe@suse.de>


# 016b661e 25-Apr-2006 Jens Axboe <axboe@suse.de>

[PATCH] splice: fix offset problems

Make the move_from_pipe() actors return number of bytes processed, then
move_from_pipe() can decide more cleverly when to move on to the next
buffer.

This fixes problems with pipe offset and differing file offset.

Signed-off-by: Jens Axboe <axboe@suse.de>


# ba5f5d90 25-Apr-2006 Andrew Morton <akpm@osdl.org>

[PATCH] splice: fix min() warning

Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Jens Axboe <axboe@suse.de>


# 82aa5d61 20-Apr-2006 Jens Axboe <axboe@suse.de>

[PATCH] splice: fix smaller sized splice reads

Signed-off-by: Jens Axboe <axboe@suse.de>


# 9e0267c2 19-Apr-2006 Jens Axboe <axboe@suse.de>

[PATCH] splice: fixup writeout path after ->map changes

Since ->map() no longer locks the page, we need to adjust the handling
of those pages (and stealing) a little. This now passes full regressions
again.

Signed-off-by: Jens Axboe <axboe@suse.de>


# a4514ebd 19-Apr-2006 Jens Axboe <axboe@suse.de>

[PATCH] splice: offset fixes

- We need to adjust *ppos for writes as well.
- Copy back modified offset value if one was passed in, similar to
what sendfile does.

Signed-off-by: Jens Axboe <axboe@suse.de>


# 2a27250e 19-Apr-2006 Jens Axboe <axboe@suse.de>

[PATCH] tee: link_pipe() must be careful when dropping one of the pipe locks

We need to ensure that we only drop a lock that is ordered last, to avoid
ABBA deadlocks with competing processes.

Signed-off-by: Jens Axboe <axboe@suse.de>


# c4f895cb 19-Apr-2006 Jens Axboe <axboe@suse.de>

[PATCH] splice: cleanup the SPLICE_F_NONBLOCK handling

- generic_file_splice_read() more readable and correct
- Don't bail on page allocation with NONBLOCK set, just don't allow
direct blocking on IO (eg lock_page).

Signed-off-by: Jens Axboe <axboe@suse.de>


# 91ad66ef 19-Apr-2006 Jens Axboe <axboe@suse.de>

[PATCH] splice: close i_size truncate races on read

We need to check i_size after doing a blocking readpage.

Signed-off-by: Jens Axboe <axboe@suse.de>


# 70524490 11-Apr-2006 Jens Axboe <axboe@suse.de>

[PATCH] splice: add support for sys_tee()

Basically an in-kernel implementation of tee, which uses splice and the
pipe buffers as an intelligent way to pass data around by reference.

Where the user space tee consumes the input and produces a stdout and
file output, this syscall merely duplicates the data inside a pipe to
another pipe. No data is copied, the output just grabs a reference to the
input pipe data.

Signed-off-by: Jens Axboe <axboe@suse.de>


# cbb7e577 11-Apr-2006 Jens Axboe <axboe@suse.de>

[PATCH] splice: pass offset around for ->splice_read() and ->splice_write()

We need not use ->f_pos as the offset for the file input/output. If the
user passed an offset pointer in through sys_splice(), just use that and
leave ->f_pos alone.

Signed-off-by: Jens Axboe <axboe@suse.de>


# 73d62d83 11-Apr-2006 Ingo Molnar <mingo@elte.hu>

[PATCH] splice: comment styles

- capitalize consistently
- end sentences in one way or another
- update comment text to match the implementation

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Jens Axboe <axboe@suse.de>


# c2058e06 11-Apr-2006 Jens Axboe <axboe@suse.de>

[PATCH] splice: add Ingo as addition copyright holder

The comment is also somewhat out of date, correct that as well.

Signed-off-by: Jens Axboe <axboe@suse.de>


# 49570e9b 11-Apr-2006 Jens Axboe <axboe@suse.de>

[PATCH] splice: unlikely() optimizations

Also corrects a few comments. Patch mainly from Ingo, changes by me.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Jens Axboe <axboe@suse.de>


# 6f767b04 11-Apr-2006 Jens Axboe <axboe@suse.de>

[PATCH] splice: speedups and optimizations

- Kill the local variables that cache ->nrbufs, they just take up space.

- Only set do_wakeup for a real pipe. This is a big win for direct splicing.

- Kill i_mutex lock around ->f_pos update, regular io paths don't do this
either.

Signed-off-by: Jens Axboe <axboe@suse.de>


# 7480a904 11-Apr-2006 Jens Axboe <axboe@suse.de>

[PATCH] splice: speedup __generic_file_splice_read

Using find_get_page() is a lot faster than find_or_create_page(). This
gets splice a lot closer to sendfile() for fd -> socket transfers.

Signed-off-by: Jens Axboe <axboe@suse.de>


# b92ce558 11-Apr-2006 Jens Axboe <axboe@suse.de>

[PATCH] splice: add direct fd <-> fd splicing support

It's more efficient for sendfile() emulation. Basically we cache an
internal private pipe and just use that as the intermediate area for
pages. Direct splicing is not available from sys_splice(), it is only
meant to be used for sendfile() emulation.

Additional patch from Ingo Molnar to avoid the PIPE_BUFFERS loop at
exit for the normal fast path.

Signed-off-by: Jens Axboe <axboe@suse.de>


# 529565dc 10-Apr-2006 Ingo Molnar <mingo@elte.hu>

[PATCH] splice: add optional input and output offsets

add optional input and output offsets to sys_splice(), for seekable file
descriptors:

asmlinkage long sys_splice(int fd_in, loff_t __user *off_in,
int fd_out, loff_t __user *off_out,
size_t len, unsigned int flags);

semantics are straightforward: f_pos will be updated with the offset
provided by user-space, before the splice transfer is about to begin.
Providing a NULL offset pointer means the existing f_pos will be used
(and updated in situ). Providing an offset for a pipe results in
-ESPIPE. Providing an invalid offset pointer results in -EFAULT.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Jens Axboe <axboe@suse.de>


# 3a326a2c 10-Apr-2006 Ingo Molnar <mingo@elte.hu>

[PATCH] introduce a "kernel-internal pipe object" abstraction

separate out the 'internal pipe object' abstraction, and make it
usable to splice. This cleans up and fixes several aspects of the
internal splice APIs and the pipe code:

- pipes: the allocation and freeing of pipe_inode_info is now more symmetric
and more streamlined with existing kernel practices.

- splice: small micro-optimization: less pointer dereferencing in splice
methods

Signed-off-by: Ingo Molnar <mingo@elte.hu>

Update XFS for the ->splice_read/->splice_write changes.

Signed-off-by: Jens Axboe <axboe@suse.de>


# 0b749ce3 10-Apr-2006 Jens Axboe <axboe@suse.de>

[PATCH] splice: be smarter about calling do_page_cache_readahead()

We don't want to call into the read-ahead logic unless we are at the
start of a page, _or_ we have multiple pages to read.

Signed-off-by: Jens Axboe <axboe@suse.de>


# 49d0b21b 10-Apr-2006 Jens Axboe <axboe@suse.de>

[PATCH] splice: optimize the splice buffer mapping

We don't really need to lock down the pages, just make sure they
are uptodate.

Signed-off-by: Jens Axboe <axboe@suse.de>


# 16c523dd 10-Apr-2006 Jens Axboe <axboe@suse.de>

[PATCH] splice: cleanup __generic_file_splice_read()

The whole shadow/pages logic got overly complex, and this simpler
approach is actually faster in testing.

Signed-off-by: Jens Axboe <axboe@suse.de>


# c0bd1f65 10-Apr-2006 Jens Axboe <axboe@suse.de>

[PATCH] splice: only call wake_up_interruptible() when we really have to

__wake_up_common() is pretty heavy in the kernel profiles, this brings
it down to a more acceptable level.

Signed-off-by: Jens Axboe <axboe@suse.de>


# 9aefe431 10-Apr-2006 Dave Jones <davej@redhat.com>

[PATCH] splice: potential !page dereference

We can get to out: with a NULL page, which we probably
don't want to be calling page_cache_release() on.

Signed-off-by: Dave Jones <davej@redhat.com>
Signed-off-by: Jens Axboe <axboe@suse.de>


# c7f21e4f 10-Apr-2006 Jens Axboe <axboe@suse.de>

[PATCH] splice: mark the io page as accessed

We should do that, since we do the LRU manipulation ourselves now. Suggested
by Nick Piggin.

Signed-off-by: Jens Axboe <axboe@suse.de>


# 3e7ee3e7 02-Apr-2006 Jens Axboe <axboe@suse.de>

[PATCH] splice: fix page stealing LRU handling.

Originally from Nick Piggin, just adapted to the newer branch.

You can't check PageLRU without holding zone->lru_lock. The page
release code can get away with it only because the page refcount is 0 at
that point. Also, you can't reliably remove pages from the LRU unless
the refcount is 0. Ever.

Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Jens Axboe <axboe@suse.de>


# ad8d6f0a 02-Apr-2006 Jens Axboe <axboe@suse.de>

[PATCH] splice: page stealing needs to wait_on_page_writeback()

Thanks to Andrew for the good explanation of why this is so. akpm writes:

If a page is under writeback and we remove it from pagecache, it's still
going to get written to disk. But the VFS no longer knows about that page,
nor that this page is about to modify disk blocks.

So there might be scenarios in which those
blocks-which-are-about-to-be-written-to get reused for something else.
When writeback completes, it'll scribble on those blocks.

This won't happen in ext2/ext3-style filesystems in normal mode because the
page has buffers and try_to_release_page() will fail.

But ext2 in nobh mode doesn't attach buffers at all - it just sticks the
page in a BIO, finds some new blocks, points the BIO at those blocks and
lets it rip.

While that write IO's in flight, someone could truncate the file. Truncate
won't block on the writeout because the page isn't in pagecache any more.
So truncate will the free the blocks from the file under the page's feet.
Then something else can reallocate those blocks. Then write data to them.

Now, the original write completes, corrupting the filesystem.

Signed-off-by: Jens Axboe <axboe@suse.de>


# 059a8f37 02-Apr-2006 Jens Axboe <axboe@suse.de>

[PATCH] splice: export generic_splice_sendpage

Forgot that one, thanks Jeff. Also move the other EXPORT_SYMBOL
to right below the functions.

Signed-off-by: Jens Axboe <axboe@suse.de>


# b2b39fa4 02-Apr-2006 Jens Axboe <axboe@suse.de>

[PATCH] splice: add a SPLICE_F_MORE flag

This lets userspace indicate whether more data will be coming in a
subsequent splice call.

Signed-off-by: Jens Axboe <axboe@suse.de>


# 83f9135b 02-Apr-2006 Jens Axboe <axboe@suse.de>

[PATCH] splice: add comments documenting more of the code

Hopefully this will make Andrew a little more happy.

Signed-off-by: Jens Axboe <axboe@suse.de>


# 4f6f0bd2 02-Apr-2006 Jens Axboe <axboe@suse.de>

[PATCH] splice: improve writeback and clean up page stealing

By cleaning up the writeback logic (killing write_one_page() and the manual
set_page_dirty()), we can get rid of ->stolen inside the pipe_buffer and
just keep it local in pipe_to_file().

This also adds dirty page balancing logic and O_SYNC handling.

Signed-off-by: Jens Axboe <axboe@suse.de>


# 53cd9ae8 02-Apr-2006 Jens Axboe <axboe@suse.de>

[PATCH] splice: fix shadow[] filling logic

Clear the entire range, and don't increment pidx or we keep filling
the same position again and again.

Thanks to KAMEZAWA Hiroyuki.

Signed-off-by: Jens Axboe <axboe@suse.de>


# 29e35094 02-Apr-2006 Linus Torvalds <torvalds@g5.osdl.org>

splice: add SPLICE_F_NONBLOCK flag

It doesn't make the splice itself necessarily nonblocking (because the
actual file descriptors that are spliced from/to may block unless they
have the O_NONBLOCK flag set), but it makes the splice pipe operations
nonblocking.

Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# a0f06780 30-Mar-2006 Jeff Garzik <jeff@garzik.org>

[PATCH] splice exports

Woe be unto he who builds their filesystems as modules.

Signed-off-by: Jeff Garzik <jeff@garzik.org>
[ Obscure quote from the infamous geek bible? ]
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 5abc97aa 30-Mar-2006 Jens Axboe <axboe@suse.de>

[PATCH] splice: add support for SPLICE_F_MOVE flag

This enables the caller to migrate pages from one address space page
cache to another. In buzz word marketing, you can do zero-copy file
copies!

Signed-off-by: Jens Axboe <axboe@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>


# 5274f052 30-Mar-2006 Jens Axboe <axboe@suse.de>

[PATCH] Introduce sys_splice() system call

This adds support for the sys_splice system call. Using a pipe as a
transport, it can connect to files or sockets (latter as output only).

From the splice.c comments:

"splice": joining two ropes together by interweaving their strands.

This is the "extended pipe" functionality, where a pipe is used as
an arbitrary in-memory buffer. Think of a pipe as a small kernel
buffer that you can use to transfer data from one end to the other.

The traditional unix read/write is extended with a "splice()" operation
that transfers data buffers to or from a pipe buffer.

Named by Larry McVoy, original implementation from Linus, extended by
Jens to support splicing to files and fixing the initial implementation
bugs.

Signed-off-by: Jens Axboe <axboe@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>