summaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAgeFilesLines
* Btrfs: remove the block device pointer from the scrub context structStefan Behrens2012-12-121-60/+73
| | | | | | | | | | | | | | | | | | | | The block device is removed from the scrub context state structure. The scrub code as it is used for the device replace procedure reads the source data from whereever it is optimal. The source device might even be gone (disconnected, for instance due to a hardware failure). Or the drive can be so faulty so that the device replace procedure tries to avoid access to the faulty source drive as much as possible, and only if all other mirrors are damaged, as a last resort, the source disk is accessed. The modified scrub code operates as if it would handle the source drive and thereby generates an exact copy of the source disk on the target disk, even if the source disk is not present at all. Therefore the block device pointer to the source disk is removed in the scrub context struct and moved into the lower level scope of scrub_bio, fixup and page structures where the block device context is known. Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
* Btrfs: rename the scrub context structureStefan Behrens2012-12-122-253/+253
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The device replace procedure makes use of the scrub code. The scrub code is the most efficient code to read the allocated data of a disk, i.e. it reads sequentially in order to avoid disk head movements, it skips unallocated blocks, it uses read ahead mechanisms, and it contains all the code to detect and repair defects. This commit is a first preparation step to adapt the scrub code to be shareable for the device replace procedure. The block device will be removed from the scrub context state structure in a later step. It used to be the source block device. The scrub code as it is used for the device replace procedure reads the source data from whereever it is optimal. The source device might even be gone (disconnected, for instance due to a hardware failure). Or the drive can be so faulty so that the device replace procedure tries to avoid access to the faulty source drive as much as possible, and only if all other mirrors are damaged, as a last resort, the source disk is accessed. The modified scrub code operates as if it would handle the source drive and thereby generates an exact copy of the source disk on the target disk, even if the source disk is not present at all. Therefore the block device pointer to the source disk is removed in a later patch, and therefore the context structure is renamed (this is the goal of the current patch) to reflect that no source block device scope is there anymore. Summary: This first preparation step consists of a textual substitution of the term "dev" to the term "ctx" whereever the scrub context is used. Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
* Btrfs: protect devices list with its mutexLiu Bo2012-12-121-4/+5
| | | | | | | | Since we've kill the bigger one volume_mutex, we need to add devices list mutex back. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
* Btrfs: cleanup for btrfs_btree_balance_dirtyLiu Bo2012-12-127-81/+34Star
| | | | | | | | | - 'nr' is no more used. - btrfs_btree_balance_dirty() and __btrfs_btree_balance_dirty() can share a bunch of code. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
* Btrfs: merge inode_list in __merge_refsAlexander Block2012-12-121-2/+11
| | | | | | | | | | | When __merge_refs merges two refs, it is also needed to merge the inode_list of both refs. Otherwise we have missed backrefs and memory leaks. This happens for example if two inodes share an extent and both lie in the same leaf and thus also have the same parent. Signed-off-by: Alexander Block <ablock84@googlemail.com> Reviewed-by: Jan Schmidt <list.btrfs@jan-o-sch.net> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
* Btrfs: set hole punching time properlyTsutomu Itoh2012-12-121-0/+3
| | | | | | | | | Even if the hole punching is executed, the modification time of the file is not updated. So, current time is set to inode. Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
* Btrfs: Don't trust the superblock label and simply printk("%s") itStefan Behrens2012-12-121-2/+5
| | | | | | | | | | | | Someone who is root or capable(CAP_SYS_ADMIN) could corrupt the superblock and make Btrfs printk("%s") crash while holding the uuid_mutex since nobody forces a limit on the string. Since the uuid_mutex is significant, the system would be unusable afterwards. Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
* Btrfs: fix a double free on pending snapshots in error handlingLiu Bo2012-12-121-1/+5
| | | | | | | | | | | When creating a snapshot, failing to commit a transaction can end up with aborting the transaction, following by doing a cleanup for it, where we'll free all snapshots pending to disk. So we check it and avoid double free on pending snapshots. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
* Btrfs: fix a deadlock in aborting transaction due to ENOSPCLiu Bo2012-12-121-0/+7
| | | | | | | | | | | When committing a transaction, we may bail out of running delayed refs due to ENOSPC, and then abort the current transaction to flip into readonly. But we'll hit a deadlock on ref head's lock since we forget to release its lock and other cleanup stuff. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
* fs/btrfs: drop if around WARN_ONJulia Lawall2012-12-123-10/+5Star
| | | | | | | | | | | | | | | | | | | Just use WARN_ON rather than an if containing only WARN_ON(1). A simplified version of the semantic patch that makes this transformation is as follows: (http://coccinelle.lip6.fr/) // <smpl> @@ expression e; @@ - if (e) WARN_ON(1); + WARN_ON(e); // </smpl> Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
* fs/btrfs: use WARNJulia Lawall2012-12-127-38/+21Star
| | | | | | | | | | | | | | | | | | | | | | Use WARN rather than printk followed by WARN_ON(1), for conciseness. A simplified version of the semantic patch that makes this transformation is as follows: (http://coccinelle.lip6.fr/) // <smpl> @@ expression list es; @@ -printk( +WARN(1, es); -WARN_ON(1); // </smpl> Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
* Btrfs: fix missing log when BTRFS_INODE_NEEDS_FULL_SYNC is setMiao Xie2012-12-121-1/+4
| | | | | | | | | If we set BTRFS_INODE_NEEDS_FULL_SYNC, we should log all the extent, but now we forget to take it into account, and set a wrong max key, if so, we will skip the file extent metadata when doing logging. Fix it. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
* Btrfs: fix unprotected extent map operation when logging file extentsMiao Xie2012-12-121-0/+2
| | | | | | | | We forget to protect the modified_extents list, fix it. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
* Btrfs: fix wrong file extent lengthMiao Xie2012-12-123-9/+23
| | | | | | | | | There are two types of the file extent - inline extent and regular extent, When we log file extents, we didn't take inline extent into account, fix it. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
* Btrfs: fix missing flush when committing a transactionMiao Xie2012-12-121-35/+47
| | | | | | | | | | | | | | | | | | | | | | Consider the following case: Task1 Task2 start_transaction commit_transaction check pending snapshots list and the list is empty. add pending snapshot into list skip the delalloc flush end_transaction ... And then the problem that the snapshot is different with the source subvolume happen. This patch fixes the above problem by flush all pending stuffs when all the other tasks end the transaction. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
* Btrfs: fix joining the same transaction handler more than 2 timesMiao Xie2012-12-122-30/+48
| | | | | | | | | | | | | | | | | | | | | If we flush inodes with pending delalloc in a transaction, we may join the same transaction handler more than 2 times. The reason is: Task use_count of trans handle commit_transaction 1 |-> btrfs_start_delalloc_inodes 1 |-> run_delalloc_nocow 1 |-> join_transaction 2 |-> cow_file_range 2 |-> join_transaction 3 In fact, cow_file_range needn't join the transaction again because the caller have joined the transaction, so we fix this problem by this way. Reported-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
* Btrfs: cleanup for btrfs_wait_order_rangeLiu Bo2012-12-121-3/+0Star
| | | | | | | Variable 'found' is no more used. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
* Btrfs: get right arguments for btrfs_wait_ordered_rangeLiu Bo2012-12-121-1/+1
| | | | | | | btrfs_wait_ordered_range expects for 'len' instead of 'end'. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
* Btrfs: do not log extents when we only log new namesLiu Bo2012-12-121-1/+2
| | | | | | | | | | | | When we log new names, we need to log just enough to recreate the inode during log replay, and there is no need to log extents along with it. This actually fixes a bug revealed by xfstests 241, where it shows that we're logging some extents that have not updated metadata, so we don't get proper EXTENT_DATA items to be copied to log tree. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
* Btrfs: don't allow degraded mount if too many devices are missingStefan Behrens2012-12-122-0/+16
| | | | | | | | | | | | | | | | | | | | | The current behavior is to allow mounting or remounting a filesystem writeable in degraded mode if at least one writeable device is present. The next failed write access to a missing device which is above the tolerance of the configured level of redundancy results in an read-only enforcement. Even without this, the next time barrier_all_devices() is called and more devices are missing than tolerable, the switch to read-only mode takes place. In order to behave predictably and to provide proper feedback to the user at mount time, this patch compares the number of missing devices with the number of devices that are tolerated to be missing according to the configured RAID level. If more devices are missing than tolerated, e.g. if two devices are missing in case of RAID1, only a read-only mount and remount is allowed. Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
* Btrfs: Fix typo in fs/btrfsMasanari Iida2012-12-122-2/+2
| | | | | | | Correct spelling typo in btrfs. Signed-off-by: Masanari Iida <standby24x7@gmail.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
* Btrfs: Remove the invalid shrink size check up from btrfs_shrink_dev()jeff.liu2012-12-122-4/+1Star
| | | | | | | | | | | | Remove an invalid size check up from btrfs_shrink_dev(). The new size should not larger than the device->total_bytes as it was already verified before coming to here(i.e. new_size < old_size). Remove invalid check up for btrfs_shrink_dev(). Signed-off-by: Jie Liu <jeff.liu@oracle.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
* Btrfs: make ordered extent be flushed by multi-taskMiao Xie2012-12-112-9/+37
| | | | | | | | | Though the process of the ordered extents is a bit different with the delalloc inode flush, but we can see it as a subset of the delalloc inode flush, so we also handle them by flush workers. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
* Btrfs: make ordered operations be handled by multi-taskMiao Xie2012-12-113-21/+45
| | | | | | | | The process of the ordered operations is similar to the delalloc inode flush, so we handle them by flush workers. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
* Btrfs: make delalloc inodes be flushed by multi-taskMiao Xie2012-12-115-8/+103
| | | | | | | | | | This patch introduce a new worker pool named "flush_workers", and if we want to force all the inode with pending delalloc to the disks, we can queue those inodes into the work queue of the worker pool, in this way, those inodes will be flushed by multi-task. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
* Btrfs: fill the global reserve when unpinning spaceJosef Bacik2012-12-111-5/+24
| | | | | | | | | | | | | | | | | Dave gave me an image of a very full file system that would abort the transaction because it ran out of space while committing the transaction. This is because we would think there was plenty of room to create a snapshot even though the global reserve was not full. This happens because we calculate the global reserve size before we unpin any space, so after we unpin the space we allow reservations to occur even though we haven't reserved all of the space for our global reserve. Fix this by adding to the global reserve while unpinning in order to make sure we always have enough space to do our work. With this patch we no longer end up with an aborted transaction, we return ENOSPC properly to the person trying to create the snapshot. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
* Btrfs: cleanup unused argumentsLiu Bo2012-12-111-7/+6Star
| | | | | | | 'disk_key' is not used at all. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
* Btrfs: kill unnecessary arguments in del_ptrLiu Bo2012-12-111-9/+7Star
| | | | | | | The argument 'tree_mod_log' is not necessary since all of callers enable it. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
* Btrfs: reorder tree mod log operations in deleting a pointerLiu Bo2012-12-111-4/+6
| | | | | | | | Since we don't use MOD_LOG_KEY_REMOVE_WHILE_MOVING to add nritems during rewinding, we should insert a MOD_LOG_KEY_REMOVE operation first. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
* Btrfs: MOD_LOG_KEY_REMOVE_WHILE_MOVING never change node's nritemsLiu Bo2012-12-111-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | Key MOD_LOG_KEY_REMOVE_WHILE_MOVING means that we're doing memmove inside an extent buffer node, and the node's number of items remains unchanged (unless we are inserting a single pointer, but we have MOD_LOG_KEY_ADD for that). So we don't need to increase node's number of items during rewinding, otherwise we may get an node larger than leafsize and cause general protection errors later. Here is the details, - If we do memory move for inserting a single pointer, we need to add node's nritems by one, and we honor MOD_LOG_KEY_ADD for adding. - If we do memory move for deleting a single pointer, we need to decrease node's nritems by one, and we honor MOD_LOG_KEY_REMOVE for deleting. - If we do memory move for balance left/right, we need to decrease node's nritems, and we honor MOD_LOG_KEY_REMOVE for balaning. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
* Btrfs: fix unnecessary while loop when search the free space, cacheMiao Xie2012-12-111-20/+10Star
| | | | | | | | | | | | | When we find a bitmap free space entry, we may check the previous extent entry covers the offset or not. But if we find this entry is also a bitmap entry, we will continue to check the previous entry of the current one by a while loop. It is unnecessary because it is impossible that the extent entry which is in front of a bitmap entry can cover the offset of the entry after that bitmap entry. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
* Btrfs: recheck bio against block device when we map the bioJosef Bacik2012-12-111-28/+131
| | | | | | | | | | | | | Alex reported a problem where we were writing between chunks on a rbd device. The thing is we do bio_add_page using logical offsets, but the physical offset may be different. So when we map the bio now check to see if the bio is still ok with the physical offset, and if it is not split the bio up and redo the bio_add_page with the physical sector. This fixes the problem for Alex and doesn't affect performance in the normal case. Thanks, Reported-and-tested-by: Alex Elder <elder@inktank.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
* Btrfs: improve the noflush reservationMiao Xie2012-12-118-86/+97
| | | | | | | | | | | | | | | | In some places(such as: evicting inode), we just can not flush the reserved space of delalloc, flushing the delayed directory index and delayed inode is OK, but we don't try to flush those things and just go back when there is no enough space to be reserved. This patch fixes this problem. We defined 3 types of the flush operations: NO_FLUSH, FLUSH_LIMIT and FLUSH_ALL. If we can in the transaction, we should not flush anything, or the deadlock would happen, so use NO_FLUSH. If we flushing the reserved space of delalloc would cause deadlock, use FLUSH_LIMIT. In the other cases, FLUSH_ALL is used, and we will flush all things. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
* Btrfs: fix wrong comment in can_overcommit()Miao Xie2012-12-111-3/+3
| | | | | | | The comment is not coincident with the code. Fix it. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
* Btrfs: cleanup duplicated division functionsMiao Xie2012-12-113-40/+46
| | | | | | | | | div_factor{_fine} has been implemented for two times, cleanup it. And I move them into a independent file named math.h because they are common math functions. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
* Linux 3.7Linus Torvalds2012-12-111-1/+1
|
* Input: matrix-keymap - provide proper module licenseFlorian Fainelli2012-12-111-0/+3
| | | | | | | | | | | | The matrix-keymap module is currently lacking a proper module license, add one so we don't have this module tainting the entire kernel. This issue has been present since commit 1932811f426f ("Input: matrix-keymap - uninline and prepare for device tree support") Signed-off-by: Florian Fainelli <florian@openwrt.org> CC: stable@vger.kernel.org # v3.5+ Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds2012-12-112-42/+131
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull networking fixes from David Miller: 1) Netlink socket dumping had several missing verifications and checks. In particular, address comparisons in the request byte code interpreter could access past the end of the address in the inet_request_sock. Also, address family and address prefix lengths were not validated properly at all. This means arbitrary applications can read past the end of certain kernel data structures. Fixes from Neal Cardwell. 2) ip_check_defrag() operates in contexts where we're in the process of, or about to, input the packet into the real protocols (specifically macvlan and AF_PACKET snooping). Unfortunately, it does a pskb_may_pull() which can modify the backing packet data which is not legal if the SKB is shared. It very much can be shared in this context. Deal with the possibility that the SKB is segmented by using skb_copy_bits(). Fix from Johannes Berg based upon a report by Eric Leblond. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: ipv4: ip_check_defrag must not modify skb before unsharing inet_diag: validate port comparison byte code to prevent unsafe reads inet_diag: avoid unsafe and nonsensical prefix matches in inet_diag_bc_run() inet_diag: validate byte code to prevent oops in inet_diag_bc_run() inet_diag: fix oops for IPv4 AF_INET6 TCP SYN-RECV state
| * ipv4: ip_check_defrag must not modify skb before unsharingJohannes Berg2012-12-101-10/+9Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ip_check_defrag() might be called from af_packet within the RX path where shared SKBs are used, so it must not modify the input SKB before it has unshared it for defragmentation. Use skb_copy_bits() to get the IP header and only pull in everything later. The same is true for the other caller in macvlan as it is called from dev->rx_handler which can also get a shared SKB. Reported-by: Eric Leblond <eric@regit.org> Cc: stable@vger.kernel.org Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * inet_diag: validate port comparison byte code to prevent unsafe readsNeal Cardwell2012-12-101-7/+24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add logic to verify that a port comparison byte code operation actually has the second inet_diag_bc_op from which we read the port for such operations. Previously the code blindly referenced op[1] without first checking whether a second inet_diag_bc_op struct could fit there. So a malicious user could make the kernel read 4 bytes beyond the end of the bytecode array by claiming to have a whole port comparison byte code (2 inet_diag_bc_op structs) when in fact the bytecode was not long enough to hold both. Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * inet_diag: avoid unsafe and nonsensical prefix matches in inet_diag_bc_run()Neal Cardwell2012-12-101-11/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add logic to check the address family of the user-supplied conditional and the address family of the connection entry. We now do not do prefix matching of addresses from different address families (AF_INET vs AF_INET6), except for the previously existing support for having an IPv4 prefix match an IPv4-mapped IPv6 address (which this commit maintains as-is). This change is needed for two reasons: (1) The addresses are different lengths, so comparing a 128-bit IPv6 prefix match condition to a 32-bit IPv4 connection address can cause us to unwittingly walk off the end of the IPv4 address and read garbage or oops. (2) The IPv4 and IPv6 address spaces are semantically distinct, so a simple bit-wise comparison of the prefixes is not meaningful, and would lead to bogus results (except for the IPv4-mapped IPv6 case, which this commit maintains). Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * inet_diag: validate byte code to prevent oops in inet_diag_bc_run()Neal Cardwell2012-12-101-3/+45
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add logic to validate INET_DIAG_BC_S_COND and INET_DIAG_BC_D_COND operations. Previously we did not validate the inet_diag_hostcond, address family, address length, and prefix length. So a malicious user could make the kernel read beyond the end of the bytecode array by claiming to have a whole inet_diag_hostcond when the bytecode was not long enough to contain a whole inet_diag_hostcond of the given address family. Or they could make the kernel read up to about 27 bytes beyond the end of a connection address by passing a prefix length that exceeded the length of addresses of the given family. Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * inet_diag: fix oops for IPv4 AF_INET6 TCP SYN-RECV stateNeal Cardwell2012-12-101-14/+39
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fix inet_diag to be aware of the fact that AF_INET6 TCP connections instantiated for IPv4 traffic and in the SYN-RECV state were actually created with inet_reqsk_alloc(), instead of inet6_reqsk_alloc(). This means that for such connections inet6_rsk(req) returns a pointer to a random spot in memory up to roughly 64KB beyond the end of the request_sock. With this bug, for a server using AF_INET6 TCP sockets and serving IPv4 traffic, an inet_diag user like `ss state SYN-RECV` would lead to inet_diag_fill_req() causing an oops or the export to user space of 16 bytes of kernel memory as a garbage IPv6 address, depending on where the garbage inet6_rsk(req) pointed. Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | Revert "revert "Revert "mm: remove __GFP_NO_KSWAPD""" and associated damageLinus Torvalds2012-12-105-13/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This reverts commits a50915394f1fc02c2861d3b7ce7014788aa5066e and d7c3b937bdf45f0b844400b7bf6fd3ed50bac604. This is a revert of a revert of a revert. In addition, it reverts the even older i915 change to stop using the __GFP_NO_KSWAPD flag due to the original commits in linux-next. It turns out that the original patch really was bogus, and that the original revert was the correct thing to do after all. We thought we had fixed the problem, and then reverted the revert, but the problem really is fundamental: waking up kswapd simply isn't the right thing to do, and direct reclaim sometimes simply _is_ the right thing to do. When certain allocations fail, we simply should try some direct reclaim, and if that fails, fail the allocation. That's the right thing to do for THP allocations, which can easily fail, and the GPU allocations want to do that too. So starting kswapd is sometimes simply wrong, and removing the flag that said "don't start kswapd" was a mistake. Let's hope we never revisit this mistake again - and certainly not this many times ;) Acked-by: Mel Gorman <mgorman@suse.de> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Rik van Riel <riel@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | Revert "mm: avoid waking kswapd for THP allocations when compaction is ↵Linus Torvalds2012-12-101-27/+10Star
|/ | | | | | | | | | | | | | | | | | | | deferred or contended" This reverts commit 782fd30406ecb9d9b082816abe0c6008fc72a7b0. We are going to reinstate the __GFP_NO_KSWAPD flag that has been removed, the removal reverted, and then removed again. Making this commit a pointless fixup for a problem that was caused by the removal of __GFP_NO_KSWAPD flag. The thing is, we really don't want to wake up kswapd for THP allocations (because they fail quite commonly under any kind of memory pressure, including when there is tons of memory free), and these patches were just trying to fix up the underlying bug: the original removal of __GFP_NO_KSWAPD in commit c654345924f7 ("mm: remove __GFP_NO_KSWAPD") was simply bogus. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: vmscan: fix inappropriate zone congestion clearingJohannes Weiner2012-12-081-3/+0Star
| | | | | | | | | | | | | | | | | | | commit c702418f8a2f ("mm: vmscan: do not keep kswapd looping forever due to individual uncompactable zones") removed zone watermark checks from the compaction code in kswapd but left in the zone congestion clearing, which now happens unconditionally on higher order reclaim. This messes up the reclaim throttling logic for zones with dirty/writeback pages, where zones should only lose their congestion status when their watermarks have been restored. Remove the clearing from the zone compaction section entirely. The preliminary zone check and the reclaim loop in kswapd will clear it if the zone is considered balanced. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Rik van Riel <riel@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* vfs: fix O_DIRECT read past end of block deviceLinus Torvalds2012-12-081-1/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | The direct-IO write path already had the i_size checks in mm/filemap.c, but it turns out the read path did not, and removing the block size checks in fs/block_dev.c (commit bbec0270bdd8: "blkdev_max_block: make private to fs/buffer.c") removed the magic "shrink IO to past the end of the device" code there. Fix it by truncating the IO to the size of the block device, like the write path already does. NOTE! I suspect the write path would be *much* better off doing it this way in fs/block_dev.c, rather than hidden deep in mm/filemap.c. The mm/filemap.c code is extremely hard to follow, and has various conditionals on the target being a block device (ie the flag passed in to 'generic_write_checks()', along with a conditional update of the inode timestamp etc). It is also quite possible that we should treat this whole block device size as a "s_maxbytes" issue, and try to make the logic even more generic. However, in the meantime this is the fairly minimal targeted fix. Noted by Milan Broz thanks to a regression test for the cryptsetup reencrypt tool. Reported-and-tested-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds2012-12-086-9/+24
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull networking fixes from David Miller: "Two stragglers: 1) The new code that adds new flushing semantics to GRO can cause SKB pointer list corruption, manage the lists differently to avoid the OOPS. Fix from Eric Dumazet. 2) When TCP fast open does a retransmit of data in a SYN-ACK or similar, we update retransmit state that we shouldn't triggering a WARN_ON later. Fix from Yuchung Cheng." * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: net: gro: fix possible panic in skb_gro_receive() tcp: bug fix Fast Open client retransmission
| * net: gro: fix possible panic in skb_gro_receive()Eric Dumazet2012-12-073-3/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 2e71a6f8084e (net: gro: selective flush of packets) added a bug for skbs using frag_list. This part of the GRO stack is rarely used, as it needs skb not using a page fragment for their skb->head. Most drivers do use a page fragment, but some of them use GFP_KERNEL allocations for the initial fill of their RX ring buffer. napi_gro_flush() overwrite skb->prev that was used for these skb to point to the last skb in frag_list. Fix this using a separate field in struct napi_gro_cb to point to the last fragment. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * tcp: bug fix Fast Open client retransmissionYuchung Cheng2012-12-073-6/+16
|/ | | | | | | | | | | | | | | | | | | | If SYN-ACK partially acks SYN-data, the client retransmits the remaining data by tcp_retransmit_skb(). This increments lost recovery state variables like tp->retrans_out in Open state. If loss recovery happens before the retransmission is acked, it triggers the WARN_ON check in tcp_fastretrans_alert(). For example: the client sends SYN-data, gets SYN-ACK acking only ISN, retransmits data, sends another 4 data packets and get 3 dupacks. Since the retransmission is not caused by network drop it should not update the recovery state variables. Further the server may return a smaller MSS than the cached MSS used for SYN-data, so the retranmission needs a loop. Otherwise some data will not be retransmitted until timeout or other loss recovery events. Signed-off-by: Yuchung Cheng <ycheng@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>