summaryrefslogtreecommitdiffstats
path: root/drivers/md/raid1.c
Commit message (Collapse)AuthorAgeFilesLines
...
| * md: initialise ->writes_pending in personality modules.NeilBrown2017-06-061-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The new per-cpu counter for writes_pending is initialised in md_alloc(), which is not called by dm-raid. So dm-raid fails when md_write_start() is called. Move the initialization to the personality modules that need it. This way it is always initialised when needed, but isn't unnecessarily initialized (requiring memory allocation) when the personality doesn't use writes_pending. Reported-by: Heinz Mauelshagen <heinzm@redhat.com> Fixes: 4ad23a976413 ("MD: use per-cpu counter for writes_pending") Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
* | block: switch bios to blk_status_tChristoph Hellwig2017-06-091-18/+18
|/ | | | | | | | | | Replace bi_error with a new bi_status to allow for a clear conversion. Note that device mapper overloaded bi_error with a private value, which we'll have to keep arround at least for now and thus propagate to a proper blk_status_t value. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
* raid1: prefer disk without bad blocksTomasz Majchrzak2017-05-121-1/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | If an array consists of two drives and the first drive has the bad block, the read request to the region overlapping the bad block chooses the same disk (with bad block) as device to read from over and over and the request gets stuck. If the first disk only partially overlaps with bad block, it becomes a candidate ("best disk") for shorter range of sectors. The second disk is capable of reading the entire requested range and it is updated accordingly, however it is not recorded as a best device for the request. In the end the request is sent to the first disk to read entire range of sectors. It fails and is re-tried in a moment but with the same outcome. Actually it is quite likely scenario but it had little exposure in my test until commit 715d40b93b10 ("md/raid1: add failfast handling for reads.") removed preference for idle disk. Such scenario had been passing as second disk was always chosen when idle. Reset a candidate ("best disk") to read from if disk can read entire range. Do it only if other disk has already been chosen as a candidate for a smaller range. The head position / disk type logic will select the best disk to read from - it is fine as disk with bad block won't be considered for it. Signed-off-by: Tomasz Majchrzak <tomasz.majchrzak@intel.com> Signed-off-by: Shaohua Li <shli@fb.com>
* md/raid1/10: avoid unnecessary lockingShaohua Li2017-05-121-4/+3Star
| | | | | | | | If we add bios to block plugging list, locking is unnecessry, since the block unplug is guaranteed not to run at that time. Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
* md: don't return -EAGAIN in md_allow_write for external metadata arraysArtur Paszkiewicz2017-05-081-6/+3Star
| | | | | | | | | | | | | | | | This essentially reverts commit b5470dc5fc18 ("md: resolve external metadata handling deadlock in md_allow_write") with some adjustments. Since commit 6791875e2e53 ("md: make reconfig_mutex optional for writes to md sysfs files.") changing array_state to 'active' does not use mddev_lock() and will not cause a deadlock with md_allow_write(). This revert simplifies userspace tools that write to sysfs attributes like "stripe_cache_size" or "consistency_policy" because it removes the need for special handling for external metadata arrays, checking for EAGAIN and retrying the write. Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com> Signed-off-by: Shaohua Li <shli@fb.com>
* Merge branch 'md-next' into md-linusShaohua Li2017-05-011-390/+289Star
|\
| * md/raid1: Use a new variable to count flighting sync requestsXiao Ni2017-04-271-3/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In new barrier codes, raise_barrier waits if conf->nr_pending[idx] is not zero. After all the conditions are true, the resync request can go on be handled. But it adds conf->nr_pending[idx] again. The next resync request hit the same bucket idx need to wait the resync request which is submitted before. The performance of resync/recovery is degraded. So we should use a new variable to count sync requests which are in flight. I did a simple test: 1. Without the patch, create a raid1 with two disks. The resync speed: Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sdb 0.00 0.00 166.00 0.00 10.38 0.00 128.00 0.03 0.20 0.20 0.00 0.19 3.20 sdc 0.00 0.00 0.00 166.00 0.00 10.38 128.00 0.96 5.77 0.00 5.77 5.75 95.50 2. With the patch, the result is: sdb 2214.00 0.00 766.00 0.00 185.69 0.00 496.46 2.80 3.66 3.66 0.00 1.03 79.10 sdc 0.00 2205.00 0.00 769.00 0.00 186.44 496.52 5.25 6.84 0.00 6.84 1.30 100.10 Suggested-by: Shaohua Li <shli@kernel.org> Signed-off-by: Xiao Ni <xni@redhat.com> Acked-by: Coly Li <colyli@suse.de> Signed-off-by: Shaohua Li <shli@fb.com>
| * md: clear WantReplacement once disk is removedGuoqing Jiang2017-04-251-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | We can clear 'WantReplacement' flag directly no matter it's replacement existed or not since the semantic is same as before. Also since the disk is removed from array, then it is straightforward to remove 'WantReplacement' flag and the comments in raid10/5 can be removed as well. Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
| * md/raid1/10: remove unused queueLidong Zhong2017-04-241-3/+0Star
| | | | | | | | | | | | | | | | | | A queue is declared and get from the disk of the array, but it's not used anywhere. So removing it from the source. Signed-off-by: Lidong Zhong <lzhong@suse.com> Acted-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
| * md/raid1: factor out flush_bio_list()NeilBrown2017-04-111-40/+26Star
| | | | | | | | | | | | | | | | | | flush_pending_writes() and raid1_unplug() each contain identical copies of a fairly large slab of code. So factor that out into new flush_bio_list() to simplify maintenance. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
| * md/raid1: simplify handle_read_error().NeilBrown2017-04-111-80/+60Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | handle_read_error() duplicates a lot of the work that raid1_read_request() does, so it makes sense to just use that function. This doesn't quite work as handle_read_error() relies on the same r1bio being re-used so that, in the case of a read-only array, setting IO_BLOCKED in r1bio->bios[] ensures read_balance() won't re-use that device. So we need to allow a r1bio to be passed to raid1_read_request(), and to have that function mostly initialise the r1bio, but leave the bios[] array untouched. Two parts of handle_read_error() that need to be preserved are the warning message it prints, so they are conditionally added to raid1_read_request(). Note that this highlights a minor bug on alloc_r1bio(). It doesn't initalise the bios[] array, so it is possible that old content is there, which might cause read_balance() to ignore some devices with no good reason. With this change, we no longer need inc_pending(), or the sectors_handled arg to alloc_r1bio(). As handle_read_error() is called from raid1d() and allocates memory, there is tiny chance of a deadlock. All element of various pools could be queued waiting for raid1 to handle them, and there may be no extra memory free. Achieving guaranteed forward progress would probably require a second thread and another mempool. Instead of that complexity, add __GFP_HIGH to any allocations when read1_read_request() is called from raid1d. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
| * md/raid1: simplify alloc_behind_master_bio()NeilBrown2017-04-111-7/+4Star
| | | | | | | | | | | | | | | | | | | | | | Now that we always always pass an offset of 0 and a size that matches the bio to alloc_behind_master_bio(), we can remove the offset/size args and simplify the code. We could probably remove bio_copy_data_partial() too. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
| * md/raid1: simplify the splitting of requests.NeilBrown2017-04-111-91/+50Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | raid1 currently splits requests in two different ways for two different reasons. First, bio_split() is used to ensure the bio fits within a resync accounting region. Second, multiple r1bios are allocated for each bio to handle the possiblity of known bad blocks on some devices. This can be simplified to just use bio_split() once, and not use multiple r1bios. We delay the split until we know a maximum bio size that can be handled with a single r1bio, and then split the bio and queue the remainder for later handling. This avoids all loops inside raid1.c request handling. Just a single read, or a single set of writes, is submitted to lower-level devices for each bio that comes from generic_make_request(). When the bio needs to be split, generic_make_request() will do the necessary looping and call md_make_request() multiple times. raid1_make_request() no longer queues request for raid1 to handle, so we can remove that branch from the 'if'. This patch also creates a new private bio_set (conf->bio_split) for splitting bios. Using fs_bio_set is wrong, as it is meant to be used by filesystems, not block devices. Using it inside md can lead to deadlocks under high memory pressure. Delete unused variable in raid1_write_request() (Shaohua) Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
| * md/raid1: avoid reusing a resync bio after error handling.NeilBrown2017-04-101-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | fix_sync_read_error() modifies a bio on a newly faulty device by setting bi_end_io to end_sync_write. This ensure that put_buf() will still call rdev_dec_pending() as required, but makes sure that subsequent code in fix_sync_read_error() doesn't try to read from the device. Unfortunately this interacts badly with sync_request_write() which assumes that any bio with bi_end_io set to non-NULL other than end_sync_read is safe to write to. As the device is now faulty it doesn't make sense to write. As the bio was recently used for a read, it is "dirty" and not suitable for immediate submission. In particular, ->bi_next might be non-NULL, which will cause generic_make_request() to complain. Break this interaction by refusing to write to devices which are marked as Faulty. Reported-and-tested-by: Michael Wang <yun.wang@profitbricks.com> Fixes: 2e52d449bcec ("md/raid1: add failfast handling for reads.") Cc: stable@vger.kernel.org (v4.10+) Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
| * md: raid1: kill warning on powerpc_pseriesMing Lei2017-03-281-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch kills the warning reported on powerpc_pseries, and actually we don't need the initialization. After merging the md tree, today's linux-next build (powerpc pseries_le_defconfig) produced this warning: drivers/md/raid1.c: In function 'raid1d': drivers/md/raid1.c:2172:9: warning: 'page_len$' may be used uninitialized in this function [-Wmaybe-uninitialized] if (memcmp(page_address(ppages[j]), ^ drivers/md/raid1.c:2160:7: note: 'page_len$' was declared here int page_len[RESYNC_PAGES]; ^ Signed-off-by: Ming Lei <tom.leiming@gmail.com> Signed-off-by: Shaohua Li <shli@fb.com>
| * md/raid1: skip data copy for behind io for discard requestShaohua Li2017-03-251-1/+5
| | | | | | | | | | | | | | | | | | | | | | discard request doesn't have data attached, so it's meaningless to allocate memory and copy from original bio for behind IO. And the copy is bogus because bio_copy_data_partial can't handle discard request. We don't support writesame/writezeros request so far. Reviewed-by: Ming Lei <tom.leiming@gmail.com> Signed-off-by: Shaohua Li <shli@fb.com>
| * md: raid1: improve write behindMing Lei2017-03-241-64/+54Star
| | | | | | | | | | | | | | | | | | | | | | | | | | This patch improve handling of write behind in the following ways: - introduce behind master bio to hold all write behind pages - fast clone bios from behind master bio - avoid to change bvec table directly - use bio_copy_data() and make code more clean Suggested-by: Shaohua Li <shli@fb.com> Signed-off-by: Ming Lei <tom.leiming@gmail.com> Signed-off-by: Shaohua Li <shli@fb.com>
| * md: raid1: move 'offset' out of loopMing Lei2017-03-241-2/+3
| | | | | | | | | | | | | | | | The 'offset' local variable can't be changed inside the loop, so move it out. Signed-off-by: Ming Lei <tom.leiming@gmail.com> Signed-off-by: Shaohua Li <shli@fb.com>
| * md: raid1: use bio helper in process_checks()Ming Lei2017-03-241-4/+8
| | | | | | | | | | | | | | Avoid to direct access to bvec table. Signed-off-by: Ming Lei <tom.leiming@gmail.com> Signed-off-by: Shaohua Li <shli@fb.com>
| * md: raid1: retrieve page from pre-allocated resync page arrayMing Lei2017-03-241-8/+8
| | | | | | | | | | | | | | | | Now one page array is allocated for each resync bio, and we can retrieve page from this table directly. Signed-off-by: Ming Lei <tom.leiming@gmail.com> Signed-off-by: Shaohua Li <shli@fb.com>
| * md: raid1: don't use bio's vec table to manage resync pagesMing Lei2017-03-241-29/+64
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Now we allocate one page array for managing resync pages, instead of using bio's vec table to do that, and the old way is very hacky and won't work any more if multipage bvec is enabled. The introduced cost is that we need to allocate (128 + 16) * raid_disks bytes per r1_bio, and it is fine because the inflight r1_bio for resync shouldn't be much, as pointed by Shaohua. Also the bio_reset() in raid1_sync_request() is removed because all bios are freshly new now and not necessary to reset any more. This patch can be thought as a cleanup too Suggested-by: Shaohua Li <shli@kernel.org> Signed-off-by: Ming Lei <tom.leiming@gmail.com> Signed-off-by: Shaohua Li <shli@fb.com>
| * md: raid1: simplify r1buf_pool_free()Ming Lei2017-03-241-9/+8Star
| | | | | | | | | | | | | | | | | | | | | | This patch gets each page's reference of each bio for resync, then r1buf_pool_free() gets simplified a lot. The same policy has been taken in raid10's buf pool allocation/free too. Signed-off-by: Ming Lei <tom.leiming@gmail.com> Signed-off-by: Shaohua Li <shli@fb.com>
| * md: move two macros into md.hMing Lei2017-03-241-2/+0Star
| | | | | | | | | | | | | | | | Both raid1 and raid10 share common resync block size and page count, so move them into md.h. Signed-off-by: Ming Lei <tom.leiming@gmail.com> Signed-off-by: Shaohua Li <shli@fb.com>
| * md: raid1/raid10: don't handle failure of bio_add_page()Ming Lei2017-03-241-16/+6Star
| | | | | | | | | | | | | | | | | | | | | | | | All bio_add_page() is for adding one page into resync bio, which is big enough to hold RESYNC_PAGES pages, and the current bio_add_page() doesn't check queue limit any more, so it won't fail at all. remove unused label (shaohua) Signed-off-by: Ming Lei <tom.leiming@gmail.com> Signed-off-by: Shaohua Li <shli@fb.com>
| * md/raid1: stop using bi_phys_segmentNeilBrown2017-03-231-66/+23Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Change to use bio->__bi_remaining to count number of r1bio attached to a bio. See precious raid10 patch for more details. Like the raid10.c patch, this fixes a bug as nr_queued and nr_pending used to measure different things, but were being compared. This patch fixes another bug in that nr_pending previously did not could write-behind requests, so behind writes could continue while resync was happening. How that nr_pending counts all r1_bio, the resync cannot commence until the behind writes have completed. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
| * md/raid1, raid10: move rXbio accounting closer to allocation.NeilBrown2017-03-231-13/+11Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When raid1 or raid10 find they will need to allocate a new r1bio/r10bio, in order to work around a known bad block, they account for the allocation well before the allocation is made. This separation makes the correctness less obvious and requires comments. The accounting needs to be a little before: before the first rXbio is submitted, but that is all. So move the accounting down to where it makes more sense. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
| * md: superblock changes for PPLArtur Paszkiewicz2017-03-171-1/+2
| | | | | | | | | | | | | | | | | | | | | | Include information about PPL location and size into mdp_superblock_1 and copy it to/from rdev. Because PPL is mutually exclusive with bitmap, put it in place of 'bitmap_offset'. Add a new flag MD_FEATURE_PPL for 'feature_map', analogically to MD_FEATURE_BITMAP_OFFSET. Add MD_HAS_PPL to mddev->flags to indicate that PPL is enabled on an array. Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com> Signed-off-by: Shaohua Li <shli@fb.com>
* | md: support REQ_OP_WRITE_ZEROESChristoph Hellwig2017-04-081-1/+3
|/ | | | | | | | Copy & paste from the REQ_OP_WRITE_SAME code. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* md/raid1: fix a trivial typo in commentsZhilong Liu2017-03-141-1/+1
| | | | | | | | | | | raid1.c: fix a trivial typo in comments of freeze_array(). Cc: Jack Wang <jack.wang.usish@gmail.com> Cc: Guoqing Jiang <gqjiang@suse.com> Cc: John Stoffel <john@stoffel.org> Acked-by: Coly Li <colyli@suse.de> Signed-off-by: Zhilong Liu <zlliu@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
* md/raid1/10: fix potential deadlockShaohua Li2017-03-091-2/+23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Neil Brown pointed out a potential deadlock in raid 10 code with bio_split/chain. The raid1 code could have the same issue, but recent barrier rework makes it less likely to happen. The deadlock happens in below sequence: 1. generic_make_request(bio), this will set current->bio_list 2. raid10_make_request will split bio to bio1 and bio2 3. __make_request(bio1), wait_barrer, add underlayer disk bio to current->bio_list 4. __make_request(bio2), wait_barrer If raise_barrier happens between 3 & 4, since wait_barrier runs at 3, raise_barrier waits for IO completion from 3. And since raise_barrier sets barrier, 4 waits for raise_barrier. But IO from 3 can't be dispatched because raid10_make_request() doesn't finished yet. The solution is to adjust the IO ordering. Quotes from Neil: " It is much safer to: if (need to split) { split = bio_split(bio, ...) bio_chain(...) make_request_fn(split); generic_make_request(bio); } else make_request_fn(mddev, bio); This way we first process the initial section of the bio (in 'split') which will queue some requests to the underlying devices. These requests will be queued in generic_make_request. Then we queue the remainder of the bio, which will be added to the end of the generic_make_request queue. Then we return. generic_make_request() will pop the lower-level device requests off the queue and handle them first. Then it will process the remainder of the original bio once the first section has been fully processed. " Note, this only happens in read path. In write path, the bio is flushed to underlaying disks either by blk flush (from schedule) or offladed to raid1/10d. It's queued in current->bio_list. Cc: Coly Li <colyli@suse.de> Cc: stable@vger.kernel.org (v3.14+, only the raid10 part) Suggested-by: NeilBrown <neilb@suse.com> Reviewed-by: Jack Wang <jinpu.wang@profitbricks.com> Signed-off-by: Shaohua Li <shli@fb.com>
* md: move funcs from pers->resize to update_sizeGuoqing Jiang2017-03-091-2/+0Star
| | | | | | | | | | | | | raid1_resize and raid5_resize should also check the mddev->queue if run underneath dm-raid. And both set_capacity and revalidate_disk are used in pers->resize such as raid1, raid10 and raid5. So move them from personality file to common code. Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
* sched/headers: Prepare for new header dependencies before moving code to ↵Ingo Molnar2017-03-021-0/+3
| | | | | | | | | | | | | | | | | | | | <linux/sched/signal.h> We are going to split <linux/sched/signal.h> out of <linux/sched.h>, which will have to be picked up from other headers and a couple of .c files. Create a trivial placeholder <linux/sched/signal.h> file that just maps to <linux/sched.h> to make this patch obviously correct and bisectable. Include the new header in the files that are going to need it. Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
* Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/mdLinus Torvalds2017-02-241-230/+366
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull md updates from Shaohua Li: "Mainly fixes bugs and improves performance: - Improve scalability for raid1 from Coly - Improve raid5-cache read performance, disk efficiency and IO pattern from Song and me - Fix a race condition of disk hotplug for linear from Coly - A few cleanup patches from Ming and Byungchul - Fix a memory leak from Neil - Fix WRITE SAME IO failure from me - Add doc for raid5-cache from me" * 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md: (23 commits) md/raid1: fix write behind issues introduced by bio_clone_bioset_partial md/raid1: handle flush request correctly md/linear: shutup lockdep warnning md/raid1: fix a use-after-free bug RAID1: avoid unnecessary spin locks in I/O barrier code RAID1: a new I/O barrier implementation to remove resync window md/raid5: Don't reinvent the wheel but use existing llist API md: fast clone bio in bio_clone_mddev() md: remove unnecessary check on mddev md/raid1: use bio_clone_bioset_partial() in case of write behind md: fail if mddev->bio_set can't be created block: introduce bio_clone_bioset_partial() md: disable WRITE SAME if it fails in underlayer disks md/raid5-cache: exclude reclaiming stripes in reclaim check md/raid5-cache: stripe reclaim only counts valid stripes MD: add doc for raid5-cache Documentation: move MD related doc into a separate dir md: ensure md devices are freed before module is unloaded. md/r5cache: improve journal device efficiency md/r5cache: enable chunk_aligned_read with write back cache ...
| * md/raid1: fix write behind issues introduced by bio_clone_bioset_partialShaohua Li2017-02-231-4/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | There are two issues, introduced by commit 8e58e32(md/raid1: use bio_clone_bioset_partial() in case of write behind): - bio_clone_bioset_partial() uses bytes instead of sectors as parameters - in writebehind mode, we return bio if all !writemostly disk bios finish, which could happen before writemostly disk bios run. So all writemostly disk bios should have their bvec. Here we just make sure all bios are cloned instead of fast cloned. Reviewed-by: Ming Lei <tom.leiming@gmail.com> Signed-off-by: Shaohua Li <shli@fb.com>
| * md/raid1: handle flush request correctlyShaohua Li2017-02-231-3/+7
| | | | | | | | | | | | | | | | | | | | I got a warning triggered in align_to_barrier_unit_end. It's a flush request so sectors == 0. The flush request happens to work well without the new barrier patch, but we'd better handle it explictly. Cc: NeilBrown <neilb@suse.com> Acked-by: Coly Li <colyli@suse.de> Signed-off-by: Shaohua Li <shli@fb.com>
| * md/raid1: fix a use-after-free bugShaohua Li2017-02-201-1/+2
| | | | | | | | | | | | | | Commit fd76863 (RAID1: a new I/O barrier implementation to remove resync window) introduces a user-after-free bug. Signed-off-by: Shaohua Li <shli@fb.com>
| * RAID1: avoid unnecessary spin locks in I/O barrier codecolyli@suse.de2017-02-201-51/+114
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When I run a parallel reading performan testing on a md raid1 device with two NVMe SSDs, I observe very bad throughput in supprise: by fio with 64KB block size, 40 seq read I/O jobs, 128 iodepth, overall throughput is only 2.7GB/s, this is around 50% of the idea performance number. The perf reports locking contention happens at allow_barrier() and wait_barrier() code, - 41.41% fio [kernel.kallsyms] [k] _raw_spin_lock_irqsave - _raw_spin_lock_irqsave + 89.92% allow_barrier + 9.34% __wake_up - 37.30% fio [kernel.kallsyms] [k] _raw_spin_lock_irq - _raw_spin_lock_irq - 100.00% wait_barrier The reason is, in these I/O barrier related functions, - raise_barrier() - lower_barrier() - wait_barrier() - allow_barrier() They always hold conf->resync_lock firstly, even there are only regular reading I/Os and no resync I/O at all. This is a huge performance penalty. The solution is a lockless-like algorithm in I/O barrier code, and only holding conf->resync_lock when it has to. The original idea is from Hannes Reinecke, and Neil Brown provides comments to improve it. I continue to work on it, and make the patch into current form. In the new simpler raid1 I/O barrier implementation, there are two wait barrier functions, - wait_barrier() Which calls _wait_barrier(), is used for regular write I/O. If there is resync I/O happening on the same I/O barrier bucket, or the whole array is frozen, task will wait until no barrier on same barrier bucket, or the whold array is unfreezed. - wait_read_barrier() Since regular read I/O won't interfere with resync I/O (read_balance() will make sure only uptodate data will be read out), it is unnecessary to wait for barrier in regular read I/Os, waiting in only necessary when the whole array is frozen. The operations on conf->nr_pending[idx], conf->nr_waiting[idx], conf-> barrier[idx] are very carefully designed in raise_barrier(), lower_barrier(), _wait_barrier() and wait_read_barrier(), in order to avoid unnecessary spin locks in these functions. Once conf-> nr_pengding[idx] is increased, a resync I/O with same barrier bucket index has to wait in raise_barrier(). Then in _wait_barrier() if no barrier raised in same barrier bucket index and array is not frozen, the regular I/O doesn't need to hold conf->resync_lock, it can just increase conf->nr_pending[idx], and return to its caller. wait_read_barrier() is very similar to _wait_barrier(), the only difference is it only waits when array is frozen. For heavy parallel reading I/Os, the lockless I/O barrier code almostly gets rid of all spin lock cost. This patch significantly improves raid1 reading peroformance. From my testing, a raid1 device built by two NVMe SSD, runs fio with 64KB blocksize, 40 seq read I/O jobs, 128 iodepth, overall throughput increases from 2.7GB/s to 4.6GB/s (+70%). Changelog V4: - Change conf->nr_queued[] to atomic_t. - Define BARRIER_BUCKETS_NR_BITS by (PAGE_SHIFT - ilog2(sizeof(atomic_t))) V3: - Add smp_mb__after_atomic() as Shaohua and Neil suggested. - Change conf->nr_queued[] from atomic_t to int. - Change conf->array_frozen from atomic_t back to int, and use READ_ONCE(conf->array_frozen) to check value of conf->array_frozen in _wait_barrier() and wait_read_barrier(). - In _wait_barrier() and wait_read_barrier(), add a call to wake_up(&conf->wait_barrier) after atomic_dec(&conf->nr_pending[idx]), to fix a deadlock between _wait_barrier()/wait_read_barrier and freeze_array(). V2: - Remove a spin_lock/unlock pair in raid1d(). - Add more code comments to explain why there is no racy when checking two atomic_t variables at same time. V1: - Original RFC patch for comments. Signed-off-by: Coly Li <colyli@suse.de> Cc: Shaohua Li <shli@fb.com> Cc: Hannes Reinecke <hare@suse.com> Cc: Johannes Thumshirn <jthumshirn@suse.de> Cc: Guoqing Jiang <gqjiang@suse.com> Reviewed-by: Neil Brown <neilb@suse.de> Signed-off-by: Shaohua Li <shli@fb.com>
| * RAID1: a new I/O barrier implementation to remove resync windowcolyli@suse.de2017-02-201-212/+261
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 'Commit 79ef3a8aa1cb ("raid1: Rewrite the implementation of iobarrier.")' introduces a sliding resync window for raid1 I/O barrier, this idea limits I/O barriers to happen only inside a slidingresync window, for regular I/Os out of this resync window they don't need to wait for barrier any more. On large raid1 device, it helps a lot to improve parallel writing I/O throughput when there are background resync I/Os performing at same time. The idea of sliding resync widow is awesome, but code complexity is a challenge. Sliding resync window requires several variables to work collectively, this is complexed and very hard to make it work correctly. Just grep "Fixes: 79ef3a8aa1" in kernel git log, there are 8 more patches to fix the original resync window patch. This is not the end, any further related modification may easily introduce more regreassion. Therefore I decide to implement a much simpler raid1 I/O barrier, by removing resync window code, I believe life will be much easier. The brief idea of the simpler barrier is, - Do not maintain a global unique resync window - Use multiple hash buckets to reduce I/O barrier conflicts, regular I/O only has to wait for a resync I/O when both them have same barrier bucket index, vice versa. - I/O barrier can be reduced to an acceptable number if there are enough barrier buckets Here I explain how the barrier buckets are designed, - BARRIER_UNIT_SECTOR_SIZE The whole LBA address space of a raid1 device is divided into multiple barrier units, by the size of BARRIER_UNIT_SECTOR_SIZE. Bio requests won't go across border of barrier unit size, that means maximum bio size is BARRIER_UNIT_SECTOR_SIZE<<9 (64MB) in bytes. For random I/O 64MB is large enough for both read and write requests, for sequential I/O considering underlying block layer may merge them into larger requests, 64MB is still good enough. Neil also points out that for resync operation, "we want the resync to move from region to region fairly quickly so that the slowness caused by having to synchronize with the resync is averaged out over a fairly small time frame". For full speed resync, 64MB should take less then 1 second. When resync is competing with other I/O, it could take up a few minutes. Therefore 64MB size is fairly good range for resync. - BARRIER_BUCKETS_NR There are BARRIER_BUCKETS_NR buckets in total, which is defined by, #define BARRIER_BUCKETS_NR_BITS (PAGE_SHIFT - 2) #define BARRIER_BUCKETS_NR (1<<BARRIER_BUCKETS_NR_BITS) this patch makes the bellowed members of struct r1conf from integer to array of integers, - int nr_pending; - int nr_waiting; - int nr_queued; - int barrier; + int *nr_pending; + int *nr_waiting; + int *nr_queued; + int *barrier; number of the array elements is defined as BARRIER_BUCKETS_NR. For 4KB kernel space page size, (PAGE_SHIFT - 2) indecates there are 1024 I/O barrier buckets, and each array of integers occupies single memory page. 1024 means for a request which is smaller than the I/O barrier unit size has ~0.1% chance to wait for resync to pause, which is quite a small enough fraction. Also requesting single memory page is more friendly to kernel page allocator than larger memory size. - I/O barrier bucket is indexed by bio start sector If multiple I/O requests hit different I/O barrier units, they only need to compete I/O barrier with other I/Os which hit the same I/O barrier bucket index with each other. The index of a barrier bucket which a bio should look for is calculated by sector_to_idx() which is defined in raid1.h as an inline function, static inline int sector_to_idx(sector_t sector) { return hash_long(sector >> BARRIER_UNIT_SECTOR_BITS, BARRIER_BUCKETS_NR_BITS); } Here sector_nr is the start sector number of a bio. - Single bio won't go across boundary of a I/O barrier unit If a request goes across boundary of barrier unit, it will be split. A bio may be split in raid1_make_request() or raid1_sync_request(), if sectors returned by align_to_barrier_unit_end() is smaller than original bio size. Comparing to single sliding resync window, - Currently resync I/O grows linearly, therefore regular and resync I/O will conflict within a single barrier units. So the I/O behavior is similar to single sliding resync window. - But a barrier unit bucket is shared by all barrier units with identical barrier uinit index, the probability of conflict might be higher than single sliding resync window, in condition that writing I/Os always hit barrier units which have identical barrier bucket indexs with the resync I/Os. This is a very rare condition in real I/O work loads, I cannot imagine how it could happen in practice. - Therefore we can achieve a good enough low conflict rate with much simpler barrier algorithm and implementation. There are two changes should be noticed, - In raid1d(), I change the code to decrease conf->nr_pending[idx] into single loop, it looks like this, spin_lock_irqsave(&conf->device_lock, flags); conf->nr_queued[idx]--; spin_unlock_irqrestore(&conf->device_lock, flags); This change generates more spin lock operations, but in next patch of this patch set, it will be replaced by a single line code, atomic_dec(&conf->nr_queueud[idx]); So we don't need to worry about spin lock cost here. - Mainline raid1 code split original raid1_make_request() into raid1_read_request() and raid1_write_request(). If the original bio goes across an I/O barrier unit size, this bio will be split before calling raid1_read_request() or raid1_write_request(), this change the code logic more simple and clear. - In this patch wait_barrier() is moved from raid1_make_request() to raid1_write_request(). In raid_read_request(), original wait_barrier() is replaced by raid1_read_request(). The differnece is wait_read_barrier() only waits if array is frozen, using different barrier function in different code path makes the code more clean and easy to read. Changelog V4: - Add alloc_r1bio() to remove redundant r1bio memory allocation code. - Fix many typos in patch comments. - Use (PAGE_SHIFT - ilog2(sizeof(int))) to define BARRIER_BUCKETS_NR_BITS. V3: - Rebase the patch against latest upstream kernel code. - Many fixes by review comments from Neil, - Back to use pointers to replace arraries in struct r1conf - Remove total_barriers from struct r1conf - Add more patch comments to explain how/why the values of BARRIER_UNIT_SECTOR_SIZE and BARRIER_BUCKETS_NR are decided. - Use get_unqueued_pending() to replace get_all_pendings() and get_all_queued() - Increase bucket number from 512 to 1024 - Change code comments format by review from Shaohua. V2: - Use bio_split() to split the orignal bio if it goes across barrier unit bounday, to make the code more simple, by suggestion from Shaohua and Neil. - Use hash_long() to replace original linear hash, to avoid a possible confilict between resync I/O and sequential write I/O, by suggestion from Shaohua. - Add conf->total_barriers to record barrier depth, which is used to control number of parallel sync I/O barriers, by suggestion from Shaohua. - In V1 patch the bellowed barrier buckets related members in r1conf are allocated in memory page. To make the code more simple, V2 patch moves the memory space into struct r1conf, like this, - int nr_pending; - int nr_waiting; - int nr_queued; - int barrier; + int nr_pending[BARRIER_BUCKETS_NR]; + int nr_waiting[BARRIER_BUCKETS_NR]; + int nr_queued[BARRIER_BUCKETS_NR]; + int barrier[BARRIER_BUCKETS_NR]; This change is by the suggestion from Shaohua. - Remove some inrelavent code comments, by suggestion from Guoqing. - Add a missing wait_barrier() before jumping to retry_write, in raid1_make_write_request(). V1: - Original RFC patch for comments Signed-off-by: Coly Li <colyli@suse.de> Cc: Johannes Thumshirn <jthumshirn@suse.de> Cc: Guoqing Jiang <gqjiang@suse.com> Reviewed-by: Neil Brown <neilb@suse.de> Signed-off-by: Shaohua Li <shli@fb.com>
| * md: fast clone bio in bio_clone_mddev()Ming Lei2017-02-151-4/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Firstly bio_clone_mddev() is used in raid normal I/O and isn't in resync I/O path. Secondly all the direct access to bvec table in raid happens on resync I/O except for write behind of raid1, in which we still use bio_clone() for allocating new bvec table. So this patch replaces bio_clone() with bio_clone_fast() in bio_clone_mddev(). Also kill bio_clone_mddev() and call bio_clone_fast() directly, as suggested by Christoph Hellwig. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <tom.leiming@gmail.com> Signed-off-by: Shaohua Li <shli@fb.com>
| * md/raid1: use bio_clone_bioset_partial() in case of write behindMing Lei2017-02-151-5/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | Write behind need to replace pages in bio's bvecs, and we have to clone a fresh bio with new bvec table, so use the introduced bio_clone_bioset_partial() for it. For other bio_clone_mddev() cases, we will use fast clone since they don't need to touch bvec table. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <tom.leiming@gmail.com> Signed-off-by: Shaohua Li <shli@fb.com>
* | block: Use pointer to backing_dev_info from request_queueJan Kara2017-02-021-2/+2
| | | | | | | | | | | | | | | | | | | | | | We will want to have struct backing_dev_info allocated separately from struct request_queue. As the first step add pointer to backing_dev_info to request_queue and convert all users touching it. No functional changes in this patch. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@fb.com>
* | md: cleanup bio op / flags handling in raid1_write_requestChristoph Hellwig2017-01-271-5/+2Star
|/ | | | | | | | | | | No need for the local variables, the bio is still live and we can just assign the bits we want directly. Make me wonder why we can't assign all the bio flags to start with. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* md: cleanup mddev flag clear for takeoverShaohua Li2017-01-051-2/+6
| | | | | | | | | | Commit 6995f0b (md: takeover should clear unrelated bits) clear unrelated bits, but it's quite fragile. To avoid error in the future, define a macro for unsupported mddev flags for each raid type and use it to clear unsupported mddev flags. This should be less error-prone. Suggested-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
* md/raid1: Refactor raid1_make_requestRobert LeBlanc2017-01-031-128/+139
| | | | | | | | Refactor raid1_make_request to make read and write code in their own functions to clean up the code. Signed-off-by: Robert LeBlanc <robert@leblancnet.us> Signed-off-by: Shaohua Li <shli@fb.com>
* md: separate flags for superblock changesShaohua Li2016-12-091-6/+6
| | | | | | | | | | | The mddev->flags are used for different purposes. There are a lot of places we check/change the flags without masking unrelated flags, we could check/change unrelated flags. These usage are most for superblock write, so spearate superblock related flags. This should make the code clearer and also fix real bugs. Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
* md: takeover should clear unrelated bitsShaohua Li2016-12-091-1/+4
| | | | | | | | | | When we change level from raid1 to raid5, the MD_FAILFAST_SUPPORTED bit will be accidentally set, but raid5 doesn't support it. The same is true for the MD_HAS_JOURNAL bit. Fix: 46533ff (md: Use REQ_FAILFAST_* on metadata writes where appropriate) Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
* md/raid1: add failfast handling for writes.NeilBrown2016-11-221-1/+25
| | | | | | | | | | | | | | | | | | | When writing to a fastfail device we use MD_FASTFAIL unless it is the only device being written to. For resync/recovery, assume there was a working device to read from so always use REQ_FASTFAIL_DEV. If a write for resync/recovery fails, we just fail the device - there is not much else to do. If a normal failfast write fails, but the device cannot be failed (must be only one left), we queue for write error handling. This will call narrow_write_error() to retry the write synchronously and without any FAILFAST flags. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
* md/raid1: add failfast handling for reads.NeilBrown2016-11-221-10/+42
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If a device is marked FailFast and it is not the only device we can read from, we mark the bio with REQ_FAILFAST_* flags. If this does fail, we don't try read repair but just allow failure. If it was the last device it doesn't fail of course, so the retry happens on the same device - this time without FAILFAST. A subsequent failure will not retry but will just pass up the error. During resync we may use FAILFAST requests and on a failure we will simply use the other device(s). During recovery we will only use FAILFAST in the unusual case were there are multiple places to read from - i.e. if there are > 2 devices. If we get a failure we will fail the device and complete the resync/recovery with remaining devices. The new R1BIO_FailFast flag is set on read reqest to suggest the a FAILFAST request might be acceptable. The rdev needs to have FailFast set as well for the read to actually use REQ_FAILFAST_*. We need to know there are at least two working devices before we can set R1BIO_FailFast, so we mustn't stop looking at the first device we find. So the "min_pending == 0" handling to not exit early, but too always choose the best_pending_disk if min_pending == 0. The spinlocked region in raid1_error() in enlarged to ensure that if two bios, reading from two different devices, fail at the same time, then there is no risk that both devices will be marked faulty, leaving zero "In_sync" devices. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
* md: Use REQ_FAILFAST_* on metadata writes where appropriateNeilBrown2016-11-221-0/+1
| | | | | | | | | | | | | | | | | This can only be supported on personalities which ensure that md_error() never causes an array to enter the 'failed' state. i.e. if marking a device Faulty would cause some data to be inaccessible, the device is status is left as non-Faulty. This is true for RAID1 and RAID10. If we get a failure writing metadata but the device doesn't fail, it must be the last device so we re-write without FAILFAST to improve chance of success. We also flag the device as LastDev so that future metadata updates don't waste time on failfast writes. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
* md/raid1, raid10: add blktrace records when IO is delayedNeilBrown2016-11-181-0/+8
| | | | | | | | | | | Both raid1 and raid10 will sometimes delay handling an IO request, such as when resync is happening or there are too many requests queued. Add some blktrace messsages so we can see when that is happening when looking for performance artefacts. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>